Model gradient: unified model and policy learning in model-based reinforcement learning

doi:10.1007/s11704-023-3150-5

Front. Comput. Sci.

2024, Vol. 18

Issue (4) : 184339 https://doi.org/10.1007/s11704-023-3150-5

Artificial Intelligence

Model gradient: unified model and policy learning in model-based reinforcement learning

Chengxing JIA^1,², Fuxiang ZHANG^1,², Tian XU^1,², Jing-Cheng PANG^1,², Zongzhang ZHANG¹, Yang YU^1,²(

)

¹. National Key Laboratory for Novel Software Technology, Nanjing University, Nanjing 210023, China
². Polixir Technologies, Nanjing 210000, China

Download: PDF(5191 KB) HTML
Export: BibTeX | EndNote | Reference Manager | ProCite | RefWorks

Abstract

Model-based reinforcement learning is a promising direction to improve the sample efficiency of reinforcement learning with learning a model of the environment. Previous model learning methods aim at fitting the transition data, and commonly employ a supervised learning approach to minimize the distance between the predicted state and the real state. The supervised model learning methods, however, diverge from the ultimate goal of model learning, i.e., optimizing the learned-in-the-model policy. In this work, we investigate how model learning and policy learning can share the same objective of maximizing the expected return in the real environment. We find model learning towards this objective can result in a target of enhancing the similarity between the gradient on generated data and the gradient on the real data. We thus derive the gradient of the model from this target and propose the Model Gradient algorithm (MG) to integrate this novel model learning approach with policy-gradient-based policy optimization. We conduct experiments on multiple locomotion control tasks and find that MG can not only achieve high sample efficiency but also lead to better convergence performance compared to traditional model-based reinforcement learning approaches.

Keywords reinforcement learning model-based reinforcement learning Markov decision process

Corresponding Author(s): Yang YU

Just Accepted Date: 25 September 2023 Issue Date: 11 December 2023

Cite this article:

Chengxing JIA,Fuxiang ZHANG,Tian XU, et al. Model gradient: unified model and policy learning in model-based reinforcement learning[J]. Front. Comput. Sci., 2024, 18(4): 184339.

URL:

https://academic.hep.com.cn/fcs/EN/10.1007/s11704-023-3150-5
https://academic.hep.com.cn/fcs/EN/Y2024/V18/I4/184339

Fig.1 The intuition of model update with model gradient. The similarity between policy gradients from the fake trajectories generated by the model and the policy gradients from the real data is utilized to update the model, which makes the model generate trajectories in a more realistic way

Fig.2 The average episodic returns (average cumulative rewards of running episodes) of MG, PPO, SLBO, and METRPO in three MuJoCo continuous control tasks. The shaded color presents the standard deviation with five random seeds. The curve of METRPO is omitted in Walker2d-v2 for its episodic returns lower than

? 500

. (a) HalfCheetah; (b) Hopper; (c) Walker2d

Fig.3 The average episodic returns of MG, SAC, PPO, and MBPO in two MuJoCo continuous control tasks with sparse rewards (a. Hopper, b. Walker2d). The shaded color presents the standard deviation with five random seeds. (a) Hopper; (b) Walker2d

Fig.4 The average episodic returns of MBRL methods with two different model learning approaches, model-gradient model learning (MG) and supervised model learning (SL), in three MuJoCo control tasks. The shaded color presents the deviation in five random seeds. (a) HalfCheetah; (b) Hopper; (c) Walker2d

Fig.5 The cosine similarity between the policy gradient from data in real environment and data from models trained with model gradient (MG) and trained with supervised learning (SL) in the HalfCheetah task of MuJoCo control. The shaded color presents the deviation in five random seeds. (a) The cosine similarity of the policy gradient when the roll-out policy is PPO pre-trained with

10

k steps; (b) the cosine similarity of the policy gradient when the roll-out policy is PPO pre-trained with

50

k steps

Fig.6 The average episodic returns (average cumulative rewards of each episode) of MG with different hyper-parameter

λ

choices in three MuJoCo continuous control tasks. The shaded color presents the deviation with five random seeds. (a) HalfCheetah; (b) Hopper; (c) Walker2d

Fig.7 The average episodic returns (average cumulative rewards of running episodes) of MG under different

β

in three MuJoCo continuous control tasks. The shaded color presents the standard deviation with five random seeds. (a) HalfCheetah; (b) Hopper; (c) Walker2d

Fig.8 The average episodic returns of MG with/without supervised learning warm-up in three MuJoCo continuous control tasks. The shaded color presents the deviation with five random seeds. (a) HalfCheetah; (b) Hopper; (c) Walker2d

Hyper-parameter	HalfCheetah	Hopper	Walker2d
$γ$	0.99	0.99	0.99
$λ$	$[1.0 → 0.1]$	$[1.0 → 0.1]$	$[1.0 → 0.1]$
$β$	0.005	0.005	0.005
Optimizer	Adam	Adam	Adam
Policy learning rate $α π$	$10 ? 4$	$10 ? 4$	$10 ? 4$
Model learning rate $α m$	$10 ? 4$	$10 ? 4$	$10 ? 4$
Network architecture (model)	[32, 32]	[32, 32]	[32, 32]
Network architecture (policy)	[32, 32]	[32, 32]	[32, 32]
Batch size $H$ (real)	2	2	2
Batch size $B$ (model)	8	8	8
$K m$	50	50	50
$K π$	$[5 → 1]$	$[5 → 1]$	$[5 → 1]$

Table A1 Network architecture and hyper-parameters of MG. The notation

[1.0 → 0.1]

means that we change the

λ

from

1.0

0.1

at the

100 k t h

time step

1	R S, Sutton A G Barto . Reinforcement Learning: An Introduction. 2nd ed. The MIT Press, 2018
2	D, Silver A, Huang C J, Maddison A, Guez L, Sifre Den Driessche G, Van J, Schrittwieser I, Antonoglou V, Panneershelvam M, Lanctot S, Dieleman D, Grewe J, Nham N, Kalchbrenner I, Sutskever T, Lillicrap M, Leach K, Kavukcuoglu T, Graepel D Hassabis . Mastering the game of go with deep neural networks and tree search. Nature, 2016, 529( 7587): 484–489
3	V, Mnih K, Kavukcuoglu D, Silver A A, Rusu J, Veness M G, Bellemare A, Graves M A, Riedmiller A K, Fidjeland G, Ostrovski S, Petersen C, Beattie A, Sadik I, Antonoglou H, King D, Kumaran D, Wierstra S, Legg D Hassabis . Human-level control through deep reinforcement learning. Nature, 2015, 518( 7540): 529–533
4	R I, Brafman M Tennenholtz . R-MAX - A general polynomial time algorithm for near-optimal reinforcement learning. The Journal of Machine Learning Research, 2002, 3: 213–231
5	J, Schrittwieser I, Antonoglou T, Hubert K, Simonyan L, Sifre S, Schmitt A, Guez E, Lockhart D, Hassabis T, Graepel T, Lillicrap D Silver . Mastering Atari, Go, Chess and Shogi by planning with a learned model. Nature, 2020, 588( 7839): 604–609
6	Y, Luo H, Xu Y, Li Y, Tian T, Darrell T Ma . Algorithmic framework for model-based deep reinforcement learning with theoretical guarantees. In: Proceedings of International Conference on Learning Representations. 2019
7	M, Janner J, Fu M, Zhang S Levine . When to trust your model: Model-based policy optimization. In: Proceedings of the 33rd International Conference on Neural Information Processing Systems. 2019, 1122
8	F, Pan J, He D, Tu Q He . Trust the model when it is confident: Masked model-based actor-critic. In: Proceedings of the 34th International Conference on Neural Information Processing Systems. 2020
9	Talvitie E. Self-correcting models for model-based reinforcement learning. In: Proceedings of the 31st AAAI Conference on Artificial Intelligence. 2017, 2597−2603
10	Kidambi R, Rajeswaran A, Netrapalli P, Joachims T. MOReL: model-based offline reinforcement learning. 2020, arXiv preprint arXiv: 2005.05951
11	T, Yu G, Thomas L, Yu S, Ermon J Y, Zou S, Levine C, Finn T Ma . MOPO: model-based offline policy optimization. In: Proceedings of the 34th International Conference on Neural Information Processing Systems. 2020
12	K, Asadi D, Misra S, Kim M L Littman . Combating the compounding-error problem with a multi-step model. 2019, arXiv preprint arXiv: 1905.13320
13	R J Williams . Simple statistical gradient-following algorithms for connectionist reinforcement learning. Machine Learning, 1992, 8( 3): 229–256
14	R S, Sutton D A, McAllester S P, Singh Y Mansour . Policy gradient methods for reinforcement learning with function approximation. In: Proceedings of the 12th International Conference on Neural Information Processing Systems. 1999, 1057−1063
15	L P, Kaelbling M L, Littman A W Moore . Reinforcement learning: a survey. Journal of Artificial Intelligence Research, 1996, 4: 237–285
16	L, Hewing K P, Wabersich M, Menner M N Zeilinger . Learning-based model predictive control: toward safe learning in control. Annual Review of Control, Robotics, and Autonomous Systems, 2020, 3( 1): 269–296
17	J, Ko D Fox . GP-BayesFilters: Bayesian filtering using Gaussian process prediction and observation models. Autonomous Robots, 2009, 27( 1): 75–90
18	M P, Deisenroth D, Fox C E Rasmussen . Gaussian processes for data-efficient learning in robotics and control. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2015, 37( 2): 408–423
19	S, Levine V Koltun . Guided policy search. In: Proceedings of the 30th International Conference on Machine Learning. 2013, 1−9
20	R, Lioutikov A, Paraschos J, Peters G Neumann . Sample-based informationl-theoretic stochastic optimal control. In: Proceedings of 2014 IEEE International Conference on Robotics and Automation. 2014, 3896−3902
21	V, Kumar E, Todorov S Levine . Optimal control with learned local models: Application to dexterous manipulation. In: Proceedings of 2016 IEEE International Conference on Robotics and Automation. 2016, 378−383
22	B, Amos I D J, Rodriguez J, Sacks B, Boots J Z Kolter . Differentiable MPC for end-to-end planning and control. In: Proceedings of the 32nd International Conference on Neural Information Processing Systems. 2018, 8299−8310
23	S M, Khansari-Zadeh A Billard . Learning stable nonlinear dynamical systems with gaussian mixture models. IEEE Transactions on Robotics, 2011, 27( 5): 943–957
24	T, Kurutach I, Clavera Y, Duan A, Tamar P Abbeel . Model-ensemble trust-region policy optimization. In: Proceedings of the 6th International Conference on Learning Representations. 2018
25	K, Chua R, Calandra R, McAllister S Levine . Deep reinforcement learning in a handful of trials using probabilistic dynamics models. In: Proceedings of the 32nd International Conference on Neural Information Processing Systems. 2018, 4759−4770
26	S, Gu T P, Lillicrap I, Sutskever S Levine . Continuous deep Q-learning with model-based acceleration. In: Proceedings of the 33rd International Conference on International Conference on Machine Learning. 2016, 2829−2838
27	I, Clavera J, Rothfuss J, Schulman Y, Fujita T, Asfour P Abbeel . Model-based reinforcement learning via meta-policy optimization. In: Proceedings of the 2nd Conference on Robot Learning. 2018, 617−629
28	A, Nagabandi G, Kahn R S, Fearing S Levine . Neural network dynamics for model-based deep reinforcement learning with model-free fine-tuning. In: Proceedings of 2018 IEEE International Conference on Robotics and Automation. 2018, 7559−7566
29	K, Asadi D, Misra M L Littman . Lipschitz continuity in model-based reinforcement learning. In: Proceedings of the 35th International Conference on Machine Learning. 2018, 264−273
30	B, Amos D Yarats . The differentiable cross-entropy method. In: Proceedings of the 37th International Conference on Machine Learning. 2020, 28
31	J, Yang B K, Petersen H, Zha D M Faissol . Single episode policy transfer in reinforcement learning. In: Proceedings of International Conference on Learning Representations. 2020
32	L C Melo . Transformers are meta-reinforcement learners. In: Proceedings of the 39th International Conference on Machine Learning. 2022, 15340−15359
33	E, Parisotto F, Song J, Rae R, Pascanu Ç, Gülçehre S M, Jayakumar M, Jaderberg R L, Kaufman A, Clark S, Noury M M, Botvinick N, Heess R Hadsell . Stabilizing transformers for reinforcement learning. In: Proceedings of the 37th International Conference on Machine Learning. 2020, 7487−7498
34	C, Grimm A, Barreto S, Singh D Silver . The value equivalence principle for model-based reinforcement learning. In: Proceedings of the 34th International Conference on Neural Information Processing Systems. 2020
35	H, Qian Y Yu . Derivative-free reinforcement learning: A review. Frontiers of Computer Science, 2021, 15( 6): 156336
36	A, Rajeswaran I, Mordatch V Kumar . A game theoretic framework for model based reinforcement learning. In: Proceedings of the 37th International Conference on Machine Learning. 2020, 737
37	N, Heess G, Wayne D, Silver T P, Lillicrap T, Erez Y Tassa . Learning continuous control policies by stochastic value gradients. In: Proceedings of the 28th International Conference on Neural Information Processing Systems. 2015, 2944−2952
38	B, Amos S, Stanton D, Yarats A G Wilson . On the model-based stochastic value gradient for continuous reinforcement learning. In: Proceedings of the 3rd Conference on Learning for Dynamics and Control. 2021, 6−20
39	D, Ha J Schmidhuber . Recurrent world models facilitate policy evolution. In: Proceedings of the 32nd International Conference on Neural Information Processing Systems. 2018, 2455−2467
40	D, Hafner T, Lillicrap I, Fischer R, Villegas D, Ha H, Lee J Davidson . Learning latent dynamics for planning from pixels. In: Proceedings of the 36th International Conference on Machine Learning. 2019, 2555−2565
41	V, Feinberg A, Wan I, Stoica M I, Jordan J E, Gonzalez S Levine . Model-based value estimation for efficient model-free reinforcement learning. 2018, arXiv preprint arXiv: 1803.00101
42	J, Buckman D, Hafner G, Tucker E, Brevdo H Lee . Sample-efficient reinforcement learning with stochastic ensemble value expansion. In: Proceedings of the 32nd International Conference on Neural Information Processing Systems. 2018, 8234−8244
43	Deisenroth M P, Rasmussen C E. PILCO: a model-based and data-efficient approach to policy search. In: Proceedings of the 28th International Conference on International Conference on Machine Learning. 2011, 465−472
44	A, Tamar S, Levine P, Abbeel Y, Wu G Thomas . Value iteration networks. In: Proceedings of the 30th International Conference on Neural Information Processing Systems. 2016, 2154−2162
45	A S Hodel . Linear-quadratic control: an introduction [book review]. Proceedings of the IEEE, 1999, 87( 5): 927–928
46	Wu C, Li T, Zhang Z, Yu Y. Bayesian optimistic optimization: Optimistic exploration for model-based reinforcement learning. In: Proceedings of the 36th Conference on Neural Information Processing Systems. 2022
47	Rigter M, Lacerda B, Hawes N. RAMBO-RL: robust adversarial model-based offline reinforcement learning. In: Proceedings of the 36th Conference on Neural Information Processing Systems. 2022
48	J, Schulman S, Levine P, Moritz M I, Jordan P Abbeel . Trust region policy optimization. In: Proceedings of the 32nd International Conference on International Conference on Machine Learning. 2015, 1889−1897
49	J, Schulman F, Wolski P, Dhariwal A, Radford O Klimov . Proximal policy optimization algorithms. 2017, arXiv preprint arXiv: 1707.06347
50	A, Agarwal S M, Kakade J D, Lee G Mahajan . On the theory of policy gradient methods: optimality, approximation, and distribution shift. The Journal of Machine Learning Research, 2021, 22( 1): 98
51	F, Luo T, Xu H, Lai X H, Chen W, Zhang Y Yu . A survey on model-based reinforcement learning. 2023, arXiv preprint arXiv: 2206.09328
52	A M, Farahmand A, Barreto D Nikovski . Value-aware loss function for model-based reinforcement learning. In: Proceedings of the 20th International Conference on Artificial Intelligence and Statistics. 2017, 1486−1494
53	T, Xu Z, Li Y Yu . Error bounds of imitating policies and environments. In: Proceedings of the 34th International Conference on Neural Information Processing Systems. 2020
54	Y H, Wu T H, Fan P J, Ramadge H Su . Model imitation for model-based reinforcement learning. In: Proceedings of International Conference on Learning Representations. 2019
55	J, Ho S Ermon . Generative adversarial imitation learning. In: Proceedings of the 30th International Conference on Neural Information Processing Systems. 2016, 4572−4580
56	J, Oh S, Singh H Lee . Value prediction network. In: Proceedings of the 31st International Conference on Neural Information Processing Systems. 2017, 6120−6130
57	Modhe N, Kamath H, Batra D, Kalyan A. Model-advantage and value-aware models for model-based reinforcement learning: bridging the gap in theory and practice. 2021, arXiv preprint arXiv: 2106.14080
58	Lovatto Â G, Bueno T P, Mauá D D, Barros L N. Decision-aware model learning for actor-critic methods: when theory does not meet practice. In: Proceedings of “I Can’t Believe It’s Not Better!” at NeurIPS Workshops. 2020
59	A, Ayoub Z, Jia C, Szepesvári M, Wang L F Yang . Model-based reinforcement learning with value-targeted regression. In: Proceedings of the 37th International Conference on Machine Learning. 2020, 44
60	R, Abachi M, Ghavamzadeh A M Farahmand . Policy-aware model learning for policy gradient methods. 2020, arXiv preprint arXiv: 2003.00030
61	P, D’Oro A M, Metelli A, Tirinzoni M, Papini M Restelli . Gradient-aware model-based policy search. In: Proceedings of the 34th AAAI Conference on Artificial Intelligence. 2020, 3801−3808
62	E, Nikishin R, Abachi R, Agarwal P L Bacon . Control-oriented model-based reinforcement learning with implicit differentiation. In: Proceedings of the 37th AAAI Conference on Artificial Intelligence. 2022, 7886−7894
63	Eysenbach B, Khazatsky A, Levine S, Salakhutdinov R. Mismatched no more: joint model-policy optimization for model-based RL. In: Proceedings of Deep RL Workshop NeurIPS 2021. 2021
64	J, Joseph A, Geramifard J W, Roberts J P, How N Roy . Reinforcement learning with misspecified model classes. In: Proceedings of 2013 IEEE International Conference on Robotics and Automation. 2013, 939−946
65	Todorov E, Erez T, Tassa Y. MuJoCo: a physics engine for model-based control. In: Proceedings of 2012 IEEE/RSJ International Conference on Intelligent Robots and Systems. 2012, 5026−5033
66	T, Wang X, Bao I, Clavera J, Hoang Y, Wen E, Langlois S, Zhang G, Zhang P, Abbeel J Ba . Benchmarking model-based reinforcement learning. 2019, arXiv preprint arXiv: 1907.02057
67	Zha D, Ma W, Yuan L, Hu X, Liu J. Rank the episodes: a simple approach for exploration in procedurally-generated environments. In: Proceedings of the 9th International Conference on Learning Representations. 2021

[1]	Cong GUAN, Ke XUE, Chunpeng FAN, Feng CHEN, Lichao ZHANG, Lei YUAN, Chao QIAN, Yang YU. Open and real-world human-AI coordination by heterogeneous training with communication[J]. Front. Comput. Sci., 2025, 19(4): 194314-.
[2]	Xiao MA, Shen-Yi ZHAO, Zhao-Heng YIN, Wu-Jun LI. Clustered Reinforcement Learning[J]. Front. Comput. Sci., 2025, 19(4): 194313-.
[3]	Yizhou YANG, Longde CHEN, Sha LIU, Lanning WANG, Haohuan FU, Xin LIU, Zuoning CHEN. Behaviour-diverse automatic penetration testing: a coverage-based deep reinforcement learning approach[J]. Front. Comput. Sci., 2025, 19(3): 193309-.
[4]	Tao HE, Ming LIU, Yixin CAO, Zekun WANG, Zihao ZHENG, Bing QIN. Exploring & exploiting high-order graph structure for sparse knowledge graph completion[J]. Front. Comput. Sci., 2025, 19(2): 192306-.
[5]	Lei YUAN, Feng CHEN, Zongzhang ZHANG, Yang YU. Communication-robust multi-agent learning by adaptable auxiliary multi-agent adversary generation[J]. Front. Comput. Sci., 2024, 18(6): 186331-.
[6]	Yuya CUI, Degan ZHANG, Jie ZHANG, Ting ZHANG, Lixiang CAO, Lu CHEN. Multi-user reinforcement learning based task migration in mobile edge computing[J]. Front. Comput. Sci., 2024, 18(4): 184504-.
[7]	Jian AN, Siyuan WU, Xiaolin GUI, Xin HE, Xuejun ZHANG. A blockchain-based framework for data quality in edge-computing-enabled crowdsensing[J]. Front. Comput. Sci., 2023, 17(4): 174503-.
[8]	Qiming FU, Zhechao WANG, Nengwei FANG, Bin XING, Xiao ZHANG, Jianping CHEN. MAML²: meta reinforcement learning via meta-learning for task categories[J]. Front. Comput. Sci., 2023, 17(4): 174325-.
[9]	Xiaoqin ZHANG, Huimin MA, Xiong LUO, Jian YUAN. LIDAR: learning from imperfect demonstrations with advantage rectification[J]. Front. Comput. Sci., 2022, 16(1): 161312-.
[10]	Hong QIAN, Yang YU. Derivative-free reinforcement learning: a review[J]. Front. Comput. Sci., 2021, 15(6): 156336-.
[11]	Peng YANG, Qi YANG, Ke TANG, Xin YAO. Parallel exploration via negatively correlated search[J]. Front. Comput. Sci., 2021, 15(5): 155333-.
[12]	Yao QIN, Hua WANG, Shanwen YI, Xiaole LI, Linbo ZHAI. A multi-objective reinforcement learning algorithm for deadline constrained scientific workflow scheduling in clouds[J]. Front. Comput. Sci., 2021, 15(5): 155105-.
[13]	Li ZHANG, Yuxuan CHEN, Wei WANG, Ziliang HAN, Shijian Li, Zhijie PAN, Gang PAN. A Monte Carlo Neural Fictitious Self-Play approach to approximate Nash Equilibrium in imperfect-information dynamic games[J]. Front. Comput. Sci., 2021, 15(5): 155334-.
[14]	Hongwei LI, Yingpeng HU, Yixuan CAO, Ganbin ZHOU, Ping LUO. Rich-text document styling restoration via reinforcement learning[J]. Front. Comput. Sci., 2021, 15(4): 154328-.
[15]	Kok-Lim Alvin YAU, Kae Hsiang KWONG, Chong SHEN. Reinforcement learning models for scheduling in wireless networks[J]. Front Comput Sci, 2013, 7(5): 754-766.

Viewed

Full text

Abstract

Cited

Shared

Discussed