|
|
Model gradient: unified model and policy learning in model-based reinforcement learning |
Chengxing JIA1,2, Fuxiang ZHANG1,2, Tian XU1,2, Jing-Cheng PANG1,2, Zongzhang ZHANG1, Yang YU1,2( ) |
1. National Key Laboratory for Novel Software Technology, Nanjing University, Nanjing 210023, China 2. Polixir Technologies, Nanjing 210000, China |
|
|
Abstract Model-based reinforcement learning is a promising direction to improve the sample efficiency of reinforcement learning with learning a model of the environment. Previous model learning methods aim at fitting the transition data, and commonly employ a supervised learning approach to minimize the distance between the predicted state and the real state. The supervised model learning methods, however, diverge from the ultimate goal of model learning, i.e., optimizing the learned-in-the-model policy. In this work, we investigate how model learning and policy learning can share the same objective of maximizing the expected return in the real environment. We find model learning towards this objective can result in a target of enhancing the similarity between the gradient on generated data and the gradient on the real data. We thus derive the gradient of the model from this target and propose the Model Gradient algorithm (MG) to integrate this novel model learning approach with policy-gradient-based policy optimization. We conduct experiments on multiple locomotion control tasks and find that MG can not only achieve high sample efficiency but also lead to better convergence performance compared to traditional model-based reinforcement learning approaches.
|
Keywords
reinforcement learning
model-based reinforcement learning
Markov decision process
|
Corresponding Author(s):
Yang YU
|
Just Accepted Date: 25 September 2023
Issue Date: 11 December 2023
|
|
1 |
R S, Sutton A G Barto . Reinforcement Learning: An Introduction. 2nd ed. The MIT Press, 2018
|
2 |
D, Silver A, Huang C J, Maddison A, Guez L, Sifre Den Driessche G, Van J, Schrittwieser I, Antonoglou V, Panneershelvam M, Lanctot S, Dieleman D, Grewe J, Nham N, Kalchbrenner I, Sutskever T, Lillicrap M, Leach K, Kavukcuoglu T, Graepel D Hassabis . Mastering the game of go with deep neural networks and tree search. Nature, 2016, 529( 7587): 484–489
|
3 |
V, Mnih K, Kavukcuoglu D, Silver A A, Rusu J, Veness M G, Bellemare A, Graves M A, Riedmiller A K, Fidjeland G, Ostrovski S, Petersen C, Beattie A, Sadik I, Antonoglou H, King D, Kumaran D, Wierstra S, Legg D Hassabis . Human-level control through deep reinforcement learning. Nature, 2015, 518( 7540): 529–533
|
4 |
R I, Brafman M Tennenholtz . R-MAX - A general polynomial time algorithm for near-optimal reinforcement learning. The Journal of Machine Learning Research, 2002, 3: 213–231
|
5 |
J, Schrittwieser I, Antonoglou T, Hubert K, Simonyan L, Sifre S, Schmitt A, Guez E, Lockhart D, Hassabis T, Graepel T, Lillicrap D Silver . Mastering Atari, Go, Chess and Shogi by planning with a learned model. Nature, 2020, 588( 7839): 604–609
|
6 |
Y, Luo H, Xu Y, Li Y, Tian T, Darrell T Ma . Algorithmic framework for model-based deep reinforcement learning with theoretical guarantees. In: Proceedings of International Conference on Learning Representations. 2019
|
7 |
M, Janner J, Fu M, Zhang S Levine . When to trust your model: Model-based policy optimization. In: Proceedings of the 33rd International Conference on Neural Information Processing Systems. 2019, 1122
|
8 |
F, Pan J, He D, Tu Q He . Trust the model when it is confident: Masked model-based actor-critic. In: Proceedings of the 34th International Conference on Neural Information Processing Systems. 2020
|
9 |
Talvitie E. Self-correcting models for model-based reinforcement learning. In: Proceedings of the 31st AAAI Conference on Artificial Intelligence. 2017, 2597−2603
|
10 |
Kidambi R, Rajeswaran A, Netrapalli P, Joachims T. MOReL: model-based offline reinforcement learning. 2020, arXiv preprint arXiv: 2005.05951
|
11 |
T, Yu G, Thomas L, Yu S, Ermon J Y, Zou S, Levine C, Finn T Ma . MOPO: model-based offline policy optimization. In: Proceedings of the 34th International Conference on Neural Information Processing Systems. 2020
|
12 |
K, Asadi D, Misra S, Kim M L Littman . Combating the compounding-error problem with a multi-step model. 2019, arXiv preprint arXiv: 1905.13320
|
13 |
R J Williams . Simple statistical gradient-following algorithms for connectionist reinforcement learning. Machine Learning, 1992, 8( 3): 229–256
|
14 |
R S, Sutton D A, McAllester S P, Singh Y Mansour . Policy gradient methods for reinforcement learning with function approximation. In: Proceedings of the 12th International Conference on Neural Information Processing Systems. 1999, 1057−1063
|
15 |
L P, Kaelbling M L, Littman A W Moore . Reinforcement learning: a survey. Journal of Artificial Intelligence Research, 1996, 4: 237–285
|
16 |
L, Hewing K P, Wabersich M, Menner M N Zeilinger . Learning-based model predictive control: toward safe learning in control. Annual Review of Control, Robotics, and Autonomous Systems, 2020, 3( 1): 269–296
|
17 |
J, Ko D Fox . GP-BayesFilters: Bayesian filtering using Gaussian process prediction and observation models. Autonomous Robots, 2009, 27( 1): 75–90
|
18 |
M P, Deisenroth D, Fox C E Rasmussen . Gaussian processes for data-efficient learning in robotics and control. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2015, 37( 2): 408–423
|
19 |
S, Levine V Koltun . Guided policy search. In: Proceedings of the 30th International Conference on Machine Learning. 2013, 1−9
|
20 |
R, Lioutikov A, Paraschos J, Peters G Neumann . Sample-based informationl-theoretic stochastic optimal control. In: Proceedings of 2014 IEEE International Conference on Robotics and Automation. 2014, 3896−3902
|
21 |
V, Kumar E, Todorov S Levine . Optimal control with learned local models: Application to dexterous manipulation. In: Proceedings of 2016 IEEE International Conference on Robotics and Automation. 2016, 378−383
|
22 |
B, Amos I D J, Rodriguez J, Sacks B, Boots J Z Kolter . Differentiable MPC for end-to-end planning and control. In: Proceedings of the 32nd International Conference on Neural Information Processing Systems. 2018, 8299−8310
|
23 |
S M, Khansari-Zadeh A Billard . Learning stable nonlinear dynamical systems with gaussian mixture models. IEEE Transactions on Robotics, 2011, 27( 5): 943–957
|
24 |
T, Kurutach I, Clavera Y, Duan A, Tamar P Abbeel . Model-ensemble trust-region policy optimization. In: Proceedings of the 6th International Conference on Learning Representations. 2018
|
25 |
K, Chua R, Calandra R, McAllister S Levine . Deep reinforcement learning in a handful of trials using probabilistic dynamics models. In: Proceedings of the 32nd International Conference on Neural Information Processing Systems. 2018, 4759−4770
|
26 |
S, Gu T P, Lillicrap I, Sutskever S Levine . Continuous deep Q-learning with model-based acceleration. In: Proceedings of the 33rd International Conference on International Conference on Machine Learning. 2016, 2829−2838
|
27 |
I, Clavera J, Rothfuss J, Schulman Y, Fujita T, Asfour P Abbeel . Model-based reinforcement learning via meta-policy optimization. In: Proceedings of the 2nd Conference on Robot Learning. 2018, 617−629
|
28 |
A, Nagabandi G, Kahn R S, Fearing S Levine . Neural network dynamics for model-based deep reinforcement learning with model-free fine-tuning. In: Proceedings of 2018 IEEE International Conference on Robotics and Automation. 2018, 7559−7566
|
29 |
K, Asadi D, Misra M L Littman . Lipschitz continuity in model-based reinforcement learning. In: Proceedings of the 35th International Conference on Machine Learning. 2018, 264−273
|
30 |
B, Amos D Yarats . The differentiable cross-entropy method. In: Proceedings of the 37th International Conference on Machine Learning. 2020, 28
|
31 |
J, Yang B K, Petersen H, Zha D M Faissol . Single episode policy transfer in reinforcement learning. In: Proceedings of International Conference on Learning Representations. 2020
|
32 |
L C Melo . Transformers are meta-reinforcement learners. In: Proceedings of the 39th International Conference on Machine Learning. 2022, 15340−15359
|
33 |
E, Parisotto F, Song J, Rae R, Pascanu Ç, Gülçehre S M, Jayakumar M, Jaderberg R L, Kaufman A, Clark S, Noury M M, Botvinick N, Heess R Hadsell . Stabilizing transformers for reinforcement learning. In: Proceedings of the 37th International Conference on Machine Learning. 2020, 7487−7498
|
34 |
C, Grimm A, Barreto S, Singh D Silver . The value equivalence principle for model-based reinforcement learning. In: Proceedings of the 34th International Conference on Neural Information Processing Systems. 2020
|
35 |
H, Qian Y Yu . Derivative-free reinforcement learning: A review. Frontiers of Computer Science, 2021, 15( 6): 156336
|
36 |
A, Rajeswaran I, Mordatch V Kumar . A game theoretic framework for model based reinforcement learning. In: Proceedings of the 37th International Conference on Machine Learning. 2020, 737
|
37 |
N, Heess G, Wayne D, Silver T P, Lillicrap T, Erez Y Tassa . Learning continuous control policies by stochastic value gradients. In: Proceedings of the 28th International Conference on Neural Information Processing Systems. 2015, 2944−2952
|
38 |
B, Amos S, Stanton D, Yarats A G Wilson . On the model-based stochastic value gradient for continuous reinforcement learning. In: Proceedings of the 3rd Conference on Learning for Dynamics and Control. 2021, 6−20
|
39 |
D, Ha J Schmidhuber . Recurrent world models facilitate policy evolution. In: Proceedings of the 32nd International Conference on Neural Information Processing Systems. 2018, 2455−2467
|
40 |
D, Hafner T, Lillicrap I, Fischer R, Villegas D, Ha H, Lee J Davidson . Learning latent dynamics for planning from pixels. In: Proceedings of the 36th International Conference on Machine Learning. 2019, 2555−2565
|
41 |
V, Feinberg A, Wan I, Stoica M I, Jordan J E, Gonzalez S Levine . Model-based value estimation for efficient model-free reinforcement learning. 2018, arXiv preprint arXiv: 1803.00101
|
42 |
J, Buckman D, Hafner G, Tucker E, Brevdo H Lee . Sample-efficient reinforcement learning with stochastic ensemble value expansion. In: Proceedings of the 32nd International Conference on Neural Information Processing Systems. 2018, 8234−8244
|
43 |
Deisenroth M P, Rasmussen C E. PILCO: a model-based and data-efficient approach to policy search. In: Proceedings of the 28th International Conference on International Conference on Machine Learning. 2011, 465−472
|
44 |
A, Tamar S, Levine P, Abbeel Y, Wu G Thomas . Value iteration networks. In: Proceedings of the 30th International Conference on Neural Information Processing Systems. 2016, 2154−2162
|
45 |
A S Hodel . Linear-quadratic control: an introduction [book review]. Proceedings of the IEEE, 1999, 87( 5): 927–928
|
46 |
Wu C, Li T, Zhang Z, Yu Y. Bayesian optimistic optimization: Optimistic exploration for model-based reinforcement learning. In: Proceedings of the 36th Conference on Neural Information Processing Systems. 2022
|
47 |
Rigter M, Lacerda B, Hawes N. RAMBO-RL: robust adversarial model-based offline reinforcement learning. In: Proceedings of the 36th Conference on Neural Information Processing Systems. 2022
|
48 |
J, Schulman S, Levine P, Moritz M I, Jordan P Abbeel . Trust region policy optimization. In: Proceedings of the 32nd International Conference on International Conference on Machine Learning. 2015, 1889−1897
|
49 |
J, Schulman F, Wolski P, Dhariwal A, Radford O Klimov . Proximal policy optimization algorithms. 2017, arXiv preprint arXiv: 1707.06347
|
50 |
A, Agarwal S M, Kakade J D, Lee G Mahajan . On the theory of policy gradient methods: optimality, approximation, and distribution shift. The Journal of Machine Learning Research, 2021, 22( 1): 98
|
51 |
F, Luo T, Xu H, Lai X H, Chen W, Zhang Y Yu . A survey on model-based reinforcement learning. 2023, arXiv preprint arXiv: 2206.09328
|
52 |
A M, Farahmand A, Barreto D Nikovski . Value-aware loss function for model-based reinforcement learning. In: Proceedings of the 20th International Conference on Artificial Intelligence and Statistics. 2017, 1486−1494
|
53 |
T, Xu Z, Li Y Yu . Error bounds of imitating policies and environments. In: Proceedings of the 34th International Conference on Neural Information Processing Systems. 2020
|
54 |
Y H, Wu T H, Fan P J, Ramadge H Su . Model imitation for model-based reinforcement learning. In: Proceedings of International Conference on Learning Representations. 2019
|
55 |
J, Ho S Ermon . Generative adversarial imitation learning. In: Proceedings of the 30th International Conference on Neural Information Processing Systems. 2016, 4572−4580
|
56 |
J, Oh S, Singh H Lee . Value prediction network. In: Proceedings of the 31st International Conference on Neural Information Processing Systems. 2017, 6120−6130
|
57 |
Modhe N, Kamath H, Batra D, Kalyan A. Model-advantage and value-aware models for model-based reinforcement learning: bridging the gap in theory and practice. 2021, arXiv preprint arXiv: 2106.14080
|
58 |
Lovatto  G, Bueno T P, Mauá D D, Barros L N. Decision-aware model learning for actor-critic methods: when theory does not meet practice. In: Proceedings of “I Can’t Believe It’s Not Better!” at NeurIPS Workshops. 2020
|
59 |
A, Ayoub Z, Jia C, Szepesvári M, Wang L F Yang . Model-based reinforcement learning with value-targeted regression. In: Proceedings of the 37th International Conference on Machine Learning. 2020, 44
|
60 |
R, Abachi M, Ghavamzadeh A M Farahmand . Policy-aware model learning for policy gradient methods. 2020, arXiv preprint arXiv: 2003.00030
|
61 |
P, D’Oro A M, Metelli A, Tirinzoni M, Papini M Restelli . Gradient-aware model-based policy search. In: Proceedings of the 34th AAAI Conference on Artificial Intelligence. 2020, 3801−3808
|
62 |
E, Nikishin R, Abachi R, Agarwal P L Bacon . Control-oriented model-based reinforcement learning with implicit differentiation. In: Proceedings of the 37th AAAI Conference on Artificial Intelligence. 2022, 7886−7894
|
63 |
Eysenbach B, Khazatsky A, Levine S, Salakhutdinov R. Mismatched no more: joint model-policy optimization for model-based RL. In: Proceedings of Deep RL Workshop NeurIPS 2021. 2021
|
64 |
J, Joseph A, Geramifard J W, Roberts J P, How N Roy . Reinforcement learning with misspecified model classes. In: Proceedings of 2013 IEEE International Conference on Robotics and Automation. 2013, 939−946
|
65 |
Todorov E, Erez T, Tassa Y. MuJoCo: a physics engine for model-based control. In: Proceedings of 2012 IEEE/RSJ International Conference on Intelligent Robots and Systems. 2012, 5026−5033
|
66 |
T, Wang X, Bao I, Clavera J, Hoang Y, Wen E, Langlois S, Zhang G, Zhang P, Abbeel J Ba . Benchmarking model-based reinforcement learning. 2019, arXiv preprint arXiv: 1907.02057
|
67 |
Zha D, Ma W, Yuan L, Hu X, Liu J. Rank the episodes: a simple approach for exploration in procedurally-generated environments. In: Proceedings of the 9th International Conference on Learning Representations. 2021
|
|
Viewed |
|
|
|
Full text
|
|
|
|
|
Abstract
|
|
|
|
|
Cited |
|
|
|
|
|
Shared |
|
|
|
|
|
Discussed |
|
|
|
|