Please wait a minute...
Frontiers of Computer Science

ISSN 2095-2228

ISSN 2095-2236(Online)

CN 10-1014/TP

Postal Subscription Code 80-970

2018 Impact Factor: 1.129

Front. Comput. Sci.    2022, Vol. 16 Issue (1) : 161312    https://doi.org/10.1007/s11704-021-0147-9
RESEARCH ARTICLE
LIDAR: learning from imperfect demonstrations with advantage rectification
Xiaoqin ZHANG1, Huimin MA2(), Xiong LUO2, Jian YUAN1
1. Department of EE, Tsinghua University, Beijing 100084, China
2. School of Computer and Communication Engineering, University of Science and Technology Beijing, Beijing 100083, China
 Download: PDF(8912 KB)  
 Export: BibTeX | EndNote | Reference Manager | ProCite | RefWorks
Abstract

In actor-critic reinforcement learning (RL) algorithms, function estimation errors are known to cause ineffective random exploration at the beginning of training, and lead to overestimated value estimates and suboptimal policies. In this paper, we address the problem by executing advantage rectification with imperfect demonstrations, thus reducing the function estimation errors. Pretraining with expert demonstrations has been widely adopted to accelerate the learning process of deep reinforcement learning when simulations are expensive to obtain. However, existing methods, such as behavior cloning, often assume the demonstrations contain other information or labels with regard to performances, such as optimal assumption, which is usually incorrect and useless in the real world. In this paper, we explicitly handle imperfect demonstrations within the actor-critic RL frameworks, and propose a new method called learning from imperfect demonstrations with advantage rectification (LIDAR). LIDAR utilizes a rectified loss function to merely learn from selective demonstrations, which is derived from a minimal assumption that the demonstrating policies have better performances than our current policy. LIDAR learns from contradictions caused by estimation errors, and in turn reduces estimation errors. We apply LIDAR to three popular actor-critic algorithms, DDPG, TD3 and SAC, and experiments show that our method can observably reduce the function estimation errors, effectively leverage demonstrations far from the optimal, and outperform state-of-the-art baselines consistently in all the scenarios.

Keywords learning from demonstrations      actor-critic reinforcement learning      advantage rectification     
Corresponding Author(s): Huimin MA   
Just Accepted Date: 03 March 2021   Issue Date: 19 November 2021
 Cite this article:   
Xiaoqin ZHANG,Huimin MA,Xiong LUO, et al. LIDAR: learning from imperfect demonstrations with advantage rectification[J]. Front. Comput. Sci., 2022, 16(1): 161312.
 URL:  
https://academic.hep.com.cn/fcs/EN/10.1007/s11704-021-0147-9
https://academic.hep.com.cn/fcs/EN/Y2022/V16/I1/161312
1 V Mnih , K Kavukcuoglu , D Silver , A A Rusu , J Veness , M G Bellemare , A Graves , M Riedmiller , A K Fidjeland , G Ostrovski , S Petersen , C Beattie , A Sadik , I Antonoglou , H King , D Kumaran , D Wierstra , S Legg , D Hassabis . Human-level control through deep reinforcement learning. Nature, 2015, 518 (7540): 529- 533
https://doi.org/10.1038/nature14236
2 D Silver , A Huang , C J Maddison , A Guez , L Sifre , G Van Den Driessche , J Schrittwieser , I Antonoglou , V Panneershelvam , M Lanctot , S Dieleman , D Grewe , J Nham , N Kalchbrenner , I Sutskever , T Lillicrap , M Leach , K Kavukcuoglu , T Graepel , D Hassabis . Mastering the game of Go with deep neural networks and tree search. Nature, 2016, 529 (7587): 484- 489
https://doi.org/10.1038/nature16961
3 V Mnih , A P Badia , M Mirza , A Graves , T Lillicrap , T Harley , D Silver , K Kavukcuoglu . Asynchronous methods for deep reinforcement learning. In: Proceedings of the 33rd International Conference on Machine Learning. 2016, 1928 1937
4 J Schulman , F Wolski , P Dhariwal , A Radford , O Klimov . Proximal policy optimization algorithms. 2017, arXiv preprint arXiv:1707.06347
5 S Fujimoto , H Hoof , D Meger . Addressing function approximation error in actor-critic methods. In: Proceedings of the 35th International Conference on Machine Learning. 2018, 1582 1591
6 A S Lakshminarayanan , S Ozair , Y Bengio . Reinforcement learning with few expert demonstrations. In: Proceedings of Neural Information Processing Systems Workshop on Deep Learning for Action and Interaction. 2016
7 A Rajeswaran , V Kumar , A Gupta , G Vezzani , J Schulman , E Todorov , S Levine . Learning complex dexterous manipulation with deep reinforcement learning and demonstrations. 2017, arXiv preprint arXiv:1709.10087
https://doi.org/10.15607/RSS.2018.XIV.049
8 A Nair , B McGrew , M Andrychowicz , W Zaremba , P Abbeel . Overcoming exploration in reinforcement learning with demonstrations. In: Proceedings of 2018 IEEE International Conference on Robotics and Automation. 2018, 6292 6299
https://doi.org/10.1109/ICRA.2018.8463162
9 S Ebrahimi , A Rohrbach , T Darrell . Gradient-free policy architecture search and adaptation. In: Proceedings of the Conference on Robot Learning. 2017, 505 514
10 S Reddy , A D Dragan , S Levine . SQIL: imitation learning via regularized behavioral cloning. 2019, arXiv preprint arXiv:1905.11108
11 J Ho , S Ermon . Generative adversarial imitation learning. In: Proceedings of the 30th International Conference on Neural Information Processing Systems. 2016, 4565 4573
12 E Todorov , T Erez , Y Tassa . Mujoco: a physics engine for model-based control. In: Proceedings of 2012 IEEE/RSJ International Conference on Intelligent Robots and Systems. 2012 5026 5033
https://doi.org/10.1109/IROS.2012.6386109
13 B Y Kang , Z Q Jie , J S Feng . Policy optimization with demonstrations. In: Proceedings of International Conference on Machine Learning. 2018, 2469 2478
14 T P Lillicrap , J J Hunt , A Pritzel , N Heess , T Erez , Y Tassa , D Silver , D Wierstra . Continuous control with deep reinforcement learning. In: Proceedings of the 4th International Conference on Learning Representations. 2016
15 T Haarnoja , A Zhou , K Hartikainen , G Tucker , S Ha , J Tan , V Kumar , H Zhu , A Gupta , P Abbeel , S Levine . Soft actor-critic algorithms and applications. 2018, arXiv preprint arXiv:1812.05905
16 A Y Ng , D Harada , S Russell . Policy invariance under reward transformations: theory and application to reward shaping. In: Proceedings of the 16th International Conference on Machine Learning. 1999, 278 287
https://doi.org/10.2514/3.6484
17 T Brys , A Harutyunyan , H B Suay , S Chernova , M E Taylor , A Nowé . Reinforcement learning from demonstration through shaping. In: Proceedings of the 32nd International Joint Conferences on Artificial Intelligence. 2015, 3352 3358
18 M X Jing , X J Ma , W B Huang , F C Sun , C Yang , B Fang , H P Liu . Reinforcement learning from imperfect demonstrations under soft expert guidance. In: Proceedings of the 33rd AAAI Conference on Artificial Intelligence. 2020, 5109 5116
19 J Schulman , S Levine , P Abbeel , M I Jordan , P Moritz . Trust region policy optimization. In: Proceedings of International Conference on Machine Learning. 2015, 1889 1897
20 P Abbeel , A Y Ng . Apprenticeship learning via inverse reinforcement learning. In: Proceedings of the 21st International Conference on Machine Learning. 2004
https://doi.org/10.1007/978-0-387-30164-8_417
21 A Y Ng , S J Russell . Algorithms for inverse reinforcement learning. In: Proceedings of the 17th International Conference on Machine Learning. 2000, 663 670
22 Y Z Li , J M Song , S Ermon . InfoGAIL: interpretable imitation learning from visual demonstrations. In: Proceedings of the 31st International Conference on Neural Information Processing Systems. 2017, 3815 3825
23 K Shiarlis , J Messias , S Whiteson . Inverse reinforcement learning from failure. In: Proceedings of the 2016 International Conference on Autonomous Agents & Multiagent Systems. 2016, 1060 1068
24 D S Brown , W Goo , P Nagarajan , S Niekum . Extrapolating beyond suboptimal demonstrations via inverse reinforcement learning from observations. In: Proceedings of the International Conference on Machine Learning. 2019, 783 792
25 P Nagarajan . Inverse reinforcement learning via ranked and failed demonstrations. 2016
26 J Oh , Y J Guo , S Singh , H Lee . Self-imitation learning. In: Proceedings of the 35th International Conference on Machine Learning. 2018, 3878 3887
27 Y H Wu , N Charoenphakdee , H Bao , V Tangkaratt , M Sugiyama . Imitation learning from imperfect demonstration. In: Proceedings of International Conference on Machine Learning. 2019, 6818 6827
28 W Sun , J A Bagnell , B Boots . Truncated horizon policy search: combining reinforcement learning & imitation learning. In: Proceedings of the 7th International Conference on Learning Representations. 2018
29 M Vecerík , T Hester , J Scholz , F M Wang , O Pietquin , B Piot , N Heess , J Scholz , J Scholz , T Rothörl , T Lampe , M A Riedmiller . Leveraging demonstrations for deep reinforcement learning on robotics problems with sparse rewards. 2017, arXiv preprint arXiv:1707.08817
30 Y Gao , H Xu , J Lin , F Yu , S Levine , T Darrell . Reinforcement learning from imperfect demonstrations. 2018, arXiv preprint arXiv:1802.05313
31 D Silver , G Lever , N Heess , T Degris , D Wierstra , M Riedmiller . Deterministic policy gradient algorithms. In: Proceedings of the 31st International Conference on Machine Learning. 2014, 387 395
32 R S Sutton , D A McAllester , S Singh , Y Mansour . Policy gradient methods for reinforcement learning with function approximation. In: Proceedings of the 12th International Conference on Neural Information Processing Systems. 1999, 1057 1063
33 R Munos , T Stepleton , A Harutyunyan , M Bellemare . Safe and efficient off-policy reinforcement learning. In: Proceedings of the 30th International Conference on Neural Information Processing Systems. 2016, 1054 1062
34 S Kakade , J Langford . Approximately optimal approximate reinforcement learning. In: Proceedings of the 19th International Conference on Machine Learning. 2002, 267 274
35 H Hasselt . Double Q-learning. In: Proceedings of the 23rd International Conference on Neural Information Processing Systems. 2010, 2613 2621
36 G Brockman , V Cheung , L Pettersson , J Schneider , J Schulman , J Tang , W Zaremba . Openai gym. 2016, arXiv preprint arXiv:1606.01540
[1] Article highlights Download
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed