Please wait a minute...
Frontiers of Computer Science

ISSN 2095-2228

ISSN 2095-2236(Online)

CN 10-1014/TP

Postal Subscription Code 80-970

2018 Impact Factor: 1.129

Front. Comput. Sci.    2019, Vol. 13 Issue (1) : 106-126    https://doi.org/10.1007/s11704-017-6222-6
RESEARCH ARTICLE
Efficient reinforcement learning in continuous state and action spaces with Dyna and policy approximation
Shan ZHONG1,2,3,4(), Quan LIU1,4,5(), Zongzhang ZHANG1, Qiming FU1,3,4,6
1. School of Computer Science and Technology, Soochow University, Suzhou 215000, China
2. School of Computer Science and Engineering, Changshu Institute of Technology, Changshu 215500, China
3. Jiangsu Province Key Laboratory of Intelligent Building Energy Efficiency, Suzhou University of Science and Technology, Suzhou 215006, China
4. Key Laboratory of Symbolic Computation and Knowledge Engineering of Ministry of Education, Jilin University, Changchun 130012, China
5. Collaborative Innovation Center of Novel Software Technology and Industrialization, Nanjing 210000, China
6. College of Electronic & Information Engineering, Suzhou University of Science and Technology, Suzhou 215000, China
 Download: PDF(1290 KB)  
 Export: BibTeX | EndNote | Reference Manager | ProCite | RefWorks
Abstract

Dyna is an effective reinforcement learning (RL) approach that combines value function evaluation with model learning. However, existing works on Dyna mostly discuss only its efficiency in RL problems with discrete action spaces. This paper proposes a novel Dyna variant, called Dyna-LSTD-PA, aiming to handle problems with continuous action spaces. Dyna-LSTD-PA stands for Dyna based on least-squares temporal difference (LSTD) and policy approximation. Dyna-LSTD-PA consists of two simultaneous, interacting processes. The learning process determines the probability distribution over action spaces using the Gaussian distribution; estimates the underlying value function, policy, and model by linear representation; and updates their parameter vectors online by LSTD(λ). The planning process updates the parameter vector of the value function again by using offline LSTD(λ). Dyna-LSTD-PA also uses the Sherman–Morrison formula to improve the efficiency of LSTD(λ), and weights the parameter vector of the value function to bring the two processes together. Theoretically, the global error bound is derived by considering approximation, estimation, and model errors. Experimentally, Dyna-LSTD-PA outperforms two representative methods in terms of convergence rate, success rate, and stability performance on four benchmark RL problems.

Keywords problem solving      control methods      heuristic search methods      dynamic programming     
Corresponding Author(s): Quan LIU   
Just Accepted Date: 20 January 2017   Online First Date: 06 March 2018    Issue Date: 31 January 2019
 Cite this article:   
Shan ZHONG,Quan LIU,Zongzhang ZHANG, et al. Efficient reinforcement learning in continuous state and action spaces with Dyna and policy approximation[J]. Front. Comput. Sci., 2019, 13(1): 106-126.
 URL:  
https://academic.hep.com.cn/fcs/EN/10.1007/s11704-017-6222-6
https://academic.hep.com.cn/fcs/EN/Y2019/V13/I1/106
1 R SSutton, A GBarto. Reinforcement Learning: An Introduction. Cambridge, MA: MIT Press, 1998
2 VMnih, K Kavukcuoglu, DSilver, ARusu, JVeness, MBellemare, A Graves, MRiedmiller, AFidjeland, G Ostrovski, SPetersen, CBeattie, ASadik, IAntonoglou, HKing, D Kumaran, D,Wierstra SLegg, D Hassabis. Human level control through deep reinforcement learning. Nature, 2015, 518(7540): 529–533
https://doi.org/10.1038/nature14236
3 M LLittman. Reinforcement learning improves behaviour from evaluative feedback. Nature, 2015, 521(7553): 445–451
https://doi.org/10.1038/nature14540
4 OAndersson, FHeintz, PDoherty. On the undecidability of probabilistic planning and infinite-horizon partially observable. In: Proceedings of the 29th National Conference on Artificial Intelligence. 2015, 2497–2503
5 R EBellman, S E Dreyfus. Applied Dynamic Programming. Princeton, NJ: Princeton University Press, 2015
6 J KBarker, R EKorf. Limitations of front-to-end bidirectional heuristic search. In: Proceedings of the 29th National Conference on Artificial Intelligence. 2015, 1086–1092
7 C PRobert, G Casella. Monte Carlo Statistical Methods. New York: Springer Science & Business Media, 2013
8 RSutton, A R Mahmood, DPrecup, H VHasselt. A new Q (λ) with interim forward view and Monte-Carlo equivalence. In: Proceedings of International Conference on Machine Learning. 2014, 568–576
9 H VSeijen, RSutton. True online TD (λ). In: Proceedings of International Conference on Machine Learning. 2014, 692–700
10 H VHasselt, A R Mahmood, R SSutton. Off-policy TD (λ) with a true online equivalence. In: Proceedings of International Conference on Uncertainty in Artificial Intelligence. 2014, 330–339
11 P JWerbos. Advanced forecasting methods for global crisis warning and models of intelligence. General Systems, 1977, 22(6): 25–38
12 AAl-Tamimi, F LLewis, MAbu-Khalaf. Discrete-time nonlinear HJB solution using approximate dynamic programming: convergence proof. IEEE Transactions on Systems, Man, and Cybernetics, Part B: Cybernetics, 2008, 38(4): 943–949
https://doi.org/10.1109/TSMCB.2008.926614
13 F YWang, NJin, D ELiu, Q L Wei. Adaptive dynamic programming for finite-horizon optimal control of discrete-time nonlinear systems with ε-error bound. IEEE Transactions on Neural Networks, 2011, 22(1): 24–36
https://doi.org/10.1109/TNN.2010.2076370
14 DLiu, Q LWei . Policy iteration adaptive dynamic programming algorithm for discrete-time nonlinear systems. IEEE Transactions on Neural Networks and Learning Systems, 2014, 25(3): 621–634
https://doi.org/10.1109/TNNLS.2013.2281663
15 J JMurray, C JCox, G GLendaris, RSaeks. Adaptive dynamic programming. IEEE Transactions on Systems, Man, and Cybernetics, Part C, 2002, 32(2): 140–153
16 THanselmann, LNoakes, AZaknich. Continuous time adaptive critics. IEEE Transactions on Neural Networks, 2007, 18(3): 631–647
https://doi.org/10.1109/TNN.2006.889499
17 QWei, RSong, PYan. Data-driven zero-sum neuro-optimal control for a class of continuous-time unknown nonlinear systems with disturbance using ADP. IEEE Transactions on Neural Networks and Learning Systems, 2016, 27(2): 444–458
https://doi.org/10.1109/TNNLS.2015.2464080
18 LBusoniu, R Babuška, BDe Schutter, DErnst. Reinforcement Learning and Dynamic Programming Using Function Approximators. Boca Raton: CRC Press, 2010
https://doi.org/10.1201/9781439821091
19 R SSutton. Integrated architecture for learning, planning and reacting based on approximating dynamic programming. In: Proceedings of International Conference on Machine Learning. 1990, 216–224
https://doi.org/10.1016/B978-1-55860-141-3.50030-4
20 JPeng, R J Williams. Efficient learning and planning within the Dyna framework. Adaptive Behavior, 1993, 4(1): 437–454
https://doi.org/10.1177/105971239300100403
21 A WMoore, C G Atkeson. Prioritized sweeping: Reinforcement learning with less data and less real time. Machine Learning, 1993, 13(1): 103–130
https://doi.org/10.1007/BF00993104
22 R SSutton, C Szepesvári, AGeramfard, MBowling. Dyna-style planning with linear function approximation and prioritized sweeping. In: Proceedings of International Conference on Uncertainty in Artificial Intelligence. 2008, 528–536
23 DSilver, R SSutton, MMüller. Temporal-difference search in computer Go. Machine Learning, 2012, 87(2): 183–219
https://doi.org/10.1007/s10994-012-5280-0
24 RCoulom. Efficient selectivity and backup operators in Monte-Carlo tree search. In: Proceedings of the 5th International Conference on Computers and Games, 2006, 72–83
25 Y CZhou, QLiu, Q MFu, Z Z Zhang. Trajectory sampling value iteration: Improved Dyna search for MDPs. In: Proceedings of the International Conference on Autonomous Agents and Multiagent Systems. 2015, 1685–1686
26 HMartin, J De Lope. Ex<α>: An effective algorithm for continuous actions reinforcement learning problems. In: Proceedings of the 35th Annual Conference of IEEE on Industrial Electronics. 2009, 2063–2068
27 AWeinstein, M L Littman. Bandit-based planning and learning in continuous-action Markov decision processes. In: Proceedings of International Conference on Automated Planning and Scheduling. 2012, 306–314
28 LBusoniu, A Daniels, RMunos, RBabuška. Optimistic planning for continuous-action deterministic systems. In: Proceedings of IEEE Symposium on Adaptive Dynamic Programming and Reinforcement Learning. 2013, 69–76
https://doi.org/10.1109/ADPRL.2013.6614991
29 IGrondman, M Vaandrager, LBusoniu, RBabuška, E Schuitema. Efficient model learning methods for actor-critic control. IEEE Transactions on Systems, Man, and Cybernetics, Part B, 2012, 42(3): 591–602
https://doi.org/10.1109/TSMCB.2011.2170565
30 H VHasselt. Reinforcement learning in continuous state and action spaces. In: Wiering M, van Otterlo M, eds. Reinforcement Learning. Berlin: Springer Heidelberg, 2012, 207–251
https://doi.org/10.1007/978-3-642-27645-3_7
31 TDegris, P M Pilarski, R SSutton. Model-free reinforcement learning with continuous action in practice. In: Proceedings of IEEE American Control Conference. 2012, 2177–2182
https://doi.org/10.1109/ACC.2012.6315022
32 S JBradtke, A GBarto. Linear least-squares algorithms for temporal difference learning. Machine Learning, 1996, 22(1–3): 33–57
https://doi.org/10.1007/BF00114723
33 J ABoyan. Technical update: least-square temporal difference learning. Machine learning, 2002, 49(2–3): 233–246
https://doi.org/10.1023/A:1017936530646
34 MTagorti, B Scherer. On the rate of the convergence and error bounds for LSTD(λ). In: Proceedings of International Conference on Machine Learning. 2015, 528–536
35 K SHwang, W CJiang, Y JChen. Model learning and knowledge sharing for a multiagent system With Dyna-Q. IEEE Transactions on Cybernetics, 2015, 45(5): 964–976
36 RFaulkner, DPrecup. Dyna planning using a feature based generative model. In: Proceedings of Advances in Neural Information Processing Systems. 2010, 1–9
37 DSilver, R SSutton, Müller M. Reinforcement learning of local shape in the game of Go. In: Proceedings of International Joint Conference on Artificial Intelligence. 2007, 1053–1058
38 HYao, C Szepesvári. Approximate policy iteration with linear action models. In: Proceedings of the 26th National Conference on Artificial Intelligence. 2012
39 J NTsitsiklis, B VRoy. An analysis of temporal-difference learning with function approximation. IEEE Transactions on Automatic Control, 1997, 42(5): 674–690
https://doi.org/10.1109/9.580874
40 ANedic, D P Bertsekas. Least squares policy evaluation algorithms with linear function approximation. Theory and Applications, 2002, 13(1–2): 79–110
41 ALazaric, M Ghavamzadeh, RMunos. Finite-sample analysis of leastsquare policy iteration. Journal of Machine Learning Research, 2012, 13(4): 3041–3074
42 XXu, H GHe, D WHu. Efficient reinforcement learning using recursive least-squares methods. Journal of Artificial Intelligence Research, 2002, 16(1): 259–292
43 LLjung, T Soderstron. Theory and practice of recursive identification. Cambridge, MA: MIT Press, 1983
44 AGeramifard, M Bowling, R SSutton. Incremental least-squares temporal difference learning. In: Proceedings of the National Conference on Artificial Intelligence. 2006, 356–361
45 H RBerenji, P Khedkar. Learning and tuning fuzzy logic controllers through reinforcements. IEEE Transactions on Neural Networks, 1992, 3(5): 724–740
https://doi.org/10.1109/72.159061
46 R SSutton. Generalization in reinforcement learning: successful examples using sparse coarse coding. Neural Information Processing Systems, 1996: 1038–1044
47 SBhatnagar, R SSutton, MGhavamzadeh, MLee. Natural actor-critic algorithms. Automatica, 2009, 45(11): 2471–2482
https://doi.org/10.1016/j.automatica.2009.07.008
48 D RLiu, H LLi, DWang. Feature selection and feature learning forhigh-dimensional batch reinforcement learning: a survey. InternationalJournal of Automation and Computing, 2015, 12(3): 229–242
https://doi.org/10.1007/s11633-015-0893-y
49 A MFarahmand, M Ghavamzadeh, CSzepesvári, SMannor. Regularizedfitted Q-iteration for planning in continuous-space Markoviandecision problems. In: Proceedings of American Control Conference.2009, 725–730
50 A MFarahmand, C Szepesvári. Model selection in reinforcementlearning. Machine Learning, 2011, 85(3): 299–332
https://doi.org/10.1007/s10994-011-5254-7
51 J ZKolter, A YNg. Regularization and feature selection in leastsquarestemporal difference learning. In: Proceedings of the 26th AnnualInternational Conference on Machine Learning. 2009, 521–528
52 MGhavamzadeh, A Lazaric, RMunos, M WHoffman. Finite-sampleanalysis of LASSO-TD. In: Proceedings of the 28th International Conferenceon Machine Learning. 2011, 1177–1184
53 SMahadevan, BLiu. Sparse Q-learning with mirror descent. In: Proceedingsof the 28th Conference on Uncertainty in Artificial Intelligence.2012, 564–573
54 CPainter-Wakefield, R Parr. Greedy algorithms for sparse reinforcementlearning. In: Proceedings of the 29th International Conference onMachine Learning. 2012, 1391–1398
55 JJohns, S Mahadevan. Sparse approximate policy evaluation usinggraph-based basis functions. Technical Report UM-CS-2009-041.2009
56 MGhavamzadeh, A Lazaric, O AMaillard, RMunos. LSTD with randomprojections. In: Proceedings of Advances in Neural InformationProcessing Systems. 2010, 721–729
57 BLiu, S Mahadevan. Compressive reinforcement learning with obliquerandom projections. Technical Report UM-CS-2011-024. 2011
58 XXu, D WHu, X CLu. Kernel-based least squares policy iteration forreinforcement learning. IEEE Transactionson Neural Networks, 2007,18(4): 973–992
https://doi.org/10.1109/TNN.2007.899161
[1] K C SANTOSH,Laurent WENDLING. Character recognition based on non-linear multi-projection profiles measure[J]. Front. Comput. Sci., 2015, 9(5): 678-690.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed