Efficient reinforcement learning in continuous state and action spaces with Dyna and policy approximation

doi:10.1007/s11704-017-6222-6

Front. Comput. Sci.

2019, Vol. 13

Issue (1) : 106-126 https://doi.org/10.1007/s11704-017-6222-6

RESEARCH ARTICLE

Efficient reinforcement learning in continuous state and action spaces with Dyna and policy approximation

Shan ZHONG^1,^2,^3,⁴(

), Quan LIU^1,^4,⁵(

), Zongzhang ZHANG¹, Qiming FU^1,^3,^4,⁶

¹. School of Computer Science and Technology, Soochow University, Suzhou 215000, China
². School of Computer Science and Engineering, Changshu Institute of Technology, Changshu 215500, China
³. Jiangsu Province Key Laboratory of Intelligent Building Energy Efficiency, Suzhou University of Science and Technology, Suzhou 215006, China
⁴. Key Laboratory of Symbolic Computation and Knowledge Engineering of Ministry of Education, Jilin University, Changchun 130012, China
⁵. Collaborative Innovation Center of Novel Software Technology and Industrialization, Nanjing 210000, China
⁶. College of Electronic & Information Engineering, Suzhou University of Science and Technology, Suzhou 215000, China

Download: PDF(1290 KB)
Export: BibTeX | EndNote | Reference Manager | ProCite | RefWorks

Abstract

Dyna is an effective reinforcement learning (RL) approach that combines value function evaluation with model learning. However, existing works on Dyna mostly discuss only its efficiency in RL problems with discrete action spaces. This paper proposes a novel Dyna variant, called Dyna-LSTD-PA, aiming to handle problems with continuous action spaces. Dyna-LSTD-PA stands for Dyna based on least-squares temporal difference (LSTD) and policy approximation. Dyna-LSTD-PA consists of two simultaneous, interacting processes. The learning process determines the probability distribution over action spaces using the Gaussian distribution; estimates the underlying value function, policy, and model by linear representation; and updates their parameter vectors online by LSTD(λ). The planning process updates the parameter vector of the value function again by using offline LSTD(λ). Dyna-LSTD-PA also uses the Sherman–Morrison formula to improve the efficiency of LSTD(λ), and weights the parameter vector of the value function to bring the two processes together. Theoretically, the global error bound is derived by considering approximation, estimation, and model errors. Experimentally, Dyna-LSTD-PA outperforms two representative methods in terms of convergence rate, success rate, and stability performance on four benchmark RL problems.

Keywords problem solving control methods heuristic search methods dynamic programming

Corresponding Author(s): Quan LIU

Just Accepted Date: 20 January 2017 Online First Date: 06 March 2018 Issue Date: 31 January 2019

Cite this article:

Shan ZHONG,Quan LIU,Zongzhang ZHANG, et al. Efficient reinforcement learning in continuous state and action spaces with Dyna and policy approximation[J]. Front. Comput. Sci., 2019, 13(1): 106-126.

URL:

https://academic.hep.com.cn/fcs/EN/10.1007/s11704-017-6222-6
https://academic.hep.com.cn/fcs/EN/Y2019/V13/I1/106

1	R SSutton, A GBarto. Reinforcement Learning: An Introduction. Cambridge, MA: MIT Press, 1998
2	VMnih, K Kavukcuoglu, DSilver, ARusu, JVeness, MBellemare, A Graves, MRiedmiller, AFidjeland, G Ostrovski, SPetersen, CBeattie, ASadik, IAntonoglou, HKing, D Kumaran, D,Wierstra SLegg, D Hassabis. Human level control through deep reinforcement learning. Nature, 2015, 518(7540): 529–533 https://doi.org/10.1038/nature14236
3	M LLittman. Reinforcement learning improves behaviour from evaluative feedback. Nature, 2015, 521(7553): 445–451 https://doi.org/10.1038/nature14540
4	OAndersson, FHeintz, PDoherty. On the undecidability of probabilistic planning and infinite-horizon partially observable. In: Proceedings of the 29th National Conference on Artificial Intelligence. 2015, 2497–2503
5	R EBellman, S E Dreyfus. Applied Dynamic Programming. Princeton, NJ: Princeton University Press, 2015
6	J KBarker, R EKorf. Limitations of front-to-end bidirectional heuristic search. In: Proceedings of the 29th National Conference on Artificial Intelligence. 2015, 1086–1092
7	C PRobert, G Casella. Monte Carlo Statistical Methods. New York: Springer Science & Business Media, 2013
8	RSutton, A R Mahmood, DPrecup, H VHasselt. A new Q (λ) with interim forward view and Monte-Carlo equivalence. In: Proceedings of International Conference on Machine Learning. 2014, 568–576
9	H VSeijen, RSutton. True online TD (λ). In: Proceedings of International Conference on Machine Learning. 2014, 692–700
10	H VHasselt, A R Mahmood, R SSutton. Off-policy TD (λ) with a true online equivalence. In: Proceedings of International Conference on Uncertainty in Artificial Intelligence. 2014, 330–339
11	P JWerbos. Advanced forecasting methods for global crisis warning and models of intelligence. General Systems, 1977, 22(6): 25–38
12	AAl-Tamimi, F LLewis, MAbu-Khalaf. Discrete-time nonlinear HJB solution using approximate dynamic programming: convergence proof. IEEE Transactions on Systems, Man, and Cybernetics, Part B: Cybernetics, 2008, 38(4): 943–949 https://doi.org/10.1109/TSMCB.2008.926614
13	F YWang, NJin, D ELiu, Q L Wei. Adaptive dynamic programming for finite-horizon optimal control of discrete-time nonlinear systems with ε-error bound. IEEE Transactions on Neural Networks, 2011, 22(1): 24–36 https://doi.org/10.1109/TNN.2010.2076370
14	DLiu, Q LWei . Policy iteration adaptive dynamic programming algorithm for discrete-time nonlinear systems. IEEE Transactions on Neural Networks and Learning Systems, 2014, 25(3): 621–634 https://doi.org/10.1109/TNNLS.2013.2281663
15	J JMurray, C JCox, G GLendaris, RSaeks. Adaptive dynamic programming. IEEE Transactions on Systems, Man, and Cybernetics, Part C, 2002, 32(2): 140–153
16	THanselmann, LNoakes, AZaknich. Continuous time adaptive critics. IEEE Transactions on Neural Networks, 2007, 18(3): 631–647 https://doi.org/10.1109/TNN.2006.889499
17	QWei, RSong, PYan. Data-driven zero-sum neuro-optimal control for a class of continuous-time unknown nonlinear systems with disturbance using ADP. IEEE Transactions on Neural Networks and Learning Systems, 2016, 27(2): 444–458 https://doi.org/10.1109/TNNLS.2015.2464080
18	LBusoniu, R Babuška, BDe Schutter, DErnst. Reinforcement Learning and Dynamic Programming Using Function Approximators. Boca Raton: CRC Press, 2010 https://doi.org/10.1201/9781439821091
19	R SSutton. Integrated architecture for learning, planning and reacting based on approximating dynamic programming. In: Proceedings of International Conference on Machine Learning. 1990, 216–224 https://doi.org/10.1016/B978-1-55860-141-3.50030-4
20	JPeng, R J Williams. Efficient learning and planning within the Dyna framework. Adaptive Behavior, 1993, 4(1): 437–454 https://doi.org/10.1177/105971239300100403
21	A WMoore, C G Atkeson. Prioritized sweeping: Reinforcement learning with less data and less real time. Machine Learning, 1993, 13(1): 103–130 https://doi.org/10.1007/BF00993104
22	R SSutton, C Szepesvári, AGeramfard, MBowling. Dyna-style planning with linear function approximation and prioritized sweeping. In: Proceedings of International Conference on Uncertainty in Artificial Intelligence. 2008, 528–536
23	DSilver, R SSutton, MMüller. Temporal-difference search in computer Go. Machine Learning, 2012, 87(2): 183–219 https://doi.org/10.1007/s10994-012-5280-0
24	RCoulom. Efficient selectivity and backup operators in Monte-Carlo tree search. In: Proceedings of the 5th International Conference on Computers and Games, 2006, 72–83
25	Y CZhou, QLiu, Q MFu, Z Z Zhang. Trajectory sampling value iteration: Improved Dyna search for MDPs. In: Proceedings of the International Conference on Autonomous Agents and Multiagent Systems. 2015, 1685–1686
26	HMartin, J De Lope. Ex<α>: An effective algorithm for continuous actions reinforcement learning problems. In: Proceedings of the 35th Annual Conference of IEEE on Industrial Electronics. 2009, 2063–2068
27	AWeinstein, M L Littman. Bandit-based planning and learning in continuous-action Markov decision processes. In: Proceedings of International Conference on Automated Planning and Scheduling. 2012, 306–314
28	LBusoniu, A Daniels, RMunos, RBabuška. Optimistic planning for continuous-action deterministic systems. In: Proceedings of IEEE Symposium on Adaptive Dynamic Programming and Reinforcement Learning. 2013, 69–76 https://doi.org/10.1109/ADPRL.2013.6614991
29	IGrondman, M Vaandrager, LBusoniu, RBabuška, E Schuitema. Efficient model learning methods for actor-critic control. IEEE Transactions on Systems, Man, and Cybernetics, Part B, 2012, 42(3): 591–602 https://doi.org/10.1109/TSMCB.2011.2170565
30	H VHasselt. Reinforcement learning in continuous state and action spaces. In: Wiering M, van Otterlo M, eds. Reinforcement Learning. Berlin: Springer Heidelberg, 2012, 207–251 https://doi.org/10.1007/978-3-642-27645-3_7
31	TDegris, P M Pilarski, R SSutton. Model-free reinforcement learning with continuous action in practice. In: Proceedings of IEEE American Control Conference. 2012, 2177–2182 https://doi.org/10.1109/ACC.2012.6315022
32	S JBradtke, A GBarto. Linear least-squares algorithms for temporal difference learning. Machine Learning, 1996, 22(1–3): 33–57 https://doi.org/10.1007/BF00114723
33	J ABoyan. Technical update: least-square temporal difference learning. Machine learning, 2002, 49(2–3): 233–246 https://doi.org/10.1023/A:1017936530646
34	MTagorti, B Scherer. On the rate of the convergence and error bounds for LSTD(λ). In: Proceedings of International Conference on Machine Learning. 2015, 528–536
35	K SHwang, W CJiang, Y JChen. Model learning and knowledge sharing for a multiagent system With Dyna-Q. IEEE Transactions on Cybernetics, 2015, 45(5): 964–976
36	RFaulkner, DPrecup. Dyna planning using a feature based generative model. In: Proceedings of Advances in Neural Information Processing Systems. 2010, 1–9
37	DSilver, R SSutton, Müller M. Reinforcement learning of local shape in the game of Go. In: Proceedings of International Joint Conference on Artificial Intelligence. 2007, 1053–1058
38	HYao, C Szepesvári. Approximate policy iteration with linear action models. In: Proceedings of the 26th National Conference on Artificial Intelligence. 2012
39	J NTsitsiklis, B VRoy. An analysis of temporal-difference learning with function approximation. IEEE Transactions on Automatic Control, 1997, 42(5): 674–690 https://doi.org/10.1109/9.580874
40	ANedic, D P Bertsekas. Least squares policy evaluation algorithms with linear function approximation. Theory and Applications, 2002, 13(1–2): 79–110
41	ALazaric, M Ghavamzadeh, RMunos. Finite-sample analysis of leastsquare policy iteration. Journal of Machine Learning Research, 2012, 13(4): 3041–3074
42	XXu, H GHe, D WHu. Efficient reinforcement learning using recursive least-squares methods. Journal of Artificial Intelligence Research, 2002, 16(1): 259–292
43	LLjung, T Soderstron. Theory and practice of recursive identification. Cambridge, MA: MIT Press, 1983
44	AGeramifard, M Bowling, R SSutton. Incremental least-squares temporal difference learning. In: Proceedings of the National Conference on Artificial Intelligence. 2006, 356–361
45	H RBerenji, P Khedkar. Learning and tuning fuzzy logic controllers through reinforcements. IEEE Transactions on Neural Networks, 1992, 3(5): 724–740 https://doi.org/10.1109/72.159061
46	R SSutton. Generalization in reinforcement learning: successful examples using sparse coarse coding. Neural Information Processing Systems, 1996: 1038–1044
47	SBhatnagar, R SSutton, MGhavamzadeh, MLee. Natural actor-critic algorithms. Automatica, 2009, 45(11): 2471–2482 https://doi.org/10.1016/j.automatica.2009.07.008
48	D RLiu, H LLi, DWang. Feature selection and feature learning forhigh-dimensional batch reinforcement learning: a survey. InternationalJournal of Automation and Computing, 2015, 12(3): 229–242 https://doi.org/10.1007/s11633-015-0893-y
49	A MFarahmand, M Ghavamzadeh, CSzepesvári, SMannor. Regularizedfitted Q-iteration for planning in continuous-space Markoviandecision problems. In: Proceedings of American Control Conference.2009, 725–730
50	A MFarahmand, C Szepesvári. Model selection in reinforcementlearning. Machine Learning, 2011, 85(3): 299–332 https://doi.org/10.1007/s10994-011-5254-7
51	J ZKolter, A YNg. Regularization and feature selection in leastsquarestemporal difference learning. In: Proceedings of the 26th AnnualInternational Conference on Machine Learning. 2009, 521–528
52	MGhavamzadeh, A Lazaric, RMunos, M WHoffman. Finite-sampleanalysis of LASSO-TD. In: Proceedings of the 28th International Conferenceon Machine Learning. 2011, 1177–1184
53	SMahadevan, BLiu. Sparse Q-learning with mirror descent. In: Proceedingsof the 28th Conference on Uncertainty in Artificial Intelligence.2012, 564–573
54	CPainter-Wakefield, R Parr. Greedy algorithms for sparse reinforcementlearning. In: Proceedings of the 29th International Conference onMachine Learning. 2012, 1391–1398
55	JJohns, S Mahadevan. Sparse approximate policy evaluation usinggraph-based basis functions. Technical Report UM-CS-2009-041.2009
56	MGhavamzadeh, A Lazaric, O AMaillard, RMunos. LSTD with randomprojections. In: Proceedings of Advances in Neural InformationProcessing Systems. 2010, 721–729
57	BLiu, S Mahadevan. Compressive reinforcement learning with obliquerandom projections. Technical Report UM-CS-2011-024. 2011
58	XXu, D WHu, X CLu. Kernel-based least squares policy iteration forreinforcement learning. IEEE Transactionson Neural Networks, 2007,18(4): 973–992 https://doi.org/10.1109/TNN.2007.899161

[1]

Download

[1]	K C SANTOSH,Laurent WENDLING. Character recognition based on non-linear multi-projection profiles measure[J]. Front. Comput. Sci., 2015, 9(5): 678-690.

Viewed

Full text

Abstract

Cited

Shared

Discussed