|
|
Efficient reinforcement learning in continuous state and action spaces with Dyna and policy approximation |
Shan ZHONG1,2,3,4( ), Quan LIU1,4,5( ), Zongzhang ZHANG1, Qiming FU1,3,4,6 |
1. School of Computer Science and Technology, Soochow University, Suzhou 215000, China 2. School of Computer Science and Engineering, Changshu Institute of Technology, Changshu 215500, China 3. Jiangsu Province Key Laboratory of Intelligent Building Energy Efficiency, Suzhou University of Science and Technology, Suzhou 215006, China 4. Key Laboratory of Symbolic Computation and Knowledge Engineering of Ministry of Education, Jilin University, Changchun 130012, China 5. Collaborative Innovation Center of Novel Software Technology and Industrialization, Nanjing 210000, China 6. College of Electronic & Information Engineering, Suzhou University of Science and Technology, Suzhou 215000, China |
|
|
Abstract Dyna is an effective reinforcement learning (RL) approach that combines value function evaluation with model learning. However, existing works on Dyna mostly discuss only its efficiency in RL problems with discrete action spaces. This paper proposes a novel Dyna variant, called Dyna-LSTD-PA, aiming to handle problems with continuous action spaces. Dyna-LSTD-PA stands for Dyna based on least-squares temporal difference (LSTD) and policy approximation. Dyna-LSTD-PA consists of two simultaneous, interacting processes. The learning process determines the probability distribution over action spaces using the Gaussian distribution; estimates the underlying value function, policy, and model by linear representation; and updates their parameter vectors online by LSTD(λ). The planning process updates the parameter vector of the value function again by using offline LSTD(λ). Dyna-LSTD-PA also uses the Sherman–Morrison formula to improve the efficiency of LSTD(λ), and weights the parameter vector of the value function to bring the two processes together. Theoretically, the global error bound is derived by considering approximation, estimation, and model errors. Experimentally, Dyna-LSTD-PA outperforms two representative methods in terms of convergence rate, success rate, and stability performance on four benchmark RL problems.
|
Keywords
problem solving
control methods
heuristic search methods
dynamic programming
|
Corresponding Author(s):
Quan LIU
|
Just Accepted Date: 20 January 2017
Online First Date: 06 March 2018
Issue Date: 31 January 2019
|
|
1 |
R SSutton, A GBarto. Reinforcement Learning: An Introduction. Cambridge, MA: MIT Press, 1998
|
2 |
VMnih, K Kavukcuoglu, DSilver, ARusu, JVeness, MBellemare, A Graves, MRiedmiller, AFidjeland, G Ostrovski, SPetersen, CBeattie, ASadik, IAntonoglou, HKing, D Kumaran, D,Wierstra SLegg, D Hassabis. Human level control through deep reinforcement learning. Nature, 2015, 518(7540): 529–533
https://doi.org/10.1038/nature14236
|
3 |
M LLittman. Reinforcement learning improves behaviour from evaluative feedback. Nature, 2015, 521(7553): 445–451
https://doi.org/10.1038/nature14540
|
4 |
OAndersson, FHeintz, PDoherty. On the undecidability of probabilistic planning and infinite-horizon partially observable. In: Proceedings of the 29th National Conference on Artificial Intelligence. 2015, 2497–2503
|
5 |
R EBellman, S E Dreyfus. Applied Dynamic Programming. Princeton, NJ: Princeton University Press, 2015
|
6 |
J KBarker, R EKorf. Limitations of front-to-end bidirectional heuristic search. In: Proceedings of the 29th National Conference on Artificial Intelligence. 2015, 1086–1092
|
7 |
C PRobert, G Casella. Monte Carlo Statistical Methods. New York: Springer Science & Business Media, 2013
|
8 |
RSutton, A R Mahmood, DPrecup, H VHasselt. A new Q (λ) with interim forward view and Monte-Carlo equivalence. In: Proceedings of International Conference on Machine Learning. 2014, 568–576
|
9 |
H VSeijen, RSutton. True online TD (λ). In: Proceedings of International Conference on Machine Learning. 2014, 692–700
|
10 |
H VHasselt, A R Mahmood, R SSutton. Off-policy TD (λ) with a true online equivalence. In: Proceedings of International Conference on Uncertainty in Artificial Intelligence. 2014, 330–339
|
11 |
P JWerbos. Advanced forecasting methods for global crisis warning and models of intelligence. General Systems, 1977, 22(6): 25–38
|
12 |
AAl-Tamimi, F LLewis, MAbu-Khalaf. Discrete-time nonlinear HJB solution using approximate dynamic programming: convergence proof. IEEE Transactions on Systems, Man, and Cybernetics, Part B: Cybernetics, 2008, 38(4): 943–949
https://doi.org/10.1109/TSMCB.2008.926614
|
13 |
F YWang, NJin, D ELiu, Q L Wei. Adaptive dynamic programming for finite-horizon optimal control of discrete-time nonlinear systems with ε-error bound. IEEE Transactions on Neural Networks, 2011, 22(1): 24–36
https://doi.org/10.1109/TNN.2010.2076370
|
14 |
DLiu, Q LWei . Policy iteration adaptive dynamic programming algorithm for discrete-time nonlinear systems. IEEE Transactions on Neural Networks and Learning Systems, 2014, 25(3): 621–634
https://doi.org/10.1109/TNNLS.2013.2281663
|
15 |
J JMurray, C JCox, G GLendaris, RSaeks. Adaptive dynamic programming. IEEE Transactions on Systems, Man, and Cybernetics, Part C, 2002, 32(2): 140–153
|
16 |
THanselmann, LNoakes, AZaknich. Continuous time adaptive critics. IEEE Transactions on Neural Networks, 2007, 18(3): 631–647
https://doi.org/10.1109/TNN.2006.889499
|
17 |
QWei, RSong, PYan. Data-driven zero-sum neuro-optimal control for a class of continuous-time unknown nonlinear systems with disturbance using ADP. IEEE Transactions on Neural Networks and Learning Systems, 2016, 27(2): 444–458
https://doi.org/10.1109/TNNLS.2015.2464080
|
18 |
LBusoniu, R Babuška, BDe Schutter, DErnst. Reinforcement Learning and Dynamic Programming Using Function Approximators. Boca Raton: CRC Press, 2010
https://doi.org/10.1201/9781439821091
|
19 |
R SSutton. Integrated architecture for learning, planning and reacting based on approximating dynamic programming. In: Proceedings of International Conference on Machine Learning. 1990, 216–224
https://doi.org/10.1016/B978-1-55860-141-3.50030-4
|
20 |
JPeng, R J Williams. Efficient learning and planning within the Dyna framework. Adaptive Behavior, 1993, 4(1): 437–454
https://doi.org/10.1177/105971239300100403
|
21 |
A WMoore, C G Atkeson. Prioritized sweeping: Reinforcement learning with less data and less real time. Machine Learning, 1993, 13(1): 103–130
https://doi.org/10.1007/BF00993104
|
22 |
R SSutton, C Szepesvári, AGeramfard, MBowling. Dyna-style planning with linear function approximation and prioritized sweeping. In: Proceedings of International Conference on Uncertainty in Artificial Intelligence. 2008, 528–536
|
23 |
DSilver, R SSutton, MMüller. Temporal-difference search in computer Go. Machine Learning, 2012, 87(2): 183–219
https://doi.org/10.1007/s10994-012-5280-0
|
24 |
RCoulom. Efficient selectivity and backup operators in Monte-Carlo tree search. In: Proceedings of the 5th International Conference on Computers and Games, 2006, 72–83
|
25 |
Y CZhou, QLiu, Q MFu, Z Z Zhang. Trajectory sampling value iteration: Improved Dyna search for MDPs. In: Proceedings of the International Conference on Autonomous Agents and Multiagent Systems. 2015, 1685–1686
|
26 |
HMartin, J De Lope. Ex<α>: An effective algorithm for continuous actions reinforcement learning problems. In: Proceedings of the 35th Annual Conference of IEEE on Industrial Electronics. 2009, 2063–2068
|
27 |
AWeinstein, M L Littman. Bandit-based planning and learning in continuous-action Markov decision processes. In: Proceedings of International Conference on Automated Planning and Scheduling. 2012, 306–314
|
28 |
LBusoniu, A Daniels, RMunos, RBabuška. Optimistic planning for continuous-action deterministic systems. In: Proceedings of IEEE Symposium on Adaptive Dynamic Programming and Reinforcement Learning. 2013, 69–76
https://doi.org/10.1109/ADPRL.2013.6614991
|
29 |
IGrondman, M Vaandrager, LBusoniu, RBabuška, E Schuitema. Efficient model learning methods for actor-critic control. IEEE Transactions on Systems, Man, and Cybernetics, Part B, 2012, 42(3): 591–602
https://doi.org/10.1109/TSMCB.2011.2170565
|
30 |
H VHasselt. Reinforcement learning in continuous state and action spaces. In: Wiering M, van Otterlo M, eds. Reinforcement Learning. Berlin: Springer Heidelberg, 2012, 207–251
https://doi.org/10.1007/978-3-642-27645-3_7
|
31 |
TDegris, P M Pilarski, R SSutton. Model-free reinforcement learning with continuous action in practice. In: Proceedings of IEEE American Control Conference. 2012, 2177–2182
https://doi.org/10.1109/ACC.2012.6315022
|
32 |
S JBradtke, A GBarto. Linear least-squares algorithms for temporal difference learning. Machine Learning, 1996, 22(1–3): 33–57
https://doi.org/10.1007/BF00114723
|
33 |
J ABoyan. Technical update: least-square temporal difference learning. Machine learning, 2002, 49(2–3): 233–246
https://doi.org/10.1023/A:1017936530646
|
34 |
MTagorti, B Scherer. On the rate of the convergence and error bounds for LSTD(λ). In: Proceedings of International Conference on Machine Learning. 2015, 528–536
|
35 |
K SHwang, W CJiang, Y JChen. Model learning and knowledge sharing for a multiagent system With Dyna-Q. IEEE Transactions on Cybernetics, 2015, 45(5): 964–976
|
36 |
RFaulkner, DPrecup. Dyna planning using a feature based generative model. In: Proceedings of Advances in Neural Information Processing Systems. 2010, 1–9
|
37 |
DSilver, R SSutton, Müller M. Reinforcement learning of local shape in the game of Go. In: Proceedings of International Joint Conference on Artificial Intelligence. 2007, 1053–1058
|
38 |
HYao, C Szepesvári. Approximate policy iteration with linear action models. In: Proceedings of the 26th National Conference on Artificial Intelligence. 2012
|
39 |
J NTsitsiklis, B VRoy. An analysis of temporal-difference learning with function approximation. IEEE Transactions on Automatic Control, 1997, 42(5): 674–690
https://doi.org/10.1109/9.580874
|
40 |
ANedic, D P Bertsekas. Least squares policy evaluation algorithms with linear function approximation. Theory and Applications, 2002, 13(1–2): 79–110
|
41 |
ALazaric, M Ghavamzadeh, RMunos. Finite-sample analysis of leastsquare policy iteration. Journal of Machine Learning Research, 2012, 13(4): 3041–3074
|
42 |
XXu, H GHe, D WHu. Efficient reinforcement learning using recursive least-squares methods. Journal of Artificial Intelligence Research, 2002, 16(1): 259–292
|
43 |
LLjung, T Soderstron. Theory and practice of recursive identification. Cambridge, MA: MIT Press, 1983
|
44 |
AGeramifard, M Bowling, R SSutton. Incremental least-squares temporal difference learning. In: Proceedings of the National Conference on Artificial Intelligence. 2006, 356–361
|
45 |
H RBerenji, P Khedkar. Learning and tuning fuzzy logic controllers through reinforcements. IEEE Transactions on Neural Networks, 1992, 3(5): 724–740
https://doi.org/10.1109/72.159061
|
46 |
R SSutton. Generalization in reinforcement learning: successful examples using sparse coarse coding. Neural Information Processing Systems, 1996: 1038–1044
|
47 |
SBhatnagar, R SSutton, MGhavamzadeh, MLee. Natural actor-critic algorithms. Automatica, 2009, 45(11): 2471–2482
https://doi.org/10.1016/j.automatica.2009.07.008
|
48 |
D RLiu, H LLi, DWang. Feature selection and feature learning forhigh-dimensional batch reinforcement learning: a survey. InternationalJournal of Automation and Computing, 2015, 12(3): 229–242
https://doi.org/10.1007/s11633-015-0893-y
|
49 |
A MFarahmand, M Ghavamzadeh, CSzepesvári, SMannor. Regularizedfitted Q-iteration for planning in continuous-space Markoviandecision problems. In: Proceedings of American Control Conference.2009, 725–730
|
50 |
A MFarahmand, C Szepesvári. Model selection in reinforcementlearning. Machine Learning, 2011, 85(3): 299–332
https://doi.org/10.1007/s10994-011-5254-7
|
51 |
J ZKolter, A YNg. Regularization and feature selection in leastsquarestemporal difference learning. In: Proceedings of the 26th AnnualInternational Conference on Machine Learning. 2009, 521–528
|
52 |
MGhavamzadeh, A Lazaric, RMunos, M WHoffman. Finite-sampleanalysis of LASSO-TD. In: Proceedings of the 28th International Conferenceon Machine Learning. 2011, 1177–1184
|
53 |
SMahadevan, BLiu. Sparse Q-learning with mirror descent. In: Proceedingsof the 28th Conference on Uncertainty in Artificial Intelligence.2012, 564–573
|
54 |
CPainter-Wakefield, R Parr. Greedy algorithms for sparse reinforcementlearning. In: Proceedings of the 29th International Conference onMachine Learning. 2012, 1391–1398
|
55 |
JJohns, S Mahadevan. Sparse approximate policy evaluation usinggraph-based basis functions. Technical Report UM-CS-2009-041.2009
|
56 |
MGhavamzadeh, A Lazaric, O AMaillard, RMunos. LSTD with randomprojections. In: Proceedings of Advances in Neural InformationProcessing Systems. 2010, 721–729
|
57 |
BLiu, S Mahadevan. Compressive reinforcement learning with obliquerandom projections. Technical Report UM-CS-2011-024. 2011
|
58 |
XXu, D WHu, X CLu. Kernel-based least squares policy iteration forreinforcement learning. IEEE Transactionson Neural Networks, 2007,18(4): 973–992
https://doi.org/10.1109/TNN.2007.899161
|
|
Viewed |
|
|
|
Full text
|
|
|
|
|
Abstract
|
|
|
|
|
Cited |
|
|
|
|
|
Shared |
|
|
|
|
|
Discussed |
|
|
|
|