Please wait a minute...
Frontiers of Computer Science

ISSN 2095-2228

ISSN 2095-2236(Online)

CN 10-1014/TP

Postal Subscription Code 80-970

2018 Impact Factor: 1.129

Front. Comput. Sci.    2022, Vol. 16 Issue (5) : 165330    https://doi.org/10.1007/s11704-021-0354-4
RESEARCH ARTICLE
Efficient policy evaluation by matrix sketching
Cheng CHEN, Weinan ZHANG(), Yong YU
Department of Computer Science, Shanghai Jiao Tong University, Shanghai 200240, China
 Download: PDF(2745 KB)   HTML
 Export: BibTeX | EndNote | Reference Manager | ProCite | RefWorks
Abstract

In the reinforcement learning, policy evaluation aims to predict long-term values of a state under a certain policy. Since high-dimensional representations become more and more common in the reinforcement learning, how to reduce the computational cost becomes a significant problem to the policy evaluation. Many recent works focus on adopting matrix sketching methods to accelerate least-square temporal difference (TD) algorithms and quasi-Newton temporal difference algorithms. Among these sketching methods, the truncated incremental SVD shows better performance because it is stable and efficient. However, the convergence properties of the incremental SVD is still open. In this paper, we first show that the conventional incremental SVD algorithms could have enormous approximation errors in the worst case. Then we propose a variant of incremental SVD with better theoretical guarantees by shrinking the singular values periodically. Moreover, we employ our improved incremental SVD to accelerate least-square TD and quasi-Newton TD algorithms. The experimental results verify the correctness and effectiveness of our methods.

Keywords temporal difference learning      policy evaluation      matrix sketching     
Corresponding Author(s): Weinan ZHANG   
Just Accepted Date: 16 March 2021   Issue Date: 31 December 2021
 Cite this article:   
Cheng CHEN,Weinan ZHANG,Yong YU. Efficient policy evaluation by matrix sketching[J]. Front. Comput. Sci., 2022, 16(5): 165330.
 URL:  
https://academic.hep.com.cn/fcs/EN/10.1007/s11704-021-0354-4
https://academic.hep.com.cn/fcs/EN/Y2022/V16/I5/165330
Fig.1  
Fig.2  
Fig.3  
Fig.4  
Fig.5  
Fig.6  
Fig.7  Comparison between iSVD and RiSVD
Fig.8  Comparison between MaiSVD and MaRiSVD
Fig.9  Comparison between different policy evaluation methods on the Mountain Car domain
Fig.10  Comparison between different policy evaluation methods on the Energy Allocation domain
1 R S Sutton, A G Barto. Reinforcement Learning: an Introduction. 2nd ed. London: MIT Press, 2018
2 Bertsekas D P, Tsitsiklis J N. Neuro-dynamic programming: an overview. In: Proceedings of the 34th IEEE Conference on Decision and Control. 1995, 560–564
3 M G Lagoudakis , R Parr . Least-squares policy iteration. Journal of Machine Learning Research, 2003, 4 : 1107– 1149
4 C Dann , G Neumann , J Peters . Policy evaluation with temporal differences: a survey and comparison. The Journal of Machine Learning Research, 2014, 15( 1): 809– 883
5 M Geist , B Scherrer . Off-policy learning with eligibility traces: a survey. The Journal of Machine Learning Research, 2014, 15( 1): 289– 333
6 Y T Liang, M C Machado, E Talvitie, M Bowling. State of the art control of Atari games using shallow reinforcement learning. In: Proceedings of the 2016 International Conference on Autonomous Agents & Multiagent Systems. 2016, 485– 493
7 M Tagorti, B Scherrer. On the rate of convergence and error bounds for LSTD (λ). In: Proceedings of the 32nd International Conference on International Conference on Machine Learning. 2015, 1521−1529
8 R S Sutton . Learning to predict by the methods of temporal differences. Machine Learning, 1988, 3( 1): 9– 44
9 R S Sutton, C Szepesvári, H R Maei. A convergent O(n) algorithm for off-policy temporal-difference learning with linear function approximation. In: Proceedings of the 21st International Conference on Neural Information Processing Systems. 2008, 1609−1616
10 J A Boyan. Technical update: least-squares temporal difference learning. Machine Learning, 2002, 49( 2– 3): 2– 3
11 A Geramifard, M H Bowling, R S Sutton. Incremental least-squares temporal difference learning. In: Proceedings of the 21st National Conference on Artificial Intelligence and the Eighteenth Innovative Applications of Artificial Intelligence Conference. 2006, 356– 361
12 A Geramifard, M H Bowling, M Zinkevich, R S Sutton. iLSTD: eligibility traces and convergence analysis. In: Proceedings of the 20th Annual Conference on Neural Information Processing Systems. 2006, 441– 448
13 Y C Pan, A M White, M White. Accelerated gradient temporal difference learning. In: Proceedings of the 31st AAAI Conference on Artificial Intelligence. 2017, 2464−2470
14 M Ghavamzadeh, A Lazaric, O A Maillard, R Munos. LSTD with random projections. In: Proceedings of Advances in Neural Information Processing Systems: 24th Annual Conference on Neural Information Processing Systems 2010. 2010, 721– 729
15 Pan Y C, Azer E S, White M. Effective sketching methods for value function approximation. In: Proceedings of the 33rd Conference on Uncertainty in Artificial Intelligence. 2017
16 C Gehring, Y C Pan, M White. Incremental truncated LSTD. In: Proceedings of the 25th International Joint Conference on Artificial Intelligence. 2016, 1505−1511
17 H F Li, Y C Xia, W S Zhang. Finite sample analysis of LSTD with random projections and eligibility traces. In: Proceedings of the 27th International Joint Conference on Artificial Intelligence. 2018, 2390−2396
18 D P Woodruff. Sketching as a tool for numerical linear algebra. Foundations and Trends® in Theoretical Computer Science , 2014, 10( 1– 2): 1– 2
19 D P Bertsekas. Dynamic Programming and Optimal Control. Volume 1. Belmont: Athena Scientific, 1995
20 J Z Kolter, A Y Ng. Regularization and feature selection in least-squares temporal difference learning. In: Proceedings of the 26th Annual International Conference on Machine Learning. 2009, 521– 528
21 S J Bradtke, A G Barto. Linear least-squares algorithms for temporal difference learning. Machine Learning, 1996, 22( 1– 3): 1– 3
22 E Liberty. Simple and deterministic matrix sketching. In: Proceedings of the 19th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. 2013, 581– 588
23 M Ghashami , E Liberty , J M Phillips , D P Woodruff . Frequent directions: simple and deterministic matrix sketching. SIAM Journal on Computing, 2016, 45( 5): 1762– 1792
24 I Kuzborskij, L Cella, N Cesa-Bianchi. Efficient linear bandits through matrix sketching. In: Proceedings of the 22nd International Conference on Artificial Intelligence and Statistics. 2019, 177– 185
25 H P Luo, A Agarwal, N Cesa-Bianchi, J Langford. Efficient second order online learning by sketching. In: Proceedings of Advances in Neural Information Processing Systems. 2016, 902– 910
26 L Luo , C Chen , Z H Zhang , W J Li , T Zhang . Robust frequent directions with application in online learning. Journal of Machine Learning Research, 2019, 20( 45): 1– 41
27 Y Mroueh, E Marcheret, V Goel. Co-occurring directions sketching for approximate matrix multiply. In: Proceedings of the 20th International Conference on Artificial Intelligence and Statistics. 2017, 567– 575
28 M Brand. Fast online SVD revisions for lightweight recommender systems. In: Proceedings of the 2003 SIAM International Conference on Data Mining. 2003, 37– 46
29 B Sarwar, G Karypis, J Konstan, J Riedl. Incremental singular value decomposition algorithms for highly scalable recommender systems. In: Proceedings of the 5th International Conference on Computer and Information Science. 2002, 27– 28
30 D A Ross, J Lim, R S Lin, M H Yang. Incremental learning for robust visual tracking. International Journal of Computer Vision, 2008, 77( 1– 3): 1– 3
31 P M Hall, D Marshall, R R Martin. Incremental eigenanalysis for classification. In: Proceedings of the British Machine Vision Conference. 1998, 1– 10
32 M Brand. Incremental singular value decomposition of uncertain data with missing values. In: Proceedings of the 7th European Conference on Computer Vision. 2002, 707– 720
33 M Brand . Fast low-rank modifications of the thin singular value decomposition. Linear Algebra and its Applications, 2006, 415( 1): 20– 30
34 D F Salas , W B Powell . Benchmarking a scalable approximate dynamic programming algorithm for stochastic control of grid-level energy storage. INFORMS Journal on Computing, 2018, 30( 1): 106– 123
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed