Please wait a minute...
Frontiers of Computer Science

ISSN 2095-2228

ISSN 2095-2236(Online)

CN 10-1014/TP

邮发代号 80-970

2019 Impact Factor: 1.275

Frontiers of Computer Science  2025, Vol. 19 Issue (4): 194313   https://doi.org/10.1007/s11704-024-3194-1
  本期目录
Clustered Reinforcement Learning
Xiao MA1, Shen-Yi ZHAO1, Zhao-Heng YIN2, Wu-Jun LI1()
1. National Key Laboratory for Novel Software Technology, Department of Computer Science and Technology, Nanjing University, Nanjing 210023, China
2. Department of Electrical Engineering and Computer Sciences, University of California, Berkeley, CA 94720-1770, USA
 全文: PDF(2856 KB)   HTML
Abstract

Exploration strategy design is a challenging problem in reinforcement learning (RL), especially when the environment contains a large state space or sparse rewards. During exploration, the agent tries to discover unexplored (novel) areas or high reward (quality) areas. Most existing methods perform exploration by only utilizing the novelty of states. The novelty and quality in the neighboring area of the current state have not been well utilized to simultaneously guide the agent’s exploration. To address this problem, this paper proposes a novel RL framework, called clustered reinforcement learning (CRL), for efficient exploration in RL. CRL adopts clustering to divide the collected states into several clusters, based on which a bonus reward reflecting both novelty and quality in the neighboring area (cluster) of the current state is given to the agent. CRL leverages these bonus rewards to guide the agent to perform efficient exploration. Moreover, CRL can be combined with existing exploration strategies to improve their performance, as the bonus rewards employed by these existing exploration strategies solely capture the novelty of states. Experiments on four continuous control tasks and six hard-exploration Atari-2600 games show that our method can outperform other state-of-the-art methods to achieve the best performance.

Key wordsdeep reinforcement learning    exploration    count-based method    clustering    K-means
收稿日期: 2023-03-06      出版日期: 2024-05-20
Corresponding Author(s): Wu-Jun LI   
 引用本文:   
. [J]. Frontiers of Computer Science, 2025, 19(4): 194313.
Xiao MA, Shen-Yi ZHAO, Zhao-Heng YIN, Wu-Jun LI. Clustered Reinforcement Learning. Front. Comput. Sci., 2025, 19(4): 194313.
 链接本文:  
https://academic.hep.com.cn/fcs/CN/10.1007/s11704-024-3194-1
https://academic.hep.com.cn/fcs/CN/Y2025/V19/I4/194313
Fig.1  
  
Fig.2  
Fig.3  
Fig.4  
Fig.5  
Fig.6  
Fig.7  
Method Mountain Car Cart-Pole Swing up Half-Cheetah Double Pendulum
TRPO [58] 0 145.16 0 294.71
VIME [36] 1 256.04 19.46 298.77
CRL 1 (0.1) 346.58 (0.1) 2.06 (0.1) 375.51 (0.1)
Hash [24] 0.40 268.01 0 279.14
CRL-Hash 0.40 (0.5) 356.15 (0.9) 0 (0.9) 367.42 (0.5)
RND [41] 0.65 310.96 0 368.81
CRL-RND 1 (0.5) 331.52 (0.5) 0 (0.9) 381.02 (0.5)
NovelD [42] 0.27 326.38 0 366.96
CRL-NovelD 0.38 (0.25) 336.39 (0.5) 0 (0.9) 392.23 (0.5)
Tab.1  
Fig.8  
Layer Configuration
conv 1 filter 32×8×8, stride 4, Leaky RELU
conv 2 filter 64×4×4, stride 2, Leaky RELU
conv 3 filter 64×3×3, stride 1, Leaky RELU
full 4 256 units
Tab.2  
Method Freeway Frostbite Gravitar Montezuma Solaris Venture
TRPO [58] 17.55 1229.66 500.33 0 2110.22 283.48
CRL 30.80 (0.75) 4337.98 (0.1) 552.46 (0.1) 0 (0.75) 3672.55 (0.5) 312.40 (0.1)
Hash [24] 22.29 2954.10 577.47 0 2619.32 299.61
CRL-Hash 28.38 (0.75) 4148.90 (0.1) 585.79 (0.1) 0 (0.75) 2741.48 (0.5) 328.50 (0.1)
RND [41] 21.52 2837.70 867.30 2188.80 765.47 966.00
CRL-RND 20.85 (0.9) 4076.60 (0.9) 1002.40 (0.75) 2453.30 (0.5) 1021.60 (0.5) 981.20 (0.9)
NovelD [42] 21.39 3476.46 677.90 1744.80 975.52 283.60
CRL-NovelD 19.97 (0.9) 3520.06 (0.9) 971.50 (0.5) 2323.40 (0.5) 980.16 (0.5) 498.60 (0.9)
Tab.3  
Method Freeway Frostbite Gravitar Montezuma Solaris Venture
Hash [24] 22.29 2954.10 577.47 0 2619.32 299.61
CRL 30.80 4337.98 552.46 0 3672.55 312.40
HashRF [24] 27.28 5530.79 520.67 0 2470.54 72.30
CRLRF 28.60 4444.63 572.74 0 2891.14 190.18
HashBASS [24] 32.18 2958.44 524.28 265.16 2372.05 401.08
CRLBASS 31.60 6173.75 602.60 379.68 3397.51 582.69
Tab.4  
Fig.9  
Fig.10  
Fig.11  
MAR CRL MAR CRLBASS
η η
β 0.1 0.25 0.5 0.75 0.9 β 0.1 0.25 0.5 0.75 0.9
0 17.55 17.55 17.55 17.55 17.55 0 17.55 17.55 17.55 17.55 17.55
0.01 25.60 27.67 24.21 25.09 24.38 0.01 23.52 24.42 27.15 24.92 23.54
0.1 30.10 30.72 28.43 30.80 29.66 0.1 30.07 31.35 29.89 31.60 23.28
1 25.19 28.52 28.75 23.79 24.35 1 23.53 29.15 22.85 22.82 24.22
Tab.5  
MAR β=0.01 β=0.1
K η=0.25 η=0.75 η=0.25 η=0.75
16 1.0 1.0 1.0 1.0
24 1.0 1.0 1.0 1.0
32 1.0 1.0 1.0 1.0
40 1.0 1.0 1.0 1.0
Tab.6  
Fig.12  
  
  
  
  
1 R S, Sutton A G Barto . Reinforcement Learning: an Introduction. Cambridge: MIT Press, 1998
2 V, Mnih K, Kavukcuoglu D, Silver A A, Rusu J, Veness M G, Bellemare A, Graves M, Riedmiller A K, Fidjeland G, Ostrovski S, Petersen C, Beattie A, Sadik I, Antonoglou H, King D, Kumaran D, Wierstra S, Legg D Hassabis . Human-level control through deep reinforcement learning. Nature, 2015, 518( 7540): 529–533
3 D, Silver A, Huang C J, Maddison A, Guez L, Sifre Den Driessche G, Van J, Schrittwieser I, Antonoglou V, Panneershelvam M, Lanctot S, Dieleman D, Grewe J, Nham N, Kalchbrenner I, Sutskever T P, Lillicrap M, Leach K, Kavukcuoglu T, Graepel D Hassabis . Mastering the game of go with deep neural networks and tree search. Nature, 2016, 529( 7587): 484–489
4 G, Lample D S Chaplot . Playing FPS games with deep reinforcement learning. In: Proceedings of the 31st AAAI Conference on Artificial Intelligence. 2017, 2140−2146
5 A P, Badia B, Piot S, Kapturowski P, Sprechmann A, Vitvitskyi D, Guo C Blundell . Agent57: Outperforming the Atari human benchmark. In: Proceedings of the 37th International Conference on Machine Learning. 2020, 48
6 X, Ma W J Li . State-based episodic memory for multi-agent reinforcement learning. Machine Learning, 2023, 112( 12): 5163–5190
7 B, Singh R, Kumar V P Singh . Reinforcement learning in robotic applications: a comprehensive survey. Artificial Intelligence Review, 2022, 55( 2): 945–990
8 Y, Wen J, Si A, Brandt X, Gao H H Huang . Online reinforcement learning control for the personalization of a robotic knee prosthesis. IEEE Transactions on Cybernetics, 2020, 50( 6): 2346–2356
9 T P, Lillicrap J J, Hunt A, Pritzel N, Heess T, Erez Y, Tassa D, Silver D Wierstra . Continuous control with deep reinforcement learning. In: Proceedings of the 4th International Conference on Learning Representations. 2016
10 Y, Duan X, Chen R, Houthooft J, Schulman P Abbeel . Benchmarking deep reinforcement learning for continuous control. In: Proceedings of the 33rd International Conference on Machine Learning. 2016, 1329−1338
11 H, Modares I, Ranatunga F L, Lewis D O Popa . Optimized assistive human-robot interaction using reinforcement learning. IEEE Transactions on Cybernetics, 2016, 46( 3): 655–667
12 S Amarjyoti . Deep reinforcement learning for robotic manipulation-the state of the art. 2017, arXiv preprint arXiv: 1701.08878
13 Y, Xu M, Fang L, Chen Y, Du J, Zhou C Zhang . Perceiving the world: Question-guided reinforcement learning for text-based games. In: Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics. 2022, 538−560
14 D, Ghalandari C, Hokamp G Ifrim . Efficient unsupervised sentence compression by fine-tuning transformers with reinforcement learning. In: Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics. 2022, 1267−1280
15 H, Li Y, Hu Y, Cao G, Zhou P Luo . Rich-text document styling restoration via reinforcement learning. Frontiers of Computer Science, 2021, 15( 4): 154328
16 K L A, Yau K H, Kwong C Shen . Reinforcement learning models for scheduling in wireless networks. Frontiers of Computer Science, 2013, 7( 5): 754–766
17 Y, Qin H, Wang S, Yi X, Li L Zhai . A multi-objective reinforcement learning algorithm for deadline constrained scientific workflow scheduling in clouds. Frontiers of Computer Science, 2021, 15( 5): 155105
18 Y C, Lin C T, Chen C Y, Sang S H Huang . Multiagent-based deep reinforcement learning for risk-shifting portfolio management. Applied Soft Computing, 2022, 123: 108894
19 Y, Zhang P, Zhao Q, Wu B, Li J, Huang M Tan . Cost-sensitive portfolio selection via deep reinforcement learning. IEEE Transactions on Knowledge and Data Engineering, 2022, 34( 1): 236–248
20 X, Li C, Cui D, Cao J, Du C Zhang . Hypergraph-based reinforcement learning for stock portfolio selection. In: Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing. 2022, 4028−4032
21 K, Xu Y, Zhang D, Ye P, Zhao M Tan . Relation-aware transformer for portfolio policy learning. In: Proceedings of the 29th International Joint Conference on Artificial Intelligence. 2020, 641
22 Z, Wang B, Huang S, Tu K, Zhang L Xu . DeepTrader: A deep reinforcement learning approach for risk-return balanced portfolio management with market conditions embedding. In: Proceedings of the 35th AAAI Conference on Artificial Intelligence. 2021, 643−650
23 L, Ouyang J, Wu X, Jiang D, Almeida C L, Wainwright P, Mishkin C, Zhang S, Agarwal K, Slama A, Ray J, Schulman J, Hilton F, Kelton L, Miller M, Simens A, Askell P, Welinder P F, Christiano J, Leike R Lowe . Training language models to follow instructions with human feedback. In: Proceedings of the 36th International Conference on Neural Information Processing Systems. 2022
24 H R, Tang R, Houthooft D, Foote A, Stooke X, Chen Y, Duan J, Schulman Turck F, De P Abbeel . #exploration: A study of count-based exploration for deep reinforcement learning. In: Proceedings of the 31th International Conference on Neural Information Processing Systems. 2017, 2753−2762
25 H, Qian Y Yu . Derivative-free reinforcement learning: a review. Frontiers of Computer Science, 2021, 15( 6): 156336
26 O, Chapelle L Li . An empirical evaluation of Thompson sampling. In: Proceedings of the 24th International Conference on Neural Information Processing Systems. 2011, 2249−2257
27 V, Mnih A P, Badia M, Mirza A, Graves T, Harley T P, Lillicrap D, Silver K Kavukcuoglu . Asynchronous methods for deep reinforcement learning. In: Proceedings of the 33rd International Conference on Machine Learning. 2016, 1928−1937
28 M, Fortunato M G, Azar B, Piot J, Menick M, Hessel I, Osband A, Graves V, Mnih R, Munos D, Hassabis O, Pietquin C, Blundell S Legg . Noisy networks for exploration. In: Proceedings of the 6th International Conference on Learning Representations. 2018
29 M, Plappert R, Houthooft P, Dhariwal S, Sidor R Y, Chen X, Chen T, Asfour P, Abbeel M Andrychowicz . Parameter space noise for exploration. In: Proceedings of the 6th International Conference on Learning Representations. 2018
30 I, Osband C, Blundell A, Pritzel Roy B Van . Deep exploration via bootstrapped DQN. In: Proceedings of the 30th International Conference on Neural Information Processing Systems. 2016, 4033−4041
31 I, Osband Roy B, Van D J, Russo Z Wen . Deep exploration via randomized value functions. Journal of Machine Learning Research, 2019, 20( 124): 1–62
32 M, Kearns S Singh . Near-optimal reinforcement learning in polynomial time. Machine Learning, 2002, 49(2−3): 209−232
33 R I, Brafman M Tennenholtz . R-MAX - a general polynomial time algorithm for near-optimal reinforcement learning. Journal of Machine Learning Research, 2003, 3: 213–231
34 M G, Bellemare S, Srinivasan G, Ostrovski T, Schaul D, Saxton R Munos . Unifying count-based exploration and intrinsic motivation. In: Proceedings of the 30th International Conference on Neural Information Processing Systems. 2016, 1479−1487
35 G, Ostrovski M G, Bellemare Den Oord A, Van R Munos . Count-based exploration with neural density models. In: Proceedings of the 34th International Conference on Machine Learning. 2017, 2721−2730
36 R, Houthooft X, Chen Y, Duan J, Schulman Turck F, De P Abbeel . VIME: variational information maximizing exploration. In: Proceedings of the 30th International Conference on Neural Information Processing Systems. 2016, 1117−1125
37 B C, Stadie S, Levine P Abbeel . Incentivizing exploration in reinforcement learning with deep predictive models. 2015, arXiv preprint arXiv: 1507.00814
38 D, Pathak P, Agrawal A A, Efros T Darrell . Curiosity-driven exploration by self-supervised prediction. In: Proceedings of the 34th International Conference on Machine Learning. 2017, 2778−2787
39 A S, Klyubin D, Polani C L Nehaniv . Empowerment: a universal agent-centric measure of control. In: Proceedings of the IEEE Congress on Evolutionary Computation. 2005, 128−135
40 J, Fu J D, Co-Reyes S Levine . EX2: exploration with exemplar models for deep reinforcement learning. In: Proceedings of the 31th International Conference on Neural Information Processing Systems. 2017, 2577−2587
41 Y, Burda H, Edwards A J, Storkey O Klimov . Exploration by random network distillation. In: Proceedings of the 7th International Conference on Learning Representations. 2019
42 T, Zhang H, Xu X, Wang Y, Wu K, Keutzer J E, Gonzalez Y Tian . NovelD: A simple yet effective exploration criterion. In: Proceedings of the 35th International Conference on Neural Information Processing Systems. 2021, 25217−25230
43 P, Auer R Ortner . Logarithmic online regret bounds for undiscounted reinforcement learning. In: Proceedings of the 19th International Conference on Neural Information Processing Systems. 2006, 49−56
44 I, Osband D, Russo Roy B Van . (More) efficient reinforcement learning via posterior sampling. In: Proceedings of the 26th International Conference on Neural Information Processing Systems. 2013, 3003−3011
45 A, Ecoffet J, Huizinga J, Lehman K O, Stanley J Clune . Go-explore: a new approach for hard-exploration problems. 2019, arXiv preprint arXiv: 1901.10995
46 A, Ecoffet J, Huizinga J, Lehman K O, Stanley J Clune . First return, then explore. Nature, 2021, 590( 7847): 580–586
47 M G, Bellemare Y, Naddaf J, Veness M Bowling . The arcade learning environment: An evaluation platform for general agents. Journal of Artificial Intelligence Research, 2013, 47: 253–279
48 A L, Strehl M L Littman . An analysis of model-based interval estimation for Markov decision processes. Journal of Computer and System Sciences, 2008, 74( 8): 1309–1331
49 R Ortner . Adaptive aggregation for reinforcement learning in average reward Markov decision processes. Annals of Operations Research, 2013, 208( 1): 321–336
50 A G Barto . Intrinsic motivation and reinforcement learning. In: Baldassarre G, Mirolli M, eds. Intrinsically Motivated Learning in Natural and Artificial Systems. Berlin: Springer, 2013, 17−47
51 D E Berlyne . Structure and Direction in Thinking. Hoboken: Wiley, 1965
52 S, Mannor I, Menache A, Hoze U Klein . Dynamic abstraction in reinforcement learning via clustering. In: Proceedings of the 21st International Conference on Machine Learning. 2004
53 N, Tziortziotis K Blekas . A model based reinforcement learning approach using on-line clustering. In: Proceedings of the IEEE International Conference on Tools with Artificial Intelligence. 2012, 712−718
54 T, Wang T, Gupta A, Mahajan B, Peng S, Whiteson C J Zhang . RODE: learning roles to decompose multi-agent tasks. In: Proceedings of the 9th International Conference on Learning Representations. 2021
55 F, Christianos G, Papoudakis A, Rahman S V Albrecht . Scaling multi-agent reinforcement learning with selective parameter sharing. In: Proceedings of the 38th International Conference on Machine Learning. 2021, 1989−1998
56 T, Mandel Y E, Liu E, Brunskill Z Popovic . Efficient Bayesian clustering for reinforcement learning. In: Proceedings of the 25th International Joint Conference on Artificial Intelligence. 2016, 1830−1838
57 Coates A, Ng A Y. Learning feature representations with K-means. In: Montavon G, Orr G B, Müller K R, eds. Neural Networks: Tricks of the Trade. 2nd ed. Berlin: Springer, 2012, 561−580
58 J, Schulman S, Levine P, Moritz M, Jordan P Abbeel . Trust region policy optimization. In: Proceedings of the 32nd International Conference on Machine Learning. 2015, 1889−1897
59 Y, Burda H, Edwards D, Pathak A J, Storkey T, Darrell A A Efros . Large-scale study of curiosity-driven learning. In: Proceedings of the 7th International Conference on Learning Representations. 2019
60 K, Wang K, Zhou B, Kang J, Feng S Yan . Revisiting intrinsic reward for exploration in procedurally generated environments. In: Proceedings of the 11th International Conference on Learning Representations. 2023
61 M S Charikar . Similarity estimation techniques from rounding algorithms. In: Proceedings of the 34th Annual ACM Symposium on Theory of Computing. 2002, 380−388
62 C, Voloshin H M, Le N, Jiang Y Yue . Empirical study of off-policy policy evaluation for reinforcement learning. In: Proceedings of the Neural Information Processing Systems Track on Datasets and Benchmarks. 2021
63 V, Nair G E Hinton . Rectified linear units improve restricted Boltzmann machines. In: Proceedings of the 27th International Conference on Machine Learning. 2010, 807−814
64 A L, Maas A Y, Hannun A Y Ng . Rectifier nonlinearities improve neural network acoustic models. In: Proceedings of the 30th International Conference on Machine Learning. 2013
65 Der Maaten L, Van G Hinton . Visualizing data using t-SNE. Journal of Machine Learning Research, 2008, 9( 86): 2579–2605
[1] FCS-23194-OF-XM_suppl_1 Download
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed