Communication-robust multi-agent learning by adaptable auxiliary multi-agent adversary generation

doi:10.1007/s11704-023-2733-5

Front. Comput. Sci.

2024, Vol. 18

Issue (6) : 186331 https://doi.org/10.1007/s11704-023-2733-5

Artificial Intelligence

Communication-robust multi-agent learning by adaptable auxiliary multi-agent adversary generation

Lei YUAN^1,², Feng CHEN¹, Zongzhang ZHANG¹, Yang YU^1,²(

)

¹. National Key Laboratory for Novel Software Technology, Nanjing University, Nanjing 210023, China
². Polixir Technologies, Nanjing 211106, China

Download: PDF(7587 KB) HTML
Export: BibTeX | EndNote | Reference Manager | ProCite | RefWorks

Abstract

Communication can promote coordination in cooperative Multi-Agent Reinforcement Learning (MARL). Nowadays, existing works mainly focus on improving the communication efficiency of agents, neglecting that real-world communication is much more challenging as there may exist noise or potential attackers. Thus the robustness of the communication-based policies becomes an emergent and severe issue that needs more exploration. In this paper, we posit that the ego system

Here ego system means the multi-agent communication system itself. We use the word ego to distinguish it from the generated adversaries.

trained with auxiliary adversaries may handle this limitation and propose an adaptable method of Multi-Agent Auxiliary Adversaries Generation for robust Communication, dubbed MA3C, to obtain a robust communication-based policy. In specific, we introduce a novel message-attacking approach that models the learning of the auxiliary attacker as a cooperative problem under a shared goal to minimize the coordination ability of the ego system, with which every information channel may suffer from distinct message attacks. Furthermore, as naive adversarial training may impede the generalization ability of the ego system, we design an attacker population generation approach based on evolutionary learning. Finally, the ego system is paired with an attacker population and then alternatively trained against the continuously evolving attackers to improve its robustness, meaning that both the ego system and the attackers are adaptable. Extensive experiments on multiple benchmarks indicate that our proposed MA3C provides comparable or better robustness and generalization ability than other baselines.

Keywords multi-agent communication adversarial training robustness validation reinforcement learning

Corresponding Author(s): Yang YU

Just Accepted Date: 19 June 2023 Issue Date: 18 September 2023

Cite this article:

Lei YUAN,Feng CHEN,Zongzhang ZHANG, et al. Communication-robust multi-agent learning by adaptable auxiliary multi-agent adversary generation[J]. Front. Comput. Sci., 2024, 18(6): 186331.

URL:

https://academic.hep.com.cn/fcs/EN/10.1007/s11704-023-2733-5
https://academic.hep.com.cn/fcs/EN/Y2024/V18/I6/186331

Fig.1 The overall relationship between the attacker and the ego system. The black solid arrows indicate the direction of data flow, the red solid ones indicate the direction of gradient flow and the red dotted ones mean the attack actions from the attacker onto specific communication channels

Fig.2 The overall framework for the attacker population optimization. (a) We utilize the representation of the attacked ego system’s trajectories to identify different attacker instances. Specifically, we apply an encoder-decoder architecture to learn the trajectory representation. The black solid arrows indicate the direction of data flow and the red solid ones imply the direction of gradient flow. (b) This is a simple visualization case for one time population updating. The locations of points imply the distances of representations and the color shades indicate the attack ability, i.e., the attackers corresponding to deeper points are stronger attackers. For example, new Attacker

3

is accepted as it is distant enough with other attackers, and the oldest Attacker

1

is removed; new Attacker

2

is accepted and the closest Attacker

2

is removed as it is weaker

Fig.3 Experimental Environments used in this paper. (a) Hallway; (b) SMAC; (c) Gold Panner (GP); (d) Traffic Junction (TJ)

		Hallway-6x6	Hallway-4x5x9	SMAC-1o_2r_vs_4r	SMAC-1o_10b_vs_1r	GP-4r	GP-9r
Normal	MA3C	0.94 $±$ 0.05	0.97 $±$ 0.05	0.86 $±$ 0.02	0.62 $±$ 0.01	0.87 $±$ 0.02	0.82 $±$ 0.01
	Vanilla	1.00 $±$ 0.00	1.00 $±$ 0.00	0.81 $±$ 0.06	0.63 $±$ 0.04	0.88 $±$ 0.03	0.82 $±$ 0.02
	Noise Adv.	1.00 $±$ 0.00	0.99 $±$ 0.01	0.88 $±$ 0.04	0.6 $±$ 0.05	0.88 $±$ 0.03	0.85 $±$ 0.02
	MA3C w/o div.	0.98 $±$ 0.02	0.66 $±$ 0.46	0.86 $±$ 0.02	0.62 $±$ 0.03	0.86 $±$ 0.09	0.81 $±$ 0.03
	Instance Adv.	0.52 $±$ 0.48	0.67 $±$ 0.47	0.84 $±$ 0.02	0.57 $±$ 0.04	0.86 $±$ 0.03	0.82 $±$ 0.03
	AME	1.00 $±$ 0.00	0.98 $±$ 0.02	0.81 $±$ 0.05	0.60 $±$ 0.01	0.23 $±$ 0.37	0.00 $±$ 0.00
Random noise	MA3C	0.91 $±$ 0.07	0.79 $±$ 0.18	0.87 $±$ 0.01	0.67 $±$ 0.03	0.88 $±$ 0.01	0.80 $±$ 0.07
	Vanilla	0.58 $±$ 0.03	0.53 $±$ 0.06	0.73 $±$ 0.07	0.60 $±$ 0.02	0.86 $±$ 0.03	0.79 $±$ 0.02
	Noise Adv.	0.97 $±$ 0.02	1.00 $±$ 0.00	0.82 $±$ 0.02	0.56 $±$ 0.02	0.88 $±$ 0.01	0.82 $±$ 0.01
	MA3C w/o div.	0.68 $±$ 0.07	0.68 $±$ 0.29	0.73 $±$ 0.07	0.53 $±$ 0.01	0.82 $±$ 0.06	0.80 $±$ 0.07
	Instance Adv.	0.56 $±$ 0.34	0.67 $±$ 0.47	0.79 $±$ 0.07	0.60 $±$ 0.08	0.90 $±$ 0.03	0.81 $±$ 0.02
	AME	0.61 $±$ 0.06	0.79 $±$ 0.03	0.71 $±$ 0.13	0.59 $±$ 0.08	0.22 $±$ 0.37	0.00 $±$ 0.00
Aggressive attackers	MA3C	0.91 $±$ 0.22	0.98 $±$ 0.01	0.67 $±$ 0.03	0.62 $±$ 0.03	0.81 $±$ 0.02	0.76 $±$ 0.03
	Vanilla	0.09 $±$ 0.19	0.00 $±$ 0.00	0.26 $±$ 0.12	0.57 $±$ 0.03	0.38 $±$ 0.02	0.30 $±$ 0.05
	Noise Adv.	0.61 $±$ 0.37	0.13 $±$ 0.14	0.51 $±$ 0.02	0.54 $±$ 0.03	0.41 $±$ 0.13	0.48 $±$ 0.11
	MA3C w/o div.	0.57 $±$ 0.39	0.96 $±$ 0.03	0.54 $±$ 0.05	0.61 $±$ 0.02	0.68 $±$ 0.06	0.71 $±$ 0.01
	Instance Adv.	0.63 $±$ 0.42	0.88 $±$ 0.14	0.28 $±$ 0.01	0.61 $±$ 0.04	0.81 $±$ 0.02	0.76 $±$ 0.03
	AME	0.13 $±$ 0.03	0.00 $±$ 0.00	0.39 $±$ 0.05	0.59 $±$ 0.07	0.10 $±$ 0.16	0.00 $±$ 0.00

Tab.1 Performance comparison under different attack modes

Fig.4 (a) This curve traces the average attacker performance of the attacker population as evolution rounds increase; (b) population visualization. In specific, each scatter corresponds to an attacker instance, and we use the color depth to represent the training stage of the attackers, i.e., the lighter the color, the earlier the attacker model. The horizontal coordinate indicates the identification feature after dimension reduction, and the vertical coordinate indicates the attack performance of the attacker model

Fig.5 Robustness comparison when employing NDQ on SMAC-1o_2r_vs_4r and TarMAC on TJ, respectively. (a) NDQ:SMAC-1o2rvs4r; (b) TarMAC:TJ

Fig.6 Comparison of the attack ability of different methods. (a) SMAC-1o 2r vs 4r; (b) GP-4r

Fig.7 Generalization test to different perturbation ranges. (a) SMAC-1o 2r vs 4r; (b) GP-4r

Fig.8 Transfer to larger perturbation range. (a) Hallway-4×5×9; (b) GP-4r

Fig.9 Test results of parameter sensitivity studies. (a) Studies of population size; (b) studies of reproduction ratio; (c) studies of distance threshold

Fig.10 The architecture of the ordinary trajectory encoder. We feed the state information at each time step into a shared MLP network, and then perform mean pooling on the outputs to obtain the embedding vector of the trajectory.

		MA3C	MA3C w/ordinary encoder	MA3C w/o div.
GP-4r	Normal	0.87 $±$ 0.02	0.90 $±$ 0.02	0.86 $±$ 0.09
	Random Noise	0.88 $±$ 0.01	0.90 $±$ 0.03	0.82 $±$ 0.06
	Aggressive Attackers	0.81 $±$ 0.02	0.73 $±$ 0.04	0.68 $±$ 0.06
GP-9r	Normal	0.82 $±$ 0.01	0.82 $±$ 0.03	0.81 $±$ 0.03
	Random Noise	0.80 $±$ 0.07	0.80 $±$ 0.05	0.80 $±$ 0.07
	Aggressive Attackers	0.76 $±$ 0.03	0.70 $±$ 0.06	0.71 $±$ 0.01

Tab.2 Test results for the trajectory encoder studies

Table A1 Selected hyper-parameters in our experiments

Comm. Alg.	Full-Comm
Task	Hallway-6x6	Hallway-4x5x9	SMAC-1o_2r_vs_4r	SMAC-1o_10b_vs_1r
$?$	1.5	1.0	10	25

Comm. Alg.	Full-Comm		NDQ	TarMAC
Task	GP-4r	GP-9r	SMAC-1o_2r_vs_4r	TJ
$?$	2	2	6	16

Table A2 Adopted magnitude

?

for each experiment. Comm. Alg. is short for Communication Algorithm

		MA3C	AME
SMAC-1o_2r_vs_4r	Normal	0.86 $±$ 0.02	0.81 $±$ 0.05
	Random Noise	0.84 $±$ 0.02	0.76 $±$ 0.07
	Aggressive Attackers	0.81 $±$ 0.01	0.60 $±$ 0.06
GP-4r	Normal	0.87 $±$ 0.02	0.23 $±$ 0.37
	Random Noise	0.87 $±$ 0.02	0.24 $±$ 0.40
	Aggressive Attackers	0.86 $±$ 0.01	0.17 $±$ 0.29

Table A3 Additional test results for the AME baseline

Table A4 Training efficiency comparison of algorithms

1	C, Zhu M, Dastani S Wang . A survey of multi-agent reinforcement learning with communication. 2022, arXiv preprint arXiv: 2203.08975
2	Z, Ding T, Huang Z Lu . Learning individually inferred communication for multi-agent cooperation. In: Proceedings of the 34th International Conference on Neural Information Processing Systems. 2020, 1851
3	R, Wang X, He R, Yu W, Qiu B, An Z Rabinovich . Learning efficient multi-agent communication: an information bottleneck approach. In: Proceedings of the 37th International Conference on Machine Learning. 2020, 919
4	Xue D, Yuan L, Zhang Z, Yu Y. Efficient multi-agent communication via Shapley message value. In: Proceedings of the 31st International Joint Conference on Artificial Intelligence. 2022, 578−584
5	C, Guan F, Chen L, Yuan C, Wang H, Yin Z, Zhang Y Yu . Efficient multi-agent communication via self-supervised information aggregation. In: Proceedings of the 36th International Conference on Neural Information Processing Systems. 2022
6	J N, Foerster Y M, Assael Freitas N, de S Whiteson . Learning to communicate with deep multi-agent reinforcement learning. In: Proceedings of the 30th International Conference on Neural Information Processing Systems. 2016, 2145−2153
7	W, Kim J, Park Y Sung . Communication in multi-agent reinforcement learning: intention sharing. In: Proceedings of the 9th International Conference on Learning Representations. 2021
8	Yuan L, Wang J, Zhang F, Wang C, Zhang Z, Yu Y, Zhang C. Multi-agent incentive communication via decentralized teammate modeling. In: Proceedings of the 36th AAAI Conference on Artificial Intelligence. 2022, 9466−9474
9	A, Chakraborty M, Alam V, Dey A, Chattopadhyay D Mukhopadhyay . Adversarial attacks and defences: a survey. 2018, arXiv preprint arXiv: 1810.00069
10	J, Moos K, Hansel H, Abdulsamad S, Stark D, Clever J Peters . Robust reinforcement learning: a review of foundations and recent advances. Machine Learning and Knowledge Extraction, 2022, 4( 1): 276–315
11	H, Zhang H, Chen C, Xiao B, Li M, Liu D S, Boning C J Hsieh . Robust deep reinforcement learning against adversarial perturbations on state observations. In: Proceedings of the 34th International Conference on Neural Information Processing Systems. 2020, 1765
12	T, Oikarinen W, Zhang A, Megretski L, Daniel T W Weng . Robust deep reinforcement learning through adversarial loss. In: Proceedings of the 34th International Conference on Neural Information Processing Systems. 2021, 26156−26167
13	M, Xu Z, Liu P, Huang W, Ding Z, Cen B, Li D Zhao . Trustworthy reinforcement learning against intrinsic vulnerabilities: robustness, safety, and generalizability. 2022, arXiv preprint arXiv: 2209.08025
14	X, Pan D, Seita Y, Gao J Canny . Risk averse robust adversarial reinforcement learning. In: Proceedings of 2019 International Conference on Robotics and Automation. 2019, 8522−8528
15	H, Zhang H, Chen D S, Boning C J Hsieh . Robust reinforcement learning on state observations with learned optimal adversary. In: Proceedings of the 9th International Conference on Learning Representations. 2021
16	J, Lin K, Dzeparoska S Q, Zhang A, Leon-Garcia N Papernot . On the robustness of cooperative multi-agent reinforcement learning. In: Proceedings of 2020 IEEE Security and Privacy Workshops. 2020, 62−68
17	Y, Hu Z Zhang . Sparse adversarial attack in multi-agent reinforcement learning. 2022, arXiv preprint arXiv: 2205.09362
18	W, Xue W, Qiu B, An Z, Rabinovich S, Obraztsova C K Yeo . Mis-spoke or mis-lead: achieving robustness in multi-agent communicative reinforcement learning. In: Proceedings of the 21st International Conference on Autonomous Agents and Multiagent Systems. 2022, 1418−1426
19	E, Vinitsky Y, Du K, Parvate K, Jang P, Abbeel A Bayen . Robust reinforcement learning using adversarial populations. 2020, arXiv preprint arXiv: 2008.01825
20	T, Wang J, Wang C, Zheng C Zhang . Learning nearly decomposable value functions via communication minimization. In: Proceedings of the 8th International Conference on Learning Representations. 2020
21	A, Das T, Gervet J, Romoff D, Batra D, Parikh M, Rabbat J Pineau . TarMAC: targeted multi-agent communication. In: Proceedings of the 36th International Conference on Machine Learning. 2019, 1538−1546
22	S, Sukhbaatar A, Szlam R Fergus . Learning multiagent communication with backpropagation. In: Proceedings of the 30th International Conference on Neural Information Processing Systems. 2016, 2252−2260
23	R, Lowe J N, Foerster Y, Boureau J, Pineau Y N Dauphin . On the pitfalls of measuring emergent communication. In: Proceedings of the 18th International Conference on Autonomous Agents and MultiAgent Systems. 2019, 693−701
24	T, Eccles Y, Bachrach G, Lever A, Lazaridou T Graepel . Biases for emergent communication in multi-agent reinforcement learning. In: Proceedings of the 33rd International Conference on Neural Information Processing Systems. 2019, 1176
25	Mao H, Zhang Z, Xiao Z, Gong Z, Ni Y. Learning agent communication under limited bandwidth by message pruning. In: Proceedings of the 34th AAAI Conference on Artificial Intelligence. 2020, 5142−5149
26	H, Mao Z, Zhang Z, Xiao Z, Gong Y Ni . Learning multi-agent communication with double attentional deep reinforcement learning. Autonomous Agents and Multi-Agent Systems, 2020, 34( 1): 32
27	Wang Y, Zhong F, Xu J, Wang Y. ToM2C: target-oriented multi-agent communication and cooperation with theory of mind. In: Proceedings of the 10th International Conference on Learning Representations. 2021
28	S Q, Zhang Q, Zhang J Lin . Efficient communication in multi-agent reinforcement learning via variance based control. In: Proceedings of the 33rd International Conference on Neural Information Processing Systems. 2019, 291
29	S Q, Zhang Q, Zhang J Lin . Succinct and robust multi-agent communication with temporal message control. In: Proceedings of the 34th International Conference on Neural Information Processing Systems. 2020, 1449
30	R, Mitchell J, Blumenkamp A Prorok . Gaussian process based message filtering for robust multi-agent cooperation in the presence of adversarial communication. 2020, arXiv preprint arXiv: 2012.00508
31	Sun Y, Zheng R, Hassanzadeh P, Liang Y, Feizi S, Ganesh S, Huang F. Certifiably robust policy learning against adversarial multi-agent communication. In: Proceedings of the 11th International Conference on Learning Representations. 2023
32	A, OroojlooyJadid D Hajinezhad . A review of cooperative multi-agent deep reinforcement learning. 2019, arXiv preprint arXiv: 1908.03963
33	F, Christianos G, Papoudakis M A, Rahman S V Albrecht . Scaling multi-agent reinforcement learning with selective parameter sharing. In: Proceedings of the 38th International Conference on Machine Learning. 2021, 1989−1998
34	J, Wang Z, Ren B, Han J, Ye C Zhang . Towards understanding cooperative multi-agent Q-learning with value factorization. In: Proceedings of the 34th International Conference on Neural Information Processing Systems. 2021, 29142−29155
35	G, Papoudakis F, Christianos A, Rahman S V Albrecht . Dealing with non-stationarity in multi-agent deep reinforcement learning. 2019, arXiv preprint arXiv: 1906.04737
36	Z, Peng Q, Li K M, Hui C, Liu B Zhou . Learning to simulate self-driven particles system with coordinated policy optimization. In: Proceedings of the 34th International Conference on Neural Information Processing Systems. 2021, 10784−10797
37	M, Kouzehgar M, Meghjani R Bouffanais . Multi-agent reinforcement learning for dynamic ocean monitoring by a swarm of buoys. In: Proceedings of Global Oceans 2020. 2020, 1−8
38	J, Wang W, Xu Y, Gu W, Song T C Green . Multi-agent reinforcement learning for active voltage control on power distribution networks. In: Proceedings of the 34th International Conference on Neural Information Processing Systems. 2021, 3271−3284
39	K, Xue J, Xu L, Yuan M, Li C, Qian Z, Zhang Y Yu . Multi-agent dynamic algorithm configuration. In: Proceedings of the 36th International Conference on Neural Information Processing Systems. 2022
40	J, Guo Y, Chen Y, Hao Z, Yin Y, Yu S Li . Towards comprehensive testing on the robustness of cooperative multi-agent reinforcement learning. In: Proceedings of 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops. 2022, 114−121
41	Li S, Wu Y, Cui X, Dong H, Fang F, Russell S. Robust multi-agent reinforcement learning via minimax deep deterministic policy gradient. In: Proceedings of the 33rd AAAI Conference on Artificial Intelligence. 2019, 4213−4220
42	R, Lowe Y, Wu A, Tamar J, Harb P, Abbeel I Mordatch . Multi-agent actor-critic for mixed cooperative-competitive environments. In: Proceedings of the 31st International Conference on Neural Information Processing Systems. 2017, 6382−6393
43	T. van der Heiden T, Salge C, Gavves E, van Hoof H. Robust multi-agent reinforcement learning with social empowerment for coordination and communication. 2020, arXiv preprint arXiv: 2012.08255
44	K, Zhang T, Sun Y, Tao S, Genc S, Mallya T Başar . Robust multi-agent reinforcement learning with model uncertainty. In: Proceedings of the 34th International Conference on Neural Information Processing Systems. 2020, 887
45	T, Phan T, Gabor A, Sedlmeier F, Ritz B, Kempter C, Klein H, Sauer R, Schmid J, Wieghardt M, Zeller C Linnhoff-Popien . Learning and testing resilience in cooperative multi-agent systems. In: Proceedings of the 19th International Conference on Autonomous Agents and Multiagent Systems. 2020, 1055−1063
46	Phan T, Belzner L, Gabor T, Sedlmeier A, Ritz F, Linnhoff-Popien C. Resilient multi-agent reinforcement learning with adversarial value decomposition. In: Proceedings of the 35th AAAI Conference on Artificial Intelligence. 2021, 11308−11316
47	M, Jaderberg V, Dalibard S, Osindero W M, Czarnecki J, Donahue A, Razavi O, Vinyals T, Green I, Dunning K, Simonyan C, Fernando K Kavukcuoglu . Population based training of neural networks. 2017, arXiv preprint arXiv: 1711.09846
48	M, Jaderberg W M, Czarnecki I, Dunning L, Marris G, Lever A G, Castañeda C, Beattie N C, Rabinowitz A S, Morcos A, Ruderman N, Sonnerat T, Green L, Deason J Z, Leibo D, Silver D, Hassabis K, Kavukcuoglu T Graepel . Human-level performance in 3D multiplayer games with population-based reinforcement learning. Science, 2019, 364( 6443): 859–865
49	H, Qian Y Yu . Derivative-free reinforcement learning: a review. Frontiers of Computer Science, 2021, 15( 6): 156336
50	K, Derek P Isola . Adaptable agent populations via a generative model of policies. In: Proceedings of the 34th International Conference on Neural Information Processing Systems. 2021, 3902−3913
51	J, Parker-Holder A, Pacchiano K, Choromanski S Roberts . Effective diversity in population based reinforcement learning. In: Proceedings of the 34th International Conference on Neural Information Processing Systems. 2020, 1515
52	F M, Luo T, Xu H, Lai X H, Chen W, Zhang Y Yu . A survey on model-based reinforcement learning. 2022, arXiv preprint arXiv: 2206.09328
53	R, Zhao J, Song H, Haifeng Y, Gao Y, Wu Z, Sun Y Wei . Maximum entropy population based training for zero-shot human-AI coordination. 2021, arXiv preprint arXiv: 2112.11701
54	K, Xue Y, Wang L, Yuan C, Guan C, Qian Y Yu . Heterogeneous multi-agent zero-shot coordination by coevolution. 2022, arXiv preprint arXiv: 2208.04957
55	Wang Y, Xue K, Qian C. Evolutionary diversity optimization with clustering-based selection for reinforcement learning. In: Proceedings of the 10th International Conference on Learning Representations. 2021
56	A, Cully Y Demiris . Quality and diversity optimization: a unifying modular framework. IEEE Transactions on Evolutionary Computation, 2018, 22( 2): 245–259
57	K, Chatzilygeroudis A, Cully V, Vassiliades J B Mouret . Quality-diversity optimization: a novel branch of stochastic optimization. In: Pardalos P M, Rasskazova V, Vrahatis M N, eds. Black Box Optimization, Machine Learning, and No-Free Lunch Theorems. Cham: Springer, 2021, 109−135
58	B, Lim L, Grillotti L, Bernasconi A Cully . Dynamics-aware quality-diversity for efficient learning of skill repertoires. In: Proceedings of 2022 International Conference on Robotics and Automation. 2022, 5360−5366
59	T, Pierrot G, Richard K, Beguir A Cully . Multi-objective quality diversity optimization. In: Proceedings of the Genetic and Evolutionary Computation Conference. 2022, 139−147
60	F, Chalumeau R, Boige B, Lim V, Macé M, Allard A, Flajolet A, Cully T Pierrot . Neuroevolution is a competitive alternative to reinforcement learning for skill discovery. 2022, arXiv preprint arXiv: 2210.03516
61	M, Samvelyan T, Rashid Witt C S, de G, Farquhar N, Nardelli T G J, Rudner C M, Hung P H S, Torr J, Foerster S Whiteson . The StarCraft multi-agent challenge. In: Proceedings of the 18th International Conference on Autonomous Agents and MultiAgent Systems. 2019, 2186−2188
62	F A, Oliehoek C Amato . A Concise Introduction to Decentralized POMDPs. Cham: Springer, 2016
63	V, Mnih K, Kavukcuoglu D, Silver A A, Rusu J, Veness M G, Bellemare A, Graves M, Riedmiller A K, Fidjeland G, Ostrovski S, Petersen C, Beattie A, Sadik I, Antonoglou H, King D, Kumaran D, Wierstra S, Legg D Hassabis . Human-level control through deep reinforcement learning. Nature, 2015, 518( 7540): 529–533
64	S, Gronauer K Diepold . Multi-agent deep reinforcement learning: a survey. Artificial Intelligence Review, 2022, 55( 2): 895–943
65	K, Zhang Z, Yang T Başar . Multi-agent reinforcement learning: a selective overview of theories and algorithms. In: Vamvoudakis K G, Wan Y, Lewis F L, Cansever D, eds. Handbook of Reinforcement Learning and Control. Cham: Springer, 2021, 321−384
66	T, Rashid M, Samvelyan C, Schroeder G, Farquhar J, Foerster S Whiteson . QMIX: monotonic value function factorisation for deep multi-agent reinforcement learning. In: Proceedings of the 35th International Conference on Machine Learning. 2018, 4295−4304
67	Foerster J, Farquhar G, Afouras T, Nardelli N, Whiteson S. Counterfactual multi-agent policy gradients. In: Proceedings of the 32nd AAAI Conference on Artificial Intelligence. 2018, 2974−2982
68	S, Fujimoto H, Hoof D Meger . Addressing function approximation error in actor-critic methods. In: Proceedings of the 35th International Conference on Machine Learning. 2018, 1587−1596
69	A Cully . Autonomous skill discovery with quality-diversity and unsupervised descriptors. In: Proceedings of the Genetic and Evolutionary Computation Conference. 2019, 81−89
70	Zhou Z, Fu W, Zhang B, Wu Y. Continuously discovering novel strategies via reward-switching policy optimization. In: Proceedings of the 10th International Conference on Learning Representations. 2022
71	A, Vaswani N, Shazeer N, Parmar J, Uszkoreit L, Jones A N, Gomez Ł, Kaiser I Polosukhin . Attention is all you need. In: Proceedings of the 31st International Conference on Neural Information Processing Systems. 2017, 6000−6010
72	Z H Zhou . Open-environment machine learning. National Science Review, 2022, 9( 8): nwac123

[1]

FCS-22733-OF-LY_suppl_1

Download

[1]	Cong GUAN, Ke XUE, Chunpeng FAN, Feng CHEN, Lichao ZHANG, Lei YUAN, Chao QIAN, Yang YU. Open and real-world human-AI coordination by heterogeneous training with communication[J]. Front. Comput. Sci., 2025, 19(4): 194314-.
[2]	Xiao MA, Shen-Yi ZHAO, Zhao-Heng YIN, Wu-Jun LI. Clustered Reinforcement Learning[J]. Front. Comput. Sci., 2025, 19(4): 194313-.
[3]	Yizhou YANG, Longde CHEN, Sha LIU, Lanning WANG, Haohuan FU, Xin LIU, Zuoning CHEN. Behaviour-diverse automatic penetration testing: a coverage-based deep reinforcement learning approach[J]. Front. Comput. Sci., 2025, 19(3): 193309-.
[4]	Ningping MOU, Xinli YUE, Lingchen ZHAO, Qian WANG. Fairness is essential for robustness: fair adversarial training by identifying and augmenting hard examples[J]. Front. Comput. Sci., 2025, 19(3): 193803-.
[5]	Tao HE, Ming LIU, Yixin CAO, Zekun WANG, Zihao ZHENG, Bing QIN. Exploring & exploiting high-order graph structure for sparse knowledge graph completion[J]. Front. Comput. Sci., 2025, 19(2): 192306-.
[6]	Chengxing JIA, Fuxiang ZHANG, Tian XU, Jing-Cheng PANG, Zongzhang ZHANG, Yang YU. Model gradient: unified model and policy learning in model-based reinforcement learning[J]. Front. Comput. Sci., 2024, 18(4): 184339-.
[7]	Yuya CUI, Degan ZHANG, Jie ZHANG, Ting ZHANG, Lixiang CAO, Lu CHEN. Multi-user reinforcement learning based task migration in mobile edge computing[J]. Front. Comput. Sci., 2024, 18(4): 184504-.
[8]	Jian AN, Siyuan WU, Xiaolin GUI, Xin HE, Xuejun ZHANG. A blockchain-based framework for data quality in edge-computing-enabled crowdsensing[J]. Front. Comput. Sci., 2023, 17(4): 174503-.
[9]	Qiming FU, Zhechao WANG, Nengwei FANG, Bin XING, Xiao ZHANG, Jianping CHEN. MAML²: meta reinforcement learning via meta-learning for task categories[J]. Front. Comput. Sci., 2023, 17(4): 174325-.
[10]	Xiaoqin ZHANG, Huimin MA, Xiong LUO, Jian YUAN. LIDAR: learning from imperfect demonstrations with advantage rectification[J]. Front. Comput. Sci., 2022, 16(1): 161312-.
[11]	Hong QIAN, Yang YU. Derivative-free reinforcement learning: a review[J]. Front. Comput. Sci., 2021, 15(6): 156336-.
[12]	Peng YANG, Qi YANG, Ke TANG, Xin YAO. Parallel exploration via negatively correlated search[J]. Front. Comput. Sci., 2021, 15(5): 155333-.
[13]	Yao QIN, Hua WANG, Shanwen YI, Xiaole LI, Linbo ZHAI. A multi-objective reinforcement learning algorithm for deadline constrained scientific workflow scheduling in clouds[J]. Front. Comput. Sci., 2021, 15(5): 155105-.
[14]	Li ZHANG, Yuxuan CHEN, Wei WANG, Ziliang HAN, Shijian Li, Zhijie PAN, Gang PAN. A Monte Carlo Neural Fictitious Self-Play approach to approximate Nash Equilibrium in imperfect-information dynamic games[J]. Front. Comput. Sci., 2021, 15(5): 155334-.
[15]	Hongwei LI, Yingpeng HU, Yixuan CAO, Ganbin ZHOU, Ping LUO. Rich-text document styling restoration via reinforcement learning[J]. Front. Comput. Sci., 2021, 15(4): 154328-.

Viewed

Full text

Abstract

Cited

Shared

Discussed