Open and real-world human-AI coordination by heterogeneous training with communication

doi:10.1007/s11704-024-3797-6

Front. Comput. Sci.

2025, Vol. 19

Issue (4) : 194314 https://doi.org/10.1007/s11704-024-3797-6

Artificial Intelligence

Open and real-world human-AI coordination by heterogeneous training with communication

Cong GUAN^1,², Ke XUE^1,², Chunpeng FAN³, Feng CHEN^1,², Lichao ZHANG³, Lei YUAN^1,^2,³, Chao QIAN^1,³, Yang YU^1,^2,³(

)

¹. National Key Laboratory for Novel Software Technology, Nanjing University, Nanjing 210023, China
². School of Artificial Intelligence, Nanjing University, Nanjing 210023, China
³. Polixir Technologies, Nanjing 211106, China

Download: PDF(3226 KB) HTML
Export: BibTeX | EndNote | Reference Manager | ProCite | RefWorks

Abstract

Human-AI coordination aims to develop AI agents capable of effectively coordinating with human partners, making it a crucial aspect of cooperative multi-agent reinforcement learning (MARL). Achieving satisfying performance of AI agents poses a long-standing challenge. Recently, ah-hoc teamwork and zero-shot coordination have shown promising advancements in open-world settings, requiring agents to coordinate efficiently with a range of unseen human partners. However, these methods usually assume an overly idealistic scenario by assuming homogeneity between the agent and the partner, which deviates from real-world conditions. To facilitate the practical deployment and application of human-AI coordination in open and real-world environments, we propose the first benchmark for open and real-world human-AI coordination (ORC) called ORCBench. ORCBench includes widely used human-AI coordination environments. Notably, within the context of real-world scenarios, ORCBench considers heterogeneity between AI agents and partners, encompassing variations in capabilities and observations, which aligns more closely with real-world applications. Furthermore, we introduce a framework known as Heterogeneous training with Communication (HeteC) for ORC. HeteC builds upon a heterogeneous training framework and enhances partner population diversity by using mixed partner training and frozen historical partners. Additionally, HeteC incorporates a communication module that enables human partners to communicate with AI agents, mitigating the adverse effects of partially observable environments. Through a series of experiments, we demonstrate the effectiveness of HeteC in improving coordination performance. Our contribution serves as an initial but important step towards addressing the challenges of ORC.

Keywords human-AI coordination multi-agent reinforcement learning communication open-environment coordination real-world coordination

Corresponding Author(s): Yang YU

Just Accepted Date: 22 March 2024 Issue Date: 05 June 2024

Cite this article:

Cong GUAN,Ke XUE,Chunpeng FAN, et al. Open and real-world human-AI coordination by heterogeneous training with communication[J]. Front. Comput. Sci., 2025, 19(4): 194314.

URL:

https://academic.hep.com.cn/fcs/EN/10.1007/s11704-024-3797-6
https://academic.hep.com.cn/fcs/EN/Y2025/V19/I4/194314

Fig.1 Illustration of the Overcooked game

Fig.2 Illustration of different layouts on Overcooked. (a) CR and H-CR; (b) AA; (c) AA-2; (d) FC

Fig.3 Illustration of the emergency rescue environment

Fig.4 Illustration of the high-level workflow of HeteC

Fig.5 Illustration of the policy network and communication module

Fig.6 Illustration of the HeteC on FC layout of Overcooked

Channel	Type	Information	Mask 1	Mask 2	Mask 3
0	Basic Information	Location of the Cooking Pots	$√$	$√$	$√$
1		Location of the counters	$√$	$√$	$×$
2		Location of the onions	$√$	$×$	$√$
3		Location of the dishes	$√$	$×$	$×$
4		Location for delivery	$√$	$√$	$√$
5	Advanced Information	Number of onions in the pot	$×$	$×$	$×$
6		Cooking time for the onions	$×$	$×$	$√$
7		Location of onion soup	$×$	$×$	$×$
8		Number of available dishes	$×$	$√$	$√$
9		Number of available onions	$×$	$√$	$×$

Tab.1 Observation channels and Masks in Overcooked environment

Environment	Type	Partner	PP	FCP	MEP	HSP	MAZE
CR	MDP	SP	139.5 $±$ 12.98	174.5 $±$ 19.37	153.625 $±$ 27.33	161.25 $±$ 35.71	178.25 $±$ 17.80
		MEP	163.0 $±$ 4.79	183.375 $±$ 10.61	184.25 $±$ 14.41	176.875 $±$ 12.89	186.75 $±$ 13.54
		MAZE	215.875 $±$ 13.71	213.5 $±$ 17.50	205.375 $±$ 12.13	192.375 $±$ 34.57	217.0 $±$ 10.89
AA		SP	159.75 $±$ 28.51	184.0 $±$ 29.57	143.875 $±$ 21.47	121.75 $±$ 20.64	183.375 $±$ 23.73
		MEP	243.5 $±$ 15.98	264.15 $±$ 15.87	250.375 $±$ 18.39	207.75 $±$ 20.15	269.625 $±$ 22.31
		MAZE	285.25 $±$ 10.36	331.375 $±$ 11.61	318.0 $±$ 10.41	314.625 $±$ 9.59	334.125 $±$ 19.29
Average ranking			4.0	2.17	3.3	4.33	1.17
CR	POMDP	SP	81.375 $±$ 33.75	58.125 $±$ 13.09	89.125 $±$ 41.98	47.875 $±$ 4.85	112.5 $±$ 26.83
		MEP	55.0 $±$ 38.03	65.0 $±$ 26.07	72.25 $±$ 14.28	32.875 $±$ 12.84	76.0 $±$ 15.02
		MAZE	44.625 $±$ 14.34	82.25 $±$ 18.26	62.25 $±$ 17.88	54.275 $±$ 14.16	88.25 $±$ 21.27
AA		SP	93.75 $±$ 10.76	43.875 $±$ 6.26	67.75 $±$ 21.03	95.375 $±$ 23.76	113.51 $±$ 31.52
		MEP	111.125 $±$ 16.01	51.51 $±$ 6.56	82.01 $±$ 26.09	102.625 $±$ 40.95	144.875 $±$ 18.23
		MAZE	96.5 $±$ 15.64	54.125 $±$ 5.95	93.75 $±$ 15.44	115.5 $±$ 12.18	169.375 $±$ 10.62
Average ranking			3.33	4.0	3.17	3.5	1.0

Tab.2 The reward (mean

±

std.) achieved by the compared algorithms when testing with different partners on CR and AA layouts. For each combination of layout and partner, the largest reward is bolded

Environment	Partner	Fully observable		Mask 1		Mask 2		Mask 3
Environment	Partner	MAZE	HeteC	MAZE	HeteC	MAZE	HeteC	MAZE	HeteC
CR	SP	178.25 $±$ 17.80	160.375 $±$ 16.01	66.5 $±$ 12.72	150.5 $±$ 21.01	112.5 $±$ 26.83	116.625 $±$ 11.31	64.5 $±$ 30.68	130.5 $±$ 9.06
	MEP	186.75 $±$ 13.54	173.125 $±$ 10.71	80.5 $±$ 34.47	182.5 $±$ 16.37	76.0 $±$ 15.02	156.99 $±$ 29.68	42.75 $±$ 2.11	170.125 $±$ 29.03
	MAZE	217.0 $±$ 10.89	201.5 $±$ 17.02	78.25 $±$ 29.66	205.875 $±$ 8.45	88.25 $±$ 21.27	206.375 $±$ 17.63	71.875 $±$ 25.59	196.375 $±$ 13.24
H-CR	SP	127.25 $±$ 16.41	140.75 $±$ 17.25	30.625 $±$ 21.01	175.375 $±$ 41.31	142.75 $±$ 58.03	156.375 $±$ 23.91	14.7 $±$ 7.92	139.125 $±$ 35.62
	MEP	211.75 $±$ 8.97	208.625 $±$ 9.73	6.0 $±$ 3.45	193.6 $±$ 48.16	213.625 $±$ 13.18	221.75 $±$ 4.95	10.275 $±$ 5.77	209.25 $±$ 12.43
	MAZE	213.925 $±$ 14.61	219.875 $±$ 10.02	5.5 $±$ 2.94	218.75 $±$ 12.09	214.375 $±$ 5.74	227.5 $±$ 6.92	54.23 $±$ 46.21	205.5 $±$ 10.01
AA	SP	183.375 $±$ 23.73	160.375 $±$ 17.37	18.675 $±$ 29.15	141.5 $±$ 29.26	113.51 $±$ 31.52	169.5 $±$ 32.78	38.0 $±$ 57.31	148.625 $±$ 26.38
	MEP	269.625 $±$ 22.31	257.5 $±$ 29.61	31.375 $±$ 40.05	248.375 $±$ 44.05	144.875 $±$ 18.23	271.85 $±$ 15.55	76.125 $±$ 74.53	235.75 $±$ 17.94
	MAZE	334.125 $±$ 19.29	365.0 $±$ 6.09	26.3 $±$ 8.01	360.25 $±$ 7.89	169.375 $±$ 10.62	356.625 $±$ 8.83	157.625 $±$ 30.29	347.5 $±$ 10.07
AA-2	SP	128.875 $±$ 28.83	117.25 $±$ 37.81	65.375 $±$ 41.07	84.3 $±$ 53.83	90.875 $±$ 6.39	97.125 $±$ 57.22	91.125 $±$ 32.65	111.0 $±$ 16.32
	MEP	190.5 $±$ 13.66	199.75 $±$ 49.32	23.5 $±$ 7.23	217.0 $±$ 29.81	97.375 $±$ 17.11	231.625 $±$ 19.97	142.25 $±$ 20.94	200.6 $±$ 14.36
	MAZE	243.75 $±$ 16.72	272.0 $±$ 16.0	26.375 $±$ 4.95	245.625 $±$ 9.77	93.5 $±$ 52.84	252.375 $±$ 8.83	225.625 $±$ 11.42	260.0 $±$ 2.21
FC	SP	99.0 $±$ 16.25	101.625 $±$ 15.87	4.625 $±$ 5.65	99.5 $±$ 9.44	5.2 $±$ 6.31	110.4 $±$ 8.69	8.025 $±$ 5.24	108.5 $±$ 20.68
	MEP	116.0 $±$ 12.54	129.3 $±$ 33.01	5.225 $±$ 3.43	108.125 $±$ 32.265	5.625 $±$ 5.28	110.75 $±$ 29.97	3.475 $±$ 2.19	107.375 $±$ 18.18
	MAZE	173.0 $±$ 13.62	185.875 $±$ 5.45	3.7 $±$ 2.31	120.375 $±$ 35.85	5.0 $±$ 2.59	185.5 $±$ 9.15	24.775 $±$ 29.38	181.375 $±$ 2.21
Average Ranking		1.6	1.4	2	1	2	1	2	1

Tab.3 The reward (mean

±

std.) achieved by the compared algorithms when testing with different partners on CR, H-CR, AA, AA-2, and FC layouts of Overcooked. For each combination of layout and partner, the largest reward is bolded

Environment	Partner	Fully observable		Mask 1		Mask 2		Mask 3
Environment	Partner	MAZE	HeteC	MAZE	HeteC	MAZE	HeteC	MAZE	HeteC
1-1	SP	0.70 $±$ 0.04	0.72 $±$ 0.01	0.73 $±$ 0.01	0.73 $±$ 0.02	0.69 $±$ 0.02	0.70 $±$ 0.03	0.51 $±$ 0.01	0.73 $±$ 0.01
	MEP	0.71 $±$ 0.05	0.73 $±$ 0.01	0.75 $±$ 0.02	0.74 $±$ 0.02	0.69 $±$ 0.01	0.70 $±$ 0.03	0.53 $±$ 0.03	0.74 $±$ 0.02
	MAZE	0.76 $±$ 0.04	0.77 $±$ 0.02	0.76 $±$ 0.02	0.76 $±$ 0.02	0.72 $±$ 0.02	0.74 $±$ 0.03	0.58 $±$ 0.02	0.78 $±$ 0.01
1-2	SP	0.49 $±$ 0.05	0.50 $±$ 0.01	0.48 $±$ 0.01	0.47 $±$ 0.05	0.33 $±$ 0.09	0.48 $±$ 0.03	0.31 $±$ 0.05	0.47 $±$ 0.04
	MEP	0.52 $±$ 0.05	0.52 $±$ 0.02	0.51 $±$ 0.03	0.49 $±$ 0.06	0.42 $±$ 0.08	0.56 $±$ 0.03	0.31 $±$ 0.04	0.48 $±$ 0.03
	MAZE	0.72 $±$ 0.04	0.72 $±$ 0.01	0.70 $±$ 0.02	0.68 $±$ 0.05	0.57 $±$ 0.09	0.70 $±$ 0.06	0.48 $±$ 0.03	0.63 $? ±$ 0.02
1-3	SP	0.64 $±$ 0.02	0.63 $±$ 0.01	0.56 $±$ 0.03	0.60 $±$ 0.03	0.38 $±$ 0.07	0.52 $±$ 0.08	0.42 $±$ 0.04	0.60 $±$ 0.07
	MEP	0.65 $±$ 0.02	0.64 $±$ 0.01	0.57 $±$ 0.03	0.60 $±$ 0.03	0.37 $±$ 0.07	0.51 $±$ 0.08	0.44 $±$ 0.05	0.61 $±$ 0.07
	MAZE	0.71 $±$ 0.02	0.69 $±$ 0.01	0.64 $±$ 0.03	0.67 $±$ 0.03	0.46 $±$ 0.09	0.60 $±$ 0.08	0.50 $±$ 0.06	0.66 $±$ 0.07
Average ranking		1.44	1.33	1.33	1.44	2	1	2	1

Tab.4 The reward (mean

±

std.) achieved by the compared algorithms when testing with different partners on 1-1, 1-2, 1-3 layouts of Emergency Rescue. For each combination of layout and partner, the largest reward is bolded

Fig.7 Training curves of MAZE and HeteC on fully observable and different masks of the AA and 1-2 layouts. (a) MAZE-AA; (b) HeteC-AA; (c) MAZE-1-2; (d) HeteC-1-2

Fig.8 Performance comparison of different methods on CR and AA layouts when testing with different partners. The gray shaded bar is used to denote the degree of performance variation, which equals to the highest performance on the two settings minus the performance on fully observable environment. (a) CR; (b) AA

Environment	Partner	MAZE	MAZE + Comm.	HeteC w/o MPT	HeteC w/o FA	HeteC w/o CS	HeteC
Fully observable	SP	139.375 $±$ 12.64	143.0 $±$ 30.61	144.875 $±$ 39.63	135.75 $±$ 6.81	134.625 $±$ 20.51	140.375 $±$ 20.61
	MEP	256.625 $±$ 30.41	247.525 $±$ 37.61	242.875 $±$ 33.05	242.125 $±$ 25.70	256.25 $±$ 18.24	257.5 $±$ 29.61
	MAZE	335.0 $±$ 11.83	350.125 $±$ 16.39	358.625 $±$ 10.29	343.375 $±$ 2.21	304.5 $±$ 13.08	365.0 $±$ 6.09
Mask 1	SP	3.6 $±$ 2.36	129.125 $±$ 32.86	133.25 $±$ 32.83	145.5 $±$ 37.17	134.625 $±$ 26.92	141.5 $±$ 29.26
	MEP	5.725 $±$ 2.31	244.0 $±$ 48.67	244.625 $±$ 25.53	244.375 $±$ 37.29	233.125 $±$ 36.11	248.375 $±$ 44.05
	MAZE	6.94 $±$ 1.05	329.9 $±$ 6.87	353.125 $±$ 10.26	357.0 $±$ 8.86	328.875 $±$ 17.63	360.25 $±$ 7.89
Mask 2	SP	99.625 $±$ 25.07	136.0 $±$ 18.06	131.1 $±$ 13.85	147.1 $±$ 27.33	135.02 $±$ 30.41	149.375 $±$ 33.73
	MEP	93.5 $±$ 15.91	244.25 $±$ 23.93	256.875 $±$ 29.65	248.5 $±$ 16.57	274.5 $±$ 46.99	271.85 $±$ 15.55
	MAZE	85.75 $±$ 15.81	341.875 $±$ 12.04	348.375 $±$ 14.23	337.375 $±$ 19.76	353.125 $±$ 12.19	356.375 $±$ 8.83
Mask 3	SP	82.5 $±$ 52.17	137.125 $±$ 32.01	141.25 $±$ 20.43	143.75 $±$ 11.36	139.75 $±$ 28.67	148.625 $±$ 26.38
	MEP	129.0 $±$ 28.06	226.75 $±$ 40.99	229.875 $±$ 31.31	250.375 $±$ 25.44	207.5 $±$ 59.36	235.75 $±$ 17.94
	MAZE	132.25 $±$ 53.58	341.375 $±$ 13.33	341.25 $±$ 8.12	350.5 $±$ 9.95	326.73 $±$ 35.55	347.5 $±$ 10.07
Average ranking		5.42	3.83	3.12	3.00	4.08	1.50

Tab.5 The reward (mean

±

std.) achieved by the compared algorithms when testing with different partners on the AA layout. The POMDP environment is Mask 2. For each combination of layout and partner, the largest reward is bolded

Fig.9 Sensitivity analysis on frozen ratio

α

on AA layout using Mask 2. 0.6 is used in our main experiments

Environment	Partner	MAZE	MAZE + Comm.	HeteC	HeteC-specific
Mask 1	SP	3.6 $±$ 2.36	129.125 $±$ 32.86	141.5 $±$ 29.26	135.0 $±$ 16.28
	MEP	5.725 $±$ 2.31	244.0 $±$ 48.67	248.375 $±$ 44.05	246.125 $±$ 13.31
	MAZE	6.94 $±$ 1.05	329.9 $±$ 6.87	360.25 $±$ 7.89	349.125 $±$ 10.47
Mask 2	SP	99.625 $±$ 25.07	136.0 $±$ 18.06	169.5 $±$ 32.78	164.875 $±$ 37.88
	MEP	93.5 $±$ 15.91	244.25 $±$ 23.93	271.85 $±$ 15.55	255.875 $±$ 36.22
	MAZE	85.75 $±$ 15.81	341.875 $±$ 12.04	356.375 $±$ 8.83	353.0 $±$ 3.41
Mask 3	SP	82.5 $±$ 52.17	137.125 $±$ 32.01	148.625 $±$ 26.38	154.25 $±$ 29.71
	MEP	129.0 $±$ 28.06	226.75 $±$ 40.99	235.75 $±$ 17.94	243.875 $±$ 24.75
	MAZE	132.25 $±$ 53.58	341.375 $±$ 13.33	347.5 $±$ 10.07	355.625 $±$ 14.23
Average ranking		4.0	3.0	1.33	1.67

Tab.6 The reward (mean

±

std.) achieved by the compared algorithms when testing with different partners on the AA layout. For each combination of layout and partner, the largest reward is bolded

Fig.10 Generalization ability of communication modules

Environment	Type	FCP	MEP	MAZE	HeteC	HeteC-specific
CR	Fully Observable	217.5 $±$ 25.37	207.5 $±$ 26.33	222.5 $±$ 21.06	215.0 $±$ 19.36	212.5 $±$ 22.22
	Mask 2	105.0 $±$ 21.79	92.5 $±$ 24.36	137.5 $±$ 30.72	210.0 $±$ 26.45	202.5 $±$ 25.37
	Mask 3	95.0 $±$ 23.97	100.0 $±$ 17.32	132.5 $±$ 26.33	200.0 $±$ 17.32	205.0 $±$ 16.58
AA	Fully Observable	337.5 $±$ 29.04	322.5 $±$ 30.72	342.5 $±$ 33.81	352.5 $±$ 26.33	347.5 $±$ 31.52
	Mask 2	132.5 $±$ 19.84	147.5 $±$ 13.91	207.5 $±$ 37.33	342.5 $±$ 21.06	335.0 $±$ 27.83
	Mask 3	112.5 $±$ 34.55	137.5 $±$ 25.37	197.5 $±$ 27.27	325.0 $±$ 39.68	327.5 $±$ 34.55
Average ranking		4.17	4.5	2.67	1.67	2.0

Tab.7 The reward (mean

±

std.) achieved by the compared algorithms when testing with real human participants on the AA and CR layouts. For each combination of layout and mask type, the largest reward is bolded

Fig.11 Human preferences of different methods on CR and AA layouts

Fig.A1 Training curves of MAZE and HeteC on fully observable and different masks of the CR layout. (a) MAZE; (b) HeteC

Fig.A2 Performance variation of different ablations on different Masks. (a) Mask 1; (b) Mask 2; (c) Mask 3

Fig.A3 Training curves of using specific information. (a) Mask 1; (b) Mask 2; (c) Mask 3

1	G, Klein D D, Woods J M, Bradshaw R R, Hoffman P J Feltovich . Ten challenges for making automation a “team player” in joint human-agent activity. IEEE Intelligent Systems, 2004, 19( 6): 91–95
2	A, Dafoe Y, Bachrach G, Hadfield E, Horvitz K, Larson T Graepel . Cooperative AI: machines must learn to find common ground. Nature, 2021, 593( 7857): 33–36
3	P, Hernandez-Leal B, Kartal M E Taylor . A survey and critique of multiagent deep reinforcement learning. Autonomous Agents and Multi-Agent Systems, 2019, 33( 6): 750–797
4	W, Du S F Ding . A survey on multi-agent deep reinforcement learning: From the perspective of challenges and applications. Artificial Intelligence Review, 2021, 54( 5): 3215–3238
5	A, Oroojlooy D Hajinezhad . A review of cooperative multi-agent deep reinforcement learning. Applied Intelligence, 2023, 53( 11): 13677–13722
6	R, Lowe Y, Wu A, Tamar J, Harb P, Abbeel I Mordatch . Multi-agent actor-critic for mixed cooperative-competitive environments. In: Proceedings of the 31st International Conference on Neural Information Processing Systems. 2017, 6382−6393
7	P, Sunehag G, Lever A, Gruslys W M, Czarnecki V, Zambaldi M, Jaderberg M, Lanctot N, Sonnerat J Z, Leibo K, Tuyls T Graepel . Value-decomposition networks for cooperative multi-agent learning based on team reward. In: Proceedings of the 17th International Conference on Autonomous Agents and MultiAgent Systems. 2018, 2085−2087
8	T, Rashid M, Samvelyan C, Schroeder G, Farquhar J, Foerster S Whiteson . QMIX: monotonic value function factorisation for deep multi-agent reinforcement learning. In: Proceedings of the 35th International Conference on Machine Learning. 2018, 4295−4304
9	C, Yu A, Velu E, Vinitsky J, Gao Y, Wang A M, Bayen Y Wu . The surprising effectiveness of PPO in cooperative multi-agent games. In: Proceedings of the 36th Conference on Neural Information Processing Systems Datasets and Benchmarks Track. 2022, 24611−24624
10	R, Gorsane O, Mahjoub Kock R J, De R, Dubb S, Singh A Pretorius . Towards a standardised performance evaluation protocol for cooperative marl. In: Proceedings of the 36th Conference on Neural Information Processing Systems, 2022, 5510−5521
11	H, Hu A, Lerer A, Peysakhovich J Foerster . “Other-play” for zero-shot coordination. In: Proceedings of the 37th International Conference on Machine Learning. 2020, 409
12	M, Carroll R, Shah M K, Ho T, Griffiths S A, Seshia P, Abbeel A Dragan . On the utility of learning about humans for human-AI coordination. In: Proceedings of the 33rd International Conference on Neural Information Processing Systems. 2019, 465
13	L, Yuan L, Li Z, Zhang F, Chen T, Zhang C, Guan Y, Yu Z H Zhou . Learning to coordinate with anyone. In: Proceedings of the 5th International Conference on Distributed Artificial Intelligence, 2023, 4
14	Z H Zhou . Open-environment machine learning. National Science Review, 2022, 9( 8): nwac123
15	X, Liu J, Liang D Y, Liu R, Chen S M Yuan . Weapon-target assignment in unreliable peer-to-peer architecture based on adapted artificial bee colony algorithm. Frontiers of Computer Science, 2022, 16( 1): 161103
16	J, Parmar S, Chouhan V, Raychoudhury S Rathore . Open-world machine learning: applications, challenges, and opportunities. ACM Computing Surveys, 2023, 55( 10): 205
17	L, Yuan Z, Zhang L, Li C, Guan Y Yu . A survey of progress on cooperative multi-agent reinforcement learning in open environment. 2023, arXiv preprint arXiv: 2312.01058
18	P, Stone G A, Kaminka S, Kraus J S Rosenschein . Ad hoc autonomous agent teams: Collaboration without pre-coordination. In: Proceedings of the 24th AAAI Conference on Artificial Intelligence. 2010, 1504−1509
19	R, Mirsky I, Carlucho A, Rahman E, Fosong W, Macke M, Sridharan P, Stone S V Albrecht . A survey of ad Hoc teamwork research. In: Proceedings of the 19th European Conference on Multi-Agent Systems. 2022, 275−293
20	A, Lupu B, Cui H, Hu J Foerster . Trajectory diversity for zero-shot coordination. In: Proceedings of the 38th International Conference on Machine Learning. 2021, 7204−7213
21	D J, Strouse K R, McKee M, Botvinick E, Hughes R Everett . Collaborating with humans without human data. In: Proceedings of the 35th Conference on Neural Information Processing Systems. 2021, 14502−14515
22	R, Zhao J, Song Y, Yuan H, Hu Y, Gao Y, Wu Z, Sun W Yang . Maximum entropy population-based training for zero-shot human-AI coordination. In: Proceedings of the 37th AAAI Conference on Artificial Intelligence. 2023, 689
23	C, Yu J, Gao W, Liu B, Xu H, Tang J, Yang Y, Wang Y Wu . Learning zero-shot cooperation with humans, assuming humans are biased. In: Proceedings of the 11th International Conference on Learning Representations. 2023
24	X, Wang S, Zhang W, Zhang W, Dong J, Chen Y, Wen W Zhang . Quantifying zero-shot coordination capability with behavior preferring partners. In: Proceedings of the 12th International Conference on Learning Representations. 2024
25	S, Kapetanakis D Kudenko . Reinforcement learning of coordination in heterogeneous cooperative multi-agent systems. In: Proceedings of the 3rd International Joint Conference on Autonomous Agents and Multiagent Systems. 2004, 1258−1259
26	C, Wang C, Pérez-D’Arpino D, Xu F F, Li K, Liu S Savarese . Co-GAIL: Learning diverse strategies for human-robot collaboration. In: Proceedings of the 5th Conference on Robot Learning. 2022, 1279−1290
27	K, Xue Y, Wang C, Guan L, Yuan H, Fu Q, Fu C, Qian Y Yu . Heterogeneous multi-agent zero-shot coordination by coevolution. 2022, arXiv preprint arXiv: 2208.04957
28	C, Cabrera A, Paleyes P, Thodoroff N D Lawrence . Real-world machine learning systems: a survey from a data-oriented architecture perspective. 2023, arXiv preprint arXiv: 2302.04810
29	T H, Davenport R Ronanki . Artificial intelligence for the real world. Harvard Business Review, 2018, 96(1): 108−116
30	M C, Fontaine Y C, Hsu Y, Zhang B, Tjanaka S Nikolaidis . On the importance of environments in human-robot coordination. In: Proceedings of the 17th Robotics: Science and Systems 2021. 2021
31	L, Busoniu R, Babuska Schutter B De . A comprehensive survey of multiagent reinforcement learning. IEEE Transactions on Systems, Man, and Cybernetics, Part C (Applications and Reviews), 2008, 38( 2): 156–172
32	K, Zhang Z, Yang T Başar . Multi-agent reinforcement learning: a selective overview of theories and algorithms. In: Vamvoudakis K G, Wan Y, Lewis F L, Cansever D, eds. Handbook of Reinforcement Learning and Control. Cham: Springer, 2021, 321−384
33	G, Sartoretti J, Kerr Y, Shi G, Wagner T K S, Kumar S, Koenig H Choset . Primal: pathfinding via reinforcement and imitation multi-agent learning. IEEE Robotics and Automation Letters, 2019, 4( 3): 2378–2385
34	J, Wang W, Xu Y, Gu W, Song T C Green . Multi-agent reinforcement learning for active voltage control on power distribution networks. In: Proceedings of the 35th Conference on Advances in Neural Information Processing Systems. 2021, 3271−3284
35	K, Xue J, Xu L, Yuan M, Li C, Qian Z, Zhang Y Yu . Multi-agent dynamic algorithm configuration. In: Proceedings of the 36th Conference on Advances in Neural Information Processing Systems. 2022, 20147−20161
36	M, Wen J G, Kuba R, Lin W, Zhang Y, Wen J, Wang Y Yang . Multi-agent reinforcement learning is a sequence modeling problem. In: Proceedings of the 36th Conference on Neural Information Processing Systems. 2022, 16509−16521
37	M, Samvelyan T, Rashid Witt C S, De G, Farquhar N, Nardelli T G J, Rudner C, Hung P H S, Torr J N, Foerster S Whiteson . The starcraft multi-agent challenge. In: Proceedings of the 18th International Conference on Autonomous Agents and MultiAgent Systems. 2019, 2186−2188
38	N, Bard J N, Foerster S, Chandar N, Burch M, Lanctot H F, Song E, Parisotto V, Dumoulin S, Moitra E, Hughes I, Dunning S, Mourad H, Larochelle M G, Bellemare M Bowling . The hanabi challenge: A new frontier for AI research. Artificial Intelligence, 2020, 280: 103216
39	C, Zhu M, Dastani S Wang . A survey of multi-agent reinforcement learning with communication. 2022, arXiv preprint arXiv: 2203.08975
40	F, Zhang C, Jia Y C, Li L, Yuan Y, Yu Z Zhang . Discovering generalizable multi-agent coordination skills from multi-task offline data. In: Proceedings of the 11th International Conference on Learning Representations. 2023
41	X, Wang Z, Zhang W Zhang . Model-based multi-agent reinforcement learning: Recent progress and prospects. 2022, arXiv preprint arXiv: 2203.10603
42	J, Guo Y, Chen Y, Hao Z, Yin Y, Yu S Li . Towards comprehensive testing on the robustness of cooperative multi-agent reinforcement learning. In: Proceedings of 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW). 2022
43	L, Yuan Z, Zhang K, Xue H, Yin F, Chen C, Guan L, Li C, Qian Y Yu . Robust multi-agent coordination via evolutionary generation of auxiliary adversarial attackers. In: Proceedings of the 37th AAAI Conference on Artificial Intelligence. 2023, 1319
44	J N, Foerster Y M, Assael Freitas N, De S Whiteson . Learning to communicate with deep multi-agent reinforcement learning. In: Proceedings of the 30th International Conference on Neural Information Processing Systems. 2016, 2145−2153
45	S, Sukhbaatar A, Szlam R Fergus . Learning multiagent communication with backpropagation. In: Proceedings of the 30th International Conference on Neural Information Processing Systems. 2016, 2252−2260
46	Z, Ding T, Huang Z Lu . Learning individually inferred communication for multi-agent cooperation. In: Proceedings of the 34th International Conference on Neural Information Processing Systems. 2020, 1851
47	H, Mao Z, Zhang Z, Xiao Z, Gong Y Ni . Learning agent communication under limited bandwidth by message pruning. In: Proceedings of the 34th AAAI Conference on Artificial Intelligence. 2020, 5142−5149
48	L, Yuan J, Wang F, Zhang C, Wang Z, Zhang Y, Yu C Zhang . Multi-agent incentive communication via decentralized teammate modeling. In: Proceedings of the 36th AAAI Conference on Artificial Intelligence. 2022, 9466−9474
49	S Q, Zhang Q, Zhang J Lin . Efficient communication in multi-agent reinforcement learning via variance based control. In: Proceedings of the 33rd International Conference on Neural Information Processing Systems. 2019, 291
50	S Q, Zhang Q, Zhang J Lin . Succinct and robust multi-agent communication with temporal message control. In: Proceedings of the 34th International Conference on Neural Information Processing Systems. 2020, 1449
51	C, Guan F, Chen L, Yuan C, Wang H, Yin Z, Zhang Y Yu . Efficient multi-agent communication via self-supervised information aggregation. In: Proceedings of the 36th Conference on Neural Information Processing Systems. 2022, 1020−1033
52	A, Das T, Gervet J, Romoff D, Batra D, Parikh M, Rabbat J Pineau . TarMAC: Targeted multi-agent communication. In: Proceedings of the 36th International Conference on Machine Learning. 2019, 1538−1546
53	C, Guan F, Chen L, Yuan Z, Zhang Y Yu . Efficient communication via self-supervised information aggregation for online and offline multi-agent reinforcement learning. 2023, arXiv preprint arXiv: 2302.09605
54	L, Yuan T, Jiang L, Li F, Chen Z, Zhang Y Yu . Robust multi-agent communication via multi-view message certification. 2023, arXiv preprint arXiv: 2305.13936
55	L, Yuan F, Chen Z, Zhang Y Yu . Communication-robust multi-agent learning by adaptable auxiliary multi-agent adversary generation. Frontiers of Computer Science, 2024, 18( 6): 186331
56	J, Gwak J, Jung R, Oh M, Park M A K, Rakhimov J Ahn . A review of intelligent self-driving vehicle software research. KSII Transactions on Internet and Information Systems (TIIS), 2019, 13( 11): 5299–5320
57	O M, Andrychowicz B, Baker M, Chociej R, Józefowicz B, McGrew J, Pachocki A, Petron M, Plappert G, Powell A, Ray J, Schneider S, Sidor J, Tobin P, Welinder L L, Weng W Zaremba . Learning dexterous in-hand manipulation. The International Journal of Robotics Research, 2020, 39( 1): 3–20
58	D C Engelbart . Augmenting human intellect: a conceptual framework. Stanford Research Institute, 2023
59	S, Carter M Nielsen . Using artificial intelligence to augment human intelligence. Distill, 2017, 2( 12): e9
60	H, Hu A, Lerer B, Cui L, Pineda N, Brown J N Foerster . Off-belief learning. In: Proceedings of the 38th International Conference on Machine Learning. 2021, 4369−4379
61	J, Treutlein M, Dennis C, Oesterheld J Foerster . A new formalism, method and open issues for zero-shot coordination. In: Proceedings of the 38th International Conference on Machine Learning. 2021, 10413−10423
62	Y, Li S, Zhang J, Sun Y, Du Y, Wen X, Wang W Pan . Cooperative open-ended learning framework for zero-shot coordination. In: Proceedings of the 40th International Conference on Machine Learning. 2023, 844
63	F A, Oliehoek C Amato . A Concise Introduction to Decentralized POMDPs. Cham: Springer, 2016
64	W, Xue W, Qiu B, An Z, Rabinovich S, Obraztsova C K Yeo . Mis-spoke or mis-lead: Achieving robustness in multi-agent communicative reinforcement learning. In: Proceedings of the 21st International Conference on Autonomous Agents and Multiagent Systems. 2022, 1418−1426
65	D, Silver T, Hubert J, Schrittwieser I, Antonoglou M, Lai A, Guez M, Lanctot L, Sifre D, Kumaran T, Graepel T, Lillicrap K, Simonyan D Hassabis . Mastering chess and shogi by self-play with a general reinforcement learning algorithm. 2017, arXiv preprint arXiv: 1712.01815
66	G Tesauro . TD-gammon, a self-teaching backgammon program, achieves master-level play. Neural Computation, 1994, 6( 2): 215–219
67	M, Jaderberg V, Dalibard S, Osindero W M, Czarnecki J, Donahue A, Razavi O, Vinyals T, Green I, Dunning K, Simonyan C, Fernando K Kavukcuoglu . Population based training of neural networks. 2017, arXiv preprint arXiv: 1711.09846
68	K, Lucas R E Allen . Any-play: an intrinsic augmentation for zero shot coordination. In: Proceedings of the 21st International Conference on Autonomous Agents and Multiagent Systems. 2022, 853–861
69	W U, Mondal M, Agarwal V, Aggarwal S V Ukkusuri . On the approximation of cooperative heterogeneous multi-agent reinforcement learning (MARL) using mean field control (MFC). Journal of Machine Learning Research, 2022, 23( 1): 129
70	J G, Kuba X, Feng S, Ding H, Dong J, Wang Y Yang . Heterogeneous-agent mirror learning: A continuum of solutions to cooperative MARL. 2022, arXiv preprint arXiv: 2208.01682
71	R, Charakorn P, Manoonpong N Dilokthanakul . Generating diverse cooperative agents by learning incompatible policies. In: Proceedings of the 11th International Conference on Learning Representations. 2023
72	X, Lou J, Guo J, Zhang J, Wang K, Huang Y Du . PECAN: leveraging policy ensemble for context-aware zero-shot human-AI coordination. In: Proceedings of the 22nd International Conference on Autonomous Agents and Multiagent Systems. 2023, 679−688
73	S, Zheng A, Trott S, Srinivasa N, Naik M, Gruesbeck D C, Parkes R Socher . The AI economist: Improving equality and productivity with AI-Driven tax policies. 2020, arXiv preprint arXiv: 2004. 13332
74	T Bäck . Evolutionary Algorithms in Theory and Practice: Evolution Strategies, Evolutionary Programming, Genetic Algorithms. New York: Oxford University Press, 1996
75	H, Hao X, Zhang A Zhou . Enhancing SAEAs with unevaluated solutions: A case study of relation model for expensive optimization. Science China Information Sciences, 2024, 67( 2): 120103
76	Y, Wang K, Xue C Qian . Evolutionary diversity optimization with clustering-based selection for reinforcement learning. In: Proceedings of the 10th International Conference on Learning Representations. 2022
77	J Demšar . Statistical comparisons of classifiers over multiple data sets. The Journal of Machine Learning Research, 2006, 7: 1–30

[1]

FCS-23797-OF-CG_suppl_1

Download

[1]	Huili XING, Zhaohui ZHU, Jinjin ZHANG. An extension of process calculus for asynchronous communications between agents with epistemic states[J]. Front. Comput. Sci., 2025, 19(3): 193401-.
[2]	Lei YUAN, Feng CHEN, Zongzhang ZHANG, Yang YU. Communication-robust multi-agent learning by adaptable auxiliary multi-agent adversary generation[J]. Front. Comput. Sci., 2024, 18(6): 186331-.
[3]	Edje E. ABEL, Muhammad Shafie Abd LATIFF. The utilization of algorithms for cloud internet of things application domains: a review[J]. Front. Comput. Sci., 2021, 15(3): 153502-.
[4]	Shiqing ZHANG, Zheng QIN, Yaohua YANG, Li SHEN, Zhiying WANG. Transparent partial page migration between CPU and GPU[J]. Front. Comput. Sci., 2020, 14(3): 143101-.
[5]	Juan CHEN, Wenhao ZHOU, Yong DONG, Zhiyuan WANG, Chen CUI, Feihao WU, Enqiang ZHOU, Yuhua TANG. Analyzing time-dimension communication characterizations for representative scientific applications on supercomputer systems[J]. Front. Comput. Sci., 2019, 13(6): 1228-1242.
[6]	Kai LI, Guangyi LV, Zhefeng WANG, Qi LIU, Enhong CHEN, Lisheng QIAO. Understanding the mechanism of social tie in the propagation process of social network with communication channel[J]. Front. Comput. Sci., 2019, 13(6): 1296-1308.
[7]	Yudong QIN, Deke GUO, Lailong LUO, Geyao CHENG, Zeliu DING. Design and optimization of VLC based small-world data centers[J]. Front. Comput. Sci., 2019, 13(5): 1034-1047.
[8]	Libing WU, Lei NIE, Samee U. KHAN, Osman KHALID, Dan WU. A V2I communication-based pipeline model for adaptive urban traffic light scheduling[J]. Front. Comput. Sci., 2019, 13(5): 929-942.
[9]	Zhigang ZHANG, Xiaodong QI, Yilin WANG, Cheqing JIN, Jiali MAO, Aoying ZHOU. Distributed top-k similarity query on big trajectory streams[J]. Front. Comput. Sci., 2019, 13(3): 647-664.
[10]	Yongli CHENG, Fang WANG, Hong JIANG, Yu HUA, Dan FENG, Lingling ZHANG, Jun ZHOU. A communication-reduced and computation-balanced framework for fast graph computation[J]. Front. Comput. Sci., 2018, 12(5): 887-907.
[11]	Changda WANG,Yulin YUAN,Lei HUANG. Base communication model of IP covert timing channels[J]. Front. Comput. Sci., 2016, 10(6): 1130-1141.
[12]	Yanwen CHEN,Yixiang CHEN,Eric MADELAINE. Timed-pNets: a communication behavioural semantic model for distributed systems[J]. Front. Comput. Sci., 2015, 9(1): 87-110.
[13]	Fei HE,Xiaoyu SONG,Ming GU,Jiaguang SUN. Generalized interface automata with multicast synchronization[J]. Front. Comput. Sci., 2015, 9(1): 1-14.
[14]	Yu ZHANG, Xiaofei LIAO, Hai JIN, Li LIN, Feng LU. An adaptive switching scheme for iterative computing in the cloud[J]. Front. Comput. Sci., 2014, 8(6): 872-884.
[15]	Yingsheng JI,Yingzhuo ZHANG,Guangwen YANG. Interpolation oriented parallel communication to optimize coupling in earth system modeling[J]. Front. Comput. Sci., 2014, 8(4): 693-708.

Viewed

Full text

Abstract

Cited

Shared

Discussed