A Monte Carlo Neural Fictitious Self-Play approach to approximate Nash Equilibrium in imperfect-information dynamic games

doi:10.1007/s11704-020-9307-6

Front. Comput. Sci.

2021, Vol. 15

Issue (5) : 155334 https://doi.org/10.1007/s11704-020-9307-6

RESEARCH ARTICLE

A Monte Carlo Neural Fictitious Self-Play approach to approximate Nash Equilibrium in imperfect-information dynamic games

Li ZHANG, Yuxuan CHEN, Wei WANG, Ziliang HAN, Shijian Li(

), Zhijie PAN, Gang PAN

College of Computer Science and Technology, Zhejiang University, Hangzhou 310027, China

Download: PDF(881 KB)
Export: BibTeX | EndNote | Reference Manager | ProCite | RefWorks

Abstract

Solving the optimization problem to approach a Nash Equilibrium point plays an important role in imperfect information games, e.g., StarCraft and poker. Neural Fictitious Self-Play (NFSP) is an effective algorithm that learns approximate Nash Equilibrium of imperfect-information games from purely self-play without prior domain knowledge. However, it needs to train a neural network in an off-policy manner to approximate the action values. For games with large search spaces, the training may suffer from unnecessary exploration and sometimes fails to converge. In this paper, we propose a new Neural Fictitious Self-Play algorithmthat combinesMonte Carlo tree search with NFSP, called MC-NFSP, to improve the performance in real-time zero-sum imperfect-information games. With experiments and empirical analysis, we demonstrate that the proposed MC-NFSP algorithm can approximate Nash Equilibrium in games with large-scale search depth while the NFSP can not. Furthermore, we develop an Asynchronous Neural Fictitious Self-Play framework (ANFSP). It uses asynchronous and parallel architecture to collect game experience and improve both the training efficiency and policy quality. The experiments with th e games with hidden state information (Texas Hold’em), and the FPS (firstperson shooter) games demonstrate effectiveness of our algorithms.

Keywords approximate Nash Equilibrium imperfectinformation games dynamic games Monte Carlo tree search Neural Fictitious Self-Play reinforcement learning

Corresponding Author(s): Shijian Li

Just Accepted Date: 24 July 2020 Issue Date: 30 August 2021

Cite this article:

Li ZHANG,Yuxuan CHEN,Wei WANG, et al. A Monte Carlo Neural Fictitious Self-Play approach to approximate Nash Equilibrium in imperfect-information dynamic games[J]. Front. Comput. Sci., 2021, 15(5): 155334.

URL:

https://academic.hep.com.cn/fcs/EN/10.1007/s11704-020-9307-6
https://academic.hep.com.cn/fcs/EN/Y2021/V15/I5/155334

1	K Arulkumaran, A Cully, J Togelius. Alphastar: an evolutionary computation perspective. In: Proceedings of the Genetic and Evolutionary Computation Conference Companion. 2019, 314–315 https://doi.org/10.1145/3319619.3321894
2	J Nash. Non-cooperative games. Annals of Mathematics, 1951, 54(2): 286–295 https://doi.org/10.2307/1969529
3	T Sanholm. The state of solving large incomplete-information games, and application to poker. AI Magazine, 2010, 31(4): 13–32 https://doi.org/10.1609/aimag.v31i4.2311
4	B Bošanský, C Kiekintveld, V Lisý, M Pěchouček. An exact doubleoracle algorithm for zero-sum extensive-form games with imperfect information. Journal of Artificial Intelligence Research, 2014, 51: 829–866 https://doi.org/10.1613/jair.4477
5	M Bowling, N Burch, M Johanson, O Tammelin. Heads-up limit hold’em poker is solved. Science, 2015, 347(6218): 145–149 https://doi.org/10.1126/science.1259433
6	C B Browne, E Powley, D Whitehouse, S M Lucas, P I Cowling, P Rohlfshagen, S Tavener, D Perez, S Samothrakis, S Colton. A survey of Monte Carlo tree search methods. IEEE Transactions on Computational Intelligence and AI in Games, 2012, 4(1): 1–43 https://doi.org/10.1109/TCIAIG.2012.2186810
7	G W Brown. Iterative solution of games by fictitious play. Activity Analysis of Production and Allocation, 1951, 13(1): 374–376
8	J Heinrich, M Lanctot, D Silver. Fictitious self-play in extensive-form games. In: Proceedings of the 32nd International Conference on Machine Learning. 2015, 805–813
9	J Heinrich, D Silver. Deep reinforcement learning from self-play in imperfect-information games. 2016, arXiv preprint arXiv:1603.01121
10	R S Sutton, A G Barto. Reinforcement Learning: An Introduction. 2nd ed. London: MIT Press, 1998
11	R B Myerson. Game Theory: Analysis of Conflict. 1st ed. London: Harvard University Press, 1991
12	L Shi, S Li, L Cao, L Yang, G Pan. TBQ(σ) improving efficiency of trace utilization for off-policy reinforcement learning. In: Proceedings of the 18th International Conference on Autonomous Agents and MultiAgent Systems. 2019, 1025–1032
13	L Yang, M Shi, Q Zheng, W Meng, G Pan. A unified approach for multistep temporal-difference learning with eligibility traces in reinforcement learning. In: Proceedings of the 27th International Joint Conference on Artificial Intelligence. 2018, 2984–2990 https://doi.org/10.24963/ijcai.2018/414
14	V Mnih, K Kavukcuoglu, D Silver, A A Rusu, J Veness, M G Bellemare, A Graves, M Riedmiller, A K Fidjeland, G Ostrovski, S Petersen, C Beattie, A Sadik, I Antonoglou, H King, D Kumaran, D Wierstra, S Legg, D Hassabis. Human-level control through deep reinforcement learning. Nature, 2015, 518(7540): 529–533 https://doi.org/10.1038/nature14236
15	W Meng, Q Zheng, L Yang, P Li, G Pan. Qualitative measurements of policy discrepancy for return-based deep q-network. IEEE Transactions on Neural Networks and Learning Systems, 2019, 31(10): 4374–4380 https://doi.org/10.1109/TNNLS.2019.2948892
16	V Mnih, A P Badia, M Mirza, A Graves, T P Lillicrap, T Harley, D Silver, K Kavukcuoglu. Asynchronous methods for deep reinforcement learning. In: Proceedings of the 33rd International Conference on Machine Learning. 2016, 1928–1937
17	D Silver, A Huang, C J Maddison, A Guez, L Sifre, G van den Driessche, J Schrittwieser, I Antonoglou, V Panneershelvam, M Lanctot,et al.. Mastering the game of go with deep neural networks and tree search. Nature, 2016, 529(7587): 484 https://doi.org/10.1038/nature16961
18	D Silver, J Schrittwieser, K Simonyan, I Antonoglou, A Huang, A Guez, T Hubert, L Baker, M Lai, A Bolton,et al.. Mastering the game of go without human knowledge. Nature, 2017, 550(7676): 354 https://doi.org/10.1038/nature24270
19	S Sukhbaatar, A Szlam, R Fergus. Learning multiagent communication with backpropagation. In: Proceedings of the 30th Annual Conference on Neural Information Processing Systems. 2016, 2244–2252
20	P Peng, Y Wen, Y Yang, Q Yuan, Z Tang, H Long, J Wang. Multiagent bidirectionally-coordinated nets: emergence of human-level coordination in learning to play starcraft combat games. 2017, arXiv preprint arXiv:1703.10069
21	J Heinrich, D Silver. Smooth uct search in computer poker. In: Proceedings of the 24th International Joint Conference on Artificial Intelligence. 2015, 554–560
22	V Lis`y, M Lanctot, M Bowling. Online Monte Carlo counterfactual regret minimization for search in imperfect information games. In: Proceedings of the 14th International Conference on Autonomous Agents and Multiagent Systems. 2015, 27–36
23	N Brown, T Sandholm. Libratus: the superhuman ai for no-limit poker. In: Proceedings of the 26th International Joint Conference on Artificial Intelligence. 2017, 5226–5228 https://doi.org/10.24963/ijcai.2017/772
24	M Moravčík, M Schmid, N Burch, V Lis`y, D Morrill, N Bard, T Davis, K Waugh, M Johanson, M Bowling. Deepstack: expert-level artificial intelligence in heads-up no-limit poker. Science, 2017, 356(6337): 508–513 https://doi.org/10.1126/science.aam6960
25	D S Leslie, E J Collins. Generalised weakened fictitious play. Games and Economic Behavior, 2006, 56(2): 285–298 https://doi.org/10.1016/j.geb.2005.08.005
26	E Hendon, H J Jacobsen, B Sloth. Fictitious play in extensive form games. Games and Economic Behavior, 1996, 15(2): 177–202 https://doi.org/10.1006/game.1996.0065
27	S Thrun, A Schwartz. Issues in using function approximation for reinforcement learning. In: Proceedings of the 4th Connectionist Models Summer School. 1993, 1–7
28	O Anschel, N Baram, N Shimkin. Averaged-DQN: variance reduction and stabilization for deep reinforcement learning. In: Proceedings of the 34th International Conference on Machine Learning. 2017, 176–185
29	J Foerster, N Nardelli, G Farquhar, T Afouras, P H Torr, P Kohli, S Whiteson. Stabilising experience replay for deep multi-agent reinforcement learning. In: Proceedings of the 34th International Conference on Machine Learning. 2017, 1146–1155
30	L Kocsis, C Szepesvári. Bandit based Monte-Carlo planning. In: Proceedings of the 17th European Conference on Machine Learning. 2006, 282–293 https://doi.org/10.1007/11871842_29
31	J Y Audibert, R Munos, C Szepesvári. Exploration–exploitation tradeoff using variance estimates in multi-armed bandits. Theoretical Computer Science, 2009, 410(19): 1876–1902 https://doi.org/10.1016/j.tcs.2009.01.016
32	D Shah, Q Xie, Z Xu. On reinforcement learning using Monte Carlo tree search with supervised learning: non-asymptotic analysis. 2019, arXiv preprint arXiv:1902.05213
33	V Lis`y, V Kovarik, M Lanctot, B Bosansky. Convergence of Monte Carlo tree search in simultaneous move games. In: Proceedings of the 27th Annual Conference on Neural Information Processing Systems. 2013, 2112–2120
34	D Auger. Multiple tree for partially observable Monte-Carlo tree search. In: Proceedings of the 14th European Conference on the Applications of Evolutionary Computation. 2011, 53–62 https://doi.org/10.1007/978-3-642-20525-5_6
35	M Ponsen, S De Jong, M Lanctot. Computing approximate nash equilibria and robust best-responses using sampling. Journal of Artificial Intelligence Research, 2011, 42: 575–605
36	P I Cowling, E J Powley, D Whitehouse. Information set Monte Carlo tree search. IEEE Transactions on Computational Intelligence and AI in Games, 2012, 4(2): 120–143 https://doi.org/10.1109/TCIAIG.2012.2200894
37	P Jin, K Keutzer, S Levine. Regret minimization for partially observable deep reinforcement learning. In: Proceedings of the 35th International Conference on Machine Learning. 2018, 2342–2351
38	J S Vitter. Random sampling with a reservoir. ACM Transactions on Mathematical Software, 1985, 11(1): 37–57 https://doi.org/10.1145/3147.3165
39	G M B Chaslot, M H Winands, H J van den Herik. Parallel Monte-Carlo tree search. In: Proceedings of the 6th International Conference on Computers and Games. 2008, 60–71 https://doi.org/10.1007/978-3-540-87608-3_6
40	Q Fang, D A Boas.Monte Carlo simulation of photon migration in 3d turbid media accelerated by graphics processing units. Optics Express, 2009, 17(22): 20178–20190 https://doi.org/10.1364/OE.17.020178
41	M D Hill, M R Marty. Amdahl’s law in the multicore era. Computer, 2008, 41(7): 33–38 https://doi.org/10.1109/MC.2008.209
42	M Lanctot, K Waugh, M Zinkevich, M Bowling. Monte Carlo sampling for regret minimization in extensive games. In: Proceedings of the 23nd Annual Conference on Neural Information Processing Systems. 2009, 1078–1086

[1]

Article highlights

Download

[1]	Xiaoqin ZHANG, Huimin MA, Xiong LUO, Jian YUAN. LIDAR: learning from imperfect demonstrations with advantage rectification[J]. Front. Comput. Sci., 2022, 16(1): 161312-.
[2]	Hong QIAN, Yang YU. Derivative-free reinforcement learning: a review[J]. Front. Comput. Sci., 2021, 15(6): 156336-.
[3]	Peng YANG, Qi YANG, Ke TANG, Xin YAO. Parallel exploration via negatively correlated search[J]. Front. Comput. Sci., 2021, 15(5): 155333-.
[4]	Yao QIN, Hua WANG, Shanwen YI, Xiaole LI, Linbo ZHAI. A multi-objective reinforcement learning algorithm for deadline constrained scientific workflow scheduling in clouds[J]. Front. Comput. Sci., 2021, 15(5): 155105-.
[5]	Hongwei LI, Yingpeng HU, Yixuan CAO, Ganbin ZHOU, Ping LUO. Rich-text document styling restoration via reinforcement learning[J]. Front. Comput. Sci., 2021, 15(4): 154328-.
[6]	Kok-Lim Alvin YAU, Kae Hsiang KWONG, Chong SHEN. Reinforcement learning models for scheduling in wireless networks[J]. Front Comput Sci, 2013, 7(5): 754-766.

Viewed

Full text

Abstract

Cited

Shared

Discussed