|
|
A Monte Carlo Neural Fictitious Self-Play approach to approximate Nash Equilibrium in imperfect-information dynamic games |
Li ZHANG, Yuxuan CHEN, Wei WANG, Ziliang HAN, Shijian Li( ), Zhijie PAN, Gang PAN |
College of Computer Science and Technology, Zhejiang University, Hangzhou 310027, China |
|
|
Abstract Solving the optimization problem to approach a Nash Equilibrium point plays an important role in imperfect information games, e.g., StarCraft and poker. Neural Fictitious Self-Play (NFSP) is an effective algorithm that learns approximate Nash Equilibrium of imperfect-information games from purely self-play without prior domain knowledge. However, it needs to train a neural network in an off-policy manner to approximate the action values. For games with large search spaces, the training may suffer from unnecessary exploration and sometimes fails to converge. In this paper, we propose a new Neural Fictitious Self-Play algorithmthat combinesMonte Carlo tree search with NFSP, called MC-NFSP, to improve the performance in real-time zero-sum imperfect-information games. With experiments and empirical analysis, we demonstrate that the proposed MC-NFSP algorithm can approximate Nash Equilibrium in games with large-scale search depth while the NFSP can not. Furthermore, we develop an Asynchronous Neural Fictitious Self-Play framework (ANFSP). It uses asynchronous and parallel architecture to collect game experience and improve both the training efficiency and policy quality. The experiments with th e games with hidden state information (Texas Hold’em), and the FPS (firstperson shooter) games demonstrate effectiveness of our algorithms.
|
Keywords
approximate Nash Equilibrium
imperfectinformation games
dynamic games
Monte Carlo tree search
Neural Fictitious Self-Play
reinforcement learning
|
Corresponding Author(s):
Shijian Li
|
Just Accepted Date: 24 July 2020
Issue Date: 30 August 2021
|
|
1 |
K Arulkumaran, A Cully, J Togelius. Alphastar: an evolutionary computation perspective. In: Proceedings of the Genetic and Evolutionary Computation Conference Companion. 2019, 314–315
https://doi.org/10.1145/3319619.3321894
|
2 |
J Nash. Non-cooperative games. Annals of Mathematics, 1951, 54(2): 286–295
https://doi.org/10.2307/1969529
|
3 |
T Sanholm. The state of solving large incomplete-information games, and application to poker. AI Magazine, 2010, 31(4): 13–32
https://doi.org/10.1609/aimag.v31i4.2311
|
4 |
B Bošanský, C Kiekintveld, V Lisý, M Pěchouček. An exact doubleoracle algorithm for zero-sum extensive-form games with imperfect information. Journal of Artificial Intelligence Research, 2014, 51: 829–866
https://doi.org/10.1613/jair.4477
|
5 |
M Bowling, N Burch, M Johanson, O Tammelin. Heads-up limit hold’em poker is solved. Science, 2015, 347(6218): 145–149
https://doi.org/10.1126/science.1259433
|
6 |
C B Browne, E Powley, D Whitehouse, S M Lucas, P I Cowling, P Rohlfshagen, S Tavener, D Perez, S Samothrakis, S Colton. A survey of Monte Carlo tree search methods. IEEE Transactions on Computational Intelligence and AI in Games, 2012, 4(1): 1–43
https://doi.org/10.1109/TCIAIG.2012.2186810
|
7 |
G W Brown. Iterative solution of games by fictitious play. Activity Analysis of Production and Allocation, 1951, 13(1): 374–376
|
8 |
J Heinrich, M Lanctot, D Silver. Fictitious self-play in extensive-form games. In: Proceedings of the 32nd International Conference on Machine Learning. 2015, 805–813
|
9 |
J Heinrich, D Silver. Deep reinforcement learning from self-play in imperfect-information games. 2016, arXiv preprint arXiv:1603.01121
|
10 |
R S Sutton, A G Barto. Reinforcement Learning: An Introduction. 2nd ed. London: MIT Press, 1998
|
11 |
R B Myerson. Game Theory: Analysis of Conflict. 1st ed. London: Harvard University Press, 1991
|
12 |
L Shi, S Li, L Cao, L Yang, G Pan. TBQ(σ) improving efficiency of trace utilization for off-policy reinforcement learning. In: Proceedings of the 18th International Conference on Autonomous Agents and MultiAgent Systems. 2019, 1025–1032
|
13 |
L Yang, M Shi, Q Zheng, W Meng, G Pan. A unified approach for multistep temporal-difference learning with eligibility traces in reinforcement learning. In: Proceedings of the 27th International Joint Conference on Artificial Intelligence. 2018, 2984–2990
https://doi.org/10.24963/ijcai.2018/414
|
14 |
V Mnih, K Kavukcuoglu, D Silver, A A Rusu, J Veness, M G Bellemare, A Graves, M Riedmiller, A K Fidjeland, G Ostrovski, S Petersen, C Beattie, A Sadik, I Antonoglou, H King, D Kumaran, D Wierstra, S Legg, D Hassabis. Human-level control through deep reinforcement learning. Nature, 2015, 518(7540): 529–533
https://doi.org/10.1038/nature14236
|
15 |
W Meng, Q Zheng, L Yang, P Li, G Pan. Qualitative measurements of policy discrepancy for return-based deep q-network. IEEE Transactions on Neural Networks and Learning Systems, 2019, 31(10): 4374–4380
https://doi.org/10.1109/TNNLS.2019.2948892
|
16 |
V Mnih, A P Badia, M Mirza, A Graves, T P Lillicrap, T Harley, D Silver, K Kavukcuoglu. Asynchronous methods for deep reinforcement learning. In: Proceedings of the 33rd International Conference on Machine Learning. 2016, 1928–1937
|
17 |
D Silver, A Huang, C J Maddison, A Guez, L Sifre, G van den Driessche, J Schrittwieser, I Antonoglou, V Panneershelvam, M Lanctot,et al.. Mastering the game of go with deep neural networks and tree search. Nature, 2016, 529(7587): 484
https://doi.org/10.1038/nature16961
|
18 |
D Silver, J Schrittwieser, K Simonyan, I Antonoglou, A Huang, A Guez, T Hubert, L Baker, M Lai, A Bolton,et al.. Mastering the game of go without human knowledge. Nature, 2017, 550(7676): 354
https://doi.org/10.1038/nature24270
|
19 |
S Sukhbaatar, A Szlam, R Fergus. Learning multiagent communication with backpropagation. In: Proceedings of the 30th Annual Conference on Neural Information Processing Systems. 2016, 2244–2252
|
20 |
P Peng, Y Wen, Y Yang, Q Yuan, Z Tang, H Long, J Wang. Multiagent bidirectionally-coordinated nets: emergence of human-level coordination in learning to play starcraft combat games. 2017, arXiv preprint arXiv:1703.10069
|
21 |
J Heinrich, D Silver. Smooth uct search in computer poker. In: Proceedings of the 24th International Joint Conference on Artificial Intelligence. 2015, 554–560
|
22 |
V Lis`y, M Lanctot, M Bowling. Online Monte Carlo counterfactual regret minimization for search in imperfect information games. In: Proceedings of the 14th International Conference on Autonomous Agents and Multiagent Systems. 2015, 27–36
|
23 |
N Brown, T Sandholm. Libratus: the superhuman ai for no-limit poker. In: Proceedings of the 26th International Joint Conference on Artificial Intelligence. 2017, 5226–5228
https://doi.org/10.24963/ijcai.2017/772
|
24 |
M Moravčík, M Schmid, N Burch, V Lis`y, D Morrill, N Bard, T Davis, K Waugh, M Johanson, M Bowling. Deepstack: expert-level artificial intelligence in heads-up no-limit poker. Science, 2017, 356(6337): 508–513
https://doi.org/10.1126/science.aam6960
|
25 |
D S Leslie, E J Collins. Generalised weakened fictitious play. Games and Economic Behavior, 2006, 56(2): 285–298
https://doi.org/10.1016/j.geb.2005.08.005
|
26 |
E Hendon, H J Jacobsen, B Sloth. Fictitious play in extensive form games. Games and Economic Behavior, 1996, 15(2): 177–202
https://doi.org/10.1006/game.1996.0065
|
27 |
S Thrun, A Schwartz. Issues in using function approximation for reinforcement learning. In: Proceedings of the 4th Connectionist Models Summer School. 1993, 1–7
|
28 |
O Anschel, N Baram, N Shimkin. Averaged-DQN: variance reduction and stabilization for deep reinforcement learning. In: Proceedings of the 34th International Conference on Machine Learning. 2017, 176–185
|
29 |
J Foerster, N Nardelli, G Farquhar, T Afouras, P H Torr, P Kohli, S Whiteson. Stabilising experience replay for deep multi-agent reinforcement learning. In: Proceedings of the 34th International Conference on Machine Learning. 2017, 1146–1155
|
30 |
L Kocsis, C Szepesvári. Bandit based Monte-Carlo planning. In: Proceedings of the 17th European Conference on Machine Learning. 2006, 282–293
https://doi.org/10.1007/11871842_29
|
31 |
J Y Audibert, R Munos, C Szepesvári. Exploration–exploitation tradeoff using variance estimates in multi-armed bandits. Theoretical Computer Science, 2009, 410(19): 1876–1902
https://doi.org/10.1016/j.tcs.2009.01.016
|
32 |
D Shah, Q Xie, Z Xu. On reinforcement learning using Monte Carlo tree search with supervised learning: non-asymptotic analysis. 2019, arXiv preprint arXiv:1902.05213
|
33 |
V Lis`y, V Kovarik, M Lanctot, B Bosansky. Convergence of Monte Carlo tree search in simultaneous move games. In: Proceedings of the 27th Annual Conference on Neural Information Processing Systems. 2013, 2112–2120
|
34 |
D Auger. Multiple tree for partially observable Monte-Carlo tree search. In: Proceedings of the 14th European Conference on the Applications of Evolutionary Computation. 2011, 53–62
https://doi.org/10.1007/978-3-642-20525-5_6
|
35 |
M Ponsen, S De Jong, M Lanctot. Computing approximate nash equilibria and robust best-responses using sampling. Journal of Artificial Intelligence Research, 2011, 42: 575–605
|
36 |
P I Cowling, E J Powley, D Whitehouse. Information set Monte Carlo tree search. IEEE Transactions on Computational Intelligence and AI in Games, 2012, 4(2): 120–143
https://doi.org/10.1109/TCIAIG.2012.2200894
|
37 |
P Jin, K Keutzer, S Levine. Regret minimization for partially observable deep reinforcement learning. In: Proceedings of the 35th International Conference on Machine Learning. 2018, 2342–2351
|
38 |
J S Vitter. Random sampling with a reservoir. ACM Transactions on Mathematical Software, 1985, 11(1): 37–57
https://doi.org/10.1145/3147.3165
|
39 |
G M B Chaslot, M H Winands, H J van den Herik. Parallel Monte-Carlo tree search. In: Proceedings of the 6th International Conference on Computers and Games. 2008, 60–71
https://doi.org/10.1007/978-3-540-87608-3_6
|
40 |
Q Fang, D A Boas.Monte Carlo simulation of photon migration in 3d turbid media accelerated by graphics processing units. Optics Express, 2009, 17(22): 20178–20190
https://doi.org/10.1364/OE.17.020178
|
41 |
M D Hill, M R Marty. Amdahl’s law in the multicore era. Computer, 2008, 41(7): 33–38
https://doi.org/10.1109/MC.2008.209
|
42 |
M Lanctot, K Waugh, M Zinkevich, M Bowling. Monte Carlo sampling for regret minimization in extensive games. In: Proceedings of the 23nd Annual Conference on Neural Information Processing Systems. 2009, 1078–1086
|
|
Viewed |
|
|
|
Full text
|
|
|
|
|
Abstract
|
|
|
|
|
Cited |
|
|
|
|
|
Shared |
|
|
|
|
|
Discussed |
|
|
|
|