{"title": "Convergence of Monte Carlo Tree Search in Simultaneous Move Games", "book": "Advances in Neural Information Processing Systems", "page_first": 2112, "page_last": 2120, "abstract": "In this paper, we study Monte Carlo tree search (MCTS) in zero-sum extensive-form games with perfect information and simultaneous moves. We present a general template of MCTS algorithms for these games, which can be instantiated by various selection methods. We formally prove that if a selection method is $\\epsilon$-Hannan consistent in a matrix game and satisfies additional requirements on exploration, then the MCTS algorithm eventually converges to an approximate Nash equilibrium (NE) of the extensive-form game. We empirically evaluate this claim using regret matching and Exp3 as the selection methods on randomly generated and worst case games. We confirm the formal result and show that additional MCTS variants also converge to approximate NE on the evaluated games.", "full_text": "Convergence of Monte Carlo Tree Search in\n\nSimultaneous Move Games\n\nViliam Lis\u00b4y1\n\nVojt\u02c7ech Kova\u02c7r\u00b4\u0131k1\n\nMarc Lanctot2\n\nBranislav Bo\u02c7sansk\u00b4y1\n\n1Agent Technology Center\n\nDept. of Computer Science and Engineering\nFEE, Czech Technical University in Prague\n\n<name>.<surname>\n\n2Department of Knowledge Engineering\nMaastricht University, The Netherlands\n\nmarc.lanctot\n\n@maastrichtuniversity.nl\n\n@agents.fel.cvut.cz\n\nAbstract\n\nWe study Monte Carlo tree search (MCTS) in zero-sum extensive-form games\nwith perfect information and simultaneous moves. We present a general tem-\nplate of MCTS algorithms for these games, which can be instantiated by various\nselection methods. We formally prove that if a selection method is \u0001-Hannan con-\nsistent in a matrix game and satis\ufb01es additional requirements on exploration, then\nthe MCTS algorithm eventually converges to an approximate Nash equilibrium\n(NE) of the extensive-form game. We empirically evaluate this claim using regret\nmatching and Exp3 as the selection methods on randomly generated games and\nempirically selected worst case games. We con\ufb01rm the formal result and show\nthat additional MCTS variants also converge to approximate NE on the evaluated\ngames.\n\n1\n\nIntroduction\n\nNon-cooperative game theory is a formal mathematical framework for describing behavior of inter-\nacting self-interested agents. Recent interest has brought signi\ufb01cant advancements from the algo-\nrithmic perspective and new algorithms have led to many successful applications of game-theoretic\nmodels in security domains [1] and to near-optimal play of very large games [2]. We focus on an\nimportant class of two-player, zero-sum extensive-form games (EFGs) with perfect information and\nsimultaneous moves. Games in this class capture sequential interactions that can be visualized as a\ngame tree. The nodes correspond to the states of the game, in which both players act simultaneously.\nWe can represent these situations using the normal form (i.e., as matrix games), where the values\nare computed from the successor sub-games. Many well-known games are instances of this class,\nincluding card games such as Goofspiel [3, 4], variants of pursuit-evasion games [5], and several\ngames from general game-playing competition [6].\nSimultaneous-move games can be solved exactly in polynomial time using the backward induction\nalgorithm [7, 4], recently improved with alpha-beta pruning [8, 9]. However, the depth-limited\nsearch algorithms based on the backward induction require domain knowledge (an evaluation func-\ntion) and computing the cutoff conditions requires linear programming [8] or using a double-oracle\nmethod [9], both of which are computationally expensive. For practical applications and in situations\nwith limited domain knowledge, variants of simulation-based algorithms such as Monte Carlo Tree\nSearch (MCTS) are typically used in practice [10, 11, 12, 13]. In spite of the success of MCTS and\nnamely its variant UCT [14] in practice, there is a lack of theory analyzing MCTS outside two-player\nperfect-information sequential games. To the best of our knowledge, no convergence guarantees are\nknown for MCTS in games with simultaneous moves or general EFGs.\n\n1\n\n\fFigure 1: A game tree of a game with perfect information and simultaneous moves. Only the leaves\ncontain the actual rewards; the remaining numbers are the expected reward for the optimal strategy.\n\nIn this paper, we present a general template of MCTS algorithms for zero-sum perfect-information\nsimultaneous move games. It can be instantiated using any regret minimizing procedure for matrix\ngames as a function for selecting the next actions to be sampled. We formally prove that if the algo-\nrithm uses an \u0001-Hannan consistent selection function, which assures attempting each action in\ufb01nitely\nmany times, the MCTS algorithm eventually converges to a subgame perfect \u0001-Nash equilibrium of\nthe extensive form game. We empirically evaluate this claim using two different \u0001-Hannan consis-\ntent procedures: regret matching [15] and Exp3 [16]. In the experiments on randomly generated and\nworst case games, we show that the empirical speed of convergence of the algorithms based on our\ntemplate is comparable to recently proposed MCTS algorithms for these games. We conjecture that\nmany of these algorithms also converge to \u0001-Nash equilibrium and that our formal analysis could be\nextended to include them.\n\n2 De\ufb01nitions and background\n\nA \ufb01nite zero-sum game with perfect information and simultaneous moves can be described by a\ntuple (N ,H,Z,A,T , u1, h0), where N = {1, 2} contains player labels, H is a set of inner states\nand Z denotes the terminal states. A = A1 \u00d7 A2 is the set of joint actions of individual players and\nwe denote A1(h) = {1 . . . mh} and A2(h) = {1 . . . nh} the actions available to individual players\nin state h \u2208 H. The transition function T : H\u00d7A1\u00d7A2 (cid:55)\u2192 H\u222aZ de\ufb01nes the successor state given\na current state and actions for both players. For brevity, we sometimes denote T (h, i, j) \u2261 hij.\nThe utility function u1 : Z (cid:55)\u2192 [vmin, vmax] \u2286 R gives the utility of player 1, with vmin and vmax\ndenoting the minimum and maximum possible utility respectively. Without loss of generality we\nassume vmin = 0, vmax = 1, and \u2200z \u2208 Z, u2(z) = 1 \u2212 u1(z). The game starts in an initial state h0.\nA matrix game is a single-stage simultaneous move game with action sets A1 and A2. Each entry\nin the matrix M = (aij) where (i, j) \u2208 A1 \u00d7 A2 and aij \u2208 [0, 1] corresponds to a payoff (to player\n1) if row i is chosen by player 1 and column j by player 2. A strategy \u03c3q \u2208 \u2206(Aq) is a distribution\nover the actions in Aq. If \u03c31 is represented as a row vector and \u03c32 as a column vector, then the\nexpected value to player 1 when both players play with these strategies is u1(\u03c31, \u03c32) = \u03c31M \u03c32.\nGiven a pro\ufb01le \u03c3 = (\u03c31, \u03c32), de\ufb01ne the utilities against best response strategies to be u1(br, \u03c32) =\n2. A strategy pro\ufb01le (\u03c31, \u03c32) is an\nmax\u03c3(cid:48)\n\u0001-Nash equilibrium of the matrix game M if and only if\n\n1M \u03c32 and u1(\u03c31, br) = min\u03c3(cid:48)\n\n2\u2208\u2206(A2) \u03c31M \u03c3(cid:48)\n\n1\u2208\u2206(A1) \u03c3(cid:48)\n\nu1(br, \u03c32) \u2212 u1(\u03c31, \u03c32) \u2264 \u0001\n\nand\n\nu1(\u03c31, \u03c32) \u2212 u1(\u03c31, br) \u2264 \u0001\n\n(1)\n\nTwo-player perfect information games with simultaneous moves are sometimes appropriately called\nstacked matrix games because at every state h each joint action from set A1(h)\u00d7A2(h) either leads\nto a terminal state or to a subgame which is itself another stacked matrix game (see Figure 1).\nA behavioral strategy for player q is a mapping from states h \u2208 H to a probability distribution over\nthe actions Aq(h), denoted \u03c3q(h). Given a pro\ufb01le \u03c3 = (\u03c31, \u03c32), de\ufb01ne the probability of reaching\na terminal state z under \u03c3 as \u03c0\u03c3(z) = \u03c01(z)\u03c02(z), where each \u03c0q(z) is a product of probabilities of\nthe actions taken by player q along the path to z. De\ufb01ne \u03a3q to be the set of behavioral strategies for\nplayer q. Then for any strategy pro\ufb01le \u03c3 = (\u03c31, \u03c32) \u2208 \u03a31 \u00d7 \u03a32 we de\ufb01ne the expected utility of the\nstrategy pro\ufb01le (for player 1) as\n\n\u03c0\u03c3(z)u1(z)\n\n(2)\n\n(cid:88)\n\nz\u2208Z\n\nu(\u03c3) = u(\u03c31, \u03c32) =\n\n2\n\n\fAn \u0001-Nash equilibrium pro\ufb01le (\u03c31, \u03c32) in this case is de\ufb01ned analogously to (1). In other words,\nnone of the players can improve their utility by more than \u0001 by deviating unilaterally. If the strategies\nare an \u0001-NE in each subgame starting in an arbitrary game state, the equilibrium strategy is termed\nsubgame perfect. If \u03c3 = (\u03c31, \u03c32) is an exact Nash equilibrium (i.e., \u0001-NE with \u0001 = 0), then we\ndenote the unique value of the game vh0 = u(\u03c31, \u03c32). For any h \u2208 H, we denote vh the value of\nthe subgame rooted in state h.\n\n3 Simultaneous move Monte-Carlo Tree Search\n\nMonte Carlo Tree Search (MCTS) is a simulation-based state space search algorithm often used\nin game trees. The nodes in the tree represent game states. The main idea is to iteratively run\nsimulations to a terminal state, incrementally growing a tree rooted at the initial state of the game. In\nits simplest form, the tree is initially empty and a single leaf is added each iteration. Each simulation\nstarts by visiting nodes in the tree, selecting which actions to take based on a selection function and\ninformation maintained in the node. Consequently, it transitions to the successor states. When a\nnode is visited whose immediate children are not all in the tree, the node is expanded by adding a\nnew leaf to the tree. Then, a rollout policy (e.g., random action selection) is applied from the new\nleaf to a terminal state. The outcome of the simulation is then returned as a reward to the new leaf\nand the information stored in the tree is updated.\nIn Simultaneous Move MCTS (SM-MCTS), the main difference is that a joint action of both players\nis selected. The algorithm has been previously applied, for example in the game of Tron [12], Urban\nRivals [11], and in general game-playing [10]. However, guarantees of convergence to NE remain\nunknown. The convergence to a NE depends critically on the selection and update policies applied,\nwhich are even more non-trivial than in purely sequential games. The most popular selection policy\nin this context (UCB) performs very well in some games [12], but Sha\ufb01ei et al. [17] show that it\ndoes not converge to Nash equilibrium, even in a simple one-stage simultaneous move game. In this\npaper, we focus on variants of MCTS, which provably converge to (approximate) NE; hence we do\nnot discuss UCB any further. Instead, we describe variants of two other selection algorithms after\nexplaining the abstract SM-MCTS algorithm.\nAlgorithm 1 describes a single simulation of SM-MCTS. T represents the MCTS tree in which\neach state is represented by one node. Every node h maintains a cumulative reward sum over all\nsimulations through it, Xh, and a visit count nh, both initially set to 0. As depicted in Figure 1,\na matrix of references to the children is maintained at each inner node. The critical parts of the\nalgorithm are the updates on lines 8 and 14 and the selection on line 10. Each variant below will\ndescribe a different way to select an action and update a node. The standard way of de\ufb01ning the\nvalue to send back is RetVal(u1, Xh, nh) = u1, but we discuss also RetVal(u1, Xh, nh) = Xh/nh,\nwhich is required for the formal analysis in Section 4. We denote this variant of the algorithms\n\n1\n2\n3\n4\n5\n6\n7\n8\n9\n10\n11\n12\n13\n14\n15\n\nSM-MCTS(node h)\n\nif h \u2208 Z then return u1(h)\nelse if h \u2208 T and \u2203(i, j) \u2208 A1(h) \u00d7 A2(h) not previously selected then\nChoose one of the previously unselected (i, j) and h(cid:48) \u2190 T (h, i, j)\nAdd h(cid:48) to T\nu1 \u2190 Rollout(h(cid:48))\nXh(cid:48) \u2190 Xh(cid:48) + u1; nh(cid:48) \u2190 nh(cid:48) + 1\nUpdate(h, i, j, u1)\nreturn RetVal(u1, Xh(cid:48), nh(cid:48))\n\n(i, j) \u2190 Select(h)\nh(cid:48) \u2190 T (h, i, j)\nu1 \u2190 SM-MCTS(h(cid:48))\nXh \u2190 Xh + u1; nh \u2190 nh + 1\nUpdate(h, i, j, u1)\nreturn RetVal(u1, Xh, nh)\n\nAlgorithm 1: Simultaneous Move Monte Carlo Tree Search\n\n3\n\n\fwith additional \u201cM\u201d for mean. Algorithm 1 and the variants below are expressed from player 1\u2019s\nperspective. Player 2 does the same except using negated utilities.\n\n3.1 Regret matching\n\nThis variant applies regret-matching [15] to the current estimated matrix game at each stage. Sup-\npose iterations are numbered from s \u2208 {1, 2, 3,\u00b7\u00b7\u00b7} and at each iteration and each inner node h there\nis a mixed strategy \u03c3s(h) used by each player, initially set to uniform random: \u03c30(h, i) = 1/|A(h)|.\nEach player maintains a cumulative regret rh[i] for having played \u03c3s(h) instead of i \u2208 A1(h). The\nvalues are initially set to 0.\nOn iteration s, the Select function (line 10 in Algorithm 1) \ufb01rst builds the player\u2019s current strategies\nfrom the cumulative regret. De\ufb01ne x+ = max(x, 0),\n\n\u03c3s(h, a) =\n\nr+\nh [a]\nR+\nsum\n\nif R+\n\nsum > 0 oth.\n\n1\n\n|A1(h)| , where R+\n\nsum =\n\nr+\nh [i].\n\n(3)\n\n(cid:88)\n\ni\u2208A1(h)\n\nThe strategy is computed by assigning higher weight proportionally to actions based on the regret of\nhaving not taken them over the long-term. To ensure exploration, an \u03b3-on-policy sampling procedure\nis used choosing action i with probability \u03b3/|A(h)| + (1 \u2212 \u03b3)\u03c3s(h, i), for some \u03b3 > 0.\nThe Updates on lines 8 and 14 add regret accumulated at the iteration to the regret tables rh. Suppose\njoint action (i1, j2) is sampled from the selection policy and utility u1 is returned from the recursive\ncall on line 12. De\ufb01ne x(h, i, j) = Xhij if (i, j) (cid:54)= (i1, j2), or u1 otherwise. The updates to the\nregret are:\n\n\u2200i(cid:48) \u2208 A1(h), rh[i(cid:48)] \u2190 rh[i(cid:48)] + (x(h, i(cid:48), j) \u2212 u1).\n\n3.2 Exp3\n\nIn Exp3 [16], a player maintains an estimate of the sum of rewards, denoted xh,i, and visit counts\nnh,i for each of their actions i \u2208 A1. The joint action selected on line 10 is composed of an action\nindependently selected for each player. The probability of sampling action a in Select is\n\n\u03c3s(h, a) =\n\n(cid:80)\n\n(1 \u2212 \u03b3) exp(\u03b7wh,a)\ni\u2208A1(h) exp(\u03b7wh,i)\n\n+\n\n\u03b3\n\n|A1(h)| , where \u03b7 =\n\n\u03b3\n\n|A1(h)| and wh,i = xh,i\n\n1.\n\n(4)\n\nThe Update after selecting actions (i, j) and obtaining a result (u1, u2) updates the visits count\n(nh,i \u2190 nh,i + 1) and adds to the corresponding reward sum estimates the reward divided by the\nprobability that the action was played by the player (xh,i \u2190 xh,i + u1/\u03c3s(h, i)). Dividing the value\nby the probability of selecting the corresponding action makes xh,i estimate the sum of rewards over\nall iterations, not only the once where action i was selected.\n\n4 Formal analysis\n\nWe focus on the eventual convergence to approximate NE, which allows us to make an important\nsimpli\ufb01cation: We disregard the incremental building of the tree and assume we have built the\ncomplete tree. We show that this will eventually happen with probability 1 and that the statistics\ncollected during the tree building phase cannot prevent the eventual convergence.\nThe main idea of the proof is to show that the algorithm will eventually converge close to the optimal\nstrategy in the leaf nodes and inductively prove that it will converge also in higher levels of the tree.\nIn order to do that, after introducing the necessary notation, we start by analyzing the situation in\nsimple matrix games, which corresponds mainly to the leaf nodes of the tree. In the inner nodes of\nthe tree, the observed payoffs are imprecise because of the stochastic nature of the selection functions\nand bias caused by exploration, but the error can be bounded. Hence, we continue with analysis of\nrepeated matrix games with bounded error. Finally, we compose the matrices with bounded errors in\n1In practice, we set wh,i = xh,i\u2212maxi(cid:48)\u2208A1(h) xh,i(cid:48) since exp(xh,i) can easily cause numerical over\ufb02ows.\n\nThis reformulation computes the same values as the original algorithm but is more numerically stable.\n\n4\n\n\fa multi-stage setting to prove convergence guarantees of SM-MCTS. Any proofs that are omitted in\nthe paper are included in the appendix available in the supplementary material and on http://arxiv.org\n(arXiv:1310.8613).\n\n4.1 Notation and de\ufb01nitions\n\nConsider a repeatedly played matrix game where at time s players 1 and 2 choose actions is and js\nrespectively. We will use the convention (|A1|,|A2|) = (m, n). De\ufb01ne\n\nG(t) =\n\naisjs ,\n\ng(t) =\n\n1\nt\n\nG(t),\n\nand Gmax(t) = max\ni\u2208A1\n\naijs,\n\nt(cid:88)\n\ns=1\n\nt(cid:88)\n\ns=1\n\nwhere G(t) is the cumulative payoff, g(t) is the average payoff, and Gmax is the maximum cumula-\ntive payoff over all actions, each to player 1 and at time t. We also denote gmax(t) = Gmax(t)/t\nand by R(t) = Gmax(t) \u2212 G(t) and r(t) = gmax(t) \u2212 g(t) the cumulative and average regrets.\nFor actions i of player 1 and j of player 2, we denote ti, tj the number of times these actions were\nchosen up to the time t and tij the number of times both of these actions has been chosen at once.\nBy empirical frequencies we mean the strategy pro\ufb01le (\u02c6\u03c31(t), \u02c6\u03c32(t)) \u2208 (cid:104)0, 1(cid:105)m\u00d7(cid:104)0, 1(cid:105)n given by\nthe formulas \u02c6\u03c31(t, i) = ti/t, \u02c6\u03c32(t, j) = tj/t. By average strategies, we mean the strategy pro\ufb01le\n2(j)/t, where \u03c3s\n1,\n\n(\u00af\u03c31(t), \u00af\u03c32(t)) given by the formulas \u00af\u03c31(t, i) =(cid:80)t\n\n1(i)/t, \u00af\u03c32(t, j) =(cid:80)t\n\n2 are the strategies used at time s.\n\u03c3s\nDe\ufb01nition 4.1. We say that a player is \u0001-Hannan-consistent if, for any payoff sequences (e.g.,\nagainst any opponent strategy), lim supt\u2192\u221e, r(t) \u2264 \u0001 holds almost surely. An algorithm A is \u0001-\nHannan consistent, if a player who chooses his actions based on A is \u0001-Hannan consistent.\n\ns=1 \u03c3s\n\ns=1 \u03c3s\n\nHannan consistency (HC) is a commonly studied property in the context of online learning in re-\npeated (single stage) decisions. In particular, RM and variants of Exp3 has been shown to be Hannan\nconsistent in matrix games [15, 16]. In order to ensure that the MCTS algorithm will eventually visit\neach node in\ufb01nitely many times, we need the selection function to satisfy the following property.\nDe\ufb01nition 4.2. We say that A is an algorithm with guaranteed exploration, if for players 1 and 2\nboth using A for action selection limt\u2192\u221e tij = \u221e holds almost surely \u2200(i, j) \u2208 A1 \u00d7 A2.\nNote that most of the HC algorithms, namely RM and Exp3, guarantee exploration without any\nmodi\ufb01cation. If there is an algorithm without this property, it can be adjusted the following way.\nDe\ufb01nition 4.3. Let A be an algorithm used for choosing action in a matrix game M. For \ufb01xed\nexploration parameter \u03b3 \u2208 (0, 1) we de\ufb01ne a modi\ufb01ed algorithm A\u2217 as follows: In each time,\nwith probability (1 \u2212 \u03b3) run one iteration of A and with probability \u03b3 choose the action randomly\nuniformly over available actions, without updating any of the variables belonging to A.\n\n4.2 Repeated matrix games\n\nFirst we show that the \u0001-Hannan consistency is not lost due to the additional exploration.\nLemma 4.4. Let A be an \u0001-Hannan consistent algorithm. Then A\u2217 is an (\u0001 + \u03b3)-Hannan consistent\nalgorithm with guaranteed exploration.\n\nIn previous works on MCTS in our class of games, RM variants generally suggested using the\naverage strategy and Exp3 variants the empirical frequencies to obtain the strategy to be played.\nThe following lemma says there eventually is no difference between the two.\nLemma 4.5. As t approaches in\ufb01nity, the empirical frequencies and average strategies will almost\nsurely be equal. That is, lim supt\u2192\u221e maxi\u2208A1 |\u02c6\u03c31(t, i) \u2212 \u00af\u03c31(t, i)| = 0 holds with probability 1.\nThe proof is a consequence of the martingale version of Strong Law of Large Numbers.\nIt is well known that two Hannan consistent players will eventually converge to NE (see [18, p. 11]\nand [19]). We prove a similar result for the approximate versions of the notions.\nLemma 4.6. Let \u0001 > 0 be a real number. If both players in a matrix game with value v are \u0001-Hannan\nconsistent, then the following inequalities hold for the empirical frequencies almost surely:\n\nu (br, \u02c6\u03c32(t)) \u2264 v + 2\u0001\n\nlim sup\nt\u2192\u221e\n\nt\u2192\u221e u (\u02c6\u03c31(t), br) \u2265 v \u2212 2\u0001.\n\nlim inf\n\n(5)\n\nand\n\n5\n\n\fThe proof shows that if the value caused by the empirical frequencies was outside of the interval\nin\ufb01nitely many times with positive probability, it would be in contradiction with de\ufb01nition of \u0001-HC.\nThe following corollary is than a direct consequence of this lemma.\nCorollary 4.7. If both players in a matrix game are \u0001-Hannan consistent, then there almost surely\nexists t0 \u2208 N, such that for every t \u2265 t0 the empirical frequencies and average strategies form\n(4\u0001 + \u03b4)-equilibrium for arbitrarly small \u03b4 > 0.\n\nThe constant 4 is caused by going from a pair of strategies with best responses within 2\u0001 of the game\nvalue guaranteed by Lemma 4.6 to the approximate NE, which multiplies the distance by two.\n\n4.3 Repeated matrix games with bounded error\n\nAfter de\ufb01ning the repeated games with error, we present a variant of Lemma 4.6 for these games.\nDe\ufb01nition 4.8. We de\ufb01ne M (t) = (aij(t)) to be a game, in which if players chose actions i and\nj, they receive randomized payoffs aij (t, (i1, ...it\u22121), (j1, ...jt\u22121)). We will denote these simply\nas aij(t), but in fact they are random variables with values in [0, 1] and their distribution in time\nt depends on the previous choices of actions. We say that M (t) = (aij(t)) is a repeated game\nwith error \u03b7, if there is a matrix game M = (aij) and almost surely exists t0 \u2208 N, such that\n|aij(t) \u2212 aij| < \u03b7 holds for all t \u2265 t0.\n\nIn this context, we will denote G(t) =(cid:80)\nvariables without errors ( \u02dcG(t) =(cid:80) aisjs etc.). Symbols v and u (\u00b7,\u00b7) will still be used with respect\n\ns\u2208{1...t} aisjs (s) etc. and use tilde for the corresponding\n\nto M without errors. The following lemma states that even with the errors, \u0001-HC algorithms still\nconverge to an approximate NE of the game.\nLemma 4.9. Let \u0001 > 0 and c \u2265 0. If M (t) is a repeated game with error c\u0001 and both players are\n\u0001-Hannan consistent then the following inequalities hold almost surely:\n\nu (br, \u02c6\u03c32) \u2264 v + 2(c + 1)\u0001,\n\nlim sup\nt\u2192\u221e\nand v \u2212 (c + 1)\u0001 \u2264 lim inf\n\nlim inf\n\nt\u2192\u221e u (\u02c6\u03c31, br) \u2265 v \u2212 2(c + 1)\u0001\nt\u2192\u221e g(t) \u2264 lim sup\nt\u2192\u221e\n\ng(t) \u2264 v + (c + 1)\u0001.\n\n(6)\n\n(7)\n\nThe proof is similar to the proof of Lemma 4.6. It needs an additional claim that if the algorithm is\n\u0001-HC with respect to the observed values with errors, it still has a bounded regret with respect to the\nexact values. In the same way as in the previous subsection, a direct consequence of the lemma is\nthe convergence to an approximate Nash equilibrium.\nTheorem 4.10. Let \u0001, c > 0 be real numbers. If M (t) is a repeated game with error c\u0001 and both\nplayers are \u0001-Hannan consistent, then for any \u03b4 > 0 there almost surely exists t0 \u2208 N, such that for\nall t \u2265 t0 the empirical frequencies form (4(c + 1)\u0001 + \u03b4)-equilibrium of the game M.\n\n4.4 Perfect-information extensive-form games with simultaneous moves\n\nTheorem 4.11. Let(cid:0)M h(cid:1)\n\nNow we have all the necessary components to prove the main theorem.\n\nh\u2208H be a game with perfect information and simultaneous moves with\nmaximal depth D. Then for every \u0001-Hannan consistent algorithm A with guaranteed exploration\nand arbitrary small \u03b4 > 0, there almost surely exists t0, so that the average strategies (\u02c6\u03c31(t), \u02c6\u03c32(t))\nform a subgame perfect\n\n(cid:0)2D2 + \u03b4(cid:1) \u0001-Nash equilibrium for all t \u2265 t0.\n\nOnce we have established the convergence of the \u0001-HC algorithms in games with errors, we can\nproceed by induction. The games in the leaf nodes are simple matrix game so they will eventually\nconverge and they will return the mean reward values in a bounded distance from the actual value\nof the game (Lemma 4.9 with c = 0). As a result, in the level just above the leaf nodes, the \u0001-\nHC algorithms are playing a matrix game with a bounded error and by Lemma 4.9, they will also\neventually return the mean values within a bounded interval. On level d from the leaf nodes, the\nerrors of returned values will be in the order of d\u0001 and players can gain 2d\u0001 by deviating. Summing\nthe possible gain of deviations on each level leads to the bound in the theorem. The subgame\nperfection of the equilibrium results from the fact that for proving the bound on approximation in the\nwhole game (i.e., in the root of the game tree), a smaller bound on approximation of the equilibrium\nis proven for all subgames in the induction. The formal proof is presented in the appendix.\n\n6\n\n\fFigure 2: Exploitability of strategies given by the empirical frequencies of Regret matching with\npropagating values (RM) and means (RMM) for various depths and branching factors.\n\n5 Empirical analysis\n\nIn this section, we \ufb01rst evaluate the in\ufb02uence of propagating the mean values instead of the current\nsample value in MCTS to the speed of convergence to Nash equilibrium. Afterwards, we try to\nassess the convergence rate of the algorithms in the worst case. In most of the experiments, we\nuse as the bases of the SM-MCTS algorithm Regret matching as the selection strategy, because a\nsuperior convergence rate bound is known for this algorithm and it has been reported to be very\nsuccessful also empirically in [20]. We always use the empirical frequencies to create the evaluated\nstrategy and measure the exploitability of the \ufb01rst player\u2019s strategy (i.e., vh0 \u2212 u(\u02c6\u03c31, br)).\n\n5.1\n\nIn\ufb02uence of propagation of the mean\n\nThe formal analysis presented in the previous section requires the algorithms to return the mean of\nall the previous samples instead of the value of the current sample. The latter is generally the case in\nprevious works on SM-MCTS [20, 11]. We run both variants with the Regret matching algorithm on\na set of randomly generated games parameterized by depth and branching factor. Branching factor\nwas always the same for both players. For the following experiments, the utility values are randomly\nselected uniformly from interval (cid:104)0, 1(cid:105). Each experiment uses 100 random games and 100 runs of\nthe algorithm.\nFigure 2 presents how the exploitability of the strategies produced by Regret matching with prop-\nagation of the mean (RMM) and current sample value (RM) develops with increasing number of\niterations. Note that both axes are in logarithmic scale. The top graph is for depth of 2, differ-\nent branching factors (BF) and \u03b3 \u2208 {0.05, 0.1, 0.2}. The bottom one presents different depths for\nBF = 2. The results show that both methods converge to the approximate Nash equilibrium of the\ngame. RMM converges slightly slower in all cases. The difference is very small in small games, but\nbecomes more apparent in games with larger depth.\n\n5.2 Empirical convergence rate\n\nAlthough the formal analysis guarantees the convergence to an \u0001-NE of the game, the rate of the con-\nvergence is not given. Therefore, we give an empirical analysis of the convergence and speci\ufb01cally\nfocus on the cases that reached the slowest convergence from a set of evaluated games.\n\n7\n\nBF=2BF=3BF=50.010.100.010.100.010.100.050.10.2101000101000101000tExploitabilityDepth=2Depth=3Depth=40.4000.2000.1000.0500.0250.1100100001001000010010000tExploitabilityMethodRMRMM\fFigure 3: The games with maximal exploitability after 1000 iterations with RM (left) and RMM\n(right) and the corresponding exploitabililty for all evaluated methods.\n\nWe have performed a brute force search through all games of depth 2 with branching factor 2 and\nutilities form the set {0, 0.5, 1}. We made 100 runs of RM and RMM with exploration set to \u03b3 =\n0.05 for 1000 iterations and computed the mean exploitability of the strategy. The games with the\nhighest exploitability for each method are presented in Figure 3. These games are not guaranteed to\nbe the exact worst case, because of possible error caused by only 100 runs of the algorithm, but they\nare representatives of particularly dif\ufb01cult cases for the algorithms. In general, the games that are\nmost dif\ufb01cult for one method are dif\ufb01cult also for the other. Note that we systematically searched\nalso for games in which RMM performs better than RM, but this was never the case with suf\ufb01cient\nnumber of runs of the algorithms in the selected games.\nFigure 3 shows the convergence of RM and Exp3 with propagating the current sample values and\nthe mean values (RMM and Exp3M) on the empirically worst games for the RM variants. The RM\nvariants converge to the minimal achievable values (0.0119 and 0.0367) after a million iterations.\nThis values corresponds exactly to the exploitability of the optimal strategy combined with the uni-\nform exploration with probability 0.05. The Exp3 variants most likely converge to the same values,\nhowever, they did not fully make it in the \ufb01rst million iterations in WC RM. The convergence rate of\nall the variants is similar and the variants with propagating means always converge a little slower.\n\n6 Conclusion\n\nWe present the \ufb01rst formal analysis of convergence of MCTS algorithms in zero-sum extensive-form\ngames with perfect information and simultaneous moves. We show that any \u0001-Hannan consistent\nalgorithm can be used to create a MCTS algorithm that provably converges to an approximate Nash\nequilibrium of the game. This justi\ufb01es the usage of the MCTS as an approximation algorithm for\nthis class of games from the perspective of algorithmic game theory. We complement the formal\nanalysis with experimental evaluation that shows that other MCTS variants for this class of games,\nwhich are not covered by the proof, also converge to the approximate NE of the game. Hence, we\nbelieve that the presented proofs can be generalized to include these cases as well. Besides this, we\nwill focus our future research on providing \ufb01nite time convergence bounds for these algorithms and\ngeneralizing the results to more general classes of extensive-form games with imperfect information.\n\nAcknowledgments\n\nThis work is partially funded by the Czech Science Foundation (grant no. P202/12/2054), the Grant\nAgency of the Czech Technical University in Prague (grant no. OHK3-060/12), and the Netherlands\nOrganisation for Scienti\ufb01c Research (NWO) in the framework of the project Go4Nature, grant num-\nber 612.000.938. The access to computing and storage facilities owned by parties and projects con-\ntributing to the National Grid Infrastructure MetaCentrum, provided under the programme \u201cProjects\nof Large Infrastructure for Research, Development, and Innovations\u201d (LM2010005) is appreciated.\n\n8\n\nWC_RMWC_RMM0.01250.02500.05000.10000.20000.40000.80001e+021e+041e+061e+021e+041e+06tExploitabilityMethodExp3Exp3MRMRMM\fReferences\n[1] Manish Jain, Dmytro Korzhyk, Ondrej Vanek, Vincent Conitzer, Michal Pechoucek, and Milind Tambe. A\ndouble oracle algorithm for zero-sum security games. In Tenth International Conference on Autonomous\nAgents and Multiagent Systems (AAMAS 2011), pages 327\u2013334, 2011.\n\n[2] Michael Johanson, Nolan Bard, Neil Burch, and Michael Bowling. Finding optimal abstract strategies in\nextensive-form games. In Proceedings of the Twenty-Sixth Conference on Arti\ufb01cial Intelligence (AAAI-\n12), pages 1371\u20131379, 2012.\n\n[3] S. M. Ross. Goofspiel \u2014 the game of pure strategy. Journal of Applied Probability, 8(3):621\u2013625, 1971.\n[4] Glenn C. Rhoads and Laurent Bartholdi. Computer solution to the game of pure strategy. Games,\n\n3(4):150\u2013156, 2012.\n\n[5] Michael L. Littman. Markov games as a framework for multi-agent reinforcement learning. In In Pro-\nceedings of the Eleventh International Conference on Machine Learning (ICML-1994), pages 157\u2013163.\nMorgan Kaufmann, 1994.\n\n[6] M. Genesereth and N. Love. General game-playing: Overview of the AAAI competition. AI Magazine,\n\n26:62\u201372, 2005.\n\n[7] Michael Buro. Solving the Oshi-Zumo game. In Proceedings of Advances in Computer Games 10, pages\n\n361\u2013366, 2003.\n\n[8] Abdallah Saf\ufb01dine, Hilmar Finnsson, and Michael Buro. Alpha-beta pruning for games with simultaneous\nmoves. In Proceedings of the Thirty-Second Conference on Arti\ufb01cial Intelligence (AAAI-12), pages 556\u2013\n562, 2012.\n\n[9] Branislav Bosansky, Viliam Lisy, Jiri Cermak, Roman Vitek, and Michal Pechoucek. Using double-oracle\nmethod and serialized alpha-beta search for pruning in simultaneous moves games. In Proceedings of the\nTwenty-Third International Joint Conference on Arti\ufb01cial Intelligence (IJCAI), pages 48\u201354, 2013.\n\n[10] H. Finnsson and Y. Bj\u00a8ornsson. Simulation-based approach to general game-playing. In The Twenty-Third\n\nAAAI Conference on Arti\ufb01cial Intelligence, pages 259\u2013264. AAAI Press, 2008.\n\n[11] Olivier Teytaud and S\u00b4ebastien Flory. Upper con\ufb01dence trees with short term partial information.\n\nIn\nApplications of Eolutionary Computation (EvoApplications 2011), Part I, volume 6624 of LNCS, pages\n153\u2013162, Berlin, Heidelberg, 2011. Springer-Verlag.\n\n[12] Pierre Perick, David L. St-Pierre, Francis Maes, and Damien Ernst. Comparison of different selection\nstrategies in monte-carlo tree search for the game of Tron. In Proceedings of the IEEE Conference on\nComputational Intelligence and Games (CIG), pages 242\u2013249, 2012.\n\n[13] Hilmar Finnsson. Simulation-Based General Game Playing. PhD thesis, Reykjavik University, 2012.\n[14] L. Kocsis and C. Szepesv\u00b4ari. Bandit-based Monte Carlo planning.\n\nIn 15th European Conference on\n\nMachine Learning, volume 4212 of LNCS, pages 282\u2013293, 2006.\n\n[15] S. Hart and A. Mas-Colell. A simple adaptive procedure leading to correlated equilibrium. Econometrica,\n\n68(5):1127\u20131150, 2000.\n\n[16] Peter Auer, Nicol`o Cesa-Bianchi, Yoav Freund, and Robert E. Schapire. The nonstochastic multiarmed\n\nbandit problem. SIAM Journal on Computing, 32(1):48\u201377, 2002.\n\n[17] M. Sha\ufb01ei, N. R. Sturtevant, and J. Schaeffer. Comparing UCT versus CFR in simultaneous games. In\n\nProceeding of the IJCAI Workshop on General Game-Playing (GIGA), pages 75\u201382, 2009.\n\n[18] Kevin Waugh. Abstraction in large extensive games. Master\u2019s thesis, University of Alberta, 2009.\n[19] A. Blum and Y. Mansour. Learning, regret minimization, and equilibria. In Noam Nisan, Tim Rough-\ngarden, Eva Tardos, and Vijay V. Vazirani, editors, Algorithmic Game Theory, chapter 4. Cambridge\nUniversity Press, 2007.\n\n[20] Marc Lanctot, Viliam Lis\u00b4y, and Mark H.M. Winands. Monte Carlo tree search in simultaneous move\n\ngames with applications to Goofspiel. In Workshop on Computer Games at IJCAI, 2013.\n\n9\n\n\f", "award": [], "sourceid": 1046, "authors": [{"given_name": "Viliam", "family_name": "Lisy", "institution": "CTU in Prague"}, {"given_name": "Vojta", "family_name": "Kovarik", "institution": "CTU in Prague"}, {"given_name": "Marc", "family_name": "Lanctot", "institution": "Maastricht University"}, {"given_name": "Branislav", "family_name": "Bosansky", "institution": "CTU in Prague"}]}