{"title": "Monte Carlo Sampling for Regret Minimization in Extensive Games", "book": "Advances in Neural Information Processing Systems", "page_first": 1078, "page_last": 1086, "abstract": "Sequential decision-making with multiple agents and imperfect information is commonly modeled as an extensive game. One efficient method for computing Nash equilibria in large, zero-sum, imperfect information games is counterfactual regret minimization (CFR). In the domain of poker, CFR has proven effective, particularly when using a domain-specific augmentation involving chance outcome sampling. In this paper, we describe a general family of domain independent CFR sample-based algorithms called Monte Carlo counterfactual regret minimization (MCCFR) of which the original and poker-specific versions are special cases. We start by showing that MCCFR performs the same regret updates as CFR on expectation. Then, we introduce two sampling schemes: {\\it outcome sampling} and {\\it external sampling}, showing that both have bounded overall regret with high probability. Thus, they can compute an approximate equilibrium using self-play. Finally, we prove a new tighter bound on the regret for the original CFR algorithm and relate this new bound to MCCFRs bounds. We show empirically that, although the sample-based algorithms require more iterations, their lower cost per iteration can lead to dramatically faster convergence in various games.", "full_text": "Monte Carlo Sampling for Regret Minimization in\n\nExtensive Games\n\nMarc Lanctot\n\nDepartment of Computing Science\n\nUniversity of Alberta\n\nEdmonton, Alberta, Canada T6G 2E8\n\nlanctot@ualberta.ca\n\nKevin Waugh\n\nSchool of Computer Science\nCarnegie Mellon University\nPittsburgh PA 15213-3891\nwaugh@cs.cmu.edu\n\nMartin Zinkevich\nYahoo! Research\n\nSanta Clara, CA, USA 95054\n\nmaz@yahoo-inc.com\n\nMichael Bowling\n\nDepartment of Computing Science\n\nUniversity of Alberta\n\nEdmonton, Alberta, Canada T6G 2E8\n\nbowling@cs.ualberta.ca\n\nAbstract\n\nSequential decision-making with multiple agents and imperfect information is\ncommonly modeled as an extensive game. One ef\ufb01cient method for computing\nNash equilibria in large, zero-sum, imperfect information games is counterfactual\nregret minimization (CFR). In the domain of poker, CFR has proven effective, par-\nticularly when using a domain-speci\ufb01c augmentation involving chance outcome\nsampling. In this paper, we describe a general family of domain-independent CFR\nsample-based algorithms called Monte Carlo counterfactual regret minimization\n(MCCFR) of which the original and poker-speci\ufb01c versions are special cases. We\nstart by showing that MCCFR performs the same regret updates as CFR on expec-\ntation. Then, we introduce two sampling schemes: outcome sampling and external\nsampling, showing that both have bounded overall regret with high probability.\nThus, they can compute an approximate equilibrium using self-play. Finally, we\nprove a new tighter bound on the regret for the original CFR algorithm and re-\nlate this new bound to MCCFR\u2019s bounds. We show empirically that, although the\nsample-based algorithms require more iterations, their lower cost per iteration can\nlead to dramatically faster convergence in various games.\n\n1 Introduction\n\nExtensive games are a powerful model of sequential decision-making with imperfect information,\nsubsuming \ufb01nite-horizon MDPs, \ufb01nite-horizon POMDPs, and perfect information games. The past\nfew years have seen dramatic algorithmic improvements in solving, i.e., \ufb01nding an approximate\nNash equilibrium, in two-player, zero-sum extensive games. Multiple techniques [1, 2] now exist\nfor solving games with up to 1012 game states, which is about four orders of magnitude larger than\nthe previous state-of-the-art of using sequence-form linear programs [3].\nCounterfactual regret minimization (CFR) [1] is one such recent technique that exploits the fact that\nthe time-averaged strategy pro\ufb01le of regret minimizing algorithms converges to a Nash equilibrium.\nThe key insight is the fact that minimizing per-information set counterfactual regret results in min-\nimizing overall regret. However, the vanilla form presented by Zinkevich and colleagues requires\nthe entire game tree to be traversed on each iteration. It is possible to avoid a full game-tree traver-\nsal. In their accompanying technical report, Zinkevich and colleagues discuss a poker-speci\ufb01c CFR\n\n1\n\n\fvariant that samples chance outcomes on each iteration [4]. They claim that the per-iteration cost\nreduction far exceeds the additional number of iterations required, and all of their empirical studies\nfocus on this variant. The sampling variant and its derived bound are limited to poker-like games\nwhere chance plays a prominent role in the size of the games. This limits the practicality of CFR\nminimization outside of its initial application of poker or moderately sized games. An additional\ndisadvantage of CFR is that it requires the opponent\u2019s policy to be known, which makes it unsuit-\nable for online regret minimization in an extensive game. Online regret minimization in extensive\ngames is possible using online convex programming techniques, such as Lagrangian Hedging [5],\nbut these techniques can require costly optimization routines at every time step.\nIn this paper, we present a general framework for sampling in counterfactual regret minimization.\nWe de\ufb01ne a family of Monte Carlo CFR minimizing algorithms (MCCFR), that differ in how they\nsample the game tree on each iteration. Zinkevich\u2019s vanilla CFR and a generalization of their chance-\nsampled CFR are both members of this family. We then introduce two additional members of this\nfamily: outcome-sampling, where only a single playing of the game is sampled on each iteration; and\nexternal-sampling, which samples chance nodes and the opponent\u2019s actions. We show that under a\nreasonable sampling strategy, any member of this family minimizes overall regret, and so can be used\nfor equilibrium computation. Additionally, external-sampling is proven to require only a constant-\nfactor increase in iterations yet achieves an order reduction in the cost per iteration, thus resulting an\nasymptotic improvement in equilibrium computation time. Furthermore, since outcome-sampling\ndoes not need knowledge of the opponent\u2019s strategy beyond samples of play from the strategy, we\ndescribe how it can be used for online regret minimization. We then evaluate these algorithms\nempirically by using them to compute approximate equilibria in a variety of games.\n\n2 Background\n\nAn extensive game is a general model of sequential decision-making with imperfect information. As\nwith perfect information games (such as Chess or Checkers), extensive games consist primarily of a\ngame tree: each non-terminal node has an associated player (possibly chance) that makes the deci-\nsion at that node, and each terminal node has associated utilities for the players. Additionally, game\nstates are partitioned into information sets where a player cannot distinguish between two states in\nthe same information set. The players, therefore, must choose actions with the same distribution at\neach state in the same information set. We now de\ufb01ne an extensive game formally, introducing the\nnotation we use throughout the paper.\n\nDe\ufb01nition 1 [6, p. 200] a \ufb01nite extensive game with imperfect information has the following com-\nponents:\n\n\u2022 A \ufb01nite set N of players. A \ufb01nite set H of sequences, the possible histories of actions, such\nthat the empty sequence is in H and every pre\ufb01x of a sequence in H is also in H. De\ufb01ne\nh (cid:118) h(cid:48) to mean h is a pre\ufb01x of h(cid:48). Z \u2286 H are the terminal histories (those which are not\na pre\ufb01x of any other sequences). A(h) = {a : ha \u2208 H} are the actions available after a\nnon-terminal history, h \u2208 H \\ Z.\n\u2022 A function P that assigns to each non-terminal history a member of N \u222a {c}. P is the\nplayer function. P (h) is the player who takes an action after the history h. If P (h) = c\nthen chance determines the action taken after history h.\n\u2022 For each player i \u2208 N \u222a {c} a partition Ii of {h \u2208 H : P (h) = i} with the property that\nA(h) = A(h(cid:48)) whenever h and h(cid:48) are in the same member of the partition. For Ii \u2208 Ii\nwe denote by A(Ii) the set A(h) and by P (Ii) the player P (h) for any h \u2208 Ii. Ii is the\ninformation partition of player i; a set Ii \u2208 Ii is an information set of player i.\n\u2022 A function fc that associates with every information set I where P (I) = c a probability\nmeasure fc(\u00b7|I) on A(h) (fc(a|I) is the probability that a occurs given some h \u2208 I), where\neach such probability measure is independent of every other such measure.1\n\n1Traditionally, an information partition is not speci\ufb01ed for chance. In fact, as long as the same chance\ninformation set cannot be revisited, it has no strategic effect on the game itself. However, this extension allows\nus to consider using the same sampled chance outcome for an entire set of histories, which is an important part\nof Zinkevich and colleagues\u2019 chance-sampling CFR variant.\n\n2\n\n\f\u2022 For each player i \u2208 N a utility function ui from the terminal states Z to the reals R. If\nN = {1, 2} and u1 = \u2212u2, it is a zero-sum extensive game. De\ufb01ne \u2206u,i = maxz ui(z)\u2212\nminz ui(z) to be the range of utilities to player i.\n\nIn this paper, we will only concern ourselves with two-player, zero-sum extensive games. Further-\nmore, we will assume perfect recall, a restriction on the information partitions such that a player\ncan always distinguish between game states where they previously took a different action or were\npreviously in a different information set.\n\n2.1 Strategies and Equilibria\n\nA strategy of player i, \u03c3i, in an extensive game is a function that assigns a distribution over A(Ii)\nto each Ii \u2208 Ii. We denote \u03a3i as the set of all strategies for player i. A strategy pro\ufb01le, \u03c3, consists\nof a strategy for each player, \u03c31, . . . , \u03c3n. We let \u03c3\u2212i refer to the strategies in \u03c3 excluding \u03c3i.\nLet \u03c0\u03c3(h) be the probability of history h occurring if all players choose actions according to \u03c3. We\ncan decompose \u03c0\u03c3(h) = \u03a0i\u2208N\u222a{c}\u03c0\u03c3\ni (h) into each player\u2019s contribution to this probability. Here,\ni (h) is the contribution to this probability from player i when playing according to \u03c3. Let \u03c0\u03c3\u2212i(h)\n\u03c0\u03c3\nbe the product of all players\u2019 contribution (including chance) except that of player i. For I \u2286 H,\nh\u2208I \u03c0\u03c3(h), as the probability of reaching a particular information set given all\ni (I) and \u03c0\u03c3\u2212i(I) de\ufb01ned similarly. Finally, let \u03c0\u03c3(h, z) =\ni (h, z) and \u03c0\u03c3\u2212i(h, z) be de\ufb01ned similarly. Using\n\nde\ufb01ne \u03c0\u03c3(I) = (cid:80)\nthis notation, we can de\ufb01ne the expected payoff for player i as ui(\u03c3) =(cid:80)\n\nplayers play according to \u03c3, with \u03c0\u03c3\n\u03c0\u03c3(z)/\u03c0\u03c3(h) if h (cid:118) z, and zero otherwise. Let \u03c0\u03c3\n\nh\u2208Z ui(h)\u03c0\u03c3(h).\n\nGiven a strategy pro\ufb01le, \u03c3, we de\ufb01ne a player\u2019s best response as a strategy that maximizes their\nexpected payoff assuming all other players play according to \u03c3. The best-response value for player\ni, \u03c3\u2212i). An \u0001-Nash equilibrium is an\ni is the value of that strategy, bi(\u03c3\u2212i) = max\u03c3(cid:48)\napproximation of a Nash equilibrium; it is a strategy pro\ufb01le \u03c3 that satis\ufb01es\nui(\u03c3(cid:48)\n\ni\u2208\u03a3i ui(\u03c3(cid:48)\n\u2200i \u2208 N ui(\u03c3) + \u0001 \u2265 max\ni\u2208\u03a3i\n\u03c3(cid:48)\n\ni, \u03c3\u2212i)\n\n(1)\n\nIf \u0001 = 0 then \u03c3 is a Nash Equilibrium: no player has any incentive to deviate as they are all playing\nbest responses. If a game is two-player and zero-sum, we can use exploitability as a metric for\ndetermining how close \u03c3 is to an equilibrium, \u0001\u03c3 = b1(\u03c32) + b2(\u03c31).\n\n2.2 Counterfactual Regret Minimization\n\nRegret is an online learning concept that has triggered a family of powerful learning algorithms. To\nde\ufb01ne this concept, \ufb01rst consider repeatedly playing an extensive game. Let \u03c3t\ni be the strategy used\nby player i on round t. The average overall regret of player i at time T is:\n\nRT\n\ni =\n\n1\nT\n\nmax\ni \u2208\u03a3i\n\u03c3\u2217\n\n(2)\n\ni , \u03c3t\u2212i) \u2212 ui(\u03c3t)(cid:1)\n\nT(cid:88)\n\n(cid:0)ui(\u03c3\u2217\n(cid:80)T\n\nt=1\n\ni(a|I) =\n\u00af\u03c3t\n\n(cid:80)T\n\nt=1 \u03c0\u03c3t\n\ni (I)\u03c3t(a|I)\nt=1 \u03c0\u03c3t\n\ni (I)\n\nMoreover, de\ufb01ne \u00af\u03c3t\ninformation set I \u2208 Ii, for each a \u2208 A(I), de\ufb01ne:\n\ni to be the average strategy for player i from time 1 to T . In particular, for each\n\n.\n\n(3)\n\nThere is a well-known connection between regret, average strategies, and Nash equilibria.\n\nTheorem 1 In a zero-sum game, if RT\n\ni\u2208{1,2} \u2264 \u0001, then \u00af\u03c3T is a 2\u0001 equilibrium.\n\nAn algorithm for selecting \u03c3t\ni for player i is regret minimizing if player i\u2019s average overall regret\n(regardless of the sequence \u03c3t\u2212i) goes to zero as t goes to in\ufb01nity. Regret minimizing algorithms in\nself-play can be used as a technique for computing an approximate Nash equilibrium. Moreover, an\nalgorithm\u2019s bounds on the average overall regret bounds the convergence rate of the approximation.\nZinkevich and colleagues [1] used the above approach in their counterfactual regret algorithm (CFR).\nThe basic idea of CFR is that overall regret can be bounded by the sum of positive per-information-\nset immediate counterfactual regret. Let I be an information set of player i. De\ufb01ne \u03c3(I\u2192a) to be\n\n3\n\n\fa strategy pro\ufb01le identical to \u03c3 except that player i always chooses action a from information set\nI. Let ZI be the subset of all terminal histories where a pre\ufb01x of the history is in the set I; for\nz \u2208 ZI let z[I] be that pre\ufb01x. Since we are restricting ourselves to perfect recall games z[I] is\nunique. De\ufb01ne counterfactual value vi(\u03c3, I) as,\n\nThe immediate counterfactual regret is then RT\n\ni,imm(I, a), where\n\n\u03c0\u03c3\u2212i(z[I])\u03c0\u03c3(z[I], z)ui(z).\n\n(cid:16)\n\ni,imm(I) = maxa\u2208A(I) RT\n\n(cid:17)\n(I\u2192a), I) \u2212 vi(\u03c3t, I)\n\nvi(\u03c3t\n\n(4)\n\n(5)\n\nvi(\u03c3, I) = (cid:88)\nT(cid:88)\n\nz\u2208ZI\n\n1\nT\n\nt=1\n\nRT\n\ni,imm(I, a) =\n\nRT\n\ni \u2264(cid:80)\ni \u2264 \u2206u,i|Ii|(cid:112)|Ai|/\n\nI\u2208Ii\n\nLet x+ = max(x, 0). The key insight of CFR is the following result.\n\nTheorem 2 [1, Theorem 3]\n\nRT,+\n\ni,imm(I)\n\nUsing regret-matching2 the positive per-information set immediate counterfactual regrets can be\n\u221a\ndriven to zero, thus driving average overall regret to zero. This results in an average overall regret\nT , where |Ai| = maxh:P (h)=i |A(h)|. We return to\nbound [1, Theorem 4]: RT\nthis bound, tightening it further, in Section 4.\nThis result suggests an algorithm for computing equilibria via self-play, which we will refer to as\nvanilla CFR. The idea is to traverse the game tree computing counterfactual values using Equation 4.\nGiven a strategy, these values de\ufb01ne regret terms for each player for each of their information sets\nusing Equation 5. These regret values accumulate and determine the strategies at the next iteration\nusing the regret-matching formula. Since both players are regret minimizing, Theorem 1 applies\nand so computing the strategy pro\ufb01le \u00af\u03c3t gives us an approximate Nash Equilibrium. Since CFR\nonly needs to store values at each information set, its space requirement is O(|I|). However, as\npreviously mentioned vanilla CFR requires a complete traversal of the game tree on each iteration,\nwhich prohibits its use in many large games. Zinkevich and colleagues [4] made steps to alleviate\nthis concern with a chance-sampled variant of CFR for poker-like games.\n\n3 Monte Carlo CFR\n\nThe key to our approach is to avoid traversing the entire game tree on each iteration while still having\nthe immediate counterfactual regrets be unchanged in expectation. In general, we want to restrict\nthe terminal histories we consider on each iteration. Let Q = {Q1, . . . , Qr} be a set of subsets of\nZ, such that their union spans the set Z. We will call one of these subsets a block. On each iteration\nwe will sample one of these blocks and only consider the terminal histories in that block. Let qj > 0\n\nbe the probability of considering block Qj for the current iteration (where(cid:80)r\nLet q(z) =(cid:80)\n\nqj, i.e., q(z) is the probability of considering terminal history z on the current\n\nj=1 qj = 1).\n\nj:z\u2208Qj\n\niteration. The sampled counterfactual value when updating block j is:\n\n\u02dcvi(\u03c3, I|j) = (cid:88)\n\nz\u2208Qj\u2229ZI\n\n1\nq(z) ui(z)\u03c0\u03c3\u2212i(z[I])\u03c0\u03c3(z[I], z)\n\n(6)\n\nSelecting a set Q along with the sampling probabilities de\ufb01nes a complete sample-based CFR algo-\nrithm. Rather than doing full game tree traversals the algorithm samples one of these blocks, and\nthen examines only the terminal histories in that block.\nSuppose we choose Q = {Z}, i.e., one block containing all terminal histories and q1 = 1. In\nthis case, sampled counterfactual value is equal to counterfactual value, and we have vanilla CFR.\nSuppose instead we choose each block to include all terminal histories with the same sequence of\nchance outcomes (where the probability of a chance outcome is independent of players\u2019 actions as\ni (a|I) =\ni,imm(I, a). Regret-matching satis\ufb01es Blackwell\u2019s approachability criteria. [7, 8]\n\n2Regret-matching selects actions with probability proportional to their positive regret, i.e., \u03c3t\n\ni,imm(I, a)/P\n\na(cid:48)\u2208A(I) RT,+\n\nRT,+\n\n4\n\n\fin poker-like games). Hence qj is the product of the probabilities in the sampled sequence of chance\noutcomes (which cancels with these same probabilities in the de\ufb01nition of counterfactual value) and\nwe have Zinkevich and colleagues\u2019 chance-sampled CFR.\nSampled counterfactual value was designed to match counterfactual value on expectation. We show\nthis here, and then use this fact to prove a probabilistic bound on the algorithm\u2019s average overall\nregret in the next section.\nLemma 1 Ej\u223cqj [\u02dcvi(\u03c3, I|j)] = vi(\u03c3, I)\nProof:\n\n(7)\n\n(8)\n\n(9)\n\nEj\u223cqj [\u02dcvi(\u03c3, I|j)] =(cid:88)\n= (cid:88)\n= (cid:88)\n\nz\u2208ZI\n\nj\n\nz\u2208ZI\n\n(cid:88)\n\nz\u2208Qj\u2229ZI\n\nqj \u02dcvi(\u03c3, I|j) =(cid:88)\n(cid:80)\n\nj\n\nqj\n\nj:z\u2208Qj\nq(z)\n\n\u03c0\u03c3\u2212i(z[I])\u03c0\u03c3(z[I], z)ui(z)\n\n\u03c0\u03c3\u2212i(z[I])\u03c0\u03c3(z[I], z)ui(z) = vi(\u03c3, I)\n\nqj\nq(z) \u03c0\u03c3\u2212i(z[I])\u03c0\u03c3(z[I], z)ui(z)\n\nEquation 8 follows from the fact that Q spans Z. Equation 9 follows from the de\ufb01nition of q(z).\n\nThis results in the following MCCFR algorithm. We sample a block and for each information\nset that contains a pre\ufb01x of a terminal history in the block we compute the sampled immediate\n(I\u2192a), I) \u2212 \u02dcvi(\u03c3t, I). We accumulate these\ncounterfactual regrets of each action, \u02dcr(I, a) = \u02dcvi(\u03c3t\nregrets, and the player\u2019s strategy on the next iteration applies the regret-matching algorithm to the\naccumulated regrets. We now present two speci\ufb01c members of this family, giving details on how the\nregrets can be updated ef\ufb01ciently.\n\nIn outcome-sampling MCCFR we choose Q so that each block\nOutcome-Sampling MCCFR.\ncontains a single terminal history, i.e., \u2200Q \u2208 Q,|Q| = 1. On each iteration we sample one terminal\nhistory and only update each information set along that history. The sampling probabilities, qj must\nspecify a distribution over terminal histories. We will specify this distribution using a sampling\npro\ufb01le, \u03c3(cid:48), so that q(z) = \u03c0\u03c3(cid:48)(z). Note that any choice of sampling policy will induce a particular\ni(a|I) > \u0001, then there exists a \u03b4 > 0 such\ndistribution over the block probabilities q(z). As long as \u03c3(cid:48)\nthat q(z) > \u03b4, thus ensuring Equation 6 is well-de\ufb01ned.\nThe algorithm works by sampling z using policy \u03c3(cid:48), storing \u03c0\u03c3(cid:48)(z). The single history is then\ntraversed forward (to compute each player\u2019s probability of playing to reach each pre\ufb01x of the history,\ni (h)) and backward (to compute each player\u2019s probability of playing the remaining actions of the\n\u03c0\u03c3\ni (h, z)). During the backward traversal, the sampled counterfactual regrets at each visited\nhistory, \u03c0\u03c3\ninformation set are computed (and added to the total regret).\n\nif (z[I]a) (cid:118) z\notherwise\n\n, where wI =\n\nui(z)\u03c0\u03c3\u2212i(z)\u03c0\u03c3\n\u03c0\u03c3(cid:48)(z)\n\ni (z[I]a, z)\n\n\u2212i. So, wI becomes ui(z)\u03c0\u03c3\n\n(10)\nOne advantage of outcome-sampling MCCFR is that if our terminal history is sampled according to\nthe opponent\u2019s policy, so \u03c3(cid:48)\n\u2212i = \u03c3\u2212i, then the update no longer requires explicit knowledge of \u03c3\u2212i as\nit cancels with the \u03c3(cid:48)\ni (z). Therefore, we can use outcome-\nsampling MCCFR for online regret minimization. We would have to choose our own actions so that\ni \u2248 \u03c3t\ni, but with some exploration to guarantee qj \u2265 \u03b4 > 0. By balancing the regret caused by\n\u03c3(cid:48)\nexploration with the regret caused by a small \u03b4 (see Section 4 for how MCCFR\u2019s bound depends\nupon \u03b4), we can bound the average overall regret as long as the number of playings T is known in\nadvance. This effectively mimics the approach taking by Exp3 for regret minimization in normal-\nform games [9]. An alternative form for Equation 10 is recommended for implementation. This and\nother implementation details can be found in the paper\u2019s supplemental material or the appendix of\nthe associated technical report [10].\n\ni (z[I], z)/\u03c0\u03c3(cid:48)\n\n(cid:26) wI \u00b7(cid:0)1 \u2212 \u03c3(a|z[I])(cid:1)\n\n\u2212wI \u00b7 \u03c3(a|z[I])\n\n\u02dcr(I, a) =\n\n5\n\n\fI\u2208Ic\n\nq\u03c4 =(cid:81)\n\nfc(\u03c4(I)|I)(cid:81)\n\nExternal-Sampling MCCFR.\nIn external-sampling MCCFR we sample only the actions of the\nopponent and chance (those choices external to the player). We have a block Q\u03c4 \u2208 Q for each\npure strategy of the opponent and chance, i.e.,, for each deterministic mapping \u03c4 from I \u2208 Ic \u222a\nIN\\{i} to A(I). The block probabilities are assigned based on the distributions fc and \u03c3\u2212i, so\nI\u2208IN\\{i} \u03c3\u2212i(\u03c4(I)|I). The block Q\u03c4 then contains all terminal histories\nz consistent with \u03c4, that is if ha is a pre\ufb01x of z with h \u2208 I for some I \u2208 I\u2212i then \u03c4(I) = a. In\npractice, we will not actually sample \u03c4 but rather sample the individual actions that make up \u03c4 only\nas needed. The key insight is that these block probabilities result in q(z) = \u03c0\u03c3\u2212i(z). The algorithm\niterates over i \u2208 N and for each doing a post-order depth-\ufb01rst traversal of the game tree, sampling\nactions at each history h where P (h) (cid:54)= i (storing these choices so the same actions are sampled at\nall h in the same information set). Due to perfect recall it can never visit more than one history from\nthe same information set during this traversal. For each such visited information set the sampled\ncounterfactual regrets are computed (and added to the total regrets).\n\n\u02dcr(I, a) = (1 \u2212 \u03c3(a|I)) (cid:88)\n\nz\u2208Q\u2229ZI\n\nui(z)\u03c0\u03c3\n\ni (z[I]a, z)\n\n(11)\n\n(cid:112)|Ai|/\n\n\u221a\n\nT .\n\nNote that the summation can be easily computed during the traversal by always maintaining a\nweighted sum of the utilities of all terminal histories rooted at the current history.\n\n4 Theoretical Analysis\n\nWe now present regret bounds for members of the MCCFR family, starting with an improved bound\nfor vanilla CFR that depends more explicitly on the exact structure of the extensive game. Let (cid:126)ai be\na subsequence of a history such that it contains only player i\u2019s actions in that history, and let (cid:126)Ai be\nthe set of all such player i action subsequences. Let Ii((cid:126)ai) be the set of all information sets where\nplayer i\u2019s action sequence up to that information set is (cid:126)ai. De\ufb01ne the M-value for player i of the\n\n(cid:112)|Ii((cid:126)a)|. Note that(cid:112)|Ii| \u2264 Mi \u2264 |Ii| with both sides of this bound\n\ngame to be Mi =(cid:80)\n\nbeing realized by some game. We can strengthen vanilla CFR\u2019s regret bound using this constant,\nwhich also appears in the bounds for the MCCFR variants.\n\n(cid:126)ai\u2208 (cid:126)Ai\n\nTheorem 3 When using vanilla CFR for player i, RT\n\ni \u2264 \u2206u,iMi\n\nWe now turn our attention to the MCCFR family of algorithms, for which we can provide probabilis-\ntic regret bounds. We begin with the most exciting result: showing that external-sampling requires\nonly a constant factor more iterations than vanilla CFR (where the constant depends on the desired\ncon\ufb01dence in the bound).\nTheorem 4 For any p \u2208 (0, 1], when using external-sampling MCCFR, with probability at least\n1 \u2212 p, average overall regret is bounded by, RT\n\n(cid:112)|Ai|/\n\ni \u2264(cid:16)\n\n\u2206u,iMi\n\n(cid:17)\n\n1 +\n\n\u221a\n\nT .\n\n\u221a\n2\u221a\np\n\nAlthough requiring the same order of iterations, note that external-sampling need only traverse a\nfraction of the tree on each iteration. For balanced games where players make roughly equal numbers\n\nof decisions, the iteration cost of external-sampling is O((cid:112)|H|), while vanilla CFR is O(|H|),\n\nmeaning external-sampling MCCFR requires asymptotically less time to compute an approximate\nequilibrium than vanilla CFR (and consequently chance-sampling CFR, which is identical to vanilla\nCFR in the absence of chance nodes).\nTheorem 5 For any p \u2208 (0, 1], when using outcome-sampling MCCFR where \u2200z \u2208 Z either\n\u03c0\u03c3\u2212i(z) = 0 or q(z) \u2265 \u03b4 > 0 at every timestep, with probability 1 \u2212 p, average overall regret\nis bounded by RT\n\n(cid:1) \u2206u,iMi\n\n(cid:112)|Ai|/\n\ni \u2264(cid:16)\n\n(cid:17)(cid:0) 1\n\n\u221a\nT\n\n1 +\n\n\u221a\n2\u221a\np\n\n\u03b4\n\nThe proofs for the theorems in this section can be found in the paper\u2019s supplemental material and\nas an appendix of the associated technical report [10]. The supplemental material also presents a\nslightly complicated, but general result for any member of the MCCFR family, from which the two\ntheorems presented above are derived.\n\n6\n\n\fGame\nOCP\nGoof\nLTTT\nPAM\n\n|H| (106)\n22.4\n98.3\n70.4\n91.8\n\n|I| (103)\n2\n3294\n16039\n20\n\nl\n5\n14\n18\n13\n\nM1\n45\n89884\n1333630\n9541\n\nM2\n32\n89884\n1236660\n2930\n\ntvc\n28s\n110s\n38s\n120s\n\ntos\n46\u00b5s\n150\u00b5s\n62\u00b5s\n85\u00b5s\n\ntes\n99\u00b5s\n150ms\n70ms\n28ms\n\nTable 1: Game properties. The value of |H| is in millions and |I| in thousands, and l = maxh\u2208H|h|.\ntvc, tos, and tes are the average wall-clock time per iteration4 for vanilla CFR, outcome-sampling\nMCCFR, and external-sampling MCCFR.\n5 Experimental Results\n\nWe evaluate the performance of MCCFR compared to vanilla CFR on four different games. Goof-\nspiel [11] is a bidding card game where players have a hand of cards numbered 1 to N, and take\nturns secretly bidding on the top point-valued card in a point card stack using cards in their hands.\nOur version is less informational: players only \ufb01nd out the result of each bid and not which cards\nwere used to bid, and the player with the highest total points wins. We use N = 7 in our exper-\niments. One-Card Poker [12] is a generalization of Kuhn Poker [13], we use a deck of size 500.\nPrincess and Monster [14, Research Problem 12.4.1] is a pursuit-evasion game on a graph, neither\nplayer ever knowing the location of the other. In our experiments we use random starting positions,\na 4-connected 3 by 3 grid graph, and a horizon of 13 steps. The payoff to the evader is the number of\nsteps uncaptured. Latent Tic-Tac-Toe is a twist on the classic game where moves are not disclosed\nuntil after the opponent\u2019s next move, and lost if invalid at the time they are revealed. While all of\nthese games have imperfect information and roughly of similar size, they are a diverse set of games,\nvarying both in the degree (the ratio of the number of information sets to the number of histories)\nand nature (whether due to chance or opponent actions) of imperfect information. The left columns\nof Table 1 show various constants, including the number of histories, information sets, game length,\nand M-values, for each of these domains.\nWe used outcome-sampling MCCFR, external-sampling MCCFR, and vanilla CFR to compute an\napproximate equilibrium in each of the four games. For outcome-sampling MCCFR we used an\nepsilon-greedy sampling pro\ufb01le \u03c3(cid:48). At each information set, we sample an action uniformly ran-\ndomly with probability \u0001 and according to the player\u2019s current strategy \u03c3t. Through experimentation\nwe found that \u0001 = 0.6 worked well across all games; this is interesting because the regret bound\nsuggests \u03b4 should be as large as possible. This implies that putting some bias on the most likely\noutcome to occur is helpful. With vanilla CFR we used to an implementational trick called pruning\nto dramatically reduce the work done per iteration. When updating one player\u2019s regrets, if the other\nplayer has no probability of reaching the current history, the entire subtree at that history can be\npruned for the current iteration, with no effect on the resulting computation. We also used vanilla\nCFR without pruning to see the effects of pruning in our domains.\nFigure 1 shows the results of all four algorithms on all four domains, plotting approximation quality\nas a function of the number of nodes of the game tree the algorithm touched while computing.\nNodes touched is an implementation-independent measure of computation; however, the results are\nnearly identical if total wall-clock time is used instead. Since the algorithms take radically different\namounts of time per iteration, this comparison directly answers if the sampling variants\u2019 lower cost\nper iteration outweighs the required increase in the number of iterations. Furthermore, for any\n\u221a\n\ufb01xed game (and degree of con\ufb01dence that the bound holds), the algorithms\u2019 average overall regret\nis falling at the same rate, O(1/\nT ), meaning that only their short-term rather than asymptotic\nperformance will differ.\nThe graphs show that the MCCFR variants often dramatically outperform vanilla CFR. For example,\nin Goofspiel, both MCCFR variants require only a few million nodes to reach \u0001\u03c3 < 0.5 where CFR\ntakes 2.5 billion nodes, three orders of magnitude more.\nIn fact, external-sampling, which has\nthe tightest theoretical computation-time bound, outperformed CFR and by considerable margins\n(excepting LTTT) in all of the games. Note that pruning is key to vanilla CFR being at all practical\nin these games. For example, in Latent Tic-Tac-Toe the \ufb01rst iteration of CFR touches 142 million\nnodes, but later iterations touch as few as 5 million nodes. This is because pruning is not possible\n\n4As measured on an 8-core Intel Xeon 2.5 GHz machine running Linux x86 64 kernel 2.6.27.\n\n7\n\n\fFigure 1: Convergence rates of Vanilla CFR, outcome-sampled MCCFR, and external-sampled MC-\nCFR for various games. The y axis in each graph represents the exploitability of the strategies for\nthe two players \u0001\u03c3 (see Section 2.1).\n\nin the \ufb01rst iteration. We believe this is due to dominated actions in the game. After one or two\ntraversals, the players identify and eliminate dominated actions from their policies, allowing these\nsubtrees to pruned. Finally, it is interesting to note that external-sampling was not uniformly the best\nchoice, with outcome-sampling performing better in Goofspiel. With outcome-sampling performing\nworse than vanilla CFR in LTTT, this raises the question of what speci\ufb01c game properties might\nfavor one algorithm over another and whether it might be possible to incorporate additional game\nspeci\ufb01c constants into the bounds.\n\n6 Conclusion\n\nIn this paper we de\ufb01ned a family of sample-based CFR algorithms for computing approximate equi-\nlibria in extensive games, subsuming all previous CFR variants. We also introduced two sampling\nschemes: outcome-sampling, which samples only a single history for each iteration, and external-\nsampling, which samples a deterministic strategy for the opponent and chance. In addition to pre-\nsenting a tighter bound for vanilla CFR, we presented regret bounds for both sampling variants,\nwhich showed that external sampling with high probability gives an asymptotic computational time\nimprovement over vanilla CFR. We then showed empirically in very different domains that the re-\nduction in iteration time outweighs the increase in required iterations leading to faster convergence.\nThere are three interesting directions for future work. First, we would like to examine how the\nproperties of the game effect the algorithms\u2019 convergence. Such an analysis could offer further\nalgorithmic or theoretical improvements, as well as practical suggestions, such as how to choose\na sampling policy in outcome-sampled MCCFR. Second, using outcome-sampled MCCFR as a\ngeneral online regret minimizing technique in extensive games (when the opponents\u2019 strategy is not\nknown or controlled) appears promising. It would be interesting to compare the approach, in terms\nof bounds, computation, and practical convergence, to Gordon\u2019s Lagrangian hedging [5]. Lastly,\nit seems like this work could be naturally extended to cases where we don\u2019t assume perfect recall.\nImperfect recall could be used as a mechanism for abstraction over actions, where information sets\nare grouped by important partial sequences rather than their full sequences.\n\n8\n\n 0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2 0 1e+09 2e+09 3e+09 4e+09 5e+09Nodes TouchedGoofspielCFRCFR with pruningMCCFR-outcomeMCCFR-external 0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 0 2e+08 4e+08 6e+08Nodes TouchedLatent Tic-Tac-ToeCFRCFR with pruningMCCFR-outcomeMCCFR-external 0 0.05 0.1 0.15 0.2 0.25 0 2e+08 4e+08 6e+08 8e+08 1e+09Nodes TouchedOne-Card PokerCFRCFR with pruningMCCFR-outcomeMCCFR-external 0 2 4 6 8 10 12 14 16 18 20 0 1e+08 2e+08 3e+08 4e+08 5e+08Nodes TouchedPrincess and MonsterCFRCFR with pruningMCCFR-outcomeMCCFR-external\fReferences\n[1] Martin Zinkevich, Michael Johanson, Michael Bowling, and Carmelo Piccione. Regret mini-\nmization in games with incomplete information. In Advances in Neural Information Processing\nSystems 20 (NIPS), 2008.\n\n[2] Andrew Gilpin, Samid Hoda, Javier Pe\u02dcna, and Tuomas Sandholm. Gradient-based algorithms\nfor \ufb01nding Nash equilibria in extensive form games. In 3rd International Workshop on Internet\nand Network Economics (WINE\u201907), 2007.\n\n[3] D. Koller, N. Megiddo, and B. von Stengel. Fast algorithms for \ufb01nding randomized strategies\nin game trees. In Proceedings of the 26th ACM Symposium on Theory of Computing (STOC\n\u201994), pages 750\u2013759, 1994.\n\n[4] Martin Zinkevich, Michael Johanson, Michael Bowling, and Carmelo Piccione. Regret min-\nimization in game with incomplete information. Technical Report TR07-14, University of\nAlberta, 2007. http://www.cs.ualberta.ca/research/techreports/2007/\nTR07-14.php.\n\n[5] Geoffrey J. Gordon. No-regret algorithms for online convex programs. In In Neural Informa-\n\ntion Processing Systems 19, 2007.\n\n[6] Martin J. Osborne and Ariel Rubinstein. A Course in Game Theory. MIT Press, 1994.\n[7] Sergiu Hart and Andreu Mas-Colell. A simple adaptive procedure leading to correlated equi-\n\nlibrium. Econometrica, 68(5):1127\u20131150, September 2000.\n\n[8] D. Blackwell. An analog of the minimax theorem for vector payoffs. Paci\ufb01c Journal of Math-\n\nematics, 6:1\u20138, 1956.\n\n[9] Peter Auer, Nicol`o Cesa-Bianchi, Yoav Freund, and Robert E. Schapire. Gambling in a rigged\ncasino: The adversarial multi-arm bandit problem. In 36th Annual Symposium on Foundations\nof Computer Science, pages 322\u2013331, 1995.\n\n[10] Marc Lanctot, Kevin Waugh, Martin Zinkevich, and Michael Bowling. Monte carlo sam-\npling for regret minimization in extensive games. Technical Report TR09-15, University of\nAlberta, 2009. http://www.cs.ualberta.ca/research/techreports/2009/\nTR09-15.php.\n\n[11] S. M. Ross. Goofspiel \u2014 the game of pure strategy. Journal of Applied Probability, 8(3):621\u2013\n\n625, 1971.\n\n[12] Geoffrey J. Gordon. No-regret algorithms for structured prediction problems. Technical Report\n\nCMU-CALD-05-112, Carnegie Mellon University, 2005.\n\n[13] H. W. Kuhn. Simpli\ufb01ed two-person poker. Contributions to the Theory of Games, 1:97\u2013103,\n\n1950.\n\n[14] Rufus Isaacs. Differential Games: A Mathematical Theory with Applications to Warfare and\n\nPursuit, Control and Optimization. John Wiley & Sons, 1965.\n\n9\n\n\f", "award": [], "sourceid": 363, "authors": [{"given_name": "Marc", "family_name": "Lanctot", "institution": null}, {"given_name": "Kevin", "family_name": "Waugh", "institution": null}, {"given_name": "Martin", "family_name": "Zinkevich", "institution": null}, {"given_name": "Michael", "family_name": "Bowling", "institution": null}]}