{"title": "Regret Minimization in Games with Incomplete Information", "book": "Advances in Neural Information Processing Systems", "page_first": 1729, "page_last": 1736, "abstract": null, "full_text": "Regret Minimization in Games with Incomplete\n\nInformation\n\nMartin Zinkevich\n\nmaz@cs.ualberta.ca\n\nMichael Johanson\n\njohanson@cs.ualberta.ca\n\nMichael Bowling\n\nComputing Science Department\n\nUniversity of Alberta\n\nEdmonton, AB Canada T6G2E8\nbowling@cs.ualberta.ca\n\nCarmelo Piccione\n\nComputing Science Department\n\nUniversity of Alberta\n\nEdmonton, AB Canada T6G2E8\n\ncarm@cs.ualberta.ca\n\nAbstract\n\nExtensive games are a powerful model of multiagent decision-making scenarios\nwith incomplete information. Finding a Nash equilibrium for very large instances\nof these games has received a great deal of recent attention. In this paper, we\ndescribe a new technique for solving large games based on regret minimization.\nIn particular, we introduce the notion of counterfactual regret, which exploits the\ndegree of incomplete information in an extensive game. We show how minimizing\ncounterfactual regret minimizes overall regret, and therefore in self-play can be\nused to compute a Nash equilibrium. We demonstrate this technique in the domain\nof poker, showing we can solve abstractions of limit Texas Hold\u2019em with as many\nas 1012 states, two orders of magnitude larger than previous methods.\n\n1 Introduction\n\nExtensive games are a natural model for sequential decision-making in the presence of other\ndecision-makers, particularly in situations of imperfect information, where the decision-makers have\ndiffering information about the state of the game. As with other models (e.g., MDPs and POMDPs),\nits usefulness depends on the ability of solution techniques to scale well in the size of the model. So-\nlution techniques for very large extensive games have received considerable attention recently, with\npoker becoming a common measuring stick for performance. Poker games can be modeled very\nnaturally as an extensive game, with even small variants, such as two-player, limit Texas Hold\u2019em,\nbeing impractically large with just under 1018 game states.\nState of the art in solving extensive games has traditionally made use of linear programming using a\nrealization plan representation [1]. The representation is linear in the number of game states, rather\nthan exponential, but considerable additional technology is still needed to handle games the size of\npoker. Abstraction, both hand-chosen [2] and automated [3], is commonly employed to reduce the\ngame from 1018 to a tractable number of game states (e.g., 107), while still producing strong poker\nprograms. In addition, dividing the game into multiple subgames each solved independently or in\nreal-time has also been explored [2, 4]. Solving larger abstractions yields better approximate Nash\nequilibria in the original game, making techniques for solving larger games the focus of research\nin this area. Recent iterative techniques have been proposed as an alternative to the traditional\nlinear programming methods. These techniques have been shown capable of \ufb01nding approximate\nsolutions to abstractions with as many as 1010 game states [5, 6, 7], resulting in the \ufb01rst signi\ufb01cant\nimprovement in poker programs in the past four years.\n\n1\n\n\fIn this paper we describe a new technique for \ufb01nding approximate solutions to large extensive games.\nThe technique is based on regret minimization, using a new concept called counterfactual regret. We\nshow that minimizing counterfactual regret minimizes overall regret, and therefore can be used to\ncompute a Nash equilibrium. We then present an algorithm for minimizing counterfactual regret\nin poker. We use the algorithm to solve poker abstractions with as many as 1012 game states, two\norders of magnitude larger than previous methods. We also show that this translates directly into\nan improvement in the strength of the resulting poker playing programs. We begin with a formal\ndescription of extensive games followed by an overview of regret minimization and its connections\nto Nash equilibria.\n\n2 Extensive Games, Nash Equilibria, and Regret\n\nExtensive games provide a general yet compact model of multiagent interaction, which explicitly\nrepresents the often sequential nature of these interactions. Before presenting the formal de\ufb01nition,\nwe \ufb01rst give some intuitions. The core of an extensive game is a game tree just as in perfect infor-\nmation games (e.g., Chess or Go). Each non-terminal game state has an associated player choosing\nactions and every terminal state has associated payoffs for each of the players. The key difference\nis the additional constraint of information sets, which are sets of game states that the controlling\nplayer cannot distinguish and so must choose actions for all such states with the same distribution.\nIn poker, for example, the \ufb01rst player to act does not know which cards the other players were dealt,\nand so all game states immediately following the deal where the \ufb01rst player holds the same cards\nwould be in the same information set. We now describe the formal model as well as notation that\nwill be useful later.\n\nDe\ufb01nition 1 [8, p. 200] a \ufb01nite extensive game with imperfect information has the following com-\nponents:\u2022 A \ufb01nite set N of players.\n\n\u2022 A \ufb01nite set H of sequences, the possible histories of actions, such that the empty sequence\nis in H and every pre\ufb01x of a sequence in H is also in H. Z \u2286 H are the terminal histories\n(those which are not a pre\ufb01x of any other sequences). A(h) = {a : (h, a) \u2208 H} are the\nactions available after a nonterminal history h \u2208 H,\n\n\u2022 A function P that assigns to each nonterminal history (each member of H\\Z) a member of\nN \u222a{c}. P is the player function. P (h) is the player who takes an action after the history\nh. If P (h) = c then chance determines the action taken after history h.\n\n\u2022 A function fc that associates with every history h for which P (h) = c a probability mea-\nsure fc(\u00b7|h) on A(h) (fc(a|h) is the probability that a occurs given h), where each such\nprobability measure is independent of every other such measure.\n\n\u2022 For each player i \u2208 N a partition Ii of {h \u2208 H : P (h) = i} with the property that\nA(h) = A(h(cid:48)) whenever h and h(cid:48) are in the same member of the partition. For Ii \u2208 Ii\nwe denote by A(Ii) the set A(h) and by P (Ii) the player P (h) for any h \u2208 Ii. Ii is the\ninformation partition of player i; a set Ii \u2208 Ii is an information set of player i.\n\n\u2022 For each player i \u2208 N a utility function ui from the terminal states Z to the reals R. If\nN = {1, 2} and u1 = \u2212u2, it is a zero-sum extensive game. De\ufb01ne \u2206u,i = maxz ui(z)\u2212\nminz ui(z) to be the range of utilities to player i.\n\nNote that the partitions of information as described can result in some odd and unrealistic situations\nwhere a player is forced to forget her own past decisions. If all players can recall their previous\nactions and the corresponding information sets, the game is said to be one of perfect recall. This\nwork will focus on \ufb01nite, zero-sum extensive games with perfect recall.\n\n2.1 Strategies\n\nA strategy of player i \u03c3i in an extensive game is a function that assigns a distribution over A(Ii) to\neach Ii \u2208 Ii, and \u03a3i is the set of strategies for player i. A strategy pro\ufb01le \u03c3 consists of a strategy\nfor each player, \u03c31, \u03c32, . . ., with \u03c3\u2212i referring to all the strategies in \u03c3 except \u03c3i.\n\n2\n\n\fi (h) into each player\u2019s contribution to this probability. Hence, \u03c0\u03c3\n\nLet \u03c0\u03c3(h) be the probability of history h occurring if players choose actions according to \u03c3. We can\ndecompose \u03c0\u03c3 = \u03a0i\u2208N\u222a{c}\u03c0\u03c3\ni (h)\nis the probability that if player i plays according to \u03c3 then for all histories h(cid:48) that are a proper pre\ufb01x\nof h with P (h(cid:48)) = i, player i takes the corresponding action in h. Let \u03c0\u03c3\u2212i(h) be the product of all\nh\u2208I \u03c0\u03c3(h),\ni (I) and \u03c0\u03c3\u2212i(I) de\ufb01ned\nas the probability of reaching a particular information set given \u03c3, with \u03c0\u03c3\nsimilarly.\nThe overall value to player i of a strategy pro\ufb01le is then the expected payoff of the resulting terminal\n\nplayers\u2019 contribution (including chance) except player i. For I \u2286 H, de\ufb01ne \u03c0\u03c3(I) =(cid:80)\nnode, ui(\u03c3) =(cid:80)\n\nh\u2208Z ui(h)\u03c0\u03c3(h).\n\n2.2 Nash Equilibrium\n\nThe traditional solution concept of a two-player extensive game is that of a Nash equilibrium. A\nNash equilibrium is a strategy pro\ufb01le \u03c3 where\n\nu1(\u03c3) \u2265 max\n1\u2208\u03a31\n\u03c3(cid:48)\n\nu1(\u03c3(cid:48)\n\n1, \u03c32)\n\nu2(\u03c3) \u2265 max\n2\u2208\u03a32\n\u03c3(cid:48)\n\nu2(\u03c31, \u03c3(cid:48)\n2).\n\nAn approximation of a Nash equilibrium or \u0001-Nash equilibrium is a strategy pro\ufb01le \u03c3 where\n\nu1(\u03c3) + \u0001 \u2265 max\n1\u2208\u03a31\n\u03c3(cid:48)\n\nu1(\u03c3(cid:48)\n\n1, \u03c32)\n\nu2(\u03c3) + \u0001 \u2265 max\n2\u2208\u03a32\n\u03c3(cid:48)\n\nu2(\u03c31, \u03c3(cid:48)\n2).\n\n(1)\n\n(2)\n\n2.3 Regret Minimization\n\nRegret is an online learning concept that has triggered a family of powerful learning algorithms. To\nde\ufb01ne this concept, \ufb01rst consider repeatedly playing an extensive game. Let \u03c3t\ni be the strategy used\nby player i on round t. The average overall regret of player i at time T is:\n\nRT\n\ni =\n\n1\nT\n\nmax\ni \u2208\u03a3i\n\u03c3\u2217\n\n(cid:0)ui(\u03c3\u2217\n\ni , \u03c3t\u2212i) \u2212 ui(\u03c3t)(cid:1)\n\nT(cid:88)\n\nt=1\n\n(3)\n\nMoreover, de\ufb01ne \u00af\u03c3t\ninformation set I \u2208 Ii, for each a \u2208 A(I), de\ufb01ne:\n\ni to be the average strategy for player i from time 1 to T . In particular, for each\n\n(cid:80)T\n\n(cid:80)T\n\nt=1 \u03c0\u03c3t\n\ni (I)\u03c3t(I)(a)\nt=1 \u03c0\u03c3t\n\ni (I)\n\n\u00af\u03c3t\ni(I)(a) =\n\n.\n\n(4)\n\nThere is a well-known connection between regret and the Nash equilibrium solution concept.\n\nTheorem 2 In a zero-sum game at time T , if both player\u2019s average overall regret is less than \u0001, then\n\u00af\u03c3T is a 2\u0001 equilibrium.\n\nAn algorithm for selecting \u03c3t\ni for player i is regret minimizing if player i\u2019s average overall regret\n(regardless of the sequence \u03c3t\u2212i) goes to zero as t goes to in\ufb01nity. As a result, regret minimizing\nalgorithms in self-play can be used as a technique for computing an approximate Nash equilibrium.\nMoreover, an algorithm\u2019s bounds on the average overall regret bounds the rate of convergence of the\napproximation.\nTraditionally, regret minimization has focused on bandit problems more akin to normal-form games.\nAlthough it is conceptually possible to convert any \ufb01nite extensive game to an equivalent normal-\nform game, the exponential increase in the size of the representation makes the use of regret algo-\nrithms on the resulting game impractical. Recently, Gordon has introduced the Lagrangian Hedging\n(LH) family of algorithms, which can be used to minimize regret in extensive games by working\nwith the realization plan representation [5]. We also propose a regret minimization procedure that\nexploits the compactness of the extensive game. However, our technique doesn\u2019t require the costly\nquadratic programming optimization needed with LH allowing it to scale more easily, while achiev-\ning even tighter regret bounds.\n\n3\n\n\f3 Counterfactual Regret\n\nThe fundamental idea of our approach is to decompose overall regret into a set of additive regret\nterms, which can be minimized independently. In particular, we introduce a new regret concept\nfor extensive games called counterfactual regret, which is de\ufb01ned on an individual information set.\nWe show that overall regret is bounded by the sum of counterfactual regret, and also show how\ncounterfactual regret can be minimized at each information set independently.\nWe begin by considering one particular information set I \u2208 Ii and player i\u2019s choices made in that\ninformation set. De\ufb01ne ui(\u03c3, h) to be the expected utility given that the history h is reached and\nthen all players play using strategy \u03c3. De\ufb01ne counterfactual utility ui(\u03c3, I) to be the expected\nutility given that information set I is reached and all players play using strategy \u03c3 except that player\ni plays to reach I, formally if \u03c0\u03c3(h, h(cid:48)) is the probability of going from history h to history h(cid:48), then:\n\n(cid:80)\nh\u2208I,h(cid:48)\u2208Z \u03c0\u03c3\u2212i(h)\u03c0\u03c3(h, h(cid:48))ui(h(cid:48))\n\n(5)\nFinally, for all a \u2208 A(I), de\ufb01ne \u03c3|I\u2192a to be a strategy pro\ufb01le identical to \u03c3 except that player i\nalways chooses action a when in information set I. The immediate counterfactual regret is:\n\nui(\u03c3, I) =\n\n\u03c0\u03c3\u2212i(I)\n\nRT\n\ni,imm(I) =\n\n1\nT\n\nmax\na\u2208A(I)\n\n\u2212i(I)(cid:0)ui(\u03c3t|I\u2192a, I) \u2212 ui(\u03c3t, I)(cid:1)\n\n\u03c0\u03c3t\n\nT(cid:88)\n\nt=1\n\nIntuitively, this is the player\u2019s regret in its decisions at information set I in terms of counterfactual\nutility, with an additional weighting term for the counterfactual probability that I would be reached\non that round if the player had tried to do so. As we will often be most concerned about regret when it\nis positive, let RT,+\ni,imm(I), 0) be the positive portion of immediate counterfactual\nregret.\nWe can now state our \ufb01rst key result.\n\ni,imm(I) = max(RT\n\ni \u2264(cid:80)\n\nTheorem 3 RT\n\nI\u2208Ii\n\nRT,+\n\ni,imm(I)\n\n(6)\n\n(7)\n\n(8)\n\nThe proof is in the full version. Since minimizing immediate counterfactual regret minimizes the\noverall regret, it enables us to \ufb01nd an approximate Nash equilibrium if we can only minimize the\nimmediate counterfactual regret.\nThe key feature of immediate counterfactual regret is that it can be minimized by controlling only\n\u03c3i(I). To this end, we can use Blackwell\u2019s algorithm for approachability to minimize this regret\nindependently on each information set. In particular, we maintain for all I \u2208 Ii, for all a \u2208 A(I):\n\n\u2212i(I)(cid:0)ui(\u03c3t|I\u2192a, I) \u2212 ui(\u03c3t, I)(cid:1)\n\n\u03c0\u03c3t\n\nRT\n\ni (I, a) =\n\nDe\ufb01ne RT,+\n\ni\n\n(I, a) = max(RT\n\ni (I, a), 0), then the strategy for time T + 1 is:\n\n\u03c3T +1\ni\n\n(I)(a) =\n\nRT ,+\n\n(I,a)\na\u2208A(I) RT ,+\n\ni\n\ni\n\n(I,a)\n\na\u2208A(I) RT,+\n\ni\n\n(I, a) > 0\n\nif (cid:80)\n\notherwise.\n\nT(cid:88)\n\nt=1\n\n1\nT\n\n(cid:80)\n1|A(I)|\n\n\uf8f1\uf8f2\uf8f3\n\nIn other words, actions are selected in proportion to the amount of positive counterfactual regret\nfor not playing that action. If no actions have any positive counterfactual regret, then the action is\nselected randomly. This leads us to our second key result.\n\u221a\nT\n\nTheorem 4 If player i selects actions according to Equation 8 then RT\nand consequently RT\n\ni \u2264 \u2206u,i|Ii|(cid:112)|Ai|/\n\nT where |Ai| = maxh:P (h)=i |A(h)|.\n\n(cid:112)|Ai|/\n\ni,imm(I) \u2264 \u2206u,i\n\n\u221a\n\nThe proof is in the full version. This result establishes that the strategy in Equation 8 can be used\nin self-play to compute a Nash equilibrium. In addition, the bound on the average overall regret is\nlinear in the number of information sets. These are similar bounds to what\u2019s achievable by Gordon\u2019s\nLagrangian Hedging algorithms. Meanwhile, minimizing counterfactual regret does not require\na costly quadratic program projection on each iteration. In the next section we demonstrate our\ntechnique in the domain of poker.\n\n4\n\n\f4 Application To Poker\n\nWe now describe how we use counterfactual regret minimization to compute a near equilibrium\nsolution in the domain of poker. The poker variant we focus on is heads-up limit Texas Hold\u2019em,\nas it is used in the AAAI Computer Poker Competition [9]. The game consists of two players\n(zero-sum), four rounds of cards being dealt, and four rounds of betting, and has just under 1018\ngame states [2]. As with all previous work on this domain, we will \ufb01rst abstract the game and\n\ufb01nd an equilibrium of the abstracted game. In the terminology of extensive games, we will merge\ninformation sets; in the terminology of poker, we will bucket card sequences. The quality of the\nresulting near equilibrium solution depends on the coarseness of the abstraction. In general, the less\nabstraction used, the higher the quality of the resulting strategy. Hence, the ability to solve a larger\ngame means less abstraction is required, translating into a stronger poker playing program.\n\n4.1 Abstraction\n\nThe goal of abstraction is to reduce the number of information sets for each player to a tractable\nsize such that the abstract game can be solved. Early poker abstractions [2, 4] involved limiting\nthe possible sequences of bets, e.g., only allowing three bets per round, or replacing all \ufb01rst-round\ndecisions with a \ufb01xed policy. More recently, abstractions involving full four round games with the\nfull four bets per round have proven to be a signi\ufb01cant improvement [7, 6]. We also will keep the\nfull game\u2019s betting structure and focus abstraction on the dealt cards.\nOur abstraction groups together observed card sequences based on a metric called hand strength\nsquared. Hand strength is the expected probability of winning1 given only the cards a player has\nseen. This was used a great deal in previous work on abstraction [2, 4]. Hand strength squared\nis the expected square of the hand strength after the last card is revealed, given only the cards a\nplayer has seen. Intuitively, hand strength squared is similar to hand strength but gives a bonus to\ncard sequences whose eventual hand strength has higher variance. Higher variance is preferred as it\nmeans the player eventually will be more certain about their ultimate chances of winning prior to a\nshowdown. More importantly, we will show in Section 5 that this metric for abstraction results in\nstronger poker strategies.\nThe \ufb01nal abstraction is generated by partitioning card sequences based on the hand strength squared\nmetric. First, all round-one card sequences (i.e., all private card holdings) are partitioned into ten\nequally sized buckets based upon the metric. Then, all round-two card sequences that shared a\nround-one bucket are partitioned into ten equally sized buckets based on the metric now applied at\nround two. Thus, a partition of card sequences in round two is a pair of numbers: its bucket in\nthe previous round and its bucket in the current round given its bucket in the previous round. This\nis repeated after reach round, continuing to partition card sequences that agreed on the previous\nrounds\u2019 buckets into ten equally sized buckets based on the metric applied in that round. Thus, card\nsequences are partitioned into bucket sequences: a bucket from {1, . . . 10} for each round. The\nresulting abstract game has approximately 1.65 \u00d7 1012 game states, and 5.73 \u00d7 107 information\nsets. In the full game of poker, there are approximately 9.17 \u00d7 1017 game states and 3.19 \u00d7 1014\ninformation sets. So although this represents a signi\ufb01cant abstraction on the original game it is two\norders of magnitude larger than previously solved abstractions.\n\n4.2 Minimizing Counterfactual Regret\n\nNow that we have speci\ufb01ed an abstraction, we can use counterfactual regret minimization to com-\npute an approximate equilibrium for this game. The basic procedure involves having two players\nrepeatedly play the game using the counterfactual regret minimizing strategy from Equation 8. Af-\n2 ) as the resulting approximate\nter T repetitions of the game, or simply iterations, we return (\u00af\u03c3T\ni(I, a) for every information set I and action a, and\nequilibrium. Repeated play requires storing Rt\nupdating it after each iteration.2\n\n1 , \u00af\u03c3T\n\n1Where a tie is considered \u201chalf a win\u201d\n2The bound from Theorem 4 for the basic procedure can actually be made signi\ufb01cantly tighter in the speci\ufb01c\ncase of poker. In the full version, we show that the bound for poker is actually independent of the size of the\ncard abstraction.\n\n5\n\n\fFor our experiments, we actually use a variation of this basic procedure, which exploits the fact\nthat our abstraction has a small number of information sets relative to the number of game states.\nAlthough each information set is crucial, many consist of a hundred or more individual histories.\nThis fact suggests it may be possible to get a good idea of the correct behavior for an information\nset by only sampling a fraction of the associated game states. In particular, for each iteration, we\nsample deterministic actions for the chance player. Thus, \u03c3t\nc is set to be a deterministic strategy, but\nchosen according to the distribution speci\ufb01ed by fc. For our abstraction this amounts to choosing\na joint bucket sequence for the two players. Once the joint bucket sequence is speci\ufb01ed, there are\nonly 18,496 reachable states and 6,378 reachable information sets. Since \u03c0\u03c3t\n\u2212i(I) is zero for all other\ninformation sets, no updates need to be made for these information sets.3\nThis sampling variant allows approximately 750 iterations of the algorithm to be completed in a\nsingle second on a single core of a 2.4Ghz Dual Core AMD Opteron 280 processor. In addition, a\nstraightforward parallelization is possible and was used when noted in the experiments. Since betting\nis public information, the \ufb02op-onward information sets for a particular pre\ufb02op betting sequence can\nbe computed independently. With four processors we were able to complete approximately 1700\niterations in one second. The complete algorithmic details with pseudocode can be found in the full\nversion.\n\n5 Experimental Results\n\nBefore discussing the results, it is useful to consider how one evaluates the strength of a near equi-\nlibrium poker strategy. One natural method is to measure the strategy\u2019s exploitability, or its per-\nformance against its worst-case opponent. In a symmetric, zero-sum game like heads-up poker4, a\nperfect equilibrium has zero exploitability, while an \u0001-Nash equilibrium has exploitability \u0001. A con-\nvenient measure of exploitability is millibets-per-hand (mb/h), where a millibet is one thousandth\nof a small-bet, the \ufb01xed magnitude of bets used in the \ufb01rst two rounds of betting. To provide some\nintuition for these numbers, a player that always folds will lose 750 mb/h while a player that is 10\nmb/h stronger than another would require over one million hands to be 95% certain to have won\noverall.\nIn general, it is intractable to compute a strategy\u2019s exploitability within the full game. For strategies\nin a reasonably sized abstraction it is possible to compute their exploitability within their own ab-\nstract game. Such a measure is a useful evaluation of the equilibrium computation technique that\nwas used to generate the strategy. However, it does not imply the technique cannot be exploited by\na strategy outside of its abstraction. It is therefore common to compare the performance of the strat-\negy in the full game against a battery of known strong poker playing programs. Although positive\nexpected value against an opponent is not transitive, winning against a large and diverse range of\nopponents suggests a strong program.\nWe used the sampled counterfactual regret minimization procedure to \ufb01nd an approximate equilib-\nrium for our abstract game as described in the previous section. The algorithm was run for 2 billion\niterations (T = 2 \u00d7 109), or less than 14 days of computation when parallelized across four CPUs.\nThe resulting strategy\u2019s exploitability within its own abstract game is 2.2 mb/h. After only 200 mil-\nlion iterations, or less than 2 days of computation, the strategy was already exploitable by less than\n13 mb/h. Notice that the algorithm visits only 18,496 game states per iteration. After 200 million\niterations each game state has been visited less than 2.5 times on average, yet the algorithm has\nalready computed a relatively accurate solution.\n\n5.1 Scaling the Abstraction\n\nIn addition to \ufb01nding an approximate equilibrium for our large abstraction, we also found approx-\nimate equilibria for a number of smaller abstractions. These abstractions used fewer buckets per\nround to partition the card sequences. In addition to ten buckets, we also solved eight, six, and \ufb01ve\n\n3A regret analysis of this variant in poker is included in the full version. We show that the quadratic decrease\nin the cost per iteration only causes in a linear increase in the required number of iterations. The experimental\nresults in the next section coincides with this analysis.\n\n4A single hand of poker is not a symmetric game as the order of betting is strategically signi\ufb01cant. However\n\na pair of hands where the betting order is reversed is symmetric.\n\n6\n\n\fAbs\n\nTime\n(h)\n33\n75\n261\n326\u2020\n\nExp\n(mb/h)\n\n3.4\n3.1\n2.7\n2.2\n\nSize\n(\u00d7109)\n6.45\n27.7\n276\n1646\n\nIterations\n(\u00d7106)\n100\n200\n750\n2000\n\n5\n6\n8\n10\n\u2020: parallel implementation with 4 CPUs\n\n(a)\n\nFigure 1: (a) Number of game states, number of iterations, computation time, and exploitability (in\nits own abstract game) of the resulting strategy for different sized abstractions. (b) Convergence\nrates for three different sized abstractions. The x-axis shows the number of iterations divided by the\nnumber of information sets in the abstraction.\n\n(b)\n\nbucket variants. As these abstractions are smaller, they require fewer iterations to compute a simi-\nlarly accurate equilibrium. For example, the program computed with the \ufb01ve bucket approximation\n(CFR5) is about 250 times smaller with just under 1010 game states. After 100 million iterations,\nor 33 hours of computation without any parallelization, the \ufb01nal strategy is exploitable by 3.4 mb/h.\nThis is approximately the same size of game solved by recent state-of-the-art algorithms [6, 7] with\nmany days of computation.\nFigure 1b shows a graph of the convergence rates for the \ufb01ve, eight, and ten partition abstractions.\nThe y-axis is exploitability while the x-axis is the number of iterations normalized by the number\nof information sets in the particular abstraction being plotted. The rates of convergence almost\nexactly coincide showing that, in practice, the number of iterations needed is growing linearly with\nthe number of information sets. Due to the use of sampled bucket sequences, the time per iteration\nis nearly independent of the size of the abstraction. This suggests that, in practice, the overall\ncomputational complexity is only linear in the size of the chosen card abstraction.\n\n5.2 Performance in Full Texas Hold\u2019em\n\nWe have noted that the ability to solve larger games means less abstraction is necessary, resulting\nin an overall stronger poker playing program. We have played our four near equilibrium bots with\nvarious abstraction sizes against each other and two other known strong programs: PsOpti4 and\nS2298. PsOpti4 is a variant of the equilibrium strategy described in [2]. It was the stronger half\nof Hyperborean, the AAAI 2006 Computer Poker Competition\u2019s winning program. It is available\nunder the name SparBot in the entertainment program Poker Academy, published by BioTools. We\nhave calculated strategies that exploit it at 175 mb/h. S2298 is the equilibrium strategy described in\n[6]. We have calculated strategies that exploit it at 52.5 mb/h. In terms of the size of the abstract\ngame PsOpti4 is the smallest consisting of a small number of merged three round games. S2298\nrestricts the number of bets per round to 3 and uses a \ufb01ve bucket per round card abstraction based\non hand-strength, resulting an abstraction slightly smaller than CFR5.\nTable 1 shows a cross table with the results of these matches. Strategies from larger abstractions\nconsistently, and signi\ufb01cantly, outperform their smaller counterparts. The larger abstractions also\nconsistently exploit weaker bots by a larger margin (e.g., CFR10 wins 19mb/h more from S2298\nthan CFR5).\nFinally, we also played CFR8 against the four bots that competed in the bankroll portion of the 2006\nAAAI Computer Poker Competition, which are available on the competition\u2019s benchmark server [9].\nThe results are shown in Table 2, along with S2298\u2019s previously published performance against the\n\n7\n\n 0 5 10 15 20 25 0 2 4 6 8 10 12 14 16 18Exploitability (mb/h)Iterations in thousands, divided by the number of information setsCFR5CFR8CFR10\fPsOpti4\nS2298\nCFR5\nCFR6\nCFR8\nCFR10\nMax\n\nPsOpti4\n0\n28\n36\n40\n52\n55\n55\n\nS2298 CFR5 CFR6 CFR8 CFR10 Average\n-35\n-13\n2\n7\n16\n22\n\n-28\n0\n17\n24\n30\n36\n36\n\n-36\n-17\n0\n5\n13\n20\n20\n\n-40\n-24\n-5\n0\n9\n14\n14\n\n-52\n-30\n-13\n-9\n0\n6\n6\n\n-55\n-36\n-20\n-14\n-6\n0\n0\n\nTable 1: Winnings in mb/h for the row player in full Texas Hold\u2019em. Matches with Opti4 used 10\nduplicate matches of 10,000 hands each and are signi\ufb01cant to 20 mb/h. Other matches used 10\nduplicate matches of 500,000 hands each are are signi\ufb01cant to 2 mb/h.\n\nHyperborean BluffBot Monash\n695\n746\n\n61\n106\n\n113\n170\n\nTeddy Average\n336\n385\n\n474\n517\n\nS2298\nCFR8\n\nTable 2: Winnings in mb/h for the row player in full Texas Hold\u2019em.\n\nsame bots [6]. The program not only beats all of the bots from the competition but does so by a\nlarger margin than S2298.\n\n6 Conclusion\n\nWe introduced a new regret concept for extensive games called counterfactual regret. We showed\nthat minimizing counterfactual regret minimizes overall regret and presented a general and poker-\nspeci\ufb01c algorithm for ef\ufb01ciently minimizing counterfactual regret. We demonstrated the technique\nin the domain of poker, showing that the technique can compute an approximate equilibrium for\nabstractions with as many as 1012 states, two orders of magnitude larger than previous methods. We\nalso showed that the resulting poker playing program outperforms other strong programs, including\nall of the competitors from the bankroll portion of the 2006 AAAI Computer Poker Competition.\n\nReferences\n[1] D. Koller and N. Megiddo. The complexity of two-person zero-sum games in extensive form. Games and\n\nEconomic Behavior, pages 528\u2013552, 1992.\n\n[2] D. Billings, N. Burch, A. Davidson, R. Holte, J. Schaeffer, T. Schauenberg, and D. Szafron. Approximat-\ning game-theoretic optimal strategies for full-scale poker. In International Joint Conference on Arti\ufb01cial\nIntelligence, pages 661\u2013668, 2003.\n\n[3] A. Gilpin and T. Sandholm. Finding equilibria in large sequential games of imperfect information. In ACM\n\nConference on Electronic Commerce, 2006.\n\n[4] A. Gilpin and T. Sandholm. A competitive texas hold\u2019em poker player via automated abstraction and\n\nreal-time equilibrium computation. In National Conference on Arti\ufb01cial Intelligence, 2006.\n\n[5] G. Gordon. No-regret algorithms for online convex programs. In Neural Information Processing Systems\n\n19, 2007.\n\n[6] M. Zinkevich, M. Bowling, and N. Burch. A new algorithm for generating strong strategies in massive\nzero-sum games. In Proceedings of the Twenty-Seventh Conference on Arti\ufb01cial Intelligence (AAAI), 2007.\nTo Appear.\n\n[7] A. Gilpin, S. Hoda, J. Pena, and T. Sandholm. Gradient-based algorithms for \ufb01nding nash equilibria in\nextensive form games. In Proceedings of the Eighteenth International Conference on Game Theory, 2007.\n[8] M. Osborne and A. Rubenstein. A Course in Game Theory. The MIT Press, Cambridge, Massachusetts,\n\n1994.\n\n[9] M. Zinkevich and M. Littman. The AAAI computer poker competition. Journal of the International\n\nComputer Games Association, 29, 2006. News item.\n\n8\n\n\f", "award": [], "sourceid": 682, "authors": [{"given_name": "Martin", "family_name": "Zinkevich", "institution": null}, {"given_name": "Michael", "family_name": "Johanson", "institution": null}, {"given_name": "Michael", "family_name": "Bowling", "institution": null}, {"given_name": "Carmelo", "family_name": "Piccione", "institution": null}]}