{"title": "Safe and Nested Subgame Solving for Imperfect-Information Games", "book": "Advances in Neural Information Processing Systems", "page_first": 689, "page_last": 699, "abstract": "In imperfect-information games, the optimal strategy in a subgame may depend on the strategy in other, unreached subgames. Thus a subgame cannot be solved in isolation and must instead consider the strategy for the entire game as a whole, unlike perfect-information games. Nevertheless, it is possible to first approximate a solution for the whole game and then improve it in individual subgames. This is referred to as subgame solving. We introduce subgame-solving techniques that outperform prior methods both in theory and practice. We also show how to adapt them, and past subgame-solving techniques, to respond to opponent actions that are outside the original action abstraction; this significantly outperforms the prior state-of-the-art approach, action translation. Finally, we show that subgame solving can be repeated as the game progresses down the game tree, leading to far lower exploitability. These techniques were a key component of Libratus, the first AI to defeat top humans in heads-up no-limit Texas hold'em poker.", "full_text": "Safe and Nested Subgame Solving for\n\nImperfect-Information Games\n\nNoam Brown\n\nComputer Science Department\nCarnegie Mellon University\n\nPittsburgh, PA 15217\nnoamb@cs.cmu.edu\n\nTuomas Sandholm\n\nComputer Science Department\nCarnegie Mellon University\n\nPittsburgh, PA 15217\n\nsandholm@cs.cmu.edu\n\nAbstract\n\nIn imperfect-information games, the optimal strategy in a subgame may depend\non the strategy in other, unreached subgames. Thus a subgame cannot be solved\nin isolation and must instead consider the strategy for the entire game as a whole,\nunlike perfect-information games. Nevertheless, it is possible to \ufb01rst approximate\na solution for the whole game and then improve it in individual subgames. This\nis referred to as subgame solving. We introduce subgame-solving techniques that\noutperform prior methods both in theory and practice. We also show how to adapt\nthem, and past subgame-solving techniques, to respond to opponent actions that\nare outside the original action abstraction; this signi\ufb01cantly outperforms the prior\nstate-of-the-art approach, action translation. Finally, we show that subgame solving\ncan be repeated as the game progresses down the game tree, leading to far lower\nexploitability. These techniques were a key component of Libratus, the \ufb01rst AI to\ndefeat top humans in heads-up no-limit Texas hold\u2019em poker.\n\nIntroduction\n\n1\nImperfect-information games model strategic settings that have hidden information. They have a\nmyriad of applications including negotiation, auctions, cybersecurity, and physical security.\nIn perfect-information games, determining the optimal strategy at a decision point only requires\nknowledge of the game tree\u2019s current node and the remaining game tree beyond that node (the\nsubgame rooted at that node). This fact has been leveraged by nearly every AI for perfect-information\ngames, including AIs that defeated top humans in chess [7] and Go [29]. In checkers, the ability to\ndecompose the game into smaller independent subgames was even used to solve the entire game [27].\nHowever, it is not possible to determine a subgame\u2019s optimal strategy in an imperfect-information\ngame using only knowledge of that subgame, because the game tree\u2019s exact node is typically unknown.\nInstead, the optimal strategy may depend on the value an opponent could have received in some other,\nunreached subgame. Although this is counter-intuitive, we provide a demonstration in Section 2.\nRather than rely on subgame decomposition, past approaches for imperfect-information games\ntypically solved the game as a whole upfront. For example, heads-up limit Texas hold\u2019em, a relatively\nsimple form of poker with 1013 decision points, was essentially solved without decomposition [2].\nHowever, this approach cannot extend to larger games, such as heads-up no-limit Texas hold\u2019em\u2014the\nprimary benchmark in imperfect-information game solving\u2014which has 10161 decision points [16].\nThe standard approach to computing strategies in such large games is to \ufb01rst generate an abstraction\nof the game, which is a smaller version of the game that retains as much as possible the strategic\ncharacteristics of the original game [24, 26, 25]. For example, a continuous action space might\nbe discretized. This abstract game is solved and its solution is used when playing the full game\nby mapping states in the full game to states in the abstract game. We refer to the solution of an\nabstraction (or more generally any approximate solution to a game) as a blueprint strategy.\n\n31st Conference on Neural Information Processing Systems (NIPS 2017), Long Beach, CA, USA.\n\n\fIn heavily abstracted games, a blueprint strategy may be far from the true solution. Subgame solving\nattempts to improve upon the blueprint strategy by solving in real time a more \ufb01ne-grained abstraction\nfor an encountered subgame, while \ufb01tting its solution within the overarching blueprint strategy.\n\n2 Coin Toss\nIn this section we provide intuition for why an imperfect-information subgame cannot be solved in\nisolation. We demonstrate this in a simple game we call Coin Toss, shown in Figure 1a, which will\nbe used as a running example throughout the paper.\nCoin Toss is played between players P1 and P2. The \ufb01gure shows rewards only for P1; P2 always\nreceives the negation of P1\u2019s reward. A coin is \ufb02ipped and lands either Heads or Tails with equal\nprobability, but only P1 sees the outcome. P1 then chooses between actions \u201cSell\u201d and \u201cPlay.\u201d The\nSell action leads to a subgame whose details are not important, but the expected value (EV) of\nchoosing the Sell action will be important. (For simplicity, one can equivalently assume in this\nsection that Sell leads to an immediate terminal reward, where the value depends on whether the\ncoin landed Heads or Tails). If the coin lands Heads, it is considered lucky and P1 receives an EV of\n$0.50 for choosing Sell. On the other hand, if the coin lands Tails, it is considered unlucky and P1\nreceives an EV of \u2212$0.50 for action Sell. (That is, P1 must on average pay $0.50 to get rid of the\ncoin). If P1 instead chooses Play, then P2 may guess how the coin landed. If P2 guesses correctly,\nthen P1 receives a reward of \u2212$1. If P2 guesses incorrectly, then P1 receives $1. P2 may also forfeit,\nwhich should never be chosen but will be relevant in later sections. We wish to determine the optimal\nstrategy for P2 in the subgame S that occurs after P1 chooses Play, shown in Figure 1a.\n\nFigure 1: (a) The example game of Coin Toss. \u201cC\u201d represents a chance node. S is a Player 2 (P2) subgame.\nThe dotted line between the two P2 nodes means that P2 cannot distinguish between them. (b) The public game\ntree of Coin Toss. The two outcomes of the coin \ufb02ip are only observed by P1.\n\nWere P2 to always guess Heads, P1 would receive $0.50 for choosing Sell when the coin lands Heads,\nand $1 for Play when it lands Tails. This would result in an average of $0.75 for P1. Alternatively,\nwere P2 to always guess Tails, P1 would receive $1 for choosing Play when the coin lands Heads,\nand \u2212$0.50 for choosing Sell when it lands Tails. This would result in an average reward of $0.25 for\nP1. However, P2 would do even better by guessing Heads with 25% probability and Tails with 75%\nprobability. In that case, P1 could only receive $0.50 (on average) by choosing Play when the coin\nlands Heads\u2014the same value received for choosing Sell. Similarly, P1 could only receive \u2212$0.50 by\nchoosing Play when the coin lands Tails, which is the same value received for choosing Sell. This\nwould yield an average reward of $0 for P1. It is easy to see that this is the best P2 can do, because\nP1 can average $0 by always choosing Sell. Therefore, choosing Heads with 25% probability and\nTails with 75% probability is an optimal strategy for P2 in the \u201cPlay\u201d subgame.\nNow suppose the coin is considered lucky if it lands Tails and unlucky if it lands Heads. That is,\nthe expected reward for selling the coin when it lands Heads is now \u2212$0.50 and when it lands Tails\nis now $0.50. It is easy to see that P2\u2019s optimal strategy for the \u201cPlay\u201d subgame is now to guess\nHeads with 75% probability and Tails with 25% probability. This shows that a player\u2019s optimal\nstrategy in a subgame can depend on the strategies and outcomes in other parts of the game. Thus,\none cannot solve a subgame using information about that subgame alone. This is the central challenge\nof imperfect-information games as opposed to perfect-information games.\n\n2\n\n\f3 Notation and Background\nIn a two-player zero-sum extensive-form game there are two players, P = {1, 2}. H is the set of\nall possible nodes, represented as a sequence of actions. A(h) is the actions available in a node and\nP (h) \u2208 P \u222a c is the player who acts at that node, where c denotes chance. Chance plays an action\na \u2208 A(h) with a \ufb01xed probability. If action a \u2208 A(h) leads from h to h(cid:48), then we write h \u00b7 a = h(cid:48). If\na sequence of actions leads from h to h(cid:48), then we write h (cid:64) h(cid:48). The set of nodes Z \u2286 H are terminal\nnodes. For each player i \u2208 P, there is a payoff function ui : Z \u2192 (cid:60) where u1 = \u2212u2.\nImperfect information is represented by information sets (infosets). Every node h \u2208 H belongs\nto exactly one infoset for each player. For any infoset Ii, nodes h, h(cid:48) \u2208 Ii are indistinguishable to\nplayer i. Thus the same player must act at all the nodes in an infoset, and the same actions must be\navailable. Let P (Ii) and A(Ii) be such that all h \u2208 Ii, P (Ii) = P (h) and A(Ii) = A(h).\nA strategy \u03c3i(Ii) is a probability vector over A(Ii) for infosets where P (Ii) = i. The probability of\naction a is denoted by \u03c3i(Ii, a). For all h \u2208 Ii, \u03c3i(h) = \u03c3i(Ii). A full-game strategy \u03c3i \u2208 \u03a3i de\ufb01nes\na strategy for each player i infoset. A strategy pro\ufb01le \u03c3 is a tuple of strategies, one for each player.\nThe expected payoff for player i if all players play the strategy pro\ufb01le (cid:104)\u03c3i, \u03c3\u2212i(cid:105) is ui(\u03c3i, \u03c3\u2212i), where\n\u03c3\u2212i denotes the strategies in \u03c3 of all players other than i.\n\n(cid:64) Ii if for some h(cid:48) \u2208 I(cid:48)\n\nh(cid:48)\u00b7a(cid:118)h \u03c3P (h(cid:48))(h(cid:48), a) denote the probability of reaching h if all players play according\nto \u03c3. \u03c0\u03c3\ni (h) is the contribution of player i to this probability (that is, the probability of reaching h if\nchance and all players other than i always chose actions leading to h). \u03c0\u03c3\u2212i(h) is the contribution of\nall players, and chance, other than i. \u03c0\u03c3(h, h(cid:48)) is the probability of reaching h(cid:48) given that h has been\nreached, and 0 if h (cid:54)(cid:64) h(cid:48). This papers focuses on perfect-recall games, where a player never forgets\npast information. Thus, for every Ii, \u2200h, h(cid:48) \u2208 Ii, \u03c0\u03c3\ni (h) for\nh \u2208 Ii. Also, I(cid:48)\ni \u00b7 a (cid:64) Ii if h(cid:48) \u00b7 a (cid:64) h.\nA Nash equilibrium [22] is a strategy pro\ufb01le \u03c3\u2217 where no player can improve by shifting to a different\nstrategy, so \u03c3\u2217 satis\ufb01es \u2200i, ui(\u03c3\u2217\n\u2212i). A best response BR(\u03c3\u2212i) is a\nstrategy for player i that is optimal against \u03c3\u2212i. Formally, BR(\u03c3\u2212i) satis\ufb01es ui(BR(\u03c3\u2212i), \u03c3\u2212i) =\ni, \u03c3\u2212i). In a two-player zero-sum game, the exploitability exp(\u03c3i) of a strategy \u03c3i is\nmax\u03c3(cid:48)\nhow much worse \u03c3i does against an opponent best response than a Nash equilibrium strategy would\ndo. Formally, exploitability of \u03c3i is ui(\u03c3\u2217) \u2212 ui(\u03c3i, BR(\u03c3i)), where \u03c3\u2217 is a Nash equilibrium.\nThe expected value of a node h when players play according to \u03c3 is v\u03c3\nAn infoset\u2019s value is the weighted average of the values of the nodes in the infoset, where a node\n\ni and some h \u2208 Ii, h(cid:48) (cid:64) h. Similarly, I(cid:48)\n\ni (h) =(cid:80)\n\ni (h(cid:48)). We de\ufb01ne \u03c0\u03c3\n\ni\u2208\u03a3i ui(\u03c3(cid:48)\n\ni\u2208\u03a3i ui(\u03c3(cid:48)\n\n\u2212i) = max\u03c3(cid:48)\n\ni (Ii) = \u03c0\u03c3\n\ni (h) = \u03c0\u03c3\n\ni , \u03c3\u2217\n\ni, \u03c3\u2217\n\nLet \u03c0\u03c3(h) =(cid:81)\n\ni\n\nis weighed by the player\u2019s belief that she is in that node. Formally, v\u03c3\n\ni (Ii) =\n\n(cid:0)\u03c0\u03c3(h, z)ui(z)(cid:1).\n(cid:0)\u03c0\u03c3\u2212i(h)v\u03c3\ni (h)(cid:1)\n(cid:80)\n\n\u03c0\u03c3\u2212i(h)\n\nh\u2208Ii\n\nh\u2208Ii\n\nz\u2208Z\n\n(cid:80)\n\n(cid:80)\n\n(cid:0)\u03c0\u03c3\u2212i(h)v\u03c3\n(cid:80)\n\ni (h\u00b7a)(cid:1)\n\nh\u2208Ii\n\nh\u2208Ii\n\n\u03c0\u03c3\u2212i(h)\n\ni (Ii, a) =\n\n. A counterfactual best response [21] CBR(\u03c3\u2212i) is a best\nand v\u03c3\nresponse that also maximizes value in unreached infosets. Speci\ufb01cally, a counterfactual best re-\nsponse is a best response \u03c3i with the additional condition that if \u03c3i(Ii, a) > 0 then v\u03c3\ni (Ii, a) =\ni (Ii, a(cid:48)). We further de\ufb01ne counterfactual best response value CBV \u03c3\u2212i(Ii) as the value\nmaxa(cid:48) v\u03c3\nplayer i expects to achieve by playing according to CBR(\u03c3\u2212i), having already reached infoset Ii.\nFormally, CBV \u03c3\u2212i(Ii) = v\nAn imperfect-information subgame, which we refer to simply as a subgame in this paper, can in\nmost cases (but not all) be described as including all nodes which share prior public actions (that is,\nactions viewable to both players). In poker, for example, a subgame is uniquely de\ufb01ned by a sequence\nof bets and public board cards. Figure 1b shows the public game tree of Coin Toss. Formally, an\nimperfect-information subgame is a set of nodes S \u2286 H such that for all h \u2208 S, if h (cid:64) h(cid:48), then\nh(cid:48) \u2208 S, and for all h \u2208 S and all i \u2208 P, if h(cid:48) \u2208 Ii(h) then h(cid:48) \u2208 S. De\ufb01ne Stop as the set of\nearliest-reachable nodes in S. That is, h \u2208 Stop if h \u2208 S and h(cid:48) (cid:54)\u2208 S for any h(cid:48) (cid:64) h.\n\n(Ii) and CBV \u03c3\u2212i(Ii, a) = v\n\n(cid:104)CBR(\u03c3\u2212i),\u03c3\u2212i(cid:105)\ni\n\n(cid:104)CBR(\u03c3\u2212i),\u03c3\u2212i(cid:105)\ni\n\n(Ii, a).\n\n4 Prior Approaches to Subgame Solving\nThis section reviews prior techniques for subgame solving in imperfect-information games, which we\nbuild upon. Throughout this section, we refer to the Coin Toss game shown in Figure 1a.\nAs discussed in Section 1, a standard approach to dealing with large imperfect-information games is\nto solve an abstraction of the game. The abstract solution is a (probably suboptimal) strategy pro\ufb01le\n\n3\n\n\fin the full game. We refer to this full-game strategy pro\ufb01le as the blueprint. The goal of subgame\nsolving is to improve upon the blueprint by changing the strategy only in a subgame.\n\nFigure 2: The blueprint strategy we refer to in the game of Coin Toss. The Sell action leads to a subgame that is\nnot displayed. Probabilities are shown for all actions. The dotted line means the two P2 nodes share an infoset.\nThe EV of each P1 action is also shown.\n\n2 of the time, Tails 1\n\n4 of the time with Heads and 1\n\nAssume that a blueprint strategy pro\ufb01le \u03c3 (shown in Figure 2) has already been computed for Coin\nToss in which P1 chooses Play 3\n2 of the time with Tails, and P2 chooses\n4 of the time after P1 chooses Play. The details\nHeads 1\nof the blueprint strategy in the Sell subgame are not relevant in this section, but the EV for choosing\nthe Sell action is relevant. We assume that if P1 chose the Sell action and played optimally thereafter,\nthen she would receive an expected payoff of 0.5 if the coin is Heads, and \u22120.5 if the coin is Tails.\nWe will attempt to improve P2\u2019s strategy in the subgame S that follows P1 choosing Play.\n\n4 of the time, and Forfeit 1\n\n4.1 Unsafe Subgame Solving\nWe \ufb01rst review the most intuitive form of subgame solving, which we refer to as Unsafe subgame\nsolving [1, 12, 13, 10]. This form of subgame solving assumes both players played according to the\nblueprint strategy prior to reaching the subgame. That de\ufb01nes a probability distribution over the\nnodes at the root of the subgame S, representing the probability that the true game state matches that\nnode. A strategy for the subgame is then calculated which assumes that this distribution is correct.\nIn all subgame solving algorithms, an augmented subgame containing S and a few additional nodes\nis solved to determine the strategy for S. Applying Unsafe subgame solving to the blueprint strategy\nin Coin Toss (after P1 chooses Play) means solving the augmented subgame shown in Figure 3a.\nSpeci\ufb01cally, the augmented subgame consists of only an initial chance node and S. The initial chance\nnode reaches h \u2208 Stop with probability\nh(cid:48)\u2208Stop \u03c0\u03c3(h(cid:48)). The augmented subgame is solved and its\nstrategy for P2 is used in S rather than the blueprint strategy.\nUnsafe subgame solving lacks theoretical solution quality guarantees and there are many situations\nwhere it performs extremely poorly. Indeed, if it were applied to the blueprint strategy of Coin Toss\nthen P2 would always choose Heads\u2014which P1 could exploit severely by only choosing Play with\nTails. Despite the lack of theoretical guarantees and potentially bad performance, Unsafe subgame\nsolving is simple and can sometimes produce low-exploitability strategies, as we show later.\nWe now move to discussing safe subgame-solving techniques, that is, ones that ensure that the\nexploitability of the strategy is no higher than that of the blueprint strategy.\n\n(cid:80)\n\n\u03c0\u03c3(h)\n\n(a) Unsafe subgame solving\n\n(b) Resolve subgame solving\n\nFigure 3: The augmented subgames solved to \ufb01nd a P2 strategy in the Play subgame of Coin Toss.\n\n4\n\n\fS leads to htop, while action a(cid:48)\n\nT with Heads and a(cid:48)\n\nS with Tails.\n\nT leads to a terminal payoff of 0 for Heads and 1\n\n1 is not used.\n\n2 rather than the blueprint strategy when in S. The P1 strategy \u03c3S\n\n4.2 Subgame Resolving\nIn subgame Resolving [6], a safe strategy is computed for P2 in the subgame by solving the augmented\nsubgame shown in Figure 3b, producing an equilibrium strategy \u03c3S. This augmented subgame differs\nfrom Unsafe subgame solving by giving P1 the option to \u201copt out\u201d from entering S and instead\nreceive the EV of playing optimally against P2\u2019s blueprint strategy in S.\nSpeci\ufb01cally, the augmented subgame for Resolving differs from unsafe subgame solving as follows.\nFor each htop \u2208 Stop we insert a new P1 node hr, which exists only in the augmented subgame,\nbetween the initial chance node and htop. The set of these hr nodes is Sr. The initial chance node\nconnects to each node hr \u2208 Sr in proportion to the probability that player P1 could reach htop if P1\ntried to do so (that is, in proportion to \u03c0\u03c3\u22121(htop)). At each node hr \u2208 Sr, P1 has two possible actions.\nAction a(cid:48)\nT leads to a terminal payoff that awards the value of playing\noptimally against P2\u2019s blueprint strategy, which is CBV \u03c32(I1(htop)). In the blueprint strategy of\nCoin Toss, P1 choosing Play after the coin lands Heads results in an EV of 0, and 1\n2 if the coin is\nTails. Therefore, a(cid:48)\n2 for Tails. After the equilibrium\nstrategy \u03c3S is computed in the augmented subgame, P2 plays according to the computed subgame\nstrategy \u03c3S\nClearly P1 cannot do worse than always picking action a(cid:48)\nT (which awards the highest EV P1 could\nachieve against P2\u2019s blueprint). But P1 also cannot do better than always picking a(cid:48)\nT , because P2\ncould simply play according to the blueprint in S, which means action a(cid:48)\nS would give the same EV to\nP1 as action a(cid:48)\nT (if P1 played optimally in S). In this way, the strategy for P2 in S is pressured to be\nno worse than that of the blueprint. In Coin Toss, if P2 were to always choose Heads (as was the case\nin Unsafe subgame solving), then P1 would always choose a(cid:48)\nResolving guarantees that P2\u2019s exploitability will be no higher than the blueprint\u2019s (and may be\nbetter). However, it may miss opportunities for improvement. For example, if we apply Resolving to\nthe example blueprint in Coin Toss, one solution to the augmented subgame is the blueprint itself, so\nP2 may choose Forfeit 25% of the time even though Heads and Tails dominate that action. Indeed,\nthe original purpose of Resolving was not to improve upon a blueprint strategy in a subgame, but\nrather to compactly store it by keeping only the EV at the root of the subgame and then reconstructing\nthe strategy in real time when needed rather than storing the whole subgame strategy.\nMaxmargin subgame solving [21], discussed in Appendix A, can improve performance by de\ufb01n-\n2 (I1) for each I1 \u2208 Stop and maximizing\ning a margin M \u03c3S\nminI1\u2208Stop M \u03c3S\n(I1). Resolving only makes all margins nonnegative. However, Maxmargin does\nworse in practice when using estimates of equilibrium values as discussed in Appendix C.\n5 Reach Subgame Solving\nAll of the subgame-solving techniques described in Section 4 only consider the target subgame in\nisolation, which can lead to suboptimal strategies. For example, Maxmargin solving applied to S\nin Coin Toss results in P2 choosing Heads with probability 5\n8 in S. This results in\nP1 receiving an EV of \u2212 1\n4 in the Tails state.\nHowever, P1 could simply always choose Sell in the Heads state (earning an EV of 0.5) and Play in\n8 for the entire game. In this section we introduce Reach subgame\nthe Tails state and receive an EV of 3\nsolving, an improvement to past subgame-solving techniques that considers what the opponent could\nhave alternatively received from other subgames.1 For example, a better strategy for P2 would be\n4. Then P1 is indifferent between\nto choose Heads with probability 3\nchoosing Sell and Play in both cases and overall receives an expected payoff of 0 for the whole game.\nHowever, that strategy is only optimal if P1 would indeed achieve an EV of 0.5 for choosing Sell\nin the Heads state and \u22120.5 in the Tails state. That would be the case if P2 played according to the\nblueprint in the Sell subgame (which is not shown), but in reality we would apply subgame solving to\nthe Sell subgame if the Sell action were taken, which would change P2\u2019s strategy there and therefore\nP1\u2019s EVs. Applying subgame solving to any subgame encountered during play is equivalent to\napplying it to all subgames independently; ultimately, the same strategy is played in both cases. Thus,\nwe must consider that the EVs from other subgames may differ from what the blueprint says because\nsubgame solving would be applied to them as well.\n\n4 by choosing Play in the Heads state, and an EV of 1\n\n(I1) = CBV \u03c32 (I1) \u2212 CBV \u03c3S\n\n8 and Tails with 3\n\n4 and Tails with probability 1\n\n1Other subgame-solving methods have also considered the cost of reaching a subgame [31, 15]. However,\n\nthose approaches are not correct in theory when applied in real time to any subgame reached during play.\n\n5\n\n\fFigure 4: Left: A modi\ufb01ed game of Coin Toss with two subgames. The nodes C1 and C2 are public chance\nnodes whose outcomes are seen by both P1 and P2. Right: An augmented subgame for one of the subgames\naccording to Reach subgame solving. If only one of the subgames is being solved, then the alternative payoff\nfor Heads can be at most 1. However, if both are solved independently, then the gift must be split among the\nsubgames and must sum to at most 1. For example, the alternative payoff in both subgames can be 0.5.\n\nAs an example of this issue, consider the game shown in Figure 4 which contains two identical\nsubgames S1 and S2 where the blueprint has P2 pick Heads and Tails with 50% probability. The Sell\naction leads to an EV of 0.5 from the Heads state, while Play leads to an EV of 0. If we were to solve\njust S1, then P2 could afford to always choose Tails in S1, thereby letting P1 achieve an EV of 1\nfor reaching that subgame from Heads because, due to the chance node C1, S1 is only reached with\n50% probability. Thus, P1\u2019s EV for choosing Play would be 0.5 from Heads and \u22120.5 from Tails,\nwhich is optimal. We can achieve this strategy in S1 by solving an augmented subgame in which the\nalternative payoff for Heads is 1. In that augmented subgame, P2 always choosing Tails would be a\nsolution (though not the only solution).\nHowever, if the same reasoning were applied independently to S2 as well, then P2 might always\nchoose Tails in both subgames and P1\u2019s EV for choosing Play from Heads would become 1 while the\nEV for Sell would only be 0.5. Instead, we could allow P1 to achieve an EV of 0.5 for reaching each\nsubgame from Heads (by setting the alternative payoff for Heads to 0.5). In that case, P1\u2019s overall\nEV for choosing Play could only increase to 0.5, even if both S1 and S2 were solved independently.\nWe capture this intuition by considering for each I1 \u2208 Stop all the infosets and actions I(cid:48)\n1 \u00b7 a(cid:48) (cid:64) I1\n1 \u00b7 a(cid:48) (cid:64) I1 where P1 acted, there was a\nthat P1 would have taken along the path to I1. If, at some I(cid:48)\ndifferent action a\u2217 \u2208 A(I(cid:48)\n1) that leads to a higher EV, then P1 would have taken a suboptimal action\nif they reached I1. The difference in value between a\u2217 and a(cid:48) is referred to as a gift. We can afford\nto let P1\u2019s value for I1 increase beyond the blueprint value (and in the process lower P1\u2019s value in\nsome other infoset in Stop), so long as the increase to I1\u2019s value is small enough that choosing actions\nleading to I1 is still suboptimal for P1. Critically, we must ensure that the increase in value is small\nenough even when the potential increase across all subgames is summed together, as in Figure 4.2\nA complicating factor is that gifts we assumed were present may actually not exist. For example, in\nCoin Toss, suppose applying subgame solving to the Sell subgame results in P1\u2019s value for Sell from\nthe Heads state decreasing from 0.5 to 0.25. If we independently solve the Play subgame, we have\nno way of knowing that P1\u2019s value for Sell is lower than the blueprint suggested, so we may still\nassume there is a gift of 0.5 from the Heads state based on the blueprint. Thus, in order to guarantee a\ntheoretical result on exploitability that is as strong as possible, we use in our theory and experiments\na lower bound on what gifts could be after subgame solving was applied to all other subgames.\nFormally, let \u03c32 be a P2 blueprint and let \u03c3\u2212S\nbe the P2 strategy that results from applying sub-\ngame solving independently to a set of disjoint subgames other than S. Since we do not want\n1, a(cid:48))(cid:99) be a lower bound of\nto compute \u03c3\u2212S\n(I(cid:48)\n. In our experiments we\nCBV \u03c3\n\nin order to apply subgame solving to S, let (cid:98)g\u03c3\n\n1) \u2212 CBV \u03c3\n(I(cid:48)\n\n\u2212S\n2\n\n\u2212S\n2\n\n2\n\n2\n\n\u2212S\n2\n\n1, a(cid:48)) that does not require knowledge of \u03c3\u2212S\n(I(cid:48)\n\n2\n\n2In this paper and in our experiments, we allow any infoset that descends from a gift to increase by the size\nof the gift (e.g., in Figure 4 the gift from Heads is 0.5, so we allow P1\u2019s value for Heads in both S1 and S2\nto increase by 0.5). However, any division of the gift among subgames is acceptable so long as the potential\nincrease across all subgames (multiplied by the probability of P1 reaching that subgame) does not exceed the\noriginal gift. For example in Figure 4 if we only apply Reach subgame solving to S1, then we could allow the\nHeads state in S1 to increase by 1 rather than just by 0.5. In practice, some divisions may do better than others.\nThe division we use in this paper (applying gifts equally to all subgames) did well in practice.\n\n6\n\n\fI1 \u2208 Stop by(cid:80)\n\n\u2212S\n2\n\n1, a(cid:48))(cid:99) = maxa\u2208Az(I(cid:48)\n(I(cid:48)\n\nuse (cid:98)g\u03c3\n1) \u2286 A(I(cid:48)\n1)\nis the set of actions leading immediately to terminal nodes. Reach subgame solving modi\ufb01es the\naugmented subgame in Resolving and Maxmargin by increasing the alternative payoff for infoset\n\n1)\u222a{a(cid:48)} CBV \u03c32 (I(cid:48)\n\n1, a(cid:48)) where Az(I(cid:48)\n\n(cid:98)g\u03c3\n\n\u2212S\n2\n\n1\u00b7a(cid:48)(cid:118)I1|P (I(cid:48)\nI(cid:48)\nM \u03c3S\nr\n\n1)=P1\n(I1) = M \u03c3S\n\n1, a(cid:48))(cid:99). Formally, we de\ufb01ne a reach margin as\n(I(cid:48)\n(I1) +\n\n1, a(cid:48))(cid:99)\n(I(cid:48)\n\n(cid:98)g\u03c3\n\n\u2212S\n2\n\n(1)\n\n1, a) \u2212 CBV \u03c32 (I(cid:48)\n(cid:88)\n\n1\u00b7a(cid:48)(cid:118)I1|P (I(cid:48)\nI(cid:48)\n\n1)=P1\n\n\u2212S\n2\n\n\u2212S\n2\n\n2\n\n2\n\n2)\n\n1\n\n\u2212S\n2\n\nh\u2208I1\n\n(I1).\n\n\u02dcCBV\n\n\u03c32(I(cid:48)\n\n\u03c32(I(cid:48)\n\n1)\u2212 \u02dcCBV\n\n\u03c0\u03c32\u22121(h)M \u03c3S\n\nr\n\n) \u2212(cid:80)\n\n2) \u2264 exp(\u03c3\u2212S\n\n1, a(cid:48))(cid:99), where\n(I(cid:48)\n\n1) \u2212 CBV \u03c32(I(cid:48)\n\n1, a(cid:48)) in place of (cid:98)g\u03c3\n\n(I(cid:48), a(cid:48))(cid:99) is nonnegative.\n\n1, a(cid:48)) \u2264 CBV \u03c32(I(cid:48)\n(I(cid:48)\n\n2 be the strategy that plays according to \u03c3S\n\nbe the strategy that plays according to \u03c3(cid:48)\n\nThis margin is larger than or equal to the one for Maxmargin, because (cid:98)g\u03c3\nWe refer to the modi\ufb01ed algorithms as Reach-Resolve and Reach-Maxmargin.\nUsing a lower bound on gifts is not necessary to guarantee safety. So long as we use a gift value\n1, a(cid:48)), the resulting strategy will be safe. However, using\ng\u03c3(cid:48)\na lower bound further guarantees a reduction to exploitability when a P1 best response reaches\nwith positive probability an infoset I1 \u2208 Stop that has positive margin, as proven in Theorem 1. In\n(I(cid:48)\n1, a(cid:48)) =\npractice, it may be best to use an accurate estimate of gifts. One option is to use \u02c6g\u03c3\n\u03c32 is the closest P1 can get to\n\u02dcCBV\nthe value of a counterfactual best response while P1 is constrained to playing within the abstraction\nthat generated the blueprint. Using estimates is covered in more detail in Appendix C.\nTheorem 1 shows that when subgames are solved independently and using lower bounds on gifts,\nReach-Maxmargin solving has exploitability lower than or equal to past safe techniques. The theorem\nstatement is similar to that of Maxmargin [21], but the margins are now larger (or equal) in size.\nTheorem 1. Given a strategy \u03c32 in a two-player zero-sum game, a set of disjoint subgames S,\n2 for each subgame S \u2208 S produced via Reach-Maxmargin solving using lower\nand a strategy \u03c3S\n2 for each subgame S \u2208 S, and \u03c32\nbounds for gifts, let \u03c3(cid:48)\nelsewhere. Moreover, let \u03c3\u2212S\n2 everywhere except for P2\nnodes in S, where it instead plays according to \u03c32. If \u03c0BR(\u03c3(cid:48)\n(I1) > 0 for some I1 \u2208 Stop, then\nexp(\u03c3(cid:48)\nSo far the described techniques have guaranteed a reduction in exploitability over the blueprint by\nsetting the value of a(cid:48)\nT equal to the value of P1 playing optimally to P2\u2019s blueprint. Relaxing this\nguarantee by instead setting the value of a(cid:48)\nT equal to an estimate of P1\u2019s value when both players\nplay optimally leads to far lower exploitability in practice. We discuss this approach in Appendix C.\n6 Nested Subgame Solving\nAs we have discussed, large games must be abstracted to reduce the game to a tractable size. This is\nparticularly common in games with large or continuous action spaces. Typically the action space is\ndiscretized by action abstraction so that only a few actions are included in the abstraction. While\nwe might limit ourselves to the actions we included in the abstraction, an opponent might choose\nactions that are not in the abstraction. In that case, the off-tree action can be mapped to an action that\nis in the abstraction, and the strategy from that in-abstraction action can be used. For example, in an\nauction game we might include a bid of $100 in our abstraction. If a player bids $101, we simply\ntreat that as a bid of $100. This is referred to as action translation [14, 28, 8]. Action translation is\nthe state-of-the-art prior approach to dealing with this issue. It has been used, for example, by all the\nleading competitors in the Annual Computer Poker Competition (ACPC).\nIn this section, we develop techniques for applying subgame solving to calculate responses to\nopponent off-tree actions, thereby obviating the need for action translation. That is, rather than simply\ntreat a bid of $101 as $100, we calculate in real time a unique response to the bid of $101. This can\nalso be done in a nested fashion in response to subsequent opponent off-tree actions. Additionally,\nthese techniques can be used to solve \ufb01ner-grained models as play progresses down the game tree.\nWe refer to the \ufb01rst method as the inexpensive method.3 When P1 chooses an off-tree action a,\na subgame S is generated following that action such that for any infoset I1 that P1 might be in,\nI1 \u00b7 a \u2208 Stop. This subgame may itself be an abstraction. A solution \u03c3S is computed via subgame\nsolving, and \u03c3S is combined with \u03c3 to form a new blueprint \u03c3(cid:48) in the expanded abstraction that now\nincludes action a. The process repeats whenever P1 again chooses an off-tree action.\n\n3Following our study, the AI DeepStack used a technique similar to this form of nested subgame solving [20].\n\n7\n\n\f\u02dcCBV\n\n\u02dcCBV\n\n\u02dcCBV\n\n\u03c32(I1) approximates CBV \u03c3\u2217\n\n\u03c32(I1) for the alternative payoff, where\n\n2 (I1), where \u03c3\u2217\n\u03c32(I1) to be close to CBV \u03c3\u2217\n\nTo conduct safe subgame solving in response to off-tree action a, we could calculate CBV \u03c32(I1, a)\nby de\ufb01ning, via action translation, a P2 blueprint following a and best responding to it [4]. However,\nthat could be computationally expensive and would likely perform poorly in practice because, as we\nshow later, action translation is highly exploitable. Instead, we relax the guarantee of safety and use\n\u03c32(I1) is P1\u2019s counterfactual best response value\n\u02dcCBV\nin I1 when constrained to playing in the blueprint abstraction (which excludes action a). In this case,\nexploitability depends on how well\n2 is an optimal\nP2 strategy (see Appendix C).4 In general, we \ufb01nd that only a small number of near-optimal actions\nneed to be included in the blueprint abstraction for\n2 (I1). We can\nthen approximate a near-optimal response to any opponent action, even in a continuous action space.\nThe \u201cinexpensive\u201d approach cannot be combined with Unsafe subgame solving because the probability\nof reaching an action outside of a player\u2019s abstraction is unde\ufb01ned. Nevertheless, a similar approach\nis possible with Unsafe subgame solving (as well as all the other subgame-solving techniques) by\nstarting the subgame solving at h rather than at h \u00b7 a. In other words, if action a taken in node h is\nnot in the abstraction, then Unsafe subgame solving is conducted in the smallest subgame containing\nh (and action a is added to that abstraction). This increases the size of the subgame compared to the\ninexpensive method because a strategy must be recomputed for every action a(cid:48) \u2208 A(h) in addition to\na. We therefore call this method the expensive method. We present experiments with both methods.\n7 Experiments\nOur experiments were conducted on heads-up no-limit Texas hold\u2019em, as well as two smaller-scale\npoker games we call No-Limit Flop Hold\u2019em (NLFH) and No-Limit Turn Hold\u2019em (NLTH). The\ndescription for these games can be found in Appendix G. For equilibrium \ufb01nding, we used CFR+ [30].\nOur \ufb01rst experiment compares the performance of the subgame-solving techniques when applied\nto information abstraction (which is card abstraction in the case of poker). Speci\ufb01cally, we solve\nNLFH with no information abstraction on the pre\ufb02op. On the \ufb02op, there are 1,286,792 infosets for\neach betting sequence; the abstraction buckets them into 200, 2,000, or 30,000 abstract ones (using a\nleading information abstraction algorithm [9]). We then apply subgame solving immediately after the\n\ufb02op community cards are dealt. We experiment with two versions of the game, one small and one\nlarge, which include only a few of the available actions in each infoset. We also experimented on\nabstractions of NLTH. In that case, we solve NLTH with no information abstraction on the pre\ufb02op or\n\ufb02op. On the turn, there are 55,190,538 infosets for each betting sequence; the abstraction buckets\nthem into 200, 2,000, or 20,000 abstract ones. We apply subgame solving immediately after the\nturn community card is dealt. Table 1 shows the performance of each technique when using 30,000\nbuckets (20,000 for NLTH). The full results are presented in Appendix E. In all our experiments,\nexploitability is measured in the standard units used in this \ufb01eld: milli big blinds per hand (mbb/h).\nSmall Flop Holdem Large Flop Holdem Turn Holdem\n91.28\n5.514\n54.07\n43.43\n41.47\n25.88\n24.23\n34.30\n22.58\n17.33\n\n345.5\n79.34\n251.8\n234.4\n233.5\n175.5\n76.44\n74.35\n72.59\n70.68\nTable 1: Exploitability of various subgame-solving techniques in three different games.\n\nBlueprint Strategy\nUnsafe\nResolve\nMaxmargin\nReach-Maxmargin\nReach-Maxmargin (no split)\nEstimate\nEstimate+Distributional\nReach-Estimate+Distributional\nReach-Estimate+Distributional (no split)\n\n41.41\n396.8\n23.11\n19.50\n18.80\n16.41\n30.09\n10.54\n9.840\n8.777\n\nEstimate and Estimate+Distributional are techniques introduced in Appendix C. We use a normal\ndistribution in the Distributional subgame solving experiments, with standard deviation determined\nby the heuristic presented in Appendix C.1.\nSince subgame solving begins immediately after a chance node with an extremely high branching\nfactor (1, 755 in NLFH), the gifts for the Reach algorithms are divided among subgames inef\ufb01ciently.\n\n4We estimate CBV \u03c3\u2217\n\n2 (I1) rather than CBV \u03c3\u2217\n\n2 (I1, a) because CBV \u03c3\u2217\n\n2 (I1) \u2212 CBV \u03c3\u2217\n\n2 (I1, a) is a gift that\n\nmay be added to the alternative payoff anyway.\n\n8\n\n\fMany subgames do not use the gifts at all, while others could make use of more. In the experiments\nwe show results both for the theoretically safe splitting of gifts, as well as a more aggressive version\nwhere gifts are scaled up by the branching factor of the chance node (1, 755). This weakens the\ntheoretical guarantees of the algorithm, but in general did better than splitting gifts in a theoretically\ncorrect manner. However, this is not universally true. Appendix F shows that in at least one case,\nexploitability increased when gifts were scaled up too aggressively. In all cases, using Reach subgame\nsolving in at least the theoretical safe method led to lower exploitability.\nDespite lacking theoretical guarantees, Unsafe subgame solving did surprisingly well in most games.\nHowever, it did substantially worse in Large NLFH with 30,000 buckets. This exempli\ufb01es its\nvariability. Among the safe methods, all of the changes we introduce show improvement over\npast techniques. The Reach-Estimate + Distributional algorithm generally resulted in the lowest\nexploitability among the various choices, and in most cases beat unsafe subgame solving.\nThe second experiment evaluates nested subgame solving, and compares it to action translation. In\norder to also evaluate action translation, in this experiment, we create an NLFH game that includes 3\nbet sizes at every point in the game tree (0.5, 0.75, and 1.0 times the size of the pot); a player can also\ndecide not to bet. Only one bet (i.e., no raises) is allowed on the pre\ufb02op, and three bets are allowed on\nthe \ufb02op. There is no information abstraction anywhere in the game. We also created a second, smaller\nabstraction of the game in which there is still no information abstraction, but the 0.75\u00d7 pot bet is\nnever available. We calculate the exploitability of one player using the smaller abstraction, while\nthe other player uses the larger abstraction. Whenever the large-abstraction player chooses a 0.75\u00d7\npot bet, the small-abstraction player generates and solves a subgame for the remainder of the game\n(which again does not include any subsequent 0.75\u00d7 pot bets) using the nested subgame-solving\ntechniques described above. This subgame strategy is then used as long as the large-abstraction player\nplays within the small abstraction, but if she chooses the 0.75\u00d7 pot bet again later, then the subgame\nsolving is used again, and so on.\nTable 2 shows that all the subgame-solving techniques substantially outperform action translation.\nWe did not test distributional alternative payoffs in this experiment, since the calculated best response\nvalues are likely quite accurate. These results suggest that nested subgame solving is preferable to\naction translation (if there is suf\ufb01cient time to solve the subgame).\n\nRandomized Pseudo-Harmonic Mapping\nResolve\nReach-Maxmargin (Expensive)\nUnsafe (Expensive)\nMaxmargin\nReach-Maxmargin\n\nmbb/h\n1,465\n150.2\n149.2\n148.3\n122.0\n119.1\n\nTable 2: Exploitability of the various subgame-solving techniques in nested subgame solving. The performance\nof the pseudo-harmonic action translation is also shown.\n\nWe used the techniques presented in this paper to develop Libratus, an AI that competed against four\ntop human professionals in heads-up no-limit Texas hold\u2019em [5]. Heads-up no-limit Texas hold\u2019em\nhas been the primary benchmark challenge for AI in imperfect-information games. The competition\ninvolved 120,000 hands of poker and a prize pool of $200,000 split among the humans to incentivize\nstrong play. The AI decisively defeated the human team by 147 mbb / hand, with 99.98% statistical\nsigni\ufb01cance. This was the \ufb01rst, and so far only, time an AI defeated top humans in no-limit poker.\n8 Conclusion\nWe introduced a subgame-solving technique for imperfect-information games that has stronger\ntheoretical guarantees and better practical performance than prior subgame-solving methods. We\npresented results on exploitability of both safe and unsafe subgame-solving techniques. We also\nintroduced a method for nested subgame solving in response to the opponent\u2019s off-tree actions, and\ndemonstrated that this leads to dramatically better performance than the usual approach of action\ntranslation. This is, to our knowledge, the \ufb01rst time that exploitability of subgame-solving techniques\nhas been measured in large games.\nFinally, we demonstrated the effectiveness of these techniques in practice in heads-up no-limit Texas\nHold\u2019em poker, the main benchmark challenge for AI in imperfect-information games. We developed\nthe \ufb01rst AI to reach the milestone of defeating top humans in heads-up no-limit Texas Hold\u2019em.\n\n9\n\n\f9 Acknowledgments\n\nThis material is based on work supported by the National Science Foundation under grants IIS-\n1718457, IIS-1617590, and CCF-1733556, and the ARO under award W911NF-17-1-0082, as well\nas XSEDE computing resources provided by the Pittsburgh Supercomputing Center. The Brains vs.\nAI competition was sponsored by Carnegie Mellon University, Rivers Casino, GreatPoint Ventures,\nAvenue4Analytics, TNG Technology Consulting, Arti\ufb01cial Intelligence, Intel, and Optimized Markets,\nInc. We thank Kristen Gardner, Marcelo Gutierrez, Theo Gutman-Solo, Eric Jackson, Christian Kroer,\nTim Reiff, and the anonymous reviewers for helpful feedback.\n\nReferences\n[1] Darse Billings, Neil Burch, Aaron Davidson, Robert Holte, Jonathan Schaeffer, Terence\nSchauenberg, and Duane Szafron. Approximating game-theoretic optimal strategies for full-\nscale poker. In Proceedings of the 18th International Joint Conference on Arti\ufb01cial Intelligence\n(IJCAI), 2003.\n\n[2] Michael Bowling, Neil Burch, Michael Johanson, and Oskari Tammelin. Heads-up limit\n\nhold\u2019em poker is solved. Science, 347(6218):145\u2013149, January 2015.\n\n[3] Noam Brown, Christian Kroer, and Tuomas Sandholm. Dynamic thresholding and pruning\nfor regret minimization. In AAAI Conference on Arti\ufb01cial Intelligence (AAAI), pages 421\u2013429,\n2017.\n\n[4] Noam Brown and Tuomas Sandholm. Simultaneous abstraction and equilibrium \ufb01nding in\ngames. In Proceedings of the International Joint Conference on Arti\ufb01cial Intelligence (IJCAI),\n2015.\n\n[5] Noam Brown and Tuomas Sandholm. Superhuman AI for heads-up no-limit poker: Libratus\n\nbeats top professionals. Science, page eaao1733, 2017.\n\n[6] Neil Burch, Michael Johanson, and Michael Bowling. Solving imperfect information games\nusing decomposition. In AAAI Conference on Arti\ufb01cial Intelligence (AAAI), pages 602\u2013608,\n2014.\n\n[7] Murray Campbell, A Joseph Hoane, and Feng-Hsiung Hsu. Deep Blue. Arti\ufb01cial intelligence,\n\n134(1-2):57\u201383, 2002.\n\n[8] Sam Ganzfried and Tuomas Sandholm. Action translation in extensive-form games with large\naction spaces: axioms, paradoxes, and the pseudo-harmonic mapping. In Proceedings of the\nTwenty-Third international joint conference on Arti\ufb01cial Intelligence, pages 120\u2013128. AAAI\nPress, 2013.\n\n[9] Sam Ganzfried and Tuomas Sandholm. Potential-aware imperfect-recall abstraction with earth\nmover\u2019s distance in imperfect-information games. In AAAI Conference on Arti\ufb01cial Intelligence\n(AAAI), 2014.\n\n[10] Sam Ganzfried and Tuomas Sandholm. Endgame solving in large imperfect-information games.\nIn International Conference on Autonomous Agents and Multi-Agent Systems (AAMAS), pages\n37\u201345, 2015.\n\n[11] Andrew Gilpin, Javier Pe\u00f1a, and Tuomas Sandholm. First-order algorithm with O(ln(1/\u0001))\nconvergence for \u0001-equilibrium in two-person zero-sum games. Mathematical Programming,\n133(1\u20132):279\u2013298, 2012. Conference version appeared in AAAI-08.\n\n[12] Andrew Gilpin and Tuomas Sandholm. A competitive Texas Hold\u2019em poker player via au-\ntomated abstraction and real-time equilibrium computation. In Proceedings of the National\nConference on Arti\ufb01cial Intelligence (AAAI), pages 1007\u20131013, 2006.\n\n[13] Andrew Gilpin and Tuomas Sandholm. Better automated abstraction techniques for imperfect\ninformation games, with application to Texas Hold\u2019em poker. In International Conference on\nAutonomous Agents and Multi-Agent Systems (AAMAS), pages 1168\u20131175, 2007.\n\n10\n\n\f[14] Andrew Gilpin, Tuomas Sandholm, and Troels Bjerre S\u00f8rensen. A heads-up no-limit texas\nhold\u2019em poker player: discretized betting models and automatically generated equilibrium-\n\ufb01nding programs. In Proceedings of the Seventh International Joint Conference on Autonomous\nAgents and Multiagent Systems-Volume 2, pages 911\u2013918. International Foundation for Au-\ntonomous Agents and Multiagent Systems, 2008.\n\n[15] Eric Jackson. A time and space ef\ufb01cient algorithm for approximately solving large imperfect\ninformation games. In AAAI Workshop on Computer Poker and Imperfect Information, 2014.\n[16] Michael Johanson. Measuring the size of large no-limit poker games. Technical report,\n\nUniversity of Alberta, 2013.\n\n[17] Michael Johanson, Nolan Bard, Neil Burch, and Michael Bowling. Finding optimal abstract\nstrategies in extensive-form games. In Proceedings of the Twenty-Sixth AAAI Conference on\nArti\ufb01cial Intelligence, pages 1371\u20131379. AAAI Press, 2012.\n\n[18] Christian Kroer, Kevin Waugh, Fatma K\u0131l\u0131n\u00e7-Karzan, and Tuomas Sandholm. Theoretical\nand practical advances on smoothing for extensive-form games. In Proceedings of the ACM\nConference on Economics and Computation (EC), 2017.\n\n[19] Nick Littlestone and M. K. Warmuth. The weighted majority algorithm. Information and\n\nComputation, 108(2):212\u2013261, 1994.\n\n[20] Matej Morav\u02c7c\u00edk, Martin Schmid, Neil Burch, Viliam Lis\u00fd, Dustin Morrill, Nolan Bard, Trevor\nDavis, Kevin Waugh, Michael Johanson, and Michael Bowling. Deepstack: Expert-level\narti\ufb01cial intelligence in heads-up no-limit poker. Science, 2017.\n\n[21] Matej Moravcik, Martin Schmid, Karel Ha, Milan Hladik, and Stephen Gaukrodger. Re\ufb01ning\nsubgames in large imperfect information games. In AAAI Conference on Arti\ufb01cial Intelligence\n(AAAI), 2016.\n\n[22] John Nash. Equilibrium points in n-person games. Proceedings of the National Academy of\n\nSciences, 36:48\u201349, 1950.\n\n[23] Yurii Nesterov. Excessive gap technique in nonsmooth convex minimization. SIAM Journal of\n\nOptimization, 16(1):235\u2013249, 2005.\n\n[24] Tuomas Sandholm. The state of solving large incomplete-information games, and application\nto poker. AI Magazine, pages 13\u201332, Winter 2010. Special issue on Algorithmic Game Theory.\n[25] Tuomas Sandholm. Abstraction for solving large incomplete-information games. In AAAI\nConference on Arti\ufb01cial Intelligence (AAAI), pages 4127\u20134131, 2015. Senior Member Track.\n[26] Tuomas Sandholm. Solving imperfect-information games. Science, 347(6218):122\u2013123, 2015.\n[27] Jonathan Schaeffer, Neil Burch, Yngvi Bj\u00f6rnsson, Akihiro Kishimoto, Martin M\u00fcller, Robert\nLake, Paul Lu, and Steve Sutphen. Checkers is solved. Science, 317(5844):1518\u20131522, 2007.\n[28] David Schnizlein, Michael Bowling, and Duane Szafron. Probabilistic state translation in\nextensive games with large action sets. In Proceedings of the Twenty-First International Joint\nConference on Arti\ufb01cial Intelligence, pages 278\u2013284, 2009.\n\n[29] David Silver, Aja Huang, Chris J Maddison, Arthur Guez, Laurent Sifre, George Van Den Driess-\nche, Julian Schrittwieser, Ioannis Antonoglou, Veda Panneershelvam, Marc Lanctot, et al. Mas-\ntering the game of go with deep neural networks and tree search. Nature, 529(7587):484\u2013489,\n2016.\n\n[30] Oskari Tammelin, Neil Burch, Michael Johanson, and Michael Bowling. Solving heads-up limit\ntexas hold\u2019em. In Proceedings of the International Joint Conference on Arti\ufb01cial Intelligence\n(IJCAI), pages 645\u2013652, 2015.\n\n[31] Kevin Waugh, Nolan Bard, and Michael Bowling. Strategy grafting in extensive games. In\nProceedings of the Annual Conference on Neural Information Processing Systems (NIPS), 2009.\n[32] Martin Zinkevich, Michael Johanson, Michael H Bowling, and Carmelo Piccione. Regret\nminimization in games with incomplete information. In Proceedings of the Annual Conference\non Neural Information Processing Systems (NIPS), pages 1729\u20131736, 2007.\n\n11\n\n\f", "award": [], "sourceid": 459, "authors": [{"given_name": "Noam", "family_name": "Brown", "institution": "Carnegie Mellon University"}, {"given_name": "Tuomas", "family_name": "Sandholm", "institution": "Carnegie Mellon University"}]}