{"title": "Computing Robust Counter-Strategies", "book": "Advances in Neural Information Processing Systems", "page_first": 721, "page_last": 728, "abstract": "Adaptation to other initially unknown agents often requires computing an effective counter-strategy. In the Bayesian paradigm, one must find a good counter-strategy to the inferred posterior of the other agents' behavior. In the experts paradigm, one may want to choose experts that are good counter-strategies to the other agents' expected behavior. In this paper we introduce a technique for computing robust counter-strategies for adaptation in multiagent scenarios under a variety of paradigms. The strategies can take advantage of a suspected tendency in the decisions of the other agents, while bounding the worst-case performance when the tendency is not observed. The technique involves solving a modified game, and therefore can make use of recently developed algorithms for solving very large extensive games. We demonstrate the effectiveness of the technique in two-player Texas Hold'em. We show that the computed poker strategies are substantially more robust than best response counter-strategies, while still exploiting a suspected tendency. We also compose the generated strategies in an experts algorithm showing a dramatic improvement in performance over using simple best responses.", "full_text": "Computing Robust Counter-Strategies\n\nMichael Johanson\n\njohanson@cs.ualberta.ca\n\nMartin Zinkevich\n\nmaz@cs.ualberta.ca\n\nComputing Science Department\n\nUniversity of Alberta\n\nMichael Bowling\n\nEdmonton, AB Canada T6G2E8\nbowling@cs.ualberta.ca\n\nAbstract\n\nAdaptation to other initially unknown agents often requires computing an effec-\ntive counter-strategy. In the Bayesian paradigm, one must \ufb01nd a good counter-\nstrategy to the inferred posterior of the other agents\u2019 behavior.\nIn the experts\nparadigm, one may want to choose experts that are good counter-strategies to\nthe other agents\u2019 expected behavior. In this paper we introduce a technique for\ncomputing robust counter-strategies for adaptation in multiagent scenarios under\na variety of paradigms. The strategies can take advantage of a suspected tendency\nin the decisions of the other agents, while bounding the worst-case performance\nwhen the tendency is not observed. The technique involves solving a modi\ufb01ed\ngame, and therefore can make use of recently developed algorithms for solving\nvery large extensive games. We demonstrate the effectiveness of the technique in\ntwo-player Texas Hold\u2019em. We show that the computed poker strategies are sub-\nstantially more robust than best response counter-strategies, while still exploiting\na suspected tendency. We also compose the generated strategies in an experts al-\ngorithm showing a dramatic improvement in performance over using simple best\nresponses.\n\n1 Introduction\n\nMany applications for autonomous decision making (e.g., assistive technologies, electronic com-\nmerce, interactive entertainment) involve other agents interacting in the same environment. The\nagents\u2019 choices are often not independent, and good performance may necessitate adapting to the\nbehavior of the other agents. A number of paradigms have been proposed for adaptive decision\nmaking in multiagent scenarios. The agent modeling paradigm proposes to learn a predictive model\nof other agents\u2019 behavior from observations of their decisions. The model is then used to compute\nor select a counter-strategy that will perform well given the model. An alternative paradigm is the\nmixture of experts. In this approach, a set of expert strategies is identi\ufb01ed a priori. These experts\ncan be thought of as counter-strategies for the range of expected tendencies in the other agents\u2019\nbehavior. The decision maker, then, chooses amongst the counter-strategies based on their online\nperformance, commonly using techniques for regret minimization (e.g., UCB1 [ACBF02]). In either\napproach, \ufb01nding counter-strategies is an important subcomponent.\nThe most common approach to choosing a counter-strategy is best response: the performance maxi-\nmizing strategy if the other agents\u2019 behavior is known [Rob51, CM96]. In large domains where best\nresponse computations are not tractable, they are often approximated with \u201cgood responses\u201d from a\ncomputationally tractable set, where performance maximization remains the only criterion [RV02].\nThe problem with this approach is that best response strategies can be very brittle. While max-\n\n1\n\n\fimizing performance against the model, they can (and often do) perform poorly when the model\nis wrong. The use of best response counter-strategies, therefore, puts an impossible burden on a\npriori choices, either the agent model bias or the set of expert counter-strategies. McCracken and\nBowling [MB04] proposed \u0001-safe strategies to address this issue. Their technique chooses the best\nperformance maximizing strategy from the set of strategies that don\u2019t lose more than \u0001 in the worst-\ncase. The strategy balances exploiting the agent model with a safety guarantee in case the model is\nwrong. Although conceptually appealing, it is computationally infeasible even for moderately sized\ndomains and has only been employed in the simple game of Ro-Sham-Bo.\nIn this paper, we introduce a new technique for computing robust counter-strategies. The counter-\nstrategies, called restricted Nash responses, balance performance maximization against the model\nwith reasonable performance even when the model is wrong. The technique involves computing a\nNash equilibrium of a modi\ufb01ed game, and therefore can exploit recent advances in solving large\nextensive games [GHPS07, ZBB07, ZJBP08]. We demonstrate the practicality of the approach in\nthe challenging domain of poker. We begin by reviewing the concepts of extensive form games,\nbest responses, and Nash equilibria, as well as describing how these concepts apply in the poker\ndomain. We then describe a technique for computing an approximate best response to an arbitrary\npoker strategy, and show that this, indeed, produces brittle counter-strategies. We then introduce\nrestricted Nash responses, describe how they can be computed ef\ufb01ciently, and show that they are\nsigni\ufb01cantly more robust while still being effective counter-strategies. Finally, we demonstrate that\nthese strategies can be used in an experts algorithm to make a more effective adaptive player than\nwhen using simple best response.\n\n2 Background\n\nA perfect information extensive game consists of a tree of game states. At each game state, an\naction is made either by nature, or by one of the players, or the state is a terminal state where each\nplayer receives a \ufb01xed utility. A strategy for a player consists of a distribution over actions for every\ngame state. In an imperfect information extensive game, the states where a player makes an action\nare divided into information sets. When a player chooses an action, it does not know the state of\nthe game, only the information set, and therefore its strategy is a mapping from information sets\nto distributions over actions. A common restriction on imperfect information extensive games is\nperfect recall, where two states can only be in the same information set for a player if that player\ntook the same actions from the same information sets to reach the two game states. In the remainder\nof the paper, we will be considering imperfect information extensive games with perfect recall.\nLet \u03c3i be a strategy for player i where \u03c3i(I, a) is the probability that strategy assigns to action a in\ninformation set I. Let \u03a3i be the set of strategies for player i, and de\ufb01ne ui(\u03c31, \u03c32) to be the expected\nutility of player i if player 1 uses \u03c31 \u2208 \u03a31 and player 2 uses \u03c32 \u2208 \u03a32. De\ufb01ne BR(\u03c32) \u2286 \u03a31 to be\nthe set of best responses to \u03c32, i.e.:\n\nBR(\u03c32) = argmax\n\u03c31\u2208\u03a31\n\nu1(\u03c31, \u03c32)\n\n(1)\n\nand de\ufb01ne BR(\u03c31) \u2286 \u03a32 similarly. If \u03c31 \u2208 BR(\u03c32) and \u03c32 \u2208 BR(\u03c31), then (\u03c31, \u03c32) is a Nash\nequilibrium. A zero-sum extensive game is an extensive game where u1 = \u2212u2. In this type of\ngame, for any two equilibria (\u03c31, \u03c32) and (\u03c30\n2) (as well as\n(\u03c30\n1, \u03c32)) are also equilibria. De\ufb01ne the value of the game to player 1 (v1) to be the expected utility\nof player 1 in equilibrium. In a zero-sum extensive game, the exploitability of a strategy \u03c31 \u2208 \u03a31\nis:\n\n2), u1(\u03c31, \u03c32) = u1(\u03c30\n\n2) and (\u03c31, \u03c30\n\n1, \u03c30\n\n1, \u03c30\n\n(v1 \u2212 u1(\u03c31, \u03c32)).\n\nex(\u03c31) = max\n\u03c32\u2208\u03a32\n\n(2)\nThe value of the game to player 2 (v2) and the exploitability of a strategy \u03c32 \u2208 \u03a32 are de\ufb01ned\nsimilarly. A strategy which can be exploited for no more than \u0001 is \u0001-safe. An \u0001-Nash equilibrium\nin a zero-sum extensive game is a strategy pair where both strategies are \u0001-safe.\nIn the remainder of the work, we will be dealing with mixing two strategies. Informally, one can\nthink of mixing two strategies as performing the following operation: \ufb01rst, \ufb02ip a (possibly biased)\ncoin; if it comes up heads, use the \ufb01rst strategy, otherwise use the second strategy. Formally, de\ufb01ne\n\u03c0\u03c3i(I) to be the probability that player i when following strategy \u03c3i chooses the actions necessary to\n\n2\n\n\fmake information set I reachable from the root of the game tree. Given \u03c31, \u03c30\nde\ufb01ne mixp(\u03c31, \u03c30\n\n1) \u2208 \u03a31 such that for any information set I of player 1, for all actions a:\n\n1 \u2208 \u03a31 and p \u2208 [0, 1],\n\nmixp(\u03c31, \u03c30\n\n1)(I, a) = p \u00d7 \u03c0\u03c31(I)\u03c31(I, a) + (1 \u2212 p) \u00d7 \u03c0\u03c30\np \u00d7 \u03c0\u03c31(I) + (1 \u2212 p) \u00d7 \u03c0\u03c30\n\n1(I)\u03c31(I, a)\n1(I)\n\n.\n\n(3)\n\nGiven an event E, de\ufb01ne Pr\u03c31,\u03c32[E] to be the probability of the event E given player 1 uses \u03c31,\n1 \u2208 \u03a31, all\nand player 2 uses \u03c32. Given the above de\ufb01nition of mix, it is the case that for all \u03c31, \u03c30\n\u03c32 \u2208 \u03a32, all p \u2208 [0, 1], and all events E:\n\nPr\nmixp(\u03c31,\u03c30\n\n1),\u03c32\n\n[E] = p Pr\n\u03c31,\u03c32\n\n[E] + (1 \u2212 p) Pr\n\u03c30\n1,\u03c32\n\n[E]\n\n(4)\n\nSo probabilities of outcomes can simply be combined linearly. As a result the utility of a mixture of\nstrategies is just u(mixp(\u03c31, \u03c30\n\n1), \u03c32) = pu(\u03c31, \u03c32) + (1 \u2212 p)u(\u03c30\n\n1, \u03c32).\n\n3 Texas Hold\u2019Em\n\nWhile the techniques in this paper apply to general extensive games, our empirical results will focus\non the domain of poker. In particular, we look at heads-up limit Texas Hold\u2019em, the game used in\nthe AAAI Computer Poker Competition [ZL06]. A single hand of this poker variant consists of two\nplayers each being dealt two private cards, followed by \ufb01ve community cards being revealed. Each\nplayer tries to form the best \ufb01ve-card poker hand from the community cards and her private cards:\nif the hand goes to a showdown, the player with the best \ufb01ve-card hand wins the pot. The key to\ngood play is on average to have more chips in the pot when you win than are in the pot when you\nlose. The players\u2019 actions control the pot size through betting. After the private cards are dealt, a\nround of betting occurs, followed by additional betting rounds after the third (\ufb02op), fourth (turn),\nand \ufb01fth (river) community cards are revealed. Betting rounds involve players alternately deciding\nto either fold (letting the other player win the chips in the pot), call (matching the opponent\u2019s chips\nin the pot), or raise (matching, and then adding an additional \ufb01xed amount into the pot). No more\nthan four raises are allowed in a single betting round. Notice that heads-up limit Texas Hold\u2019em is\nan example of a \ufb01nite imperfect information extensive game with perfect recall. When evaluating\nthe results of a match (several hands of poker) between two players, we \ufb01nd it convenient to state\nthe result in millibets won per hand. A millibet is one thousandth of a small-bet, the \ufb01xed magnitude\nof bets used in the \ufb01rst two rounds of betting. To provide some intuition for these numbers, a player\nthat always folds will lose 750 mb/h while a typical player that is 10 mb/h stronger than another\nwould require over one million hands to be 95% certain to have won overall.\n\nAbstraction. While being a relatively small variant of poker, the game tree for heads-up limit\nTexas Hold\u2019em is still very large, having approximately 9.17\u00d71017 states. Fundamental operations,\nsuch as computing a best response strategy or a Nash equilibrium as described in Section 2, are\nintractable on the full game. Common practice is to de\ufb01ne a more reasonably sized abstraction by\nmerging information sets (e.g., by treating certain hands as indistinguishable). If the abstraction\ninvolves the same betting structure, a strategy for an abstract game can be played directly in the full\ngame. If the abstraction is small enough Nash equilibria and best response computations become\nfeasible. Finding an approximate Nash equilibrium in an abstract game has proven to be an effective\nway to construct a strong program for the full game [BBD+03, GS06]. Recent solution techniques\nhave been able to compute approximate Nash equilibria for abstractions with as many as 1010 game\nstates [ZBB07, GHPS07]. Given a strategy de\ufb01ned in a small enough abstraction, it is also possible\nto compute a best response to the strategy in the abstract game. This can be done in time linear in the\nsize of the extensive game. The abstraction used in this paper has approximately 6.45 \u00d7 109 game\nstates, and is described in an accompanying technical report [JZB07].\nThe Competitors. Since this work focuses on adapting to other agents\u2019 behavior, our experiments\nmake use of a battery of different poker playing programs. We give a brief description of these\nprograms here. PsOpti4 [BBD+03] is one of the earliest successful near equilibrium programs\nfor poker and is available as \u201cSparbot\u201d in the commercial title Poker Academy. PsOpti6 is a later\nand weaker variant, but whose weaknesses are thought to be less obvious to human players. To-\ngether, PsOpti4 and PsOpti6 formed Hyperborean, the winner of the AAAI 2006 Computer Poker\nCompetition. S1239, S1399, and S2298 are similar near equilibrium strategies generated by a new\n\n3\n\n\fequilibrium computation method [ZBB07] using a much larger abstraction than is used in PsOpti4\nand PsOpti6. A60 and A80 are two past failed attempts at generating interesting exploitive strategies,\nand are highly exploitable for over 1000 mb/h. CFR5 is a new near Nash equilibrium [ZJBP08],\nand uses the abstraction described in the accompanying technical report [JZB07]. We will also ex-\nperiment with two programs Bluffbot and Monash, who placed second and third respectively in the\nAAAI 2006 Computer Poker Competition\u2019s bankroll event [ZL06].\n\n4 Frequentist Best Response\n\nIn the introduction, we described best response counter-strategies as brittle, performing poorly when\nplaying against a different strategy from the one which they were computed to exploit. In this sec-\ntion, we examine this claim empirically in the domain of poker. Since a best response computation\nis intractable in the full game, we \ufb01rst describe a technique, called frequentist best response, for\n\ufb01nding a \u201cgood response\u201d using an abstract game. As described in the previous section, given a\nstrategy in an abstract game we can compute a best response to that strategy within the abstraction.\nThe challenge is that the abstraction used by an arbitrary opponent is not known. In addition, it may\nbe bene\ufb01cial to \ufb01nd a best response in an alternative, possible more powerful, abstraction.\nSuppose we want to \ufb01nd a \u201cgood response\u201d to some strategy P. The basic idea of frequentist best\nresponse (FBR) is to observe P playing the full game of poker, construct a model of it in an abstract\ngame (unrelated to that P\u2019s own abstraction), and then compute a best-response in this abstraction.\nFBR \ufb01rst needs many examples of the strategy playing the full, unabstracted game. It then iterates\nthrough every one of P\u2019s actions for every hand. It \ufb01nds the action\u2019s associated information set in\nthe abstract game and increments a counter associated with that information set and action. After\nobserving a suf\ufb01cient number of hands, we can construct a strategy in the abstract game based on\nthe frequency counts. At each information set, we set the strategy\u2019s probability for performing each\naction to be the number of observations of that action being chosen from that information set, divided\nby the total number of observations in the information set. If an information set was never observed,\nthe strategy defaults to the call action. Since this strategy is de\ufb01ned in a known abstraction, FBR\ncan simply calculate a best response to this frequentist strategy.\nP\u2019s opponent in the observed games greatly affects the quality of the model. We have found it\nmost effective to have P play against a trivial strategy that calls and raises with equal probability.\nThis provides with us the most observations of P\u2019s decisions that are well distributed throughout\nthe possible betting sequences. Observing P in self-play or against near equilibrium strategies has\nshown to require considerably more observed hands. We typically use 5 million hands of training\ndata to compute the model strategy, although reasonable responses can still be computed with as few\nas 1 million hands.\n\nEvaluation. We computed frequentist best response strategies against seven different opponents.\nWe played the resulting responses both against the opponent it was designed to exploit as well as the\nother six opponents and an approximate equilibrium strategy computed using the same abstraction.\nThe results of this tournament are shown as a crosstable in Table 1. Positive numbers (appearing\nwith a green background) are in favor of the row player (FBR strategies, in this case).\nThe \ufb01rst thing to notice is that FBR is very successful at exploiting the opponent it was designed to\nexploit, i.e., the diagonal of the crosstable is positive and often large. In some cases, FBR identi\ufb01ed\nstrategies exploiting the opponent for more than previously known to be possible, e.g., PsOpti4 had\nonly previously been exploited for 75 mb/h [Sch06], while FBR exploits it for 137 mb/h. The second\nthing to notice is that when FBR strategies play against other opponents their performance is poor,\ni.e., the off-diagonal of the crosstable is generally negative and occasionally by a large amount. For\nexample, A60 is not a strong program. It is exploitable for over 2000 mb/h (note that always fold\nonly loses 750 mb/h) and an approximate equilibrium strategy defeats it by 93 mb/h. Yet, every FBR\nstrategy besides the one trained on it, loses to it, sometimes by a substantial amount. These results\ngive evidence that best response is, in practice, a brittle computation, and can perform poorly when\nthe model is wrong.\nOne exception to this trend is play within the family of S-bots. In particular, consider S1399 and\nS1239, which are very similar programs, using the same technique for equilibrium computation with\nthe same abstract game. They only differ in the number of iterations the algorithm was afforded. The\n\n4\n\n\fFBR-PsOpti4\nFBR-PsOpti6\nFBR-A60\nFBR-A80\nFBR-S1239\nFBR-S1399\nFBR-S2298\nCFR5\nMax\n\nPsOpti4\n137\n-79\n-442\n-312\n-20\n-43\n-39\n36\n137\n\nPsOpti6\n-163\n330\n-499\n-281\n105\n38\n51\n123\n330\n\nA60\n-227\n-68\n2170\n-557\n-89\n-48\n-50\n93\n2170\n\nOpponents\nA80\n-231\n-89\n-701\n1048\n-42\n-77\n-26\n41\n1048\n\nS1239\n-106\n-36\n-359\n-251\n106\n75\n42\n70\n106\n\nS1399\n-85\n-23\n-305\n-231\n91\n118\n50\n68\n118\n\nS2298\n-144\n-48\n-377\n-266\n-32\n-46\n33\n17\n33\n\nCFR5 Average\n-129\n-210\n-14\n-97\n-620\n-142\n-148\n-331\n3\n-87\n-11\n-109\n2\n-41\n0\n56\n0\n\nTable 1: Results of frequentist best responses (FBR) against a variety of opponent programs in full\nTexas Hold\u2019em, with winnings in mb/h for the row player. Results involving PsOpti4 or PsOpti6\nused 10 duplicate matches of 10,000 hands and are signi\ufb01cant to 20 mb/h. Other results used 10\nduplicate matches of 500,000 hands and are signi\ufb01cant to 2 mb/h.\n\nresults show they do share weaknesses as FBR-S1399 does beat S1239 by 75 mb/h. However, this\nis 30% less than 106 mb/h, the amount that FBR-S1239 beats the same opponent. Considering the\nsimilarity of these opponents, even this apparent exception is actually suggestive that best response\nis not robust to even slight changes in the model.\nFinally, consider the performance of the approximate equilibrium player, CFR5. As it was computed\nfrom a relatively large abstraction it performs comparably well, not losing to any of the seven oppo-\nnents. However, it also does not win by the margins of the correct FBR strategy. As noted, against\nthe highly exploitable A60, it wins by a mere 93 mb/h. What we really want is a compromise.\nWe would like a strategy that can exploit an opponent successfully like FBR, but without the large\npenalty when playing against a different opponent. The remainder of the paper examines Restricted\nNash Response, a technique for creating such strategies.\n\n5 Restricted Nash Response\n\nImagine that you had a model of your opponent, but did not believe that this model was perfect.\nThe model may capture the general idea of the adversary you expect to face, but most likely is not\nidentical. For example, maybe you have played a previous version of the same program, have a\nmodel of its play, but suspect that the designer is likely to have made some small improvements in\nthe new version. One way to explicitly de\ufb01ne our situation is that with the new version we might\nexpect that 75 percent of the hands will be played identically to the old version. The other 25 percent\nis some new modi\ufb01cation, for which we want to be robust. This, in itself, can be thought of as a\ngame for which we can apply the usual game theoretic machinery of equilibria.\nLet our model of our opponent be some strategy \u03c3\ufb01x \u2208 \u03a32. De\ufb01ne \u03a3p,\u03c3\ufb01x\nthe form mixp(\u03c3\ufb01x, \u03c30\nresponses to \u03c31 \u2208 \u03a31 to be:\n\nto be those strategies of\n2 is an arbitrary strategy in \u03a32. De\ufb01ne the set of restricted best\n\n2), where \u03c30\n\n2\n\nBRp,\u03c3\ufb01x(\u03c31) = argmax\n\u03c32\u2208\u03a3p,\u03c3\ufb01x\n\n2\n\nu2(\u03c31, \u03c32)\n\n(5)\n\n1 \u2208 BR(\u03c3\u2217\n\n2). In this pair, the strategy \u03c3\u2217\n\n2 \u2208 BRp,\u03c3\ufb01x(\u03c3\u2217\nA (p, \u03c3\ufb01x) restricted Nash equilibrium is a pair of strategies (\u03c3\u2217\n1)\nand \u03c3\u2217\n1 is a p-restricted Nash response (RNR) to \u03c3\ufb01x. We\npropose these RNRs would be ideal counter-strategies for \u03c3\ufb01x, where p provides a balance between\nexploitation and exploitability. This concept is closely related to \u0001-safe best responses [MB04].\n\u2286 \u03a31 to be the set of all strategies which are \u0001-safe (with an exploitability less than\nDe\ufb01ne \u03a3\u0001-safe\n\u0001). Then the set of \u0001-safe best responses are:\n\n1\n\n1, \u03c3\u2217\n\n2) where \u03c3\u2217\n\nBR\u0001-safe(\u03c32) = argmax\n\u03c31\u2208\u03a3\u0001-safe\n\nu1(\u03c31, \u03c32)\n\n(6)\n\nTheorem 1 For all \u03c32 \u2208 \u03a32, for all p \u2208 (0, 1], if \u03c31 is a p-RNR to \u03c32, then there exists an \u0001 such\nthat \u03c31 is an \u0001-safe best response to \u03c32.\n\n5\n\n\f(a) Versus PsOpti4\n\n(b) Versus A80\n\nFigure 1: The tradeoff between \u0001 and utility. For each opponent, we varied p \u2208 [0, 1] for the RNR.\nThe labels at each datapoint indicate the value of p used.\n\nThe proof of Theorem 1 is in an accompanying technical report [JZB07]. The signi\ufb01cance of The-\norem 1 is that, among all strategies that are at most \u0001 suboptimal, the RNR strategies are among the\nbest responses. Thus, if we want a strategy that is at most \u0001 suboptimal, we can vary p to produce a\nstrategy that is the best response among all such \u0001-safe strategies.\nUnlike safe best responses, a RNR can be computed by just solving a modi\ufb01cation of the original\nabstract game. For example, if using a sequence form representation of linear programming then\none just needs to add lower bound constraints for the restricted player\u2019s realization plan probabili-\nties. In our experiments we use a recently developed solution technique based on regret minimiza-\ntion [ZJBP08] with a modi\ufb01ed game that starts with an unobserved chance node deciding whether\nthe restricted player is forced to use strategy \u03c3\ufb01x on the current hand. The RNRs used in our experi-\nments were computed with less than a day of computation on a 2.4Ghz AMD Opteron.\n\nChoosing p.\nIn order to compute a RNR we have to choose a value of p. By varying the value\np \u2208 [0, 1], we can produce poker strategies that are closer to a Nash equilibrium (when p is near 0) or\nare closer to the best response (when p is near 1). When producing an RNR to a particular opponent,\nit is useful to consider the tradeoff between the utility of the response against that opponent and\nthe exploitability of the response itself. We explore this tradeoff in Figure 1. In 1a we plot the\nresults of using RNR with various values of p against the model of PsOpti4. The x-axis shows the\nexploitability of the response, \u0001. The y-axis shows the exploitation of the model by the response\nin the abstract game. Note that the actual exploitation and exploitability in the full game may be\ndifferent, as we explore later. Figure 1b shows this tradeoff against A80.\nNotice that by selecting values of p, we can control the tradeoff between \u0001 and the response\u2019s ex-\nploitation of the strategy. More importantly, the curves are highly concave meaning that dramatic\nreductions in exploitability can be achieved with only a small sacri\ufb01ce in the ability to exploit the\nmodel.\n\nEvaluation. We used RNR to compute a counter-strategy to the same seven opponents used in the\nFBR experiments, with the p value used for each opponent selected such that the resulting \u0001 is close\nto 100 mb/h. The RNR strategies were played against these seven opponents and the equilibrium\nCFR5 in the full game of Texas Hold\u2019em. The results of this tournament are displayed as a crosstable\nin Table 2.\nThe \ufb01rst thing to notice is that RNR is capable of exploiting the opponent for which it was designed\nas a counter-strategy, while still performing well against the other opponents. In other words, not\nonly is the diagonal positive and large, most of the crosstable is positive. For the highly exploitable\nopponents, such as A60 and A80, the degree of exploitation is much reduced from FBR, which is a\nconsequence of choosing p such that \u0001 is at most 100 mb/h. Notice, though, that it does exploit these\nopponents signi\ufb01cantly more than the approximate Nash strategy (CFR5).\n\n6\n\n 80 100 120 140 160 180 200 220 240 260 0 1000 2000 3000 4000 5000 6000 7000 8000Exploitation (mb/h)Exploitability (mb/h)(0.00)(0.50)(0.75)(0.82)(0.85)(0.90)(0.95)(0.99)(1.00) 0 100 200 300 400 500 600 700 800 900 1000 1100 0 1000 2000 3000 4000 5000 6000 7000 8000 9000Exploitation (mb/h)Exploitability (mb/h)(0.00)(0.25)(0.40)(0.45)(0.50)(0.55)(0.60)(0.80)(0.90)(0.95)(1.00)\fRNR-PsOpti4\nRNR-PsOpti6\nRNR-A60\nRNR-A80\nRNR-S1239\nRNR-S1399\nRNR-S2298\nCFR5\nMax\n\nPsOpti4\n85\n26\n-17\n-7\n38\n31\n21\n36\n85\n\nPsOpti6 A60 A80\n9\n34\n-22\n293\n31\n29\n30\n41\n293\n\n112\n234\n63\n66\n130\n136\n137\n123\n234\n\n39\n72\n582\n22\n68\n66\n72\n93\n582\n\nOpponents\n\nS1239\n63\n59\n37\n11\n111\n105\n77\n70\n111\n\nS1399\n61\n59\n39\n12\n106\n112\n76\n68\n112\n\nS2298\n-1\n1\n-9\n0\n9\n6\n31\n17\n31\n\nCFR5 Average\n43\n57\n78\n46\n59\n58\n54\n56\n\n-23\n-28\n-45\n-29\n-20\n-24\n-11\n0\n0\n\nTable 2: Results of restriced Nash response (RNR) against a variety of opponent programs in full\nTexas Hold\u2019em, with winnings in mb/h for the row player. See the caption of Table 1 for match\ndetails.\n\nFigure 2: Performance of FBR-experts, RNR-experts, and a near Nash equilibrium strategy (CFR5)\nagainst \u201ctraining\u201d opponents and \u201chold out\u201d opponents in 50 duplicate matches of 1000 hands.\n\nRevisiting the family of S-bots, we notice that the known similarity of S1239 and S1399 is more\napparent with RNR. The performance of RNR with the correct model against these two players is\nclose to that of FBR, while the performance with the similar model is only a 6mb/h drop. Essentially,\nRNR is forced to exploit only the weaknesses that are general and is robust to small changes. Overall,\nRNR offers a similar degree of exploitation to FBR, but with far more robustness.\n\n6 Restricted Nash Experts\n\nWe have shown that RNR can be used to \ufb01nd robust counter-strategies. In this section we investigate\ntheir use in an adaptive poker program. We generated four counter-strategies based on the opponents\nPsOpti4, A80, S1399, and S2298, and then used these as experts which UCB1 [ACBF02] (a regret\nminimizing algorithm) selected amongst. The FBR-experts algorithm used a FBR to each opponent,\nand the RNR-experts used RNR to each opponent. We then played these two expert mixtures in\n1000 hand matches against both the four programs used to generate the counter strategies as well as\ntwo programs from the 2006 AAAI Computer Poker Competition, which have an unknown origin\nand were developed independently of the other programs. We call the \ufb01rst four programs \u201ctraining\nopponents\u201d and the other two programs \u201choldout opponents\u201d, as they are similar to training error\nand holdout error in supervised learning.\nThe results of these matches are shown in Figure 2. As expected, when the opponent matches one of\nthe training models, FBR-experts and RNR-experts perform better, on average, than a near equilib-\nrium strategy (see \u201cTraining Average\u201d in Figure 2). However, if we look at the break down against\nindividual opponents, we see that all of FBR\u2019s performance comes from its ability to signi\ufb01cantly\nexploit one single opponent. Against the other opponents, it actually performs worse than the non-\nadaptive near equilibrium strategy. RNR does not exploit A80 to the same degree as FBR, but also\ndoes not lose to any opponent.\n\n7\n\n-100 0 100 200 300 400 500 600 700 800HoldoutAverageMonashBluffBotTrainingAverageS2298S1399Attack80Opti4Performance (mb/h)FBR ExpertsRNR Experts5555hs2\fThe comparison with the holdout opponents, though, is more realistic and more telling. Since it\nis unlikely a player will have a model of the exact program its likely to face in a competition,\nit is important for its counter-strategies to exploit general weaknesses that might be encountered.\nOur holdout programs have no explicit relationship to the training programs, yet the RNR counter-\nstrategies are still effective at exploiting these programs as demonstrated by the expert mixture\nbeing able to exploit these programs by more than the near equilibrium strategy. The FBR counter-\nstrategies, on the other hand, performed poorly outside of the training programs, demonstrating once\nagain that RNR counter-strategies are both more robust and more suitable as a basis for adapting\nbehavior to other agents in the environment.\n\n7 Conclusion\n\nWe proposed a new technique for generating robust counter-strategies in multiagent scenarios. The\nrestricted Nash responses balance exploiting suspected tendencies in other agents\u2019 behavior, while\nbounding the worst-case performance when the tendency is not observed. The technique involves\ncomputing an approximate equilibrium to a modi\ufb01cation of the original game, and therefore can\nmake use of recently developed algorithms for solving very large extensive games. We demon-\nstrated the technique in the domain of poker, showing it to generate more robust counter-strategies\nthan traditional best response. We also showed that a simple mixture of experts algorithm based on\nrestricted Nash response counter-strategies was far superior to using best response counter-strategies\nif the exact opponent was not used in training. Further, the restricted Nash experts algorithm outper-\nformed a static non-adaptive near equilibrium at exploiting the previously unseen programs.\n\nReferences\n[ACBF02] P. Auer, N. Cesa-Bianchi, and P. Fischer. Finite time analysis of the multiarmed bandit problem.\n\nMachine Learning, 47:235\u2013256, 2002.\n\n[BBD+03] D. Billings, N. Burch, A. Davidson, R. Holte, J. Schaeffer, T. Schauenberg, and D. Szafron. Ap-\nproximating game-theoretic optimal strategies for full-scale poker. In International Joint Confer-\nence on Arti\ufb01cial Intelligence, pages 661\u2013668, 2003.\nDavid Carmel and Shaul Markovitch. Learning models of intelligent agents. In Proceedings of the\nThirteenth National Conference on Arti\ufb01cial Intelligence, Menlo Park, CA, 1996. AAAI Press.\n\n[CM96]\n\n[GHPS07] A. Gilpin, S. Hoda, J. Pena, and T. Sandholm. Gradient-based algorithms for \ufb01nding nash equilibria\nIn Proceedings of the Eighteenth International Conference on Game\n\nin extensive form games.\nTheory, 2007.\nA. Gilpin and T. Sandholm. A competitive texas hold\u2019em poker player via automated abstraction\nand real-time equilibrium computation. In National Conference on Arti\ufb01cial Intelligence, 2006.\n\n[GS06]\n\n[JZB07] Michael Johanson, Martin Zinkevich, and Michael Bowling. Computing robust counter-strategies.\n\n[MB04]\n\n[Rob51]\n[RV02]\n\n[Sch06]\n\nTechnical Report TR07-15, Department of Computing Science, University of Alberta, 2007.\nPeter McCracken and Michael Bowling. Safe strategies for agent modelling in games. In AAAI\nFall Symposium on Arti\ufb01cial Multi-agent Learning, October 2004.\nJulia Robinson. An iterative method of solving a game. Annals of Mathematics, 54:296\u2013301, 1951.\nPatrick Riley and Manuela Veloso. Planning for distributed execution through use of probabilis-\ntic opponent models. In Proceedings of the Sixth International Conference on AI Planning and\nScheduling, pages 77\u201382, April 2002.\nT.C. Schauenberg. Opponent modelling and search in poker. Master\u2019s thesis, University of Alberta,\n2006.\n\n[ZBB07] M. Zinkevich, M. Bowling, and N. Burch. A new algorithm for generating strong strategies in mas-\nsive zero-sum games. In Proceedings of the Twenty-Seventh Conference on Arti\ufb01cial Intelligence\n(AAAI), 2007. To Appear.\n\n[ZJBP08] M. Zinkevich, M. Johanson, M. Bowling, and C. Piccione. Regret minimization in games with\n\nincomplete information. In Neural Information Processing Systems 21, 2008.\nM. Zinkevich and M. Littman. The AAAI computer poker competition. Journal of the International\nComputer Games Association, 29, 2006. News item.\n\n[ZL06]\n\n8\n\n\f", "award": [], "sourceid": 807, "authors": [{"given_name": "Michael", "family_name": "Johanson", "institution": null}, {"given_name": "Martin", "family_name": "Zinkevich", "institution": null}, {"given_name": "Michael", "family_name": "Bowling", "institution": null}]}