{"title": "Optimistic Regret Minimization for Extensive-Form Games via Dilated Distance-Generating Functions", "book": "Advances in Neural Information Processing Systems", "page_first": 5221, "page_last": 5231, "abstract": "We study the performance of optimistic regret-minimization algorithms for both minimizing regret in, and computing Nash equilibria of, zero-sum extensive-form games. In order to apply these algorithms to extensive-form games, a distance-generating function is needed. We study the use of the dilated entropy and dilated Euclidean distance functions. For the dilated Euclidean distance function we prove the first explicit bounds on the strong-convexity parameter for general treeplexes. Furthermore, we show that the use of dilated distance-generating functions enable us to decompose the mirror descent algorithm, and its optimistic variant, into local mirror descent algorithms at each information set. This decomposition mirrors the structure of the counterfactual regret minimization framework, and enables important techniques in practice, such as distributed updates and pruning of cold parts of the game tree. Our algorithms provably converge at a rate of $T^{-1}$, which is superior to prior counterfactual regret minimization algorithms. We experimentally compare to the popular algorithm CFR+, which has a theoretical convergence rate of $T^{-0.5}$ in theory, but is known to often converge at a rate of $T^{-1}$, or better, in practice. We give an example matrix game where CFR+ experimentally converges at a relatively slow rate of $T^{-0.74}$, whereas our optimistic methods converge faster than $T^{-1}$. We go on to show that our fast rate also holds in the Kuhn poker game, which is an extensive-form game. For games with deeper game trees however, we find that CFR+ is still faster. Finally we show that when the goal is minimizing regret, rather than computing a Nash equilibrium, our optimistic methods can outperform CFR+, even in deep game trees.", "full_text": "Optimistic Regret Minimization for Extensive-Form\nGames via Dilated Distance-Generating Functions\u2217\n\nGabriele Farina\n\nComputer Science Department\nCarnegie Mellon University\n\nChristian Kroer\nIEOR Department\nColumbia University\n\ngfarina@cs.cmu.edu\n\nchristian.kroer@columbia.edu\n\nTuomas Sandholm\n\nComputer Science Department, CMU\n\nStrategic Machine, Inc.\n\nStrategy Robot, Inc.\n\nOptimized Markets, Inc.\nsandholm@cs.cmu.edu\n\nAbstract\n\nWe study the performance of optimistic regret-minimization algorithms for both\nminimizing regret in, and computing Nash equilibria of, zero-sum extensive-form\ngames. In order to apply these algorithms to extensive-form games, a distance-\ngenerating function is needed. We study the use of the dilated entropy and dilated\nEuclidean distance functions. For the dilated Euclidean distance function we prove\nthe \ufb01rst explicit bounds on the strong-convexity parameter for general treeplexes.\nFurthermore, we show that the use of dilated distance-generating functions enable\nus to decompose the mirror descent algorithm, and its optimistic variant, into local\nmirror descent algorithms at each information set. This decomposition mirrors\nthe structure of the counterfactual regret minimization framework, and enables\nimportant techniques in practice, such as distributed updates and pruning of cold\nparts of the game tree. Our algorithms provably converge at a rate of T \u22121, which is\nsuperior to prior counterfactual regret minimization algorithms. We experimentally\ncompare to the popular algorithm CFR+, which has a theoretical convergence rate\nof T \u22120.5 in theory, but is known to often converge at a rate of T \u22121, or better, in\npractice. We give an example matrix game where CFR+ experimentally converges\nat a relatively slow rate of T \u22120.74, whereas our optimistic methods converge faster\nthan T \u22121. We go on to show that our fast rate also holds in the Kuhn poker game,\nwhich is an extensive-form game. For games with deeper game trees however, we\n\ufb01nd that CFR+ is still faster. Finally we show that when the goal is minimizing\nregret, rather than computing a Nash equilibrium, our optimistic methods can\noutperform CFR+, even in deep game trees.\n\n1\n\nIntroduction\n\nExtensive-form games (EFGs) are a broad class of games that can model sequential interaction,\nimperfect information, and stochastic outcomes. To operationalize them they must be accompanied\nby techniques for computing game-theoretic equilibria such as Nash equilibrium. A notable success\nstory of this is poker: Bowling et al. [1] computed a near-optimal Nash equilibrium for heads-up\n\n\u2217The full version of this paper is available on arXiv.\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\flimit Texas hold\u2019em, while Brown and Sandholm [3] beat top human specialist professionals at the\nlarger game of heads-up no-limit Texas hold\u2019em. Solving extremely large EFGs relies on many\nmethods for dealing with the scale of the problem: abstraction methods are sometimes used to create\nsmaller games [16, 26, 20, 14, 6, 21], endgame solving is used to compute re\ufb01ned solutions to the\nend of the game in real time [9, 15, 27], and recently depth-limited subgame solving has been very\nsuccessfully used in real time [28, 8, 5]. At the core of all these methods is a reliance on a fast\nalgorithm for computing approximate Nash equilibria of the abstraction, endgame, and/or depth-\nlimited subgame [28, 8, 5]. In practice the most popular method has been the CFR+ algorithm [38, 35],\nwhich was used within all three two-player poker breakthroughs [1, 28, 3]. CFR+ has been shown to\nconverge to a Nash equilibrium at a rate of T \u22120.5, but in practice it often performs much better, even\noutperforming faster methods that have a guaranteed rate of T \u22121 [7, 24, 23, 4].\nRecently, another class of optimization algorithms has been shown to have appealing theoretical\nproperties. Online convex optimization (OCO) algorithms are online variants of \ufb01rst-order methods:\nat each timestep t they receive some loss function (cid:96)t (often a linear loss which is a gradient of some\nunderlying loss function), and must then recommend a point from some convex set based on the\nseries of past points and losses. While these algorithms are generally known to have a T \u22120.5 rate of\nconvergence when solving static problems, a recent series of papers showed that when two optimistic\nOCO algorithms are faced against each other, and they have some estimate of the next loss faced,\na rate of T \u22121 can be achieved [30, 31, 34]. In this paper we investigate the application of these\nalgorithms to EFG solving, both in the regret-minimization setting, and for computing approximate\nNash equilibria at the optimal rate of O(T \u22121). The only prior attempt at using optimistic OCO\nalgorithm in extensive-form games is due to Farina et al. [13]. In that paper, the authors show that by\nrestricting to the weaker notion of stable-predictive optimism, one can mix and match local stable-\npredictive optimistic algorithm at every decision point in the game as desired and obtain an overall\nstable-predictive optimistic algorithm that enables O(T \u22120.75) convergence to Nash equilibrium. The\napproach we adopt in this paper is different from that of Farina et al. [13] in that our construction\ndoes not allow one to pick different regret minimizers for different decision points; however, our\nalgorithms converge to Nash equilibrium at the improved rate O(T \u22121).\nThe main hurdle to overcome is that in all known OCO algorithms a distance-generating function\n(DGF) is needed to maintain feasibility via proximal operators and ensure that the stepsizes of\nthe algorithms are appropriate for the convex set at hand. For the case of EFGs, the convex set\nis known as a treeplex, and the so-called dilated DGFs are known to have appealing properties,\nincluding closed-form iterate updates and strong convexity properties [18, 24]. In particular, the\ndilated entropy DGF, which applies the negative entropy at each information set, is known to lead\nto the state-of-the-art theoretical rate on convergence for iterative methods [24]. Another potential\nDGF is the dilated Euclidean DGF, which applies the (cid:96)2 norm as a DGF at each information set.\nWe show the \ufb01rst explicit bounds on the strong-convexity parameter for the dilated Euclidean DGF\nwhen applied to the strategy space of an EFG. We go on to show that when a dilated DGF is paired\nwith the online mirror descent (OMD) algorithm, or its optimistic variant, the resulting algorithm\ndecomposes into a recursive application of local online mirror descent algorithms at each information\nset of the game. This decomposition is similar to the decomposition achieved in the counterfactual\nregret minimization framework, where a local regret minimizer is applied on the counterfactual regret\nat each information set. This localization of the updates along the tree structure enables further\ntechniques, such as distributing the updates [3, 6] or skipping updates on cold parts of the game\ntree [2].\nIt is well-known that the entropy DGF is the theoretically superior DGF when applied to optimization\nover a simplex [18]. For the treeplex case where the entropy DGF is used at each information set,\nKroer et al. [24] showed that the strong theoretical properties of the simplex entropy DGF generalize\nto the dilated entropy DGF on a treeplex (with earlier weaker results shown by Kroer et al. [22]).\nOur results on the dilated Euclidean DGF con\ufb01rm this \ufb01nding, as the dilated Euclidean DGF has a\nsimilar strong convexity parameter, but with respect to the (cid:96)2 norm, rather than the (cid:96)1 norm for dilated\nentropy (having strong convexity with respect to the (cid:96)1 norm leads to a tighter convergence-rate\nbound because it gives a smaller matrix norm, another important constant in the rate).\nIn contrast to these theoretical results, for the case of computing a Nash equilibrium in matrix games\nit has been found experimentally that the Euclidean DGF often performs much better than the entropy\nDGF. This was shown by Chambolle and Pock [11] when using a particular accelerated primal-dual\nalgorithm [10, 11] and using the last iterate (as opposed to the uniformly-averaged iterate as the\n\n2\n\n\ftheory suggests). Kroer [19] recently showed that this extends to the theoretically-sound case of using\nlinear or quadratic averaging in the same primal-dual algorithm, or in mirror prox [29] (the of\ufb02ine\nvariant of optimistic OMD). In this paper we replicate these results when using OCO algorithms: \ufb01rst\nwe show it on a particular matrix game, where we also exhibit a slow T \u22120.74 convergence rate of\nCFR+ (the slowest CFR+ rate seen to the best of our knowledge). We show that for the Kuhn poker\ngame the last iterate of optimistic OCO algorithms with the dilated Euclidean DGF also converges\nextremely fast. In contrast to this, we show that for deeper EFGs CFR+ is still faster. Finally we\ncompare the performance of CFR+ and optimistic OCO algorithms for minimizing regret, where we\n\ufb01nd that OCO algorithms perform better.\n\n2 Regret Minimization Algorithms\n\nIn this section we present the regret-minimization algorithms that we will work with. We will operate\nwithin the framework of online convex optimization [37]. In this setting, a decision maker repeatedly\nplays against an unknown environment by making decision x1, x2, . . . \u2208 X for some convex compact\nset X . After each decision xt at time t, the decision maker faces a linear loss xt (cid:55)\u2192 (cid:104)(cid:96)t, xt(cid:105), where\n(cid:96)t is a vector in X . Summarizing, the decision maker makes a decision xt+1 based on the sequence\nof losses (cid:96)1, . . . , (cid:96)t as well as the sequence of past iterates x1, . . . , xt.\nThe quality metric for a regret minimizer is its cumulative regret, which is the difference between the\nloss cumulated by the sequence of decisions x1, . . . , xT and the loss that would have been cumulated\nby playing the best-in-hindsight time-independent decision \u02c6x. Formally, the cumulative regret up to\ntime T is\n\nT(cid:88)\n\nt=1\n\nRT :=\n\n(cid:104)(cid:96)t, xt(cid:105) \u2212 min\n\u02c6x\u2208X\n\n(cid:26) T(cid:88)\n\nt=1\n\n(cid:27)\n(cid:104)(cid:96)t, \u02c6x(cid:105)\n\n.\n\nA \u201cgood\u201d regret minimizer is such that the cumulative regret grows sublinearly in T .\nThe algorithms we consider assume access to a distance-generating function d : X \u2192 R, which\nis 1-strongly convex (with respect to some norm) and continuously differentiable on the inte-\nrior of X . Furthermore d should be such that the gradient of the convex conjugate \u2207d(g) =\nargmaxx\u2208X(cid:104)g, x(cid:105) \u2212 d(x) is easy to compute. Following Hoda et al. [18] we say that a DGF sat-\nisfying these properties is a nice DGF for X . From d we also construct the Bregman divergence\nD(x (cid:107) x(cid:48)) := d(x) \u2212 d(x(cid:48)) \u2212 (cid:104)\u2207d(x(cid:48)), x \u2212 x(cid:48)(cid:105).\nFirst we present two classical regret minimization algorithms. The online mirror descent (OMD)\n(cid:26)\nalgorithm produces iterates according to the rule\n\nThe follow the regularized leader (FTRL) algorithm produces iterates according to the rule [32]\n\nOMD and FTRL satisfy regret bounds of the form RT \u2264 O\nThe optimistic variants of the classical regret minimization algorithms take as input an additional\nvector mt+1, which is an estimate of the loss faced at time t + 1 [12, 30]. Optimistic OMD produces\niterates according to the rule [30] (note that xt+1 is produced before seeing (cid:96)t+1, while zt+1 is\nproduced after)\n\n(e.g. Hazan [17]).\n\nD(x\u2217(cid:107)x1)L\u221aT\n\n(cid:26)\n(cid:104)mt+1, x(cid:105) +\n\n(cid:27)\nD(x (cid:107) zt)\n\n1\n\u03b7\n\nxt+1 = argmin\n\nx\u2208X\n\n(cid:26)\n\n(cid:27)\nD(z (cid:107) zt)\n\n. (3)\n\n, zt+1 = argmin\n\nz\u2208X\n\n(cid:104)(cid:96)t+1, z(cid:105) +\n\n1\n\u03b7\n\nThus it is like OMD, except that xt+1 is generated by an additional step taken using the loss estimate.\nThis additional step is transient in the sense that xt+1 is not used as a center for the next iterate.\n\n3\n\nxt+1 = argmin\n\nx\u2208X\n\n(cid:104)(cid:96)t, x(cid:105) +\n\n1\n\u03b7\n\n(cid:26)(cid:28) t(cid:88)\n\nx\u2208X\n\n\u03c4 =1\n\nxt+1 = argmin\n\n(cid:96)\u03c4 , x\n\nd(x)\n\n.\n\n(cid:27)\nD(x (cid:107) xt)\n(cid:29)\n\n(cid:27)\n\n.\n\n1\n\u03b7\n\n+\n\n(cid:16)\n\n(cid:17)\n\n(1)\n\n(2)\n\n\fOFTRL produces iterates according to the rule [30, 34]\n\n(cid:26)(cid:28)\n\n(cid:29)\n\n+\n\n1\n\u03b7\n\n(cid:27)\n\nd(x)\n\n.\n\n(cid:96)\u03c4 , x\n\n(4)\n\nt(cid:88)\n\n\u03c4 =1\n\nxt+1 = argmin\n\nx\u2208X\n\nmt+1 +\n\nAgain the loss estimate is used in a transient way: it is used as if we already saw the loss at time t + 1,\nbut then discarded and not used in future iterations.\n\n(cid:8)x(cid:62)Ay(cid:9),where X ,Y\n\n2.1 Connection to Saddle Points\n\nA bilinear saddle-point problem is a problem of the form minx\u2208X maxy\u2208Y\nare closed convex sets. This general formulation allows us to capture, among other settings, several\ngame-theoretical applications such as computing Nash equilibria in two-player zero-sum games. In\nthat setting, X and Y are convex polytopes whose description is provided by the sequence-form\nconstraints, and A is a real payoff matrix [36].\nThe error metric that we use is the saddle-point residual (or gap) \u03be of ( \u00afx, \u00afy), de\ufb01ned as \u03be( \u00afx, \u00afy) :=\nmax\u02c6y\u2208Y(cid:104) \u00afx, A \u02c6y(cid:105) \u2212 min \u02c6x\u2208X(cid:104) \u02c6x, A \u00afy(cid:105). A well-known folk theorem shows that the average of a se-\nquence of regret-minimizing strategies for the choice of losses (cid:96)t\n: Y (cid:51)\nX\ny (cid:55)\u2192 (A(cid:62)xt)(cid:62)y leads to a bounded saddle-point residual, since one has\n\n: X (cid:51) x (cid:55)\u2192 (\u2212Ayt)(cid:62)x, (cid:96)t\nY\n\n\u03be( \u00afx, \u00afy) =\n\n(RT\n\nX + RT\nY ).\n\n(5)\n\n1\nT\n\nWhen X ,Y are the players\u2019 sequence-form strategy spaces, this implies that the average strategy\npro\ufb01le produced by the regret minimizers is a 1/T (RT\nY )-Nash equilibrium. This also implies\nthat by using online mirror descent or follow-the-regularizer-leader, one obtains an anytime algorithm\nfor computing a Nash equilibrium. In particular, at each time T , the average strategy output by each\nof the two regret minimizers forms a \u0001-Nash equilibrium, where \u0001 = O(T \u22120.5).\n\nX + RT\n\n2.2 RVU Property and Fast Convergence to Saddle Points\n\nBoth optimistic OMD and optimistic FTRL satisfy the Regret bounded by Variation in Utilities (RVU)\nproperty, as given by Syrgkanis et al.:\nDe\ufb01nition 1 (RVU property, [34]). We say that a regret minimizer satis\ufb01es the RVU property if there\nexist constants \u03b1 > 0 and 0 < \u03b2 \u2264 \u03b3, as well as a pair of dual norms ((cid:107) \u00b7 (cid:107),(cid:107) \u00b7 (cid:107)\u2217) such that, no\nmatter what the loss functions (cid:96)1, . . . , (cid:96)T are,\n\nT(cid:88)\n\nt=1\n\nT(cid:88)\n\nt=1\n\nRT \u2264 \u03b1 + \u03b2\n\n(cid:107)(cid:96)t \u2212 mt(cid:107)2\n\n\u2217 \u2212 \u03b3\n\n(cid:107)xt \u2212 xt\u22121(cid:107)2.\n\n(RVU)\n\nThe de\ufb01nition given here is slightly more general than that of Syrgkanis et al. [34]: we allow a general\nestimate mt of (cid:96)t, whereas their de\ufb01nition requires using mt = (cid:96)t\u22121. While the choice mt = (cid:96)t\u22121\nis often reasonable, in some cases other de\ufb01nitions of the loss prediction are more natural [13]. In\npractice, both optimistic OMD and optimistic FTRL satisfy a parametric notion of the RVU property,\nwhich depends on the value of the step-size parameter that was chosen to set up either algorithm.\nTheorem 1 (Syrgkanis et al. [34]). For all step-size parameters \u03b7 > 0, Optimistic OMD satis\ufb01es\nthe RVU conditions with respect to the primal-dual norm pair ((cid:107) \u00b7 (cid:107)1,(cid:107) \u00b7 (cid:107)\u221e) with parameters\n\u03b1 = R/\u03b7, \u03b2 = \u03b7, \u03b3 = 1/(8\u03b7), where R is a constant that scales with the maximum allowed norm of\nany loss function (cid:96).\nTheorem 2. For all step-size parameters \u03b7 > 0, OFTRL satis\ufb01es the RVU conditions with respect\nto any primal-dual norm pair ((cid:107) \u00b7 (cid:107),(cid:107) \u00b7 (cid:107)\u2217) with parameters \u03b1 = \u2206d/\u03b7, \u03b2 = \u03b7, \u03b3 = 1/(4\u03b7), where\n\u2206d := maxx,y\u2208X{d(x) \u2212 d(y)}.\nOur proof, available in the appendix of the full paper, generalizes the work by Syrgkanis et al. [34]\nby extending the proof beyond simplex domains and beyond the \ufb01xed choice mt = (cid:96)t\u22121.\nIt turns out that this is enough to accelerate the convergence to a saddle point in the construction of\nSection 2.1. In particular, by letting the predictions be de\ufb01ned as mt\nY , we\nX\n\nX , mt\nY\n\n:= (cid:96)t\u22121\n\n:= (cid:96)t\u22121\n\n4\n\n\fobtain that the residual \u03be of the average decisions ( \u00afx, \u00afy) satis\ufb01es\n\nT \u03be( \u00afx, \u00afy) \u2264\n\n2\u03b1(cid:48)\n\u03b7\n\n+ \u03b7\n\n(cid:107)\u2212Ayt + Ayt\u22121(cid:107)2\n\n(cid:18)\n\nT(cid:88)\n\nt=1\n\n(cid:19)\n\n(cid:18)\n\nT(cid:88)\n\n\u2217 + (cid:107)A(cid:62)xt \u2212 A(cid:62)xt\u22121(cid:107)2\n\u2217\n\u03b3 (cid:48)\n\u03b7\n\n\u2212\n\nt=1\n\n(cid:19)\n\n(cid:18)\n\u03b7(cid:107)A(cid:107)2\n\nop \u2212\n\n(cid:19)(cid:32) T(cid:88)\n\nt=1\n\n\u03b3 (cid:48)\n\u03b7\n\n2\u03b1(cid:48)\n\u03b7\n\n\u2264\n\n+\n\n(cid:107)xt \u2212 xt\u22121(cid:107)2 + (cid:107)yt \u2212 yt\u22121(cid:107)2\nT(cid:88)\n\n(cid:33)\n\n(cid:107)yt \u2212 yt\u22121(cid:107)2\n\n,\n\nt=1\n\n(cid:107)xt \u2212 xt\u22121(cid:107)2 +\n\nwhere the \ufb01rst inequality holds by plugging (RVU) into (5), and the second inequality by noting\nthat the operator norm (cid:107) \u00b7 (cid:107)op of a linear function is equal to the operator norm of its transpose.\n, the saddle-point gap \u03be( \u00afx, \u00afy)\nThis implies that when the step-size parameter is chosen as \u03b7 =\nsatis\ufb01es \u03be( \u00afx, \u00afy) \u2264 2\u03b1(cid:48)\n3 Treeplexes and Sequence Form\n\n(cid:107)A(cid:107)op\nT\u221a\u03b3 (cid:48) = O(T \u22121).\n\n\u221a\u03b3 (cid:48)\n(cid:107)A(cid:107)op\n\nWe formalize a sequential decision process as follows. We assume that we have a set of decision\npoints J . Each decision point j \u2208 J has a set of actions Aj of size nj. Given a speci\ufb01c action at j,\nthe set of possible decision points that the agent may next face is denoted by Cj,a. It can be an empty\nset if no more actions are taken after j, a. We assume that the decision points form a tree, that is,\nCj,a \u2229 Cj(cid:48),a(cid:48) = \u2205 for all other convex sets and action choices j(cid:48), a(cid:48). This condition is equivalent to\nthe perfect-recall assumption in extensive-form games, and to conditioning on the full sequence of\nactions and observations in a \ufb01nite-horizon partially-observable decision process. In our de\ufb01nition,\nthe decision space starts with a root decision point, whereas in practice multiple root decision points\nmay be needed, for example in order to model different starting hands in card games. Multiple root\ndecision points can be modeled by having a dummy root decision point with only a single action.\nThe set of possible next decision points after choosing action a \u2208 Aj at decision point j \u2208 J ,\ndenoted Cj,a, can be thought of as representing the different decision points that an agent may\nface after taking action a and then making an observation on which she can condition her next\naction choice. In addition to games, our model of sequential decision process captures, for example,\npartially-observable Markov decision processes and Markov decision processes where we condition\non the entire history of observations and actions.\n\nAs an illustration, consider\nthe game of Kuhn\npoker [25]. Kuhn poker consists of a three-card deck:\nking, queen, and jack. The action space for the\n\ufb01rst player is shown in Figure 1. For instance, we\nhave: J = {0, 1, 2, 3, 4, 5, 6}; n0 = 1; nj = 2\nfor all j \u2208 J \\ {0}; A0 = {start}, A1 = A2 =\nA3 = {check, raise}, A4 = A5 = A6 = {fold, call};\nC0,start = {1, 2, 3}, C1,raise = \u2205, C3,check = {6}; etc.\nThe expected loss for a given strategy is non-linear\nin the vectors of probability masses for each decision\npoint j. This non-linearity is due to the probability of\nreaching each j, which is computed as the product of\nthe probabilities of all actions on the path to from the\nroot to j. An alternative formulation which preserves\nlinearity is called the sequence form. In the sequence-\nform representation, the simplex strategy space at a generic decision point j \u2208 J is scaled by the\ndecision variable associated with the last action in the path from the root of the process to j. In this\nformulation, the value of a particular action represents the probability of playing the whole sequence\nof actions from the root to that action. This allows each term in the expected loss to be weighted only\nby the sequence ending in the corresponding action. The sequence form has been used to instantiate\nlinear programming [36] and \ufb01rst-order methods [18, 22, 24] for computing Nash equilibria of\nzero-sum EFGs. Formally, the sequence-form representation X of a sequential decision process can\n\nFigure 1: Sequential action space for the\n\ufb01rst player in the game of Kuhn poker.\ndenotes an observation point;\nthe end of the decision process.\n\nrepresents\n\n5\n\nX0X3X6X2X5X1X4startfoldcallfoldcallfoldcallcheckraisecheckraisecheckraisejackqueenkingcheckraisecheckraisecheckraise\fj(cid:48)\u2208Cj,a X\u2193j(cid:48), where\n\nbe obtained recursively, as follows: for every j \u2208 J , a \u2208 Aj, we let X\u2193j,a :=(cid:81)\n\n\u03a0 denotes Cartesian product; at every decision point j \u2208 J , we let\n\nX\u2193j := {(\u03bb1, . . . , \u03bbnj , \u03bb1xa1, . . . , \u03bbnj xanj\n\n) : (\u03bb1, . . . , \u03bbn) \u2208 \u2206nj , xa \u2208 X\u2193j,a \u2200 a \u2208 Aj},\n\nwhere we assumed Aj = {a1, . . . , anj}.\nThe sequence form strategy space for the whole sequential decision process is then X := {1} \u00d7 X\u2193r,\nwhere r is the root of the process. The \ufb01rst entry, identically equal to 1 for any point in X , corresponds\nto what is called the empty sequence. Crucially, X is a convex and compact set, and the expected\nloss of the process is a linear function over X . With the sequence-form representation the problem\nof computing a Nash equilibrium in an EFG can be formulated as a bilinear saddle-point problem\n(see Section 2.1), where X and Y are the sequence-form strategy spaces of the sequential decision\nprocesses faced by the two players, and A is a sparse matrix encoding the leaf payoffs of the game.\nAs we have already observed, vectors that pertain to the sequence form have one entry for each\nsequence of the decision process. We denote with v\u03c6 the entry in v corresponding to the empty\nsequence, and vja the entry corresponding to any other sequence (j, a) where j \u2208 J , a \u2208 Aj.\nSometimes, we will need to slice a vector v and isolate only those entries that refer to all decision\npoints j(cid:48) and actions a(cid:48) \u2208 Aj(cid:48) that are at or below some j \u2208 J ; we will denote such operation as v\u2193j.\nSimilarly, we introduce the syntax vj to denote the subset of nj = |Aj| entries of v that pertain to\nall actions a \u2208 Aj at decision point j \u2208 J . Finally, note that for any j \u2208 J \u2212 {r} there is a unique\nsequence (j(cid:48), a(cid:48)), denoted pj and called the parent sequence of decision point j, such that j \u2208 Cj(cid:48)a(cid:48).\nWhen j = r is the root decision point, we let pr := \u03c6, the empty sequence.\n\n4 Dilated Distance Generating Functions\n\n(cid:19)\n\n(cid:18)\n\nxpj dj\n\nj\u2208J\n\nxj\nxpj\n\npoint: d(x) =(cid:80)\n\nWe will be interested in a particular type of DGF which is suitable for sequential decision-making\nproblems: a dilated DGF. A dilated DGF is constructed by taking a sum over suitable local DGFs for\neach decision point, where each local DGF is dilated by the parent variable leading to the decision\n. Each \u201clocal\u201d DGF dj is given the local variable xj divided by\nxpj , so that xj\nxpj \u2208 \u2206nj . The idea is that dj can be any DGF suitable for \u2206nj ; by multiplying dj by\nxpj and taking a sum over J we construct a DGF for the whole treeplex from these local DGFs.\nHoda et al. [18] showed that dilated DGFs have many of the desired properties of a DGF for an\noptimization problem over a treeplex.\nWe now present two local DGFs for simplexes, that are by far the most common in practice. In\nthe following we let b be a vector in the n-dimensional simplex \u2206n. First, the Euclidean DGF\n2, which is 1-strongly convex with respect to the (cid:96)2 norm; secondly, the negative entropy\nd(b) = (cid:107)b(cid:107)2\ni=1 bi log(bi) (we will henceforth drop the \u201cnegative\u201d and simply refer to it as\nthe entropy DGF), which is 1-strongly convex with respect to the (cid:96)1 norm. The strong convexity\nproperties of the dilated entropy DGF were shown by Kroer et al. [24] (with earlier weaker results\nshown by Kroer et al. [22]). However, for the dilated Euclidean DGF a setup for achieving a strong-\nconvexity parameter of 1 was unknown until now; Hoda et al. [18] show that a strong-convexity\nparameter exists, but do not show what it is for the general case (they give speci\ufb01c results for a\nparticular class of uniform treeplexes). We now show how to achieve this.\nWe are now ready to state our \ufb01rst result on dilated regularizers that are strongly convex with respect\nto the Euclidean norm:\n\nDGF d(b) = (cid:80)n\n\n(cid:80)\nxpj dj(xj/xpj ) where for all j, dj is \u00b5j-strongly convex with\nj(cid:48)\u2208Cja \u00b5j(cid:48), and\n\nrespect to the Euclidean norm over \u2206nj . Furthermore, de\ufb01ne \u03c3ja := \u00b5j\n\u00af\u03c3 := minja \u03c3ja. Then, d is \u00af\u03c3-strongly convex with respect to the Euclidean norm over X .\nWe can immediately use Theorem 3 to prove the following corollary:\nCorollary 1. Let \u00af\u03c3 > 0 be arbitrary, and for all j let dj be a \u00b5j-strongly convex function over \u2206nj\nwith respect to the Euclidean norm, where the \u00b5j\u2019s satisfy\n\nTheorem 3. Let d(x) = (cid:80)\n\n2 \u2212\n\nj\u2208J\n\n(cid:88)\n\nj(cid:48)\u2208Cja\n\n6\n\nThen, d(x) =(cid:80)\n\nj\u2208J\n\n\u00b5j = 2\u00af\u03c3 + 2 max\na\u2208Aj\n\n\u00b5j(cid:48).\n\n(6)\n\nxpj dj(xj/xpj ) is \u00af\u03c3-strongly convex over X with respect to the Euclidean norm.\n\n\f5 Local Regret Minimization\n\nWe now show that OMD and Optimistic OMD run on a treeplex X with a dilated DGF can\nboth be interpreted as locally minimizing a modi\ufb01ed variant of loss at each information set, with\ncorrespondingly-modi\ufb01ed loss predictions. The modi\ufb01ed local loss at a given information set j takes\ninto account the loss and DGF below j by adding the expectation with respect to the next iterate xt\n\u2193j.\nIn practice this modi\ufb01ed loss is easily handled by computing xt bottom-up, thereby visiting j after\nhaving visited the whole subtree below.\nWe \ufb01rst show that the problem of computing the prox mapping, the minimizer of a linear term\nplus the Bregman divergence, decomposes into local prox mappings at each simplex of a treeplex.\nThis will then be used to show that OMD and Optimistic OMD can be viewed as a tree of local\nsimplex-instantiations of the respective algorithms.\n\n5.1 Decomposition into Local Prox Mappings with a Dilated DGF\n\nWe will be interested in solving the following prox mapping, which takes place in the sequence form:\n(7)\n\nProx(g, \u02c6x) = argmin\n\n(cid:8)\n(cid:104)g, x(cid:105) + D(x (cid:107) \u02c6x)(cid:9).\n\nThe reason is that the update applied at each iteration of several OCO algorithms run on the sequence-\nform polytope of X can be described as an instantiation of this prox mapping. We now show that this\nupdate can be interpreted as a local prox mapping at each decision point, but with a new loss \u02c6gj that\ndepends on the update applied in the subtree beneath j.\nProposition 1 (Decomposition into local prox mappings). A prox mapping (7) on a treeplex with a\nBregman divergence constructed from a dilated DGF decomposes into local prox mappings at each\ndecision point j where the solution is as follows:\n\nx\u2208X\n\n(cid:26)\n\nwhere\n\n\u02c6gj,a = gj,a +\n\n(cid:88)\n\nj(cid:48)\u2208Cj,a\n\n(cid:104)\u02c6gj, bj(cid:105) + Dj\n\nx\u2217j = xpj \u00b7 argmin\nbj\u2208\u2206nj\n(cid:34)\n\n\u2212 g\u2193j(cid:48) + \u2207d\u2193j(cid:48)( \u02c6x\u2193j(cid:48))(cid:1)\n\n\u2193j(cid:48)(cid:0)\n\nd\u2217\n\nbj\n\n(cid:13)(cid:13)(cid:13)(cid:13) \u02c6xj\n(cid:18)\n(cid:18) \u02c6xj\n(cid:19)\n\n\u02c6xpj\n\n(cid:19)(cid:27)\n(cid:42)\n\n,\n\n(cid:32)\n\n\u2212 dj(cid:48)\n\n\u02c6xpj\n\n+\n\n\u2207dj(cid:48)\n\n(cid:33)\n\n\u02c6xj(cid:48)\n\u02c6xpj(cid:48)\n\n,\n\n\u02c6xj(cid:48)\n\u02c6xpj(cid:48)\n\n(cid:43)(cid:35)\n\n.\n\nHoda et al. [18] and Kroer et al. [23] gave variations on a similar result: that the convex conjugate\n\u2193j(\u2212g) can be computed in bottom-up fashion similar to the recursion we show here. Proposition 1\nd\u2217\nis slightly different in that we additionally show that the Bregman divergence also survives the\ndecomposition and can be viewed as a local Bregman divergence. This latter difference will be\nnecessary for showing that OMD can be interpreted as a local RM.\n\n5.2 Decomposition into Local Regret Minimizers\n\nWith Proposition 1 it follows almost directly that OMD and Optimistic OMD can be seen as a set of\nlocal regret minimizers, one for each simplex. Each produces iterates from their respective simplex,\nwith the overall strategy produced by then applying the sequence-form transformation to these local\niterates.\nTheorem 4. OMD with a dilated DGF for a treeplex X corresponds to running OMD locally at each\nsimplex j, with the local loss \u02c6(cid:96)t constructed according to Proposition 1. Optimistic OMD corresponds\nto the optimistic variant of this local OMD with local loss predictions \u02c6(cid:96)t, \u02c6mt+1\nagain constructed\naccording to Proposition 1 using xt as Bregman divergence center and xt+1 for aggregating losses\nbelow each simplex. Here the modi\ufb01ed loss uses zt\n\u2193j(cid:48) and xt+1 as Bregman divergence center and\naggregating loss below, respectively. The prediction \u02c6mt+1\n\n\u2193j(cid:48) and zt+1.\n(cid:80)t\nUnlike OMD and its optimistic variant, it is not the case that FTRL has a nice interpretation as a local\nregret minimizer. The reason is that the prox mapping in (2) or (4) minimizes the sum of losses, rather\nthan the most recent loss. Because of this, the expected value (cid:104)\n\u2193j (cid:105) at simplex j, which\n\n\u2193j, xt+1\n\n\u03c4 =1 (cid:96)\u03c4\n\nuses zt\n\nj\n\nj\n\n7\n\n\fin\ufb02uences the modi\ufb01ed loss at parent simplexes, is computed based on xt+1 for all t losses. Thus\nthere is no local modi\ufb01ed loss that could be received at rounds 1 through t that accurately re\ufb02ects the\nmodi\ufb01ed loss needed in Proposition 1.\n\n6 Experimental Evaluation\n\nWe experimentally evaluate the performance of optimistic regret minimization methods instantiated\nwith dilated distance-generating functions. We experiment on three games:\n\u2022 Smallmatrix, a small 2 \u00d7 2 matrix game. Given a mixed strategy x = (x1, x2) \u2208 \u22062 for Player\n1 and a mixed strategy y = (y1, y2) \u2208 \u22062 for Player 2, the payoff function for player 1 is\nu(x, y) = 5x1y1 \u2212 x1y2 + x2y2.\n\u2022 Kuhn poker, already introduced in Section 3. In Kuhn poker, each player \ufb01rst has to put a payment\nof 1 into the pot. Each player is then dealt one of the three cards, and the third is put aside unseen.\nA single round of betting then occurs: \ufb01rst, Player 1 can check or bet 1. Then,\n\n\u2013 If Player 1 checks Player 2 can check or raise 1.\n\n\u2217 If Player 2 checks a showdown occurs; if Player 2 raises Player 1 can fold or call.\n\n\u00b7 If Player 1 folds Player 2 takes the pot; if Player 1 calls a showdown occurs.\n\n\u2013 If Player 1 raises Player 2 can fold or call.\n\n\u2217 If Player 2 folds Player 1 takes the pot; if Player 2 calls a showdown occurs.\n\nIf no player has folded, a showdown occurs where the player with the higher card wins.\n\n\u2022 Leduc poker, a standard benchmark in imperfect-information game solving [33]. The game is\nplayed with a deck consisting of 5 unique cards with 2 copies of each, and consists of two rounds.\nIn the \ufb01rst round, each player places an ante of 1 in the pot and receives a single private card. A\nround of betting then takes place with a two-bet maximum, with Player 1 going \ufb01rst. A public\nshared card is then dealt face up and another round of betting takes place. Again, Player 1 goes\n\ufb01rst, and there is a two-bet maximum. If one of the players has a pair with the public card, that\nplayer wins. Otherwise, the player with the higher card wins. All bets in the \ufb01rst round are 1, while\nall bets in the second round are 2. This game has 390 decision points and 911 sequences per player.\n\nFast Last-Iterate Convergence. In the \ufb01rst set of experiments (Figure 2, top row), we compare\nthe saddle-point gap of the strategy pro\ufb01les produced by optimistic OMD and optimistic FTRL to\nthat produced by CFR and CFR+. Optimistic OMD and optimistic FTRL were set up with the\nstep-size parameter \u03b7 = 0.1 in Smallmatrix and \u03b7 = 2 in Kuhn Poker, and the plots show the\nlast-iterate convergence for the optimistic algorithms, which has recently received attention in the\nworks by Chambolle and Pock [11] and Kroer [19]. Finally, we instantiated optimistic OMD and\noptimistic FTRL with the Euclidean distance generating function as constructed in Corollary 1. The\nplots show that\u2014at least in these shallow games\u2014optimistic methods are able to produce even up to\n12 orders of magnitude better-approximate saddle-points than CFR and CFR+.\nInterestingly, Smallmatrix appears to be a hard instance for CFR+: linear regression on the \ufb01rst 20 000\niterations of CFR+ shows, with a coef\ufb01cient of determination of roughly 0.96, that log \u03be(xT\n\u2217 ) \u2248\n\u22120.7375 \u00b7 log(T ) \u2212 2.1349, where (xT\n\u2217 ) is the average strategy pro\ufb01le (computed using linear\naveraging, as per CF R+\u2019s construction) up to time T . In other words, we have evidence of at least\none game in which the approximate saddle-point computed by CFR+ experimentally has residual\nbounded below by \u2126(T \u22120.74). This observation suggests that the analysis of CFR+ might actually\nbe quite tight, and that CFR+ is not an accelerated method.\nFigure 2 (bottom left) shows the performance of OFTRL in Leduc Poker, compared to CFR and\nCFR+ (we do not show optimistic OMD, which we found to have worse performance than OFTRL).\nHere OFTRL performs worse than CFR+. This shows that in deeper games, more work has to be\ndone to fully exploit the accelerated bounds of optimistic regret minimization methods.\n\n\u2217 , yT\n\n\u2217 , yT\n\nComparing the Cumulative Regret. We also compared the algorithms based on the sum of cumula-\ntive regrets (again we omit optimistic OMD, which performed worse than OFTRL). In all three games,\nOFTRL leads to lower sum of cumulative regrets. Figure 2 (bottom right) shows the performance of\nt=1 xt (note that the\n\nOFTRL in Leduc Poker. Here, we used the usual average of iterates \u00afx := 1/T(cid:80)T\n\nchoice of averaging strategy has no effect on the bottom right plot.)\n\n8\n\n\fFigure 2: (Left and upper right) Saddle-point gap as a function of the number of iterations. The plots\nshow the last-iterate convergence for OOMD and OFTRL.(Lower right) Sum of cumulative regret for\nboth players in Leduc. Optimistic OMD (OOMD) and OFTRL use step-size parameter \u03b7 = 0.1 in\nSmallmatrix and \u03b7 = 2 in Kuhn. OFTRL uses step-size parameter \u03b7 = 200 in Leduc.\n\nOFTRL\u2019s performance matches the theory from Theorem 2 and Section 2.2. In particular, we observe\nthat while OFTRL does not beat the state-of-the-art CFR+ in terms of saddle-point gap, it beats it\naccording to the regret sum metric. The fact that CFR+ performs worse with respect to the regret sum\nmetric is somewhat surprising: the entire derivation of CFR and CFR+ is based on showing bounds\non the regret sum. However, the connection between regret and saddle-point gap (or exploitability) is\none-way: if the two regret minimizers (one per player) have regret R1 and R2, then the saddle point\ngap can be easily shown to be less than or equal to (R1 + R2)/T . However, nothing prevents it from\nbeing much smaller than (R1 + R2)/T . What we empirically \ufb01nd is that for CFR+ this bound is\nvery loose. We are not sure why this is the case, and it potentially warrants further investigation in\nthe future.\n\n7 Conclusions\n\nWe studied how optimistic regret minimization can be applied in the context of extensive-form games,\nand introduced the \ufb01rst instantiations of regret-based techniques that achieve T \u22121 convergence to\nNash equilibrium in extensive-form games. These methods rely crucially on having a tractable\nregularizer to maintain feasibility and control the stepsizes on the domain at hand\u2014in our case, the\nsequence-form polytope. We provided the \ufb01rst explicit bound on the strong convexity properties\nof dilated distance-generating functions with respect to the Euclidean norm. We also showed\nthat when optimistic regret minimization methods are instantiated with dilated distance-generating\nfunctions, the regret updates are local to each information set in the game, mirroring the structure of\nthe counterfactual regret minimization framework. This localization of the updates along the tree\nstructure enables further techniques, such as distributing the updates or skipping updates on cold\nparts of the game tree. Finally, when used in self play, these optimistic regret minimization methods\nguarantee an optimal T \u22121 convergence rate to Nash equilibrium.\nWe demonstrate that in shallow games, methods based on optimistic regret minimization can signi\ufb01-\ncantly outperform CFR and CFR+\u2014even up to 12 orders of magnitude. In deeper games, more work\nhas to be done to fully exploit the accelerated bounds of optimistic regret minimization methods.\nHowever, while the strong CFR+ performance in large games remains a mystery, we elucidate some\npoints about its performance\u2014including showing that its theoretically slow convergence bound is\nsomewhat tight. Finally, we showed that when the goal is minimizing regret, rather than computing a\nNash equilibrium, optimistic methods can outperform CFR+ even in deep game trees.\n\n9\n\n10010110210310\u22121610\u22121210\u2212810\u22124100OFTRLOOMDCFRCFR+Iterationnumber(T)Saddle-pointgap(\u03be)Smallmatrix10010110210310\u22121610\u22121210\u2212810\u22124100OFTRLOOMDCFRCFR+Iterationnumber(T)Saddle-pointgap(\u03be)Kuhn10010110210310\u2212310\u2212210\u22121100101OFTRLCFRCFR+Iterationnumber(T)Saddle-pointgap(\u03be)Leduc100101102103101101.5OFTRLCFRCFR+Iterationnumber(T)CumulativeregretLeduc\fAcknowledgments\n\nThis material is based on work supported by the National Science Foundation under grants IIS-\n1718457, IIS-1617590, and CCF-1733556, and the ARO under award W911NF-17-1-0082. Gabriele\nFarina is supported by a Facebook fellowship.\n\nReferences\n[1] Michael Bowling, Neil Burch, Michael Johanson, and Oskari Tammelin. Heads-up limit\n\nhold\u2019em poker is solved. Science, 347(6218), January 2015.\n\n[2] Noam Brown and Tuomas Sandholm. Reduced space and faster convergence in imperfect-\ninformation games via pruning. In International Conference on Machine Learning (ICML),\n2017.\n\n[3] Noam Brown and Tuomas Sandholm. Superhuman AI for heads-up no-limit poker: Libratus\n\nbeats top professionals. Science, page eaao1733, Dec. 2017.\n\n[4] Noam Brown and Tuomas Sandholm. Solving imperfect-information games via discounted re-\ngret minimization. In Proceedings of the AAAI Conference on Arti\ufb01cial Intelligence, volume 33,\npages 1829\u20131836, 2019.\n\n[5] Noam Brown and Tuomas Sandholm. Superhuman AI for multiplayer poker. Science, 365\n(6456):885\u2013890, 2019. ISSN 0036-8075. doi: 10.1126/science.aay2400. URL https://\nscience.sciencemag.org/content/365/6456/885.\n\n[6] Noam Brown, Sam Ganzfried, and Tuomas Sandholm. Hierarchical abstraction, distributed\nequilibrium computation, and post-processing, with application to a champion no-limit Texas\nHold\u2019em agent. In International Conference on Autonomous Agents and Multi-Agent Systems\n(AAMAS), 2015.\n\n[7] Noam Brown, Christian Kroer, and Tuomas Sandholm. Dynamic thresholding and pruning for\n\nregret minimization. In AAAI Conference on Arti\ufb01cial Intelligence (AAAI), 2017.\n\n[8] Noam Brown, Tuomas Sandholm, and Brandon Amos. Depth-limited solving for imperfect-\n\ninformation games. arXiv preprint arXiv:1805.08195, 2018.\n\n[9] Neil Burch, Michael Johanson, and Michael Bowling. Solving imperfect information games\n\nusing decomposition. In AAAI Conference on Arti\ufb01cial Intelligence (AAAI), 2014.\n\n[10] Antonin Chambolle and Thomas Pock. A \ufb01rst-order primal-dual algorithm for convex problems\n\nwith applications to imaging. Journal of Mathematical Imaging and Vision, 2011.\n\n[11] Antonin Chambolle and Thomas Pock. On the ergodic convergence rates of a \ufb01rst-order\n\nprimal\u2013dual algorithm. Mathematical Programming, 159(1-2):253\u2013287, 2016.\n\n[12] Chao-Kai Chiang, Tianbao Yang, Chia-Jung Lee, Mehrdad Mahdavi, Chi-Jen Lu, Rong Jin,\nand Shenghuo Zhu. Online optimization with gradual variations. In Conference on Learning\nTheory, pages 6\u20131, 2012.\n\n[13] Gabriele Farina, Christian Kroer, Noam Brown, and Tuomas Sandholm. Stable-predictive\noptimistic counterfactual regret minimization. In International Conference on Machine Learning\n(ICML), 2019.\n\n[14] Sam Ganzfried and Tuomas Sandholm. Potential-aware imperfect-recall abstraction with earth\nmover\u2019s distance in imperfect-information games. In AAAI Conference on Arti\ufb01cial Intelligence\n(AAAI), 2014.\n\n[15] Sam Ganzfried and Tuomas Sandholm. Endgame solving in large imperfect-information games.\nIn International Conference on Autonomous Agents and Multi-Agent Systems (AAMAS), 2015.\n[16] Andrew Gilpin and Tuomas Sandholm. Lossless abstraction of imperfect information games.\n\nJournal of the ACM, 54(5), 2007.\n\n[17] Elad Hazan. Introduction to online convex optimization. Foundations and Trends in Optimiza-\n\ntion, 2(3-4):157\u2013325, 2016.\n\n[18] Samid Hoda, Andrew Gilpin, Javier Pe\u00f1a, and Tuomas Sandholm. Smoothing techniques for\ncomputing Nash equilibria of sequential games. Mathematics of Operations Research, 35(2),\n2010.\n\n10\n\n\f[19] Christian Kroer. First-order methods with increasing iterate averaging for solving saddle-point\n\nproblems. arXiv preprint arXiv:1903.10646, 2019.\n\n[20] Christian Kroer and Tuomas Sandholm. Extensive-form game abstraction with bounds. In\n\nProceedings of the ACM Conference on Economics and Computation (EC), 2014.\n\n[21] Christian Kroer and Tuomas Sandholm. Imperfect-recall abstractions with bounds in games. In\n\nProceedings of the ACM Conference on Economics and Computation (EC), 2016.\n\n[22] Christian Kroer, Kevin Waugh, Fatma K\u0131l\u0131n\u00e7-Karzan, and Tuomas Sandholm. Faster \ufb01rst-order\nmethods for extensive-form game solving. In Proceedings of the ACM Conference on Economics\nand Computation (EC), 2015.\n\n[23] Christian Kroer, Gabriele Farina, and Tuomas Sandholm. Solving large sequential games with\nthe excessive gap technique. In Proceedings of the Annual Conference on Neural Information\nProcessing Systems (NIPS), 2018.\n\n[24] Christian Kroer, Kevin Waugh, Fatma K\u0131l\u0131n\u00e7-Karzan, and Tuomas Sandholm. Faster algo-\nrithms for extensive-form game solving via improved smoothing functions. Mathematical\nProgramming, pages 1\u201333, 2018.\n\n[25] H. W. Kuhn. A simpli\ufb01ed two-person poker.\n\nIn H. W. Kuhn and A. W. Tucker, editors,\nContributions to the Theory of Games, volume 1 of Annals of Mathematics Studies, 24, pages\n97\u2013103. Princeton University Press, Princeton, New Jersey, 1950.\n\n[26] Marc Lanctot, Richard Gibson, Neil Burch, Martin Zinkevich, and Michael Bowling. No-regret\nlearning in extensive-form games with imperfect recall. In International Conference on Machine\nLearning (ICML), 2012.\n\n[27] Matej Moravcik, Martin Schmid, Karel Ha, Milan Hladik, and Stephen Gaukrodger. Re\ufb01ning\nsubgames in large imperfect information games. In AAAI Conference on Arti\ufb01cial Intelligence\n(AAAI), 2016.\n\n[28] Matej Morav\u02c7c\u00edk, Martin Schmid, Neil Burch, Viliam Lis\u00fd, Dustin Morrill, Nolan Bard, Trevor\nDavis, Kevin Waugh, Michael Johanson, and Michael Bowling. Deepstack: Expert-level\narti\ufb01cial intelligence in heads-up no-limit poker. Science, 356(6337), May 2017.\n\n[29] Arkadi Nemirovski. Prox-method with rate of convergence O(1/t) for variational inequali-\nties with Lipschitz continuous monotone operators and smooth convex-concave saddle point\nproblems. SIAM Journal on Optimization, 15(1), 2004.\n\n[30] Alexander Rakhlin and Karthik Sridharan. Online learning with predictable sequences. In\n\nConference on Learning Theory, pages 993\u20131019, 2013.\n\n[31] Sasha Rakhlin and Karthik Sridharan. Optimization, learning, and games with predictable\nsequences. In Advances in Neural Information Processing Systems, pages 3066\u20133074, 2013.\n\n[32] Shai Shalev-Shwartz and Yoram Singer. A primal-dual perspective of online learning algorithms.\n\nMachine Learning, 69(2-3):115\u2013142, 2007.\n\n[33] Finnegan Southey, Michael Bowling, Bryce Larson, Carmelo Piccione, Neil Burch, Darse\nBillings, and Chris Rayner. Bayes\u2019 bluff: Opponent modelling in poker. In Proceedings of the\n21st Annual Conference on Uncertainty in Arti\ufb01cial Intelligence (UAI), July 2005.\n\n[34] Vasilis Syrgkanis, Alekh Agarwal, Haipeng Luo, and Robert E Schapire. Fast convergence of\nregularized learning in games. In Advances in Neural Information Processing Systems, pages\n2989\u20132997, 2015.\n\n[35] Oskari Tammelin, Neil Burch, Michael Johanson, and Michael Bowling. Solving heads-up\nlimit Texas hold\u2019em. In Proceedings of the 24th International Joint Conference on Arti\ufb01cial\nIntelligence (IJCAI), 2015.\n\n[36] Bernhard von Stengel. Ef\ufb01cient computation of behavior strategies. Games and Economic\n\nBehavior, 14(2):220\u2013246, 1996.\n\n[37] Martin Zinkevich. Online convex programming and generalized in\ufb01nitesimal gradient ascent.\nIn International Conference on Machine Learning (ICML), pages 928\u2013936, Washington, DC,\nUSA, 2003.\n\n[38] Martin Zinkevich, Michael Bowling, Michael Johanson, and Carmelo Piccione. Regret mini-\nmization in games with incomplete information. In Proceedings of the Annual Conference on\nNeural Information Processing Systems (NIPS), 2007.\n\n11\n\n\f", "award": [], "sourceid": 2824, "authors": [{"given_name": "Gabriele", "family_name": "Farina", "institution": "Carnegie Mellon University"}, {"given_name": "Christian", "family_name": "Kroer", "institution": "Columbia University"}, {"given_name": "Tuomas", "family_name": "Sandholm", "institution": "CMU, Strategic Machine, Strategy Robot, Optimized Markets"}]}