{"title": "Competitive Gradient Descent", "book": "Advances in Neural Information Processing Systems", "page_first": 7625, "page_last": 7635, "abstract": "We introduce a new algorithm for the numerical computation of Nash equilibria of competitive two-player games. Our method is a natural generalization of gradient descent to the two-player setting where the update is given by the Nash equilibrium of a regularized bilinear local approximation of the underlying game. It avoids oscillatory and divergent behaviors seen in alternating gradient descent. Using numerical experiments and rigorous analysis, we provide a detailed comparison to methods based on \\emph{optimism} and \\emph{consensus} and show that our method avoids making any unnecessary changes to the gradient dynamics while achieving exponential (local) convergence for (locally) convex-concave zero sum games. Convergence and stability properties of our method are robust to strong interactions between the players, without adapting the stepsize, which is not the case with previous methods. In our numerical experiments on non-convex-concave problems, existing methods are prone to divergence and instability due to their sensitivity to interactions among the players, whereas we never observe divergence of our algorithm. The ability to choose larger stepsizes furthermore allows our algorithm to achieve faster convergence, as measured by the number of model evaluations.", "full_text": "Competitive Gradient Descent\n\nFlorian Sch\u00e4fer\n\nComputing and Mathematical Sciences\n\nCalifornia Institute of Technology\n\nPasadena, CA 91125\n\nflorian.schaefer@caltech.edu\n\nAnima Anandkumar\n\nComputing and Mathematical Sciences\n\nCalifornia Institute of Technology\n\nPasadena, CA 91125\nanima@caltech.edu\n\nAbstract\n\nWe introduce a new algorithm for the numerical computation of Nash equilibria of\ncompetitive two-player games. Our method is a natural generalization of gradient\ndescent to the two-player setting where the update is given by the Nash equilibrium\nof a regularized bilinear local approximation of the underlying game. It avoids\noscillatory and divergent behaviors seen in alternating gradient descent. Using\nnumerical experiments and rigorous analysis, we provide a detailed comparison to\nmethods based on optimism and consensus and show that our method avoids making\nany unnecessary changes to the gradient dynamics while achieving exponential\n(local) convergence for (locally) convex-concave zero sum games. Convergence\nand stability properties of our method are robust to strong interactions between the\nplayers, without adapting the stepsize, which is not the case with previous methods.\nIn our numerical experiments on non-convex-concave problems, existing methods\nare prone to divergence and instability due to their sensitivity to interactions\namong the players, whereas we never observe divergence of our algorithm. The\nability to choose larger stepsizes furthermore allows our algorithm to achieve faster\nconvergence, as measured by the number of model evaluations.\n\n1\n\nIntroduction\n\nCompetitive optimization: Whereas traditional optimization is concerned with a single agent trying\nto optimize a cost function, competitive optimization extends this problem to the setting of multiple\nagents each trying to minimize their own cost function, which in general depends on the actions of all\nagents. The present work deals with the case of two such agents:\n\n(1)\n\ng(x, y)\n\nf (x, y), min\ny\u2208Rn\n\nmin\nx\u2208Rm\nfor two functions f, g : Rm \u00d7 Rn \u2212\u2192 R.\nIn single agent optimization, the solution of the problem consists of the minimizer of the cost function.\nIn competitive optimization, the right de\ufb01nition of solution is less obvious, but often one is interested\nin computing Nash\u2013 or strategic equilibria: Pairs of strategies, such that no player can decrease\ntheir costs by unilaterally changing their strategies. If f and g are not convex, \ufb01nding a global Nash\nequilibrium is typically impossible and instead we hope to \ufb01nd a \"good\" local Nash equilibrium.\nThe bene\ufb01ts of competition: While competitive optimization problems arise naturally in mathemat-\nical economics and game/decision theory (Nisan et al., 2007), they also provide a highly expressive\nand transparent language to formulate algorithms in a wide range of domains. In optimization\n(Bertsimas et al., 2011) and statistics (Huber and Ronchetti, 2009) it has long been observed that\ncompetitive optimization is a natural way to encode robustness requirements of algorithms. More\nrecently, researchers in machine learning have been using multi-agent optimization to design highly\n\ufb02exible objective functions for reinforcement learning (Liu et al., 2016; Pfau and Vinyals, 2016;\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\fPathak et al., 2017; Wayne and Abbott, 2014; Vezhnevets et al., 2017) and generative models (Good-\nfellow et al., 2014). We believe that this approach has still a lot of untapped potential, but its full\nrealization depends crucially on the development of ef\ufb01cient and reliable algorithms for the numerical\nsolution of competitive optimization problems.\nGradient descent/ascent and the cycling problem: For differentiable objective functions, the most\nnaive approach to solving (1) is gradient descent ascent (GDA), whereby both players independently\nchange their strategy in the direction of steepest descent of their cost function. Unfortunately, this\nprocedure features oscillatory or divergent behavior even in the simple case of a bilinear game\n(f (x, y) = x(cid:62)y = \u2212g(x, y)) (see Figure 2). In game-theoretic terms, GDA lets both players choose\ntheir new strategy optimally with respect to the last move of the other player. Thus, the cycling\nbehaviour of GDA is not surprising: It is the analogue of \"Rock! Paper! Scissors! Rock! Paper!\nScissors! Rock! Paper!...\" in the eponymous hand game. While gradient descent is a reliable\nbasic workhorse for single-agent optimization, GDA can not play the same role for competitive\noptimization. At the moment, the lack of such a workhorse greatly hinders the broader adoption of\nmethods based on competition.\nExisting works: Most existing approaches to stabilizing GDA follow one of three lines of attack.\nIn the special case f = \u2212g, the problem can be written as a minimization problem minx F (x), where\nF (x) := maxy f (x, y). For certain structured problems, Gilpin et al. (2007) use techniques from\nconvex optimization (Nesterov, 2005) to minimize the implicitly de\ufb01ned F . For general problems,\nthe two-scale update rules proposed in Goodfellow et al. (2014); Heusel et al. (2017); Metz et al.\n(2016) can be seen as an attempt to approximate F and its gradients.\nIn GDA, players pick their next strategy based on the last strategy picked by the other players.\nMethods based on follow the regularized leader (Shalev-Shwartz and Singer, 2007; Grnarova et al.,\n2017), \ufb01ctitious play (Brown, 1951), predictive updates (Yadav et al., 2017), opponent learning\nawareness (Foerster et al., 2018), and optimism (Rakhlin and Sridharan, 2013; Daskalakis et al.,\n2017; Mertikopoulos et al., 2019) propose more sophisticated heuristics that the players could use to\npredict each other\u2019s next move. Algorithmically, many of these methods can be considered variations\nof the extragradient method (Korpelevich, 1977)(see also Facchinei and Pang (2003)[Chapter 12]).\nFinally, some methods directly modify the gradient dynamics, either by promoting convergence\nthrough gradient penalties (Mescheder et al., 2017), or by attempting to disentangle convergent\npotential parts from rotational Hamiltonian parts of the vector \ufb01eld (Balduzzi et al., 2018; Letcher\net al., 2019; Gemp and Mahadevan, 2018).\n\nxyf, D2\n\nOur contributions: Our main conceptual objection to most existing methods is that they lack a clear\ngame-theoretic motivation, but instead rely on the ad-hoc introduction of additional assumptions,\nmodi\ufb01cations, and model parameters.\nTheir main practical shortcoming is that to avoid divergence the stepsize has to be chosen inversely\nxyg).\nproportional to the magnitude of the interaction of the two players (as measured by D2\nOn the one hand, the small stepsize results in slow convergence. On the other hand, a stepsize small\nenough to prevent divergence will not be known in advance in most problems. Instead it has to be\ndiscovered through tedious trial and error, which is further aggravated by the lack of a good diagnostic\nfor improvement in multi-agent optimization (which is given by the objective function in single agent\noptimization).\nWe alleviate the above mentioned problems by introducing a novel algorithm, competitive gradient\ndescent (CGD) that is obtained as a natural extension of gradient descent to the competitive setting.\nRecall that in the single player setting, the gradient descent update is obtained as the optimal solution\nto a regularized linear approximation of the cost function. In the same spirit, the update of CGD\nis given by the Nash equilibrium of a regularized bilinear approximation of the underlying game.\nThe use of a bilinear\u2013 as opposed to linear approximation lets the local approximation preserve the\ncompetitive nature of the problem, signi\ufb01cantly improving stability. We prove (local) convergence\nresults of this algorithm in the case of (locally) convex-concave zero-sum games. We also show\nthat stronger interactions between the two players only improve convergence, without requiring an\nadaptation of the stepsize. In comparison, the existing methods need to reduce the stepsize to match\nthe increase of the interactions to avoid divergence, which we illustrate on a series of polynomial test\ncases considered in previous works.\nWe begin our numerical experiments by trying to use a GAN on a bimodal Gaussian mixture model.\nEven in this simple example, trying \ufb01ve different (constant) stepsizes under RMSProp, the existing\n\n2\n\n\fmethods diverge. The typical solution would be to decay the learning rate. However even with a\nconstant learning rate, CGD succeeds with all these stepsize choices to approximate the main features\nof the target distribution. In fact, throughout our experiments we never saw CGD diverge. In order to\nmeasure the convergence speed more quantitatively, we next consider a nonconvex matrix estimation\nproblem, measuring computational complexity in terms of the number of gradient computations\nperformed. We observe that all methods show improved speed of convergence for larger stepsizes,\nwith CGD roughly matching the convergence speed of optimistic gradient descent (Daskalakis\net al., 2017), at the same stepsize. However, as we increase the stepsize, other methods quickly\nstart diverging, whereas CGD continues to improve, thus being able to attain signi\ufb01cantly better\nconvergence rates (more than two times as fast as the other methods in the noiseless case, with the\nratio increasing for larger and more dif\ufb01cult problems). For small stepsize or games with weak\ninteractions on the other hand, CGD automatically invests less computational time per update, thus\ngracefully transitioning to a cheap correction to GDA, at minimal computational overhead. We\nbelieve that the robustness of CGD makes it an excellent candidate for the fast and simple training\nof machine learning systems based on competition, hopefully helping them reach the same level of\nautomatization and ease-of-use that is already standard in minimization based machine learning.\n\n2 Competitive gradient descent\n\nWe propose a novel algorithm, which we call competitive gradient descent (CGD), for the so-\nlution of competitive optimization problems minx\u2208Rm f (x, y), miny\u2208Rn g(x, y), where we have\naccess to function evaluations, gradients, and Hessian-vector products of the objective functions. 1\nAlgorithm 1: Competitive Gradient Descent (CGD)\nfor 0 \u2264 k \u2264 N \u2212 1 do\n\nyxg(cid:1)\u22121(cid:0)\u2207xf \u2212 \u03b7D2\nxyf(cid:1)\u22121(cid:0)\u2207yg \u2212 \u03b7D2\n\nxyf\u2207yg(cid:1);\nyxg\u2207xf(cid:1);\n\nxk+1 = xk \u2212 \u03b7(cid:0)Id\u2212\u03b72D2\nyk+1 = yk \u2212 \u03b7(cid:0)Id\u2212\u03b72D2\n\nxyf D2\nyxgD2\n\nreturn (xN , yN );\n\nHow to linearize a game: To motivate this algorithm, we remind ourselves that gradient descent\nwith stepsize \u03b7 applied to the function f : Rm \u2212\u2192 R can be written as\n\nxk+1 = argminx\u2208Rm(x(cid:62) \u2212 x(cid:62)\n\nk )\u2207xf (xk) +\n\n(cid:107)x \u2212 xk(cid:107)2.\n\n1\n2\u03b7\n\nThis models a (single) player solving a local linear approximation of the (minimization) game, subject\nto a quadratic penalty that expresses her limited con\ufb01dence in the global accuracy of the model. The\nnatural generalization of this idea to the competitive case should then be given by the two players\nsolving a local approximation of the true game, both subject to a quadratic penalty that expresses\ntheir limited con\ufb01dence in the accuracy of the local approximation.\nIn order to implement this idea, we need to \ufb01nd the appropriate way to generalize the linear approxi-\nmation in the single agent setting to the competitive setting: How to linearize a game?.\nLinear or Multilinear: GDA answers the above question by choosing a linear approximation of\nf, g : Rm \u00d7 Rn \u2212\u2192 R. This seemingly natural choice has the \ufb02aw that linear functions can not\nexpress any interaction between the two players and are thus unable to capture the competitive\nnature of the underlying problem. From this point of view it is not surprising that the convergent\nmodi\ufb01cations of GDA are, implicitly or explicitly, based on higher order approximations (see also (Li\net al., 2017)). An equally valid generalization of the linear approximation in the single player setting\nis to use a bilinear approximation in the two-player setting. Since the bilinear approximation is the\nlowest order approximation that can capture some interaction between the two players, we argue that\nthe natural generalization of gradient descent to competitive optimization is not GDA, but rather the\n\n1Here and in the following, unless otherwise mentioned, all derivatives are evaluated in the point (xk, yk)\n\n3\n\n\fupdate rule (xk+1, yk+1) = (xk, yk) + (x, y), where (x, y) is a Nash equilibrium of the game 2\n\nx(cid:62)\u2207xf + x(cid:62)D2\ny(cid:62)\u2207yg + y(cid:62)D2\n\nxyf y + y(cid:62)\u2207yf +\nyxgx + x(cid:62)\u2207xg +\n\nmin\nx\u2208Rm\n\nmin\ny\u2208Rn\n\nx(cid:62)x\n\ny(cid:62)y.\n\n1\n2\u03b7\n1\n2\u03b7\n\n(2)\n\nIndeed, the (unique) Nash equilibrium of the Game (2) can be computed in closed form.\nTheorem 2.1. Among all (possibly randomized) strategies with \ufb01nite \ufb01rst moment, the only Nash\nequilibrium of the Game (2) is given by\n\nx = \u2212\u03b7(cid:0)Id\u2212\u03b72D2\ny = \u2212\u03b7(cid:0)Id\u2212\u03b72D2\n\nyxg(cid:1)\u22121(cid:0)\u2207xf \u2212 \u03b7D2\nxyf(cid:1)\u22121(cid:0)\u2207yg \u2212 \u03b7D2\n\nxyf\u2207yg(cid:1)\nyxg\u2207xf(cid:1) ,\n\nxyf D2\nyxgD2\n\n(3)\n\ngiven that the matrix inverses in the above expression exist. 3\n\nProof. Let X, Y be randomized strategies. By subtracting and adding E[X]2/(2\u03b7), E[Y ]2/(2\u03b7), and\ntaking expectations, we can rewrite the game as\n\nmin\n\nE[X]\u2208Rm\n\nmin\n\nE[Y ]\u2208Rn\n\nE[X](cid:62)\u2207xf + E[X](cid:62)D2\nE[Y ](cid:62)\u2207yg + E[Y ](cid:62)D2\n\nxyfE[Y ] + E[Y ](cid:62)\u2207yf +\nyxgE[X] + E[X](cid:62)\u2207xg +\n\nE[X](cid:62)E[X] +\n\n1\n2\u03b7\nE[Y ](cid:62)E[Y ] +\n1\n2\u03b7\n\n1\n2\u03b7\n1\n2\u03b7\n\nVar[X]\n\nVar[Y ].\n\nThus, the objective value for both players can always be improved by decreasing the variance while\nkeeping the expectation the same, meaning that the optimal value will always (and only) be achieved\nby a deterministic strategy. We can then replace the E[X], E[Y ] with x, y, set the derivative of the\n\ufb01rst expression with respect to x and of the second expression with respect to y to zero, and solve the\nresulting system of two equations for the Nash equilibrium (x, y).\n\nAccording to Theorem 2.1, the Game (2) has exactly one optimal pair of strategies, which is\ndeterministic. Thus, we can use these strategies as an update rule, generalizing the idea of local\noptimality from the single\u2013 to the multi agent setting and obtaining Algorithm 1.\nWhat I think that they think that I think ... that they do: Another game-theoretic interpretation\nof CGD follows from the observation that its update rule can be written as\n\n(cid:18)\u2206x\n\n(cid:19)\n\n\u2206y\n\n(cid:18) Id\n\n\u03b7D2\n\nyxg\n\n= \u2212\n\n(cid:19)\n(cid:19)\u22121(cid:18)\u2207xf\u2207yg\n\u22121 = limN\u2192\u221e(cid:80)N\n\n\u03b7D2\nxyf\nId\n\n.\n\n(4)\n\nApplying the expansion \u03bbmax(A) < 1 \u21d2 (Id\u2212A)\nk=0 Ak to the above equation,\nwe observe that the \ufb01rst partial sum (N = 0) corresponds to the optimal strategy if the other player\u2019s\nstrategy stays constant (GDA). The second partial sum (N = 1) corresponds to the optimal strategy\nif the other player thinks that the other player\u2019s strategy stays constant (LCGD, see Figure 1). The\nthird partial sum (N = 2) corresponds to the optimal strategy if the other player thinks that the other\nplayer thinks that the other player\u2019s strategy stays constant, and so forth, until the Nash equilibrium is\nrecovered in the limit. For small enough \u03b7, we could use the above series expansion to solve for\n(\u2206x, \u2206y), which is known as Richardson iteration and would recover high order LOLA (Foerster\net al., 2018). However, expressing it as a matrix inverse will allow us to use optimal Krylov subspace\nmethods to obtain far more accurate solutions with fewer gradient evaluations.\n\nRigorous results on convergence and local stability: We will now show some basic convergence\nresults for CGD, the proofs of which we defer to the appendix. Our results are restricted to the case\nof a zero-sum game (f = \u2212g), but we expect that they can be extended to games that are dominated\nby competition. To simplify notation, we de\ufb01ne\n\n\u00afD := (Id +\u03b72D2\n\n\u02dcD := (Id +\u03b72D2\nWe furthermore de\ufb01ne the spectral function h\u00b1(\u03bb) := min(3\u03bb, \u03bb)/2.\n\nxyf D2\n\nxyf D2\n\nyxf,\n\nyxf )\u22121\u03b72D2\n\nyxf D2\n\nxyf )\u22121\u03b72D2\n\nyxf D2\n\nxyf.\n\n2We could alternatively use the penalty (x(cid:62)x + y(cid:62)y)/(2\u03b7) for both players, without changing the solution.\n3We note that the matrix inverses exist for all but one value of \u03b7, and for all \u03b7 in the case of a zero sum game.\n\n4\n\n\fTheorem 2.2. If f is two times continiously differentiable with L-Lipschitz continuous mixed Hessian,\nf is convex-concave or D2\nyyf are L-Lipschitz continuous, and the diagonal blocks of its\nHessian are bounded as \u03b7(cid:107)D2\n(cid:107)\u2207xf (xk+1, yk+1)(cid:107)2 + (cid:107)\u2207yf (xk+1, yk+1)(cid:107)2 \u2212 (cid:107)\u2207xf(cid:107)2 \u2212 (cid:107)\u2207yf(cid:107)2 \u2264\n\nxxf, D2\nxxf(cid:107), \u03b7(cid:107)D2\n\nyyf(cid:107) \u2264 1, we have\n\nxxf(cid:1) + \u00afD \u2212 32L\u03b72(cid:107)\u2207xf(cid:107)(cid:1)\u2207xf \u2212 \u2207yf(cid:62)(cid:16)\n\nyyf(cid:1) + \u02dcD \u2212 32L\u03b72(cid:107)\u2207yf(cid:107)(cid:17)\u2207yf\n\n\u2212 \u2207xf(cid:62)(cid:0)\u03b7h\u00b1(cid:0)D2\n\n\u03b7h\u00b1(cid:0)\u2212D2\n\nUnder suitable assumptions on the curvature of f, Theorem 2.2 implies results on the convergence of\nCGD.\nCorollary 2.2.1. Under the assumptions of Theorem 2.2, if for \u03b1 > 0\n\n\u03b7h\u00b1(cid:0)\u2212D2\n\nyyf(cid:1)+ \u02dcD\u221232L\u03b72(cid:107)\u2207yf (x0, y0)(cid:107)(cid:17) (cid:23) \u03b1 Id,\n\n\u03b7h\u00b1(cid:0)D2\n\nxxf(cid:1)+ \u00afD\u221232L\u03b72(cid:107)\u2207xf (x0, y0)(cid:107)(cid:17)\n\n(cid:16)\n\n(cid:16)\n\n,\n\nfor all (x, y) \u2208 Rm+n, then CGD started in (x0, y0) converges at exponential rate with exponent \u03b1\nto a critical point.\n\n(cid:0)\u2212\u03b7D2\n\n\u03bbmin := min(cid:0)\u03bbmin\n\nxxf + \u00afD(cid:1) , \u03bbmin\n(cid:0)\u03b7D2\n\nyyf + \u00afD(cid:1)(cid:1) > 0 and f \u2208 C 2(Rm+n) with Lipschitz\n\nFurthermore, we can deduce the following local stability result.\nTheorem 2.3. Let (x\u2217, y\u2217) be a critical point ((\u2207xf,\u2207yf ) = (0, 0)) and assume furthermore that\ncontinuous mixed Hessian. Then there exists a neighbourhood U of (x\u2217, y\u2217), such that CGD started\nin (x1, y1) \u2208 U converges to a point in U at an exponential rate that depends only on \u03bbmin.\nThe results on local stability for existing modi\ufb01cations of GDA, including those of (Mescheder\net al., 2017; Daskalakis et al., 2017; Mertikopoulos et al., 2019) (see also Liang and Stokes (2018))\nall require the stepsize to be chosen inversely proportional to an upper bound on \u03c3max(D2\nxyf ) and\nindeed we will see in our experiments that the existing methods are prone to divergence under\nstrong interactions between the two players (large \u03c3max(D2\nxyf )). In contrast to these results, our\nconvergence results only improve as the interaction between the players becomes stronger.\nWhy not use D2\nyyg?: The use of a bilinear approximation that contains some, but not all\nsecond order terms is unusual and begs the question why we do not include the diagonal blocks of\nthe Hessian in Equation (4) resulting in the damped and regularized Newton\u2019s method\n\nxxf and D2\n\n(cid:18)\u2206x\n\n(cid:19)\n\n\u2206y\n\n(cid:18)Id +\u03b7D2\n\n\u03b7D2\n\nyxg\n\n= \u2212\n\n(cid:19)\n\n(cid:19)\u22121(cid:18)\u2207xf\u2207yg\n\nxxf\n\n\u03b7D2\n\nxyf\nId +\u03b7D2\n\nyyg\n\n.\n\n(5)\n\nyyg(cid:107) \u2265 1.\n\nxxf(cid:107), \u03b7(cid:107)D2\n\nxxf(cid:107) \u2265 1 or \u03b7(cid:107)D2\n\nFor the following reasons we believe that the bilinear approximation is preferable both from a practical\nand conceptual point of view.\n\u2022 Conditioning of matrix inverse: One advantage of competitive gradient descent is that in many\ncases, including all zero-sum games, the condition number of the matrix inverse in Algorithm 1 is\nbounded above by \u03b72(cid:107)Dxy(cid:107)2. If we include the diagonal blocks of the Hessian in a non-convex-\nconcave problem, the matrix can even be singular as soon as \u03b7(cid:107)D2\n\u2022 Irrational updates: We can only expect the update rule (5) to correspond to a local Nash equilibrium\nif the problem is convex-concave or \u03b7(cid:107)D2\nyyg(cid:107) < 1. If these conditions are violated it\ncan instead correspond to the players playing their worst as opposed to best strategy based on the\nquadratic approximation, leading to behavior that contradicts the game-interpretation of the problem.\n\u2022 Lack of regularity: For the inclusion of the diagonal blocks of the Hessian to be helpful at\nall, we need to make additional assumptions on the regularity of f, for example by bounding\nthe Lipschitz constants of D2\nyyg. Otherwise, their value at a given point can be totally\nuninformative about the global structure of the loss functions (consider as an example the minimization\nof x (cid:55)\u2192 x2 + \u00013/2 sin(x/\u0001) for \u0001 (cid:28) 1). Many problems in competitive optimization, including GANs,\nhave the form f (x, y) = \u03a6(G(x),D(y)), g(x, y) = \u0398(G(x),D(y)), where \u03a6, \u0398 are smooth and\nsimple, but G and D might only have \ufb01rst order regularity. In this setting, the bilinear approximation\nhas the advantage of fully exploiting the \ufb01rst order information of G and D, without assuming them to\nhave higher order regularity. This is because the bilinear approximations of f and g then contains only\nthe \ufb01rst derivatives of G and D, while the quadratic approximation contains the second derivatives\nxxG and D2\nyyD and therefore needs stronger regularity assumptions on G and D to be effective.\nD2\n\nxxf and D2\n\n5\n\n\f\u2206x =\n\n\u2206x =\n\nGDA:\nLCGD:\nSGA:\n\u2206x =\nConOpt: \u2206x =\n\u2206x \u2248\nOGDA:\n\nCGD:\n\n\u2206x =(cid:0)Id +\u03b72D2\n\nyxf(cid:1)\u22121\n\nxyf D2\n\n\u2212 \u2207xf\n\u2212 \u2207xf \u2212 \u03b7D2\n\u2212 \u2207xf \u2212 \u03b3D2\n\u2212 \u2207xf \u2212 \u03b3D2\n\u2212 \u2207xf \u2212 \u03b7D2\n\n(cid:0) \u2212 \u2207xf \u2212 \u03b7D2\n\nxyf\u2207yf\nxyf\u2207yf\nxyf\u2207yf \u2212 \u03b3D2\nxyf\u2207yf + \u03b7D2\nxyf\u2207yf\n\nxxf\u2207xf\nxxf\u2207xf\n\n(cid:1)\n\nFigure 1: The update rules of the \ufb01rst player for (from top to bottom) GDA, LCGD, ConOpt, OGDA,\nand CGD, in a zero-sum game (f = \u2212g).\n\n\u2022 No spurious symmetry: One reason to favor full Taylor approximations of a certain order in single-\nplayer optimization is that they are invariant under changes of the coordinate system. For competitive\noptimization, a change of coordinates of (x, y) \u2208 Rm+n can correspond, for instance, to taking a\ndecision variable of one player and giving it to the other player. This changes the underlying game\nsigni\ufb01cantly and thus we do not want our approximation to be invariant under this transformation.\nInstead, we want our local approximation to only be invariant to coordinate changes of x \u2208 Rm\nand y \u2208 Rn in separation, that is to block-diagonal coordinate changes on Rm+n. Mixed order\napproximations (bilinear, biquadratic, etc.) have exactly this invariance property and thus are the\nnatural approximation for two-player games.\n\nWhile we are convinced that the right notion of \ufb01rst order competitive optimization is given by\nquadratically regularized bilinear approximations, we believe that the right notion of second order\ncompetitive optimization is given by cubically regularized biquadratic approximations, in the spirit\nof Nesterov and Polyak (2006).\n\n3 Consensus, optimism, or competition?\n\nWe will now show that many of the convergent modi\ufb01cations of GDA correspond to different\nsubsets of four common ingredients. Consensus optimization (ConOpt) (Mescheder et al., 2017),\npenalises the players for non-convergence by adding the squared norm of the gradient at the next\nlocation, \u03b3(cid:107)\u2207xf (xk+1, yk+1),\u2207xf (xk+1, yk+1)(cid:107)2 to both player\u2019s loss function (here \u03b3 \u2265 0 is a\nhyperparameter). As we see in Figure 1, the resulting gradient \ufb01eld has two additional Hessian\ncorrections. Balduzzi et al. (2018); Letcher et al. (2019) observe that any game can be written as\nthe sum of a potential game (that is easily solved by GDA), and a Hamiltonian game (that is easily\nsolved by ConOpt). Based on this insight, they propose symplectic gradient adjustment that applies\n(in its simplest form) ConOpt only using the skew-symmetric part of the Hessian, thus alleviating\nthe problematic tendency of ConOpt to converge to spurious solutions. The same algorithm was\nindependently discovered by Gemp and Mahadevan (2018), who also provide a detailed analysis in\nthe case of linear-quadratic GANs.\nDaskalakis et al. (2017) proposed to modify GDA as\n\n\u2206x = \u2212 (\u2207xf (xk, yk) + (\u2207xf (xk, yk) \u2212 \u2207xf (xk\u22121, yk\u22121)))\n\u2206y = \u2212 (\u2207yg(xk, yk) + (\u2207yg(xk, yk) \u2212 \u2207yg(xk\u22121, yk\u22121))) ,\n\nwhich we will refer to as optimistic gradient descent ascent (OGDA). By interpreting the differences\nappearing in the update rule as \ufb01nite difference approximations to Hessian vector products, we\nsee that (to leading order) OGDA corresponds to yet another second order correction of GDA (see\nFigure 1). It will also be instructive to compare the algorithms to linearized competitive gradient\ndescent (LCGD), which is obtained by skipping the matrix inverse in CGD (which corresponds to\nxyf \u2192 0) and also coincides with \ufb01rst order LOLA\ntaking only the leading order term in the limit \u03b7D2\n(Foerster et al., 2018). As illustrated in Figure 1, these six algorithms amount to different subsets of\nthe following four terms.\n1. The gradient term \u2212\u2207xf, \u2207yf which corresponds to the most immediate way in which the\nplayers can improve their cost.\n\n6\n\n\f2. The competitive term \u2212Dxyf\u2207yf, Dyxf\u2207xf which can be interpreted either as anticipating\nthe other player to use the naive (GDA) strategy, or as decreasing the other players in\ufb02uence (by\ndecreasing their gradient).\n3. The consensus term \u00b1D2\nyy\u2207yf that determines whether the players prefer to decrease\ntheir gradient (\u00b1 = +) or to increase it (\u00b1 = \u2212). The former corresponds the players seeking\nconsensus, whereas the latter can be seen as the opposite of consensus.\n(It also corresponds to an approximate Newton\u2019s method. 4)\nxyf )\u22121, which arises from the players\nyxf )\u22121, (Id +\u03b72D2\n4. The equilibrium term (Id +\u03b72D2\nsolving for the Nash equilibrium. This term lets each player prefer strategies that are less vulnerable\nto the actions of the other player.\n\nxx\u2207xf, \u2213D2\n\nyxD2\n\nxyD2\n\nEach of these is responsible for a different feature of the corresponding algorithm, which we can\nillustrate by applying the algorithms to three prototypical test cases considered in previous works.\n\u2022 We \ufb01rst consider the bilinear problem f (x, y) = \u03b1xy (see Figure 2). It is well known that GDA\nwill fail on this problem, for any value of \u03b7. For \u03b1 = 1.0, all the other methods converge exponentially\ntowards the equilibrium, with ConOpt and SGA converging at a faster rate due to the stronger gradient\ncorrection (\u03b3 > \u03b7). If we choose \u03b1 = 3.0, OGDA, ConOpt, and SGA fail. The former diverges,\nwhile the latter two begin to oscillate widely. If we choose \u03b1 = 6.0, all methods but CGD diverge.\n\u2022 In order to explore the effect of the consensus Term 3, we now consider the convex-concave\nproblem f (x, y) = \u03b1(x2 \u2212 y2) (see Figure 3). For \u03b1 = 1.0, all algorithms converge at an exponential\nrate, with ConOpt converging the fastest, and OGDA the slowest. The consensus promoting term of\nConOpt accelerates convergence, while the competition promoting term of OGDA slows down the\nconvergence. As we increase \u03b1 to \u03b1 = 3.0, the OGDA and ConOpt start failing (diverge), while the\nremaining algorithms still converge at an exponential rate. Upon increasing \u03b1 further to \u03b1 = 6.0, all\nalgorithms diverge.\n\u2022 We further investigate the effect of the consensus Term 3 by considering the concave-convex\nproblem f (x, y) = \u03b1(\u2212x2 + y2) (see Figure 3). The critical point (0, 0) does not correspond to\na Nash-equilibrium, since both players are playing their worst possible strategy. Thus it is highly\nundesirable for an algorithm to converge to this critical point. However for \u03b1 = 1.0, ConOpt does\nconverge to (0, 0) which provides an example of the consensus regularization introducing spurious\nsolutions. The other algorithms, instead, diverge away towards in\ufb01nity, as would be expected. In\nparticular, we see that SGA is correcting the problematic behavior of ConOpt, while maintaining\nits better convergence rate in the \ufb01rst example. As we increase \u03b1 to \u03b1 \u2208 {3.0, 6.0}, the radius\nof attraction of (0, 0) under ConOpt decreases and thus ConOpt diverges from the starting point\n(0.5, 0.5), as well.\n\nThe \ufb01rst experiment shows that the inclusion of the competitive Term 2 is enough to solve the cycling\nproblem in the bilinear case. However, as discussed after Theorem 2.2, the convergence results of\nexisting methods in the literature are not break down as the interactions between the players becomes\ntoo strong (for the given \u03b7). The \ufb01rst experiment illustrates that this is not just a lack of theory, but\ncorresponds to an actual failure mode of the existing algorithms. The experimental results in Figure 5\nfurther show that for input dimensions m, n > 1, the advantages of CGD can not be recovered by\nsimply changing the stepsize \u03b7 used by the other methods.\nWhile introducing the competitive term is enough to \ufb01x the cycling behaviour of GDA, OGDA and\nConOpt (for small enough \u03b7) add the additional consensus term to the update rule, with opposite\nsigns.\nIn the second experiment (where convergence is desired), OGDA converges in a smaller parameter\nrange than GDA and SGA, while only diverging slightly faster in the third experiment (where\ndivergence is desired).\nConOpt, on the other hand, converges faster than GDA in the second experiment, for \u03b1 = 1.0\nhowever, it diverges faster for the remaining values of \u03b1 and, what is more problematic, it converges\nto a spurious solution in the third experiment for \u03b1 = 1.0.\nBased on these \ufb01ndings, the consensus term with either sign does not seem to systematically improve\nthe performance of the algorithm, which is why we suggest to only use the competitive term (that is,\nuse LOLA/LCGD, or CGD, or SGA).\n\n4Applying a damped and regularized Newton\u2019s method to the optimization problem of Player 1 would amount\n\nto choosing xk+1 = xk \u2212 \u03b7(Id +\u03b7D2\n\nxx)\u22121f\u2207xf \u2248 xk \u2212 \u03b7(\u2207xf \u2212 \u03b7D2\n\nxxf\u2207xf ), for (cid:107)\u03b7D2\n\nxxf(cid:107) (cid:28) 1.\n\n7\n\n\fFigure 2: The \ufb01rst 50 iterations of GDA, LCGD, ConOpt, OGDA, and CGD with parameters \u03b7 = 0.2\nand \u03b3 = 1.0. The objective function is f (x, y) = \u03b1x(cid:62)y for, from left to right, \u03b1 \u2208 {1.0, 3.0, 6.0}.\n(Note that ConOpt and SGA coincide on a bilinear problem)\n\nFigure 3: We measure the (non-)convergence to equilibrium in the separable convex-concave\u2013\n(f (x, y) = \u03b1(x2 \u2212 y2), left three plots) and concave convex problem (f (x, y) = \u03b1(\u2212x2 + y2), right\nthree plots), for \u03b1 \u2208 {1.0, 3.0, 6.0}. (Color coding given by GDA, SGA, LCGD, CGD, ConOpt,\nOGDA, the y-axis measures log10((cid:107)(xk, yk)(cid:107)) and the x-axis the number of iterations k. Note that\nconvergence is desired for the \ufb01rst problem, while divergence is desired for the second problem.\n\n4\n\nImplementation and numerical results\n\nxyf v = \u2202\n\n\u2202h\u2207yf (x + hv, y)(cid:12)(cid:12)h=0, using forward\n\nWe brie\ufb02y discuss the implementation of CGD.\nComputing Hessian vector products: First, our algorithm requires products of the mixed Hessian\nv (cid:55)\u2192 Dxyf v, v (cid:55)\u2192 Dyxgv, which we want to compute using automatic differentiation. As was already\nobserved by Pearlmutter (1994), Hessian vector products can be computed at minimal overhead over\nthe cost of computing gradients, by combining forward\u2013 and reverse mode automatic differentiation.\nTo this end, a function x (cid:55)\u2192 \u2207yf (x, y) is de\ufb01ned using reverse mode automatic differentiation. The\nHessian vector product can then be evaluated as D2\nmode automatic differentiation. Many AD frameworks, like Autograd (https://github.com/\nHIPS/autograd) and ForwardDiff(https://github.com/JuliaDiff/ForwardDiff.jl, (Rev-\nels et al., 2016)) together with ReverseDiff(https://github.com/JuliaDiff/ReverseDiff.jl)\nsupport this procedure. In settings where we are only given access to gradient evaluations but cannot\nuse automatic differentiation to compute Hessian vector products, we can instead approximate them\nusing \ufb01nite differences.\nMatrix inversion for the equilibrium term: Similar to a truncated Newton\u2019s method (Nocedal and\nWright, 2006), we propose to use iterative methods to approximate the inverse-matrix vector products\narising in the equilibrium term 4. We will focus on zero-sum games, where the matrix is always\nsymmetric positive de\ufb01nite, making the conjugate gradient (CG) algorithm the method of choice. For\nnonzero sum games we recommend using the GMRES or BCGSTAB (see for example Saad (2003)\nfor details). We suggest terminating the iterative solver after a given relative decrease of the residual\nis achieved ((cid:107)M x \u2212 y(cid:107) \u2264 \u0001(cid:107)x(cid:107) for a small parameter \u0001, when solving the system M x = y). In our\nexperiments we choose \u0001 = 10\u22126. Given the strategy \u2206x of one player, \u2206y is the optimal counter\nstrategy which can be found without solving another system of equations. Thus, we recommend in\neach update to only solve for the strategy of one of the two players using Equation (3), and then use\nthe optimal counter strategy for the other player. The computational cost can be further improved by\nusing the last round\u2019s optimal strategy as a a warm start of the inner CG solve. An appealing feature\nof the above algorithm is that the number of iterations of CG adapts to the dif\ufb01culty of solving the\nequilibrium term 4. If it is easy, we converge rapidly and CGD thus gracefully reduces to LCGD,\nat only a small overhead. If it is dif\ufb01cult, we might need many iterations, but correspondingly the\nproblem would be very hard without the preconditioning provided by the equilibrium term.\nExperiment: Fitting a bimodal distribution: We use a simple GAN to \ufb01t a Gaussian mixture model\nwith two modes, in two dimensions (see supplement for details). We apply SGA, ConOpt (\u03b3 = 1.0),\nOGDA, and CGD for stepsize \u03b7 \u2208 {0.4, 0.1, 0.025, 0.005} together with RMSProp (\u03c1 = 0.9). In\n\n8\n\n\fFigure 4: For all methods, initially the players cycle between the two modes (\ufb01rst column). For all\nmethods but CGD, the dynamics eventually become unstable (middle column). Under CGD, the mass\neventually distributes evenly among the two modes (right column). (The arrows show the update of\nthe generator and the colormap encodes the logit output by the discriminator.)\n\nFigure 5: We plot the decay of the residual after a given number of model evaluations, for increasing\nproblem sizes and \u03b7 \u2208 {0.005, 0.025, 0.1, 0.4}. Experiments that are not plotted diverged.\n\neach case, CGD produces an reasonable approximation of the input distribution without any mode\ncollapse. In contrast, all other methods diverge after some initial cycling behaviour! Reducing the\nsteplength to \u03b7 = 0.001, did not seem to help, either. While we do not claim that the other methods\ncan not be made work with proper hyperparameter tuning, this result substantiates our claim that\nCGD is signi\ufb01cantly more robust than existing methods for competitive optimization. For more\ndetails and visualizations of the whole trajectories, consult the supplementary material.\nExperiment: Estimating a covariance matrix: To show that CGD is also competitive in terms of\ncomputational complexity we consider the noiseless case of the covariance estimation example used\nby Daskalakis et al. (2017)[Appendix C], We study the tradeoff between the number of evaluations\nof the forward model (thus accounting for the inner loop of CGD) and the residual and observe that\nfor comparable stepsize, the convergence rate of CGD is similar to the other methods. However, due\nto CGD being convergent for larger stepsize it can beat the other methods by more than a factor two\n(see supplement for details).\n\n5 Conclusion and outlook\n\nWe propose a novel and natural generalization of gradient descent to competitive optimization. Besides\nits attractive game-theoretic interpretation, the algorithm shows improved robustness properties\ncompared to the existing methods, which we study using a combination of theoretical analysis and\ncomputational experiments. We see four particularly interesting directions for future work. First, we\nwould like to further study the practical implementation and performance of CGD, developing it to\nbecome a useful tool for practitioners to solve competitive optimization problems. Second, we would\nlike to study extensions of CGD to the setting of more than two players. As hinted in Section 2, a\nnatural candidate would be to simply consider multilinear quadratically regularized local models, but\n\n9\n\n\fthe practical implementation and evaluation of this idea is still open. Third, we believe that second\norder methods can be obtained from biquadratic approximations with cubic regularization, thus\nextending the cubically regularized Newton\u2019s method of Nesterov and Polyak (2006) to competitive\noptimization. Fourth, a convergence proof in the nonconvex case analogue to Lee et al. (2016) is still\nout of reach in the competitive setting. A major obstacle to this end is the identi\ufb01cation of a suitable\nmeasure of progress (which is given by the function value in the setting in the single agent setting),\nsince norms of gradients can not be expected to decay monotonously for competitive dynamics in\nnon-convex-concave games.\n\nAcknowledgments\n\nA. Anandkumar is supported in part by Bren endowed chair, Darpa PAI, Raytheon, and Microsoft,\nGoogle and Adobe faculty fellowships. F. Sch\u00e4fer gratefully acknowledges support by the Air Force\nOf\ufb01ce of Scienti\ufb01c Research under award number FA9550-18-1-0271 (Games for Computation and\nLearning) and by Amazon AWS under the Caltech Amazon Fellows program. We thank the reviewers\nfor their constructive feedback, which has helped us improve the paper.\n\nReferences\nBalduzzi, D., Racaniere, S., Martens, J., Foerster, J., Tuyls, K., and Graepel, T. (2018). The mechanics of\n\nn-player differentiable games. arXiv preprint arXiv:1802.05642.\n\nBertsimas, D., Brown, D. B., and Caramanis, C. (2011). Theory and applications of robust optimization. SIAM\n\nRev., 53(3):464\u2013501.\n\nBrown, G. W. (1951). Iterative solution of games by \ufb01ctitious play. Activity analysis of production and allocation,\n\n13(1):374\u2013376.\n\nDaskalakis, C., Ilyas, A., Syrgkanis, V., and Zeng, H. (2017). Training gans with optimism. arXiv preprint\n\narXiv:1711.00141.\n\nFacchinei, F. and Pang, J.-S. (2003). Finite-dimensional variational inequalities and complementarity problems.\n\nVol. II. Springer Series in Operations Research. Springer-Verlag, New York.\n\nFoerster, J., Chen, R. Y., Al-Shedivat, M., Whiteson, S., Abbeel, P., and Mordatch, I. (2018). Learning with\nopponent-learning awareness. In Proceedings of the 17th International Conference on Autonomous Agents\nand MultiAgent Systems, pages 122\u2013130. International Foundation for Autonomous Agents and Multiagent\nSystems.\n\nGemp, I. and Mahadevan, S. (2018). Global convergence to the equilibrium of gans using variational inequalities.\n\narXiv preprint arXiv:1808.01531.\n\nGilpin, A., Hoda, S., Pena, J., and Sandholm, T. (2007). Gradient-based algorithms for \ufb01nding nash equilibria in\nextensive form games. In International Workshop on Web and Internet Economics, pages 57\u201369. Springer.\n\nGoodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., and Bengio, Y.\n(2014). Generative adversarial nets. In Advances in neural information processing systems, pages 2672\u20132680.\n\nGrnarova, P., Levy, K. Y., Lucchi, A., Hofmann, T., and Krause, A. (2017). An online learning approach to\n\ngenerative adversarial networks. arXiv preprint arXiv:1706.03269.\n\nHeusel, M., Ramsauer, H., Unterthiner, T., Nessler, B., and Hochreiter, S. (2017). Gans trained by a two\ntime-scale update rule converge to a local nash equilibrium. In Advances in Neural Information Processing\nSystems, pages 6626\u20136637.\n\nHuber, P. J. and Ronchetti, E. M. (2009). Robust statistics. Wiley Series in Probability and Statistics. John Wiley\n\n& Sons, Inc., Hoboken, NJ, second edition.\n\nKorpelevich, G. (1977). Extragradient method for \ufb01nding saddle points and other problems. Matekon, 13(4):35\u2013\n\n49.\n\nLee, J. D., Simchowitz, M., Jordan, M. I., and Recht, B. (2016). Gradient descent converges to minimizers.\n\narXiv preprint arXiv:1602.04915.\n\nLetcher, A., Balduzzi, D., Racani\u00e8re, S., Martens, J., Foerster, J., Tuyls, K., and Graepel, T. (2019). Differentiable\n\ngame mechanics. Journal of Machine Learning Research, 20(84):1\u201340.\n\n10\n\n\fLi, J., Madry, A., Peebles, J., and Schmidt, L. (2017). On the limitations of \ufb01rst-order approximation in gan\n\ndynamics. arXiv preprint arXiv:1706.09884.\n\nLiang, T. and Stokes, J. (2018). Interaction matters: A note on non-asymptotic local convergence of generative\n\nadversarial networks. arXiv preprint arXiv:1802.06132.\n\nLiu, B., Liu, J., Ghavamzadeh, M., Mahadevan, S., and Petrik, M. (2016). Proximal gradient temporal difference\n\nlearning algorithms. In IJCAI, pages 4195\u20134199.\n\nMertikopoulos, P., Zenati, H., Lecouat, B., Foo, C.-S., Chandrasekhar, V., and Piliouras, G. (2019). Optimistic\nmirror descent in saddle-point problems: Going the extra (gradient) mile. In ICLR\u201919: Proceedings of the\n2019 International Conference on Learning Representations.\n\nMescheder, L., Nowozin, S., and Geiger, A. (2017). The numerics of gans. In Advances in Neural Information\n\nProcessing Systems, pages 1825\u20131835.\n\nMetz, L., Poole, B., Pfau, D., and Sohl-Dickstein, J. (2016). Unrolled generative adversarial networks. arXiv\n\npreprint arXiv:1611.02163.\n\nNesterov, Y. (2005). Excessive gap technique in nonsmooth convex minimization. SIAM Journal on Optimization,\n\n16(1):235\u2013249.\n\nNesterov, Y. and Polyak, B. T. (2006). Cubic regularization of Newton method and its global performance. Math.\n\nProgram., 108(1, Ser. A):177\u2013205.\n\nNisan, N., Roughgarden, T., Tardos, E., and Vazirani, V. V. (2007). Algorithmic game theory. Cambridge\n\nuniversity press.\n\nNocedal, J. and Wright, S. J. (2006). Numerical optimization. Springer Series in Operations Research and\n\nFinancial Engineering. Springer, New York, second edition.\n\nPathak, D., Agrawal, P., Efros, A. A., and Darrell, T. (2017). Curiosity-driven exploration by self-supervised\nprediction. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops,\npages 16\u201317.\n\nPearlmutter, B. A. (1994). Fast exact multiplication by the hessian. Neural computation, 6(1):147\u2013160.\n\nPfau, D. and Vinyals, O. (2016). Connecting generative adversarial networks and actor-critic methods. arXiv\n\npreprint arXiv:1610.01945.\n\nRakhlin, A. and Sridharan, K. (2013). Online learning with predictable sequences.\n\nRevels, J., Lubin, M., and Papamarkou, T. (2016).\n\narXiv:1607.07892 [cs.MS].\n\nForward-mode automatic differentiation in julia.\n\nSaad, Y. (2003). Iterative methods for sparse linear systems. Society for Industrial and Applied Mathematics,\n\nPhiladelphia, PA, second edition.\n\nShalev-Shwartz, S. and Singer, Y. (2007). Convex repeated games and fenchel duality. In Advances in neural\n\ninformation processing systems, pages 1265\u20131272.\n\nVezhnevets, A. S., Osindero, S., Schaul, T., Heess, N., Jaderberg, M., Silver, D., and Kavukcuoglu, K. (2017).\nFeudal networks for hierarchical reinforcement learning. In Proceedings of the 34th International Conference\non Machine Learning-Volume 70, pages 3540\u20133549. JMLR. org.\n\nWayne, G. and Abbott, L. (2014). Hierarchical control using networks trained with higher-level forward models.\n\nNeural computation, 26(10):2163\u20132193.\n\nYadav, A., Shah, S., Xu, Z., Jacobs, D., and Goldstein, T. (2017). Stabilizing adversarial nets with prediction\n\nmethods. arXiv preprint arXiv:1705.07364.\n\n11\n\n\f", "award": [], "sourceid": 4162, "authors": [{"given_name": "Florian", "family_name": "Schaefer", "institution": "Caltech"}, {"given_name": "Anima", "family_name": "Anandkumar", "institution": "NVIDIA / Caltech"}]}