{"title": "The Limit Points of (Optimistic) Gradient Descent in Min-Max Optimization", "book": "Advances in Neural Information Processing Systems", "page_first": 9236, "page_last": 9246, "abstract": "Motivated by applications in Optimization, Game Theory, and the training of Generative Adversarial Networks, the convergence properties of first order methods in min-max problems have received extensive study. It has been recognized that they may cycle, and there is no good understanding of their limit points when they do not. When they converge, do they converge to local min-max solutions? We characterize the limit points of two basic first order methods, namely Gradient Descent/Ascent (GDA) and Optimistic Gradient Descent Ascent (OGDA). We show that both dynamics avoid unstable critical points for almost all initializations. Moreover, for small step sizes and under mild assumptions, the set of OGDA-stable critical points is a superset of GDA-stable critical points, which is a superset of local min-max solutions (strict in some cases). The connecting thread is that the behavior of these dynamics can be studied from a dynamical systems perspective.", "full_text": "The Limit Points of (Optimistic) Gradient Descent in\n\nMin-Max Optimization\n\nConstantinos Daskalakis\n\nCSAIL\nMIT\n\nCambridge, MA 02138\ncostis@csail.mit.edu\n\nIoannis Panageas\n\nISTD\nSUTD\n\nSingapore, 487371\n\nioannis@sutd.edu.sg\n\nAbstract\n\nMotivated by applications in Optimization, Game Theory, and the training of\nGenerative Adversarial Networks, the convergence properties of \ufb01rst order methods\nin min-max problems have received extensive study. It has been recognized that\nthey may cycle, and there is no good understanding of their limit points when they\ndo not. When they converge, do they converge to local min-max solutions? We\ncharacterize the limit points of two basic \ufb01rst order methods, namely Gradient\nDescent/Ascent (GDA) and Optimistic Gradient Descent Ascent (OGDA). We\nshow that both dynamics avoid unstable critical points for almost all initializations.\nMoreover, for small step sizes and under mild assumptions, the set of OGDA-stable\ncritical points is a superset of GDA-stable critical points, which is a superset of\nlocal min-max solutions (strict in some cases). The connecting thread is that the\nbehavior of these dynamics can be studied from a dynamical systems perspective.\n\n1\n\nIntroduction\n\nThe celebrated min-max theorem was a founding stone in the development of Game Theory [21], and\nis intimately related to strong linear programming duality [1], Blackwell\u2019s approachability theory [3],\nand the theory of no-regret learning [5]. The theorem states that if f (x, y) is a convex-concave\nfunction, and X , Y are compact and concave subsets of Euclidean space, then\n\ny\u2208Y min\n\nx\u2208X f (x, y).\n\nmin\nx\u2208X max\n\ny\u2208Y f (x, y) = max\n\n(1)\nIf f (x, y) represents the payment of the X player to the Y player under choices of strategies x \u2208 X\nand y \u2208 Y by these two players, the min-max theorem reassures us that an equilibrium of the game\nexists, and that the equilibrium payoffs to both players are unique.\nWhat does not follow directly from the min-max theorem is whether there exist dynamics via which\nplayers would arrive at equilibrium if they were to follow some simple rule to update their current\nstrategies. This has been the topic of a long line of investigation starting with Julia Robinson\u2019s\ncelebrated analysis of \ufb01ctitious play [4, 20], and leading to the development of no-regret learning [5].\nRenewed interest in this problem has been recently motivated by the task of training Generative\nAdversarial Networks (GANs) [9, 2], where two deep neural networks, the generator and the discrim-\ninator, are trained in tandem using \ufb01rst order methods, aiming at solving a min-max problem, of the\nfollowing form, albeit typically with a non convex-concave objective function f (x, y):\n\nx\u2208X sup\ninf\ny\u2208Y\n\nf (x, y).\n\n(2)\n\nHere x represents the parameters of the generator deep neural net, y represents the parameters of the\ndiscriminator neural net, and f (x, y) is some measure of how close the distribution generated by the\ngenerator appears to the true distribution from the perspective of the discriminator.\n\n32nd Conference on Neural Information Processing Systems (NeurIPS 2018), Montr\u00e9al, Canada.\n\n\fMin-max optimization in non convex-concave settings is a central problem for many research\ncommunities, however our knowledge is very limited from optimization perspective. Moreover, for\nsuch applications of \ufb01rst-order methods to min-max problems in Machine Learning, it is especially\nimportant that the last-iterate maintained by the min and the max dynamics converges to a desirable\nsolution. Unfortunately, even when f (x, y) is convex-concave, it is rare that guarantees are known\nfor the last iterate (see [17, 15, 13] for continuous time learning dynamics that may cycle). Some\nguarantees are known for continuous-time dynamics [6], but for discrete-time dynamics it is typically\nonly shown that the average-iterates converge to min-max equilibrium. Recent work of [7] shows that,\nwhile Gradient Descent/Ascent (GDA) dynamics performed by the min/max players may diverge, the\nOptimistic version dynamics of [18] exhibit last iterate convergence to min-max solutions (which\nwe shall call Optimistic Gradient Descent/Ascent (OGDA)), whenever f (x, y) is linear in x and y1).\nThe goal of our paper is to understand the limit points of GDA and OGDA dynamics (points that\nlast iterate might converge to) for general functions f (x, y). In particular, we answer the following\nquestions:\n\n\u2022 are the stable limit points of GDA and OGDA locally min-max solutions?\n\u2022 how do the stable limit points of GDA and OGDA relate to each other?\n\nWe provide answers to these questions after de\ufb01ning our dynamics of interest formally.\nGDA and OGDA Dynamics. Assume from now on that X = Rn, Y = Rm and f is a real-valued\nfunction in C 2, the space of twice-continuously differentiable functions (unconstrained case). Perhaps\nthe most natural approach to solve (2) is by doing gradient descent on x and gradient ascent on y\n(GDA), i.e.,\n\nxt+1 = xt \u2212 \u03b1\u2207xf (xt, yt),\nyt+1 = yt + \u03b1\u2207yf (xt, yt),\n\n(3)\n\nwith some constant step size \u03b1 > 02. However, there are examples (functions f and initial points\n(x0, y0)) in which the system of equations (3) cycles (see [7]). To break down this behavior,\nthe authors in [7] analyzed another optimization algorithm which is called Optimistic Gradient\nDescent/Ascent (OGDA)3, the equations of which boil down to the following:\nxt+1 = xt \u2212 2\u03b1\u2207xf (xt, yt) + \u03b1\u2207xf (xt\u22121, yt\u22121),\nyt+1 = yt + 2\u03b1\u2207yf (xt, yt) \u2212 \u03b1\u2207yf (xt\u22121, yt\u22121).\n\n(4)\n\nOne of their key results was to show convergence to the min max solution for the case of bilinear\nobjective functions, namely f (x, y) = x(cid:62)Ay.\n\nOur contribution and techniques: In this paper we analyze Gradient Descent/Ascent (GDA) and\nOptimistic Gradient Descent/Ascent (OGDA) dynamics applied to min-max optimization problems.\nOur starting point is to show that both dynamics avoid their unstable \ufb01xed points (GDA-unstable and\nOGDA-unstable respectively, as de\ufb01ned in Section 1.1). This is shown by using techniques from\ndynamical systems, following the line of work of recent papers in Optimization and Machine Learning\n[14, 11, 10]. In a nutshell we show that the update rules of both dynamics are local diffeomorphisms4\nand we then make use of Center-Stable manifold theorem A.1 (see supplementary material). These\nresults are given as Theorems 2.2, 3.2. We note that it is very crucial to show that the update rule of\nthe dynamics is a diffeomorphism, otherwise avoiding \u201clinearly\" unstable points cannot be shown\nglobally. One important step in our approach is the construction of a dynamical system for OGDA in\norder to apply dynamical systems\u2019 techniques.\nWe next study the set of stable \ufb01xed points of GDA dynamics and their relation to locally min-max\nsolutions, called local min-max.5). Informally, a local min-max critical point (x\u2217, y\u2217) satis\ufb01es the\nfollowing: compared to value f (x\u2217, y\u2217), if we \ufb01x x\u2217 and perturb y\u2217 in\ufb01nitesimally, the value of f\n1We note that in their paper the dynamics is called Optimistic Mirror Descent, we changed the name because\n\nthe dynamics is a modi\ufb01ed gradient descent.\n2Note that \u03b1 > 0 for the rest of this paper.\n3We note that OGDA has some resemblance with Polyak\u2019s heavy ball method. However, one important\n\ndifference is that OGDA has \u201cnegative momentum\u201d while the heavy ball method has \u201cpositive momentum.\u201d\n\n4A local diffeomorphism is a function that locally is invertible, smooth and its (local) inverse is also smooth.\n5In optimization literature they are called local saddles.\n\n2\n\n\fdoes not increase and similarly if we \ufb01x y\u2217 and perturb x\u2217 in\ufb01nitesimally, the value of f does not\ndecrease. We show that the set of stable \ufb01xed points of GDA is a superset of the set of local min-max\nand there are functions in which this inclusion is strict. This is given in Lemmas 2.4, 2.7, 2.5.\nFinally, we analyze OGDA dynamics which is a bit trickier than GDA due to the nature of the\ndynamics, namely the existence of memory in the dynamics: the next iterate depends on the gradient\nof the current and previous point. We construct a dynamical system that captures OGDA dynamics (see\nEquation (7)), using a construction that is commonly employed in differential equations. Importantly,\nwe establish a mapping (relation) between the eigenvalues of the Jacobian of the update rules of both\nGDA and OGDA, showing that OGDA stable \ufb01xed points is a superset of GDA stable ones (under\nmild assumptions on the stepsize), namely (we suggest the reader to see \ufb01rst Remark 1.5 to avoid\nconfusion):\n\nLocal min-max \u2282 GDA-stable \u2282 OGDA-stable\n\nWe note that the inclusion above are strict.\nNotation: Vectors in Rn, Rm are denoted in boldface x, y. Time indices are denoted by subscripts.\nThus, a time indexed vector x at time t is denoted as xt. We denote by \u2207xf (x, y) the gradient of f\nwith respect to variables in x (of dimension the same as x) and by \u22072\nxyf the part of the Hessian in\nwhich the derivative of f is taken with respect to a variable in x and then a variable in y. We use\nthe letter J to denote the Jacobian of a function (with appropriate subscript), Ik, 0k\u00d7l to denote the\nidentity and zero matrix of sizes k \u00d7 k and k \u00d7 l respectively6, \u03c1(A) for the spectral radius of matrix\nA and \ufb01nally we use f t to denote the composition of f by itself t times.\nFinally, we would like to note that all the missing proofs can be found in the supplementary material.\n\n1.1\n\nImportant De\ufb01nitions\n\nWe have already stated our min-max problem of interest (2) as well as the Gradient Descent/Ascent\n(GDA) dynamics (3) and Optimistic Gradient Descent/Ascent (OGDA) dynamics (4) that we plan to\nanalyze. We provide some further de\ufb01nitions.\n\nDynamical Systems. A recurrence relation of the form xt+1 = w(xt) is a discrete time dynamical\nsystem, with update rule w : S \u2192 S for some convex set S \u2282 Rn. Function w is assumed to\nbe continuously differentiable for the purpose of this paper. The point z is called a \ufb01xed point or\nequilibrium of w if w(z) = z. We will be interested in the following standard notions of \ufb01xed point\nstability.\nDe\ufb01nition 1.1 ((Linear) stability). Let w be continuously differentiable. We call a \ufb01xed point z\nlinearly stable or just stable if, for the Jacobian J of w computed at z, it holds that its spectral radius\n\u03c1(J) is at most one and otherwise we call it linearly unstable or just unstable.\nDe\ufb01nition 1.2 (Lyapunov and Asymptotic Stability). A \ufb01xed point z of w is called Lyapunov stable if,\nfor every \u0001 > 0, there exists a \u03b4 = \u03b4(\u0001) > 0 such that if x \u2208 B\u03b4 with B\u03b4 = {y \u2208 S : (cid:107)y \u2212 z(cid:107) < \u03b4}7\nwe have that (cid:107)wn(x) \u2212 z(cid:107) < \u0001 for every n \u2265 0. That is, if dynamics starts close enough to z, it\nremains close for all times.\nA \ufb01xed point z of w is called (locally) asymptotically stable (or attracting) if it is Lyapunov stable\nand there exists a \u03b4 > 0 such that, for all x \u2208 B\u03b4 we have that (cid:107)wn(x) \u2212 z(cid:107) \u2192 0 as n \u2192 \u221e. That\nis, there is a small neighborhood around z so that, for all initializations in that neighborhood, the\ndynamics converges to z.\nDe\ufb01nition 1.3 (Hyperbolicity). We call a \ufb01xed point z hyperbolic iff the Jacobian J of w computed\nat z has no eigenvalues with absolute value 1.\n\nThe following are well-known facts.\nProposition 1.4 (e.g. [8]). If the Jacobian of the update rule at a stable \ufb01xed point z has spectral\nradius less than one, then the \ufb01xed point is asymptotically stable. Therefore, if a \ufb01xed point z is\nhyperbolic, then linear stability implies asymptotic stability.\n\n6We also use 0 to denote the zero vector.\n7Ball of radius \u03b4.\n\n3\n\n\fRemark 1.5 (Fixed points of GDA, OGDA dynamics). It is easy to see that a \ufb01xed point of the GDA\ndynamics (3) arises whenever (xt+1, yt+1) = (xt, yt), or in other words whenever (xt, yt) = (x, y)\nsuch that \u2207f (x, y) = 0.\nSince the OGDA dynamics (4) has memory, it is more appropriate to think of the dynamics as\nmapping a quadruple (xt, yt, xt\u22121, yt\u22121) to a quadruple (xt+1, yt+1, xt, yt). In this case, a \ufb01xed\npoint arises whenever (xt+1, yt+1, xt, yt) = (xt, yt, xt\u22121, yt\u22121), or in other words whenever\n(xt, yt, xt\u22121, yt\u22121) = (x, y, x, y) and \u2207f (x, y) = 0.\nWe should stress in particular that whenever we say that the set of OGDA-stable \ufb01xed points is a\nsuper-set of the GDA-stable \ufb01xed points, we will be somewhat abusing notation, since the \ufb01xed points\nof OGDA lie in R2n+2m while the \ufb01xed points of GDA lie in Rn+m. However, as discussed above, a\n\ufb01xed point of OGDA is of the form (x, y, x, y), and we can thus project it to its \ufb01rst two components\nwithout any loss of information to obtain a point in Rn+m. When we relate \ufb01xed points of OGDA to\n\ufb01xed points of GDA we will implicitly apply this projection.\n\nGiven Proposition 1.4, it follows that spectral analysis of the Jacobian of the \ufb01xed points can give\nus qualitative information about the local behavior of the dynamics. Unless otherwise speci\ufb01ed,\nthroughout this paper, whenever we say \u201cstable\u201d we mean linearly stable. GDA/OGDA-stable critical\npoints are critical points that are stable with respect to GDA/OGDA dynamics (for \ufb01xed stepsize \u03b1,\notherwise are unstable). Moreover since different choices of stepsize \u03b1 might give different stability\nfor GDA and OGDA dynamics, we are interested in the case \u03b1 is \u201csuf\ufb01ciently\" small. Therefore\nin the sections we characterize the GDA/OGDA-stable critical points, a point (x\u2217, y\u2217) is classi\ufb01ed\nas GDA/OGDA-stable if there exists a suf\ufb01ciently small number \u03b2 > 0 such that for all stepsizes\n0 < \u03b1 < \u03b2 we have that the (x\u2217, y\u2217) is a stable \ufb01xed point of GDA/OGDA dynamics (in case there\nexists a small \u03b2 > 0 so that for all stepsizes 0 < \u03b1 < \u03b2 we have that (x\u2217, y\u2217) is an unstable \ufb01xed\npoint of GDA/OGDA dynamics, it is classi\ufb01ed as GDA/OGDA-unstable).\n\nOptimization. We use the following standard terminology.\nDe\ufb01nition 1.6. For a min-max problem (2) where f is twice continuously differentiable,\n\n\u2022 A point (x\u2217, y\u2217) is a critical point of f if \u2207f (x\u2217, y\u2217) = 0.\n\u2022 A critical point (x\u2217, y\u2217) is isolated if there is a neighborhood U around (x\u2217, y\u2217) where\n\n(x\u2217, y\u2217) is the only critical point.8 Otherwise it is called non-isolated.\n\n\u2022 A critical point (x\u2217, y\u2217) is a local min-max point if there exists a neighborhood U around\n\n(x\u2217, y\u2217) so that for all (x, y) \u2208 U we have that f (x\u2217, y) \u2264 f (x\u2217, y\u2217) \u2264 f (x, y\u2217).9\n\u2022 A critical point (x\u2217, y\u2217) is a strongly local min-max point if \u03bbmin(\u22072\n\nxxf (x\u2217, y\u2217)) > 0 and\n\n\u03bbmax(\u22072\n\nyyf (x\u2217, y\u2217)) < 0.\n\n1.2 Formal Statement of Results\n\nWe present our main results for GDA and OGDA, to be proven in Sections 2 and 3. Some of our\nclaims make use of the following assumptions about the objective function f of (2):\nAssumption 1.7 (Invertibility of Hessian of f). \u22072f (the Hessian of f) is invertible for all x, y.\nAssumption 1.8 (Non-Imaginary GDA at a Critical Point). GDA is non-imaginary at a critical point\n(x\u2217, y\u2217) of f iff\n\n(cid:18) \u2212\u22072\n\n(cid:19)\n\nH =\n\nxxf \u2212\u22072\nxyf\n\u22072\n\u22072\nyxf\nyyf\n\n(5)\n\u03b1 (J(x\u2217, y\u2217) \u2212 In+m) where J is\n\nhas no eigenvalue whose real part is 0. H captures the difference 1\nthe Jacobian of GDA dynamics and In+m the identity matrix.\nRemark 1.9. To illustrate the nature of the above assumptions, we note that Assumption 1.7 is\n2 x(cid:62)Qx,\ngenerically true for quadratic functions. Take an arbitrary quadratic function f (x) = 1\n\n8If the critical points are isolated then they are countably many or \ufb01nite.\n9In optimization literature these critical points are also called local saddle points. If U is the whole domain\n\nthen we call it global min-max.\n\n4\n\n\f2 x(cid:62)Ax where A is a matrix with random entries from some continuous\nand de\ufb01ne \u02dcf (x) = f (x) + 1\ndistribution (say uniform in [\u2212\u0001, \u0001] for \u0001 small enough). It is not hard to see that \u22072 \u02dcf is invertible with\nprobability one. This is intuitively a \u201chyperbolicity\" assumption of the \ufb01xed points of the dynamics.\nWe note that we use this assumption for Lemma 3.1 and also to show that OGDA avoids its unstable\n\ufb01xed points. The stability characterizations do not need this assumption. Moreover, we note that\nAssumption 1.8 is satis\ufb01ed when critical point (x\u2217, y\u2217) is strongly local min-max.\n\nOur two main results are stated as follows:\nTheorem 1.10 (Inclusion). Assume f is twice differentiable and \u2207f is Lipschitz with constant L.\n\u2022 Let (x\u2217, y\u2217) be a local min max critical point that satis\ufb01es Assumption 1.8. For \u03b1 > 0\nsuf\ufb01ciently small it holds that (x\u2217, y\u2217) is GDA-stable \ufb01xed point. There is a function with\ncritical point (x\u2217, y\u2217) which violates Assumption 1.8, (x\u2217, y\u2217) is local min-max but not\nGDA-stable for any 0 < \u03b1 < 1\nAdditionally, if (x\u2217, y\u2217) is a strongly local min max critical point then Assumption 1.8 is\nsatis\ufb01ed and for \u03b1 > 0 suf\ufb01ciently small we get (x\u2217, y\u2217) is GDA-stable (Remark 2.8).\nFinally there is a function with a critical point (x\u2217, y\u2217) which is not local min-max but it is\nGDA-stable (for suf\ufb01ciently small \u03b1 > 0, Lemma 2.5).\n\nL (Lemmas 2.4, 2.7 and 2.6).\n\n\u2022 Let (x\u2217, y\u2217) be a GDA-stable \ufb01xed point. For 0 < \u03b1 < 1\n\n2L it holds that (x\u2217, y\u2217) is OGDA-\nstable. Moreover the inclusion is strict, i.e., there is a function with critical point (x\u2217, y\u2217)\nwhich is OGDA-stable but not GDA-stable (for small enough \u03b1 > 0, Lemmas 3.4 and 3.5).\nTheorem 1.11 (Avoid unstable). Assume f is twice differentiable and \u2207f is Lipschitz with constant\nL. The set of initial vectors (x0, y0) so that GDA converges to (linearly) GDA-unstable \ufb01xed points\n(critical points) is of measure zero. Under Assumption 1.7, the set of initial vectors (x1, y1, x0, y0)\nso that OGDA converges to (linearly) OGDA-unstable \ufb01xed points (critical points) is of measure zero.\nThese statements are captured by Theorems 2.2 and 3.2.\n\n2 Analysis of Gradient Descent/Ascent\n\nIn this section we analyze the local behavior (which carries over to a global characterization under\nLemma 2.1 and Center-stable manifold theorem A.1) of GDA dynamics (3). In all our statements\n(theorems, lemmas etc) we work with real-valued function f that is twice differentiable and we also\nassume \u2207f is Lipschitz with constant L and that the stepsize satis\ufb01es 0 < \u03b1 < 1\nL (unless stated\notherwise in the statement of a lemma/theorem).\n\n2.1 Analyzing GDA\n\nWe need to show the following lemma in order to use the stable manifold theorem (see Theorem A.1).\nLemma 2.1 (GDA is a local diffeomorphism). Let f be twice differentiable and \u2207f is Lipschitz\nwith constant L. Assume that 0 < \u03b1 < 1\nL . The update rule of the GDA dynamics (3) is a local\ndiffeomorphism.\nTheorem 2.2 (Measure zero for GDA). Let f be twice differentiable and \u2207f is Lipschitz with\nL and let h be the update rule of the GDA dynamics (3), (x\u2217, y\u2217)\nconstant L. Assume that 0 < \u03b1 < 1\nbe a GDA-unstable critical point and WGDA(x\u2217, y\u2217) be its stable set, i.e.,\n\nWGDA(x\u2217, y\u2217) = {(x0, y0) : lim\n\nk\n\nhk(x0, y0) = (x\u2217, y\u2217)}.\n\nIt holds that WGDA(x\u2217, y\u2217) is of Lebesgue measure zero. Moreover if WGDA is union of the stable sets\nof all GDA-unstable critical points, then WGDA has also measure zero (namely the proof works for\nnon-isolated critical points).\n\nThe following corollary is immediate from Theorem 2.2.\nCorollary 2.3. Let (x\u2217, y\u2217) be GDA-unstable. Assume \u00b5 is a measure of the starting points (x0, y0)\nand is absolutely continuous with respect to the Lebesgue measure on Rn+m. Then it holds that\n\n(xt, yt) = (x\u2217, y\u2217)] = 0.\n\nPr[lim\nt\n\n5\n\n\f2.2 Characterizing GDA-stability\n\nL and let (x\u2217, y\u2217) be a\nLemma 2.4 (Local min-max are GDA-stable). Assume that 0 < \u03b1 < 1\nlocal min-max critical point of f and matrix H (see equations (5)) computed at (x\u2217, y\u2217) has real\neigenvalues. It holds that (x\u2217, y\u2217) is GDA-stable.\nLemma 2.5. The converse of Lemma 2.4 is false. There are functions with critical points that are\nGDA-stable but not local min-max. An example is f (x, y) = \u2212 1\n\n2 y2 + 6\n\n10 xy10.\n\nProof. We provide an example with two variables (so that we can also give a \ufb01gure). Let f (x, y) =\n\u2212 1\n8 x2 \u2212 1\n10 xy. Computing the Jacobian of the update rule of dynamics (3) at point (0, 0) we\nget that\n\n2 y2 + 6\n\nJGDA =\n\n(6)\nL where L \u2264 1.34). Finally\nBoth eigenvalues of JGDA have magnitude less than 1 (for any 0 < \u03b1 < 1\nmatrix HGDA has real eigenvalues. Therefore there exists a neighborhood U of (0, 0) so that for all\n(x0, y0) \u2208 U, we get that limt(xt, yt) = (0, 0) for GDA dynamics (3). However it is clear that (0, 0)\nis not a local min-max. See also Figure 1 for a pictorial illustration of the result.\n\n6\n10 \u03b1\n\n,\n\n8 x2 \u2212 1\n(cid:19)\n\n(cid:18) 1 + 1\n\n4 \u03b1 \u2212 6\n10 \u03b1\n1 \u2212 \u03b1\n\nFigure 1: Function f (x, y) = \u2212 1\n10 xy and \u03b1 = 0.001. The arrows point towards the\nnext step of the Gradient Descent/Ascent dynamics. We can see that the system converges to (0, 0)\npoint (GDA-stable), which is not a local min-max critical point.\n\n8 x2 \u2212 1\n\n2 y2 + 6\n\nWe end Section 2 by characterizing the case in which H has complex eigenvalues.\nLemma 2.6 (Imaginary eigenvalues). There are functions with critical points that are not GDA-\nstable but are local min-max when matrix H (see equations (5)) has imaginary eigenvalues.\n\nWe complete the characterization for the relation between GDA-stable critical points and local\nmin-max with the following lemma:\nLemma 2.7 (Real part nonzero). Let (x\u2217, y\u2217) be a local min-max critical point of f and matrix\nH (see equations (5)) computed at (x\u2217, y\u2217) has all its eigenvalues with real part nonzero (i.e.,\nAssumption 1.8). There is a small enough step-size \u03b1 > 0 so that (x\u2217, y\u2217) is GDA-stable.\n\n10See Figure 1.\n\n6\n\n-0.25-0.2-0.15-0.1-0.0500.050.10.150.20.25-0.2-0.15-0.1-0.0500.050.10.150.2\fRemark 2.8. If the critical point (x\u2217, y\u2217) is strongly local min-max then \u03bbmax(H) < 0 and\nhence (x\u2217, y\u2217) is attracting under GDA dynamics, i.e., it holds that Strongly Local min-max \u2282\nGDA-stable.\n\n3 Optimistic Gradient Descent/Ascent\n\nThe results of the previous section cannot carry over to Optimistic Gradient Descent/Ascent due to the\nfact that the dynamics has memory and is more challenging to analyze. Here we show that Optimistic\nGradient Descent/Ascent avoid OGDA-unstable critical points and we also relate the eigenvalues\nof the Jacobian of OGDA to the eigenvalues of the Jacobian of GDA. In particular we show that\nGDA-stable \u2282 OGDA-stable (inclusion strict). In the beginning we will construct a dynamical\nsystem that captures the dynamics of OGDA (4).\n\n3.1 Constructing the Dynamical System\nWe de\ufb01ne the function F to be F (x, y, z, w) = f (x, y) for all (x, y, z, w) \u2208 X \u00d7 Y \u00d7 X \u00d7 Y\n(think of the last two vector components as dummy for function F , its value does not depend on\nthem). Hence it is clear that \u2207zF (x, y, z, w) = 0 and \u2207wF (x, y, z, w) = 0. The same holds for\n\u2207xF (z, w, x, y) = 0 and \u2207yF (z, w, x, y) = 0.\nWe de\ufb01ne the following function g which consists of 4 components:\n\ng(x, y, z, w) := (g1(x, y, z, w), g2(x, y, z, w), g3(x, y, z, w), g4(x, y, z, w)),\ng1(x, y, z, w) := Inx \u2212 2\u03b1\u2207xF (x, y, z, w) + \u03b1\u2207zF (z, w, x, y),\ng2(x, y, z, w) := Imy + 2\u03b1\u2207yF (x, y, z, w) \u2212 \u03b1\u2207wF (z, w, x, y),\ng3(x, y, z, w) := Inx,\ng4(x, y, z, w) := Imy.\n\n(7)\n\nIt is not hard to check that (xt+1, yt+1, xt, yt) = g(xt, yt, xt\u22121, yt\u22121), so g captures exactly\nthe dynamics of OGDA (4). The idea behind the construction of the dynamical system above is\ncommon in the literature of ODEs (ordinal differential equations) where in order to solve (typically\nto understand the qualitative behavior) a higher order ODE, one approach is to express it as a linear\nsystem of ODEs.\n\n3.2 Analyzing OGDA via system (7)\n\nAs in the case of GDA, we need to show the following key lemma in order to use the Center-stable\nmanifold theorem.\nLemma 3.1 (OGDA is a local diffeomorphism). Let f is real-valued C 2 and \u2207f is Lipschitz with\nconstant L and 0 < \u03b1 < 1\nL . Under the Assumption 1.7 we get that the update rule g of the OGDA\ndynamics (7) is a local diffeomorphism.\n\nAgain as in Section 2, we are able to prove the following measure zero argument using Lemma 3.1\nand Center-Stable manifold theorem.\nTheorem 3.2 (Measure zero for OGDA). Let f be twice differentiable and \u2207f is Lipschitz with\nconstant L. Suppose that Assumption 1.7 holds and 0 < \u03b1 < 1\nL . Let g be the update rule of the\nOGDA dynamics (4), (x\u2217, y\u2217, x\u2217, y\u2217) be a OGDA-unstable critical point and WOGDA(x\u2217, y\u2217, x\u2217, y\u2217)\nbe its stable set, i.e.,\n\nWOGDA(x\u2217, y\u2217, x\u2217, y\u2217) = {(x1, y1, x0, y0) : lim\n\ngk(x1, y1, x0, y0) = (x\u2217, y\u2217, x\u2217, y\u2217)}.\n\nIt holds that WOGDA(x\u2217, y\u2217, x\u2217, y\u2217) is of Lebesgue measure zero. Moreover if WOGDA is union of\nthe stable sets of all OGDA-unstable critical points, then WOGDA has also measure zero (namely the\nproof works for non-isolated critical points).\n\nk\n\nThe following corollary is immediate from Theorem 3.2.\nCorollary 3.3. Let (x\u2217, y\u2217, x\u2217, y\u2217) be OGDA-unstable. Assume \u00b5 is a measure of the starting points\n(x1, y1, x0, y0) and is absolutely continuous with respect to the Lebesgue measure on R2n+2m. Then\nit holds that\n\n(xt, yt, xt\u22121, yt\u22121) = (x\u2217, y\u2217, x\u2217, y\u2217)] = 0.\n\nPr[lim\nt\n\n7\n\n\f3.3 Characterizing OGDA-stability\n\nIn this subsection we provide an analysis for the eigenvalues of the Jacobian matrix JOGDA of the\nupdate rule g of the system (7) the equations of which can be found in the supplementary material.\nWe begin by claiming that the set of GDA-stable critical points is a subset of the set of OGDA-critical\npoints. We manage to show this by constructing a mapping between the eigenvalues of JGDA and\nJOGDA.\nLemma 3.4 (GDA-stable are OGDA-stable). Let f be twice differentiable and \u2207f be L-Lipschitz.\n2L and suppose (x\u2217, y\u2217) is a critical point that is GDA-stable (i.e., stable\nAssume that 0 < \u03b1 < 1\naccording to dynamics (3)). The critical point (x\u2217, y\u2217, x\u2217, y\u2217) is stable according to OGDA dynamics\n(4).\n\nWe conclude the subsection with the following claim and a remark.\nLemma 3.5. There are functions with critical points that are OGDA-stable but not GDA-stable.\nRemark 3.6. We would like to note that some of our results (e.g., Lemma 3.1 and Theorem 3.2) are\nnot applicable to a generic bilinear function f (x, y) = x(cid:62)Ay, since if A is not a square matrix, the\nHessian \u22072f is not invertible (they are applicable only when A is square matrix and invertible).\n\n4 Examples and Experiments\nIn this section we provide two examples/experiments, one 2-dimensional (function f : R2 \u2192\nR, x, y \u2208 R) and one higher dimensional (f : R10 \u2192 R, x, y \u2208 R5). The purpose of these\nexperiments is to get better intuition about our \ufb01ndings. In the 2-dimensional example, we construct\na function with local min-max, {GDA, OGDA}-unstable and {GDA, OGDA}-stable critical points.\nMoreover, we get 10000 random initializations from the domain R = {(x, y) : \u22125 \u2264 x, y \u2264 5}\nand we compute the probability to reach each critical point for both GDA and OGDA dynamics. In\nthe higher dimensional experiment, we construct a polynomial function p(x, y) of degree 3 with\ncoef\ufb01cients sampled i.i.d from uniform distribution with support [\u22121, 1] and then we plant a local\nmin max. Under 10000 random initializations in R, we analyze the convergence properties of GDA\nand OGDA (as in the two dimensional case).\n\n2 y2 + 6\n\n8 x2 \u2212 1\n\n4.1 A 2D example\nThe function f1(x, y) = \u2212 1\n10 xy has the property that the critical point (0, 0) is GDA-\nstable but not local min-max (see Lemma 2.5). Moreover, consider f2(x, y) = 1\n2 y2 + 4xy.\nThis function has the property that the critical point (0, 0) is GDA-unstable and is easy to check\nthat is not a local min-max. We construct the polynomial function f (x, y) = f1(x, y)(x \u2212 1)2(y \u2212\n1)2 + f2(x, y)x2y2. Function f has the property that around (0, 0) behaves like f1 and around (1, 1)\nbehaves like f2. The GDA dynamics of f can be seen in Figure 2. However more critical points are\ncreated. There are \ufb01ve critical points, i.e, (0, 0), (0, 1), (1, 0), (1, 1), (0.3301, 0.3357) (in interval R,\nthe last critical point is computed approximately). In Table 1 we observe that the critical point (0, 0)\nis stable for OGDA but unstable for GDA (essentially OGDA has more attracting critical points).\nMoreover, our theorems of avoiding unstable \ufb01xed points are veri\ufb01ed with this experiment. Note\nthat there are some initial conditions that GDA and OGDA dynamics don\u2019t converge (3% and 9.8%\nrespectively).\n\n2 x2 + 1\n\nCritical point\n\nGDA-\nstable\nNO\n(0, 0)\nNO\n(0, 1)\nYES\n(1, 0)\nYES\n(1, 1)\n(0.3301, 0.3357) NO\n\nOGDA-\nstable\nYES\nNO\nYES\nYES\nNO\n\nLocal\nmin-max\nNO\nNO\nYES\nNO\nNO\n\nvalue of\nf\n0\n0\n0\n0\n0.109\n\nProb. GDA\nconverges\n0%\n0%\n78%\n19%\n0%\n\nProb. OGDA\nconverges\n25.8%\n0%\n35.4%\n29%\n0%\n\nTable 1: Summary of critical points of f.\n\n8\n\n\fFigure 2: Construction of a function with points that are GDA-stable and local min-max, GDA-stable\nand not local min-max and GDA-unstable (and hence not local min-max). The arrows point towards\nthe next step of the Gradient Descent/Ascent dynamics.\n\n4.2 Higher dimensional\n\nLet f (x, y) := p(x, y) \u00b7 ((cid:80)5\nmentioned above and w(x, y) =(cid:80)5\n\ni=1 x3\n\ni + y3\n\ni ) + w(x, y), where p is the random 3-degree polynomial as\ni ). It is clear that f locally at (0, ..., 0) behaves like\nfunction w (which has 0 as a local min-max critical point). We run for 10000 uniformly random\npoints in R and it turns out that 87% of initial points converge to 0 in OGDA as opposed to GDA\nwhich 79.3% fraction converged. This experiment indicates qualitative difference between the two\nmethods, where the area of region of attraction in OGDA is a bit larger.\n\ni \u2212 y2\n\ni=1(x2\n\n5 Conclusion\n\nIn this paper we made a step towards understanding \ufb01rst order methods which are used to solve\nmin-max optimization problems, by analyzing the local behavior of GDA and OGDA dynamics\naround critical points. Our paper is an indication that important \ufb01rst order methods we analyze fail to\nconverge to only local min-max solutions (standard concept in optimization literature). Whether or not\nlocal min-max solutions is a good concept is out of the scope of this paper11. Local min-max solutions\nmight not be all equally good and some may be bad, which is really important in applications such as\ntraining GANs. Nevertheless, even for minimization problems, \ufb01nding good local minima is a hard\ntask that is not well understood in the literature (most \ufb01rst order methods guarantee convergence to\nsome local minimum, without guarantees about its quality). A forteriori guaranteeing good solutions\nin a min-max problem is a harder proposition and an important open question.\n\n11Even characterizing whether a local min-max solution is good or not is not an easy/clear task.\n\n9\n\n-0.200.20.40.60.811.2-0.200.20.40.60.811.2\fAcknowledgments\n\nConstantinos Daskalakis was supported by NSF awards CCF-1617730 and IIS-1741137, a Simons\nInvestigator Award, a Google Faculty Research Award, and an MIT-IBM Watson AI Lab research\ngrant. Ioannis Panageas was supported by SRG ISTD 2018 136. This work was done when Ioannis\nwas a postdoctoral fellow at MIT.\n\nReferences\n[1] Ilan Adler. The equivalence of linear programs and zero-sum games. In International Journal\n\nof Game Theory, pages 165\u2013177, 2013.\n\n[2] Mart\u00edn Arjovsky, Soumith Chintala, and L\u00e9on Bottou. Wasserstein generative adversarial\nnetworks. In Proceedings of the 34th International Conference on Machine Learning, ICML\n2017, Sydney, NSW, Australia, 6-11 August 2017, pages 214\u2013223, 2017.\n\n[3] David Blackwell. An analog of the minimax theorem for vector payoffs. In Paci\ufb01c J. Math.,\n\npages 1\u20138, 1956.\n\n[4] G.W Brown. Iterative solutions of games by \ufb01ctitious play. In Activity Analysis of Production\n\nand Allocation, 1951.\n\n[5] Nikolo Cesa-Bianchi and Gabor Lugosi. Prediction, Learning, and Games. Cambridge\n\nUniversity Press, 2006.\n\n[6] Ashish Cherukuri, Bahman Gharesifard, and Jorge Cort\u00e9s. Saddle-point dynamics: Conditions\nfor asymptotic stability of saddle points. SIAM J. Control and Optimization, 55(1):486\u2013511,\n2017.\n\n[7] Constantinos Daskalakis, Andrew Ilyas, Vasilis Syrgkanis, and Haoyang Zeng. Training gans\n\nwith optimism. ICLR, 2018.\n\n[8] Oded Galor. Discrete Dynamical Systems. Springer, 2007.\n\n[9] Ian J. Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil\nOzair, Aaron C. Courville, and Yoshua Bengio. Generative adversarial nets. In Advances\nin Neural Information Processing Systems 27: Annual Conference on Neural Information\nProcessing Systems 2014, December 8-13 2014, Montreal, Quebec, Canada, pages 2672\u20132680,\n2014.\n\n[10] Chi Jin, Yuchen Zhang, Sivaraman Balakrishnan, Martin J. Wainwright, and Michael I. Jordan.\nLocal maxima in the likelihood of gaussian mixture models: Structural results and algorithmic\nconsequences. In Advances in Neural Information Processing Systems 29: Annual Conference\non Neural Information Processing Systems 2016, December 5-10, 2016, Barcelona, Spain,\npages 4116\u20134124, 2016.\n\n[11] Jason D. Lee, Ioannis Panageas, Georgios Piliouras, Max Simchowitz, Michael I. Jordan, and\nBenjamin Recht. First-order methods almost always avoid saddle points. CoRR, abs/1710.07406,\n2017.\n\n[12] L. Lessard, B. Recht, and A. Packard. Analysis and design of optimization algorithms via\n\nintegral quadratic constraints. In SIAM Journal on Optimization, 2016.\n\n[13] Tung Mai, Milena Mihail, Ioannis Panageas, Will Ratcliff, Vijay V. Vazirani, and Peter Yunker.\nRock-paper-scissors, differential games and biological diversity. To appear in Economics and\nComputation (EC), 2018.\n\n[14] Ruta Mehta, Ioannis Panageas, and Georgios Piliouras. Natural selection as an inhibitor of\ngenetic diversity: Multiplicative weights updates algorithm and a conjecture of haploid genetics.\nIn Innovations in Theoretical Computer Science, ITCS, 2015.\n\n10\n\n\f[15] Panayotis Mertikopoulos, Christos Papadimitriou, and Georgios Piliouras. Cycles in adversarial\nIn Proceedings of the Twenty-Ninth Annual ACM-SIAM Symposium\nregularized learning.\non Discrete Algorithms, SODA 2018, New Orleans, LA, USA, January 7-10, 2018, pages\n2703\u20132717, 2018.\n\n[16] V. Nagarajan and J. Zico Kolter. Gradient descent gan optimization is locally stable. In NIPS,\n\n2017.\n\n[17] Gerasimos Palaiopanos, Ioannis Panageas, and Georgios Piliouras. Multiplicative weights\nupdate with constant step-size in congestion games: Convergence, limit cycles and chaos.\nIn Advances in Neural Information Processing Systems 30: Annual Conference on Neural\nInformation Processing Systems 2017, 4-9 December 2017, Long Beach, CA, USA, pages\n5874\u20135884, 2017.\n\n[18] Alexander Rakhlin and Karthik Sridharan. Online learning with predictable sequences. In\nCOLT 2013 - The 26th Annual Conference on Learning Theory, June 12-14, 2013, Princeton\nUniversity, NJ, USA, pages 993\u20131019, 2013.\n\n[19] Lillian J. Ratliff, Samuel A. Burden, and S. Shankar Sastry. Characterization and computation\n\nof local nash equilibria in continuous games. In Allerton, 2013.\n\n[20] J. Robinson. An iterative method of solving a game. In Annals of Mathematics, pages 296\u2013301,\n\n1951.\n\n[21] J Von Neumann. Zur theorie der gesellschaftsspiele. In Math. Ann., pages 295\u2013320, 1928.\n\n11\n\n\f", "award": [], "sourceid": 5568, "authors": [{"given_name": "Constantinos", "family_name": "Daskalakis", "institution": "MIT"}, {"given_name": "Ioannis", "family_name": "Panageas", "institution": "Singapore University of Technology and Design"}]}