{"title": "Acceleration through Optimistic No-Regret Dynamics", "book": "Advances in Neural Information Processing Systems", "page_first": 3824, "page_last": 3834, "abstract": "We consider the problem of minimizing a smooth convex function by reducing the optimization to computing the Nash equilibrium of a particular zero-sum convex-concave game. Zero-sum games can be solved using online learning dynamics, where a classical technique involves simulating two no-regret algorithms that play against each other and, after $T$ rounds, the average iterate is guaranteed to solve the original optimization problem with error decaying as $O(\\log T/T)$.\nIn this paper we show that the technique can be enhanced to a rate of $O(1/T^2)$ by extending recent work \\cite{RS13,SALS15} that leverages \\textit{optimistic learning} to speed up equilibrium computation. The resulting optimization algorithm derived from this analysis coincides \\textit{exactly} with the well-known \\NA \\cite{N83a} method, and indeed the same story allows us to recover several variants of the Nesterov's algorithm via small tweaks. We are also able to establish the accelerated linear rate for a function which is both strongly-convex and smooth. This methodology unifies a number of different iterative optimization methods: we show that the \\HB algorithm is precisely the non-optimistic variant of \\NA, and recent prior work already established a similar perspective on \\FW \\cite{AW17,ALLW18}.", "full_text": "Acceleration through Optimistic No-Regret Dynamics\n\nGeorgia Institute of Technology\n\nGeorgia Institute of Technology\n\nJun-Kun Wang\n\nCollege of Computing\n\nAtlanta, GA 30313\n\njimwang@gatech.edu\n\nJacob Abernethy\n\nCollege of Computing\n\nAtlanta, GA 30313\nprof@gatech.edu\n\nAbstract\n\nWe consider the problem of minimizing a smooth convex function by reducing the\noptimization to computing the Nash equilibrium of a particular zero-sum convex-\nconcave game. Zero-sum games can be solved using online learning dynamics,\nwhere a classical technique involves simulating two no-regret algorithms that\nplay against each other and, after T rounds, the average iterate is guaranteed\nto solve the original optimization problem with error decaying as O(log T /T ).\nIn this paper we show that the technique can be enhanced to a rate of O(1/T 2)\nby extending recent work [22, 25] that leverages optimistic learning to speed\nup equilibrium computation. The resulting optimization algorithm derived from\nthis analysis coincides exactly with the well-known NESTEROVACCELERATION\n[16] method, and indeed the same story allows us to recover several variants\nof the Nesterov\u2019s algorithm via small tweaks. We are also able to establish the\naccelerated linear rate for a function which is both strongly-convex and smooth.\nThis methodology uni\ufb01es a number of different iterative optimization methods: we\nshow that the HEAVYBALL algorithm is precisely the non-optimistic variant of\nNESTEROVACCELERATION, and recent prior work already established a similar\nperspective on FRANKWOLFE [2, 1].\n\n1\n\nIntroduction\n\n(cid:17)\n\nt=1 (cid:96)t(x)\n\n(cid:16)(cid:80)T\nt=1 (cid:96)t(xt) \u2212 minx\u2208K(cid:80)T\n\nOne of the most successful and broadly useful tools recently developed within the machine learn-\ning literature is the no-regret framework, and in particular online convex optimization (OCO)\n[28].\nIn the standard OCO setup, a learner is presented with a sequence of (convex) loss\nfunctions (cid:96)1(\u00b7), (cid:96)2(\u00b7), . . ., and must make a sequence of decisions x1, x2, . . . from some set K\nin an online fashion, and observes (cid:96)t after only having committed to xt. Assuming the se-\nquence {(cid:96)t} is chosen by an adversary, the learner aims is to minimize the average regret\n\u00afRT := 1\nagainst any such loss functions. Many simple algo-\nT\nrithms have been developed for OCO problems\u2014including MIRRORDESCENT, FOLLOWTHEREGU-\nLARIZEDLEADER, FOLLOWTHEPERTURBEDLEADER, etc.\u2014and these algorithms exhibit regret\nguarantees that are strong even against adversarial opponents. Under very weak conditions one can\nachieve a regret rate of \u00afRT = O(1/\nT ), or even \u00afRT = O(log T /T ) with required curvature on (cid:96)t.\nOne can apply online learning tools to several problems, but perhaps the simplest is to \ufb01nd the\napproximate minimum of a convex function argminx\u2208K f (x). With a simple reduction we set (cid:96)t = f,\nand it is easy to show that, via Jensen\u2019s inequality, the average iterate \u00afxT := x1+...+xT\nf (\u00afxT ) \u2264 1\nt=1 (cid:96)t(x) + \u00afRT = minx\u2208K f (x) + \u00afRT\nhence \u00afRT upper bounds the approximation error. But this reduction, while simple and natural, is\nquite limited. For example, we know that when f (\u00b7) is smooth, more sophisticated algorithms such\n\nt=1 f (xt) = 1\nT\n\n(cid:80)T\nt=1 (cid:96)t(xt) \u2264 minx\u2208K 1\n\nT\n\n(cid:80)T\n\n(cid:80)T\n\nsatis\ufb01es\n\nT\n\n\u221a\n\nT\n\n32nd Conference on Neural Information Processing Systems (NeurIPS 2018), Montr\u00e9al, Canada.\n\n\fas FRANKWOLFE and HEAVYBALL achieve convergence rates of O(1/T ), whereas the now-famous\nNESTEROVACCELERATION algorithm achieves a rate of O(1/T 2). The fast rate shown by Nesterov\nwas quite surprising at the time, and many researchers to this day \ufb01nd the result quite puzzling. There\nhas been a great deal of work aimed at providing a more natural explanation of acceleration, with a\nmore intuitive convergence proof [27, 4, 10]. This is indeed one of the main topics of the present\nwork, and we will soon return to this discussion.\nAnother application of the no-regret framework is the solution of so-called saddle-point prob-\nlems, which are equivalently referred to as Nash equilibria for zero-sum games. Given a func-\ntion g(x, y) which is convex in x and concave in y (often called a payoff function), de\ufb01ne\nV \u2217 = inf x\u2208K supy g(x, y). An \u0001-equilibrium of g(\u00b7,\u00b7) is a pair \u02c6x, \u02c6y such that such that\n\nV \u2217 \u2212 \u0001 \u2264 inf x\u2208K g(x, \u02c6y) \u2264 V \u2217 \u2264 supy g(\u02c6x, y) \u2264 V \u2217 + \u0001.\n\n(1)\n\nT\n\nT\n\nOne can \ufb01nd an approximate saddle point of the game with the following setup: implement a no-\nregret learning algorithm for both the x and y players simultaneously, after observing the actions\n\n{xt, yt}t=1...T return the time-averaged iterates (\u02c6x, \u02c6y) =(cid:0) x1+...+xT\n\n(cid:1). A simple proof\n\n, y1+...+yT\n\nshows that (\u02c6x, \u02c6y) is an approximate equilibrium, with approximation bounded by the average regret of\nboth players (see Theorem 1). In the case where the function g(\u00b7,\u00b7) is biaf\ufb01ne, the no-regret reduction\n\u221a\nguarantees a rate of O(1/\nT ), and it was assumed by many researchers this was the fastest possible\nusing this framework. But one of the most surprising online learning results to emerge in recent\nyears established that no-regret dynamics can obtain an even faster rate of O(1/T ). Relying on tools\ndeveloped by [8], this fact was \ufb01rst proved by [21] and extended by [25]. The new ingredient in\nthis recipe is the use of optimistic learning algorithms, where the learner seeks to bene\ufb01t from the\npredictability of slowly-changing inputs {(cid:96)t}.\nWe will consider solving the classical convex optimization problem minx f (x), for smooth func-\ntions f, by instead solving an associated saddle-point problem which we call the Fenchel Game.\nSpeci\ufb01cally, we consider that the payoff function g of the game to be\n\ng(x, y) = (cid:104)x, y(cid:105) \u2212 f\u2217(y).\n\n(2)\nwhere f\u2217(\u00b7) is the fenchel conjugate of f (\u00b7). This is an appropriate choice of payoff function since,\nV \u2217 = minx f (x) and supy g(\u02c6x, y) = supy(cid:104)\u02c6x, y(cid:105) \u2212 f\u2217(y) = f (\u02c6x). Therefore, by the de\ufb01nition of an\n\u0001-equilibrium, we have that\nLemma 1. If (\u02c6x, \u02c6y) is an \u0001-equilibrium of the Fenchel Game (2), then f (\u02c6x) \u2212 minx f (x) \u2264 \u0001.\nOne can imagine computing the equilibrium of the Fenchel game using no-regret dynamics, and\nindeed this was the result of recent work [2] establishing the FRANKWOLFE algorithm as precisely\nan instance of two competing learning algorithms.\nIn the present work we will take this approach even further.\n\n1. We show that, by considering a notion of weighted regret, we can compute equilibria in\nthe Fenchel game at a rate of O(1/T 2) using no-regret dynamics where the only required\ncondition is that f is smooth. This improves upon recent work [1] on a faster FRANKWOLFE\nmethod, which required strong convexity of f (see Appendix J).\n\n2. We show that the secret sauce for obtaining the fast rate is precisely the use of an optimistic\nno-regret algorithm, OPTIMISTICFTL [1], combined with appropriate weighting scheme.\n3. We show that, when viewed simply as an optimization algorithm, this method is identically\nthe original NESTEROVACCELERATION method. In addition, we recover several variants of\nNESTEROVACCELERATION (see [15, 17, 19]) using small tweaks of the framework.\n\n4. We show that if one simply plays FOLLOWTHELEADER without optimism, the resulting\nalgorithm is precisely the HEAVYBALL. The latter is known to achieve a suboptimal rate in\ngeneral, and our analysis sheds light on this difference.\n5. Under the additional assumption that function f (\u00b7) is strongly convex, we show that an\n\naccelerated linear rate can also be obtained from the game framework.\n\n6. Finally, we show that the same equilibrium framework can also be extended to composite\n\noptimization and lead to a variant of Accelerated Proximal Method.\n\nRelated works: In recent years, there are growing interest in giving new interpretations of Nesterov\u2019s\naccelerated algorithms. For example, [26] gives a uni\ufb01ed analysis for some Nesterov\u2019s accelerated\n\n2\n\n\falgorithms [17, 18, 19], using the standard techniques and analysis in optimization literature. [13]\nconnects the design of accelerated algorithms with dynamical systems and control theory. [7] gives a\ngeometric interpretation of the Nesterov\u2019s method for unconstrained optimization, inspired by the\nellipsoid method. [10] studies the Nesterov\u2019s methods and the HEAVYBALL method for quadratic\nnon-strongly convex problems by analyzing the eigen-values of some linear dynamical systems. [4]\nproposes a variant of accelerated algorithms by mixing the updates of gradient descent and mirror\ndescent and showing the updates are complementary. [24, 27] connect the acceleration algorithms\nwith differential equations. In recent years there has emerged a lot of work where learning problems\nare treated as repeated games [14, 3], and many researchers have been studying the relationship\nbetween game dynamics and provable convergence rates [5, 11, 9].\nWe would like to acknowledge George Lan for his excellent notes titled \u201cLectures on Optimization\nfor Machine Learning\u201d (unpublished). In parallel to the development of the results in this paper, we\ndiscovered that Lan had observed a similar connection between NESTEROVACCELERATION and\nrepeated game playing (Chapter 3.4). A game interpretation was given by George Lan and Yi Zhou\nin Section 2.2 of [12].\n\n2 Preliminaries\nConvex functions and conjugates. A function f on Rd is L-smooth w.r.t. a norm (cid:107) \u00b7 (cid:107) if f is\neverywhere differentiable and it has lipschitz continuous gradient (cid:107)\u2207f (u) \u2212 \u2207f (v)(cid:107)\u2217 \u2264 L(cid:107)u \u2212 v(cid:107),\nwhere (cid:107) \u00b7 (cid:107)\u2217 denotes the dual norm. Throughout the paper, our goal will be to solve the problem\nof minimizing an L-smooth function f (\u00b7) over a convex set K. We also assume that the optimal\nsolution of x\u2217 := argminx\u2208K f (x) has \ufb01nite norm. For any convex function f, its Fenchel conjugate\nis f\u2217(y) := supx\u2208dom(f )(cid:104)x, y(cid:105) \u2212 f (x). If a function f is convex, then its conjugate f\u2217 is also convex.\nFurthermore, when the function f (\u00b7) is strictly convex, we have that \u2207f (x) = argmax\n(cid:104)x, y(cid:105)\u2212 f\u2217(y).\nSuppose we are given a differentiable function \u03c6(\u00b7), then the Bregman divergence Vc(x) with respect\nto \u03c6(\u00b7) at a point c is de\ufb01ned as Vc(x) := \u03c6(x) \u2212 (cid:104)\u2207\u03c6(c), x \u2212 c(cid:105) \u2212 \u03c6(c). Let (cid:107) \u00b7 (cid:107) be any norm on\nRd. When we have that Vc(x) \u2265 \u03c3\n2(cid:107)c \u2212 x(cid:107)2 for any x, c \u2208 dom(\u03c6), we say that \u03c6(\u00b7) is a \u03c3-strongly\nconvex function with respect to (cid:107) \u00b7 (cid:107). Throughout the paper we assume that \u03c6(\u00b7) is 1-strongly convex.\n\ny\n\nNo-regret zero-sum game dynamics. Let us now consider the process of solving a zero-sum game\nvia repeatedly play by a pair of online learning strategies. The sequential procedure is described in\nAlgorithm 1.\n\nAlgorithm 1 Computing equilibrium using no-regret algorithms\n1: Input: sequence \u03b11, . . . , \u03b1T > 0\n2: for t = 1, 2, . . . , T do\n3:\n4:\n5:\n6:\n7: end for\n8: Output (\u00afxT , \u00afyT ) :=\n\ny-player selects yt \u2208 Y = Rd by OAlgy.\nx-player selects xt \u2208 X by OAlgx, possibly with knowledge of yt.\ny-player suffers loss (cid:96)t(yt) with weight \u03b1t, where (cid:96)t(\u00b7) = \u2212g(xt,\u00b7).\nx-player suffers loss ht(xt) with weight \u03b1t, where ht(\u00b7) = g(\u00b7, yt).\n\n(cid:16)(cid:80)T\n\n(cid:80)T\n\n(cid:17)\n\n.\n\ns=1 \u03b1sxs\n\ns=1 \u03b1sys\n\n,\n\nAT\n\nAT\n\nIn this paper, we consider Fenchel game with weighted losses depicted in Algorithm 1, following the\nsame setup as [1]. In this game, the y-player plays before the x-player plays and the x-player sees\nwhat the y-player plays before choosing its action. The y-player receives loss functions \u03b1t(cid:96)t(\u00b7) in\nround t, in which (cid:96)t(y) := f\u2217(y) \u2212 (cid:104)xt, y(cid:105), while the x-player see its loss functions \u03b1tht(\u00b7) in round\nt, in which ht(x) := (cid:104)x, yt(cid:105) \u2212 f\u2217(yt). Consequently, we can de\ufb01ne the weighted regret of the x and\ny players as\n\n(3)\n(4)\n\n:= (cid:80)T\n:= (cid:80)T\n\n\u03b1-REGy\n\u03b1-REGx\n\nt=1 \u03b1t(cid:96)t(yt) \u2212 miny\n\nt=1 \u03b1tht(xt) \u2212(cid:80)T\n\n(cid:80)T\nt=1 \u03b1tht(x\u2217)\n\nt=1 \u03b1t(cid:96)t(y)\n\n3\n\n\fimizer of(cid:80)T\n\n1, . . . , y(cid:48)\n\nNotice that the x-player\u2019s regret is computed relative to x\u2217 the minimizer of f (\u00b7), rather than the min-\nt=1 \u03b1tht(\u00b7). Although slightly non-standard, this allows us to handle the unconstrained\n\nT ). We also denote At as the cumulative sum of the weights At :=(cid:80)t\n\nsetting while Theorem 1 still holds as desired.\nAt times when we want to refer to the regret on another sequence y(cid:48)\nT we may refer to this\nas \u03b1-REG(y(cid:48)\ns=1 \u03b1s\nand the weighted average regret \u03b1-REG := \u03b1-REG\n. Finally, for of\ufb02ine constrained optimization\nAT\n(i.e. minx\u2208K f (x)), we let the decision space of the benchmark/comparator in the weighted regret\nde\ufb01nition to be X = K; for of\ufb02ine unconstrained optimization, we let the decision space of the\nbenchmark/comparator to be a norm ball that contains the optimum solution of the of\ufb02ine problem\n(i.e. contains arg minx\u2208Rn f (x)), which means that X of the comparator is a norm ball. We let\nY = Rd be unconstrained.\nTheorem 1. [1] Assume a T -length sequence \u03b1 are given. Suppose in Algorithm 1 the online\ny\nx and \u03b1-REG\nlearning algorithms OAlgx and OAlgy have the \u03b1-weighted average regret \u03b1-REG\nrespectively. Then the output (\u00afxT , \u00afyT ) is an \u0001-equilibrium for g(\u00b7,\u00b7), with \u0001 = \u03b1-REG\n+ \u03b1-REG\n\n1, . . . , y(cid:48)\n\ny\n\n.\n\nx\n\n3 An Accelerated Solution to the Fenchel Game via Optimism\n\nWe are going to analyze more closely the use of Algorithm 1, with the help of Theorem 1, to establish\na fast method to compute an approximate equilibrium of the Fenchel Game. In particular, we will\nestablish an approximation factor of O(1/T 2) after T iterations, and we recall that this leads to a\nO(1/T 2) algorithm for our primary goal of solving minx\u2208K f (x).\n\n3.1 Analysis of the weighted regret of the y-player (i.e. the gradient player)\n\nA very natural online learning algorithm is FOLLOWTHELEADER, which always plays the point with\nthe lowest (weighted) historical loss\n\n\u03b1-REGy((cid:101)y1, . . . ,(cid:101)yT ) \u2264(cid:80)T\nt=1 \u03b4t((cid:101)yt) \u2212 \u03b4t(\u02c6yt+1).\ns=1 \u03b1s(cid:96)s(y) and also \u02dcLt(y) := \u03b1t(cid:96)t\u22121(y) +(cid:80)t\u22121\nProof. Let Lt(y) :=(cid:80)t\nt=1 \u03b1t(cid:96)t((cid:101)yt) \u2212 LT (\u02c6yT +1) =(cid:80)T\n:= (cid:80)T\n\u03b1-REG((cid:101)y1:T )\n\u2264 (cid:80)T\nt=1 \u03b1t(cid:96)t((cid:101)yt) \u2212 \u02dcLT ((cid:101)yT ) \u2212 \u03b4T (\u02c6yT +1)\n= (cid:80)T\u22121\nt=1 \u03b1t(cid:96)t((cid:101)yt) \u2212 LT\u22121((cid:101)yT ) + \u03b4T ((cid:101)yT ) \u2212 \u03b4T (\u02c6yT +1)\n\u2264 (cid:80)T\u22121\nt=1 \u03b1t(cid:96)t((cid:101)yt) \u2212 LT\u22121(\u02c6yT ) + \u03b4T ((cid:101)yT ) \u2212 \u03b4T (\u02c6yT +1)\n= \u03b1-REG((cid:101)y1:T\u22121) + \u03b4T ((cid:101)yT ) \u2212 \u03b4T (\u02c6yT +1).\n\ns=1 \u03b1s(cid:96)s(y).\n\nt=1 \u03b1t(cid:96)t((cid:101)yt) \u2212 \u02dcLT (\u02c6yT +1) \u2212 \u03b4T (\u02c6yT +1)\n\n4\n\n(cid:110)(cid:80)t\u22121\n(cid:110)\n\u03b1t(cid:96)t\u22121(y) +(cid:80)t\u22121\n\n(cid:111)\n\n(cid:111)\n\n(cid:101)yt\n\nFOLLOWTHELEADER\n\n\u02c6yt\n\n:= argminy\n\ns=1 \u03b1s(cid:96)s(y)\n\n.\n\nFOLLOWTHELEADER is known to not perform well against arbitrary loss functions, but for strongly\nconvex (cid:96)t(\u00b7) one can prove an O(log T /T ) regret bound in the unweighted case. For the time being,\nwe shall focus on a slightly different algorithm that utilizes \u201coptimism\u201d in selecting the next action:\n\nOPTIMISTICFTL\n\n:= argminy\n\ns=1 \u03b1s(cid:96)s(y)\n\n.\n\nThis procedure can be viewed as an optimistic variant of FOLLOWTHELEADER since the algorithm\nis effectively making a bet that, while (cid:96)t(\u00b7) has not yet been observed, it is likely to be quite similar\nto (cid:96)t\u22121. Within the online learning community, the origins of this trick go back to [8], although their\nalgorithm was described in terms of a 2-step descent method. This was later expanded by [21] who\ncoined the term optimistic mirror descent (OMD), and who showed that the proposed procedure can\naccelerate zero-sum game dynamics when both players utilize OMD. OPTIMISTICFTL, de\ufb01ned as a\n\u201cbatch\u201d procedure, was \ufb01rst presented in [1] and many of the tools of the present paper follow directly\nfrom that work.\nFor convenience, we\u2019ll de\ufb01ne \u03b4t(y) := \u03b1t((cid:96)t(y) \u2212 (cid:96)t\u22121(y)). Intuitively, the regret will be small if\nthe functions \u03b4t are not too big. This is formalized in the following lemma.\nLemma 2. For an arbitrary sequence {\u03b1t, (cid:96)t}t=1...T , the regret of OPTIMISTICFTL satis\ufb01es\n\n\fThe bound follows by induction on T .\n\nThe result from Lemma 2 is generic, and would hold for any online learning problem. But for\nthe Fenchel game, we have a very speci\ufb01c sequence of loss functions, (cid:96)t(y) := \u2212g(xt, y) =\nf\u2217(y) \u2212 (cid:104)xt, y(cid:105). With this in mind, let us further analyze the regret of the y player.\nFor the time being, let us assume that the sequence of xt\u2019s is arbitrary. We de\ufb01ne\n\n(cid:101)xt := 1\n\nAt\n\n(\u03b1txt\u22121 +(cid:80)t\u22121\n\ns=1 \u03b1sxs).\n\n\u00afxt := 1\nAt\n\ns=1 \u03b1sxs\n\nand\n\n(cid:80)t\n\nIt is critical that we have two parallel sequences of iterate averages for the x-player. Our \ufb01nal\nalgorithm will output \u00afxT , whereas the Fenchel game dynamics will involve computing \u2207f at the\n\nreweighted averages(cid:101)xt for each t = 1, . . . , T .\n\nTo prove the key regret bound for the y-player, we \ufb01rst need to state some simple technical facts.\n\nt(cid:88)\n(cid:101)yt = \u2207f ((cid:101)xt)\n\n\u02c6yt+1 = argmin\n\ns=1\n\ny\n\n(cid:101)xt \u2212 \u00afxt =\n\n(xt\u22121 \u2212 xt).\n\n\u03b1t\nAt\n\n\u03b1s (f\u2217(y) \u2212 (cid:104)xs, y(cid:105)) = argmax\n\ny\n\n(cid:104)\u00afxt, y(cid:105) \u2212 f\u2217(y) = \u2207f (\u00afxt)\n\n(following same reasoning as above),\n\n(5)\n\n(6)\n(7)\n\nEquations 5 and 6 follow from elementary properties of Fenchel conjugation and the Legendre\ntransform [23]. Equation 7 follows from a simple algebraic calculation.\nLemma 3. Suppose f (\u00b7) is a convex function that is L-smooth with respect to the the norm (cid:107) \u00b7 (cid:107) with\ndual norm (cid:107) \u00b7 (cid:107)\u2217. Let x1, . . . , xT be an arbitrary sequence of points. Then, we have\n\n\u03b1-REGy((cid:101)y1, . . . ,(cid:101)yT ) \u2264 L(cid:80)T\n\n(cid:107)xt\u22121 \u2212 xt(cid:107)2.\n\n\u03b12\nt\nAt\n\n(8)\n\nt=1\n\nProof. Following Lemma 2, and noting that here we have \u03b4t(y) = \u03b1t(cid:104)xt\u22121 \u2212 xt, y(cid:105), we have\n\n(cid:80)T\nt=1 \u03b1t(cid:96)t((cid:101)yt) \u2212 \u03b1t(cid:96)t(y\u2217) \u2264 (cid:80)T\n= (cid:80)T\n\u2264 (cid:80)T\n\u2264 L(cid:80)T\n= L(cid:80)T\n\n(Eqns. 5, 6)\n(H\u00f6lder\u2019s Ineq.)\n(L-smoothness of f)\n\n(Eqn. 7)\n\nt=1 \u03b4t((cid:101)yt) \u2212 \u03b4t(\u02c6yt+1) =(cid:80)T\nt=1 \u03b1t(cid:104)xt\u22121 \u2212 xt,\u2207f ((cid:101)xt) \u2212 \u2207f (\u00afxt)(cid:105)\nt=1 \u03b1t(cid:107)xt\u22121 \u2212 xt(cid:107)(cid:107)\u2207f ((cid:101)xt) \u2212 \u2207f (\u00afxt)(cid:107)\u2217\nt=1 \u03b1t(cid:107)xt\u22121 \u2212 xt(cid:107)(cid:107)(cid:101)xt \u2212 \u00afxt(cid:107)\n\n\u03b12\nt\nAt\n\nt=1\n\n(cid:107)xt\u22121 \u2212 xt(cid:107)(cid:107)xt\u22121 \u2212 xt(cid:107)\n\nt=1 \u03b1t(cid:104)xt\u22121 \u2212 xt,(cid:101)yt \u2212 \u02c6yt+1(cid:105)\n\nas desired.\n\nWe notice that a similar bound is given in [1] for the gradient player using OPTIMISTICFTL, yet the\nabove result is a stict improvement as the previous work relied on the additional assumption that f (\u00b7)\nis strongly convex. The above lemma depends only on the fact that f has lipschitz gradients.\n\n3.2 Analysis of the weighted regret of the x-player\n\nIn the present section we are going to consider that the x-player uses MIRRORDESCENT for updating\nits action, which is de\ufb01ned as follows.\n\nVxt\u22121(x) = argminx\u2208K \u03b3t(cid:104)x, \u03b1tyt(cid:105) + Vxt\u22121 (x),\n\nxt := argminx\u2208K \u03b1tht(x) + 1\n\u03b3t\n\n(9)\nwhere we recall that the Bregman divergence Vx(\u00b7) is with respect to a 1-strongly convex regularization\n\u03c6. Also, we note that the x-player has an advantage in these game dynamics, since xt is chosen with\nknowledge of yt and hence has knowledge of the incoming loss ht(\u00b7).\nLemma 4. Let the sequence of xt\u2019s be chosen according to MIRRORDESCENT. Assume that the\nBregman Divergence is uniformly bounded on K, so that D = supt=1,...,T Vxt(x\u2217), where x\u2217 denotes\nthe minimizer of f (\u00b7). Assume that the sequence {\u03b3t}t=1,2,... is non-increasing. Then we have\n\u03b1-REGx \u2264 D\n\n\u2212(cid:80)T\n\n(cid:107)xt\u22121 \u2212 xt(cid:107)2.\n\n\u03b3T\n\n1\n2\u03b3t\n\nt=1\n\n5\n\n\fThe proof of this lemma is quite standard, and we postpone it to Appendix A. We also note that the\nbenchmark x\u2217 is always within a \ufb01nite norm ball by assumption. We given an alternative to this\nlemma in the appendix, when \u03b3t = \u03b3 is \ufb01xed, in which case we can instead use the more natural\nconstant D = Vx1 (x\u2217).\n\n3.3 Convergence Rate of the Fenchel Game\n\nTheorem 2. Let us consider the output (\u00afxT , \u00afyT ) of Algorithm 1 under the following conditions:\n(a) the sequence {\u03b1t} is positive but otherwise arbitrary (b) OAlgy is chosen OPTIMISTICFTL, (c)\nOAlgx is MIRRORDESCENT with any non-increasing positive sequence {\u03b3t}, and (d) we have a\nbound Vxt(x\u2217) \u2264 D for all t. Then the point \u00afxT satis\ufb01es\nT(cid:88)\n\n(cid:18) \u03b12\n\n(cid:32)\n\n(cid:33)\n\n(cid:19)\n\nf (\u00afxT ) \u2212 min\n\nx\u2208X f (x) \u2264 1\n\nAT\n\nD\n\u03b3T\n\n+\n\nL \u2212 1\n2\u03b3t\n\nt\nAt\n\n(cid:107)xt\u22121 \u2212 xt(cid:107)2\n\n.\n\nt=1\n\n(10)\n\nProof. We have already done the hard work to prove this theorem. Lemma 1 tells us we can bound\nthe error of \u00afxT by the \u0001 error of the approximate equilibrium (\u00afxT , \u00afyT ). Theorem 1 tells us that the\npair (\u00afxT , \u00afyT ) derived from Algorithm 1 is controlled by the sum of averaged regrets of both players,\n(\u03b1-REGx + \u03b1-REGy). But we now have control over both of these two regret quantities, from\n1\nAT\nLemmas 3 and 4. The right hand side of (10) is the sum of these bounds.\n\nTheorem 2 is somewhat opaque without a specifying the sequence {\u03b1t}. But what we now show is\nthat the summation term vanishes when we can guarantee that \u03b12\nremains constant! This is where\nt\nAt\nwe obtain the following fast rate.\nCorollary 1. Following Theorem 2 with \u03b1t = t and for any non-increasing sequence \u03b3t satisfying\nCL \u2264 \u03b3t \u2264 1\n1\n\n4L for some constant C > 4, we have f (\u00afxT ) \u2212 min\n\nx\u2208X f (x) \u2264 2CLD\n\n.\n\nT 2\n\nProof. Observing At := t(t+1)\n2L \u2264 1\n\n= 2Lt2\n, which ensures that the summation term in (10) is negative. The rest is simple algebra.\n\n, the choice of {\u03b1t, \u03b3t} implies D\n\n\u2264 cDL and L\u03b12\n\nt(t+1) \u2264\n\nt\nAt\n\n\u03b3t\n\n2\n\n2\u03b3t\n\nA straightforward choice for the learning rate \u03b3t is simple the constant sequence \u03b3t = 1\n4L. The\ncorollary is stated with a changing \u03b3t in order to bring out a connection to the classical NESTEROVAC-\nCELERATION in the following section.\nRemark: It is worth dwelling on exactly how we obtained the above result. A less re\ufb01ned analysis\nof the MIRRORDESCENT algorithm would have simply ignored the negative summation term in\nLemma 4, and simply upper bounded this by 0. But the negative terms (cid:107)xt \u2212 xt\u22121(cid:107)2 in this sum\nhappen to correspond exactly to the positive terms one obtains in the regret bound for the y-player,\nbut this is true only as a result of using the OPTIMISTICFTL algorithm. To obtain a cancellation of\nthese terms, we need a \u03b3t which is roughly constant, and hence we need to ensure that \u03b12\n= O(1).\nt\nAt\nThe \ufb01nal bound, of course, is determined by the inverse quantity 1\n, and a quick inspection reveals\nAT\nthat the best choice of \u03b1t = \u03b8(t). This is not the only choice that could work, and we conjecture\nthat there are scenarios in which better bounds are achievable for different \u03b1t tuning. We show in\nSection 4.3 that a linear rate is achievable when f (\u00b7) is also strongly convex, and there we tune \u03b1t to\ngrow exponentially in t rather than linearly.\n\n4 Nesterov\u2019s methods are instances of our accelerated solution to the game\n\nStarting from 1983, Nesterov has proposed three accelerated methods for smooth convex problems\n(i.e. [16, 15, 17, 19]. In this section, we show that our accelerated algorithm to the Fenchel game can\ngenerate all his methods with some simple tweaks.\n\n6\n\n\f4.1 Recovering Nesterov\u2019s (1983) method for unconstrained smooth convex problems\n\n[16, 15]\n\nIn this subsection, we assume that the x-player\u2019s action space is unconstrained. That is, K = Rn.\nConsider the following algorithm.\n\nAlgorithm 2 A variant of our accelerated algorithm.\n1: In the weighted loss setting of Algorithm 1:\n2:\n3:\n4:\n\ny-player uses OPTIMISITCFTL as OAlgy: yt = \u2207f ((cid:101)xt).\nxt = xt\u22121 \u2212 \u03b3t\u03b1t\u2207ht(x) = xt\u22121 \u2212 \u03b3t\u03b1tyt = xt\u22121 \u2212 \u03b3t\u03b1t\u2207f ((cid:101)xt).\n\nx-player uses ONLINEGRADIENTDESCENT as OAlgx:\n\nTheorem 3. Let \u03b1t = t. Assume K = Rn. Algorithm 2 is actually the case the x-player uses\nMIRRORDESCENT. Therefore, \u00afxT is an O( 1\nT 2 )-approximate optimal solution of minx f (x) by\nTheorem 2 and Corollary 1.\n\nProof. For the unconstrained case, we can let the distance generating function of the Bregman\ndivergence to be the squared of L2 norm, i.e. \u03c6(x) := 1\n2. Then, the update becomes xt =\nargminx \u03b3t(cid:104)x, \u03b1tyt(cid:105) + Vxt\u22121(x) = argminx \u03b3t(cid:104)x, \u03b1tyt(cid:105) + 1\n2(cid:107)xt\u22121(cid:107)2\n2 \u2212 (cid:104)xt\u22121, x \u2212 xt\u22121(cid:105) \u2212 1\n2.\nDifferentiating the objective w.r.t x and setting it to zero, one will get xt = xt\u22121 \u2212 \u03b3t\u03b1tyt.\n\n2(cid:107)x(cid:107)2\n2(cid:107)x(cid:107)2\n\nHaving shown that Algorithm 2 is actually our accelerated algorithm to the Fenchel game. We are\ngoing to show that Algorithm 2 has a direct correspondence with Nesterov\u2019s \ufb01rst acceleration method\n(Algorithm 3) [16, 15] (see also [24]).\n\nAlgorithm 3 Nesterov Algorithm [[16, 15]]\n1: Init: w0 = z0. Require: \u03b8 \u2264 1\nL.\n2: for t = 1, 2, . . . , T do\nwt = zt\u22121 \u2212 \u03b8\u2207f (zt\u22121).\n3:\nzt = wt + t\u22121\n4:\n5: end for\n6: Output wT .\n\nt+2 (wt \u2212 wt\u22121).\n\nTo see the equivalence, let us re-write \u00afxt := 1\nAt\n\ns=1 \u03b1sxs of Algorithm 2.\n\n= At\u22121 \u00afxt\u22121+\u03b1t(xt\u22121\u2212\u03b3t\u03b1t\u2207f ((cid:101)xt))\n\nAt\u22121 \u00afxt\u22121\u2212At\u22122 \u00afxt\u22122\n\n\u03b1t\u22121\nAt\n\n\u00afxt = At\u22121 \u00afxt\u22121+\u03b1txt\n\nAt\n\nAt\u22121 \u00afxt\u22121+\u03b1t(\n\n=\n= \u00afxt\u22121( At\u22121\nAt\n= \u00afxt\u22121 \u2212 \u03b3t\u03b12\n= \u00afxt\u22121 \u2212 1\n\nAt\n\nt\n\nAt\u03b1t\u22121\n\n+ \u03b1t(\u03b1t\u22121+At\u22122)\n\n\u2207f ((cid:101)xt) + ( \u03b1tAt\u22122\n4L\u2207f ((cid:101)xt) + ( t\u22122\n\nAt\u03b1t\u22121\n\nAt\n\n\u2212\u03b3t\u03b1t\u2207f ((cid:101)xt))\n) \u2212 \u00afxt\u22122( \u03b1tAt\u22122\n)(\u00afxt\u22121 \u2212 \u00afxt\u22122)\n\nAt\u03b1t\u22121\n\nt+1 )(\u00afxt\u22121 \u2212 \u00afxt\u22122),\n\n\u2207f ((cid:101)xt)\n\n(11)\n\n) \u2212 \u03b3t\u03b12\n\nt\n\nAt\n\n(cid:80)t\n\n7\n\nwhere \u03b1t = t and \u03b3t = (t+1)\nTheorem 4. Algorithm 3 with \u03b8 = 1\nthat they generate equivalent sequences of iterates:\n\n8L.\n\n1\n\nt\n\n4L is equivalent to Algorithm 2 with \u03b3t = (t+1)\n\nt\n\n1\n\n8L in the sense\n\nfor all t = 1, 2, . . . , T,\n\nwt = \u00afxt\n\nand\n\nterm). But, the difference is that the gradient is evaluated at(cid:101)xt = 1\n\nLet us switch to comparing the update of Algorithm 2, which is (11), with the update of the HEAVY-\nBALL algorithm. We see that (11) has the so called momentum term (i.e. has a (\u00afxt\u22121 \u2212 \u00afxt\u22122)\ns=1 \u03b1sxs), not\ns=1 \u03b1sxs, which is the consequence that the y-player plays OPTIMISTICFTL. To\n\n(cid:80)t\u22121\n\n\u00afxt\u22121 = 1\n\nAt\n\nAt\u22121\n\nzt\u22121 =(cid:101)xt.\n(\u03b1txt\u22121 +(cid:80)t\u22121\n\n\fAlgorithm 4 HEAVYBALL algorithm\n1: In the weighted loss setting of Algorithm 1:\n2:\n3:\n4:\n\ny-player uses FOLLOWTHELEADER as OAlgy: yt = \u2207f (\u00afxt\u22121).\nx-player uses ONLINEGRADIENTDESCENT as OAlgx:\n\nxt := xt\u22121 \u2212 \u03b3t\u03b1t\u2207ht(x) = xt\u22121 \u2212 \u03b3t\u03b1tyt = xt\u22121 \u2212 \u03b3t\u03b1t\u2207f (\u00afxt\u22121).\n\nelaborate, let us consider a scenario (shown in Algorithm 4) such that the y-player plays FOL-\nLOWTHELEADER instead of OPTIMISTICFTL.\nFollowing what we did in (11), we can rewrite \u00afxt of Algorithm 4 as\n\n\u00afxt = \u00afxt\u22121 \u2212 \u03b3t\u03b12\n\nt\n\n\u2207f (\u00afxt\u22121) + (\u00afxt\u22121 \u2212 \u00afxt\u22122)( \u03b1tAt\u22122\n\nAt\n\nby observing that (11) still holds except that \u2207f ((cid:101)xt) is changed to \u2207f (\u00afxt\u22121) as the y-player uses\n\nFOLLOWTHELEADER now, which give us the update of the Heavy Ball algorithm as (12). Moreover,\nby the regret analysis, we have the following theorem. The proof is in Appendix C.\nTheorem 5. Let \u03b1t = t. Assume K = Rn. Also, let \u03b3t = O( 1\nO( 1\n\nL ). The output \u00afxT of Algorithm 4 is an\n\nT )-approximate optimal solution of minx f (x).\n\nAt\u03b1t\u22121\n\n),\n\n(12)\n\nTo conclude, by comparing Algorithm 2 and Algorithm 4, we see that Nesterov\u2018s (1983) method\nenjoys O(1/T 2) rate since its adopts OPTIMISTICFTL, while the HEAVYBALL algorithm which\nadopts FTL may not enjoy the fast rate, as the distance terms may not cancel out. The result also\nconforms to empirical studies that the HEAVYBALL does not exhibit acceleration on general smooth\nconvex problems.\n\n4.2 Recovering Nesterov\u2019s (1988) 1-memory method [17] and Nesterov\u2019s (2005) \u221e-memory\n\nmethod [19]\n\nIn this subsection, we consider recovering Nesterov\u2019s (1988) 1-memory method [17] and Nesterov\u2019s\n(2005) \u221e-memory method [19]. To be speci\ufb01c, we adopt the presentation of Nesterov\u2019s algorithm\ngiven in Algorithm 1 and Algorithm 3 of [26] respectively.\nAlgorithm 5 (A) Nesterov\u2019s 1-memory method [17] and (B) Nesterov\u2019s \u221e-memory method [19]\n1: Input: parameter \u03b2t = 2\n2: Init: w0 = x0\n3: for t = 1, 2, . . . , T do\n4:\n5:\n6:\n7:\n8: end for\n9: Output wT .\n\nzt = (1 \u2212 \u03b2t)wt\u22121 + \u03b2txt\u22121.\n(A) xt = argminx\u2208K \u03b3(cid:48)\nwt = (1 \u2212 \u03b2t)wt\u22121 + \u03b2txt.\n\nOr, (B) xt = argminx\u2208K(cid:80)t\n\nt(cid:104)\u2207f (zt), x(cid:105) + Vxt\u22121 (x).\ns=1 \u03b8s(cid:104)x,\u2207f (zs)(cid:105) + 1\n\n\u03b7 R(x), where R(\u00b7) is 1-strongly convex.\n\n4L, \u03b8t = t, and \u03b7 = 1\n4L.\n\nt+1, \u03b3(cid:48)\n\nt = t\n\nTheorem 6. Let \u03b1t = t. Algorithm 5 with update by option (A) is the case when the y-player\nuses OPTIMISTICFTL and the x-player adopts MIRRORDESCENT with \u03b3t = 1\n4L in Fenchel game.\nTherefore, wT is an O( 1\n\nT 2 )-approximate optimal solution of minx\u2208K f (x).\n\nThe proof is in Appendix D, which shows the direct correspondence of Algorithm 5 using option (A)\nto our accelerated solution in Section 3.\nTheorem 7. Let \u03b1t = t. Algorithm 5 with update by option (B) is the case when the y-player uses\nOPTIMISTICFTL and the x-player adopts BETHEREGULARIZEDLEADER with \u03b7 = 1\n4L in Fenchel\ngame. Therefore, wT is an O( 1\n\nT 2 )-approximate optimal solution of minx\u2208K f (x).\n\nThe proof is in Appendix E, which requires the regret bound of BETHEREGULARIZEDLEADER.\n\n8\n\n\f2\n\n2\n\n2\n\n\u02dcf (x) + \u00b5(cid:107)x(cid:107)2\n\n4.3 Accelerated linear rate\nNesterov observed that, when f (\u00b7) is both \u00b5-strongly convex and L-smooth, one can achieve a rate\nthat is exponentially decaying in T (e.g. page 71-81 of [18]). It is natural to ask if the zero-sum\ngame and regret analysis in the present work also recovers this faster rate in the same fashion. We\nanswer this in the af\ufb01rmative. Denote \u03ba := L\n\u00b5 . A property of f (x) being \u00b5-strongly convex is\nthat the function \u02dcf (x) := f (x) \u2212 \u00b5(cid:107)x(cid:107)2\nis still a convex function. Now we de\ufb01ne a new game\nwhose payoff function is \u02dcg(x, y) := (cid:104)x, y(cid:105) \u2212 \u02dcf\u2217(y) + \u00b5(cid:107)x(cid:107)2\n. Then, the minimax vale of the game\nis V \u2217 := minx maxy \u02dcg(x, y) = minx\n2 = minx f (x). Observe that, in this game, the\nloss of the y-player in round t is \u03b1t(cid:96)t(y) := \u03b1t( \u02dcf\u2217(y) \u2212 (cid:104)xt, y(cid:105)), while the loss of the x-player in\nround t is a strongly convex function \u03b1tht(y) := \u03b1t((cid:104)x, yt(cid:105) + \u00b5(cid:107)x(cid:107)2\n). The cumulative loss function\nof the x-player becomes more and more strongly convex over time, which is the key to allowing\nthe exponential growth of the total weight At that leads to the linear rate. In this setup, we have a\ns=0 \u03b1s which incorporate the additional step\n\n\u201cwarmup round\u201d t = 0, and thus we denote \u02dcAt :=(cid:80)t\nand the x-player plays BETHEREGULARIZEDLEADER: xt \u2190 arg minx\u2208X(cid:80)t\n\ninto the average. The proof of the following result is in Appendix H.\nTheorem 8. For the game \u02dcg(x, y) := (cid:104)x, y(cid:105)\u2212 \u02dcf\u2217(y) + \u00b5(cid:107)x(cid:107)2\n\n\u03b10(cid:96)0(x) := \u03b10\napproximate equilibrium of the game, where the weights \u03b10, \u03b11, . . . are chosen to satisfy \u03b1t\n\u02dcAt\nThis implies that f (\u00afxT ) \u2212 minx\u2208X f (x) = O(exp(\u2212 T\u221a\n\n, then the weighted average points (\u00afxT , \u00afyT ) would be an O(exp(\u2212 T\u221a\n= 1\u221a\n6\u03ba\n\n, if the y-player plays OPTIMISTICFTL\ns=0 \u03b1s(cid:96)s(x), where\n\u03ba ))-\n.\n\n\u00b5(cid:107)x(cid:107)2\n\n2\n\n2\n\n2\n\n2\n\n2\n\n2\n\n2\n\n2\n\n\u03ba )).\n\n5 Accelerated Proximal Method\n\nIn this section, we consider solving composite optimization problems minx\u2208Rn f (x) + \u03c8(x), where\nf (\u00b7) is smooth convex but \u03c8(\u00b7) is possibly non-differentiable convex (e.g. (cid:107) \u00b7 (cid:107)1). We want to\nshow that the game analysis still applies to this problem. We just need to change the payoff\nfunction g to account for \u03c8(x). Speci\ufb01cally, we consider the following two-players zero-sum game,\nminx maxy{(cid:104)x, y(cid:105)\u2212f\u2217(y)+\u03c8(x)}. Notice that the minimax value of the game is minx f (x)+\u03c8(x),\nwhich is exactly the optimum value of the composite optimization problem. Let us denote the proximal\noperator as prox\u03bb\u03c8(v) = argminx\n\n(cid:0)\u03c8(x) + 1\n\n2\u03bb(cid:107)x \u2212 v(cid:107)2\n\n(cid:1). 1\n\n2\n\nAlgorithm 6 Accelerated Proximal Method\n1: In the weighted loss setting of Algorithm 1 (let \u03b1t = t and \u03b3t = 1\n2:\n3:\n4:\n5:\n\ny-player uses OPTIMISITCFTL as OAlgy: yt = \u2207f ((cid:101)xt).\n2(cid:107)x(cid:107)2\n2 in Bregman divergence as OAlgx:\n2 + 2(cid:104)\u03b1t\u03b3tyt \u2212 xt\u22121, x(cid:105)) = prox\u03b1t\u03b3t\u03c8(xt\u22121 \u2212 \u03b1t\u03b3t\u2207f ((cid:101)xt))\nxt = argminx \u03b3t(\u03b1tht(x)) + Vxt\u22121 (x) = argminx \u03b3t(\u03b1t{(cid:104)x, yt(cid:105) + \u03c8(x)}) + Vxt\u22121(x)\n= argminx \u03c6(x) + 1\n\nx-player uses MIRRORDESCENT with \u03c8(x) := 1\n\n((cid:107)x(cid:107)2\n\n4L):\n\n2\u03b1t\u03b3t\n\nWe notice that the loss function of the x-player here, \u03b1tht(x) = \u03b1t((cid:104)x, yt(cid:105) + \u03c8(x)), is possibly\nnonlinear. Yet, we can slightly adapt the analysis in Section 3 to show that the weighed average \u00afxT\nis still an O(1/T 2) approximate optimal solution of the of\ufb02ine problem. Please see Appendix I for\ndetails. One can view Algorithm 6 as a variant of the so called \u201cAccelerated Proximal Gradient\u201din\n[6]. Yet, the design and analysis of our algorithm is simpler than that of [6].\nAcknowlegement: We would like to thank Kevin Lai and K\ufb01r Levy for helpful discussions leading\nup to the results in this paper. This work was supported by funding from the Division of Computer\nScience and Engineering at the University of Michigan, from the College of Computing at the Georgia\nInstitute of Technology, NSF TRIPODS award 1740776, and NSF CAREER award 1453304.\n\n1It is known that for some \u03c8(\u00b7), their corresponding proximal operations have closed-form solutions (see e.g.\n\n[20] for details).\n\n9\n\n\fReferences\n[1] Jacob Abernethy, K\ufb01r Levy, Kevin Lai, and Jun-Kun Wang. Faster rates for convex-concave\n\ngames. COLT, 2018.\n\n[2] Jacob Abernethy and Jun-Kun Wang. Frank-wolfe and equilibrium computation. NIPS, 2017.\n\n[3] Jacob Abernethy, Manfred K Warmuth, and Joel Yellin. Optimal strategies from random walks.\nIn Proceedings of The 21st Annual Conference on Learning Theory, pages 437\u2013446. Citeseer,\n2008.\n\n[4] Zeyuan Allen-Zhu and Lorenzo Orecchia. Linear coupling: An ultimate uni\ufb01cation of gradient\n\nand mirror descent. ITCS, 2017.\n\n[5] David Balduzzi, Sebastien Racaniere, James Martens, Jakob Foerster, Karl Tuyls, and Thore\nGraepel. The mechanics of n-player differentiable games. arXiv preprint arXiv:1802.05642,\n2018.\n\n[6] Amir Beck and Marc Teboulle. A fast iterative shrinkage-thresholding algorithm for linear\n\ninverse problems. SIAM J. on Imaging Sciences, 2009.\n\n[7] Sabastien Bubeck, Yin Tat Lee, and Mohit Singh. A geometric alternative to nesterov\u2019s\n\naccelerated gradient descent. 2015.\n\n[8] Chao-Kai Chiang, Tianbao Yang, Chia-Jung Lee, Mehrdad Mahdavi, Chi-Jen Lu, Rong Jin, ,\n\nand Shenghuo Zhu. Online optimization with gradual variations. 2012.\n\n[9] Constantinos Daskalakis, Andrew Ilyas, Vasilis Syrgkanis, and Haoyang Zeng. Training gans\n\nwith optimism. arXiv preprint arXiv:1711.00141, 2017.\n\n[10] Nicolas Flammarion and Francis Bach. From averaging to acceleration, there is only a step-size.\n\nCOLT, 2015.\n\n[11] Gauthier Gidel, Reyhane Askari Hemmat, Mohammad Pezeshki, Gabriel Huang, Remi Lepriol,\nSimon Lacoste-Julien, and Ioannis Mitliagkas. Negative momentum for improved game\ndynamics. arXiv preprint arXiv:1807.04740, 2018.\n\n[12] Guanghui Lan and Yi Zhou. An optimal randomized incremental gradient method. Mathematical\n\nProgramming, 2017.\n\n[13] Laurent Lessard, Benjamin Recht, and Andrew Packard. Analysis and design of optimization\n\nalgorithms via integral quadratic constraints. SIAM Journal on Optimization, 2016.\n\n[14] Brendan McMahan and Jacob Abernethy. Minimax optimal algorithms for unconstrained linear\noptimization. In C. J. C. Burges, L. Bottou, M. Welling, Z. Ghahramani, and K. Q. Weinberger,\neditors, Advances in Neural Information Processing Systems 26, pages 2724\u20132732. Curran\nAssociates, Inc., 2013.\n\n[15] Yuri Nesterov. A method for unconstrained convex minimization problem with the rate of\n\nconvergence o(1/k2). Doklady AN USSR, 1983.\n\n[16] Yuri Nesterov. A method of solving a convex programming problem with convergence rate\n\no(1/k2). Soviet Mathematics Doklady, 27:372\u2013376, 1983.\n\n[17] Yuri Nesterov. On an approach to the construction of optimal methods of minimization of\n\nsmooth convex functions. Ekonom. i. Mat. Metody, 24:509\u2013517, 1988.\n\n[18] Yuri Nesterov. Introductory lectures on convex optimization: A basic course. Springer, 2004.\n\n[19] Yuri Nesterov. Smooth minimization of nonsmooth functions. Mathematical programming,\n\n2005.\n\n[20] Neal Parikh and Stephen Boyd. Proximal algorithms. Foundations and Trends in Optimization,\n\n2014.\n\n10\n\n\f[21] Alexander Rakhlin and Karthik Sridharan. Online learning with predictable sequences. COLT,\n\n2013.\n\n[22] Alexander Rakhlin and Karthik Sridharan. Optimization, learning, and games with predictable\n\nsequences. NIPS, 2013.\n\n[23] Tyrrell Rockafellar. Convex analysis. Princeton University Press, 1996.\n\n[24] Weijie Su, Stephen Boyd, and Emmanuel Candes. A differential equation for modeling nes-\n\nterov\u2019s accelerated gradient method: Theory and insights. NIPS, 2014.\n\n[25] Vasilis Syrgkanis, Alekh Agarwal, Haipeng Luo, and Robert E. Schapire. Fast convergence of\n\nregularized learning in games. NIPS, 2015.\n\n[26] Paul Tseng. On accelerated proximal gradient methods for convex-concave optimization. 2008.\n\n[27] Andre Wibisono, Ashia C Wilson, and Michael I Jordan. A variational perspective on accelerated\nmethods in optimization. Proceedings of the National Academy of Sciences, 113(47):E7351\u2013\nE7358, 2016.\n\n[28] Martin Zinkevich. Online convex programming and generalized in\ufb01nitesimal gradient ascent.\n\nICML, 2003.\n\n11\n\n\f", "award": [], "sourceid": 1896, "authors": [{"given_name": "Jun-Kun", "family_name": "Wang", "institution": "Georgia Institute of Technology"}, {"given_name": "Jacob", "family_name": "Abernethy", "institution": "Georgia Institute of Technolog"}]}