{"title": "On Frank-Wolfe and Equilibrium Computation", "book": "Advances in Neural Information Processing Systems", "page_first": 6584, "page_last": 6593, "abstract": "We consider the Frank-Wolfe (FW) method for constrained convex optimization, and we show that this classical technique can be interpreted from a different perspective: FW emerges as the computation of an equilibrium (saddle point) of a special convex-concave zero sum game. This saddle-point trick relies on the existence of no-regret online learning to both generate a sequence of iterates but also to provide a proof of convergence through vanishing regret. We show that our stated equivalence has several nice properties, as it exhibits a modularity that gives rise to various old and new algorithms. We explore a few such resulting methods, and provide experimental results to demonstrate correctness and efficiency.", "full_text": "On Frank-Wolfe and Equilibrium Computation\n\nJacob Abernethy\n\nGeorgia Institute of Technology\n\nprof@gatech.edu\n\nJun-Kun Wang\n\nGeorgia Institute of Technology\n\njimwang@gatech.edu\n\nAbstract\n\nWe consider the Frank-Wolfe (FW) method for constrained convex optimization,\nand we show that this classical technique can be interpreted from a different\nperspective: FW emerges as the computation of an equilibrium (saddle point) of\na special convex-concave zero sum game. This saddle-point trick relies on the\nexistence of no-regret online learning to both generate a sequence of iterates but\nalso to provide a proof of convergence through vanishing regret. We show that our\nstated equivalence has several nice properties, as it exhibits a modularity that gives\nrise to various old and new algorithms. We explore a few such resulting methods,\nand provide experimental results to demonstrate correctness and ef\ufb01ciency.\n\n1\n\nIntroduction\n\nThere has been a burst of interest in a technique known as the Frank-Wolfe method (FW) [10], also\nknown as conditional gradient, for solving constrained optimization problems. FW is entirely a\n\ufb01rst-order method, does not require any projection operation, and instead relies on access to a linear\noptimization oracle. Given a compact and convex constraint set X \u2282 Rd, we require the ability\nto (quickly) answer queries of the form O(v) := arg minx\u2208X x(cid:62)v, for any vector v \u2208 Rd. Other\ntechniques such as gradient descent methods require repeated projections into the constraint set which\ncan be prohibitively expensive. Interior point algorithms, such as Newton path following schemes\n[1], require computing a hessian inverse at each iteration which generally does not scale well with the\ndimension.\nIn the present paper we aim to give a new perspective on the Frank-Wolfe method by showing that,\nin a broad sense, it can be viewed as a special case of equilibrium computation via online learning.\nIndeed, when the optimization objective is cast as a particular convex-concave payoff function,\nthen we are able to extract the desired optimal point via the equilibrium of the associated zero-sum\ngame. Within Machine Learning there has been a lot of attention paid to the computation of optimal\nstrategies for zero-sum games using online learning techniques. An amazing result, attributed to [12]\nyet now practically folklore in the literature, says that we can compute the optimal equilibrium in\na zero sum game by pitting two online learning strategies against each other and, as long as they\nachieve the desired regret-minimization guarantee, the long-run empirical average of their actions\n(strategy choices) must converge to the optimal equilibrium. This trick is both very beautiful but also\nextremely useful: it was in some sense the core of early work in Boosting [11], has been shown to\ngeneralize many linear programming techniques [3], it serves as the key tool for recent advances in\n\ufb02ow optimization problems [8], and has been instrumental in understanding differential privacy [9].\nWe begin in Section 2 by reviewing the method of proving a generalized minimax theorem using\nregret minimization, and we show how this proof is actually constructive and gives rise to a generic\nmeta-algorithm. This meta-algorithm is especially modular, and allows for the substitution of various\nalgorithmic tools that achieve, up to convergence rates, essentially the same core result. We then show\nthat the original Frank-Wolfe algorithm is simply one instantiation of this meta-algorithm, yet where\nthe convergence rate follows as a trivial consequence of main theorem, albeit with an additional\nO(log T ) factor.\n\n31st Conference on Neural Information Processing Systems (NIPS 2017), Long Beach, CA, USA.\n\n\fWe build upon this by showing that a number of variants of Frank-Wolfe are also simple instantiations\nof our meta-algorithm, with a convergence rate that follows easily. For example, we propose the\ncumulative gradient variant of Frank-Wolfe and prove that the same guarantee holds, yet relies on\na potentially more stable optimization oracle. We show that techniques of [31] using stochastic\nsmoothing corresponding to implement a Follow-the-perturbed-leader variant of our meta-algorithm.\nAnd \ufb01nally, we use our framework to prove an entirely new result, showing that one obtains an\nO(log T /T ) convergence rate even when the objective f (\u00b7) is not smooth, but instead the constraint\nset satis\ufb01es strong convexity.\nThe results laid out in this paper provide value not only in proving rates and establishing new and\nexisting algorithms but also in setting forth a perspective on Frank-Wolfe-style methods that can\nleverage the wealth of results we have available from online learning and online convex optimization.\nAt present, the possibilities and limits of various online learning problems has been thoroughly\nworked out [20, 7] with incredibly tight bounds. Using the connections we put forth, many of\nthese results can provide a stronger theoretical framework towards understanding projection-free\nconditional gradient methods.\nRelated works of projection-free algorithms\n[25] gives an analysis of FW for smooth objectives, and shows that FW converges at a O(1/T )\nrate even when the linear oracle is solved approximately, under certain conditions. [30] develops\na block-wise update strategy for FW on the dual objective of structural SVM, where only a subset\nof dual variables are updated at each iteration. In the algorithm, a smaller oracle is called due to\nthe block-wise update, which reduces the computational time per iteration and leads to the speedup\noverall. [37] proposes updating multiple blocks at a time. [34] proposes using various measures to\nselect a block for update.\nIn another direction, some results have aimed at obtaining improved convergence rates. [14] shows\nthat for strongly convex and smooth objective functions, FW can achieve a O(1/T 2) convergence rate\nover a strongly convex set. [13, 15] \ufb01rst show that one can achieve linear convergence for strongly\nconvex and smooth objectives over polytopes using a projection-free algorithm. The algorithm\nconstructs a stronger oracle which can be ef\ufb01ciently implemented for certain polytopes like simplex.\n[29] shows that some variants of FW such as away-step FW [38] or pairwise FW enjoy an exponential\nconvergence rate when the feasible set is a polytope. [5] provides a re\ufb01ned analysis for the away-\nstep FW. [17] extends [29] to some saddle-point optimization problems, where the constraint set\nis assumed to be a polytope and the objective is required to be strongly convex for one variable\nand strongly concave for the other. A drawback of away-step FW [38] is that it requires storing the\nprevious outputs from the oracle. Very recently, [16] develop a new variant that avoids this issue\nfor speci\ufb01c polytopes, which also enjoys exponential convergence for strongly convex and smooth\nobjectives. Note that all of the exponential convergence results depend on some geometric properties\nof the underlying polytope.\nOther works include variants for stochastic setting [23], online learning setting [22], minimizing\nsome structural norms [19, 39], or reducing the number of gradient evaluations [32]. There is also a\nconnection between subgradient descent and FW; Bach [4] shows that for certain types of objectives,\nsubgradient descent applied to the primal domain is equivalent to FW applied to the dual domain.\n\nPreliminaries and Notation\nDe\ufb01nition 1: A convex set Y \u2286 Rm is an \u03b1-strongly convex set w.r.t. a norm (cid:107) \u00b7 (cid:107) if for any\nu, v \u2208 Y , any \u03b8 \u2208 [0, 1], the (cid:107) \u00b7 (cid:107) ball centered at \u03b8u + (1 \u2212 \u03b8)v with radius \u03b8(1 \u2212 \u03b8) \u03b1\n2 (cid:107)u \u2212 v(cid:107)2 is\ncontained in Y . Please see [14] for examples about strongly-convex sets.\nDe\ufb01nition 2 A function is \u03b2-strongly smooth w.r.t. a norm (cid:107) \u00b7 (cid:107) if f is everywhere differentiable and\n2(cid:107)u \u2212 v(cid:107)2. A function is \u03b2-strongly convex w.r.t. a norm (cid:107) \u00b7 (cid:107) if\nf (u) \u2264 f (v) + \u2207f (v)(cid:62)(u \u2212 v) + \u03b2\nf (u) \u2265 f (v) + \u2207f (v)(cid:62)(u \u2212 v) + \u03b2\n2(cid:107)u \u2212 v(cid:107)2.\nDe\ufb01nition 3 For a convex function f (\u00b7), its Fenchel conjugate is f\u2217(x) := supy(cid:104)x, y(cid:105) \u2212 f (y). Note\nthat if f is convex then so is its conjugate f\u2217, since it is de\ufb01ned as the maximum over linear functions\nof x [6]. Furthermore, the biconjugate f\u2217\u2217 equals f if and only if f is closed and convex. It is known\nthat f is \u03b2-strongly convex w.r.t. (cid:107) \u00b7 (cid:107) if and only if f\u2217 is 1/\u03b2 strongly smooth w.r.t the dual norm\n(cid:107) \u00b7 (cid:107)\u2217 [26], assuming that f is a closed and convex function.\n\n2\n\n\f2 Minimax Duality via No-Regret Learning\n\n2.1 Brief review of online learning\n\n(cid:80)T\n\nt=1 (cid:96)t(xt) \u2212 minx\u2208K\n\nRT :=(cid:80)T\n\nIn the task of online convex optimization, we assume a learner is provided with a compact and\nconvex set K \u2282 Rn known as the decision set. Then, in an online fashion, the learner is presented\nwith a sequence of T loss functions (cid:96)1(\u00b7), (cid:96)2(\u00b7), . . . , (cid:96)T (\u00b7) : K \u2192 R. On each round t, the learner\nmust select a point xt \u2208 K, and is then \u201ccharged\u201d a loss of (cid:96)t(xt) for this choice. Typically\nit is assumed that, when the learner selects xt on round t, she has observed all loss functions\n(cid:96)1(\u00b7), . . . , (cid:96)t\u22121(\u00b7) up to, but not including, time t. However, we will also consider learners that are\nprescient, i.e. that can choose xt with knowledge of the loss functions up to and including time t.\nThe objective of interest in most of the online learning literature is the learner\u2019s regret, de\ufb01ned as\nt=1 (cid:96)t(x). Oftentimes we will want to refer to the average regret,\nor the regret normalized by the time horizon T , which we will call RT := RT\nT . What has become a\ncornerstone of online learning research has been the existence of no-regret algorithms, i.e. learning\nstrategies that guarantee RT \u2192 0 as T \u2192 \u221e.\nLet us consider three very simple learning strategies, and we note the available guarantees for each.\n(FollowTheLeader) Perhaps the most natural algorithm one might think of is to simply select xt as\nthe best point in hindsight. That is, the learner can choose xt = arg minx\u2208K\nLemma 1 ([21]). If each (cid:96)t(\u00b7) is 1-lipschitz and 1-strongly convex, then FollowTheLeader achieves\nRT \u2264 c log T\n(BeTheLeader) When the learner is prescient, then we can do slightly better than FollowTheLeader\nby incorporating the current loss function: xt = arg minx\u2208K\ns=1 (cid:96)s(x). This algorithm was named\nBeTheLeader by [28], who also proved that it actually guarantees non-positive regret!\nLemma 2 ([28]). For any sequence of loss functions, BeTheLeader achieves RT \u2264 0.\n(BestResponse) But perhaps the most trivial strategy for a prescient learner is to ignore the history\nof the (cid:96)s\u2019s, and simply play the best choice of xt on the current round. We call this algorithm\nBestResponse, de\ufb01ned as xt = arg minx\u2208K (cid:96)t(x). A quick inspection reveals that BestResponse\nsatis\ufb01es RT \u2264 0.\n\nfor some constant c.\n\nT\n\n(cid:80)t\u22121\n\ns=1 (cid:96)s(x).\n\n(cid:80)t\n\n2.2 Minimax Duality\n\nThe celebrated minimax theorem for zero-sum games, \ufb01rst discovered by John von Neumann in the\n1920s [36, 33], is certainly a foundational result in the theory of games. It states that two players,\nplaying a game with zero-sum payoffs, each have an optimal randomized strategy that can be played\nobliviously \u2013 that is, even announcing their strategy in advance to an optimal opponent would not\ndamage their own respective payoff, in expectation.\nIn this paper we will focus on more general minimax result, establishing duality for a class of\nconvex/concave games, and we will show how this theorem can be proved without the need for\nBrouwer\u2019s Fixed Point Theorem [27]. The key inequality can be established through the use of\nno-regret learning strategies in online convex optimization, which we detail in the following section.\nThe theorem below can be proved as well using Sion\u2019s Minimax Theorem [35].\nTheorem 1. Let X, Y be compact convex subsets of Rn and Rm respectively. Let g : X \u00d7 Y \u2192 R\nbe convex in its \ufb01rst argument and concave in its second. Then we have that\n\nmin\nx\u2208X\n\nmax\ny\u2208Y\n\ng(x, y) = max\ny\u2208Y\n\nmin\nx\u2208X\n\ng(x, y)\n\n(1)\n\nWe want to emphasize that a meta-algorithm (Algorithm 1) actually emerges from our proof for\nTheorem 1, please see the supplementary for details. It is important to point out that the meta\nalgorithm, as a routine for computing equlibria, is certainly not a novel technique, it has served\nimplicitly as the underpinning of many works, including those already mentioned [11, 9, 8].\nWe close this section by summarizing the approximate equilibrium computation guarantee that\nfollows from the above algorithm. This result is classical, and we explore it in great detail in the\n\n3\n\n\fAlgorithm 1 Meta Algorithm for equilibrium computation\n1: Input: convex-concave payoff g : X \u00d7 Y \u2192 R, algorithms OAlgX and OAlgY\n2: for t = 1, 2, . . . , T do\n3:\n4:\n5: end for\n6: Output: \u00afxT = 1\nT\n\nxt := OAlgX (g(\u00b7, y1), . . . , g(\u00b7, yt\u22121))\nyt := OAlgY (g(x1,\u00b7), . . . , g(xt\u22121,\u00b7), g(xt,\u00b7))\n\nt=1 xt and \u00afyT := 1\n\n(cid:80)T\n\n(cid:80)T\n\nT\n\nt=1 yt\n\n(cid:80)T\n\n(cid:80)T\nt=1 yt, and let V \u2217 be the value of the game,\n\nAppendix. We let \u00afxT := 1\nT\nwhich is the quantity in (1).\nTheorem 2. Algorithm 1 outputs \u00afxT and \u00afyT satisfying\n\nt=1 xt and \u00afyT := 1\n\nT\n\ng(\u00afxT , y) \u2264 V \u2217 + \u0001T + \u03b4T\n\nmax\ny\u2208Y\n\nand min\nx\u2208X\n\ng(x, \u00afyT ) \u2265 V \u2217 \u2212 (\u0001T + \u03b4T ).\n\n(2)\n\nas long as OAlgX and OAlgY guarantee average regret bounded by \u0001T and \u03b4T , respectively.\n\n3 Relation to the Frank-Wolfe Method\n\nWe now return our attention to the problem of constrained optimization, and we review the standard\nFrank-Wolfe algorithm. We then use the technologies presented in the previous section to recast\nFrank-Wolfe as an equilibrium computation, and we show that indeed the vanilla algorithm is an\ninstantiation of our meta-algorithm (Alg. 1). We then proceed to show that the modularity of the\nminimax duality perspective allows us to immediately reproduce existing variants of Frank-Wolfe, as\nwell as construct new algorithms, with convergence rates provided immediately by Theorem 2.\nTo begin, let us assume that we have a compact set Y \u2282 Rn and a convex function f : Y \u2192 R. Our\nprimary goal is to solve the objective\n\nmin\ny\u2208Y\n\nf (y).\n\n(3)\n\nWe say that y0 is an \u0001-approximate solution as long as f (y0) \u2212 miny\u2208Y f (y) \u2264 \u0001.\n\n3.1 A Brief Overview of Frank-Wolfe\n\nAlgorithm 2 Standard Frank-Wolfe algorithm\n1: Input: obj. f : Y \u2192 R, oracle O(\u00b7), learning rate {\u03b3t \u2208 [0, 1]}t=1,2,..., init. w0 \u2208 Y\n2: for t = 1, 2, 3 . . . , T do\n3:\n\n(cid:104)v,\u2207f (wt\u22121)(cid:105)\n\nvt \u2190 O(\u2207f (wt\u22121)) = arg min\nv\u2208Y\nwt \u2190 (1 \u2212 \u03b3t)wt\u22121 + \u03b3tvt.\n\n4:\n5: end for\n6: Output: wT\n\nThe standard Frank-Wolfe algorithm (Algorithm 2) consists of making repeated calls to a linear\noptimization oracle (line 6), followed by a convex averaging step of the current iterate and the oracle\u2019s\noutput (line 7). It initializes a w1 in the constraint set Y . Due to the convex combination step, the\niterate wt is always within the constraint set, which is the reason why it is called projection free. We\nrestate a proposition from [10], who established the convergence rate of their algorithm.\nTheorem 3 ([10]). Assume that f (\u00b7) is 1-strongly smooth. If Algorithm 2 is run for T rounds, then\n\nthere exists a sequence {\u03b3t} such that the output wT is a O(cid:0) 1\n\n(cid:1)-approximate solution to (3).\n\nT\n\nIt is worth noting that the typical learning rate used throughout the literature is \u03b3t = 2\nThis emerges as the result of a recursive inequality.\n\n2+t [31, 25].\n\n4\n\n\f3.2 Frank-Wolfe via the Meta-Algorithm\n\nWe now show that the meta-algorithm generalizes Frank-Wolfe, and provides a much more modular\nframework for producing similar algorithms. We will develop some of these novel methods and\nestablish their convergence via Theorem 2.\nIn order to utilize minimax duality, we have to de\ufb01ne decision sets for two players, and we must\nproduce a convex-concave payoff function. First we will assume, for convenience, that f (y) := \u221e for\nany y /\u2208 Y . That is, it takes the value \u221e outside of the convex/compact set Y , which ensures that f\nis lower semi-continuous and convex. Now, let the x-player be given the set X := {\u2207f (y) : y \u2208 Y }.\nOne can check that the closure of the set X is a convex set. Please see Appendix 2 for the proof.\nTheorem 4. The closure of (sub-)gradient space {\u2202f (y)|y \u2208 Y } is a convex set.\n\nThe y-player\u2019s decision set will be Y , the constraint set of the primary objective (3). The payoff\ng(\u00b7,\u00b7) will be de\ufb01ned as\n\ng(x, y) := \u2212x(cid:62)y + f\u2217(x).\n\n(4)\nThe function f\u2217(\u00b7) is the Fenchel conjugate of f. We observe that g(x, y) is indeed linear, and hence\nconcave, in y, and it is also convex in x.\nLet\u2019s notice a few things about this particular game. Looking at the max min expression,\n\n(cid:18)\n\n(cid:8)x(cid:62)y \u2212 f\u2217(x)(cid:9)(cid:19)\n\n(cid:18)\n\n(cid:19)\n\nmax\ny\u2208Y\n\nmin\nx\u2208X\n\ng(x, y) = max\ny\u2208Y\n\n\u2212 max\nx\u2208X\n\n= \u2212\n\nmin\ny\u2208Y\n\nf (y)\n\n= V \u2217,\n\n(5)\n\nwhich follows by the fact that f\u2217\u2217 = f.1 Note, crucially, that the last term above corresponds to the\nobjective we want to solve up to a minus sign. Any \u00afy which is an \u0001-approximate equilibrium strategy\nfor the y-player will also be an \u0001-approximate solution to (3).\nWe now present the main result of this section, which is the connection between Frank-Wolfe (Alg. 2)\nand Alg. 1.\nTheorem 5. When both are run for exactly T rounds, the output \u00afyT of Algorithm 1 is identically\nthe output wT of Algorithm 2 as long as: (I) Init. x1 in Alg 1 equals \u2207f (w0) in Alg. 2; (II)\nt ; (III) Alg. 1 receives g(\u00b7,\u00b7) de\ufb01ned in (4); (IV) Alg. 1 sets\nAlg. 2 uses learning rate \u03b3t := 1\nOAlgX := FollowTheLeader; (V) Alg. 1 sets OAlgY := BestResponse.\n\nProof. We will prove that the following three equalities are maintained throughout both algorithms.\nWe emphasize that the objects on the left correspond to Alg. 1 and those on the right to Alg. 2.\n\nxt = \u2207f (wt\u22121)\nyt = vt\n\u00afyt = wt.\n\n(6)\n(7)\n(8)\n\nWe \ufb01rst note that the \ufb01rst condition of the theorem ensures that (6) holds for t = 1. Second, the\nchoice of learning rate \u03b3t = 1\nt already guarantees that (7) implies (8), since this choice of rate ensures\nthat wt is always a uniform average of the updates vt. It remains to establish (6) and (7) via induction.\nWe begin with the former.\nRecall that xt is selected via FollowTheLeader against the sequence of loss functions (cid:96)t(\u00b7) :=\ng(\u00b7, yt). To write precisely what this means,\n\n(cid:110) 1\n\nt\u22121\n\n(cid:80)t\u22121\ns=1(\u2212y(cid:62)\n\ns x + f\u2217(x))\n\n(cid:111)\n\n(cid:80)t\u22121\n\n(cid:110) 1\n(cid:8)\u00afy(cid:62)\nt\u22121x \u2212 f\u2217(x)(cid:9) = \u2207f (\u00afyt\u22121).\n\ns=1 (cid:96)s(x)\n\n(cid:111)\n\nt\u22121\n\n= arg max\nx\u2208X\n\nxt\n\n:= arg minx\u2208X\n\n= arg minx\u2208X\n\nThe \ufb01nal line follows as a result of the Legendre transform [6]. Of course, by induction, we have that\n\u00afyt\u22121 = wt\u22121, and hence we have established (6).\n\n1It was important how we de\ufb01ned X here, as the fenchel conjugate takes the value of \u221e at any point\n\nx /\u2208 {\u2207f (y) : y \u2208 Y }, hence the unconstrained supremum is the same as maxx\u2208X (\u00b7)\n\n5\n\n\fFinally, let us consider how yt is chosen according to BestResponse. Recall that sequence of loss\nfunctions presented to the y-player is ht(\u00b7) := \u2212g(xt,\u00b7). Utilizing BestResponse for this sequence\nimplies that\n\n(cid:0)x(cid:62)\nt y \u2212 f\u2217(xt)(cid:1) = arg min\n\n(cid:0)x(cid:62)\nt y(cid:1)\n\nyt = arg min\n\ny\u2208Y\n\n((6) by induc.) = arg min\n\ny\u2208Y\n\nht(y) = arg min\n\u2207f (\u00afyt\u22121)(cid:62)y = arg min\n\ny\u2208Y\n\ny\u2208Y\n\u2207f (wt\u22121)(cid:62)y\n\ny\u2208Y\n\n( which is vt).\n\n(cid:8)\u00afy(cid:62)\nt\u22121x \u2212 f\u2217(x)(cid:9). Yet, this operation does not need to be computed in the naive way\n\nWhere the last equality follows by induction via (8). This completes the proof.\nNote that the algorithm does not need to compute the conjugate, f\u2217. While the Frank-Wolfe\nalgorithm can be viewed as implicitly operating on the conjugate, it is only through the use of\narg maxx\u2208X\n(i.e. by \ufb01rst computing f\u2217 and then doing the maximization). Instead, the expression actually boils\ndown to \u2207f (y) which is just a gradient computation!\nThe equivalence we just established has several nice features. But it does not provide a convergence\nrate for Algorithm 2. This should perhaps not be surprising, as nowhere did we even use the\nsmoothness of f anywhere in the equivalence. Instead, this actually follows via a key application\nof Theorem 2, utilizing the fact that f\u2217 is strongly convex on the interior of the set X 2, granting\nFollowTheLeader a logarithmic regret rate.\nCorollary 1. Assume that f (\u00b7) is 1-strongly smooth. Then Algorithm 2, with learning rate \u03b3t := 1\nt ,\noutputs wT with approximation error O\n\n(cid:16) log T\n\n(cid:17)\n\n.\n\nT\n\nProof. As a result of Theorem 5, we have established that Alg. 2 is a special case of Alg. 1, with the\nparameters laid out in the previous theorem. As a result of Theorem 2, the approximation error of wT\nis precisely the error \u0001T + \u03b4T of the point \u00afyT when generated via Alg. 1 with subroutines OAlgX :=\nFollowTheLeader and OAlgY = BestResponse, assuming that these two learning algorithms\nguarantee average regret no more than \u0001T and \u03b4T , respectively. We noted that BestResponse does\nnot suffer regret, so \u03b4T = 0.\nTo bound the regret of FollowTheLeader on the sequence of functions g(\u00b7, y1), . . . , g(\u00b7, yT ), we\nobserve that the smoothness of f implies that f\u2217 is 1-strongly convex, which in turn implies that\ng(x, yt) = \u2212x(cid:62)yt + f\u2217(x) is also 1-strongly convex (in x). Hence Lemma 1 guarantees that\nFollowTheLeader has average regret \u0001T := O\n\n, which completes the proof.\n\n(cid:16) log T\n\n(cid:17)\n\nT\n\nWe emphasize that the above result leans entirely on existing work on regret bounds for online\nlearning, and these tools are doing the heavy lifting. We explore this further in the following section.\n\n4 Frank-Wolfe-style Algs, New and Old\n\nWe now have a factory for generating new algorithms using the approach laid out in Section 3.\nTheorem 5 shows that the standard Frank-Wolfe algorithm (with a particular learning rate) is obtained\nvia the meta-algorithm using two particular online learning algorithms OAlgX , OAlgY . But we\nhave full discretion to choose these two algorithms, as long as they provide the appropriate regret\nguarantees to ensure convergence.\n\n4.1 Cumulative Gradients\n\nWe begin with one simple variant, which we call Cumulative-Gradient Frank-Wolfe, laid out in\nAlgorithm 3. The one signi\ufb01cant difference with vanilla Frank-Wolfe is that the linear optimization\noracle receives as input the average of the gradients obtained thus far, as opposed to the last one.\n\n2 We only need to assume f is \"smooth on the interior of Y \" to get the result. (That f is technically not\nsmooth outside of Y is not particularly relevant) The result that f\u2217 is strongly convex on the interior of the set X\nis essentially proven by [26] in their appendix. This argument has been made elsewhere in various forms in the\nliterature (e.g. [18]).\n\n6\n\n\fAlgorithm 3 Cumulative-Gradient Frank-Wolfe\n1: Initialize: any w0 \u2208 Y .\n2: for t = 1, 2, 3 . . . , T do\ny, 1\n3:\nt\u22121\n\n(cid:80)t\u22121\ns=1 \u2207f (ws)\n\nvt \u2190 arg min\nwt \u2190 (1 \u2212 \u03b3t)wt\u22121 + \u03b3tvt.\n\n(cid:68)\n\n(cid:69)\n\nv\u2208Y\n\n4:\n5: end for\n6: Output: wT\n\nThe proof of convergence requires little effort.\nCorollary 2. Assume that f (\u00b7) is 1-strongly smooth. Then Algorithm 3, with learning rate \u03b3t := 1\nt ,\noutputs wT with approximation error O\n\n.\n\n(cid:16) log T\n\n(cid:17)\n\nT\n\nProof. The result follows almost identically to Corollary 1. It requires a quick inspection to verify\nthat the new linear optimization subroutine corresponds to implementing BeTheLeader as OAlgY\ninstead of BestResponse. However, both BestResponse and BeTheLeader have non-positive\nregret (\u03b4T \u2264 0) (Lemma 2 in the supplementary), and thus they achieve the same convergence.\n\nWe note that a similar algorithm to the above can be found in [31], although in their results they\nconsider more general weighted averages over the gradients.\n\n4.2 Perturbation Methods and Stochastic Smoothing\n\nLooking carefully at the proof of Corollary 1, the fact that FollowTheLeader was suitable for the\nvanilla FW analysis relies heavily on the strong convexity of the functions (cid:96)t(\u00b7) := g(\u00b7, yt), which\nin turn results from the smoothness of f (\u00b7). But what about when f (\u00b7) is not smooth, is there an\nalternative algorithm available?\nWe observe that one of the nice techniques to grow out of the online learning community is the\nuse of perturbations as a type of regularization to obtain vanishing regret guarantees [28] \u2013 their\nmethod is known as Follow the Perturbed Leader (FTPL). The main idea is to solve an optimization\nproblem that has a random linear function added to the input, and to select3 as xt the expectation of\nthe arg min under this perturbation. More precisely,\n\n(cid:104)\n\n(cid:110)\nZ(cid:62)x +(cid:80)t\u22121\n\n(cid:111)(cid:105)\n\nxt := EZ\n\narg minx\u2208X\n\ns=1 (cid:96)s(x)\n\n.\n\nT\n\n(cid:17)\n\n(cid:16) 1\u221a\n\nHere Z is some random vector drawn according to an appropriately-chosen distribution and (cid:96)s(x)\nis the loss function of the x-player on round s; with the de\ufb01nition of payoff function g, (cid:96)s(x) is\n\u2212x(cid:62)ys + f\u2217(x) (4).\nOne can show that, as long as Z is chosen from the right distribution, then this algorithm guarantees\naverage regret on the order of O\n, although obtaining the correct dimension dependence relies\non careful probabilistic analysis. Recent work of [2] shows that the analysis of perturbation-style\nalgorithm reduces to curvature properties of a stochastically-smoothed Fenchel conjugate.\nWhat is intriguing about this perturbation approach is that it ends up being equivalent to an existing\nmethod proposed by [31] (Section 3.3), who also uses a stochastically smoothed objective function.\nWe note that\n= EZ\nEZ\n= EZ[\u2207f (\u00afyt\u22121 + Z/(t \u2212 1))] = \u2207 \u02dcft\u22121(\u00afyt\u22121)\n(9)\nwhere \u02dcf\u03b1(x) := E[f (x + Z/\u03b1)]. [31] suggests using precisely this modi\ufb01ed \u02dcf, and they prove a rate\non the order of O\n\n(cid:110)\nZ(cid:62)x +(cid:80)t\u22121\n(cid:17)\n(cid:16) 1\u221a\n\n(cid:8)(\u00afyt\u22121 + Z/(t \u2212 1))(cid:62)x \u2212 f\u2217(x)(cid:9)(cid:3)\n\n. As discussed, the same would follow from vanishing regret of FTPL.\n\n(cid:2)arg maxx\u2208X\n\narg minx\u2208X\n\ns=1 (cid:96)s(x)\n\n(cid:111)(cid:105)\n\n(cid:104)\n\nT\n\n3Technically speaking, the results of [28] only considered linear loss functions and hence their analysis did\nnot require taking averages over the input perturbation. While we will not address computational issues here due\nto space, actually computing the average arg min is indeed non-trivial.\n\n7\n\n\f4.3 Boundary Frank-Wolfe\n\nAlgorithm 4 Modi\ufb01ed meta-algorithm, swapped roles\n1: Input: convex-concave payoff g : X \u00d7 Y \u2192 R, algorithms OAlgX and OAlgY\n2: for t = 1, 2, . . . , T do\n3:\n4:\n5: end for\n6: Output: \u00afxT = 1\nT\n\nyt := OAlgY (g(x1,\u00b7), . . . , g(xt\u22121,\u00b7))\nxt := OAlgX (g(\u00b7, y1), . . . , g(\u00b7, yt\u22121), g(\u00b7, yt))\n\nt=1 xt and \u00afyT := 1\n\n(cid:80)T\n\n(cid:80)T\n\nt=1 yt\n\nT\n\nWe observe that the meta-algorithm previously discussed assumed that the x-player was \ufb01rst to act,\nfollowed by the y-player who was allowed to be prescient. Here we reverse their roles, and we instead\nallow the x-player to be prescient. The new meta-algorithm is described in Algorithm 4. We are\ngoing to show that this framework lead to a new projection-free algorithm that works for non-smooth\nobjective functions. Speci\ufb01cally, if the constraint set is strongly convex, then this exhibits a novel\nprojection free algorithm that grants a O(log T /T ) convergence even for non-smooth objective\nfunctions. The result relies on very recent work showing that FollowTheLeader for strongly convex\nsets [24] grants a O(log T ) regret rate. Prior work has considered strongly convex decision sets\n[14], yet with the additional assumption that the objective is smooth and strongly convex, leading to\nO(1/T 2) convergence. Boundary Frank-Wolfe requires neither smoothness nor strong convexity of\nthe objective. What we have shown, essentially, is that a strongly convex boundary of the constraint\nset can be used in place of smoothness of f (\u00b7) in order to achieve O(1/T ) convergence.\n\nAlgorithm 5 Boundary Frank-Wolfe\n1: Input: objective f : Y \u2192 R, oracle O(\u00b7) for Y , init. y1 \u2208 Y .\n2: for t = 2, 3 . . . , T do\n(cid:80)T\nyt \u2190 arg miny\u2208Y\n3:\n4: end for\n5: Output: \u00afyT = 1\nT\n\n(cid:80)t\u22121\ns=1(cid:104)y, \u2202f (ys)(cid:105)\n\nt=1 yt\n\n1\nt\u22121\n\nWe may now prove a result about Algorithm 5 using the same techniques laid out in Theorem 5.\nTheorem 6. Algorithm 5 is a instance of Algorithm 4 if (I) Init. y1 in Alg 5 equals y1 in Alg. 4;\n(II) Alg. 1 sets OAlgY := FollowTheLeader; and (III) Alg. 1 sets OAlgX := BestResponse.\n\ns=1 \u2202f (ys) has non-zero norm, then\n\nFurthermore, when Y is strongly convex, and(cid:80)t\nwhere M := supy\u2208Y (cid:107)\u2202f (y)(cid:107), LT := min1\u2264t\u2264T (cid:107)\u0398t(cid:107), \u0398t =(cid:80)t\n\nf (\u00afyT ) \u2212 min\ny\u2208Y\n\nM log T\n\u03b1LT T\n\nf (y) = O(\n\n)\n\n1\n\nt \u2202f (ys).\n\ns=1\n\nProof. Please see Appendix 3 for the proof.\n\nNote that the rate depends crucially on LT , which is the smallest averaged-gradient norm computed\n\u221a\nduring the optimization. Depending on the underlying optimization problem, LT can be as small as\nT ) or can even be 0. Now let us discuss when the boundary FW works; namely, the condition\nO(1/\nthat causes the cumulative gradient being nonzero. If a linear combination of gradients is 0 then\nclearly 0 is in the convex hull of subgradients \u2202f (x) for boundary points x. Since the closure of\n{\u2207f (x)|x \u2208 Y } is convex, according to Theorem 4, this implies that 0 is in {\u2207f (x)|x \u2208 Y }. If we\nknow in advance that 0 /\u2208 cl({\u2207f (x)|x \u2208 Y }) we are assured that the cumulative gradient will not\nbe 0. Hence, the proposed algorithm may only be useful when it is known, a priori, that the solution\ny\u2217 will occur not in the interior but on the boundary of Y . It is indeed an odd condition, but it does\nhold in many typical scenarios. One may add a perturbed vector to the gradient and show that with\nhigh probability, LT is a non-zero number. The downside of this approach is that it would generally\ngrant a slower convergence rate; it cannot achieve log(T )/T as the inclusion of the perturbation\nrequires managing an additional trade-off.\n\n8\n\n\fReferences\n[1] Jacob Abernethy and Elad Hazan. Faster convex optimization: Simulated annealing with an ef\ufb01cient\nuniversal barrier. In Proceedings of The 33rd International Conference on Machine Learning, pages\n2520\u20132528, 2016.\n\n[2] Jacob Abernethy, Chansoo Lee, Abhinav Sinha, and Ambuj Tewari. Online linear optimization via\n\nsmoothing. In COLT, pages 807\u2013823, 2014.\n\n[3] Sanjeev Arora, Elad Hazan, and Satyen Kale. The multiplicative weights update method: a meta-algorithm\n\nand applications. Theory of Computing, 8(1):121\u2013164, 2012.\n\n[4] Francis Bach. Duality between subgradient and conditional gradient methods. SIAM Journal of Optimiza-\n\ntion, 2015.\n\n[5] Amir Beck and Shimrit Shtern. Linearly convergent away-step conditional gradient for non-strongly\n\nconvex functions. Mathematical Programming, 2016.\n\n[6] Stephen Boyd. Convex optimization. Cambridge University Press, 2004.\n\n[7] Nicolo Cesa-Bianchi and G\u00e1bor Lugosi. Prediction, learning, and games. Cambridge university press,\n\n2006.\n\n[8] Paul Christiano, Jonathan A Kelner, Aleksander Madry, Daniel A Spielman, and Shang-Hua Teng. Elec-\ntrical \ufb02ows, laplacian systems, and faster approximation of maximum \ufb02ow in undirected graphs. In\nProceedings of the forty-third annual ACM symposium on Theory of computing, pages 273\u2013282. ACM,\n2011.\n\n[9] Cynthia Dwork, Aaron Roth, et al. The algorithmic foundations of differential privacy. Foundations and\n\nTrends R(cid:13) in Theoretical Computer Science, 9(3\u20134):211\u2013407, 2014.\n\n[10] Marguerite Frank and Philip Wolfe. An algorithm for quadratic programming. Naval research logistics\n\nquarterly, 3(1-2):95\u2013110, 1956.\n\n[11] Yoav Freund and Robert E Schapire. Game theory, on-line prediction and boosting. In Proceedings of the\n\nninth annual conference on Computational learning theory, pages 325\u2013332. ACM, 1996.\n\n[12] Yoav Freund and Robert E Schapire. Adaptive game playing using multiplicative weights. Games and\n\nEconomic Behavior, 29(1-2):79\u2013103, 1999.\n\n[13] Dan Garber and Elad Hazan. Playing non-linear games with linear oracles. FOCS, 2013.\n\n[14] Dan Garber and Elad Hazan. Faster rates for the frank-wolfe method over strongly-convex sets. ICML,\n\n2015.\n\n[15] Dan Garber and Elad Hazan. A linearly convergent conditional gradient algorithm with applications to\n\nonline and stochastic optimization. SIAM Journal on Optimization, 2016.\n\n[16] Dan Garber and Ofer Meshi. Linear-memory and decomposition-invariant linearly convergent conditional\n\ngradient algorithm for structured polytopes. NIPS, 2016.\n\n[17] G. Gidel, T. Jebara, and S. Lacoste-Julien. Frank-wolfe algorithms for saddle point problems. AISTATS,\n\n2016.\n\n[18] Gianluca Gorni. Conjugation and second-order properties of convex functions. Journal of Mathematical\n\nAnalysis and Applications, 1991.\n\n[19] Zaid Harchaoui, Anatoli Juditsky, and Arkadi Nemirovski. Conditional gradient algorithms for norm-\n\nregularized smooth convex optimization. Math. Prog., Series A, 2013.\n\n[20] Elad Hazan. Introduction to online convex optimization. 2014.\n\n[21] Elad Hazan, Amit Agarwal, and Satyen Kale. Logarithmic regret algorithms for online convex optimization.\n\nMachine Learning, 69(2-3):169\u2013192, 2007.\n\n[22] Elad Hazan and Satyen Kale. Projection-free online learning. ICML, 2012.\n\n[23] Elad Hazan and Haipeng Luo. Variance-reduced and projection-free stochastic optimization. ICML, 2016.\n\n[24] Ruitong Huang, Tor Lattimore, Andr\u00e1s Gy\u00f6rgy, and Csaba Szepesvari. Following the leader and fast rates\n\nin linear prediction: Curved constraint sets and other regularities. 2016.\n\n9\n\n\f[25] Martin Jaggi. Revisiting frank-wolfe: Projection-free sparse convex optimization. ICML, 2013.\n\n[26] Sham M. Kakade, Shai Shalev-shwartz, and Ambuj Tewari. On the duality of strong convexity and strong\n\nsmoothness: Learning applications and matrix regularization. 2009.\n\n[27] Shizuo Kakutani. A generalization of brouwer\u2019s \ufb01xed point theorem. 1941.\n\n[28] Adam Kalai and Santosh Vempala. Ef\ufb01cient algorithms for online decision problems. Journal of Computer\n\nand System Sciences, 71(3):291\u2013307, 2005.\n\n[29] Simon Lacoste-Julien and Martin Jaggi. On the global linear convergence of frank-wolfe optimization\n\nvariants. NIPS, 2015.\n\n[30] Simon Lacoste-Julien, Martin Jaggi, Mark Schmidt, and Patrick Pletscher. Block-coordinate frank-wolfe\n\noptimization for structural svms. ICML, 2013.\n\n[31] Guanghui Lan. The complexity of large-scale convex programming under a linear optimization oracle.\n\nhttps://arxiv.org/abs/1309.5550, 2013.\n\n[32] Guanghui Lan and Yi Zhou. Conditional gradient sliding for convex optimization. SIAM Journal on\n\nOptimization,, 2014.\n\n[33] J von Neumann, Oskar Morgenstern, et al. Theory of games and economic behavior, 1944.\n\n[34] Anton Osokin, Jean-Baptiste Alayrac, Isabella Lukasewitz, Puneet K. Dokania, and Simon Lacoste-Julien.\n\nMinding the gaps for block frank-wolfe optimization for structural svms. ICML, 2016.\n\n[35] Maurice Sion. On general minimax theorems. Paci\ufb01c J. Math, 8(1):171\u2013176, 1958.\n\n[36] J v. Neumann. Zur theorie der gesellschaftsspiele. Mathematische annalen, 100(1):295\u2013320, 1928.\n\n[37] Yu-Xiang Wang, Veeranjaneyulu Sadhanala, Wei Dai, Willie Neiswanger, Suvrit Sra, and Eric Xing.\n\nParallel and distributed block-coordinate frank-wolfe algorithms. ICML, 2016.\n\n[38] P. Wolf. Convergence theory in nonlinear programming. Integer and Nonlinear Programming, 1970.\n\n[39] Y. Yu, X. Zhang, and D. Schuurmans. Generalized conditional gradient for structured estimation.\n\narXiv:1410.4828, 2014.\n\n10\n\n\f", "award": [], "sourceid": 3299, "authors": [{"given_name": "Jacob", "family_name": "Abernethy", "institution": "University of Michigan"}, {"given_name": "Jun-Kun", "family_name": "Wang", "institution": "Georgia Institute of Technology"}]}