{"title": "Sharpness, Restart and Acceleration", "book": "Advances in Neural Information Processing Systems", "page_first": 1119, "page_last": 1129, "abstract": "The {\\L}ojasiewicz inequality shows that H\\\"olderian error bounds on the minimum of convex optimization problems hold almost generically. Here, we clarify results of \\citet{Nemi85} who show that H\\\"olderian error bounds directly controls the performance of restart schemes. The constants quantifying error bounds are of course unobservable, but we show that optimal restart strategies are robust, and searching for the best scheme only increases the complexity by a logarithmic factor compared to the optimal bound. Overall then, restart schemes generically accelerate accelerated methods.", "full_text": "Sharpness, Restart and Acceleration\n\nVincent Roulet\nINRIA, ENS\nParis France\n\nvincent.roulet@inria.fr\n\nAbstract\n\nAlexandre d\u2019Aspremont\n\nCNRS, ENS\nParis France\n\naspremon@ens.fr\n\nThe \u0141ojasiewicz inequality shows that sharpness bounds on the minimum of convex\noptimization problems hold almost generically. Sharpness directly controls the\nperformance of restart schemes, as observed by Nemirovskii and Nesterov [1985].\nThe constants quantifying error bounds are of course unobservable, but we show\nthat optimal restart strategies are robust, and searching for the best scheme only\nincreases the complexity by a logarithmic factor compared to the optimal bound.\nOverall then, restart schemes generically accelerate accelerated methods.\n\nIntroduction\n\nWe study convex optimization problems of the form\nminimize\n\nf (x)\n\n(P)\nwhere f is a convex function de\ufb01ned on Rn. The complexity of these problems using \ufb01rst order\nmethods is generically controlled by smoothness assumptions on f such as Lipschitz continuity of its\ngradient. Additional assumptions such as strong convexity or uniform convexity provide respectively\nlinear [Nesterov, 2013b] and faster polynomial [Juditski and Nesterov, 2014] rates of convergence.\nHowever, these assumptions are often too restrictive to be applied. Here, we make a much weaker and\ngeneric assumption that describes the sharpness of the function around its minimizers by constants\n\u00b5 \u2265 0 and r \u2265 1 such that\n\n\u00b5\nr\n\nd(x, X\u2217)r \u2264 f (x) \u2212 f\u2217,\n\nfor every x \u2208 K,\n\n(Sharp)\nwhere f\u2217 is the minimum of f, K \u2282 Rn is a compact set, d(x, X\u2217) = miny\u2208X\u2217 (cid:107)x \u2212 y(cid:107) is the\ndistance from x to the set X\u2217 \u2282 K of minimizers of f 1 for the Euclidean norm (cid:107) \u00b7 (cid:107). This de\ufb01nes a\nlower bound on the function around its minimizers: for r = 1, f shows a kink around its minimizers\nand the larger is r the \ufb02atter is the function around its minimizers. We tackle this property by restart\nschemes of classical convex optimization algorithms.\nSharpness assumption (Sharp) is better known as a H\u00f6lderian error bound on the distance to the set\nof minimizers. Hoffman [Hoffman, 1952] \ufb01rst introduced error bounds to study system of linear\ninequalities. Natural extensions were then developed for convex optimization [Robinson, 1975;\nMangasarian, 1985; Auslender and Crouzeix, 1988], notably through the concept of sharp minima\n[Polyak, 1979; Burke and Ferris, 1993; Burke and Deng, 2002]. But the most striking discovery was\nmade by \u0141ojasiewicz [\u0141ojasiewicz, 1963, 1993] who proved inequality (Sharp) for real analytic and\nsubanalytic functions. It has then been extended to non-smooth subanalytic convex functions by\nBolte et al. [2007]. Overall, since (Sharp) essentially measures the sharpness of minimizers, it holds\nsomewhat generically. On the other hand, this inequality is purely descriptive as we have no hope of\never observing either r or \u00b5, and deriving adaptive schemes is crucial to ensure practical relevance.\n\n1We assume the problem feasible, i.e. X\u2217 (cid:54)= \u2205.\n\n31st Conference on Neural Information Processing Systems (NIPS 2017), Long Beach, CA, USA.\n\n\f\u0141ojasiewicz inequalities either in the form of (Sharp) or as gradient dominated properties [Polyak,\n1979] led to new simple convergence results [Karimi et al., 2016], in particular for alternating and\nsplitting methods [Attouch et al., 2010; Frankel et al., 2015], even in the non-convex case [Bolte et al.,\n2014]. Here we focus on H\u00f6lderian error bounds as they offer simple explanation of accelerated rates\nof restart schemes.\nRestart schemes were already studied for strongly or uniformly convex functions [Nemirovskii and\nNesterov, 1985; Nesterov, 2013a; Juditski and Nesterov, 2014; Lin and Xiao, 2014]. In particular,\nNemirovskii and Nesterov [1985] link a \u201cstrict minimum\u201d condition akin to (Sharp) with faster\nconvergence rates using restart schemes which form the basis of our results, but do not study the\ncost of adaptation and do not tackle the non-smooth case. In a similar spirit, weaker versions of this\nstrict minimum condition were used more recently to study the performance of restart schemes in\n[Renegar, 2014; Freund and Lu, 2015; Roulet et al., 2015]. The fundamental question of a restart\nscheme is naturally to know when must an algorithm be stopped and relaunched. Several heuristics\n[O\u2019Donoghue and Candes, 2015; Su et al., 2014; Giselsson and Boyd, 2014] studied adaptive restart\nschemes to speed up convergence of optimal methods. The robustness of restart schemes was then\ntheoretically studied by Fercoq and Qu [2016] for quadratic error bounds, i.e. (Sharp) with r = 2,\nthat LASSO problem satis\ufb01es for example. Fercoq and Qu [2017] extended recently their work to\nproduce adaptive restarts with theoretical guarantees of optimal performance, still for quadratic error\nbounds. Previous references focus on smooth problems, but error bounds appear also for non-smooth\nones, Gilpin et al. [2012] prove for example linear converge of restart schemes in bilinear matrix\ngames where the minimum is sharp, i.e. (Sharp) with r = 1.\nOur contribution here is to derive optimal scheduled restart schemes for general convex optimization\nproblems for smooth, non-smooth or H\u00f6lder smooth functions satisfying the sharpness assumption.\nWe then show that for smooth functions these schemes can be made adaptive with nearly optimal\ncomplexity (up to a squared log term) for a wide array of sharpness assumptions. We also analyze\nrestart criterion based on a suf\ufb01cient decrease of the gap to the minimum value of the problem, when\nthis latter is known in advance. In that case, restart schemes are shown ot be optimal without requiring\nany additional information on the function.\n\n1 Problem assumptions\n\n1.1 Smoothness\n\nConvex optimization problems (P) are generally divided in two classes: smooth problems, for which\nf has Lipschitz continuous gradients, and non-smooth problems for which f is not differentiable.\nNesterov [2015] proposed to unify point of views by assuming generally that there exist constants\n1 \u2264 s \u2264 2 and L > 0 such that\n\nfor all x, y \u2208 Rn\n\n(cid:107)\u2207f (x) \u2212 \u2207f (y)(cid:107) \u2264 L(cid:107)x \u2212 y(cid:107)s\u22121,\n\n(Smooth)\nwhere \u2207f (x) is any sub-gradient of f at x if s = 1 (otherwise this implies differentiability of f).\nFor s = 2, we retrieve the classical de\ufb01nition of smoothness [Nesterov, 2013b]. For s = 1 we get a\nclassical assumption made in non-smooth convex optimization, i.e., that sub-gradients of the function\nare bounded. For 1 < s < 2, this assumes gradient of f to be H\u00f6lder Lipschitz continuous. In a\n\ufb01rst step, we will analyze restart schemes for smooth convex optimization problems, then generalize\nto general smoothness assumption (Smooth) using appropriate accelerated algorithms developed by\nNesterov [2015].\n\n1.2 Error bounds\n\nIn general, an error bound is an inequality of the form\n\nd(x, X\u2217) \u2264 \u03c9(f (x) \u2212 f\u2217),\n\nwhere \u03c9 is an increasing function at 0, called the residual function, and x may evolve either in the\nwhole space or in a bounded set, see Bolte et al. [2015] for more details. We focus on H\u00f6lderian\nError Bounds (Sharp) as they are the most common in practice. They are notably satis\ufb01ed by a\nanalytic and subanalytic functions but the proof (see e.g. Bierstone and Milman [1988, Theorem\n6.4]) is shown using topological arguments that are far from constructive. Hence, outside of some\n\n2\n\n\fparticular cases (e.g. strong convexity), we cannot assume that the constants in (Sharp) are known,\neven approximately.\nError bounds can generically be linked to \u0141ojasiewicz inequality that upper bounds magnitude of the\ngradient by values of the function [Bolte et al., 2015]. Such property paved the way to many recent\nresults in optimization [Attouch et al., 2010; Frankel et al., 2015; Bolte et al., 2014]. Here we will\nsee that (Sharp) is suf\ufb01cient to acceleration of convex optimization algorithms by their restart. Note\n\ufb01nally that in most cases, error bounds are local properties hence the convergence results that follow\nwill generally be local.\n\n1.3 Sharpness and smoothness\nLet f be a convex function on Rn satisfying (Smooth) with parameters (s, L). This property ensures\nthat, f (x) \u2264 f\u2217 + L\ns (cid:107)x \u2212 y(cid:107)s, for given x \u2208 Rn and y \u2208 X\u2217. Setting y to be the projection of x\nonto X\u2217, this yields the following upper bound on suboptimality\nd(x, X\u2217)s.\n\n(1)\n\nf (x) \u2212 f\u2217 \u2264 L\ns\n\nNow, assume that f satis\ufb01es the error bound (Sharp) on a set K with parameters (r, \u00b5). Combining (1)\nand (Sharp) this leads for every x \u2208 K,\n\n\u2264 d(x, X\u2217)s\u2212r.\n\ns\u00b5\nrL\n\nThis means that necessarily s \u2264 r by taking x \u2192 X\u2217. Moreover if s < r, this last inequality can\nonly be valid on a bounded set, i.e. either smoothness or error bound or both are valid only on a\nbounded set. In the following, we write\n\u03ba (cid:44) L\n\n2\ns /\u00b5\n\nand\n\n(2)\n\n2\nr\n\n\u03c4 (cid:44) 1 \u2212 s\nr\n\nrespectively a generalized condition number for the function f and a condition number based on the\nratio of powers in inequalities (Smooth) and (Sharp). If r = s = 2, \u03ba matches the classical condition\nnumber of the function.\n\n2 Scheduled restarts for smooth convex problems\n\nIn this section, we seek to solve (P) assuming that the function f is smooth, i.e. satis\ufb01es (Smooth)\nwith s = 2 and L > 0. Without further assumptions on f, an optimal algorithm to solve the smooth\nconvex optimization problem (P) is Nesterov\u2019s accelerated gradient method [Nesterov, 1983]. Given\nan initial point x0, this algorithm outputs, after t iterations, a point x = A(x0, t) such that\n\nf (x) \u2212 f\u2217 \u2264 cL\n\nt2 d(x0, X\u2217)2,\n\n(3)\n\nwhere c > 0 denotes a universal constant (whose value will be allowed to vary in what follows, with\nc = 4 here). We assume without loss of generality that f (x) \u2264 f (x0). More details about Nesterov\u2019s\nalgorithm are given in Supplementary Material.\nIn what follows, we will also assume that f satis\ufb01es (Sharp) with parameters (r, \u00b5) on a set K \u2287 X\u2217,\nwhich means\n\nd(x, X\u2217)r \u2264 f (x) \u2212 f\u2217,\n\nfor every x \u2208 K.\n\n\u00b5\nr\n\n(Sharp)\n\nAs mentioned before if r > s = 2, this property is necessarily local, i.e. K is bounded. We assume\nthen that given a starting point x0 \u2208 Rn, sharpness is satis\ufb01ed on the sublevel set {x| f (x) \u2264 f (x0)}.\nRemark that if this property is valid on an open set K \u2283 X\u2217, it will also be valid on any compact set\nK(cid:48) \u2283 K with the same exponent r but a potentially lower constant \u00b5. The scheduled restart schemes\nwe present here rely on a global sharpness hypothesis on the sublevel set de\ufb01ned by the initial point\nand are not adaptive to constant \u00b5 on smaller sublevel sets. On the other hand, restarts on criterion\nthat we present in Section 4, assuming that f\u2217 is known, adapt to the value of \u00b5. We now describe a\nrestart scheme exploiting this extra regularity assumption to improve the computational complexity\nof solving problem (P) using accelerated methods.\n\n3\n\n\f2.1 Scheduled restarts\n\nHere, we schedule the number of iterations tk made by Nesterov\u2019s algorithm between restarts, with\ntk the number of (inner) iterations at the kth algorithm run (outer iteration). Our scheme is described\nin Algorithm 1 below.\n\nAlgorithm 1 Scheduled restarts for smooth convex minimization\n\nInputs : x0 \u2208 Rn and a sequence tk for k = 1, . . . , R.\nfor k = 1, . . . , R do\n\nxk := A(xk\u22121, tk)\n\nend for\nOutput : \u02c6x := xR\n\nThe analysis of this scheme and the following ones relies on two steps. We \ufb01rst choose schedules that\nensure linear convergence in the iterates xk at a given rate. We then adjust this linear rate to minimize\nthe complexity in terms of the total number of iterations.\nWe begin with a technical lemma which assumes linear convergence holds, and connects the growth\nof tk, the precision reached and the total number of inner iterations N.\nLemma 2.1. Let xk be a sequence whose kth iterate is generated from the previous one by an\nk=1 tk the total number of iterations to output a\npoint xR. Suppose setting tk = Ce\u03b1k, k = 1, . . . , R for some C > 0 and \u03b1 \u2265 0 ensures that outer\niterations satisfy\n\nalgorithm that runs tk iterations and write N =(cid:80)R\n\nf (xk) \u2212 f\u2217 \u2264 \u03bde\u2212\u03b3k,\n\n(4)\n\nfor all k \u2265 0 with \u03bd \u2265 0 and \u03b3 \u2265 0. Then precision at the output is given by,\n\nf (xR) \u2212 f\u2217 \u2264 \u03bd exp(\u2212\u03b3N/C), when \u03b1 = 0,\n\nand\n\nf (xR) \u2212 f\u2217 \u2264\n\n\u03bd\n\n(\u03b1e\u2212\u03b1C\u22121N + 1)\n\n\u03b3\n\u03b1\n\n, when \u03b1 > 0.\n\nProof. When \u03b1 = 0, N = RC, and inserting this in (4) at the last point xR yields the desired\ne\u03b1\u22121 , which gives\n\nresult. On the other hand, when \u03b1 > 0, we have N = (cid:80)R\nR = log(cid:0) e\u03b1\u22121\ne\u03b1C N + 1(cid:1)(cid:1) \u2264\n\ne\u03b1C N + 1(cid:1) /\u03b1. Inserting this in (4) at the last point, we get\nf (xR) \u2212 f\u2217 \u2264 \u03bd exp(cid:0)\u2212 \u03b3\n\n\u03b1 log(cid:0) e\u03b1\u22121\n\nk=1 tk = Ce\u03b1 e\u03b1R\u22121\n\n\u03bd\n\n(\u03b1e\u2212\u03b1C\u22121N +1)\n\n,\n\n\u03b3\n\u03b1\n\nwhere we used ex \u2212 1 \u2265 x. This yields the second part of the result.\n\nThe last approximation in the case \u03b1 > 0 simpli\ufb01es the analysis that follows without signi\ufb01cantly\naffecting the bounds. We also show in Supplementary Material that using \u02dctk = (cid:100)tk(cid:101) does not\nsigni\ufb01cantly affect the bounds above. Remark that convergence bounds are generally linear or\npolynomial such that we can extract a subsequence that converges linearly. Therefore our approach\ndoes not restrict the analysis of our scheme. It simpli\ufb01es it and can be used for other algorithms like\nthe gradient descent as detailed in Supplementary Material.\nWe now analyze restart schedules tk that ensure linear convergence. Our choice of tk will heavily\ndepend on the ratio between r and s (with s = 2 for smooth functions here), incorporated in the\nparameter \u03c4 = 1 \u2212 s/r de\ufb01ned in (2). Below, we show that if \u03c4 = 0, a constant schedule is suf\ufb01cient\nto ensure linear convergence. When \u03c4 > 0, we need a geometrically increasing number of iterations\nfor each cycle.\nProposition 2.2. Let f be a smooth convex function satisfying (Smooth) with parameters (2, L)\nand (Sharp) with parameters (r, \u00b5) on a set K. Assume that we are given x0 \u2208 Rn such that\n{x| f (x) \u2264 f (x0)} \u2282 K. Run Algorithm 1 from x0 with iteration schedule tk = C\u2217\n\u03ba,\u03c4 e\u03c4 k, for\nk = 1, . . . , R, where\n\nC\u2217\n\n\u03ba,\u03c4\n\n(cid:44) e1\u2212\u03c4 (c\u03ba)\n\n1\n\n2 (f (x0) \u2212 f\u2217)\u2212 \u03c4\n2 ,\n\n(5)\n\n4\n\n\fwith \u03ba and \u03c4 de\ufb01ned in (2) and c = 4e2/e here. The precision reached at the last point \u02c6x is given by,\n\n(cid:16)\n\n(cid:17)\n\n2 N\n\n(f (x0) \u2212 f\u2217) = O\n\nexp(\u2212\u03ba\u2212 1\n\n2 N )\n\n, when \u03c4 = 0,\n\n(6)\n\n(cid:16)\n\nN\u2212 2\n\n\u03c4\n\n(cid:17)\n\n= O\n\n(cid:17) 2\n\n\u03c4\n\n, when \u03c4 > 0,\n\n(7)\n\n(cid:17)\n\n(cid:16)\u22122e\u22121(c\u03ba)\u2212 1\n(cid:16)\n\nf (x0) \u2212 f\u2217\n\n\u03c4 e\u22121(f (x0) \u2212 f\u2217) \u03c4\n\n2 (c\u03ba)\u2212 1\n\n2 N + 1\n\nk=1 tk is the total number of iterations.\n\nf (\u02c6x) \u2212 f\u2217 \u2264 exp\n\nwhile,\n\nf (\u02c6x) \u2212 f\u2217 \u2264\n\nwhere N =(cid:80)R\n\nProof. Our strategy is to choose tk such that the objective is linearly decreasing, i.e.\n\n(8)\nfor some \u03b3 \u2265 0 depending on the choice of tk. This directly holds for k = 0 and any \u03b3 \u2265 0.\nCombining (Sharp) with the complexity bound in (3), we get\n\nf (xk) \u2212 f\u2217 \u2264 e\u2212\u03b3k(f (x0) \u2212 f\u2217),\n\nf (xk) \u2212 f\u2217 \u2264 c\u03ba\n\nt2\nk\n\n(f (xk\u22121) \u2212 f\u2217) 2\nr ,\n\nwhere c = 4e2/e using that r2/r \u2264 e2/e. Assuming recursively that (8) is satis\ufb01ed at iteration k \u2212 1\nfor a given \u03b3, we have\n\nf (xk) \u2212 f\u2217 \u2264 c\u03bae\n\n\u2212\u03b3 2\nr\nt2\nk\n\n(k\u22121)\n\n(f (x0) \u2212 f\u2217) 2\nr ,\n\nand to ensure (8) at iteration k, we impose\n\nc\u03bae\n\n\u2212\u03b3 2\nr\nt2\nk\n\n(k\u22121)\n\n(f (x0) \u2212 f\u2217) 2\n\nr \u2264 e\u2212\u03b3k(f (x0) \u2212 f\u2217).\n\nRearranging terms in this last inequality, using \u03c4 de\ufb01ned in (2), we get\n\n(c\u03ba)\nFor a given \u03b3 \u2265 0, we can set tk = Ce\u03b1k where\n\n2\n\n\u03b3(1\u2212\u03c4 )\n\ntk \u2265 e\n\n1\n\n2 (f (x0) \u2212 f\u2217)\u2212 \u03c4\n2 e\n\n\u03c4 \u03b3\n\n2 k.\n\nC = e\n\n\u03b3(1\u2212\u03c4 )\n\n2\n\n(c\u03ba)\n\nand Lemma 2.1 then yields,\n\n1\n\n2 (f (x0) \u2212 f\u2217)\u2212 \u03c4\n\n2\n\nf (\u02c6x) \u2212 f\u2217 \u2264 exp\n\n2 (c\u03ba)\u2212 1\n\n2 N\n\nand\n\n\u03b1 = \u03c4 \u03b3/2,\n\n(cid:17)\n\n(f (x0) \u2212 f\u2217),\n\nwhen \u03c4 = 0, while\n\nf (\u02c6x) \u2212 f\u2217 \u2264\n\n(f (x0)\u2212f\u2217)\n\u2212 1\n2 (f (x0)\u2212f\u2217)\n\n(cid:17) 2\n\n\u03c4\n\n\u03c4\n2 N +1\n\n,\n\n(cid:16)\u2212\u03b3e\u2212 \u03b3\n(cid:16) \u03c4\n\n\u2212 \u03b3\n\n2 \u03b3e\n\n2 (c\u03ba)\n\n(9)\n\n(10)\n\nwhen \u03c4 > 0. These bounds are minimal for \u03b3 = 2, which yields the desired result.\n\nWhen \u03c4 = 0, bound (6) matches the classical complexity bound for smooth strongly convex func-\ntions [Nesterov, 2013b]. When \u03c4 > 0 on the other hand, bound (7) highlights a much faster\nconvergence rate than accelerated gradient methods. The sharper the function (i.e. the smaller r), the\nfaster the convergence. This matches the lower bounds for optimizing smooth and sharp functions\nfunctions [Arjevani and Shamir, 2016; Nemirovskii and Nesterov, 1985, Page 6] up to constant\n\u03ba,\u03c4 e\u03c4 k yields continuous bounds on precision, i.e. when \u03c4 \u2192 0, bound\nfactors. Also, setting tk = C\u2217\n(7) converges to bound (6), which also shows that for \u03c4 near zero, constant restart schemes are almost\noptimal.\n\n5\n\n\f2.2 Adaptive scheduled restart\n\nThe previous restart schedules depend on the sharpness parameters (r, \u00b5) in (Sharp). In general of\ncourse, these values are neither observed nor known a priori. Making our restart scheme adaptive is\nthus crucial to its practical performance. Fortunately, we show below that a simple logarithmic grid\nsearch strategy on these parameters is enough to guarantee nearly optimal performance.\nWe run several schemes with a \ufb01xed number of inner iterations N to perform a log-scale grid search\non \u03c4 and \u03ba. We de\ufb01ne these schemes as follows.\n\n(cid:26) Si,0 : Algorithm 1 with tk = Ci,\n\nSi,j : Algorithm 1 with tk = Cie\u03c4j k,\n\niterations has exceed N, i.e. at the smallest R such that(cid:80)R\n\n(11)\nwhere Ci = 2i and \u03c4j = 2\u2212j. We stop these schemes when the total number of inner algorithm\nk=1 tk \u2265 N. The size of the grid search in\nCi is naturally bounded as we cannot restart the algorithm after more than N total inner iterations,\nso i \u2208 [1, . . . ,(cid:98)log2 N(cid:99)]. We will also show that when \u03c4 is smaller than 1/N, a constant schedule\nperforms as well as the optimal geometrically increasing schedule, which crucially means we can\nalso choose j \u2208 [1, . . . ,(cid:100)log2 N(cid:101)] and limits the cost of grid search. The following result details the\nconvergence of this method, its notations are the same as in Proposition 2.2 and its technical proof\ncan be found in Supplementary Material.\nProposition 2.3. Let f be a smooth convex function satisfying (Smooth) with parameters (2, L)\nand (Sharp) with parameters (r, \u00b5) on a set K. Assume that we are given x0 \u2208 Rn such that\n{x| f (x) \u2264 f (x0)} \u2282 K and denote N a given number of iterations. Run schemes Si,j de\ufb01ned in\n(11) to solve (P) for i \u2208 [1, . . . ,(cid:98)log2 N(cid:99)] and j \u2208 [0, . . . ,(cid:100)log2 N(cid:101)], stopping each time after N\n\ntotal inner algorithm iterations i.e. for R such that(cid:80)R\n(cid:16)\u2212e\u22121(c\u03ba)\u2212 1\n\nf (\u02c6x) \u2212 f\u2217 \u2264 exp\n\nAssume N is large enough, so N \u2265 2C\u2217\nIf \u03c4 = 0, there exists i \u2208 [1, . . . ,(cid:98)log2 N(cid:99)] such that scheme Si,0 achieves a precision given by\n\n\u03ba,\u03c4 , and if 1\n\n\u03ba,\u03c4 > 1.\n\nk=1 tk \u2265 N.\nN > \u03c4 > 0, C\u2217\n\n2 N\n\n(f (x0) \u2212 f\u2217).\n\n(cid:17)\n\nIf \u03c4 > 0, there exist i \u2208 [1, . . . ,(cid:98)log2 N(cid:99)] and j \u2208 [1, . . . ,(cid:100)log2 N(cid:101)] such that scheme Si,j achieves\na precision given by\n\nf (\u02c6x) \u2212 f\u2217 \u2264\n\n(cid:16)\n\nf (x0)\u2212f\u2217\n2 (f (x0)\u2212f\u2217)\n\n\u03c4 e\u22121(c\u03ba)\n\n\u2212 1\n\n\u03c4\n\n2 (N\u22121)/4+1\n\n(cid:17) 2\n\n\u03c4\n\n.\n\nOverall, running the logarithmic grid search has a complexity (log2 N )2 times higher than running\nN iterations using the optimal (oracle) scheme.\n\n\u03ba,0 \u2265 1, therefore if 1\n\nN > \u03c4 > 0 and N is large, assuming C\u2217\n\nAs showed in Supplementary Material, scheduled restart schemes are theoretically ef\ufb01cient only if\nthe algorithm itself makes a suf\ufb01cient number of iterations to decrease the objective value. Therefore\nwe need N large enough to ensure the ef\ufb01ciency of the adaptive method. If \u03c4 = 0, we naturally\n\u03ba,\u03c4 \u2265 1.\nhave C\u2217\nThis adaptive bound is similar to the one of Nesterov [2013b] to optimize smooth strongly convex\nfunctions in the sense that we lose approximately a log factor of the condition number of the function.\nHowever our assumptions are weaker and we are able to tackle all regimes of the sharpness property,\ni.e. any exponent r \u2208 [2, +\u221e], not just the strongly convex case.\nIn the supplementary material we also analyze the simple gradient descent method under the sharpness\n(Sharp) assumption. It shows that simple gradient descent achieves a O(\u0001\u2212\u03c4 ) complexity for a given\naccuracy \u0001. Therefore restarting accelerated gradient methods reduces complexity to O(\u0001\u2212\u03c4 /2)\ncompared to simple gradient descent. This result is similar to the acceleration of gradient descent. We\nextend now this restart scheme to solve non-smooth or H\u00f6lder smooth convex optimization problem\nunder the sharpness assumption.\n\n\u03ba,0, we get C\u2217\n\n\u03ba,\u03c4 \u2248 C\u2217\n\n3 Universal scheduled restarts for convex problems\n\nIn this section, we use the framework introduced by Nesterov [2015] to describe smoothness of a\nconvex function f, namely, we assume that there exist s \u2208 [1, 2] and L > 0 on a set J \u2282 Rn, i.e.\n\n(cid:107)\u2207f (x) \u2212 \u2207f (y)(cid:107) \u2264 L(cid:107)x \u2212 y(cid:107)s\u22121,\n\nfor every x, y \u2208 J.\n\n6\n\n\fWithout further assumptions on f, the optimal rate of convergence for this class of functions is\nbounded as O(1/N \u03c1), where N is the total number of iterations and\n\n(12)\nwhich gives \u03c1 = 2 for smooth functions and \u03c1 = 1/2 for non-smooth functions. The universal\nfast gradient method [Nesterov, 2015] achieves this rate by requiring only a target accuracy \u0001 and a\nstarting point x0. It outputs after t iterations a point x (cid:44) U(x0, \u0001, t), such that\n\n\u03c1 = 3s/2 \u2212 1,\n\nf (x) \u2212 f\u2217 \u2264 \u0001\n2\n\n+\n\ncL 2\n\ns d(x0, X\u2217)2\n\u0001 2\ns t\n\n2\u03c1\ns\n\n4s\u22122\n\n\u0001\n2\n\n,\n\n(13)\n\nwhere c is a constant (c = 2\nSupplementary Material.\nWe will again assume that f is sharp with parameters (r, \u00b5) on a set K \u2287 X\u2217, i.e.\n\n). More details about the universal fast gradient method are given in\n\ns\n\n\u00b5\nr\n\nd(x, X\u2217)r \u2264 f (x) \u2212 f\u2217,\n\nfor every x \u2208 K.\n\n(Sharp)\nAs mentioned in Section 1.2, if r > s, smoothness or sharpness are local properties, i.e. either J or\nK or both are bounded, our analysis is therefore local. In the following we assume for simplicity,\ngiven an initial point x0, that smoothness and sharpness are satis\ufb01ed simultaneously on the sublevel\nset {x| f (x) \u2264 f (x0)}. The key difference with the smooth case described in the previous section is\nthat here we schedule both the target accuracy \u0001k used by the algorithm and the number of iterations\ntk made at the kth run of the algorithm. Our scheme is described in Algorithm 2.\n\nAlgorithm 2 Universal scheduled restarts for convex minimization\n\nInputs : x0 \u2208 Rn, \u00010 \u2265 f (x0) \u2212 f\u2217, \u03b3 \u2265 0 and a sequence tk for k = 1, . . . , R.\nfor k = 1, . . . , R do\n\n\u0001k := e\u2212\u03b3\u0001k\u22121,\n\nxk := U(xk\u22121, \u0001k, tk)\n\nend for\nOutput : \u02c6x := xR\n\nOur strategy is to choose a sequence tk that ensures\n\nf (xk) \u2212 f\u2217 \u2264 \u0001k,\n\nfor the geometrically decreasing sequence \u0001k. The overall complexity of our method will then depend\non the growth of tk as described in Lemma 2.1. The proof is similar to the smooth case and can be\nfound in Supplementary Material.\nProposition 3.1. Let f be a convex function satisfying (Smooth) with parameters (s, L) on a set J and\n(Sharp) with parameters (r, \u00b5) on a set K. Given x0 \u2208 Rn assume that {x|f (x) \u2264 f (x0)} \u2282 J \u2229 K.\nRun Algorithm 2 from x0 for a given \u00010 \u2265 f (x0) \u2212 f\u2217 with\n\u03ba,\u03c4,\u03c1e\u03c4 k, where C\u2217\n(cid:16)\n\nwhere \u03c1 is de\ufb01ned in (12), \u03ba and \u03c4 are de\ufb01ned in (2) and c = 8e2/e here. The precision reached at\nthe last point \u02c6x is given by,\nf (\u02c6x) \u2212 f\u2217 \u2264 exp\n\n(cid:16)\u2212\u03c1e\u22121(c\u03ba)\n\n(cid:44) e1\u2212\u03c4 (c\u03ba)\n\n, when \u03c4 = 0,\n\ntk = C\u2217\n\nexp(\u2212\u03ba\n\n\u03b3 = \u03c1,\n\n\u2212 \u03c4\n0\n\n(cid:17)\n\n\u2212 s\n\n2\u03c1 N\n\ns\n2\u03c1 \u0001\n\n\u03c1\n\n\u2212 s\n\n2\u03c1 N )\n\n(cid:17)\n\n\u03ba,\u03c4,\u03c1\n\n\u00010 = O\n\n(cid:16)\n\n(cid:17)\n\n= O\n\ns\n\n2\u03c4 N\u2212 \u03c1\n\n\u03c4\n\n\u03ba\n\n, when \u03c4 > 0,\n\nwhile,\n\nf (\u02c6x) \u2212 f\u2217 \u2264\n\nwhere N =(cid:80)R\n\n\u03c4 e\u22121(c\u03ba)\n\n0 N + 1\nk=1 tk is total number of iterations.\n\n2\u03c1 \u0001\n\n\u00010\n\u2212 s\n\n\u03c4\n\u03c1\n\n(cid:16)\n\n(cid:17)\u2212 \u03c1\n\n\u03c4\n\nThis bound matches the lower bounds for optimizing smooth and sharp functions [Nemirovskii and\nNesterov, 1985, Page 6] up to constant factors. Notice that, compared to Nemirovskii and Nesterov\n[1985], we can tackle non-smooth convex optimization by using the universal fast gradient algorithm\nof Nesterov [2015]. The rate of convergence in Proposition 3.1 is controlled by the ratio between \u03c4\nand \u03c1. If these are unknown, a log-scale grid search won\u2019t be able to reach the optimal rate, even if \u03c1\nis known since we will miss the optimal rate by a constant factor. If both are known, in the case of\nnon-smooth strongly convex functions for example, a grid-search on C recovers nearly the optimal\nbound. Now we will see that if f\u2217 is known, restart produces adaptive optimal rates.\n\n7\n\n\f4 Restart with termination criterion\nHere, we assume that we know the optimum f\u2217 of (P), or have an exact termination criterion. This is\nthe case for example in zero-sum matrix games problems or non-degenerate least-squares without\nregularization. We assume again that f satis\ufb01es (Smooth) with parameters (s, L) on a set J and\n(Sharp) with parameters (r, \u00b5) on a set K. Given an initial point x0 we assume that smoothness and\nsharpness are satis\ufb01ed simultaneously on the sublevel set {x| f (x) \u2264 f (x0)}. We use again the\nuniversal gradient method U. Here however, we can stop the algorithm when it reaches the target\naccuracy as we know the optimum f\u2217, i.e. we stop after t\u0001 inner iterations such that x = U(x0, \u0001, t\u0001)\nsatis\ufb01es f (x) \u2212 f\u2217 \u2264 \u0001, and write x (cid:44) C(x0, \u0001) the output of this method.\nHere we simply restart this method and decrease the target accuracy by a constant factor after each\nrestart. Our scheme is described in Algorithm 3.\n\nAlgorithm 3 Restart on criterion\n\nInputs : x0 \u2208 Rn, f\u2217, \u03b3 \u2265 0, \u00010 = f (x0) \u2212 f\u2217\nfor k = 1, . . . , R do\n\u0001k := e\u2212\u03b3\u0001k\u22121,\n\nend for\nOutput : \u02c6x := xR\n\nxk := C(xk\u22121, \u0001k)\n\nThe following result describes the convergence of this method. It relies on the idea that it cannot do\nmore iterations than the best scheduled restart to achieve the target accuracy at each iteration. Its\nproof can be found in Supplementary Material.\nProposition 4.1. Let f be a convex function satisfying (Smooth) with parameters (s, L) on a set J\nand (Sharp) with parameters (r, \u00b5) on a set K. Given x0 \u2208 Rn assume that {x, f (x) \u2264 f (x0)} \u2282\nJ \u2229 K. Run Algorithm 3 from x0 with parameter \u03b3 = \u03c1. The precision reached at the last point \u02c6x is\ngiven by,\nf (\u02c6x) \u2212 f\u2217 \u2264 exp\n\n(cid:16)\u2212\u03c1e\u22121(c\u03ba)\n\n(f (x0) \u2212 f\u2217) = O\n\n, when \u03c4 = 0,\n\n\u2212 s\n\n2\u03c1 N\n\n\u2212 s\n\n2\u03c1 N )\n\n(cid:17)\n\nexp(\u2212\u03ba\n\n(cid:16)\n= O(cid:0)\u03ba s\n\n2\u03c4 N\u2212 \u03c1\n\n(cid:17)\n\u03c4(cid:1) , when \u03c4 > 0,\n\nwhile,\n\nf (\u02c6x) \u2212 f\u2217 \u2264\n\n(cid:16)\n\nf (x0) \u2212 f\u2217\n2\u03c1 (f (x0) \u2212 f\u2217)\n\u2212 s\n\n\u03c4 e\u22121(c\u03ba)\n\n(cid:17) \u03c1\n\n\u03c4\n\n\u03c4\n\u03c1 N + 1\n\nwhere N is the total number of iterations, \u03c1 is de\ufb01ned in (12), \u03ba and \u03c4 are de\ufb01ned in (2) and c = 8e2/e\nhere.\nTherefore if f\u2217 is known, this method is adaptive, contrary to the general case in Proposition 3.1.\nIt can even adapt to the local values of L or \u00b5 as we use a criterion instead of a preset schedule.\nHere, stopping using f (xk) \u2212 f\u2217 implicitly yields optimal choices of C and \u03c4. A closer look at\nthe proof shows that the dependency in \u03b3 of this restart scheme is a factor h(\u03b3) = \u03b3e\u2212\u03b3/\u03c1 of\nthe number of iterations. Taking \u03b3 = 1, leads then to a suboptimal constant factor of at most\nh(\u03c1)/h(1) \u2264 e/2 \u2248 1.3 for \u03c1 \u2208 [1/2, 2], so running this scheme with \u03b3 = 1 makes it parameter-free\nwhile getting nearly optimal bounds.\n\n5 Numerical Results\n\nWe illustrate our results by testing our adaptive restart methods, denoted Adap and Crit, introduced\nrespectively in Sections 2.2 and 4 on several problems and compare them against simple gradient\ndescent (Grad), accelerated gradient methods (Acc), and the restart heuristic enforcing monotonicity\n(Mono in [O\u2019Donoghue and Candes, 2015]). For Adap we plot the convergence of the best method\nfound by grid search to compare with the restart heuristic. This implicitly assumes that the grid\nsearch is run in parallel with enough servers. For Crit we use the optimal f\u2217 found by another solver.\nThis gives an overview of its performance in order to potentially approximate it along the iterations\n\n8\n\n\fin a future work as done with Polyak steps [Polyak, 1987]. All restart schemes were done using the\naccelerated gradient with backtracking line search detailed in the Supplementary Material, with large\ndots representing restart iterations.\nThe results focused on unconstrained problems but our approach can directly be extended to composite\nproblems by using the proximal variant of the gradient, accelerated gradient and universal fast gradient\nmethods [Nesterov, 2015] as detailed in the Supplementary Material. This includes constrained\noptimization as a particular case by adding the indicator function of the constraint set to the objective\n(as in the SVM example below).\nIn Figure 1, we solve classi\ufb01cation problems with various losses on the UCI Sonar data set [Asuncion\nand Newman, 2007]. For least square loss on sonar data set, we observe much faster convergence\nof the restart schemes compared to the accelerated method. These results were already observed by\nO\u2019Donoghue and Candes [2015]. For logistic loss, we observe that restart does not provide much\nimprovement. The backtracking line search on the Lipschitz constant may be suf\ufb01cient to capture\nthe geometry of the problem. For hinge loss, we regularized by a squared norm and optimize the\ndual, which means solving a quadratic problem with box constraints. We observe here that the\nscheduled restart scheme convergences much faster, while restart heuristics may be activated too\nlate. We observe similar results for the LASSO problem. In general Crit ensures the theoretical\naccelerated rate but Adap exhibits more consistent behavior. This highlights the bene\ufb01ts of a sharpness\nassumption for these last two problems. Precisely quantifying sharpness from data/problem structure\nis a key open problem.\n\nFigure 1: From left to right: least square loss, logistic loss, dual SVM problem and LASSO. We use\nadaptive restarts (Adap), gradient descent (Grad), accelerated gradient (Acc) and restart heuristic\nenforcing monotonicity (Mono). Large dots represent the restart iterations. Regularization parameters\nfor dual SVM and LASSO were set to one.\n\nAcknowledgments\n\nThe authors would like to acknowledge support from the chaire \u00c9conomie des nouvelles donn\u00e9es with\nthe data science joint research initiative with the fonds AXA pour la recherche, a gift from Soci\u00e9t\u00e9\nG\u00e9n\u00e9rale Cross Asset Quantitative Research and an AMX fellowship. The authors are af\ufb01liated to\nPSL Research University, Paris, France.\n\n9\n\n0200400600800Iterations10-1010-5100f(x)-f*GradAccMonoAdapCrit05001000Iterations10-210-1100f(x)-f*GradAccMonoAdapCrit05001000Iterations10-1010-5100f(x)-f*GradAccMonoAdapCrit05001000Iterations10-1010-5100f(x)-f*GradAccMonoAdapCrit\fReferences\nArjevani, Y. and Shamir, O. [2016], On the iteration complexity of oblivious \ufb01rst-order optimization\n\nalgorithms, in \u2018International Conference on Machine Learning\u2019, pp. 908\u2013916.\n\nAsuncion, A. and Newman, D. [2007], \u2018Uci machine learning repository\u2019.\n\nAttouch, H., Bolte, J., Redont, P. and Soubeyran, A. [2010], \u2018Proximal alternating minimization\nand projection methods for nonconvex problems: An approach based on the kurdyka-\u0142ojasiewicz\ninequality\u2019, Mathematics of Operations Research 35(2), 438\u2013457.\n\nAuslender, A. and Crouzeix, J.-P. [1988], \u2018Global regularity theorems\u2019, Mathematics of Operations\n\nResearch 13(2), 243\u2013253.\n\nBierstone, E. and Milman, P. D. [1988], \u2018Semianalytic and subanalytic sets\u2019, Publications Math\u00e9ma-\n\ntiques de l\u2019IH\u00c9S 67, 5\u201342.\n\nBolte, J., Daniilidis, A. and Lewis, A. [2007], \u2018The \u0142ojasiewicz inequality for nonsmooth subanalytic\nfunctions with applications to subgradient dynamical systems\u2019, SIAM Journal on Optimization\n17(4), 1205\u20131223.\n\nBolte, J., Nguyen, T. P., Peypouquet, J. and Suter, B. W. [2015], \u2018From error bounds to the complexity\n\nof \ufb01rst-order descent methods for convex functions\u2019, Mathematical Programming pp. 1\u201337.\n\nBolte, J., Sabach, S. and Teboulle, M. [2014], \u2018Proximal alternating linearized minimization for\n\nnonconvex and nonsmooth problems\u2019, Mathematical Programming 146(1-2), 459\u2013494.\n\nBurke, J. and Deng, S. [2002], \u2018Weak sharp minima revisited part i: basic theory\u2019, Control and\n\nCybernetics 31, 439\u2013469.\n\nBurke, J. and Ferris, M. C. [1993], \u2018Weak sharp minima in mathematical programming\u2019, SIAM\n\nJournal on Control and Optimization 31(5), 1340\u20131359.\n\nFercoq, O. and Qu, Z. [2016], \u2018Restarting accelerated gradient methods with a rough strong convexity\n\nestimate\u2019, arXiv preprint arXiv:1609.07358 .\n\nFercoq, O. and Qu, Z. [2017], \u2018Adaptive restart of accelerated gradient methods under local quadratic\n\ngrowth condition\u2019, arXiv preprint arXiv:1709.02300 .\n\nFrankel, P., Garrigos, G. and Peypouquet, J. [2015], \u2018Splitting methods with variable metric for\nkurdyka\u2013\u0142ojasiewicz functions and general convergence rates\u2019, Journal of Optimization Theory\nand Applications 165(3), 874\u2013900.\n\nFreund, R. M. and Lu, H. [2015], \u2018New computational guarantees for solving convex optimization\nproblems with \ufb01rst order methods, via a function growth condition measure\u2019, arXiv preprint\narXiv:1511.02974 .\n\nGilpin, A., Pena, J. and Sandholm, T. [2012], \u2018First-order algorithm with O(log 1/\u0001) convergence for\n\u0001-equilibrium in two-person zero-sum games\u2019, Mathematical programming 133(1-2), 279\u2013298.\n\nGiselsson, P. and Boyd, S. [2014], Monotonicity and restart in fast gradient methods, in \u201853rd IEEE\n\nConference on Decision and Control\u2019, IEEE, pp. 5058\u20135063.\n\nHoffman, A. J. [1952], \u2018On approximate solutions of systems of linear inequalities\u2019, Journal of\n\nResearch of the National Bureau of Standards 49(4).\n\nJuditski, A. and Nesterov, Y. [2014], \u2018Primal-dual subgradient methods for minimizing uniformly\n\nconvex functions\u2019, arXiv preprint arXiv:1401.1792 .\n\nKarimi, H., Nutini, J. and Schmidt, M. [2016], Linear convergence of gradient and proximal-gradient\nmethods under the polyak-\u0142ojasiewicz condition, in \u2018Joint European Conference on Machine\nLearning and Knowledge Discovery in Databases\u2019, Springer, pp. 795\u2013811.\n\nLin, Q. and Xiao, L. [2014], An adaptive accelerated proximal gradient method and its homotopy\n\ncontinuation for sparse optimization., in \u2018ICML\u2019, pp. 73\u201381.\n\n10\n\n\f\u0141ojasiewicz, S. [1963], \u2018Une propri\u00e9t\u00e9 topologique des sous-ensembles analytiques r\u00e9els\u2019, Les\n\n\u00e9quations aux d\u00e9riv\u00e9es partielles pp. 87\u201389.\n\n\u0141ojasiewicz, S. [1993], \u2018Sur la g\u00e9om\u00e9trie semi-et sous-analytique\u2019, Annales de l\u2019institut Fourier\n\n43(5), 1575\u20131595.\n\nMangasarian, O. L. [1985], \u2018A condition number for differentiable convex inequalities\u2019, Mathematics\n\nof Operations Research 10(2), 175\u2013179.\n\nNemirovskii, A. and Nesterov, Y. [1985], \u2018Optimal methods of smooth convex minimization\u2019, USSR\n\nComputational Mathematics and Mathematical Physics 25(2), 21\u201330.\n\nNesterov, Y. [1983], \u2018A method of solving a convex programming problem with convergence rate\n\nO(1/k2)\u2019, Soviet Mathematics Doklady 27(2), 372\u2013376.\n\nNesterov, Y. [2013a], \u2018Gradient methods for minimizing composite functions\u2019, Mathematical Pro-\n\ngramming 140(1), 125\u2013161.\n\nNesterov, Y. [2013b], Introductory lectures on convex optimization: A basic course, Vol. 87, Springer\n\nScience & Business Media.\n\nNesterov, Y. [2015], \u2018Universal gradient methods for convex optimization problems\u2019, Mathematical\n\nProgramming 152(1-2), 381\u2013404.\n\nO\u2019Donoghue, B. and Candes, E. [2015], \u2018Adaptive restart for accelerated gradient schemes\u2019, Founda-\n\ntions of computational mathematics 15(3), 715\u2013732.\n\nPolyak, B. [1979], Sharp minima institute of control sciences lecture notes, moscow, ussr, 1979, in\n\u2018IIASA workshop on generalized Lagrangians and their applications, IIASA, Laxenburg, Austria\u2019.\n\nPolyak, B. [1987], Introduction to optimization, Optimization Software.\n\nRenegar, J. [2014], \u2018Ef\ufb01cient \ufb01rst-order methods for linear programming and semide\ufb01nite program-\n\nming\u2019, arXiv preprint arXiv:1409.5832 .\n\nRobinson, S. M. [1975], \u2018An application of error bounds for convex programming in a linear space\u2019,\n\nSIAM Journal on Control 13(2), 271\u2013273.\n\nRoulet, V., Boumal, N. and d\u2019Aspremont, A. [2015], \u2018Renegar\u2019s condition number, shaprness and\n\ncompressed sensing performance\u2019, arXiv preprint arXiv:1506.03295 .\n\nSu, W., Boyd, S. and Candes, E. [2014], A differential equation for modeling nesterov\u2019s accelerated\ngradient method: Theory and insights, in \u2018Advances in Neural Information Processing Systems\u2019,\npp. 2510\u20132518.\n\n11\n\n\f", "award": [], "sourceid": 760, "authors": [{"given_name": "Vincent", "family_name": "Roulet", "institution": "INRIA / ENS Ulm"}, {"given_name": "Alexandre", "family_name": "d'Aspremont", "institution": "CNRS - Ecole Normale Sup\u00e9rieure"}]}