{"title": "Online Adaptive Methods, Universality and Acceleration", "book": "Advances in Neural Information Processing Systems", "page_first": 6500, "page_last": 6509, "abstract": "We present a novel method for convex unconstrained optimization that, without any modifications ensures: (1) accelerated convergence rate for smooth objectives, (2) standard convergence rate in the general (non-smooth) setting, and (3) standard convergence rate in the stochastic optimization setting. \nTo the best of our knowledge, this is the first method that simultaneously applies to all of the above settings. \nAt the heart of our method is an adaptive learning rate rule that employs importance weights, in the spirit of adaptive online learning algorithms [duchi2011adaptive,levy2017online], combined with an update that linearly couples two sequences, in the spirit of [AllenOrecchia2017]. An empirical examination of our method demonstrates its applicability to the above mentioned scenarios and corroborates our theoretical findings.", "full_text": "Online Adaptive Methods, Universality and\n\nAcceleration\n\nK\ufb01r Y. Levy\nETH Zurich\n\nAlp Yurtsever\n\nEPFL\n\nVolkan Cevher\n\nEPFL\n\nyehuda.levy@inf.ethz.ch\n\nalp.yurtsever@epfl.ch\n\nvolkan.cevher@epfl.ch\n\nAbstract\n\nWe present a novel method for convex unconstrained optimization that, without\nany modi\ufb01cations, ensures: (i) accelerated convergence rate for smooth objectives,\n(ii) standard convergence rate in the general (non-smooth) setting, and (iii) stan-\ndard convergence rate in the stochastic optimization setting. To the best of our\nknowledge, this is the \ufb01rst method that simultaneously applies to all of the above\nsettings.\nAt the heart of our method is an adaptive learning rate rule that employs importance\nweights, in the spirit of adaptive online learning algorithms [12, 20], combined with\nan update that linearly couples two sequences, in the spirit of [2]. An empirical\nexamination of our method demonstrates its applicability to the above mentioned\nscenarios and corroborates our theoretical \ufb01ndings.\n\n1\n\nIntroduction\n\nThe accelerated gradient method of Nesterov [23] is one of the cornerstones of modern optimization.\nDue to its appeal as a computationally ef\ufb01cient and fast method, it has found use in numerous\napplications including: imaging [8], compressed sensing [14], and deep learning [31], amongst other.\nDespite these merits, accelerated methods are less prevalent in Machine Learning due to two major\nissues: (i) acceleration is inappropriate for handling noisy feedback, and (ii) acceleration requires\nthe knowledge of the objective\u2019s smoothness. While each of these issues was separately resolved in\n[17, 16, 33], and respectively in [25]; it was unknown whether there exists an accelerated method\nthat addresses both issues. In this work we propose such a method.\nConcretely, Nesterov [25] devises a method that obtains an accelerated convergence rate of O(1/T 2)\n\u221a\nfor smooth convex objectives, and a standard rate of O(1/\nT ) for non-smooth convex objectives,\nover T iterations. This is done without any prior knowledge of the smoothness parameter, and is\ntherefore referred to as a universal1 method. Nonetheless, this method uses a line search technique in\nevery round, and is therefore inappropriate for handling noisy feedback. On the other hand, Lan [17],\n\u221a\nHu et al. [16], and Xiao [33], devise accelerated methods that are able to handle noisy feedback and\nobtain a convergence rate of O(1/T 2 + \u03c3/\nT ), where \u03c3 is the variance of the gradients. However,\nthese methods are not universal since they require the knowledge of both \u03c3 and of the smoothness.\nConversely, adaptive \ufb01rst order methods are very popular\nin Machine Learning, with\nAdaGrad, [12], being the most prominent method among this class. AdaGrad is an online learning\nalgorithm which adapts its learning rate using the feedback (gradients) received through the opti-\nmization process, and is known to successfully handle noisy feedback. This renders AdaGrad as\n\n1Following Nesterov\u2019s paper [25], we say that an algorithm is universal if it does not require to know in\nadvance whether the objective is smooth or not. Note that universality does not mean a parameter free algorithm.\nSpeci\ufb01cally, Nesterov\u2019s universal methods [25] as well as ours are not parameter free.\n\n32nd Conference on Neural Information Processing Systems (NeurIPS 2018), Montr\u00e9al, Canada.\n\n\fthe method of choice in various learning applications. Note however, that AdaGrad (probably) can\nnot ensure acceleration. Moreover, it was so far unknown whether AdaGrad is at all able to exploit\nsmoothness in order to converge faster.\nIn this work we investigate unconstrained convex optimization. We suggest AcceleGrad (Alg. 2),\na novel universal method which employs an accelerated-gradient-like update rule together with an\nadaptive learning rate \u00e0 la AdaGrad. Our contributions,\n\n\u2022 We show that AcceleGrad obtains an accelerated rate of O(1/T 2) in the smooth case and\nT ) in the general case, without any prior information of the objective\u2019s smoothness.\n\u2022 We show that without any modi\ufb01cations, AcceleGrad ensures a convergence rate of\n\n\u02dcO(1/\n\n\u221a\n\n\u221a\n\n\u02dcO(1/\n\nT ) in the general stochastic convex case.\n\n\u2022 We also present a new result regarding the AdaGrad algorithm. We show that in the case of\n\u221a\nstochastic optimization with a smooth expected loss, AdaGrad ensures an O(1/T + \u03c3/\nT )\nconvergence rate, where \u03c3 is the variance of the gradients. AdaGrad does not require a\nknowledge of the smoothness, hence this result establishes the universality of AdaGrad\n(though without acceleration).\n\nOn the technical side our algorithm emoploys three simultaneous mechanisms: learning rate adapta-\ntion in conjunction with importance weighting, in the spirit of adaptive online learning algorithms\n[12, 20], combined with an update rule that linearly couples two sequences, in the spirit of [2].\nThis paper is organized as follows. In Section 2 we present our setup and review relevant background.\nOur results and analysis for the of\ufb02ine setting are presented in Section 3, and Section 4 presents\nour results for the stochastic setting. In Section 5 we present our empirical study, and Section 6\nconcludes.\n\nRelated Work:\nIn his pioneering work, Nesterov [23], establishes an accelerated rate for smooth\nconvex optimization. This was later generalized in, [24, 6], to allow for general metrics and line\nsearch.\nIn recent years there has been a renewed interest in accelerated methods, with efforts being made to\nunderstand acceleration as well as to extend it beyond the standard of\ufb02ine optimization setting.\nAn extension of acceleration to handle stochastic feedback was developed in, [17, 16, 33, 9]. Ac-\nceleration for modern variance reduction optimization methods is explored in, [29, 1], and generic\ntemplates to accelerating variance reduction algorithms are developed in, [21, 15]. Scieur et al.\n[28], derives a scheme that enables hindsight acceleration of non-accelerated methods. In [34], the\nauthors devise a universal accelerated method for primal dual problems. And the connection between\nacceleration and ODEs is investigated in, [30, 32, 13, 19, 5, 4]. Universal accelerated schemes are\nexplore in [25, 18, 26], yet these works do not apply to the stochastic setting. Alternative accelerated\nmethods and interpretations are explored in, [3, 7, 11].\nCuriously, Allen-Zhu and Orecchia [2], interpret acceleration as a linear coupling between gradient\ndescent and mirror descent, our work builds on their ideas. Our method also relies on ideas from [20],\nwhere universal (non-accelerated) procedures are derived through a conversion scheme of online\nlearning algorithms.\n\n2 Setting and Preliminaries\nWe discuss the optimization of a convex function f : Rd (cid:55)\u2192 R. Our goal is to (approximately) solve\nthe following unconstrained optimization problem,\n\nmin\nx\u2208Rd\n\nf (x) .\n\nWe focus on \ufb01rst order methods, i.e., methods that only require gradient information, and consider\nboth smooth and non-smooth objectives. The former is de\ufb01ned below,\nDe\ufb01nition 1 (\u03b2-smoothness). A function f : Rd (cid:55)\u2192 R is \u03b2-smooth if,\n\nf (y) \u2264 f (x) + \u2207f (x) \u00b7 (y \u2212 x) +\n\n(cid:107)x \u2212 y(cid:107)2;\n\n\u2200x, y \u2208 Rd\n\n\u03b2\n2\n\n2\n\n\fAlgorithm 1 Adaptive Gradient Method (AdaGrad)\n\nInput: #Iterations T , x1 \u2208 K, set K\nfor t = 1 . . . T do\n\nCalculate: gt = \u2207f (xt), and update, \u03b7t = D\nUpdate:\n\n(cid:16)\n\n\u03c4 =1 (cid:107)g\u03c4(cid:107)2(cid:17)\u22121/2\n2(cid:80)t\n\nxt+1 = \u03a0K (xt \u2212 \u03b7tgt)\n\nend for\nOutput: \u00afxT = 1\nT\n\n(cid:80)T\n\nt=1 xt\n\nIt is well known that with the knowledge of the smoothness parameter, \u03b2, one may obtain fast\nconvergence rates by an appropriate adaptation of the update rule. In this work we do not assume any\nsuch knowledge; instead we assume to be given a bound on the distance between some initial point,\nx0, and a global minimizer of the objective.\nThis is formalized as follows: we are given a compact convex set K that contains a global minimum\nof f, i.e., \u2203z \u2208 K such that z \u2208 arg minx\u2208Rd f (x). Thus, for any initial point, x0 \u2208 K, its distance\nfrom the global optimum is bounded by the diameter of the set, D := maxx,y\u2208K (cid:107)x \u2212 y(cid:107). Note that\nwe allow to choose points outside K. We also assume that the objective f is G-Lipschitz, which\ntranslates to a bound of G on the magnitudes of the (sub)-gradients.\nAn access to the exact gradients of the objective is not always possible. And in many scenarios we\nmay only access an oracle which provides noisy and unbiased gradient estimates. This Stochatic\nOptimization setting is prevalent in Machine Learning, and we discuss it more formally in Section 4.\n\nThe AdaGrad Algorithm: The adaptive method presented in this paper is inspired by AdaGrad\n(Alg. 1), a well known online optimization method which employs an adaptive learning rate. The\nfollowing theorem states AdaGrad\u2019s guarantees2 , [12],\nTheorem 2.1. Let K be a convex set with diameter D. Let f be a convex function. Then Algorithm 1\nguarantees the following error;\n\n(cid:118)(cid:117)(cid:117)(cid:116)2D2\n\nT(cid:88)\n\nt=1\n\nf (\u00afxT ) \u2212 min\n\nx\u2208K f (x) \u2264\n\n(cid:107)gt(cid:107)2/T .\n\nNotation: We denote the Euclidean norm by (cid:107) \u00b7 (cid:107). Given a compact convex set K we denote by\n\u03a0K(\u00b7) the projection onto the K, i.e. \u2200x \u2208 Rd, \u03a0K(x) = arg miny\u2208K (cid:107)y \u2212 x(cid:107)2 .\n\n3 Of\ufb02ine Setting\n\nThis section discusses the of\ufb02ine optimization setting where we have an access to the exact gradients\nof the objective. We present our method in Algorithm 2, and substantiate its universality by providing\n\nO(1/T 2) rate in the smooth case (Thm. 3.1), and a rate of O((cid:112)log T /T ) in the general convex case\n\n(Thm. 3.2). The analysis for the smooth case appears in Section 3.1 and we defer the proof of the\nnon-smooth case to the Appendix.\nAcceleGrad is summarized in Algorithm 2. Inspired by, [2], our method linearly couples between\ntwo sequences {zt}t,{yt}t into a sequence {xt+1}t. Using the gradient , gt = \u2207f (xt+1), these\nsequences are then updated with the same learning rate, \u03b7t, yet with different reference points and\ngradient magnitudes. Concretely, yt+1 takes a gradient step starting at xt+1. Conversely, for zt+1 we\nscale the gradient by a factor of \u03b1t and then take a projected gradient step starting at zt. Our method\n\ufb01nally outputs a weighted average of the {yt+1}t sequence.\ntaking\nOur\n\u03b7t = 1/\u03b2 and outputting the last iterate, yT , rather then a weighted average; yet this method is not\n\nalgorithm\n\ncoincides\n\nupon\n\nwith\n\nthe\n\nmethod\n\nof\n\n[2]\n\n2Actually AdaGrad is well known to ensure regret guarantees in the online setting. For concreteness, Thm. 2.1\n\nprovides error guarantees in the of\ufb02ine setting.\n\n3\n\n\fAlgorithm 2 Accelerated Adaptive Gradient Method (AcceleGrad)\n\nInput: #Iterations T , x0 \u2208 K, diameter D, weights {\u03b1t}t\u2208[T ], learning rate {\u03b7t}t\u2208[T ]\nSet: y0 = z0 = x0\nfor t = 0 . . . T do\n\nSet \u03c4t = 1/\u03b1t\nUpdate:\n\nxt+1 = \u03c4tzt + (1 \u2212 \u03c4t)yt ,\nzt+1 = \u03a0K (zt \u2212 \u03b1t\u03b7tgt)\nyt+1 = xt+1 \u2212 \u03b7tgt\n\nOutput: \u00afyT \u221d(cid:80)T\u22121\n\nend for\n\nt=0 \u03b1tyt+1\n\nand de\ufb01ne gt := \u2207f (xt+1)\n\nuniversal. Below we present our \u03b2-independent choice of learning rate and weights,\n\n(cid:16)\n\nG2 +(cid:80)t\n\n\u03b7t =\n\n2D\n\n\u03c4 =0 \u03b12\n\n\u03c4(cid:107)g\u03c4(cid:107)2\n\n(cid:17)1/2\n\n&\n\n\u03b1t =\n\n1\n4 (t + 1)\n\n0 \u2264 t \u2264 2\nt \u2265 3\n\n(1)\n\n(cid:26)1\n\nThe learning rate that we suggest adapts similarly to AdaGrad. Differently from AdaGrad we consider\nthe importance weights, \u03b1t, inside the learning rate rule; an idea that we borrow from [20]. The\nweights that we employ are increasing with t, which in turn emphasizes recent queries.\nNext we state the guarantees of AcceleGrad for the smooth and non-smooth cases,\nTheorem 3.1. Assume that f is convex and \u03b2-smooth. Let K be a convex set with bounded diameter\nD, and assume there exists a global minimizer for f in K. Then Algorithm 2 with weights and\nlearning rate as in Equation (1) ensures,\n\nf (\u00afyT ) \u2212 min\nx\u2208Rd\n\nf (x) \u2264 O\n\n(cid:19)\n\n(cid:18) DG + \u03b2D2 log(\u03b2D/G)\n, which yields a rate of O(cid:16) \u03b2D2 log(\u03b2D/(cid:107)g0(cid:107))\n\nT 2\n\nRemark: Actually, in the smooth case we do not need a bound on the Lipschitz continuity, i.e., G is\nonly required in case that the objective is non-smooth. Concretely, if we know that f is smooth then\nwe may use \u03b7t = 2D\n\n\u03c4(cid:107)g\u03c4(cid:107)2(cid:17)\u22121/2\n\n(cid:16)(cid:80)t\n\n(cid:17)\n\n\u03c4 =0 \u03b12\n\n.\n\nT 2\n\nNext we show that the exactly same algorithm provides guarantees in the general convex case,\nTheorem 3.2. Assume that f is convex and G-Lipschitz. Let K be a convex set with bounded\ndiameter D, and assume there exists a global minimizer for f in K. Then Algorithm 2 with weights\nand learning rate as in Equation (1) ensures,\n\nf (x) \u2264 O(cid:16)\n\nGD(cid:112)log T /\n\n\u221a\n\nT\n\n(cid:17)\n\nf (\u00afyT ) \u2212 min\nx\u2208Rd\n\nRemark: For non-smooth objectives, we can modify AcceleGrad and provide guarantees for\nthe constrained setting. Concretely, using Alg. 2 with a projection step for the yt\u2019s,\ni.e.,\nyt+1 = \u03a0K(xt+1 \u2212 \u03b7tgt),\nthen we can bound its error by f (\u00afyT ) \u2212 minx\u2208K f (x) \u2264\n. This holds even in the case where minimizer over K is not a global one.\n\n\u221a\nlog T /\n\nO(cid:16)\n\n(cid:17)\n\nGD\n\n\u221a\n\nT\n\n3.1 Analysis of the Smooth Case\nHere we provide a proof sketch for Theorem 3.1. For brevity, we will use z \u2208 K to denote a global\nmimimizer of f which belongs to K.\nRecall that Algorithm 2 outputs a weighted average of the queries. Consequently, we may employ\nJensen\u2019s inequality to bound its error as follow,\n\nf (\u00afyT ) \u2212 f (z) \u2264\n\n\u03b1t (f (yt+1) \u2212 f (z)) .\n\n(2)\n\n1(cid:80)T\u22121\n\nt=0 \u03b1t\n\nt=0\n\nT\u22121(cid:88)\n\n4\n\n\fCombining this with(cid:80)T\u22121\nto show that,(cid:80)T\u22121\n\nt=0 \u03b1t \u2265 \u2126(T 2), implies that in order to substantiate the proof it is suf\ufb01cient\nt=0 \u03b1t (f (yt+1) \u2212 f (z)), is bounded by a constant. This is the bulk of the analysis.\n\nWe start with the following lemma which provides us with a bound on \u03b1t (f (yt+1) \u2212 f (z)),\nLemma 3.1. Assume that f is convex and \u03b2-smooth. Then for any sequence of non-negative weights\n{\u03b1t}t\u22650, and learning rates {\u03b7t}t\u22650, Algorithm 2 ensures the following to hold,\n\n\u03b1t(f (yt+1) \u2212 f (z)) \u2264 (\u03b12\n\nt \u2212 \u03b1t)(f (yt) \u2212 f (yt+1)) +\n1\n2\u03b7t\n\n(cid:0)(cid:107)zt \u2212 z(cid:107)2 \u2212 (cid:107)zt+1 \u2212 z(cid:107)2(cid:1)\n\n\u03b12\nt\n2\n\n+\n\n\u03b2 \u2212 1\n\u03b7t\n\n(cid:18)\n\n(cid:16)\n\n(cid:107)yt+1 \u2212 xt+1(cid:107)2\n\n(cid:19)\n(cid:17)(cid:107)yt+1 \u2212 xt+1(cid:107)2, does\n\nInterestingly, choosing \u03b7t \u2264 1/\u03b2, implies that the above term, \u03b12\nnot contribute to the sum. We can show that this choice facilitates a concise analysis establishing an\nerror of O(\u03b2D2/T 2) for \u00afyT\nNote however that our learning rate does not depend on \u03b2, and therefore the mentioned term is not\nnecessarily negative. This issue is one of the main challenges in our analysis. Next we provide a\nproof sketch of Theorem 3.1. The full proof is deferred to the Appendix.\n\n\u03b2 \u2212 1\n\n3.\n\nt\n2\n\n\u03b7t\n\nProof Sketch of Theorem 3.1. Lemma 3.1 enables to decompose(cid:80)T\u22121\n\u03b1t(f (yt+1) \u2212 f (z)) \u2264 T\u22121(cid:88)\nT\u22121(cid:88)\n(cid:124)\nT\u22121(cid:88)\n(cid:124)\n\n(cid:0)(cid:107)zt \u2212 z(cid:107)2 \u2212 (cid:107)zt+1 \u2212 z(cid:107)2(cid:1)\n(cid:125)\n\nt \u2212 \u03b1t)(f (yt) \u2212 f (yt+1))\n(cid:125)\n\nT\u22121(cid:88)\n(cid:124)\n\n(cid:123)(cid:122)\n(cid:123)(cid:122)\n\n1\n2\u03b7t\n\n(\u03b12\n\nt=0\n\nt=0\n\nt=0\n\nt=0\n\n(A)\n\n+\n\n+\n\n(B)\n\nt=0 \u03b1t(f (yt+1) \u2212 f (z)),\n\n(cid:18)\n\n\u03b12\nt\n2\n\n\u03b2 \u2212 1\n\u03b7t\n\n(cid:19)\n(cid:123)(cid:122)\n\n(C)\n\n(cid:107)yt+1 \u2212 xt+1(cid:107)2\n\n(cid:125)\n\n(3)\n\n1\n2\u03b7t\n\nT\u22121(cid:88)\n\nT\u22121(cid:88)\n\nNext we separately bound each of the above terms.\n(a) Bounding (A) : Using the fact that {1/\u03b7t}t\u2208[T ] is monotonically increasing allows to show,\n\u2264 D2\n2\u03b7T\u22121\n(4)\n\n(cid:0)(cid:107)zt \u2212 z(cid:107)2 \u2212 (cid:107)zt+1 \u2212 z(cid:107)2(cid:1) \u2264 1\n\nwhere we used (cid:107)zt \u2212 z(cid:107) \u2264 D.\n(b) Bounding (B) : We will require the following property of the weights that we choose (Eq. (1)),\n(5)\n\nt\u22121 \u2212 \u03b1t\u22121) \u2264 \u03b1t\u22121/2\n\nt \u2212 \u03b1t) \u2212 (\u03b12\n\n\u2212 1\n\u03b7t\u22121\n\n(cid:107)z0 \u2212 z(cid:107)2\n\n(cid:107)zt \u2212 z(cid:107)2\n\n(cid:18) 1\n\n(cid:19)\n\n(\u03b12\n\n2\u03b70\n\nt=0\n\nt=1\n\n\u03b7t\n\n+\n\n2\n\nNow recall that z := arg minx\u2208Rd f (x), and let us denote the sub-optimality of yt by \u03b4t, i.e.\n\u03b4t = f (yt) \u2212 f (z). Noting that \u03b4t \u2265 0 we may show the following,\n\nt \u2212 \u03b1t) (f (yt) \u2212 f (yt+1)) =\n\n(\u03b12\n\nt \u2212 \u03b1t) (\u03b4t \u2212 \u03b4t+1)\n\n(\u03b12\n\nT\u22121(cid:88)\n\nt=0\n\nt=0\n\nT\u22121(cid:88)\n\u2264 T\u22121(cid:88)\nT\u22121(cid:88)\n\nt=1\n\n\u2264 1\n2\n\nt=0\n\n((\u03b12\n\nt \u2212 \u03b1t) \u2212 (\u03b12\n\nt\u22121 \u2212 \u03b1t\u22121))\u03b4t\n\n\u03b1t (f (yt+1) \u2212 f (z))\n\n(6)\n\nWhere the last inequality uses Equation (5) (see full proof for the complete derivation).\n3While we do not spell out this analysis, it is a simpli\ufb01ed version of our proof for Thm. 3.1.\n\n5\n\n\f(c) Bounding (C) : Let us denote \u03c4(cid:63) := max{t \u2208 {0, . . . , T \u2212 1} : 2\u03b2 \u2265 1/\u03b7t} . We may now\nsplit the term (C) according to \u03c4(cid:63),\n\n(cid:18)\n\n(cid:19)\n\n(C) =\n\n\u03b2 \u2212 1\n\u03b7t\n\n(cid:107)yt+1 \u2212 xt+1(cid:107)2 +\n\n\u03b12\nt\n2\n\n\u03b2 \u2212 1\n\u03b7t\n\n(cid:18)\n\n(cid:19)\n\nT\u22121(cid:88)\n\nt=\u03c4(cid:63)+1\n\nT\u22121(cid:88)\n\nt(cid:107)yt+1 \u2212 xt+1(cid:107)2 \u2212 1\n\u03b12\nT\u22121(cid:88)\n4\n\nt(cid:107)gt(cid:107)2 \u2212 1\n4\n\nt \u03b12\n\u03b72\n\nt=\u03c4(cid:63)+1\n\nt=\u03c4(cid:63)+1\n\n\u03b7t\u03b12\n\nt(cid:107)gt(cid:107)2\n\n(cid:107)yt+1 \u2212 xt+1(cid:107)2\n\n\u03b12\nt\n\u03b7t\n\n\u03b12\nt\n2\n\nt=0\n\n\u03c4(cid:63)(cid:88)\n\u03c4(cid:63)(cid:88)\n\u03c4(cid:63)(cid:88)\n\nt=0\n\nt=0\n\n\u2264 \u03b2\n2\n\n=\n\n\u03b2\n2\n\nwhere in the second line we use 2\u03b2 \u2264 1\nthe last line we use (cid:107)yt+1 \u2212 xt+1(cid:107) = \u03b7t(cid:107)gt(cid:107).\nFinal Bound : Combining the bounds in Equations (4),(6),(7) into Eq. (3), and re-arranging gives,\n\nwhich holds for t > \u03c4(cid:63), implying that \u03b2 \u2212 1\n\n\u2264 \u2212 1\n\n; in\n\n2\u03b7t\n\n\u03b7t\n\n\u03b7t\n\nT\u22121(cid:88)\n\nt=0\n\n1\n2\n\n\u03b1t(f (yt+1) \u2212 f (z)) \u2264 D2\n(cid:124)\n2\u03b7T\u22121\n\n\u2212 1\n4\n\nWe are now in the intricate part of the proof where we need to show that the above is bounded by\na constant. As we show next this crucially depends on our choice of the learning rate. To simplify\nthe proof sketch we assume to be using , \u03b7t = 2D\n, i.e. taking G = 0 in the\nlearning rate. We will require the following lemma before we go on,\nLemma. For any non-negative numbers a1, . . . , an the following holds:\n\n\u03c4 =0 \u03b12\n\nT\u22121(cid:88)\n(cid:123)(cid:122)\n(cid:16)(cid:80)t\n\nt=\u03c4(cid:63)+1\n(\u2217)\n\n\u03b7t\u03b12\n\n\u03b2\n2\n\n+\n\nt(cid:107)gt(cid:107)2\n(cid:124)\n(cid:125)\n\u03c4(cid:107)g\u03c4(cid:107)2(cid:17)\u22121/2\n(cid:118)(cid:117)(cid:117)(cid:116) n(cid:88)\n\nai .\n\ni=1\n\n\u2264 2\n\n(cid:107)yt+1 \u2212 xt+1(cid:107)2\n\n(7)\n\n\u03c4(cid:63)(cid:88)\n\nt=0\n\nt(cid:107)gt(cid:107)2\n(cid:125)\n\n\u03b72\nt \u03b12\n\n(cid:123)(cid:122)\n\n(\u2217\u2217)\n\n(8)\n\n\u03c4 =0\n\n\u2264 2\u03b2D2\n\n=\n\nD2\n\u03b7\u03c4(cid:63)\n\nT\u22121(cid:88)\n\nwhere in the last inequality we have used the de\ufb01nition of \u03c4(cid:63) which implies that 1/\u03b7\u03c4(cid:63) \u2264 2\u03b2.\nUsing similar argumentation allows to bound the term (\u2217\u2217) by O(\u03b2D2 log (\u03b2D/(cid:107)g0(cid:107))). Plugging\nthese bounds back into Eq. (8) we get,\n\nCombining this with Eq. (2) and noting that(cid:80)T\u22121\n\nt=0\n\n\u03b1t(f (yt+1) \u2212 f (z)) \u2264 O(\u03b2D2 log (\u03b2D/(cid:107)g0(cid:107))) .\n\nt=0 \u03b1t \u2265 T 2/32, concludes the proof.\n\n6\n\nEquipped with the above lemma and using \u03b7t explicitly enables to bound (\u2217),\n\ni=1\n\n(cid:118)(cid:117)(cid:117)(cid:116) n(cid:88)\n(cid:32)T\u22121(cid:88)\nT\u22121(cid:88)\n(cid:16)(cid:80)t\n\u03c4(cid:63)(cid:88)\n(cid:16)(cid:80)t\n(cid:32) \u03c4(cid:63)(cid:88)\n\nt=0\n\nt=0\n\nt=0\n\nai \u2264 n(cid:88)\n(cid:33)1/2\n\ni=1\n\nj=1 aj\n\nai(cid:113)(cid:80)i\nT\u22121(cid:88)\n\n\u2212 D\n2\n\nt=\u03c4(cid:63)+1\n\nt(cid:107)gt(cid:107)2\n\u03b12\n\n\u03c4 =0 \u03b12\n\n(cid:17)1/2\nt(cid:107)gt(cid:107)2\n\u03b12\n\u03c4(cid:107)g\u03c4(cid:107)2\n(cid:17)1/2\nt(cid:107)gt(cid:107)2\n\u03b12\n(cid:33)1/2\n\u03c4(cid:107)g\u03c4(cid:107)2\n\n\u03c4 =0 \u03b12\n\n\u03c4(cid:107)g\u03c4(cid:107)2\n\u03b12\n\n(\u2217) =\n\nD\n4\n\n\u2264 D\n4\n\n\u2264 D\n4\n\n\u2264 D\n2\n\n(cid:16)(cid:80)t\nT\u22121(cid:88)\n\n\u03c4 =0 \u03b12\n\n(cid:17)1/2\nt(cid:107)gt(cid:107)2\n\u03b12\n\u03c4(cid:107)g\u03c4(cid:107)2\nt(cid:107)gt(cid:107)2\n\u03b12\n\u03c4(cid:107)g\u03c4(cid:107)2\n\n(cid:16)(cid:80)t\n\n\u03c4 =0 \u03b12\n\n\u2212 D\n2\n\nt=\u03c4(cid:63)+1\n\n(cid:17)1/2\n\n\f4 Stochastic Setting\n\nThis section discusses the stochastic optimization setup which is prevalent in Machine Learning\nscenarios. We formally describe this setup and prove that Algorithm 2, without any modi\ufb01cation, is\nensured to converge in this setting (Thm. 4.1). Conversely, the universal gradient methods presented\nin [25] rely on a line search procedure, which requires exact gradients and function values, and are\ntherefore inappropriate for stochastic optimization.\nAs a related result we show that the AdaGrad algorithm (Alg. 1) is universal and is able to exploit\nsmall variance in order to ensure fast rates in the case of stochastic optimization with smooth expected\nloss (Thm. 4.2). We emphasize that AdaGrad does not require the smoothness nor a bound on the\nvariance. Conversely, previous works with this type of guarantees, [33, 17], require the knowledge of\nboth of these parameters.\nSetup: We consider the problem of minimizing a convex function f : Rd (cid:55)\u2192 R. We assume that\noptimization lasts for T rounds; on each round t = 1, . . . , T , we may query a point xt \u2208 Rd, and\nreceive a feedback. After the last round, we choose \u00afxT \u2208 Rd, and our performance measure is the\nexpected excess loss, de\ufb01ned as,\n\nE[f (\u00afxT )] \u2212 min\nx\u2208Rd\n\nf (x) .\n\nHere we assume that our feedback is a \ufb01rst order noisy oracle such that upon querying this oracle\nwith a point x, we receive a bounded and unbiased gradient estimate, \u02dcg, such\n\nE[\u02dcg|x] = \u2207f (x); & (cid:107)\u02dcg(cid:107) \u2264 G\n\n(9)\nWe also assume that the internal coin tosses (randomizations) of the oracle are independent. It is well\n\u221a\nknown that variants of Stochastic Gradient Descent (SGD) are ensured to output an estimate \u00afxT such\nthat the excess loss is bounded by O(1/\nT ) for the setups of stochastic convex optimization, [22].\nSimilarly to the of\ufb02ine setting we assume to be given a set K with bounded diameter D, such that\nthere exists a global optimum of f in K.\n\nThe next theorem substantiates the guarantees of Algorithm 2 in the stochastic case,\nTheorem 4.1. Assume that f is convex and G-Lipschitz. Let K be a convex set with bounded diameter\nD, and assume there exists a global minimizer for f in K. Assume that we invoke Algorithm 2 but\nprovide it with noisy gradient estimates (see Eq. (9)) rather then the exact ones. Then Algorithm 2\nwith weights and learning rate as in Equation (1) ensures,\n\nf (x) \u2264 O(cid:16)\n\nGD(cid:112)log T /\n\n\u221a\n\nT\n\n(cid:17)\n\nE[f (\u00afyT )] \u2212 min\nx\u2208Rd\n\nThe analysis of Theorem 4.1 goes along similar lines to the proof of its of\ufb02ine counterpart (Thm. 3.2).\n\u221a\nIt is well known that AdaGrad (Alg. 1) enjoys the standard rate of O(GD/\nT ) in the stochastic\nsetting. The next lemma demonstrates that: (i) AdaGrad is universal, and (ii) AdaGrad implicitly\nmake use of smoothness and small variance in the stochastic setting.\nTheorem 4.2. Assume that f is convex and \u03b2-smooth. Let K be a convex set with bounded diameter\nD, and assume there exists a global minimizer for f in K. Assume that we invoke AdaGrad (Alg. 1)\nbut provide it with noisy gradient estimates (see Eq. (9)) rather then the exact ones. Then,\n\nwhere \u03c32 is a bound on the variance of noisy gradients, i.e., \u2200x \u2208 Rd; E(cid:2)(cid:107)\u02dcg \u2212 \u2207f (x)(cid:107)2|x(cid:3) \u2264 \u03c32 .\n\nE[f (\u00afxT )] \u2212 min\nx\u2208Rd\n\nf (x) \u2264 O\n\n\u03c3D\u221a\nT\n\n+\n\nT\n\n(cid:18) \u03b2D2\n\n(cid:19)\n\n5 Experiments\n\nIn this section we compare AcceleGrad against AdaGrad (Alg. 1) and universal gradient methods\n[25], focusing on the effect of tuning parameters and the level of adaptivity.\nWe consider smooth (p = 2) and non-smooth (p = 1) regression problems of the form\n\nF (x) := (cid:107)Ax \u2212 b(cid:107)p\np .\n\nmin\nx\u2208Rd\n\n7\n\n\fFigure 1: Comparison of universal methods at a smooth (top) and a non-smooth (bottom) problem.\n\nWe synthetically generate matrix A \u2208 Rn\u00d7d and a point of interest x(cid:92) \u2208 Rd randomly, with entries\nindependently drawn from standard Gaussian distribution. Then, we generate b = Ax(cid:92) + \u03c9, with\nGaussian noise, w \u223c N (0, \u03c32) and \u03c32 = 10\u22122. We \ufb01x n = 2000 and d = 500.\nFigure 1 presents the results for the of\ufb02ine optimization setting, where we provide the exact gradients\nof F . All methods are initialized at the origin, and we choose K as the (cid:96)2 norm ball of diameter D.\nUniversal gradient methods are based on an inexact line-search technique that requires an input\n2-suboptimality. For\nparameter \u0001. Moreover, these methods have convergence guarantees only up to \u0001\nsmooth problems, these methods perform better with smaller \u0001. In stark contrast, for the non-smooth\nproblems, small \u0001 causes late adaptation, and large \u0001 ends up with early saturation. Tuning is a major\nproblem for these methods, since it requires rough knowledge of the optimal value.\nUniversal gradient method (also the fast version) provably requires two line-search iterations on\naverage at each outer iteration. Consequently, it performs two data pass at each iteration (four for the\nfast version), while AdaGrad and AcceleGrad require only a single data pass.\nThe parameter \u03c1 denotes the ratio between D/2 and the distance between initial point and the solution.\nParameter D plays a major role on the step-size of AdaGrad and AcceleGrad. Overestimating D\ncauses an overshoot in the \ufb01rst iterations. AcceleGrad consistently overperforms AdaGrad in the\ndeterministic setting. As a \ufb01nal note, it needs to be mentioned that the iterates yt of AcceleGrad\nempirically converge faster than the averaged sequence \u00afyT . Note that for AcceleGrad we always take\nG = 0, i.e., use \u03b7t = 2D\n\n\u03c4(cid:107)g\u03c4(cid:107)2(cid:17)\u22121/2\n\n(cid:16)(cid:80)t\n\n\u03c4 =0 \u03b12\n\n.\n\nWe also study the stochastic setup (see Appendix), where we provide noisy gradients of F based\non minibatches. As expected, universal line search methods fail in this case, while AcceleGrad\nconverges and performs similarly to AdaGrad.\nLarge batches: In the appendix we show results on a real dataset which demonstrate the appeal\nof AcceleGrad in the large-minibatch regime. We show that with the increase of batch size the\nperformance of AcceleGrad verses the number of gradient calculations does not degrade and might\neven improve. This is bene\ufb01cial when we like to parallelize a stochastic optimization problem.\nConversely, for AdaGrad we see a clear degradation of the performance as we increase the batch size.\n\n6 Conclusion and Future Work\n\nWe have presented a novel universal method that may exploit smoothness in order to accelerate\nwhile still being able to successfully handle noisy feedback. Our current analysis only applies to\nunconstrained optimization problems. Extending our work to the constrained setting is a natural\n\n8\n\niteration100101102103104105objectiveresidual10\u2212910\u2212710\u2212510\u2212310\u22121101103105107109universalgradientmethod\u01eb=100\u01eb=10\u22122\u01eb=10\u22124iteration10010110210310410510\u2212910\u2212710\u2212510\u2212310\u22121101103105107109AdaGrad\u03c1=102\u03c1=101\u03c1=100iteration10010110210310410510\u2212910\u2212710\u2212510\u2212310\u22121101103105107109universalfastgradientmethod\u01eb=100\u01eb=10\u22122\u01eb=10\u22124iteration10010110210310410510\u2212910\u2212710\u2212510\u2212310\u22121101103105107109AcceleGrad(\u00afyt)\u03c1=102\u03c1=101\u03c1=100iteration10010110210310410510\u2212910\u2212710\u2212510\u2212310\u22121101103105107109AcceleGrad(yt)\u03c1=102\u03c1=101\u03c1=100iteration100101102103104105objectiveresidual10\u2212310\u2212210\u22121100101102103104105106\u01eb=100\u01eb=101\u01eb=102\u01eb=103iteration10010110210310410510\u2212310\u2212210\u22121100101102103104105106\u03c1=102\u03c1=101\u03c1=100iteration10010110210310410510\u2212310\u2212210\u22121100101102103104105106\u01eb=100\u01eb=101\u01eb=102\u01eb=103iteration10010110210310410510\u2212310\u2212210\u22121100101102103104105106\u03c1=102\u03c1=101\u03c1=100iteration10010110210310410510\u2212310\u2212210\u22121100101102103104105106\u03c1=102\u03c1=101\u03c1=100\ffuture direction. Another direction is to implicitly adapt the parameter D, this might be possible\nusing ideas in the spirit of scale-free online algorithms, [27, 10].\n\nAcknowledgments\n\nThe authors would like to thank Zal\u00e1n Borsos for his insightful comments on the manuscript.\nThis project has received funding from the European Research Council (ERC) under the European\nUnion\u2019s Horizon 2020 research and innovation programme (grant agreement no 725594 - time-data).\nK.Y.L. is supported by the ETH Zurich Postdoctoral Fellowship and Marie Curie Actions for People\nCOFUND program.\n\nReferences\n[1] Z. Allen-Zhu. Katyusha: The First Direct Acceleration of Stochastic Gradient Methods. In\n\nSTOC, 2017. Full version available at http://arxiv.org/abs/1603.05953.\n\n[2] Z. Allen-Zhu and L. Orecchia. Linear Coupling: An Ultimate Uni\ufb01cation of Gradient and Mirror\nDescent. In Proceedings of the 8th Innovations in Theoretical Computer Science, ITCS \u201917,\n2017. Full version available at http://arxiv.org/abs/1407.1537.\n\n[3] Y. Arjevani, S. Shalev-Shwartz, and O. Shamir. On lower and upper bounds in smooth and\nstrongly convex optimization. The Journal of Machine Learning Research, 17(1):4303\u20134353,\n2016.\n\n[4] H. Attouch and Z. Chbani. Fast inertial dynamics and \ufb01sta algorithms in convex optimization.\n\nperturbation aspects. arXiv preprint arXiv:1507.01367, 2015.\n\n[5] J. Aujol and C. Dossal. Optimal rate of convergence of an ode associated to the fast gradient\n\ndescent schemes for b> 0. 2017.\n\n[6] A. Beck and M. Teboulle. A fast iterative shrinkage-thresholding algorithm for linear inverse\n\nproblems. SIAM journal on imaging sciences, 2(1):183\u2013202, 2009.\n\n[7] S. Bubeck, Y. T. Lee, and M. Singh. A geometric alternative to nesterov\u2019s accelerated gradient\n\ndescent. arXiv preprint arXiv:1506.08187, 2015.\n\n[8] A. Chambolle and T. Pock. A \ufb01rst-order primal-dual algorithm for convex problems with\n\napplications to imaging. Journal of mathematical imaging and vision, 40(1):120\u2013145, 2011.\n\n[9] M. B. Cohen, J. Diakonikolas, and L. Orecchia. On acceleration with noise-corrupted gradients.\n\narXiv preprint arXiv:1805.12591, 2018.\n\n[10] A. Cutkosky and F. Orabona. Black-box reductions for parameter-free online learning in banach\n\nspaces. arXiv preprint arXiv:1802.06293, 2018.\n\n[11] J. Diakonikolas and L. Orecchia. Accelerated extra-gradient descent: A novel accelerated\n\n\ufb01rst-order method. arXiv preprint arXiv:1706.04680, 2017.\n\n[12] J. Duchi, E. Hazan, and Y. Singer. Adaptive subgradient methods for online learning and\n\nstochastic optimization. Journal of Machine Learning Research, 12(Jul):2121\u20132159, 2011.\n\n[13] N. Flammarion and F. Bach. From averaging to acceleration, there is only a step-size. In\n\nConference on Learning Theory, pages 658\u2013695, 2015.\n\n[14] S. Foucart and H. Rauhut. A mathematical introduction to compressive sensing, volume 1.\n\nBirkh\u00e4user Basel, 2013.\n\n[15] R. Frostig, R. Ge, S. Kakade, and A. Sidford. Un-regularizing: approximate proximal point and\nfaster stochastic algorithms for empirical risk minimization. In International Conference on\nMachine Learning, pages 2540\u20132548, 2015.\n\n[16] C. Hu, W. Pan, and J. T. Kwok. Accelerated gradient methods for stochastic optimization and\nonline learning. In Advances in Neural Information Processing Systems, pages 781\u2013789, 2009.\n\n9\n\n\f[17] G. Lan. An optimal method for stochastic composite optimization. Mathematical Programming,\n\n133(1-2):365\u2013397, 2012.\n\n[18] G. Lan. Bundle-level type methods uniformly optimal for smooth and nonsmooth convex\n\noptimization. Mathematical Programming, 149(1-2):1\u201345, 2015.\n\n[19] L. Lessard, B. Recht, and A. Packard. Analysis and design of optimization algorithms via\n\nintegral quadratic constraints. SIAM Journal on Optimization, 26(1):57\u201395, 2016.\n\n[20] K. Levy. Online to of\ufb02ine conversions, universality and adaptive minibatch sizes. In Advances\n\nin Neural Information Processing Systems, pages 1612\u20131621, 2017.\n\n[21] H. Lin, J. Mairal, and Z. Harchaoui. A universal catalyst for \ufb01rst-order optimization.\n\nAdvances in Neural Information Processing Systems, pages 3384\u20133392, 2015.\n\nIn\n\n[22] A. Nemirovskii, D. B. Yudin, and E. Dawson. Problem complexity and method ef\ufb01ciency in\n\noptimization. 1983.\n\n[23] Y. Nesterov. A method of solving a convex programming problem with convergence rate o\n\n(1/k2). In Soviet Mathematics Doklady, volume 27, pages 372\u2013376, 1983.\n\n[24] Y. Nesterov. Introductory lectures on convex optimization. 2004, 2003.\n\n[25] Y. Nesterov. Universal gradient methods for convex optimization problems. Mathematical\n\nProgramming, 152(1-2):381\u2013404, 2015.\n\n[26] A. Neumaier. Osga: a fast subgradient algorithm with optimal complexity. Mathematical\n\nProgramming, 158(1-2):1\u201321, 2016.\n\n[27] F. Orabona and D. P\u00e1l. Scale-free algorithms for online linear optimization. In International\n\nConference on Algorithmic Learning Theory, pages 287\u2013301. Springer, 2015.\n\n[28] D. Scieur, A. d\u2019Aspremont, and F. Bach. Regularized nonlinear acceleration. In Advances In\n\nNeural Information Processing Systems, pages 712\u2013720, 2016.\n\n[29] S. Shalev-Shwartz and T. Zhang. Accelerated proximal stochastic dual coordinate ascent for\nregularized loss minimization. In International Conference on Machine Learning, pages 64\u201372,\n2014.\n\n[30] W. Su, S. Boyd, and E. Candes. A differential equation for modeling nesterov\u2019s accelerated\ngradient method: Theory and insights. In Advances in Neural Information Processing Systems,\npages 2510\u20132518, 2014.\n\n[31] I. Sutskever, J. Martens, G. Dahl, and G. Hinton. On the importance of initialization and\nmomentum in deep learning. In International conference on machine learning, pages 1139\u2013\n1147, 2013.\n\n[32] A. Wibisono, A. C. Wilson, and M. I. Jordan. A variational perspective on accelerated methods\nin optimization. Proceedings of the National Academy of Sciences, 113(47):E7351\u2013E7358,\n2016.\n\n[33] L. Xiao. Dual averaging methods for regularized stochastic learning and online optimization.\n\nJournal of Machine Learning Research, 11(Oct):2543\u20132596, 2010.\n\n[34] A. Yurtsever, Q. T. Dinh, and V. Cevher. A universal primal-dual convex optimization framework.\n\nIn Advances in Neural Information Processing Systems, pages 3150\u20133158, 2015.\n\n10\n\n\f", "award": [], "sourceid": 3218, "authors": [{"given_name": "Kfir Y.", "family_name": "Levy", "institution": "ETH"}, {"given_name": "Alp", "family_name": "Yurtsever", "institution": "EPFL"}, {"given_name": "Volkan", "family_name": "Cevher", "institution": "EPFL"}]}