{"title": "UniXGrad: A Universal, Adaptive Algorithm with Optimal Guarantees for Constrained Optimization", "book": "Advances in Neural Information Processing Systems", "page_first": 6260, "page_last": 6269, "abstract": "We propose a novel adaptive, accelerated algorithm for the stochastic constrained convex optimization setting.Our method, which is inspired by the Mirror-Prox method, \\emph{simultaneously} achieves the optimal rates for smooth/non-smooth problems with either deterministic/stochastic first-order oracles. This is done without any prior knowledge of the smoothness nor the noise properties of the problem. To the best of our knowledge, this is the first adaptive, unified algorithm that achieves the optimal rates in the constrained setting. We demonstrate the practical performance of our framework through extensive numerical experiments.", "full_text": "UniXGrad: A Universal, Adaptive Algorithm with\nOptimal Guarantees for Constrained Optimization\n\nAli Kavis\u2217\nEPFL\n\nali.kavis@epfl.ch\n\nK\ufb01r Y. Levy\u2217\nTechnion\n\nkfirylevy@technion.ac.il\n\nFrancis Bach\n\nINRIA\n\nfrancis.bach@inria.fr\n\nVolkan Cevher\n\nEPFL\n\nvolkan.cevher@epfl.ch\n\nAbstract\n\nWe propose a novel adaptive, accelerated algorithm for the stochastic constrained\nconvex optimization setting. Our method, which is inspired by the Mirror-Prox\nmethod, simultaneously achieves the optimal rates for smooth/non-smooth prob-\nlems with either deterministic/stochastic \ufb01rst-order oracles. This is done without\nany prior knowledge of the smoothness nor the noise properties of the problem.\nTo the best of our knowledge, this is the \ufb01rst adaptive, uni\ufb01ed algorithm that\nachieves the optimal rates in the constrained setting. We demonstrate the practical\nperformance of our framework through extensive numerical experiments.\n\n1\n\nIntroduction\n\n\u221a\nT ) and O(LD2/T 2 + \u03c3D/\n\nStochastic constrained optimization with \ufb01rst-order oracles (SCO) is critical in machine learning.\nIndeed, the scalability of classical machine learning tasks, such as support vector machines (SVMs),\nlinear/logistic regression and Lasso, rely on ef\ufb01cient stochastic optimization methods. Importantly,\ngeneralization guarantees for such tasks often rely on constraining the set of possible solutions. The\nlatter induces simple solutions in the form of low norm or low entropy, which in trun enables to\nestablish generalization guarantees.\n\u221a\nIn the SCO setting, the optimal convergence rates for the cases of non-smooth and smooth objectives\nare given by O(GD/\nT ), respectively; where T is the total number\nof (noisy) gradient queries, L is the smoothness constant of the objective, \u03c32 is the variance of the\nstochastic gradient estimates, D is the effective diameter of the decision set, and G is a bound on the\nmagnitude of gradient estimates. These rates cannot be improved without additional assumptions.\nThe optimal rate for the non-smooth case may be obtained by the current state-of-the-art optimization\nalgorithms, such as Stochastic Gradient Descent (SGD), AdaGrad [Duchi et al., 2011], Adam [Kingma\nand Ba, 2014], and AmsGrad [Reddi et al., 2018]. However, in order to obtain the optimal rate for\nthe smooth case, one is required to use more involved accelerated methods such as [Hu et al., 2009,\nLan, 2012, Xiao, 2010, Diakonikolas and Orecchia, 2017, Cohen et al., 2018, Deng et al., 2018].\nUnfortunately, all of these accelerated methods require a-priori knowledge of the smoothness parame-\nter L, as well as the variance of the gradients \u03c32, creating a setup barrier for their use in practice. As\na result, accelerated methods are not very popular in machine learning tasks.\nThis work develops a new universal method for SCO that obtains the optimal rates in both smooth\nand non-smooth cases, without any prior knowledge regarding the smoothness of the problem L, nor\n\n\u2217Equal contribution\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\fthe noise magnitude \u03c3. Such universal methods that implicitly adapt to the properties of the learning\nobjective may be very bene\ufb01cial in practical large-scale problems where these properties are usually\nunknown. To our knowledge, this is the \ufb01rst work that achieves this desiderata in the constrained\nSCO setting.\n\nGD\n\nT\n\n(cid:17)\n\n(cid:17)\n\n(cid:17)\n\nT\n\n(cid:17)\n\n\u221a\n1/\n\nT\n\nL log LD2/T + \u03c3D\n\n\u221a\n\n\u221a\nlog T /\n\nT\n\n\u221a\n\n\u221a\nlog T /\n\nT\n\nand O(cid:16)\n\nand O(cid:16)\n\n\u221a\nD2L/T 3/2 + \u03c3D/\n\nconstrained setting.\nOur work completely resolves the open problem in Levy et al. [2018], Cutkosky [2019], and\nand\n\n\u221a\nGD/\n\nOur contributions in the context of related work For the unconstrained setting, Levy et al. [2018]\nand Cutkosky [2019] have recently presented a universal scheme that obtains (almost) optimal rates\nfor both smooth and non-smooth cases.\nMore speci\ufb01cally, Levy et al. [2018] designs AcceleGrad\u2014a method that obtains respective rates\n. Unfortunately, this result only\nholds for the unconstrained setting, and the authors leave the constrained case as an open problem.\nAn important progress towards this open problem is achieved only recently by Cutkosky [2019], who\nfor SCO in the\n\nof O(cid:16)\nproves suboptimal respective rates of O(cid:16)\n(cid:17)\nproposes the \ufb01rst universal method that obtains respective optimal rates of O(cid:16)\nO(cid:16)\n\nfor the constrained setting. When applied to the unconstrained setting, our\nanalysis tightens the rate characterizations by removing the unnecessary logarithmic factors appearing\nin [Levy et al., 2018, Cutkosky, 2019].\nOur method is inspired by the Mirror-Prox method [Nemirovski, 2004, Rakhlin and Sridharan, 2013,\nDiakonikolas and Orecchia, 2017, Bach and Levy, 2019], and builds on top of it using additional\ntechniques from the online learning literature. Among, is an adaptive learning rate rule [Duchi et al.,\n2011, Rakhlin and Sridharan, 2013], as well as recent online-to-batch conversion techniques [Levy,\n2017, Cutkosky, 2019].\nThe paper is organized as follows. In the next section, we specify the problem setup, and give the\nnecessary de\ufb01nitions and background information. In Section 3, we motivate our framework and\nexplain the general mechanism. We also introduce the convergence theorems with proof sketches\nto highlight the technical novelties. We share numerical results in comparison with other adaptive\nmethods and baselines for different machine learning tasks in Section 4, followed up with conclusions.\n\n\u221a\nD2L/T 2 + \u03c3D/\n\n(cid:17)\n\nT\n\n2 Setting and preliminaries\nPreliminaries. Let (cid:107) \u00b7 (cid:107) be a general norm and (cid:107) \u00b7 (cid:107)\u2217 be its dual norm. A function f : K (cid:55)\u2192 R is\n\u00b5-strongly convex over a convex set K, if for any x \u2208 K and any \u2207f (x), a subgradient of f at x,\n\nf (x) \u2212 f (y) \u2212 (cid:104)\u2207f (y), x \u2212 y(cid:105) \u2265 \u00b5\n2\n\n(cid:107)x \u2212 y(cid:107)2 ,\n\n\u2200x, y \u2208 K\n\n(1)\n\nA function f : K (cid:55)\u2192 R is L-smooth over K if it has L-Lipschitz continuous gradient, i.e.,\n\n(cid:107)\u2207f (x) \u2212 \u2207f (y)(cid:107)\u2217 \u2264 L(cid:107)x \u2212 y(cid:107) ,\n\n(2)\nConsider a 1-strongly convex differentiable function R : K \u2192 R. The Bregman divergence with\nrespect to a distance-generating function R is de\ufb01ned as follows \u2200x, y \u2208 K,\nDR(x, y) = R(x) \u2212 R(y) \u2212 (cid:104)\u2207R(y), x \u2212 y(cid:105) .\n\n\u2200x, y \u2208 K.\n\n(3)\n2 (cid:107)x \u2212 y(cid:107)2 for all x, y \u2208 K, due\n\nAn important property of Bregman divergence is that DR(x, y) \u2265 1\nto the strong convexity of R.\n\nSetting This paper focuses on (approximately) solving the following constrained problem,\n\nwhere f : K (cid:55)\u2192 R is a convex function, and K \u2282 Rd is a compact convex set.\n\nmin\nx\u2208K f (x) ,\n\n(4)\n\n2\n\n\fWe assume the availability of a \ufb01rst order oracle for f (\u00b7), and consider two settings: a deterministic\nsetting where we may access exact gradients, and a stochastic setting where we may only access\nunbiased (noisy) gradient estimates. Concretely, we assume that by querying this oracle with a point\nx \u2208 K, we receive \u02dc\u2207f (x) \u2208 Rd such,\n\n= \u2207f (x) .\n\n(5)\n\nE(cid:104) \u02dc\u2207f (x)(cid:12)(cid:12)x\n\n(cid:105)\n\nThroughout this paper we also assume the norm of the (sub)-gradient estimates is bounded by G, i.e,\n\n(cid:107) \u02dc\u2207f (x)(cid:107)\u2217 \u2264 G,\n\n\u2200x \u2208 K .\n\n3 The algorithm\n\nIn this section, we present and analyze our Universal eXtra Gradient (UniXGrad) method. We \ufb01rst\ndiscuss the Mirror-Prox (MP) algorithm of [Nemirovski, 2004], and the related Optimistic Mirror\nDescent (OMD) algorithm of [Rakhlin and Sridharan, 2013]. Later we present our algorithm which\nbuilds on top of the Optimistic Mirror Descent (OMD) scheme. Then in Sections 3.1 and 3.2, we\npresent and analyze the guarantees of our method in nonsmooth and smooth settings, respectively.\nOur goal is to optimize a convex function f over a compact domain K, and Algorithm 1 offers\na framework for solving this template, which is inspired by the Mirror-Prox (MP) algorithm of\n[Nemirovski, 2004] and the Optimistic Mirror Descent (OMD) algorithm of [Rakhlin and Sridharan,\n2013]. Let us motivate this particular template. Basically, the algorithm takes a step from yt\u22121 to\nxt, using \ufb01rst order information based on yt\u22121. Then, it goes back to yt\u22121 and takes another step,\nbut this time, gradient information relies on xt. Each step is a generalized projection with respect to\nBregman divergence DR(\u00b7,\u00b7).\n\nAlgorithm 1 Mirror-Prox Template\nInput: Number of iterations T , y0 \u2208 K, learning rate {\u03b7t}t\u2208[T ]\n1: for t = 1, ..., T do\n2:\n\nDR(x, yt\u22121)\n\nxt = arg min\n\nx\u2208K (cid:104)x, Mt(cid:105) + 1\ny\u2208K (cid:104)y, gt(cid:105) + 1\n\n\u03b7t\n\n\u03b7t\nDR(y, yt\u22121)\n\n3:\n\nyt = arg min\n\n4: end for\n\n2, instead of general Bregman divergences.\n\nNow, let us explain the salient differences between UniXGrad and MP as well as OMD using the\nparticular choices of Mt, gt and the distance-generating function R.\nOptimistic Mirror Descent takes gt = \u2207f (xt) and computes Mt = \u2207f (xt\u22121), i.e., based on gradient\ninformation from previous iterates. This vector is available at the beginning of each iteration and\nthe \u201coptimism\u201d arises in the case where Mt \u2248 gt. When Mt = \u2207f (yt\u22121) and gt = \u2207f (xt),\nthe template is known as the famous Mirror-Prox algorithm. One special case of Mirror-Prox is\nExtra-Gradient scheme [Korpelevich, 1976] where the projections are with respect to Euclidean norm,\ni.e. R(x) = 1/2(cid:107)x(cid:107)2\nMP has been well-studied, especially in the context of variational inequalities and convex-concave\nsaddle point problems. It achieves fast convergence rate of O(1/T ) for this class of problems,\nhowever, in the context of smooth convex optimization, this is the standard slow rate [Nesterov, 2003].\nTo date, MP is not known to enjoy the accelerated rate of O(1/T 2) for smooth convex minimization.\nWe propose three modi\ufb01cations to this template, which are the precise choice of gt and Mt, the\nadaptive learning rate and the gradient weighting scheme.\nThe notion of averaging:\nIn different interpretations of acceleration [Nesterov, 1983, Tseng, 2008,\nAllen Zhu and Orecchia, 2014], the notion of averaging is always central and we incorporate this\nnotion via gradients taken at weighted average of iterates. Let us de\ufb01ne the weight \u03b1t = t and the\nfollowing quantities\n\n\u03b1txt +(cid:80)t\u22121\n(cid:80)t\n\ni=1 \u03b1i\n\ni=1 \u03b1ixi\n\n\u00afxt =\n\n\u03b1tyt\u22121 +(cid:80)t\u22121\n\n(cid:80)t\n\ni=1 \u03b1i\n\ni=1 \u03b1ixi\n\n.\n\n(6)\n\n,\n\n\u02dczt =\n\n3\n\n\fThen, UniXGrad algorithm takes gt = \u2207f (\u00afxt) and Mt = \u2207f (\u02dczt), which provides a naive interpre-\ntation of averaging. Our choice of gt and Mt coincide with that of the accelerated Extra-Gradient\nscheme of Diakonikolas and Orecchia [2017]. While their decision relies on implicit Euler discretiza-\ntion of an accelerated dynamics, we arrive at the same conclusion as a direct consequence of our\nconvergence analysis.\nAdaptive learning rate: A key ingredients of our algorithm is the choice of adaptive learning rate\n\u03b7t. In light of Rakhlin and Sridharan [2013], we de\ufb01ne our lag-one-behind learning rate as\n\n(cid:115)\n\n\u03b7t =\n\nt\u22121(cid:80)\n\ni=1\n\n1 +\n\n2D\ni (cid:107)gi \u2212 Mi(cid:107)2\u2217\n\u03b12\n\n,\n\n(7)\n\nwhere D2 = supx,y\u2208K DR(x, y) is the diameter of the compact set K with respect to Bregman\ndivergences. Algorithm 2 summarizes our framework.\nGradient weighting scheme: We introduce the weights \u03b1t in the sequence updates. One can\ninterpret this as separating step size into learning rate and the scaling factors. It is necessary that\n\u03b1t = \u0398(t) in order to achieve optimal rates, in fact we precisely choose \u03b1t = t. Also notice that they\nappear in the learning rate, compatible with the update rule.\n\nAlgorithm 2 UniXGrad\nInput: # of iterations T , y0 \u2208 K, diameter D, weight \u03b1t = t, learning rate {\u03b7t}t\u2208[T ]\n1: for t = 1, ..., T do\n2:\n\nDR(x, yt\u22121)\n\nxt = arg min\n\nx\u2208K \u03b1t (cid:104)x, Mt(cid:105) + 1\ny\u2208K \u03b1t (cid:104)y, gt(cid:105) + 1\n\n\u03b7t\n\n\u03b7t\nDR(y, yt\u22121)\n\n(Mt = \u2207f (\u02dczt))\n(gt = \u2207f (\u00afxt))\n\n3:\n\nyt = arg min\n\n4: end for\n5: return \u00afxT\n\nIn the remainder of this section, we will present our convergence theorems and provide proof sketches\nto emphasize the fundamental aspects and novelties. With the purpose of simplifying the analysis, we\nborrow classical tools in the online learning literature and perform the convergence analysis in the\nsense of bounding \u201cweighted regret\u201d. Then, we use a simple yet essential conversion strategy which\nenables us to directly translate our weighted regret bounds to convergence rates. Before we proceed,\nwe will present the conversion scheme from weighted regret to convergence rate, by deferring\nthe proof to Appendix. In a concurrent work, [Cutkosky, 2019] proves a similar online-to-of\ufb02ine\nconversion bound.\nt=1 \u03b1t (cid:104)xt \u2212 x\u2217, gt(cid:105) denote\nthe weighted regret after T iterations, \u03b1t = t and gt = \u2207f (\u00afxt). Then,\n\nLemma 1. Consider weighted average \u00afxt as in Eq. (6). Let RT (x\u2217) =(cid:80)T\n\nf (\u00afxT ) \u2212 f (x\u2217) \u2264 2RT (x\u2217)\n\nT 2\n\n.\n\n3.1 Non-smooth setting\nDeterministic setting: First, we will focus on the convergence analysis in the case of non-smooth\nobjective functions with deterministic/stochastic \ufb01rst-order oracles. We will follow the regret analysis\nas in Rakhlin and Sridharan [2013] with essential adjustments that suit our weighted scheme and\nparticular choice of adaptive learning rate.\nRemark 1. It is important to point out that we do not completely exploit the precise de\ufb01nitions of gt\nand Mt in the presence of non-smooth objectives. As far as the regret analysis is concerned, it suf\ufb01ces\nthat these quantities are functions of \u2207f (\u00b7) and that, as a corollary, their dual norm is upper bounded.\nHowever, in order to bridge the gap between weighted regret and the objective sub-optimality, i.e.\nf (\u00afxT ) \u2212 f (x\u2217), we require gt = \u2207f (\u00afxt).\nNow, we can exhibit our convergence bounds for the case of deterministic oracles.\n\n4\n\n\fTheorem 1. Consider the constrained optimization setting in Problem (4), where f : K \u2192 R is a\nproper, convex and G-Lipschitz function de\ufb01ned over compact, convex set K. Let x\u2217 \u2208 minx\u2208K f (x).\nThen, Algorithm 2 guarantees\n\nf (\u00afxT ) \u2212 min\n\nx\u2208K f (x) \u2264 7D\n\nt (cid:107)gt \u2212 Mt(cid:107)2\u2217 \u2212 D\nt=1 \u03b12\nT 2\n\n\u2264 6D\n\nT 2 +\n\n14GD\u221a\nT\n\n.\n\n(8)\n\n(cid:113)\n1 +(cid:80)T\n\nWe establish the basis of our analysis through Lemma 1 and Corollary 2 of Rakhlin and Sridharan\n[2013]. Then, we build upon this base by exploiting the structure of the adaptive learning rate, the\nweights \u03b1t and the bound on gradient norms to give adaptive convergence bounds.\nStochastic setting: Now, we further consider the case of stochastic gradients. We assume that the\n\ufb01rst-order oracles are unbiased (see Eq. (5)). We want to emphasize that our stochastic setting is not\nrestricted to the notion of additive noise, i.e. gradients corrupted with zero-mean noise. It essentially\nincludes any estimate that recovers the full gradient in expectation, e.g. estimating gradient using\nmini batches. Additionally, we propagate the bounded gradient norm assumption to the stochastic\noracles, such that (cid:107) \u02dc\u2207f (x)(cid:107)\u2217 \u2264 G, \u2200x \u2208 K.\nTheorem 2. Consider the optimization setting in Problem (4), where f is non-smooth, convex and\nG-Lipschitz. Let {xt}t=1,..,T be a sequence generated by Algorithm 2 such that gt = \u02dc\u2207f (\u00afxt) and\nMt = \u02dc\u2207f (\u02dczt). With \u03b1t = t and learning rate as in Eq. (7), it holds that\n\nE [f (\u00afxT )] \u2212 min\n\nx\u2208K f (x) \u2264 6D\nT 2 +\n\n14GD\u221a\nT\n\n.\n\nThe analysis in the stochastic setting is similar to deterministic setting. The difference is up to\nreplacing gt \u2194 \u02dcgt and Mt \u2194 \u02dcMt. With the bound on stochastic gradients, the same rate is achieved.\n\n3.2 Smooth setting\n\nDeterministic setting:\nIn terms of theoretical contributions and novelty, the case of L-smooth\nobjective is of greater interest. We will \ufb01rst start with the deterministic oracle scheme and then\nintroduce the convergence theorem for the noisy setting.\nTheorem 3. Consider the constrained optimization setting in Problem (4), where f : K \u2192 R is a\nproper, convex and L-smooth function de\ufb01ned over compact, convex set K. Let x\u2217 \u2208 minx\u2208K f (x).\nThen, Algorithm 2 ensures the following\n\nf (\u00afxT ) \u2212 min\n\nx\u2208K f (x) \u2264 20\n\n7D2L\nT 2\n\n.\n\n(9)\n\nRemark 2. In the non-smooth setting, we assume that gradients have bounded norms. Our algorithm\ndoes not need to know this information, but it is necessary for the analysis in that case. However,\nwhen the function is smooth, neither the algorithm nor the analysis requires bounded gradients.\n\nProof Sketch (Theorem 3). We follow the proof of Theorem 1 until the point where we obtain\n\nT(cid:88)\n\nt=1\n\nT(cid:88)\n\n\u03b7t+1\u03b12\n\n\u03b1t (cid:104)xt \u2212 x\u2217, gt(cid:105) \u2264 1\n2\n\n1\n\u03b71\nBy smoothness of the objective function, we have (cid:107)gt \u2212 Mt(cid:107)\u2217 \u2264 L(cid:107)\u00afxt \u2212 \u02dczt(cid:107), which implies\n\u2212 1\n\nt (cid:107)gt \u2212 Mt(cid:107)2\u2217 \u2212 1\n2\n\n(cid:107)xt \u2212 yt\u22121(cid:107)2 + D2\n\n(cid:107)xt \u2212 yt\u22121(cid:107)2 \u2264 \u2212 \u03b12\n\n(cid:107)gt \u2212 Mt(cid:107)2\u2217. Hence,\n\n\u03b7T +1\n\n\u03b7t+1\n\nt=1\n\nt=1\n\n+\n\n1\n\nt\n\n4L2\u03b7t+1\n\n\u03b7t+1\n\n(cid:18) 3\n\n(cid:19)\n\n.\n\n\u221a\n\nT(cid:88)\n\n(cid:18)\n\nT(cid:88)\n\nt=1\n\n\u2264 1\n2\n\n(cid:110)\n\n(cid:19)\n\n4L2\u03b7t+1\n\n\u2264 7L2(cid:111)\n\n\u03b7t+1 \u2212\n\n1\n\nt (cid:107)gt \u2212 Mt(cid:107)2\u2217 + D2\n\u03b12\n\n(cid:18) 3\n\n+\n\n1\n\u03b71\n\n\u03b7T +1\n\n(cid:19)\n\n.\n\nNow we will introduce a time variable to characterize the growth of the learning rate. De\ufb01ne\n\u03c4\u2217 = max\n\nsuch that \u2200t > \u03c4\u2217, \u03b7t+1 \u2212\n\nt \u2208 {1, ..., T} :\n\n\u2264 \u2212 3\n\n1\n\n4 \u03b7t+1. Then,\n\n1\n\u03b72\n\n4L2\u03b7t+1\n\nt+1\n\n5\n\n\ft=1\n\n\u03c4\u2217(cid:88)\n(cid:113)\n1 +(cid:80)t\n(cid:118)(cid:117)(cid:117)(cid:116)1 +\n\uf8eb\uf8ed\nT(cid:88)\n\n\u2264 D\n\n(cid:124)\n\n+\n\n3D\n2\n\n(cid:124)\n\n+\n\n(A)\n\nD\n2\n\nt (cid:107)gt \u2212 Mt(cid:107)2\u2217\n\u03b12\n(cid:123)(cid:122)\n(cid:125)\ni (cid:107)gi \u2212 Mi(cid:107)2\u2217\ni=1 \u03b12\nt (cid:107)gt \u2212 Mt(cid:107)2\u2217 \u2212 T(cid:88)\n(cid:123)(cid:122)\n\n\u03b12\n\nt=1\n\nt=\u03c4\u2217+1\n\n(B)\n\n(cid:113)\n\n1 +(cid:80)t\n\nt (cid:107)gt \u2212 Mt(cid:107)2\u2217\n\u03b12\ni=1 \u03b12\n\ni (cid:107)gi \u2212 Mi(cid:107)2\u2217\n\n\uf8f6\uf8f8\n(cid:125)\n\n,\n\nwhere we wrote \u03b7t+1 in open form and used the de\ufb01nition of \u03c4\u2217. To complete the proof, we will need\nthe following lemma.\nLemma 2. Let {ai}i=1,...,n be a sequence of non negative numbers. Then, it holds that\n\n(cid:118)(cid:117)(cid:117)(cid:116) n(cid:88)\n\nai \u2264 n(cid:88)\n\ni=1\n\ni=1\n\nai(cid:113)(cid:80)i\n\nj=1 aj\n\n(cid:118)(cid:117)(cid:117)(cid:116) n(cid:88)\n\ni=1\n\n\u2264 2\n\nai.\n\n\u221a\nPlease refer to [McMahan and Streeter, 2010, Levy et al., 2018] for the proof. We jointly use Lemma 2\nand the bound on \u03b7\u03c4\u2217+1 to upper bound terms (A) and (B) with 4\n7D2L, respectively.\nLemma 1 immediately establishes the convergence bound.\n\n\u221a\n7D2L and 6\n\nStochastic setting: Next, we will present our results for the stochastic extension. In addition to\nunbiasedness and boundedness, we will introduce another classical assumption: bounded variance,\n\nE[(cid:107)\u2207f (x) \u2212 \u02dc\u2207f (x)(cid:107)2\u2217|x] \u2264 \u03c32,\n\n\u2200x \u2208 K.\n\n(10)\n\nThe analysis proceeds along similar lines as its deterministic counterpart. However, we execute the\nanalysis using auxiliary terms and attain the optimal accelerated rate without the log factors.\nTheorem 4. Consider the optimization setting in Problem (4), where f is L-smooth and convex. Let\n{xt}t=1,..,T be a sequence generated by Algorithm 2 such that gt = \u02dc\u2207f (\u00afxt) and Mt = \u02dc\u2207f (\u02dczt).\nWith \u03b1t = t and learning rate as in (7), it holds that\nx\u2208K f (x) \u2264 224\n\nE [f (\u00afxT )] \u2212 min\n\n\u221a\n2\u03c3D\u221a\n14\nT\n\n14D2L\nT 2\n\n\u221a\n\n+\n\n.\n\nProof Sketch (Theorem 4). We start in the same spirit as the stochastic, non-smooth setting,\n\nT(cid:88)\n(cid:124)\n\nt=1\n\n(cid:125)\n\n\u03b1t (cid:104)xt \u2212 x\u2217, \u02dcgt(cid:105)\n\n+\n\n\u03b1t (cid:104)xt \u2212 x\u2217, gt \u2212 \u02dcgt(cid:105)\n\n.\n\n(cid:123)(cid:122)\n\n(B)\n\n(cid:125)\n\nT(cid:88)\n\nt=1\n\n\u03b1t (cid:104)xt \u2212 x\u2217, gt(cid:105) \u2264 T(cid:88)\n(cid:124)\n(cid:118)(cid:117)(cid:117)(cid:116)1 +\n\nt=1\n\n(cid:123)(cid:122)\n\n(A)\n\nT(cid:88)\n\nt=1\n\nT(cid:88)\n\nt=1\n\nRecall that term (B) is zero in expectation given \u00afxt. Then, we follow the proof steps of Theorem 1,\n\n\u03b1t (cid:104)xt \u2212 x\u2217, gt(cid:105) \u2264 7D\n2\n\nt(cid:107)\u02dcgt \u2212 \u02dcMt(cid:107)2\u2217 \u2212 1\n\u03b12\n2\n\n(cid:107)xt \u2212 yt\u22121(cid:107)2 .\n\n1\n\n\u03b7t+1\n\n(11)\n\nT(cid:88)\n\nt=1\n\nWe will obtain (cid:107)gt \u2212 Mt(cid:107)2\u2217 from (cid:107)xt \u2212 yt\u22121(cid:107)2 due to smoothness and the challenge is to handle\n(cid:107)\u02dcgt \u2212 \u02dcMt(cid:107)2\u2217 and (cid:107)gt \u2212 Mt(cid:107)2\u2217 together. So let\u2019s denote, B2\nt := min{(cid:107)gt \u2212 Mt(cid:107)2\u2217,(cid:107)\u02dcgt \u2212 \u02dcMt(cid:107)2\u2217}. Using\nthis de\ufb01nition, we could declare an auxiliary learning rate which we will only use for the analysis,\n\n6\n\n\f(cid:115)\n\n\u02dc\u03b7t =\n\n2D\n\nt\u22121(cid:80)\n\ni=1\n\n.\n\ni B2\n\u03b12\ni\n\n1 +\n\nClearly, for any t \u2208 [T ] we have \u2212 1\n\n\u03b7t+1\n\n(cid:107)gt \u2212 Mt(cid:107)2\u2217 \u2264 \u2212 1\n\n\u02dc\u03b7t+1\n\nB2\n\nt . Also, we can write,\n\n(cid:107)\u02dcgt \u2212 \u02dcMt(cid:107)2\u2217 \u2264 2(cid:107)gt \u2212 Mt(cid:107)2\u2217 + 2(cid:107)\u03bet(cid:107)2\u2217,\n\nand,\n\nTherefore, we could rewrite Eq. (11) as,\n\nT(cid:88)\n\nt=1\n\n\u03b1t (cid:104)xt \u2212 x\u2217, gt(cid:105) \u2264 7\n2\n\n(cid:124)\n\nt=1\n\n(cid:107)\u02dcgt \u2212 \u02dcMt(cid:107)2\u2217 \u2264 2B2\n(cid:18)\nT(cid:88)\n\nt + 2(cid:107)\u03bet(cid:107)2\u2217.\n(cid:19)\n\n\u02dc\u03b7t+1 \u2212\n\n\u03b12\nt B2\n\nt +\n\n28L2 \u02dc\u03b7t+1\n\n1\n\n(cid:123)(cid:122)\n\n(A)\n\n(cid:118)(cid:117)(cid:117)(cid:116) T(cid:88)\n(cid:123)(cid:122)\n\nt=1\n\n(B)\n\nt (cid:107)\u03bet(cid:107)2\u2217\n(cid:125)\n\u03b12\n\n+\n\n7D\n2\n\n(cid:125)\n\n7D\u221a\n2\n\n(cid:124)\n\n(12)\n\n(13)\n\n.\n\nUsing Lemma 2 and de\ufb01ning a time variable \u03c4\u2217 in the sense of Theorem 3 (with correct constants),\n\u221a\n14D2L. By taking expectation conditioned on \u00afxt and using\nterm (A) is upper bounded by 112\n\u221a\nJensen\u2019s inequality, we could upper bound term (B) as 14\u03c3DT 3/2/\n2, which leads us to the optimal\nrate of 224\n2\u03c3D/\n\nT through Lemma 1.\n\n14D2L/T 2 + 14\n\n\u221a\n\n\u221a\n\n\u221a\n\n4 Experiments\n\nWe compare performance of our algorithm for two different tasks against adaptive methods of various\ncharacteristics, such as AdaGrad, AMSGrad and AcceleGrad, along with a recent non-adaptive\nmethod AXGD. We consider a synthetic setting where we analyze the convergence behavior, as\nwell as a SVM classi\ufb01cation task on some LIBSVM dataset. In all the setups, we tuned the hyper-\nparameters of each algorithm by grid search. In order to compare the adaptive methods on equal\ngrounds, AdaGrad is implemented with a scalar step size based on the template given by Levy [2017].\nWe implement AMSGrad exactly as it is described by Reddi et al. [2018].\n\n4.1 Convergence behavior\n\nWe take the least squares problem with L2-norm ball constraint\ni.e.,\n2, where A \u2208 Rn\u00d7d, A \u223c N (0, \u03c32I) and b = Ax(cid:92) + \u0001 such that \u0001 is a\nmin(cid:107)x(cid:107)2