{"title": "A unified variance-reduced accelerated gradient method for convex optimization", "book": "Advances in Neural Information Processing Systems", "page_first": 10462, "page_last": 10472, "abstract": "We propose a novel randomized incremental gradient algorithm, namely, VAriance-Reduced Accelerated Gradient (Varag), for finite-sum optimization. Equipped with a unified step-size policy that adjusts itself to the value of the conditional number, Varag exhibits the unified optimal rates of convergence for solving smooth convex finite-sum problems directly regardless of their strong convexity. Moreover, Varag is the first accelerated randomized incremental gradient method that benefits from the strong convexity of the data-fidelity term to achieve the optimal linear convergence. It also establishes an optimal linear rate of convergence for solving a wide class of problems only satisfying a certain error bound condition rather than strong convexity. Varag can also be extended to solve stochastic finite-sum problems.", "full_text": "A uni\ufb01ed variance-reduced accelerated gradient\n\nmethod for convex optimization\n\nH. Milton Stewart School of Industrial & Systems Engineering\n\nGuanghui Lan\n\nGeorgia Institute of Technology\n\nAtlanta, GA 30332\n\ngeorge.lan@isye.gatech.edu\n\nInstitute for Interdisciplinary Information Sciences\n\nIBM Almaden Research Center\n\nZhize Li\n\nTsinghua University\nBeijing 100084, China\n\nzz-li14@mails.tsinghua.edu.cn\n\nYi Zhou\n\nSan Jose, CA 95120\nyi.zhou@ibm.com\n\nAbstract\n\nWe propose a novel randomized incremental gradient algorithm, namely, VAriance-\nReduced Accelerated Gradient (Varag ), for \ufb01nite-sum optimization. Equipped\nwith a uni\ufb01ed step-size policy that adjusts itself to the value of the condition\nnumber, Varag exhibits the uni\ufb01ed optimal rates of convergence for solving smooth\nconvex \ufb01nite-sum problems directly regardless of their strong convexity. Moreover,\nVarag is the \ufb01rst accelerated randomized incremental gradient method that bene\ufb01ts\nfrom the strong convexity of the data-\ufb01delity term to achieve the optimal linear\nconvergence. It also establishes an optimal linear rate of convergence for solving\na wide class of problems only satisfying a certain error bound condition rather\nthan strong convexity. Varag can also be extended to solve stochastic \ufb01nite-sum\nproblems.\n\n1\n\nIntroduction\n\nThe problem of interest in this paper is the convex programming (CP) problem given in the form of\n\n(cid:8)\u03c8(x) := 1\n\nm\n\ni=1fi(x) + h(x)(cid:9) .\n(cid:80)m\n\n\u03c8\u2217 := min\nx\u2208X\n\n(1.1)\n\n(cid:107)\u2207fi(x1) \u2212 \u2207fi(x2)(cid:107)\u2217 \u2264 Li(cid:107)x1 \u2212 x2(cid:107), \u2200x1, x2 \u2208 X,\n\nHere, X \u2286 Rn is a closed convex set, the component function fi : X \u2192 R,\ni = 1, . . . , m, are\nsmooth and convex function with Li-Lipschitz continuous gradients over X, i.e., \u2203Li \u2265 0 such that\n(1.2)\nand h : X \u2192 R is a relatively simple but possibly nonsmooth convex function. For notational\nconvenience, we denote f (x) := 1\ni=1Li. It is easy to see that f has\nL-Lipschitz continuous gradients, i.e., for some Lf \u2265 0, (cid:107)\u2207f (x1) \u2212 \u2207f (x2)(cid:107)\u2217 \u2264 Lf(cid:107)x1 \u2212 x2(cid:107) \u2264\nm\nL(cid:107)x1 \u2212 x2(cid:107), \u2200x1, x2 \u2208 X. It should be pointed out that it is not necessary to assume h being\nstrongly convex. Instead, we assume that f is possibly strongly convex with modulus \u00b5 \u2265 0.\nWe also consider a class of stochastic \ufb01nite-sum optimization problems given by\n\n(cid:80)m\n\ni=1fi(x) and L := 1\n\nm\n\n(cid:80)m\n\n(cid:8)\u03c8(x) := 1\n\n(cid:80)m\n\nm\n\ni=1\n\nE\u03bei[Fi(x, \u03bei)] + h(x)(cid:9) ,\n\n(1.3)\n\n\u03c8\u2217 := min\nx\u2208X\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\fwhere \u03bei\u2019s are random variables with support \u039ei \u2286 Rd. It can be easily seen that (1.3) is a special\ncase of (1.1) with fi = E\u03bei[Fi(x, \u03bei)], i = 1, . . . , m. However, different from deterministic \ufb01nite-\nsum optimization problems, only noisy gradient information of each component function fi can be\naccessed for the stochastic \ufb01nite-sum optimization problem in (1.3). Particularly, (1.3) models the\ngeneralization risk minimization in distributed machine learning problems.\nFinite-sum optimization given in the form of (1.1) or (1.3) has recently found a wide range of\napplications in machine learning (ML), statistical inference, and image processing, and hence\nbecomes the subject of intensive studies during the past few years. In centralized ML, fi usually\ndenotes the loss generated by a single data point, while in distributed ML, it may correspond to the\nloss function for an agent i , which is connected to other agents in a distributed network.\nRecently, randomized incremental gradient (RIG) methods have emerged as an important class of\n\ufb01rst-order methods for \ufb01nite-sum optimization (e.g.,[5, 14, 27, 9, 24, 18, 1, 2, 13, 20, 19]). In an\nimportant work, [24] (see [5] for a precursor) showed that by incorporating new gradient estimators\ninto stochastic gradient descent (SGD) one can possibly achieve a linear rate of convergence for\nsmooth and strongly convex \ufb01nite-sum optimization. Inspired by this work, [14] proposed a stochastic\nvariance reduced gradient (SVRG) which incorporates a novel stochastic estimator of \u2207f (xt\u22121).\nMore speci\ufb01cally, each epoch of SVRG starts with the computation of the exact gradient \u02dcg = \u2207f (\u02dcx)\nfor a given \u02dcx \u2208 Rn and then runs SGD for a \ufb01xed number of steps using the gradient estimator\n\nGt = (\u2207fit(xt\u22121) \u2212 \u2207fit(\u02dcx)) + \u02dcg,\n\nwhere it is a random variable with support on {1, . . . , m}. They show that the variance of Gt vanishes\nas the algorithm proceeds, and hence SVRG exhibits an improved linear rate of convergence, i.e.,\nO{(m + L/\u00b5) log(1/\u0001)}, for smooth and strongly convex \ufb01nite-sum problems. See [27, 9] for the\nsame complexity result. Moreover, [2] show that by doubling the epoch length SVRG obtains an\nO{m log(1/\u0001) + L/\u0001} complexity bound for smooth convex \ufb01nite-sum optimization.\nObserve that the aforementioned variance reduction methods are not accelerated and hence they are\nnot optimal even when the number of components m = 1. Therefore, much recent research effort\nhas been devoted to the design of optimal RIG methods. In fact, [18] established a lower complexity\nbound for RIG methods by showing that whenever the dimension is large enough, the number of\ngradient evaluations required by any RIG methods to \ufb01nd an \u0001-solution of a smooth and strongly\nconvex \ufb01nite-sum problem i.e., a point \u00afx \u2208 X s.t. E[(cid:107)\u00afx \u2212 x\u2217(cid:107)2\n\n2] \u2264 \u0001, cannot be smaller than\n\n(cid:16)(cid:16)\n\n\u2126\n\nm +\n\n(cid:113) mL\n\n(cid:17)\n\n\u00b5\n\n(cid:17)\n\nlog 1\n\u0001\n\n.\n\n(1.4)\n\nAs can be seen from Table 1, existing accelerated RIG methods are optimal for solving smooth and\nstrongly convex \ufb01nite-sum problems, since their complexity matches the lower bound in (1.4).\nNotwithstanding these recent progresses, there still remain a few signi\ufb01cant issues on the development\nof accelerated RIG methods. Firstly, as pointed out by [25], existing RIG methods can only establish\naccelerated linear convergence based on the assumption that the regularizer h is strongly convex, and\nfails to bene\ufb01t from the strong convexity from the data-\ufb01delity term [26]. This restrictive assumption\ndoes not apply to many important applications (e.g., Lasso models) where the loss function, rather\nthan the regularization term, may be strongly convex. Speci\ufb01cally, when dealing with the case that\nonly f is strongly convex but not h, one may not be able to shift the strong convexity of f, by\nsubtracting and adding a strongly convex term, to construct a simple strongly convex term h in the\nobjective function. In fact, even if f is strongly convex, some of the component functions fi may\nonly be convex, and hence these fis may become nonconvex after subtracting a strongly convex\nterm. Secondly, if the strongly convex modulus \u00b5 becomes very small, the complexity bounds of all\nexisting RIG methods will go to +\u221e (see column 2 of Table 1), indicating that they are not robust\nagainst problem ill-conditioning. Thirdly, for solving smooth problems without strong convexity,\none has to add a strongly convex perturbation into the objective function in order to gain up to a\nfactor of\nm over Nesterov\u2019s accelerated gradient method for gradient computation (see column 3 of\nTable 1). One signi\ufb01cant dif\ufb01culty for this indirect approach is that we do not know how to choose\nthe perturbation parameter properly, especially for problems with unbounded feasible region (see\n[2] for a discussion about a similar issue related to SVRG applied to non-strongly convex problems).\nHowever, if one chose not to add the strongly convex perturbation term, the best-known complexity\nwould be given by Katyushans[1], which are not more advantageous over Nesterov\u2019s orginal method.\nIn other words, it does not gain much from randomization in terms of computational complexity.\n\n\u221a\n\n2\n\n\fFinally, it should be pointed out that only a few existing RIG methods, e.g., RGEM[19] and [16],\ncan be applied to solve stochastic \ufb01nite-sum optimization problems, where one can only access the\nstochastic gradient of fi via a stochastic \ufb01rst-order oracle (SFO).\n\nTable 1: Summary of the recent results on accelerated RIG methods\n\nDeterministic smooth strongly convex Deterministic smooth convex\n1\n\n(cid:26)\n(cid:26)\n(cid:26)\n\nO\nO\nO\n\n(m +\n\n(cid:27)\n(cid:27)\n(cid:27)\n\n1\n\n1\n\n\u0001\n\n\u0001 ) log 1\n\n(cid:113) mL\n(cid:113) mL\n(cid:113) mL\n(cid:27)\n(cid:113) mL\n\n\u0001 +\n\n\u0001 ) log2 1\n\n\u0001 )\n\n\u0001\n\n\u0001\n\n(m +\n\n(cid:26)\n\n(m log 1\nO\n\nm\u221a\n\n\u0001 +\n\nNA\n\nAlgorithms\n\nRPDG[18]\n\nCatalyst[20]\n\nKatyusha[1]\n\nKatyushans[1]\n\nRGEM[19]\n\n(m +\n\n\u00b5 ) log 1\n\n\u0001\n\n(m +\n\n\u00b5 ) log 1\n\n\u0001\n\nO(cid:110)\nO(cid:110)\nO(cid:110)\nO(cid:110)\n\n(m +\n\n(m +\n\n(cid:113) mL\n(cid:113) mL\n(cid:113) mL\n(cid:113) mL\n\nNA\n\n\u00b5 ) log 1\n\n\u0001\n\n\u00b5 ) log 1\n\n\u0001\n\n(cid:111)\n(cid:111)1\n(cid:111)\n(cid:111)\n\nOur contributions. In this paper, we propose a novel accelerated variance reduction type method,\nnamely the variance-reduced accelerated gradient (Varag ) method, to solve smooth \ufb01nite-sum\noptimization problems given in the form of (1.1). Table 2 summarizes the main convergence results\nachieved by our Varag algorithm.\n\nTable 2: Summary of the main convergence results for Varag\n\nProblem\n\nsmooth optimization problems (1.1)\n\nwith or without strong convexity\n\nRelations of m, 1/\u0001 and L/\u00b5\n\nm \u2265 D0\n\n\u0001\n\n4\u00b5\n\n2 or m \u2265 3L\n\u0001 \u2264 3L\n4\u00b5 \u2264 D0\n\n4\u00b5\n\n\u0001\n\nm < D0\n\nm < 3L\n\nO(cid:110)\n\n(cid:26)\n\nUni\ufb01ed results\n\nO(cid:8)m log 1\n(cid:9)\n(cid:113) mL\n(cid:113) mL\n\nm log m +\n\n\u0001\n\n\u0001\n\n(cid:27)\n\nO\n\nm log m +\n\n\u00b5 log D0/\u0001\n\n3L/4\u00b5\n\n(cid:111) 3\n\nFirstly, for smooth convex \ufb01nite-sum optimization, our proposed method exploits a direct acceleration\nscheme instead of employing any perturbation or restarting techniques to obtain desired optimal\nconvergence results. As shown in the \ufb01rst two rows of Table 2, Varag achieves the optimal rate of\nconvergence if the number of component functions m is relatively small and/or the required accuracy\nis high, while it exhibits a fast linear rate of convergence when the number of component functions\nm is relatively large and/or the required accuracy is low, without requiring any strong convexity\nassumptions. To the best of our knowledge, this is the \ufb01rst time that these complexity bounds have\nbeen obtained through a direct acceleration scheme for smooth convex \ufb01nite-sum optimization in the\nliterature. In comparison with existing methods using perturbation techniques, Varag does not need\nto know the target accuracy or the diameter of the feasible region a priori, and thus can be used to\nsolve a much wider class of smooth convex problems, e.g., those with unbounded feasible sets.\nSecondly, we equip Varag with a uni\ufb01ed step-size policy for smooth convex optimization no matter\n(1.1) is strongly convex or not, i.e., the strongly convex modulus \u00b5 \u2265 0. With this step-size policy,\nVarag can adjust to different classes of problems to achieve the best convergence results, without\nknowing the target accuracy and/or \ufb01xing the number of epochs. In particular, as shown in the last\ncolumn of Table 2, when \u00b5 is relatively large, Varag achieves the well-known optimal linear rate\nof convergence. If \u00b5 is relatively small, e.g., \u00b5 < \u0001, it obtains the accelerated convergence rate that\nis independent of the condition number L/\u00b5. Therefore, Varag is robust against ill-conditioning\nof problem (1.1). Moreover, our assumptions on the objective function is more general comparing\nto those used by other RIG methods, such as RPDG and Katyusha. Speci\ufb01cally, Varag does not\nrequire to keep a strongly convex regularization term in the projection, and so we can assume that the\nstrong convexity is associated with the smooth function f instead of the simple proximal function\nh(\u00b7). Some other advantages of Varag over existing accelerated SVRG methods, e.g., Katyusha,\n1These complexity bounds are obtained via indirect approaches, i.e., by adding strongly convex perturbation.\n2D0 = 2[\u03c8(x0) \u2212 \u03c8(x\u2217)] + 3LV (x0, x\u2217) where x0 is the initial point, x\u2217 is the optimal solution of (1.1)\n\nand V is de\ufb01ned in (1.5).\n\n3Note that this term is less than O{(cid:113) mL\n\n\u0001}.\n\u00b5 log 1\n\n3\n\n\finclude that it only requires the solution of one, rather than two, subproblems, and that it can allow\nthe application of non-Euclidean Bregman distance for solving all different classes of problems.\nFinally, we extend Varag to solve two more general classes of \ufb01nite-sum optimization problems.\nWe demonstrate that Varag is the \ufb01rst randomized method that achieves the accelerated linear rate\nof convergence when solving the class of problems that satis\ufb01es a certain error-bound condition\nrather than strong convexity. We then show that Varag can also be applied to solve stochastic smooth\n\ufb01nite-sum optimization problems resulting in a sublinear rate of convergence.\nThis paper is organized as follows. In Section 2, we present our proposed algorithm Varag and\nits convergence results for solving (1.1) under different problem settings. In Section 3 we provide\nextensive experimental results to demonstrate the advantages of Varag over several state-of-the-art\nmethods for solving some well-known ML models, e.g., logistic regression, Lasso, etc. We defer the\nproofs of the main results in Appendix A.\nNotation and terminology. We use (cid:107) \u00b7 (cid:107) to denote a general norm in Rn without speci\ufb01c mention,\nand (cid:107) \u00b7 (cid:107)\u2217 to denote the conjugate norm of (cid:107) \u00b7 (cid:107). For any p \u2265 1, (cid:107) \u00b7 (cid:107)p denotes the standard p-norm\nin Rn, i.e., (cid:107)x(cid:107)p\ni=1|xi|p, for any x \u2208 Rn. For a given strongly convex function w : X \u2192 R\nwith modulus 1 w.r.t. an arbitrary norm (cid:107) \u00b7 (cid:107), we de\ufb01ne a prox-function associated with w as\n\np =(cid:80)n\nV (x0, x) \u2261 Vw(x0, x) := w(x) \u2212(cid:2)w(x0) + (cid:104)w(cid:48)(x0), x \u2212 x0(cid:105)(cid:3) ,\n\n(1.5)\n\nwhere w(cid:48)(x0) \u2208 \u2202w(x0) is any subgradient of w at x0. By the strong convexity of w, we have\n\nV (x0, x) \u2265 1\n\n2(cid:107)x \u2212 x0(cid:107)2, \u2200x, x0 \u2208 X.\n\n(1.6)\nNotice that V (\u00b7,\u00b7) described above is different from the standard de\ufb01nition for Bregman distance [6,\n3, 4, 15, 7] in the sense that w is not necessarily differentiable. Throughout this paper, we assume\nthat the prox-mapping associated with X and h, given by\n\nargminx\u2208X {\u03b3[(cid:104)g, x(cid:105) + h(x) + \u00b5V (x0, x)] + V (x0, x)} ,\n\n(1.7)\ncan be easily computed for any x0, x0 \u2208 X, g \u2208 Rn, \u00b5 \u2265 0, \u03b3 > 0. We denote logarithm with base 2\nas log. For any real number r, (cid:100)r(cid:101) and (cid:98)r(cid:99) denote the ceiling and \ufb02oor of r.\n\n2 Algorithms and main results\n\nThis section contains two subsections. We \ufb01rst present in Subsection 2.1 a uni\ufb01ed optimal Varag for\nsolving the \ufb01nite-sum problem given in (1.1) as well as its optimal convergence results. Subsection 2.2\nis devoted to the discussion of several extensions of Varag . Throughout this section, we assume\nthat each component function fi is smooth with Li-Lipschitz continuous gradients over X, i.e., (1.2)\nholds for all component functions. Moreover, we assume that the objective function \u03c8(x) is possibly\nstrongly convex, in particular, for f (x) = 1\nm\n\n(cid:80)m\ni=1fi(x), \u2203\u00b5 \u2265 0 s.t.\n\nf (y) \u2265 f (x) + (cid:104)\u2207f (x), y \u2212 x(cid:105) + \u00b5V (x, y),\u2200x, y \u2208 X.\n\n(2.1)\nNote that we assume the strong convexity of \u03c8 comes from f, and the simple function h is not\nnecessarily strongly convex. Clearly the strong convexity of h, if any, can be shifted to f since h is\nassumed to be simple and its structural information is transparent to us. Also observe that (2.1) is\nde\ufb01ned based on a generalized Bregman distance, and together with (1.6) they imply the standard\nde\ufb01nition of strong convexity w.r.t. Euclidean norm.\n\n2.1 Varag for convex \ufb01nite-sum optimization\n\nThe basic scheme of Varag is formally described in Algorithm 1. In each epoch (or outer loop),\nit \ufb01rst computes the full gradient \u2207f (\u02dcx) at the point \u02dcx (cf. Line 3), which will then be repeatedly\nused to de\ufb01ne a gradient estimator Gt at each iteration of the inner loop (cf. Line 8). This is the\nwell-known variance reduction technique employed by many algorithms (e.g., [14, 27, 1, 13]). The\ninner loop has a similar algorithmic scheme to the accelerated stochastic approximation algorithm\n[17, 11, 12] with a constant step-size policy. Indeed, the parameters used in the inner loop, i.e.,\n{\u03b3s},{\u03b1s}, and {ps}, only depend on the index of epoch s. Each iteration of the inner loop requires\nthe gradient information of only one randomly selected component function fit, and maintains three\nprimal sequences, {xt},{xt} and {\u00afxt}, which play important role in the acceleration scheme.\n\n4\n\n\fAlgorithm 1 The variance-reduced accelerated gradient (Varag ) method\nInput: x0 \u2208 X,{Ts},{\u03b3s},{\u03b1s},{ps},{\u03b8t}, and a probability distribution Q = {q1, . . . , qm} on\n\n{1, . . . , m}.\n1: Set \u02dcx0 = x0.\n2: for s = 1, 2, . . . do\n3:\n4:\n5:\n6:\n7:\n8:\n9:\n10:\n11:\n12:\n13: end for\n\nend for\n\nSet \u02dcx = \u02dcxs\u22121 and \u02dcg = \u2207f (\u02dcx).\nSet x0 = xs\u22121, \u00afx0 = \u02dcx and T = Ts.\nfor t = 1, 2, . . . , T do\n\nPick it \u2208 {1, . . . , m} randomly according to Q.\nxt = [(1 + \u00b5\u03b3s)(1 \u2212 \u03b1s \u2212 ps)\u00afxt\u22121 + \u03b1sxt\u22121 + (1 + \u00b5\u03b3s)ps \u02dcx] /[1 + \u00b5\u03b3s(1 \u2212 \u03b1s)].\nGt = (\u2207fit(xt) \u2212 \u2207fit(\u02dcx))/(qitm) + \u02dcg.\nxt = arg minx\u2208X {\u03b3s [(cid:104)Gt, x(cid:105) + h(x) + \u00b5V (xt, x)] + V (xt\u22121, x)}.\n\u00afxt = (1 \u2212 \u03b1s \u2212 ps)\u00afxt\u22121 + \u03b1sxt + ps \u02dcx.\n\nSet xs = xT and \u02dcxs =(cid:80)T\n\nt=1(\u03b8t \u00afxt)/(cid:80)T\n\nt=1\u03b8t.\n\nNote that Varag is closely related to stochastic mirror descent method [22, 23] and SVRG[14, 27].\nBy setting \u03b1s = 1 and ps = 0, Algorithm 1 simply combines the variance reduction technique\nwith stochastic mirror descent. In this case, the algorithm only maintains one primal sequence {xt}\nand possesses the non-accelerated rate of convergence O{(m + L/\u00b5) log(1/\u0001)} for solving (1.1).\nInterestingly, if we use Euclidean distance instead of prox-function V (\u00b7,\u00b7) to update xt and set\nX = Rn, Algorithm 1 will further reduce to prox-SVRG proposed in [27].\nIt is also interesting to observe the difference between Varag and Katyusha [1] because both are\naccelerated variance reduction methods. Firstly, while Katyusha needs to assume that the strongly\nconvex term is speci\ufb01ed as in the form of a simple proximal function, e.g., (cid:96)1/(cid:96)2-regularizer, Varag\nassumes that f is possibly strongly convex, which solves an open issue of the existing accelerated RIG\nmethods pointed out by [25]. Therefore, the momentum steps in Lines 7 and 10 are different from\nKatyusha. Secondly, Varag has a less computationally expensive algorithmic scheme. Particularly,\nVarag only needs to solve one proximal mapping (cf. Line 9) per iteration even if f is strongly\nconvex, while Katyusha requires to solve two proximal mappings per iteration. Thirdly, Varag\nincorporates a prox-function V de\ufb01ned in (1.5) rather than the Euclidean distance in the proximal\nmapping to update xt. This allows the algorithm to take advantage of the geometry of the constraint\nset X when performing projections. However, Katyusha cannot be fully adapted to the non-Euclidean\nsetting because its second proximal mapping must be de\ufb01ned using the Euclidean distance regardless\nthe strong convexity of \u03c8. Finally, we will show in this section that Varag can achieve a much\nbetter rate of convergence than Katyusha for smooth convex \ufb01nite-sum optimization by using a novel\napproach to specify step-size and to schedule epoch length.\nWe \ufb01rst discuss the case when f is not necessarily strongly convex, i.e., \u00b5 = 0 in (2.1). In Theorem 1,\nwe suggest one way to specify the algorithmic parameters, including {qi}, {\u03b8t}, {\u03b1s}, {\u03b3s}, {ps}\nand {Ts}, for Varag to solve smooth convex problems given in the form of (1.1), and discuss its\nconvergence properties of the resulting algorithm. We defer the proof of this result in Appendix A.1.\n\nTheorem 1 (Smooth \ufb01nite-sum optimization) Suppose that\n\ni=1Li for i = 1, . . . , m, and weights {\u03b8t} are set as\n\nthe probabilities qi\u2019s are set\n\nto\n\nLi/(cid:80)m\n\n(\u03b1s + ps)\n\n1 \u2264 t \u2264 Ts \u2212 1\nt = Ts.\n\nMoreover, let us denote s0 := (cid:98)log m(cid:99) + 1 and set parameters {Ts}, {\u03b3s} and {ps} as\n\nTs =\n\n, \u03b3s = 1\n\n3L\u03b1s\n\n, and ps = 1\n\n2 , with\n\n\u03b8t =\n\n(cid:26)2s\u22121,\n\nTs0,\n\n(cid:40) \u03b3s\n\n\u03b1s\n\u03b3s\n\u03b1s\n\ns \u2264 s0\ns > s0\n\n(cid:40) 1\n\n\u03b1s =\n\n2 ,\ns\u2212s0+4 ,\n\n2\n\ns \u2264 s0\ns > s0\n\n.\n\n5\n\n(2.2)\n\n(2.3)\n\n(2.4)\n\n\fThen the total number of gradient evaluations of fi performed by Algorithm 1 to \ufb01nd a stochastic\n\u0001-solution of (1.1), i.e., a point \u00afx \u2208 X s.t. E[\u03c8(\u00afx) \u2212 \u03c8\u2217] \u2264 \u0001, can be bounded by\n\n\uf8f1\uf8f2\uf8f3O(cid:8)m log D0\n(cid:26)\n\nO\n\n\u0001\n\nm log m +\n\n(cid:9) ,\n(cid:113) mD0\n\n\u0001\n\n(cid:27)\n\n\u00afN :=\n\nm \u2265 D0/\u0001,\n, m < D0/\u0001,\n\nwhere D0 is de\ufb01ned as\n\nD0 := 2[\u03c8(x0) \u2212 \u03c8(x\u2217)] + 3LV (x0, x\u2217).\n\n(2.5)\n\n(2.6)\n\ncomplexity bounded by O{(cid:112)mD0/\u0001 + m log m}. Secondly, whenever(cid:112)mD0/\u0001 is dominating in\n\nWe now make a few observations regarding the results obtained in Theorem 1. Firstly, as mentioned\nearlier, whenever the required accuracy \u0001 is low and/or the number of components m is large, Varag\ncan achieve a fast linear rate of convergence even under the assumption that the objective function\nis not strongly convex. Otherwise, Varag achieves an optimal sublinear rate of convergence with\n\u221a\nthe second case of (2.5), Varag can save up to O(\nm) gradient evaluations of fi than the optimal\ndeterministic \ufb01rst-order methods for solving (1.1). To the best our knowledge, Varag is the \ufb01rst\naccelerated RIG in the literature to obtain such convergence results by directly solving (1.1). Other\nexisting accelerated RIG methods, such as RPDG[18] and Katyusha[1], require the application\nof perturbation and restarting techniques to obtain such convergence results. Thirdly, Varag also\nsupports mini-batch approach where the component function fi is associated with a mini-batch of\ndata samples instead of a single data sample. In a more general case, for a given mini-batch size\nb, we assume that the component functions can be split into subsets where each subset contains\nexactly b number of component functions. Therefore, one can replace Line 8 in Algorithm 1 by\n(\u2207fit(xt) \u2212 \u2207fit(\u02dcx))/(qitm) + \u02dcg with Sb being the selected subset and |Sb| = b\nGt = 1\nb\nVarag can obtain parallel linear speedup of factor b whenever the mini-batch size b \u2264 \u221a\nand adjust the appropriate parameters to obtain the mini-batch version of Varag . The mini-batch\nproblem is almost not strongly convex, i.e., \u00b5 \u2248 0. In the latter case, the term(cid:112)mL/\u00b5 log(1/\u0001)\n\nNext we consider the case when f is possibly strongly convex, including the situation when the\n\nwill be dominating in the complexity of existing accelerated RIG methods (e.g., [18, 19, 1, 20]) and\nwill tend to \u221e as \u00b5 decreases. Therefore, these complexity bounds are signi\ufb01cantly worse than (2.5)\nobtained by simply treating (1.1) as smooth convex problems. Moreover, \u00b5 \u2248 0 is very common in\nML applications. In Theorem 2, we provide a uni\ufb01ed step-size policy which allows Varag to achieve\noptimal rate of convergence for \ufb01nite-sum optimization in (1.1) regardless of its strong convexity,\nand hence it can achieve stronger rate of convergence than existing accelerated RIG methods if the\ncondition number L/\u00b5 is very large. The proof of this result can be found in Appendix A.2.\n\n(cid:80)\n\nit\u2208Sb\n\nm.\n\nqi\u2019s are set to Li/(cid:80)m\n\nTheorem 2 (A uni\ufb01ed result for convex \ufb01nite-sum optimization) Suppose that the probabilities\ni=1Li for i = 1, . . . , m. Moreover, let us denote s0 := (cid:98)log m(cid:99) + 1 and assume\n4\u00b5 . Otherwise,\n\nthat the weights {\u03b8t} are set to (2.2) if 1 \u2264 s \u2264 s0 or s0 < s \u2264 s0 +\nthey are set to\n\n(cid:113) 12L\nm\u00b5 \u2212 4, m < 3L\n(cid:26)\u0393t\u22121 \u2212 (1 \u2212 \u03b1s \u2212 ps)\u0393t, 1 \u2264 t \u2264 Ts \u2212 1,\n(cid:40) 1\n\nwhere \u0393t = (1 + \u00b5\u03b3s)t. If the parameters {Ts}, {\u03b3s} and {ps} set to (2.3) with\n\nt = Ts,\n\n\u0393t\u22121,\n\n\u03b8t =\n\n(2.7)\n\n(2.8)\n\n2}(cid:111)\n\n,\n\n3L , 1\n\ns \u2264 s0,\ns > s0,\n\nthen the total number of gradient evaluations of fi performed by Algorithm 1 to \ufb01nd a stochastic\n\u0001-solution of (1.1) can be bounded by\n\n\u03b1s =\n\n2 ,\nmax\n\nO(cid:8)m log D0\n\uf8f1\uf8f4\uf8f4\uf8f4\uf8f2\uf8f4\uf8f4\uf8f4\uf8f3\n(cid:26)\nO(cid:110)\n\nO\n\n\u0001\n\nm log m +\n\nm log m +\n\n(cid:110) 2\ns\u2212s0+4 , min{(cid:112) m\u00b5\n(cid:9) ,\n(cid:113) mD0\n(cid:113) mL\n\n(cid:27)\n\n(cid:111)\n\n,\n\n\u00b5 log D0/\u0001\n\n3L/4\u00b5\n\n\u0001\n\n\u00afN :=\n\nwhere D0 is de\ufb01ned as in (2.6).\n\n6\n\nm \u2265 D0\nm < D0\n\n\u0001 or m \u2265 3L\n4\u00b5 ,\n\u0001 \u2264 3L\n4\u00b5 ,\n4\u00b5 \u2264 D0\n\u0001 .\n\n, m < 3L\n\n(2.9)\n\n\fObserve that the complexity bound (2.9) is a uni\ufb01ed convergence result for Varag to solve deter-\nministic smooth convex \ufb01nite-sum optimization problems (1.1). When the strong convex modulus \u00b5\nof the objective function is large enough, i.e., 3L/\u00b5 < D0/\u0001, Varag exhibits an optimal linear rate\nof convergence since the third case of (2.9) matches the lower bound (1.4) for RIG methods. If \u00b5\nis relatively small, Varag treats the \ufb01nite-sum problem (1.1) as a smooth problem without strong\nconvexity, which leads to the same complexity bounds as in Theorem 1. It should be pointed out that\nthe parameter setting proposed in Theorem 2 does not require the values of \u0001 and D0 given a priori.\n\n2.2 Generalization of Varag\n\nIn this subsection, we extend Varag to solve two general classes of \ufb01nite-sum optimization problems\nas well as establishing its convergence properties for these problems.\nFinite-sum problems under error bound condition. We investigate a class of weakly strongly\nconvex problems, i.e., \u03c8(x) is smooth convex and satis\ufb01es the error bound condition given by\n\nV (x, X\u2217) \u2264 1\n\n\u00af\u00b5 (\u03c8(x) \u2212 \u03c8\u2217), \u2200x \u2208 X,\n\n(2.10)\nwhere X\u2217 denotes the set of optimal solutions of (1.1). Many optimization problems satisfy (2.10),\nfor instance, linear systems, quadratic programs, linear matrix inequalities and composite problems\n(outer: strongly convex, inner: polyhedron functions), see [8] and Section 6 of [21] for more examples.\nAlthough these problems are not strongly convex, by properly restarting Varag we can solve them\nwith an accelerated linear rate of convergence, the best-known complexity result to solve this class of\nproblems so far. We formally present the result in Theorem 3, whose proof is given in Appendix A.3.\n\nqi\u2019s are set to Li/(cid:80)m\n\nTheorem 3 (Convex \ufb01nite-sum optimization under error bound) Assume that the probabilities\ni=1Li for i = 1, . . . , m, and \u03b8t are de\ufb01ned as (2.2). Moreover, let us set\nparameters {\u03b3s}, {ps} and {\u03b1s} as in (2.3) and (2.4) with {Ts} being set as\n\n(cid:26)T12s\u22121,\n\n8T1,\n\nTs =\n\ns \u2264 4\ns > 4\n\n,\n\n(cid:113) L\n\n\u00af\u00b5m ,\n\n(2.11)\n\n(2.12)\n\nwhere T1 = min{m, L\n\n\u00af\u00b5}. Then under condition (2.10), for any x\u2217 \u2208 X\u2217, s = 4 + 4\n\nMoreover, if we restart Varag every time it runs s iterations for k = log \u03c8(x0)\u2212\u03c8(x\u2217)\nnumber of gradient evaluations of fi to \ufb01nd a stochastic \u0001-solution of (1.1) can be bounded by\n\ntimes, the total\n\nE[\u03c8(\u02dcxs) \u2212 \u03c8(x\u2217)] \u2264 5\n\n\u00afN := k((cid:80)\n\ns(m + Ts)) = O(cid:110)(cid:0)m +\n\n16 [\u03c8(x0) \u2212 \u03c8(x\u2217)].\n(cid:113) mL\n\n(cid:1) log \u03c8(x0)\u2212\u03c8(x\u2217)\n\n\u0001\n\n\u00af\u00b5\n\n\u0001\n\n(cid:111)\n\n.\n\n(2.13)\n\nRemark 1 Note that Varag can also be extended to obtain an uni\ufb01ed result as shown in Theorem 2\nfor solving \ufb01nite-sum problems under error bound condition. In particular, if the condition number is\nvery large, i.e., s = O{L/(\u00af\u00b5m)} \u2248 \u221e, Varag will never be restarted, and the resulting complexity\nbounds will reduce to the case for solving smooth convex problems provided in Theorem 1.\n\nStochastic \ufb01nite-sum optimization. We now consider stochastic smooth convex \ufb01nite-sum opti-\nmization and online learning problems de\ufb01ned as in (1.3), where only noisy gradient information of\nfi can be accessed via a SFO oracle. In particular, for any x \u2208 X, the SFO oracle outputs a vector\nGi(x, \u03bej) such that\n\nE\u03bej [Gi(x, \u03bej)] = \u2207fi(x), i = 1, . . . , m,\nE\u03bej [(cid:107)Gi(x, \u03bej) \u2212 \u2207fi(x)(cid:107)2\u2217] \u2264 \u03c32, i = 1, . . . , m.\n\n(2.14)\n(2.15)\n\nWe present the variant of Varag for stochastic \ufb01nite-sum optimization in Algorithm 2 as well as its\nconvergence results in Theorem 4, whose proof can be found in Appendix B.\n\nTheorem 4 (Stochastic smooth \ufb01nite-sum optimization) Assume that \u03b8t are de\ufb01ned as in (2.2),\ni=1Li for i = 1, . . . , m. Moreover, let us\n\nqim2 and the probabilities qi\u2019s are set to Li/(cid:80)m\n\nC :=(cid:80)m\n\ni=1\n\n1\n\n7\n\n\fAlgorithm 2 Stochastic accelerated variance-reduced stochastic gradient descent (Stochastic Varag )\nThis algorithm is the same as Algorithm 1 except that for given batch-size parameters Bs and bs,\nLine 3 is replaced by \u02dcx = \u02dcxs\u22121 and\n\n\u02dcg = 1\nm\n\ni=1\n\nGi(\u02dcx) := 1\nBs\n\n(cid:80)m\n\n(cid:110)\n(cid:80)bs\n\nk=1\n\n(cid:111)\n(cid:80)Bs\nk) \u2212 Git(\u02dcx)(cid:1) + \u02dcg.\n(cid:0)Git(xt, \u03bes\n\nj=1Gi(\u02dcx, \u03bes\nj )\n\n,\n\nand Line 8 is replaced by\n\nGt = 1\n\nqit mbs\n\n(2.16)\n\n(2.17)\n\n(2.18)\n\ndenote s0 := (cid:98)log m(cid:99) + 1 and set Ts, \u03b1s, \u03b3s and ps as in (2.3) and (2.4). Then the number of calls\nto the SFO oracle required by Algorithm 2 to \ufb01nd a stochastic \u0001-solution of (1.1) can be bounded by\n\nNSFO =(cid:80)\n\ns(mBs + Tsbs) =\n\nwhere D0 is given in (2.6).\n\n\uf8f1\uf8f2\uf8f3O(cid:110) mC\u03c32\n(cid:111)\n(cid:111)\nO(cid:110) C\u03c32D0\n\nL\u0001\n\nL\u00012\n\n, m \u2265 D0/\u0001,\n, m < D0/\u0001,\n\nRemark 2 Note that the constant C in (2.18) can be easily upper bounded by\nmin{Li} , and C = 1 if\nLi = L,\u2200i. To the best of our knowledge, among a few existing RIG methods that can be applied to\nsolve the class of stochastic \ufb01nite-sum problems, Varag is the \ufb01rst to achieve such complexity results\nas in (2.18) for smooth convex problems. RGEM[19] obtains nearly-optimal rate of convergence\nfor strongly convex case, but cannot solve stochastic smooth problems directly, and [16] required a\nspeci\ufb01c initial point, i.e., an exact solution to a proximal mapping depending on the variance \u03c32, to\n\nachieve O(cid:8)m log m + \u03c32/\u00012(cid:9) rate of convergence for smooth convex problems.\n\nL\n\n3 Numerical experiments\n\nsuggesting sampling distribution should be non-uniform, i.e., qi = Li/(cid:80)m\n\nIn this section, we demonstrate the advantages of our proposed algorithm, Varag over several state-\nof-the-art algorithms, e.g., SVRG++ [2] and Katyusha [1], etc., via solving several well-known\nmachine learning models. For all experiments, we use public real datasets downloaded from UCI\nMachine Learning Repository [10] and uniform sampling strategy to select fi. Indeed, the theoretical\ni=1Li, which results in the\noptimal constant L appearing in the convergence results. However, a uniform sampling strategy will\nonly lead to a constant factor slightly larger than L = 1\ni=1Li. Moreover, it is computationally\nm\nef\ufb01cient to estimate Li by performing maximum singular value decomposition of the Hessian since\nonly a rough estimation suf\ufb01ces.\nUnconstrained smooth convex problems. We \ufb01rst investigate unconstrained logistic models which\ncannot be solved via the perturbation approach due to the unboundedness of the feasible set. More\nspeci\ufb01cally, we applied Varag , SVRG++ and Katyushans to solve a logistic regression problem,\n\n(cid:80)m\n\n(cid:80)m\ni=1fi(x)} where fi(x) := log(1 + exp(\u2212biaT\n\ni x))}.\n\n(3.1)\n\n{\u03c8(x) := 1\n\nm\n\nmin\nx\u2208Rn\n\nHere (ai, bi) \u2208 Rn \u00d7 {\u22121, 1} is a training data point and m is the sample size, and hence fi now\ncorresponds to the loss generated by a single training data. As we can see from Figure 1, Varag\nconverges much faster than SVRG++ and Katyusha in terms of training loss.\n\n8\n\n\fDiabetes (m = 1151),\nunconstrained logistic\n\nBreast Cancer Wisconsin (m = 683),\n\nunconstrained logistic\n\nFigure 1: The algorithmic parameters for SVRG++ and Katyushans are set according to [2] and [1], respectively,\nand those for Varag are set as in Theorem 1.\nStrongly convex loss with simple convex regularizer. We now study the class of Lasso regression\nproblems with \u03bb as the regularizer coef\ufb01cient, given in the following form\n\n(cid:80)m\ni=1fi(x) + h(x)} where fi(x) := 1\n\ni x \u2212 bi)2, h(x) := \u03bb(cid:107)x(cid:107)1.\n\n{\u03c8(x) := 1\n\n(3.2)\n\n2 (aT\n\nmin\nx\u2208Rn\n\nm\n\nDue to the assumption SVRG++ and Katyusha enforced on the objective function that the strong\nconvexity can only be associated with the regularizer, these methods always view Lasso as smooth\nproblems [25], while Varag can treat Lasso as strongly convex problems. As can be seen from\nFigure 2, Varag outperforms SVRG++ and Katyushans in terms of training loss.\n\nDiabetes (m = 1151),\n\nLasso \u03bb = 0.001\n\nBreast Cancer Wisconsin (m = 683),\n\nLasso \u03bb = 0.001\n\n(cid:80)m\ni=1fi(x)} where fi(x) := 1\n\n{\u03c8(x) := 1\n\nFigure 2: The algorithmic parameters for SVRG++ and Katyushans are set according to [2] and [1], respectively,\nand those for Varag are set as in Theorem 2.\nWeakly strongly convex problems satisfying error bound condition. Let us consider a special\nclass of \ufb01nite-sum convex quadratic problems given in the following form\n\nm\n\nmin\nx\u2208Rn\n\n2 xT Qix + qT\n\n(3.3)\nHere qi = \u2212Qixs and xs is a solution to the symmetric linear system Qix + qi = 0 with Qi (cid:23) 0.\n[8][Section 6] and [21][Section 6.1] proved that (3.3) belongs to the class of weakly strongly convex\nproblems satisfying error bound condition (2.10). For a given solution xs, we use the following real\ndatasets to generate Qi and qi. We then compare the performance of Varag with fast gradient method\n(FGM) proposed in [21]. As shown in Figure 3, Varag outperforms FGM for all cases. And as the\nnumber of component functions m increases, Varag demonstrates more advantages over FGM. These\ncan save up to O{\u221a\nnumerical results are consistent with the theoretical complexity bound (2.13) suggesting that Varag\nm} number of gradient computations than deterministic algorithms, e.g., FGM.\n\ni x.\n\nDiabetes (m = 1151)\n\nParkinsons Telemonitoring (m = 5875)\n\nFigure 3: The algorithmic parameters for FGM and Varag are set according to [21] and Theorem 3, respectively.\nMore numerical experiment results on another problem case, strongly convex problems with small\nstrongly convex modulus, can be found in Appendix C.\n\n9\n\n\fReferences\n[1] Zeyuan Allen-Zhu. Katyusha: The \ufb01rst direct acceleration of stochastic gradient methods.\n\nArXiv e-prints, abs/1603.05953, 2016.\n\n[2] Zeyuan Allen-Zhu and Yang Yuan. Improved svrg for non-strongly-convex or sum-of-non-\nconvex objectives. In International conference on machine learning, pages 1080\u20131089, 2016.\n\n[3] A. Auslender and M. Teboulle. Interior gradient and proximal methods for convex and conic\n\noptimization. SIAM Journal on Optimization, 16:697\u2013725, 2006.\n\n[4] H.H. Bauschke, J.M. Borwein, and P.L. Combettes. Bregman monotone optimization algorithms.\n\nSIAM Journal on Controal and Optimization, 42:596\u2013636, 2003.\n\n[5] D. Blatt, A. Hero, and H. Gauchman. A convergent incremental gradient method with a constant\n\nstep size. SIAM Journal on Optimization, 18(1):29\u201351, 2007.\n\n[6] L.M. Bregman. The relaxation method of \ufb01nding the common point convex sets and its\napplication to the solution of problems in convex programming. USSR Comput. Math. Phys.,\n7:200\u2013217, 1967.\n\n[7] Yair Censor and Arnold Lent. An iterative row-action method for interval convex programming.\n\nJournal of Optimization theory and Applications, 34(3):321\u2013353, 1981.\n\n[8] Cong D Dang, Guanghui Lan, and Zaiwen Wen. Linearly convergent \ufb01rst-order algorithms for\n\nsemide\ufb01nite programming. Journal of Computational Mathematics, 35(4):452\u2013468, 2017.\n\n[9] A. Defazio, F. Bach, and S. Lacoste-Julien. SAGA: A fast incremental gradient method\nwith support for non-strongly convex composite objectives. Advances of Neural Information\nProcessing Systems (NIPS), 27, 2014.\n\n[10] Dheeru Dua and Casey Graff. UCI machine learning repository, 2017.\n\n[11] S. Ghadimi and G. Lan. Optimal stochastic approximation algorithms for strongly convex\nstochastic composite optimization, I: a generic algorithmic framework. SIAM Journal on\nOptimization, 22:1469\u20131492, 2012.\n\n[12] S. Ghadimi and G. Lan. Optimal stochastic approximation algorithms for strongly convex\nstochastic composite optimization, II: shrinking procedures and optimal algorithms. SIAM\nJournal on Optimization, 23:2061\u20132089, 2013.\n\n[13] Elad Hazan and Haipeng Luo. Variance-reduced and projection-free stochastic optimization.\n\nCoRR, abs/1602.02101, 2, 2016.\n\n[14] R. Johnson and T. Zhang. Accelerating stochastic gradient descent using predictive variance\n\nreduction. Advances of Neural Information Processing Systems (NIPS), 26:315\u2013323, 2013.\n\n[15] K.C. Kiwiel. Proximal minimization methods with generalized bregman functions. SIAM\n\nJournal on Controal and Optimization, 35:1142\u20131168, 1997.\n\n[16] Andrei Kulunchakov and Julien Mairal. Estimate sequences for stochastic composite optimiza-\ntion: Variance reduction, acceleration, and robustness to noise. arXiv preprint arXiv:1901.08788,\n2019.\n\n[17] Guanghui Lan. An optimal method for stochastic composite optimization. Mathematical\n\nProgramming, 133(1-2):365\u2013397, 2012.\n\n[18] Guanghui Lan and Yi Zhou. An optimal randomized incremental gradient method. Mathematical\n\nprogramming, pages 1\u201349, 2017.\n\n[19] Guanghui Lan and Yi Zhou. Random gradient extrapolation for distributed and stochastic\n\noptimization. SIAM Journal on Optimization, 28(4):2753\u20132782, 2018.\n\n[20] Hongzhou Lin, Julien Mairal, and Zaid Harchaoui. A universal catalyst for \ufb01rst-order optimiza-\n\ntion. In Advances in Neural Information Processing Systems, pages 3384\u20133392, 2015.\n\n10\n\n\f[21] Ion Necoara, Yu Nesterov, and Francois Glineur. Linear convergence of \ufb01rst order methods for\n\nnon-strongly convex optimization. Mathematical Programming, pages 1\u201339, 2018.\n\n[22] A. S. Nemirovski, A. Juditsky, G. Lan, and A. Shapiro. Robust stochastic approximation\n\napproach to stochastic programming. SIAM Journal on Optimization, 19:1574\u20131609, 2009.\n\n[23] A. S. Nemirovski and D. Yudin. Problem complexity and method ef\ufb01ciency in optimization.\n\nWiley-Interscience Series in Discrete Mathematics. John Wiley, XV, 1983.\n\n[24] Mark Schmidt, Nicolas Le Roux, and Francis Bach. Minimizing \ufb01nite sums with the stochastic\n\naverage gradient. Mathematical Programming, 162(1-2):83\u2013112, 2017.\n\n[25] Junqi Tang, Mohammad Golbabaee, Francis Bach, et al. Rest-katyusha: Exploiting the solution\u2019s\nstructure via scheduled restart schemes. In Advances in Neural Information Processing Systems,\npages 429\u2013440, 2018.\n\n[26] Jialei Wang and Lin Xiao. Exploiting strong convexity from data with primal-dual \ufb01rst-order\nalgorithms. In Proceedings of the 34th International Conference on Machine Learning-Volume\n70, pages 3694\u20133702. JMLR. org, 2017.\n\n[27] Lin Xiao and Tong Zhang. A proximal stochastic gradient method with progressive variance\n\nreduction. SIAM Journal on Optimization, 24(4):2057\u20132075, 2014.\n\n11\n\n\f", "award": [], "sourceid": 5518, "authors": [{"given_name": "Guanghui", "family_name": "Lan", "institution": "Georgia Tech"}, {"given_name": "Zhize", "family_name": "Li", "institution": "Tsinghua University, and KAUST"}, {"given_name": "Yi", "family_name": "Zhou", "institution": "IBM Almaden Research Center"}]}