{"title": "Non-convex Finite-Sum Optimization Via SCSG Methods", "book": "Advances in Neural Information Processing Systems", "page_first": 2348, "page_last": 2358, "abstract": "We develop a class of algorithms, as variants of the stochastically controlled stochastic gradient (SCSG) methods , for the smooth nonconvex finite-sum optimization problem. Only assuming the smoothness of each component, the complexity of SCSG to reach a stationary point with $E \\|\\nabla f(x)\\|^{2}\\le \\epsilon$ is $O(\\min\\{\\epsilon^{-5/3}, \\epsilon^{-1}n^{2/3}\\})$, which strictly outperforms the stochastic gradient descent. Moreover, SCSG is never worse than the state-of-the-art methods based on variance reduction and it significantly outperforms them when the target accuracy is low. A similar acceleration is also achieved when the functions satisfy the Polyak-Lojasiewicz condition. Empirical experiments demonstrate that SCSG outperforms stochastic gradient methods on training multi-layers neural networks in terms of both training and validation loss.", "full_text": "Non-Convex Finite-Sum Optimization\n\nVia SCSG Methods\n\nLihua Lei\nUC Berkeley\n\nlihua.lei@berkeley.edu\n\nCheng Ju\nUC Berkeley\n\ncju@berkeley.edu\n\nJianbo Chen\nUC Berkeley\n\njianbochen@berkeley.edu\n\nMichael I. Jordan\n\nUC Berkeley\n\njordan@stat.berkeley.edu\n\nAbstract\n\nWe develop a class of algorithms, as variants of the stochastically controlled\nstochastic gradient (SCSG) methods [21], for the smooth non-convex \ufb01nite-\nsum optimization problem. Assuming the smoothness of each component,\nthe complexity of SCSG to reach a stationary point with E(cid:107)\u2207f (x)(cid:107)2 \u2264 \u03b5 is\n\nO(cid:0)min{\u03b5\u22125/3, \u03b5\u22121n2/3}(cid:1), which strictly outperforms the stochastic gradient de-\n\nscent. Moreover, SCSG is never worse than the state-of-the-art methods based\non variance reduction and it signi\ufb01cantly outperforms them when the target ac-\ncuracy is low. A similar acceleration is also achieved when the functions satisfy\nthe Polyak-Lojasiewicz condition. Empirical experiments demonstrate that SCSG\noutperforms stochastic gradient methods on training multi-layers neural networks\nin terms of both training and validation loss.\n\n1\n\nIntroduction\n\nWe study smooth non-convex \ufb01nite-sum optimization problems of the form\n\nn(cid:88)\n\ni=1\n\nmin\nx\u2208Rd\n\nf (x) =\n\n1\nn\n\nfi(x)\n\n(1)\n\nwhere each component fi(x) is possibly non-convex with Lipschitz gradients. This generic form\ncaptures numerous statistical learning problems, ranging from generalized linear models [22] to deep\nneural networks [19].\nIn contrast to the convex case, the non-convex case is comparatively under-studied. Early work\nfocused on the asymptotic performance of algorithms [11, 7, 29], with non-asymptotic complexity\nbounds emerging more recently [24]. In recent years, complexity results have been derived for both\ngradient methods [13, 2, 8, 9] and stochastic gradient methods [12, 13, 6, 4, 26, 27, 3]. Unlike in the\nconvex case, in the non-convex case one can not expect a gradient-based algorithm to converge to the\nglobal minimum if only smoothness is assumed. As a consequence, instead of measuring function-\nvalue suboptimality Ef (x) \u2212 inf x f (x) as in the convex case, convergence is generally measured in\nterms of the squared norm of the gradient; i.e., E(cid:107)\u2207f (x)(cid:107)2. We summarize the best available rates 1\nin Table 1. We also list the rates for Polyak-Lojasiewicz (P-L) functions, which will be de\ufb01ned in\nSection 2. The accuracy for minimizing P-L functions is measured by Ef (x) \u2212 inf x f (x).\n\ntransferred to this alternative measure by using Cauchy-Schwartz inequality, E(cid:107)\u2207f (x)(cid:107) \u2264 (cid:112)E(cid:107)\u2207f (x)(cid:107)2,\n\n1It is also common to use E(cid:107)\u2207f (x)(cid:107) to measure convergence; see, e.g. [2, 8, 9, 3]. Our results can be readily\n\u221a\n\nalthough not vice versa. The rates under this alternative can be made comparable to ours by replacing \u03b5 by\n\n\u03b5.\n\n31st Conference on Neural Information Processing Systems (NIPS 2017), Long Beach, CA, USA.\n\n\fTable 1: Computation complexity of gradient methods and stochastic gradient methods for the\n\ufb01nite-sum non-convex optimization problem (1). The second and third columns summarize the rates\nin the smooth and P-L cases respectively. \u00b5 is the P-L constant and H\u2217 is the variance of a stochastic\ngradient. These quantities are de\ufb01ned in Section 2. The \ufb01nal column gives additional required\nassumptions beyond smoothness or the P-L condition. The symbol \u2227 denotes a minimum and \u02dcO(\u00b7) is\nthe usual Landau big-O notation with logarithmic terms hidden.\n\nGradient Methods\n\nGD\n\nBest available\n\nSmooth\n\n\u03b5\n\n\u03b57/8\n\n(cid:1) [24, 13]\nO(cid:0) n\n\u02dcO(cid:0) n\n(cid:1) [9]\n\u02dcO(cid:0) n\n(cid:1) [9]\n(cid:1) [24, 26]\nO(cid:0) 1\n(cid:16)\n(cid:17)\n(cid:16) 1\n(cid:17)\n\u03b55/3 \u2227 n2/3\n\nn + n2/3\n\u03b5\n\u02dcO\n\n\u03b52\n\n\u03b5\n\n[26, 27]\n\n\u03b55/6\nStochastic Gradient Methods\n\nSGD\n\nBest available O\n\nSCSG\n\nadditional cond.\n\n-\nsmooth gradient\nsmooth Hessian\n\nPolyak-Lojasiewicz\n\n(cid:17)\n\n(cid:16) n\n\n\u00b5\n\n\u02dcO\n\n[25, 17]\n-\n-\n\n(cid:16) 1\n\n(cid:17)\n(cid:17)\n\n(cid:16)\nO\n\u00b52\u03b5\nn + n2/3\n\u02dcO\n\u00b5\n\u00b5\u03b5 \u2227 n) + 1\n( 1\n\n(cid:16)\n\n\u02dcO\n\n[17]\n[26, 27]\n\nH\u2217 = O(1)\n-\n\n\u00b5\u03b5 \u2227 n)2/3(cid:17) H\u2217 = O(1)\n\n\u00b5 ( 1\n\nAs in the convex case, gradient methods have better dependence on \u03b5 in the non-convex case but\nworse dependence on n. This is due to the requirement of computing a full gradient. Comparing\nthe complexity of SGD and the best achievable rate for stochastic gradient methods, achieved via\nvariance-reduction methods, the dependence on \u03b5 is signi\ufb01cantly improved in the latter case. However,\nunless \u03b5 << n\u22121/2, SGD has similar or even better theoretical complexity than gradient methods and\nexisting variance-reduction methods. In practice, it is often the case that n is very large (105 \u223c 109)\nwhile the target accuracy is moderate (10\u22121 \u223c 10\u22123). In this case, SGD has a meaningful advantage\nover other methods, deriving from the fact that it does not require a full gradient computation. This\nmotivates the following research question: Is there an algorithm that\n\n\u2022 achieves/beats the theoretical complexity of SGD in the regime of modest target accuracy;\n\u2022 and achieves/beats the theoretical complexity of existing variance-reduction methods in the\n\nregime of high target accuracy?\n\nThe question has been partially answered in the convex case by [21] in their formulation of the\nstochastically controlled stochastic gradient (SCSG) methods. When the target accuracy is low,\n\nSCSG has the same O(cid:0)\u03b5\u22122(cid:1) rate as SGD but with a much smaller data-dependent constant factor\n\n(which does not even require bounded gradients). When the target accuracy is high, SCSG achieves\nthe same rate as the best non-accelerated methods, O( n\n\u03b5 ). Despite the gap between this and the\noptimal rate, SCSG is the \ufb01rst known algorithm that provably achieves the desired performance in\nboth regimes.\nIn this paper, we generalize SCSG to the non-convex setting which, surprisingly, provides a completely\naf\ufb01rmative answer to the question raised before. By only assuming smoothness of each component\n\nas in almost all other works, SCSG is always O(cid:0)\u03b5\u22121/3(cid:1) faster than SGD and is never worse than\n\nn, SCSG is at\nrecently developed stochastic gradient methods that achieve the best rate.When \u03b5 >> 1\nleast O((\u03b5n)2/3) faster than the best SVRG-type algorithms. Comparing with the gradient methods,\nSCSG has a better convergence rate provided \u03b5 >> n\u22126/5, which is the common setting in practice.\nInterestingly, there is a parallel to recent advances in gradient methods; [9] improved the classical\nO(\u03b5\u22121) rate of gradient descent to O(\u03b5\u22125/6); this parallels the improvement of SCSG over SGD\nfrom O(\u03b5\u22122) to O(\u03b5\u22125/3).\nBeyond the theoretical advantages of SCSG, we also show that SCSG yields good empirical perfor-\nmance for the training of multi-layer neural networks. It is worth emphasizing that the mechanism by\nwhich SCSG achieves acceleration (variance reduction) is qualitatively different from other speed-up\n\n2\n\n\ftechniques, including momentum [28] and adaptive stepsizes [18]. It will be of interest in future work\nto explore combinations of these various approaches in the training of deep neural networks.\nThe rest of paper is organized as follows: In Section 2 we discuss our notation and assumptions\nand we state the basic SCSG algorithm. We present the theoretical convergence analysis in Section\n3. Experimental results are presented in Section 4. All the technical proofs are relegated to the\nAppendices. Our code is available at https://github.com/Jianbo-Lab/SCSG.\n\n2 Notation, Assumptions and Algorithm\nWe use (cid:107) \u00b7 (cid:107) to denote the Euclidean norm and write min{a, b} as a \u2227 b for brevity throughout the\npaper. The notation \u02dcO, which hides logarithmic terms, will only be used to maximize readibility in\nour presentation but will not be used in the formal analysis.\n(cid:80)\nWe de\ufb01ne computation cost using the IFO framework of [1] which assumes that sampling an index\ni and accessing the pair (\u2207fi(x), fi(x)) incur a unit of cost. For brevity, we write \u2207fI(x) for\ni\u2208I \u2207fi(x). Note that calculating \u2207fI(x) incurs |I| units of computational cost. x is called\n1|I|\nan \u03b5-accurate solution iff E(cid:107)\u2207f (x)(cid:107)2 \u2264 \u03b5. The minimum IFO complexity to reach an \u03b5-accurate\nsolution is denoted by Ccomp(\u03b5).\nRecall that a random variable N has a geometric distribution, N \u223c Geom(\u03b3), if N is supported on\nthe non-negative integers 2 with\n\nP (N = k) = \u03b3k(1 \u2212 \u03b3),\n\n\u2200k = 0, 1, . . .\n\nAn elementary calculation shows that\n\nEN\u223cGeom(\u03b3) =\n\n\u03b3\n1 \u2212 \u03b3\n\n.\n\nTo formulate our complexity bounds, we de\ufb01ne\n\nFurther we de\ufb01ne H\u2217 as an upper bound of the variance of stochastic gradients, i.e.\n\nf\u2217 = inf\n\nf (x), \u2206f = f (\u02dcx0) \u2212 f\u2217.\n\nx\n\nn(cid:88)\n\ni=1\n\nH\u2217 = sup\n\nx\n\n1\nn\n\n(cid:107)\u2207fi(x) \u2212 \u2207f (x)(cid:107)2.\n\n(2)\n\n(3)\n\nThe assumption A1 on the smoothness of individual functions will be made throughout this paper.\n\nA1 fi is differentiable with\n\nfor some L < \u221e and all i \u2208 {1, . . . , n}.\n\n(cid:107)\u2207fi(x) \u2212 \u2207fi(y)(cid:107) \u2264 L(cid:107)x \u2212 y(cid:107)\n\nAs a direct consequence of assumption A1, it holds for any x, y \u2208 Rd that\n(cid:107)x \u2212 y(cid:107)2 \u2264 fi(x) \u2212 fi(y) \u2212 (cid:104)\u2207fi(y), x \u2212 y(cid:105) \u2264 L\n2\n\n\u2212 L\n2\n\n(cid:107)x \u2212 y(cid:107)2.\n\n(4)\n\nIn this paper, we also consider the following Polyak-Lojasiewicz (PL) condition [25]. It is weaker\nthan strong convexity as well as other popular conditions that appeared in optimization literature; see\n[17] for an extensive discussion.\n\nA2 f (x) satis\ufb01es the P-L condition with \u00b5 > 0 if\n\n(cid:107)\u2207f (x)(cid:107)2 \u2265 2\u00b5(f (x) \u2212 f (x\u2217))\n\nwhere x\u2217 is the global minimum of f.\n\n2Here we allow N to be zero to facilitate the analysis.\n\n3\n\n\f2.1 Generic form of SCSG methods\n\nThe algorithm we propose in this paper is similar to that of [14] except (critically) the number of\ninner loops is a geometric random variable. This is an essential component in the analysis of SCSG,\nand, as we will show below, it is key in allowing us to extend the complexity analysis for SCSG to\nthe non-convex case. Moreover, that algorithm that we present here employs a mini-batch procedure\nin the inner loop and outputs a random sample instead of an average of the iterates. The pseudo-code\nis shown in Algorithm 1.\n\nj=1.\n\nj=1, batch sizes (Bj)T\n\nAlgorithm 1 (Mini-Batch) Stochastically Controlled Stochastic Gradient (SCSG) method for smooth\nnon-convex \ufb01nite-sum objectives\nInputs: Number of stages T , initial iterate \u02dcx0, stepsizes (\u03b7j)T\nsizes (bj)T\nProcedure\n1: for j = 1, 2,\u00b7\u00b7\u00b7 , T do\nUniformly sample a batch Ij \u2282 {1,\u00b7\u00b7\u00b7 , n} with |Ij| = Bj;\n2:\ngj \u2190 \u2207fIj (\u02dcxj\u22121);\n3:\n0 \u2190 \u02dcxj\u22121;\nx(j)\n4:\nGenerate Nj \u223c Geom (Bj/(Bj + bj));\n5:\nfor k = 1, 2,\u00b7\u00b7\u00b7 , Nj do\n6:\n7:\n8:\n\nj=1, mini-batch\n\n(x(j)\n\nRandomly pick \u02dcIk\u22121 \u2282 [n] with |\u02dcIk\u22121| = bj;\nk\u22121 \u2190 \u2207f \u02dcIk\u22121\n\u03bd(j)\n0 ) + gj;\nk \u2190 x(j)\nx(j)\nend for\n\u02dcxj \u2190 x(j)\n\nk\u22121) \u2212 \u2207f \u02dcIk\u22121\n(x(j)\nk\u22121 \u2212 \u03b7j\u03bd(j)\nk\u22121;\n\n9:\n10:\n11:\n12: end for\nOutput: (Smooth case) Sample \u02dcx\u2217\n\nT from (\u02dcxj)T\n\nNj\n\n;\n\nj=1 with P (\u02dcx\u2217\n\nT = \u02dcxj) \u221d \u03b7jBj/bj; (P-L case) \u02dcxT .\n\nAs seen in the pseudo-code, the SCSG method consists of multiple epochs. In the j-th epoch, a mini-\nbatch of size Bj is drawn uniformly from the data and a sequence of mini-batch SVRG-type updates\nare implemented, with the total number of updates being randomly generated from a geometric\ndistribution, with mean equal to the batch size. Finally it outputs a random sample from {\u02dcxj}T\nj=1.\nThis is the standard way, proposed by [23], as opposed to computing arg minj\u2264T (cid:107)\u2207f (\u02dcxj)(cid:107) which\nrequires additional overhead. By (2), the average total cost is\n\nT(cid:88)\n\n(Bj + bj \u00b7 ENj) =\n\nj=1\n\ni=1\n\nT(cid:88)\n\nT(cid:88)\n\nj=1\n\n(Bj + bj \u00b7 Bj\nbj\n\n) = 2\n\nBj.\n\n(5)\n\nDe\ufb01ne T (\u03b5) as the minimum number of epochs such that all outputs afterwards are \u03b5-accurate\nsolutions, i.e.\n\nRecall the de\ufb01nition of Ccomp(\u03b5) at the beginning of this section, the average IFO complexity to\nreach an \u03b5-accurate solution is\n\nT (\u03b5) = min{T : E(cid:107)\u2207f (\u02dcx\u2217\n\nT (cid:48))(cid:107) \u2264 \u03b5 for all T (cid:48) \u2265 T}.\nT (\u03b5)(cid:88)\n\nECcomp(\u03b5) \u2264 2\n\nBj.\n\n2.2 Parameter settings\n\nj=1\n\nThe generic form (Algorithm 1) allows for \ufb02exibility in both stepsize, \u03b7j, and batch/mini-batch size,\n(Bj, bj). In order to minimize the amount of tuning needed in practice, we provide several default\nsettings which have theoretical support. The settings and the corresponding complexity results are\nsummarized in Table 2. Note that all settings \ufb01x bj = 1 since this yields the best rate as will be shown\nin Section 3. However, in practice a reasonably large mini-batch size bj might be favorable due to the\nacceleration that could be achieved by vectorization; see Section 4 for more discussions on this point.\n\n4\n\n\fTable 2: Parameter settings analyzed in this paper.\n\nVersion 1\nVersion 2\n\nVersion 3\n\n\u03b7j\n\n1\n\n2LB2/3\n\n1\n\n2LB2/3\n\nj\n\n1\n\n2LB2/3\n\nj\n\nBj\n\nO(cid:0) 1\n\u03b5 \u2227 n(cid:1)\n(cid:17)\n(cid:16) 1\n2 \u2227 n\nj 3\n\u00b5\u03b5 \u2227 n\n\nO\n\nbj\n\n1\n\n1\n\n1\n\nType of Objectives\n\nSmooth\nSmooth\n\nPolyak-Lojasiewicz\n\n\u02dcO\n\n(cid:16)\n\n(cid:16) 1\nECcomp(\u03b5)\n(cid:16) 1\n\u03b55/3 \u2227 n2/3\nO\n\u03b55/3 \u2227 n2/3\n\u02dcO\n\u00b5\u03b5 \u2227 n) + 1\n( 1\n\u00b5 ( 1\n\n(cid:17)\n(cid:17)\n\u00b5\u03b5 \u2227 n)2/3(cid:17)\n\n\u03b5\n\n\u03b5\n\n3 Convergence Analysis\n\n3.1 One-epoch analysis\n\nFirst we present the analysis for a single epoch. Given j, we de\ufb01ne\nej = \u2207fIj (\u02dcxj\u22121) \u2212 \u2207f (\u02dcxj\u22121).\n\n(6)\n\nAs shown in [14], the gradient update \u03bd(j)\nk\non the current random index ik. Speci\ufb01cally, within the j-th epoch,\n\nis a biased estimate of the gradient \u2207f (x(j)\n\nk ) conditioning\n\nE \u02dcIk\n\nk = \u2207f (x(j)\n\u03bd(j)\n\nk ) + \u2207fIj (x(j)\n\n0 ) \u2212 \u2207f (x(j)\n\n0 ) = \u2207f (x(j)\n\nk ) + ej.\n\nThis reveals the basic qualitative difference between SVRG and SCSG. Most of the novelty in our\nk \u2212 x\u2217(cid:107) to be\nanalysis lies in dealing with the extra term ej. Unlike [14], we do not assume (cid:107)x(j)\nbounded since this is invalid in unconstrained problems, even in convex cases.\nBy careful analysis of primal and dual gaps [cf. 5], we \ufb01nd that the stepsize \u03b7j should scale as\n(Bj/bj)\u2212 2\n3 . Then same phenomenon has also been observed in [26, 27, 4] when bj = 1 and Bj = n.\n3 and Bj \u2265 8bj for all j, then under\nTheorem 3.1 Let \u03b7jL = \u03b3(Bj/bj)\u2212 2\nAssumption A1,\n\n3 . Suppose \u03b3 \u2264 1\n\nE(cid:107)\u2207f (\u02dcxj)(cid:107)2 \u2264 5L\n\u03b3\n\n\u00b7\n\n3 E(f (\u02dcxj\u22121) \u2212 f (\u02dcxj)) +\n\n6I(Bj < n)\n\nBj\n\n\u00b7 H\u2217.\n\n(7)\n\nThe proof is presented in Appendix B. It is not surprising that a large mini-batch size will increase\nthe theoretical complexity as in the analysis of mini-batch SGD. For this reason we restrict most of\nour subsequent analysis to bj \u2261 1.\n\n(cid:18) bj\n\n(cid:19) 1\n\nBj\n\n3.2 Convergence analysis for smooth non-convex objectives\nWhen only assuming smoothness, the output \u02dcx\u2217\nover all epochs, we easily obtain the following result.\n\nT is a random element from (\u02dcxj)T\n\nj=1. Telescoping (7)\n\nTheorem 3.2 Under the speci\ufb01cations of Theorem 3.1 and Assumption A1,\n\n5L\n\u03b3 \u2206f + 6\n\nI(Bj < n)\n\n(cid:16)(cid:80)T\n(cid:80)T\n\nj=1 b\n\n3\n\n3\n\n\u2212 1\n\u2212 2\nj B\nj\n\u2212 1\nj B\n\n3\n\n1\n3\nj\n\nj=1 b\n\n(cid:17)H\u2217\n\n.\n\nE(cid:107)\u2207f (\u02dcx\u2217\n\nT )(cid:107)2 \u2264\n(cid:17)\n(cid:16) L\u2206f\n(cid:17)\n(cid:16)H\u2217\n\nT n1/3\n\n\u03b5\n\nT )(cid:107)2 = O\n\nThis theorem covers many existing results. When Bj = n and bj = 1, Theorem 3.2 implies that\nE(cid:107)\u2207f (\u02dcx\u2217\nand hence T (\u03b5) = O(1+ L\u2206f\n\u03b5n1/3 ). This yields the same complexity bound\n) as SVRG [26]. On the other hand, when bj = Bj \u2261 B for some\nECcomp(\u03b5) = O(n + n2/3L\u2206f\nB < n, Theorem 3.2 implies that E(cid:107)\u2207f (\u02dcx\u2217\n. The second term can be made\nand ECcomp(\u03b5) = O\nO(\u03b5) by setting B = O\nThis is the same rate as in [26] for SGD.\n\n(cid:17)\n(cid:17)\nT + H\u2217\n\n. Under this setting T (\u03b5) = O\n\n(cid:16) L\u2206fH\u2217\n\n(cid:16) L\u2206f\n\n(cid:16) L\u2206f\n\nT )(cid:107)2 = O\n\n(cid:17)\n\n\u03b52\n\nB\n\n\u03b5\n\n\u03b5\n\n.\n\n5\n\n\fHowever, both of the above settings are suboptimal since they either set the batch sizes Bj too large\nor set the mini-batch sizes bj too large. By Theorem 3.2, SCSG can be regarded as an interpolation\nbetween SGD and SVRG. By leveraging these two parameters, SCSG is able to outperform both\nmethods.\nWe start from considering a constant batch/mini-batch size Bj \u2261 B, bj \u2261 1. Similar to SGD and\nSCSG, B should be at least O(H\u2217\n\u03b5 ). In applications like the training of neural networks, the required\naccuracy is moderate and hence a small batch size suf\ufb01ces. This is particularly important since the\ngradient can be computed without communication overhead, which is the bottleneck of SVRG-type\nalgorithms. As shown in Corollary 3.3 below, the complexity of SCSG beats both SGD and SVRG.\n\nCorollary 3.3 (Constant batch sizes) Set\n\nThen it holds that\n\nbj \u2261 1, Bj \u2261 B = min\n\n(cid:32)(cid:18)H\u2217\n\n\u03b5\n\nECcomp(\u03b5) = O\n\n(cid:27)\n\n(cid:26) 12H\u2217\n(cid:19)\n\n\u03b5\n\n\u2227 n\n\n.\n\n1\n\n6LB 2\n\n3\n\n3(cid:33)\n(cid:19) 2\n\n.\n\n\u2227 n\n\n, n\n\n,\n\n\u03b7j \u2261 \u03b7 =\n\n+\n\nL\u2206f\n\n\u03b5\n\n\u00b7\n\n(cid:18)H\u2217\n3(cid:33)\n(cid:19) 2\n\n\u03b5\n\nAssume that L\u2206f ,H\u2217 = O(1), the above bound can be simpli\ufb01ed to\n\n(cid:19)\n\n(cid:32)(cid:18) 1\n\n\u03b5\n\n(cid:18) 1\n\n\u03b5\n\nECcomp(\u03b5) = O\n\n\u2227 n\n\n+\n\n\u00b7\n\n1\n\u03b5\n\n\u2227 n\n\n= O\n\n(cid:32)\n\n1\n\u03b5 5\n\n3\n\n3\n\n\u2227 n 2\n\u03b5\n\n(cid:33)\n\n.\n\nWhen the target accuracy is high, one might consider a sequence of increasing batch sizes. Heuristi-\ncally, a large batch is wasteful at the early stages when the iterates are inaccurate. Fixing the batch\nsize to be n as in SVRG is obviously suboptimal. Via an involved analysis, we \ufb01nd that Bj \u223c j 3\ngives the best complexity among the class of SCSG algorithms.\n\n2\n\nCorollary 3.4 (Time-varying batch sizes) Set\nbj \u2261 1, Bj = min\n\n(cid:110)(cid:100)j\n\n3\n\n2(cid:101), n\n\n(cid:32)\n\nThen it holds that\n\nECcomp(\u03b5) = O\n\nmin\n\n(cid:26) 1\n\n\u03b5 5\n\n3\n\n(cid:20)\n\n(L\u2206f )\n\n5\n\n3 + (H\u2217)\n\n5\n\n3 log5\n\n,\n\n(cid:111)\n(cid:18)H\u2217\n\n\u03b5\n\n\u03b7j =\n\n1\n\n.\n\n2\n6LB\n3\nj\n\n(cid:19)(cid:21)\n\n(cid:27)\n\n5\n3\n\n, n\n\n+\n\n3\n\nn 2\n\u03b5\n\n\u00b7 (L\u2206f + H\u2217 log n)\n\n(cid:33)\n\n.\n\n(8)\n\nThe proofs of both Corollary 3.3 and Corollary 3.4 are presented in Appendix C. To simplify the\nbound (8), we assume that L\u2206f ,H\u2217 = O(1) in order to highlight the dependence on \u03b5 and n. Then\n(8) can be simpli\ufb01ed to\n\n(cid:32)\nThe log-factor log5(cid:0) 1\n\nECcomp(\u03b5) = O\n\n1\n\u03b5 5\n\n3\n\n(cid:19)\n\n(cid:33)\n\n(cid:33)\n(cid:18) 1\n(cid:1) is purely an artifact of our proof. It can be reduced to log\n\n(cid:32)\n\nn 2\n\u03b5\n\n3 log n\n\n\u2227 n\n\n\u2227 n\n\n1\n\u03b5 5\n\n= \u02dcO\n\nlog5\n\n5\n3 +\n\n5\n3 +\n\nn 2\n\n\u03b5\n\n\u03b5\n\n3\n\n3\n\n(cid:33)\n(cid:1) for any\n\n\u2227 n 2\n\u03b5\n\n.\n\n3\n\n(cid:32)\n2 +\u00b5(cid:0) 1\n\n1\n\u03b5 5\n\n3\n\n3\n\n\u03b5\n\n= \u02dcO\n\n\u00b5 > 0 by setting Bj \u223c j 3\n\n\u03b5\n\n2 (log j) 3\n\n2 +\u00b5; see remark 1 in Appendix C.\n\n3.3 Convergence analysis for P-L objectives\n\nWhen the component fi(x) satis\ufb01es the P-L condition, it is known that the global minimum can be\nfound ef\ufb01ciently by SGD [17] and SVRG-type algorithms [26, 4]. Similarly, SCSG can also achieve\nthis. As in the last subsection, we start from a generic result to bound E(f (\u02dcxT ) \u2212 f\u2217) and then\nconsider speci\ufb01c settings of the parameters as well as their complexity bounds.\n\n6\n\n\fTheorem 3.5 Let \u03bbj =\n\n. Then under the same settings of Theorem 3.2,\n\n1\n3\n5Lb\nj\n1\n1\n3\n3\nj +5Lb\nj\n\n\u00b5\u03b3B\n\nE(f (\u02dcxT ) \u2212 f\u2217) \u2264 \u03bbT \u03bbT\u22121 . . . \u03bb1 \u00b7 \u2206f + 6\u03b3H\u2217 \u00b7 T(cid:88)\n\n\u03bbT \u03bbT\u22121 . . . \u03bbj+1 \u00b7 I(Bj < n)\n\n\u00b5\u03b3Bj + 5Lb\n\n1\n3\n\nj B\n\n2\n3\nj\n\n.\n\nj=1\n\nThe proofs and additional discussion are presented in Appendix D. Again, Theorem 3.5 covers\nexisting complexity bounds for both SGD and SVRG. In fact, when Bj = bj \u2261 B as in SGD, via\nsome calculation, we obtain that\n\nE(f (\u02dcxT ) \u2212 f\u2217) = O\n\n(cid:32)(cid:18) L\n\n(cid:19)T \u00b7 \u2206f +\n\n(cid:33)\n\n.\n\nH\u2217\n\u00b5B\n\n\u00b5 + L\nThe second term can be made O(\u03b5) by setting B = O(H\u2217\n\u00b5 log \u2206f\na result, the average cost to reach an \u03b5-accurate solution is ECcomp(\u03b5) = O( LH\u2217\nas [17]. On the other hand, when Bj \u2261 n and bj \u2261 1 as in SVRG, Theorem 3.5 implies that\n\n\u00b5\u03b5 ), in which case T (\u03b5) = O( L\n\n\u03b5 ). As\n\u00b52\u03b5 ), which is the same\n\n(cid:32)(cid:18)\n(cid:17)\n\nE(f (\u02dcxT ) \u2212 f\u2217) = O\n\nL\n\u00b5n 1\n\n3 + L\n\n(cid:16)\n\n(cid:33)\n\n(cid:19)T \u00b7 \u2206f\n\n.\n\nThis entails that T (\u03b5) = O\nwhich is the same as [26].\nBy leveraging the batch and mini-batch sizes, we obtain a counterpart of Corollary 3.3 as below.\n\n\u00b5n1/3 ) log 1\n\nand hence ECcomp(\u03b5) = O\n\n(n + n2/3\n\n(1 + 1\n\n\u00b5 ) log 1\n\n\u03b5\n\n\u03b5\n\nCorollary 3.6 Set\n\nThen it holds that\n\nbj \u2261 1, Bj \u2261 B = min\n\nECcomp(\u03b5) = O\n\n, n\n\n(cid:26) 12H\u2217\n(cid:19)\n\n\u00b5\u03b5\n\n+\n\n1\n\u00b5\n\n,\n\n(cid:27)\n(cid:18)H\u2217\n(cid:17)\n\n\u00b5\u03b5\n\n\u2227 n\n\n(cid:16) 1\n\n(cid:32)(cid:40)(cid:18)H\u2217\n\n\u00b5\u03b5\n\n\u03b7j \u2261 \u03b7 =\n\n1\n\n6LB 2\n\n3\n\n3(cid:41)\n(cid:19) 2\n\n\u2227 n\n\nlog\n\n\u2206f\n\u03b5\n\nRecall the results from Table 1, SCSG is O\nfaster than SGD and is never worse than\nSVRG. When both \u00b5 and \u03b5 are moderate, the acceleration of SCSG over SVRG is signi\ufb01cant. Unlike\nthe smooth case, we do not \ufb01nd any possible choice of setting that can achieve a better rate than\nCorollary 3.6.\n\n\u00b5 + 1\n\n(\u00b5\u03b5)1/3\n\n4 Experiments\n\nWe evaluate SCSG and mini-batch SGD on the MNIST dataset with (1) a three-layer fully-connected\nneural network with 512 neurons in each layer (FCN for short) and (2) a standard convolutional\nneural network LeNet [20] (CNN for short), which has two convolutional layers with 32 and 64\n\ufb01lters of size 5 \u00d7 5 respectively, followed by two fully-connected layers with output size 1024 and\n10. Max pooling is applied after each convolutional layer. The MNIST dataset of handwritten digits\nhas 50, 000 training examples and 10, 000 test examples. The digits have been size-normalized and\ncentered in a \ufb01xed-size image. Each image is 28 pixels by 28 pixels. All experiments were carried\nout on an Amazon p2.xlarge node with a NVIDIA GK210 GPU with algorithms implemented in\nTensorFlow 1.0.\nDue to the memory issues, sampling a chunk of data is costly. We avoid this by modifying the inner\nloop: instead of sampling mini-batches from the whole dataset, we split the batch Ij into Bj/bj\nmini-batches and run SVRG-type updates sequentially on each. Despite the theoretical advantage of\nsetting bj = 1, we consider practical settings bj > 1 to take advantage of the acceleration obtained\n\n7\n\n(cid:17)\n\n,\n\n(cid:16)\n\n(cid:33)\n\n.\n\n\fof the three algorithms are T B, 2T B and 2(cid:80)T\n\nby vectorization. We initialized parameters by TensorFlow\u2019s default Xavier uniform initializer. In all\nexperiments below, we show the results corresponding to the best-tuned stepsizes.\nWe consider three algorithms: (1) SGD with a \ufb01xed batch size B \u2208 {512, 1024}; (2) SCSG with a\n\ufb01xed batch size B \u2208 {512, 1024} and a \ufb01xed mini-batch size b = 32; (3) SCSG with time-varying\nbatch sizes Bj = (cid:100)j3/2 \u2227 n(cid:101) and bj = (cid:100)Bj/32(cid:101). To be clear, given T epochs, the IFO complexity\nj=1 Bj, respectively. We run each algorithm with 20\npasses of data. It is worth mentioning that the largest batch size in Algorithm 3 is (cid:100)2751.5(cid:101) = 4561,\nwhich is relatively small compared to the sample size 50000.\nWe plot in Figure 1 the training and the validation loss against the IFO complexity\u2014i.e., the number of\npasses of data\u2014for fair comparison. In all cases, both versions of SCSG outperform SGD, especially\nin terms of training loss. SCSG with time-varying batch sizes always has the best performance and it\nis more stable than SCSG with a \ufb01xed batch size. For the latter, the acceleration is more signi\ufb01cant\nafter increasing the batch size to 1024. Both versions of SCSG provide strong evidence that variance\nreduction can be achieved ef\ufb01ciently without evaluating the full gradient.\n\nFigure 1: Comparison between two versions of SCSG and mini-batch SGD of training loss (top row)\nand validation loss (bottom row) against the number of IFO calls. The loss is plotted on a log-scale.\nEach column represents an experiment with the setup printed on the top.\n\nFigure 2: Comparison between SCSG and mini-batch SGD of training loss and validation loss with a\nCNN loss, against wall clock time. The loss is plotted on a log-scale.\n\nGiven 2B IFO calls, SGD implements updates on two fresh batches while SCSG replaces the second\nbatch by a sequence of variance reduced updates. Thus, Figure 1 shows that the gain due to variance\nreduction is signi\ufb01cant when the batch size is \ufb01xed. To further explore this, we compare SCSG with\ntime-varying batch sizes to SGD with the same sequence of batch sizes. The results corresponding to\nthe best-tuned constant stepsizes are plotted in Figure 3a. It is clear that the bene\ufb01t from variance\nreduction is more signi\ufb01cant when using time-varying batch sizes.\nWe also compare the performance of SGD with that of SCSG with time-varying batch sizes against\nwall clock time, when both algorithms are implemented in TensorFlow and run on a Amazon p2.xlarge\nnode with a NVIDIA GK210 GPU. Due to the cost of computing variance reduction terms in SCSG,\neach update of SCSG is slower per iteration compared to SGD. However, SCSG makes faster progress\n\n8\n\n02468101214#grad / n10-210-1100Training Log-LossCNNSGD (B = 512)SCSG (B = 512, b = 32)SCSG (B = j^1.5, B/b = 32)02468101214#grad / n10-210-1100Training Log-LossCNNSGD (B = 1024)SCSG (B = 1024, b = 32)SCSG (B = j^1.5, B/b = 32)02468101214#grad / n10-210-1100Training Log-LossFCNSGD (B = 512)SCSG (B = 512, b = 32)SCSG (B = j^1.5, B/b = 32)02468101214#grad / n10-210-1100Training Log-LossFCNSGD (B = 1024)SCSG (B = 1024, b = 32)SCSG (B = j^1.5, B/b = 32)02468101214#grad / n10-1100Validation Log-Loss02468101214#grad / n10-1100Validation Log-Loss02468101214#grad / n10-1100Validation Log-Loss02468101214#grad / n10-1100Validation Log-Loss050100150200Wall Clock Time (in second)10-1100Training Log LossCNNscsg (B=j^1.5, B/b=16)sgd (B=j^1.5)050100150200Wall Clock Time (in second)10-1100Validation Log LossCNNscsg (B=j^1.5, B/b=16)sgd (B=j^1.5)\fin terms of both training loss and validation loss compared to SCD in wall clock time. The results are\nshown in Figure 2.\n\n(a) SCSG and SGD with increasing batch sizes\n\n(b) SCSG with different Bj/bj\n\nFinally, we examine the effect of Bj/bj, namely the number of mini-batches within an iteration,\nsince it affects the ef\ufb01ciency in practice where the computation time is not proportional to the batch\nsize. Figure 3b shows the results for SCSG with Bj = (cid:100)j3/2 \u2227 n(cid:101) and (cid:100)Bj/bj(cid:101) \u2208 {2, 5, 10, 16, 32}.\nIn general, larger Bj/bj yields better performance. It would be interesting to explore the tradeoff\nbetween computation ef\ufb01ciency and this ratio on different platforms.\n\n5 Conclusion and Discussion\n\nWe have presented the SCSG method for smooth, non-convex, \ufb01nite-sum optimization problems.\nSCSG is the \ufb01rst algorithm that achieves a uniformly better rate than SGD and is never worse\nthan SVRG-type algorithms. When the target accuracy is low, SCSG signi\ufb01cantly outperforms the\nSVRG-type algorithms. Unlike various other variants of SVRG, SCSG is clean in terms of both\nimplementation and analysis. Empirically, SCSG outperforms SGD in the training of multi-layer\nneural networks.\nAlthough we only consider the \ufb01nite-sum objective in this paper, it is straightforward to extend SCSG\nto the general stochastic optimization problems where the objective can be written as E\u03be\u223cF f (x; \u03be):\nat the beginning of j-th epoch a batch of i.i.d. sample (\u03be1, . . . , \u03beBj ) is drawn from the distribution F\nand\n\n\u2207f (\u02dcxj\u22121; \u03bei)\n\n(see line 3 of Algorithm 1);\n\nBj(cid:88)\n\ni=1\n\ngj =\n\n1\nBj\n\nbj(cid:88)\n\ni=1\n\nbj(cid:88)\n\ni=1\n\nat the k-th step, a fresh sample ( \u02dc\u03be(k)\n\n1 , . . . , \u02dc\u03be(k)\nbj\n\n) is drawn from the distribution F and\n\n\u03bd(j)\nk\u22121 =\n\n1\nbj\n\n\u2207f (x(j)\n\nk\u22121; \u02dc\u03be(k)\n\ni\n\n) \u2212 1\nbj\n\n\u2207f (x(j)\n\n0 ; \u02dc\u03be(k)\n\ni\n\n) + gj\n\n(see line 8 of Algorithm 1).\n\nOur proof directly carries over to this case, by simply suppressing the term I(Bj < n), and yields\nthe bound \u02dcO(\u03b5\u22125/3) for smooth non-convex objectives and the bound \u02dcO(\u00b5\u22121\u03b5\u22121 \u2227 \u00b5\u22125/3\u03b5\u22122/3) for\nP-L objectives. These bounds are simply obtained by setting n = \u221e in our convergence analysis.\nCompared to momentum-based methods [28] and methods with adaptive stepsizes [10, 18], the\nmechanism whereby SCSG achieves acceleration is qualitatively different: while momentum aims at\nbalancing primal and dual gaps [5], adaptive stepsizes aim at balancing the scale of each coordinate,\nand variance reduction aims at removing the noise. We believe that an algorithm that combines these\nthree techniques is worthy of further study, especially in the training of deep neural networks where\nthe target accuracy is modest.\n\nAcknowledgments\n\nThe authors thank Zeyuan Allen-Zhu, Chi Jin, Nilesh Tripuraneni, Yi Xu, Tianbao Yang, Shenyi\nZhao and anonymous reviewers for helpful discussions.\n\nReferences\n[1] Alekh Agarwal and Leon Bottou. A lower bound for the optimization of \ufb01nite sums. ArXiv\n\ne-prints abs/1410.0723, 2014.\n\n9\n\n02468101214#grad / n10-210-1100Training Log-LossCNNSGDSCSG02468101214#grad / n10-210-1100Training Log-LossFCNSGDSCSG02468101214#grad / n10-210-1100Training Log-LossCNNB/b = 2.0B/b = 5.0B/b = 10.0B/b = 16.0B/b = 32.002468101214#grad / n10-1100Training Log-LossFCNB/b = 2.0B/b = 5.0B/b = 10.0B/b = 16.0B/b = 32.0\f[2] Naman Agarwal, Zeyuan Allen-Zhu, Brian Bullins, Elad Hazan, and Tengyu Ma. Find-\ning approximate local minima for nonconvex optimization in linear time. arXiv preprint\narXiv:1611.01146, 2016.\n\n[3] Zeyuan Allen-Zhu. Natasha: Faster stochastic non-convex optimization via strongly non-convex\n\nparameter. arXiv preprint arXiv:1702.00763, 2017.\n\n[4] Zeyuan Allen-Zhu and Elad Hazan. Variance reduction for faster non-convex optimization.\n\nArXiv e-prints abs/1603.05643, 2016.\n\n[5] Zeyuan Allen-Zhu and Lorenzo Orecchia. Linear coupling: An ultimate uni\ufb01cation of gradient\n\nand mirror descent. arXiv preprint arXiv:1407.1537, 2014.\n\n[6] Zeyuan Allen-Zhu and Yang Yuan. Improved SVRG for non-strongly-convex or sum-of-non-\n\nconvex objectives. ArXiv e-prints, abs/1506.01972, 2015.\n\n[7] Dimitri P Bertsekas. A new class of incremental gradient methods for least squares problems.\n\nSIAM Journal on Optimization, 7(4):913\u2013926, 1997.\n\n[8] Yair Carmon, John C Duchi, Oliver Hinder, and Aaron Sidford. Accelerated methods for\n\nnon-convex optimization. arXiv preprint arXiv:1611.00756, 2016.\n\n[9] Yair Carmon, Oliver Hinder, John C Duchi, and Aaron Sidford. \" convex until proven guilty\":\nDimension-free acceleration of gradient descent on non-convex functions. arXiv preprint\narXiv:1705.02766, 2017.\n\n[10] John Duchi, Elad Hazan, and Yoram Singer. Adaptive subgradient methods for online learning\nand stochastic optimization. Journal of Machine Learning Research, 12(Jul):2121\u20132159, 2011.\n\n[11] Alexei A Gaivoronski. Convergence properties of backpropagation for neural nets via theory of\nstochastic gradient methods. part 1. Optimization methods and Software, 4(2):117\u2013134, 1994.\n\n[12] Saeed Ghadimi and Guanghui Lan. Stochastic \ufb01rst-and zeroth-order methods for nonconvex\n\nstochastic programming. SIAM Journal on Optimization, 23(4):2341\u20132368, 2013.\n\n[13] Saeed Ghadimi and Guanghui Lan. Accelerated gradient methods for nonconvex nonlinear and\n\nstochastic programming. Mathematical Programming, 156(1-2):59\u201399, 2016.\n\n[14] Reza Harikandeh, Mohamed Osama Ahmed, Alim Virani, Mark Schmidt, Jakub Kone\u02c7cn`y, and\nScott Sallinen. Stop wasting my gradients: Practical SVRG. In Advances in Neural Information\nProcessing Systems, pages 2242\u20132250, 2015.\n\n[15] Matthew D Hoffman, David M Blei, Chong Wang, and John William Paisley. Stochastic\n\nvariational inference. Journal of Machine Learning Research, 14(1):1303\u20131347, 2013.\n\n[16] Rie Johnson and Tong Zhang. Accelerating stochastic gradient descent using predictive variance\n\nreduction. In Advances in Neural Information Processing Systems, pages 315\u2013323, 2013.\n\n[17] Hamed Karimi, Julie Nutini, and Mark Schmidt. Linear convergence of gradient and proximal-\ngradient methods under the polyak-\u0142ojasiewicz condition. In Joint European Conference on\nMachine Learning and Knowledge Discovery in Databases, pages 795\u2013811. Springer, 2016.\n\n[18] Diederik Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv preprint\n\narXiv:1412.6980, 2014.\n\n[19] Yann LeCun, Yoshua Bengio, and Geoffrey Hinton. Deep learning. Nature, 521(7553):436\u2013444,\n\n2015.\n\n[20] Yann LeCun, L\u00e9on Bottou, Yoshua Bengio, and Patrick Haffner. Gradient-based learning\n\napplied to document recognition. Proceedings of the IEEE, 86(11):2278\u20132324, 1998.\n\n[21] Lihua Lei and Michael I Jordan. Less than a single pass: Stochastically controlled stochastic\n\ngradient method. arXiv preprint arXiv:1609.03261, 2016.\n\n[22] Peter McCullagh and John A Nelder. Generalized Linear Models. CRC Press, 1989.\n\n10\n\n\f[23] Arkadi Nemirovski, Anatoli Juditsky, Guanghui Lan, and Alexander Shapiro. Robust stochastic\napproximation approach to stochastic programming. SIAM Journal on Optimization, 19(4):1574\u2013\n1609, 2009.\n\n[24] Yurii Nesterov. Introductory lectures on convex optimization: A basic course. Kluwer Academic\n\nPublishers, Massachusetts, 2004.\n\n[25] Boris Teodorovich Polyak. Gradient methods for minimizing functionals. Zhurnal Vychisli-\n\ntel\u2019noi Matematiki i Matematicheskoi Fiziki, 3(4):643\u2013653, 1963.\n\n[26] Sashank J Reddi, Ahmed Hefny, Suvrit Sra, Barnabas Poczos, and Alex Smola. Stochastic\n\nvariance reduction for nonconvex optimization. arXiv preprint arXiv:1603.06160, 2016.\n\n[27] Sashank J Reddi, Suvrit Sra, Barnab\u00e1s P\u00f3czos, and Alex Smola. Fast incremental method for\n\nnonconvex optimization. arXiv preprint arXiv:1603.06159, 2016.\n\n[28] Ilya Sutskever, James Martens, George E Dahl, and Geoffrey E Hinton. On the importance of\n\ninitialization and momentum in deep learning. ICML (3), 28:1139\u20131147, 2013.\n\n[29] Paul Tseng. An incremental gradient (-projection) method with momentum term and adaptive\n\nstepsize rule. SIAM Journal on Optimization, 8(2):506\u2013531, 1998.\n\n[30] Martin J Wainwright, Michael I Jordan, et al. Graphical models, exponential families, and\nvariational inference. Foundations and Trends R(cid:13) in Machine Learning, 1(1\u20132):1\u2013305, 2008.\n\n11\n\n\f", "award": [], "sourceid": 1371, "authors": [{"given_name": "Lihua", "family_name": "Lei", "institution": "UC Berkeley"}, {"given_name": "Cheng", "family_name": "Ju", "institution": "University of California, Berkeley"}, {"given_name": "Jianbo", "family_name": "Chen", "institution": "University of California, Berkeley"}, {"given_name": "Michael", "family_name": "Jordan", "institution": "UC Berkeley"}]}