{"title": "SSRGD: Simple Stochastic Recursive Gradient Descent for Escaping Saddle Points", "book": "Advances in Neural Information Processing Systems", "page_first": 1523, "page_last": 1533, "abstract": "We analyze stochastic gradient algorithms for optimizing nonconvex problems.\nIn particular, our goal is to find local minima (second-order stationary points) instead of just finding first-order stationary points which may be some bad unstable saddle points.\nWe show that a simple perturbed version of stochastic recursive gradient descent algorithm (called SSRGD) can find an $(\\epsilon,\\delta)$-second-order stationary point with $\\widetilde{O}(\\sqrt{n}/\\epsilon^2 + \\sqrt{n}/\\delta^4 + n/\\delta^3)$ stochastic gradient complexity for nonconvex finite-sum problems.\nAs a by-product, SSRGD finds an $\\epsilon$-first-order stationary point with $O(n+\\sqrt{n}/\\epsilon^2)$ stochastic gradients. These results are almost optimal since Fang et al. [2018] provided a lower bound $\\Omega(\\sqrt{n}/\\epsilon^2)$ for finding even just an $\\epsilon$-first-order stationary point.\nWe emphasize that SSRGD algorithm for finding second-order stationary points is as simple as for finding first-order stationary points just by adding a uniform perturbation sometimes, while all other algorithms for finding second-order stationary points with similar gradient complexity need to combine with a negative-curvature search subroutine (e.g., Neon2 [Allen-Zhu and Li, 2018]).\nMoreover, the simple SSRGD algorithm gets a simpler analysis.\nBesides, we also extend our results from nonconvex finite-sum problems to nonconvex online (expectation) problems, and prove the corresponding convergence results.", "full_text": "SSRGD: Simple Stochastic Recursive Gradient\n\nDescent for Escaping Saddle Points\n\nZhize Li\n\nTsinghua University, China and KAUST, Saudi Arabia\n\nzhizeli.thu@gmail.com\n\nAbstract\n\n\u221a\n\nn/\u00012 +\n\nwith (cid:101)O(\n\nWe analyze stochastic gradient algorithms for optimizing nonconvex problems. In\nparticular, our goal is to \ufb01nd local minima (second-order stationary points) instead\nof just \ufb01nding \ufb01rst-order stationary points which may be some bad unstable saddle\npoints. We show that a simple perturbed version of stochastic recursive gradient\ndescent algorithm (called SSRGD) can \ufb01nd an (\u0001, \u03b4)-second-order stationary point\n\u221a\nn/\u03b44 + n/\u03b43) stochastic gradient complexity for nonconvex\n\u221a\n\ufb01nite-sum problems. As a by-product, SSRGD \ufb01nds an \u0001-\ufb01rst-order stationary\nn/\u00012) stochastic gradients. These results are almost optimal\npoint with O(n +\nsince Fang et al. [11] provided a lower bound \u2126(\nn/\u00012) for \ufb01nding even just an\n\u0001-\ufb01rst-order stationary point. We emphasize that SSRGD algorithm for \ufb01nding\nsecond-order stationary points is as simple as for \ufb01nding \ufb01rst-order stationary\npoints just by adding a uniform perturbation sometimes, while all other algorithms\nfor \ufb01nding second-order stationary points with similar gradient complexity need to\ncombine with a negative-curvature search subroutine (e.g., Neon2 [4]). Moreover,\nthe simple SSRGD algorithm gets a simpler analysis. Besides, we also extend our\nresults from nonconvex \ufb01nite-sum problems to nonconvex online (expectation)\nproblems, and prove the corresponding convergence results.\n\n\u221a\n\n1\n\nIntroduction\n\nNonconvex optimization is ubiquitous in machine learning applications especially for deep neural\nnetworks. For convex optimization, every local minimum is a global minimum and it can be achieved\nby any \ufb01rst-order stationary point, i.e., \u2207f (x) = 0. However, for nonconvex problems, the point\nwith zero gradient can be a local minimum, a local maximum or a saddle point. To avoid converging\nto bad saddle points (including local maxima), we want to \ufb01nd a second-order stationary point,\ni.e., \u2207f (x) = 0 and \u22072f (x) (cid:23) 0 (this is a necessary condition for x to be a local minimum). All\nsecond-order stationary points indeed are local minima if function f satis\ufb01es strict saddle property\n[12]. Note that \ufb01nding the global minimum in nonconvex problems is NP-hard in general. Also\nnote that it was shown that all local minima are also global minima for some nonconvex problems,\ne.g., matrix sensing [5], matrix completion [13], and some neural networks [14]. Thus, our goal\nin this paper is to \ufb01nd an approximate second-order stationary point (local minimum) with proved\nconvergence.\nThere has been extensive research for \ufb01nding \u0001-\ufb01rst-order stationary point (i.e., (cid:107)\u2207f (x)(cid:107) \u2264 \u0001), e.g.,\nGD, SGD and SVRG. See Table 1 for an overview. Although Xu et al. [33] and Allen-Zhu and Li\n[4] independently proposed reduction algorithms Neon/Neon2 that can be combined with previous\n\u0001-\ufb01rst-order stationary points \ufb01nding algorithms to \ufb01nd an (\u0001, \u03b4)-second-order stationary point (i.e.,\n(cid:107)\u2207f (x)(cid:107) \u2264 \u0001 and \u03bbmin(\u22072f (x)) \u2265 \u2212\u03b4). However, algorithms obtained by this reduction are very\ncomplicated in practice, and they need to extract negative curvature directions from the Hessian\nto escape saddle points by using a negative curvature search subroutine: given a point x, \ufb01nd an\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\fapproximate smallest eigenvector of \u22072f (x). This also involves a more complicated analysis. Note\nthat in practice, standard \ufb01rst-order stationary point \ufb01nding algorithms can often work (escape bad\nsaddle points) in nonconvex setting without a negative curvature search subroutine. The reason may\nbe that the saddle points are usually not very stable. So there is a natural question:\n\u201cIs there any simple modi\ufb01cation to allow \ufb01rst-order stationary point \ufb01nding algorithms to get a\ntheoretical second-order guarantee?\".\n\nTable 1: Gradient complexity of optimization algorithms for nonconvex \ufb01nite-sum problem (1)\n\nAlgorithm\nGD [24]\n\nSVRG [28, 3],\nSCSG [22],\nSVRG+ [23]\nSNVRG [35],\nSPIDER [11],\n\nSpiderBoost [32],\n\nSARAH [27]\n\nSSRGD (this paper)\n\nPGD [18]\n\nNeon2+FastCubic/CDHS [1, 6]\n\nNeon2+SVRG [4]\n\nStabilized SVRG [15]\nSNVRG++Neon2 [34]\n\nSPIDER-SFO+(+Neon2) [11]\n\nSSRGD (this paper)\n\nStochastic gradient\n\ncomplexity\n\nO( n\n\u00012 )\n\nGuarantee Negative-curvature\nsearch subroutine\n1st-order\n\nNo\n\nO(n + n2/3\n\u00012 )\n\n1st-order\n\nO(n + n1/2\n\u00012 )\n\n\u00011.5 + n\n\n(cid:101)O( n\n(cid:101)O( n\n(cid:101)O( n2/3\n(cid:101)O( n2/3\n(cid:101)O( n1/2\n(cid:101)O( n1/2\n(cid:101)O( n1/2\n\nO(n + n1/2\n\u00012 )\n\u00012 + n\n\u03b44 )\n\u00011.75 + n3/4\n\u03b43 + n3/4\n\u03b43.5 )\n\u03b43 + n3/4\n\u03b43.5 )\n\u03b43 + n2/3\n\u03b44 )\n\u03b43 + n3/4\n\u03b43.5 )\n\u0001\u03b43 + 1\n\u03b45 )\n\u03b44 + n\n\u03b43 )\n\n\u00012 + n\n\u00012 + n\n\u00012 + n\n\u00012 + n1/2\n\n\u00012 + n1/2\n\n\u0001\u03b42 + 1\n\n1st-order\n\n1st-order\n2nd-order\n2nd-order\n2nd-order\n2nd-order\n2nd-order\n2nd-order\n2nd-order\n\nNo\n\nNo\n\nNo\nNo\n\nNeeded\nNeeded\n\nNo\n\nNeeded\nNeeded\n\nNo\n\nTable 2: Gradient complexity of optimization algorithms for nonconvex online problem (2)\nGuarantee Negative-curvature\nsearch subroutine\n1st-order\n1st-order\n\nStochastic gradient\n\ncomplexity\n\nNo\nNo\n\nO( 1\n\nO( 1\n\n\u00014 )\n\u00013.5 )\n\nNote: 1. Guarantee (see De\ufb01nition 1): \u0001-\ufb01rst-order stationary point (cid:107)\u2207f (x)(cid:107) \u2264 \u0001; (\u0001, \u03b4)-second-\norder stationary point (cid:107)\u2207f (x)(cid:107) \u2264 \u0001 and \u03bbmin(\u22072f (x)) \u2265 \u2212\u03b4.\n2. In the classical setting where \u03b4 = O(\n\u0001) [25, 18], our simple SSRGD is always (no matter what n\nand \u0001 are) not worse than all other algorithms (in both Table 1 and 2) except FastCubic/CDHS (which\nneed to compute Hessian-vector product) and SPIDER-SFO+. Moreover, our simple SSRGD is not\nworse than FastCubic/CDHS if n \u2265 1/\u0001 and is better than SPIDER-SFO+ if \u03b4 is very small (e.g.,\n\u221a\n\u03b4 \u2264 1/\n\nn) in Table 1.\n\n\u221a\n\n2\n\nAlgorithm\nSGD [16]\nSCSG [22];\nSVRG+ [23]\nSNVRG [35];\nSPIDER [11];\n\nSpiderBoost [32];\n\nSARAH [27]\n\nSSRGD (this paper)\nPerturbed SGD [12]\n\nCNC-SGD [8]\n\nNeon2+SCSG [4]\n\nNeon2+Natasha2 [2]\nSNVRG++Neon2 [34]\n\nSPIDER-SFO+(+Neon2) [11]\n\nSSRGD (this paper)\n\nO( 1\n\n\u00013 )\n\n1st-order\n\n1\n\n(cid:101)O( 1\n(cid:101)O(\n(cid:101)O( 1\n(cid:101)O( 1\n(cid:101)O( 1\n(cid:101)O( 1\n\nO( 1\n\u00013 )\npoly(d, 1\n\u0001 , 1\n\u03b4 )\n\u00014 + 1\n\u03b410 )\n\u000110/3 + 1\n\u00013.25 + 1\n\u00013 + 1\n\u00013 + 1\n\u00013 + 1\n\n\u00012\u03b43 + 1\n\u03b45 )\n\u00013\u03b4 + 1\n\u03b45 )\n\u00012\u03b43 + 1\n\u03b45 )\n\u00012\u03b42 + 1\n\u03b45 )\n\u00012\u03b43 + 1\n\u0001\u03b44 )\n\n1st-order\n2nd-order\n2nd-order\n2nd-order\n2nd-order\n2nd-order\n2nd-order\n2nd-order\n\nNo\n\nNo\nNo\nNo\n\nNeeded\nNeeded\nNeeded\nNeeded\n\nNo\n\n\fAlgorithm 1 Simple Stochastic Recursive Gradient Descent (SSRGD)\nInput: initial point x0, epoch length m, minibatch size b, step size \u03b7, perturbation radius r, threshold\n\ngradient gthres\n\n1: for s = 0, 1, 2, . . . do\n2:\n3:\n\nif not currently in a super epoch and (cid:107)\u2207f (xsm)(cid:107) \u2264 gthres then\n\nxsm \u2190 xsm + \u03be, where \u03be uniformly \u223c B0(r), start a super epoch\n// we use super epoch since we do not want to add the perturbation too often near a saddle\npoint\nend if\nvsm \u2190 \u2207f (xsm)\nfor k = 1, 2, . . . , m do\nt \u2190 sm + k\nxt \u2190 xt\u22121 \u2212 \u03b7vt\u22121\nvt \u2190 1\n// Ib are i.i.d. uniform samples with |Ib| = b\nif meet stop condition then stop super epoch\n\n(cid:0)\u2207fi(xt) \u2212 \u2207fi(xt\u22121)(cid:1) + vt\u22121\n\n(cid:80)\n\nb\n\ni\u2208Ib\n\n4:\n5:\n6:\n7:\n8:\n9:\n\n10:\n11:\n12: end for\n\nend for\n\nFor gradient descent (GD), Jin et al. [18] showed that a simple perturbation step is enough to escape\nsaddle points for \ufb01nding a second-order stationary point, and this is necessary [10]. Very recently, Ge\net al. [15] showed that a simple perturbation step is also enough to \ufb01nd a second-order stationary\npoint for SVRG algorithm [23]. Moreover, Ge et al. [15] also developed a stabilized trick to further\nimprove the dependency of Hessian Lipschitz parameter.\n\n1.1 Our Contributions\n\nIn this paper, we propose a simple SSRGD algorithm (described in Algorithm 1) showed that a simple\nperturbation step is enough to \ufb01nd a second-order stationary point for stochastic recursive gradient\ndescent algorithm. Our results and previous results are summarized in Table 1 and 2. We would like\nto highlight the following points:\n\n\u2022 We improve the result in [15] to the almost optimal one (i.e., from n2/3/\u00012 to n1/2/\u00012) since\nFang et al. [11] provided a lower bound \u2126(\nn/\u00012) for \ufb01nding even just an \u0001-\ufb01rst-order\nstationary point. Note that for the other two n1/2 algorithms (i.e., SNVRG+ and SPIDER-\nSFO+), they both need the negative curvature search subroutine (e.g. Neon2) thus are more\ncomplicated in practice and in analysis compared with their \ufb01rst-order guarantee algorithms\n(SNVRG and SPIDER), while our SSRGD is as simple as its \ufb01rst-order guarantee algorithm\njust by adding a uniform perturbation sometimes.\n\n\u221a\n\n\u2022 For more general nonconvex online (expectation) problems (2), we obtain the \ufb01rst algorithm\nwhich is as simple as \ufb01nding \ufb01rst-order stationary points for \ufb01nding a second-order stationary\npoint with similar state-of-the-art convergence result. See the last column of Table 2.\n\n\u2022 Our simple SSRGD algorithm gets simpler analysis. Also, the result for \ufb01nding a \ufb01rst-order\nstationary point is a by-product from our analysis. We also give a clear interpretation to show\nwhy our analysis for SSRGD algorithm can improve the original SVRG from n2/3 to n1/2\nin Section 5.1. We believe it is very useful for better understanding these two algorithms.\n\n2 Preliminaries\nNotation: Let [n] denote the set {1, 2,\u00b7\u00b7\u00b7 , n} and (cid:107) \u00b7 (cid:107) denote the Eculidean norm for a vector\nand the spectral norm for a matrix. Let (cid:104)u, v(cid:105) denote the inner product of two vectors u and v. Let\n\u03bbmin(A) denote the smallest eigenvalue of a symmetric matrix A. Let Bx(r) denote a Euclidean ball\n\nwith center x and radius r. We use O(\u00b7) to hide the constant and (cid:101)O(\u00b7) to hide the polylogarithmic\n\nfactor.\n\n3\n\n\fIn this paper, we consider two types of nonconvex problems. The \ufb01nite-sum problem has the form\n\nmin\nx\u2208Rd\n\nf (x) :=\n\n1\nn\n\nn(cid:88)\n\ni=1\n\nfi(x),\n\n(1)\n\nwhere f (x) and all individual fi(x) are possibly nonconvex. This form usually models the empirical\nrisk minimization in machine learning problems.\nThe online (expectation) problem has the form\n\nf (x) := E\u03b6\u223cD[F (x, \u03b6)],\n\nmin\nx\u2208Rd\n\n(2)\n\nwhere f (x) and F (x, \u03b6) are possibly nonconvex. This form usually models the population risk\nminimization in machine learning problems.\nNow, we make standard smoothness assumptions for these two problems.\n\nAssumption 1 (Gradient Lipschitz)\n\n1. For \ufb01nite-sum problem (1), each fi(x) is differen-\n\ntiable and has L-Lipschitz continuous gradient, i.e.,\n\n(cid:107)\u2207fi(x1) \u2212 \u2207fi(x2)(cid:107) \u2264 L(cid:107)x1 \u2212 x2(cid:107),\n\n\u2200x1, x2 \u2208 Rd.\n\n(3)\n\n2. For online problem (2), F (x, \u03b6) is differentiable and has L-Lipschitz continuous gradient,\n\ni.e.,\n\n(cid:107)\u2207F (x1, \u03b6) \u2212 \u2207F (x2, \u03b6)(cid:107) \u2264 L(cid:107)x1 \u2212 x2(cid:107),\n\n\u2200x1, x2 \u2208 Rd.\n\n(4)\n\nAssumption 2 (Hessian Lipschitz)\n\n1. For \ufb01nite-sum problem (1), each fi(x) is twice-\n\ndifferentiable and has \u03c1-Lipschitz continuous Hessian, i.e.,\n(cid:107)\u22072fi(x1) \u2212 \u22072fi(x2)(cid:107) \u2264 \u03c1(cid:107)x1 \u2212 x2(cid:107),\n\n\u2200x1, x2 \u2208 Rd.\n\n(5)\n\n2. For online problem (2), F (x, \u03b6) is twice-differentiable and has \u03c1-Lipschitz continuous\n\nHessian, i.e.,\n\n(cid:107)\u22072F (x1, \u03b6) \u2212 \u22072F (x2, \u03b6)(cid:107) \u2264 \u03c1(cid:107)x1 \u2212 x2(cid:107),\n\n\u2200x1, x2 \u2208 Rd.\n\n(6)\n\nThese two assumptions are standard for \ufb01nding \ufb01rst-order stationary points (Assumption 1) and\nsecond-order stationary points (Assumption 1 and 2) for all algorithms in both Table 1 and 2.\nNow we de\ufb01ne the approximate \ufb01rst-order stationary points and approximate second-order stationary\npoints.\n\nDe\ufb01nition 1 x is an \u0001-\ufb01rst-order stationary point for a differentiable function f if\n\n(cid:107)\u2207f (x)(cid:107) \u2264 \u0001.\n\nx is an (\u0001, \u03b4)-second-order stationary point for a twice-differentiable function f if\n\n(cid:107)\u2207f (x)(cid:107) \u2264 \u0001 and \u03bbmin(\u22072f (x)) \u2265 \u2212\u03b4.\n\n(7)\n\n(8)\n\nThe de\ufb01nition of (\u0001, \u03b4)-second-order stationary point is the same as [4, 8, 34, 11] and it generalizes\nthe classical version where \u03b4 =\n\n\u03c1\u0001 used in [25, 18, 15].\n\n\u221a\n\n3 Simple Stochastic Recursive Gradient Descent\n\nIn this section, we propose the simple stochastic recursive gradient descent algorithm called SSRGD.\nThe high-level description (which omits the stop condition details in Line 10) of this algorithm is\nin Algorithm 1 and the full algorithm (containing the stop condition) is described in Algorithm 2.\nCompared with the high-level Algorithm 1, the only difference is that Algorithm 2 contains the stop\ncondition of super epoch (Line 13\u201314 of Algorithm 2) and the random stop of epoch (Line 15\u201316 of\nAlgorithm 2). Note that we call each outer loop an epoch, i.e., iterations t from sm to (s + 1)m for\nan epoch s. We call the iterations between the beginning of perturbation and end of perturbation a\nsuper epoch.\n\n4\n\n\fsuper_epoch \u2190 1\n\n// start a super epoch near a saddle point\n\ngradient gthres, threshold function value fthres, super epoch length tthres\n\nif super_epoch = 0 and (cid:107)\u2207f (xsm)(cid:107) \u2264 gthres then\n\nAlgorithm 2 Simple Stochastic Recursive Gradient Descent (SSRGD)\nInput: initial point x0, epoch length m, minibatch size b, step size \u03b7, perturbation radius r, threshold\n1: super_epoch \u2190 0\n2: for s = 0, 1, 2, . . . do\n3:\n4:\n5:\n6:\n7:\n8:\n9:\n10:\n11:\n12:\n\n(cid:101)x \u2190 xsm, tinit \u2190 sm\nxsm \u2190(cid:101)x + \u03be, where \u03be uniformly \u223c B0(r)\n(cid:0)\u2207fi(xt) \u2212 \u2207fi(xt\u22121)(cid:1) + vt\u22121\n\nend if\nvsm \u2190 \u2207f (xsm)\nfor k = 1, 2, . . . , m do\nt \u2190 sm + k\nxt \u2190 xt\u22121 \u2212 \u03b7vt\u22121\nvt \u2190 1\n// Ib are i.i.d. uniform samples with |Ib| = b\n\nif super_epoch = 1 and (f ((cid:101)x) \u2212 f (xt) \u2265 fthres or t \u2212 tinit \u2265 tthres) then\n\n(cid:80)\n\nb\n\ni\u2208Ib\n\n13:\n14:\n15:\n16:\n\nsuper_epoch \u2190 0; break\nelse if super_epoch = 0 then\n1\n\nbreak with probability\n// we use random stop since we want to randomly choose a point as the starting point of\nthe next epoch\n\nm\u2212k+1\n\nend if\nend for\nx(s+1)m \u2190 xt\n\n17:\n18:\n19:\n20: end for\n\nThe SSRGD algorithm is based on the stochastic recursive gradient descent which is introduced\nin [26] for convex optimization. In particular, Nguyen et al. [26] want to save the storage of past\ngradients in SAGA [9] by using the recursive gradient. However, this stochastic recursive gradient\ndescent is widely used in recent work for nonconvex optimization such as SPIDER [11], SpiderBoost\n[32] and some variants of SARAH (e.g., ProxSARAH [27]).\nRecall that in the well-known SVRG algorithm, Johnson and Zhang [19] reused a \ufb01xed snapshot full\n\ngradient \u2207f ((cid:101)x) (which is computed at the beginning of each epoch) in the gradient estimator:\n\nvt \u2190 1\nb\n\n(cid:88)\n(cid:0)\u2207fi(xt) \u2212 \u2207fi((cid:101)x)(cid:1) + \u2207f ((cid:101)x),\n(cid:88)\n(cid:0)\u2207fi(xt) \u2212 \u2207fi(xt\u22121)(cid:1) + vt\u22121.\n\ni\u2208Ib\n\n(9)\n\n(10)\n\nwhile the stochastic recursive gradient descent uses a recursive update form (more timely update):\n\nvt \u2190 1\nb\n\n4 Convergence Results\n\ni\u2208Ib\n\nSimilar to the perturbed GD [18] and perturbed SVRG [15], we add simple perturbations to the\nstochastic recursive gradient descent algorithm to escape saddle points ef\ufb01ciently. Besides, we also\nconsider the more general online case. In the following theorems, we provide the convergence results\nof SSRGD for \ufb01nding an \u0001-\ufb01rst-order stationary point and an (\u0001, \u03b4)-second-order stationary point for\nboth nonconvex \ufb01nite-sum problem (1) and online problem (2). The detailed proofs are provided in\nAppendix C. We give an overview of the proofs in next Section 5.\n\n4.1 Nonconvex Finite-sum Problem\nTheorem 1 Under Assumption 1 (i.e. (3)), let \u2206f := f (x0) \u2212 f\u2217, where x0 is the initial point and\nf\u2217 is the optimal value of f. By letting step size \u03b7 \u2264 \u221a\nn and minibatch\n\n5\u22121\n2L , epoch length m =\n\n\u221a\n\n5\n\n\f\u221a\n\nsize b =\n\nn, SSRGD will \ufb01nd an \u0001-\ufb01rst-order stationary point in expectation using\n\n(cid:16)\n\nO\n\nn +\n\nL\u2206f\n\u00012\n\n(cid:17)\n\n\u221a\n\nn\n\nstochastic gradients for nonconvex \ufb01nite-sum problem (1).\nTheorem 2 Under Assumption 1 and 2 (i.e. (3) and (5)), let \u2206f := f (x0) \u2212 f\u2217, where x0 is the\n\u221a\nn,\n\nn, perturbation radius r = (cid:101)O(cid:0) min( \u03b43\n\ninitial point and f\u2217 is the optimal value of f. By letting step size \u03b7 = (cid:101)O( 1\nthreshold function value fthres = (cid:101)O( \u03b43\n(cid:16) L\u2206f\n\n\u03c12 ) and super epoch length tthres = (cid:101)O( 1\n\nL ), epoch length m =\n\n)(cid:1), threshold gradient gthres = \u0001,\n(cid:17)\n\n\u03b7\u03b4 ), SSRGD will at\n\nleast once get to an (\u0001, \u03b4)-second-order stationary point with high probability using\n\nminibatch size b =\n\n\u03c12\u2206f n\n\n\u221a\n\u03c12\u0001 , \u03b43/2\n\n\u03c1\n\nL\n\n\u221a\n\n\u221a\n\nn\n\n\u221a\n\nn\n\n+\n\nL\u03c12\u2206f\n\u03b44\n\n\u00012\n\n+\n\n\u03b43\n\n(cid:101)O\n\nstochastic gradients for nonconvex \ufb01nite-sum problem (1).\n\n4.2 Nonconvex Online (Expectation) Problem\n\nFor nonconvex online problem (2), one usually needs the following bounded variance assumption.\nFor notational convenience, we also consider this online case as the \ufb01nite-sum form by letting\n\u2207fi(x) := \u2207F (x, \u03b6i) and thinking of n as in\ufb01nity (in\ufb01nite data samples). Although we try to write\nit as \ufb01nite-sum form, the convergence analysis of optimization methods in this online case is a little\ndifferent from the \ufb01nite-sum case.\nAssumption 3 (Bounded Variance) For \u2200x \u2208 Rd, Ei[(cid:107)\u2207fi(x)\u2212\u2207f (x)(cid:107)2] := E\u03b6i[(cid:107)\u2207F (x, \u03b6i)\u2212\n\u2207f (x)(cid:107)2] \u2264 \u03c32, where \u03c3 > 0 is a constant.\n\nNote that this assumption is standard and necessary for this online case since the full gradients are not\navailable (see e.g., [16, 22, 23, 21, 20, 35, 11, 32, 27]). Moreover, we need to modify the full gradient\ncomputation step at the beginning of each epoch to a large batch stochastic gradient computation step\n(similar to [22, 23]), i.e., change vsm \u2190 \u2207f (xsm) (Line 8 of Algorithm 2) to\n\n\u2207fj(xsm),\n\n(11)\n\n(cid:88)\n\nj\u2208IB\n\nvsm \u2190 1\nB\n\nwhere IB are i.i.d. samples with |IB| = B. We call B the batch size and b the minibatch size. Also,\nwe need to change (cid:107)\u2207f (xsm)(cid:107) \u2264 gthres (Line 3 of Algorithm 2) to (cid:107)vsm(cid:107) \u2264 gthres.\nTheorem 3 Under Assumption 1 (i.e. (4)) and Assumption 3, let \u2206f := f (x0) \u2212 f\u2217, where x0 is\nthe initial point and f\u2217 is the optimal value of f. By letting step size \u03b7 \u2264 \u221a\n5\u22121\n2L , batch size B = 4\u03c32\n\u00012 ,\nminibatch size b =\n\u0001 and epoch length m = b, SSRGD will \ufb01nd an \u0001-\ufb01rst-order stationary\npoint in expectation using\n\nB = \u03c3\n\n\u221a\n\n(cid:17)\n\nL\u2206f \u03c3\n\n(cid:16) \u03c32\n\nO\n\n\u00012 +\n\n\u00013\nstochastic gradients for nonconvex online problem (2).\n\nFor achieving a high probability result of \ufb01nding second-order stationary points in this online case\n(i.e., Theorem 4), we need a stronger version of Assumption 3 as in the following Assumption 4.\nAssumption 4 (Bounded Variance) For \u2200i, x, (cid:107)\u2207fi(x)\u2212\u2207f (x)(cid:107)2 := (cid:107)\u2207F (x, \u03b6i)\u2212\u2207f (x)(cid:107)2 \u2264\n\u03c32, where \u03c3 > 0 is a constant.\nWe want to point out that Assumption 4 can be relaxed such that (cid:107)\u2207fi(x)\u2212\u2207f (x)(cid:107) has sub-Gaussian\ntail, i.e., E[exp(\u03bb(cid:107)\u2207fi(x) \u2212 \u2207f (x)(cid:107))] \u2264 exp(\u03bb2\u03c32/2), for \u2200\u03bb \u2208 R. Then it is suf\ufb01cient for us to\nget a high probability bound by using Hoeffding bound on these sub-Gaussian variables. Note that\nAssumption 4 (or the relaxed sub-Gaussian version) is also standard in online case for second-order\nstationary point \ufb01nding algorithms (see e.g., [4, 34, 11]).\n\n6\n\n\fg2\n\nthres\n\n\u221a\n\nTheorem 4 Under Assumption 1, 2 (i.e. (4) and (6)) and Assumption 4, let \u2206f := f (x0) \u2212 f\u2217,\nL ), batch\n\u0001 ), epoch length m = b, perturbation\n\nwhere x0 is the initial point and f\u2217 is the optimal value of f. By letting step size \u03b7 = (cid:101)O( 1\nB = (cid:101)O( \u03c3\n) = (cid:101)O( \u03c32\nsize B = (cid:101)O( \u03c32\nradius r = (cid:101)O(cid:0) min( \u03b43\n\u03c12 ) and super epoch length tthres = (cid:101)O( 1\nfthres = (cid:101)O( \u03b43\n\n)(cid:1), threshold gradient gthres = \u0001 \u2264 \u03b42/\u03c1, threshold function value\n(cid:16) L\u2206f \u03c3\n(cid:101)O\n\n\u00012 ), minibatch size b =\n\u221a\n\u03c12\u0001 , \u03b43/2\n\u03c1\n\n(\u0001, \u03b4)-second-order stationary point with high probability using\n\n\u03b7\u03b4 ), SSRGD will at least once get to an\n\n\u03c12\u2206f \u03c32\n\u00012\u03b43 +\nstochastic gradients for nonconvex online problem (2).\n\n\u00013 +\n\nL\u03c12\u2206f \u03c3\n\n(cid:17)\n\n\u0001\u03b44\n\nL\n\n5 Overview of the Proofs\n\n5.1 Finding First-order Stationary Points\n\nIn this section, we \ufb01rst show that why SSRGD algorithm can improve previous SVRG type algorithm\n(see e.g., [23, 15]) from n2/3/\u00012 to n1/2/\u00012. Then we give a simple high-level proof for achieving\nthe n1/2/\u00012 convergence result (i.e., Theorem 1).\n\nWhy it can be improved from n2/3/\u00012 to n1/2/\u00012: First, we need a key relation between f (xt) and\nf (xt\u22121), where xt := xt\u22121 \u2212 \u03b7vt\u22121,\nf (xt) \u2264 f (xt\u22121) \u2212 \u03b7\n2\n\n(cid:107)\u2207f (xt\u22121)(cid:107)2 \u2212(cid:0) 1\n\n(cid:1)(cid:107)xt \u2212 xt\u22121(cid:107)2 +\n\n(cid:107)\u2207f (xt\u22121) \u2212 vt\u22121(cid:107)2,\n\n\u2212 L\n2\n\n(12)\n\n\u03b7\n2\n\n2\u03b7\n\n2\n\n2\u03b7 \u2212 L\n\nis large. The second term \u2212(cid:0) 1\n\n(cid:1)(cid:107)xt\u2212 xt\u22121(cid:107)2 indicates that the function value will also decrease\n\nwhere (12) holds since f has L-Lipschitz continuous gradient (Assumption 1). The details for\nobtaining (12) can be found in Appendix C.1 (see (27)).\nNote that (12) is very meaningful and also very important for the proofs. The \ufb01rst term\n\u2212 \u03b7\n2(cid:107)\u2207f (xt\u22121)(cid:107)2 indicates that the function value will decrease a lot if the gradient \u2207f (xt\u22121)\na lot if the moving distance xt \u2212 xt\u22121 is large (note that here we require the step size \u03b7 \u2264 1\nL). The\n2(cid:107)\u2207f (xt\u22121) \u2212 vt\u22121(cid:107)2 exists since we use vt\u22121 as an estimator of the actual\nadditional third term + \u03b7\ngradient \u2207f (xt\u22121) (i.e., xt := xt\u22121 \u2212 \u03b7vt\u22121). So it may increase the function value if vt\u22121 is a bad\ndirection in this step.\nTo get an \u0001-\ufb01rst-order stationary point, we want to cancel the last two terms in (12). Firstly, we want\nto bound the last variance term. Recall the variance bound (see Equation (29) in [23]) for SVRG\nalgorithm, i.e., estimator (9):\n\nIn order to connect the last two terms in (12), we use Young\u2019s inequality for the second term\n(cid:107)xt \u2212 xt\u22121(cid:107)2, i.e., \u2212(cid:107)xt \u2212 xt\u22121(cid:107)2 \u2264 1\nthis Young\u2019s inequality and (13) into (12), we can cancel the last two terms in (12) by summing up\n(12) for each epoch, i.e., for each epoch s (i.e., iterations sm + 1 \u2264 t \u2264 sm + m), we have (see\nEquation (35) in [23])\n\nE(cid:2)(cid:107)\u2207f (xt\u22121) \u2212 vt\u22121(cid:107)2(cid:3) \u2264 L2\nE[(cid:107)xt\u22121 \u2212(cid:101)x(cid:107)2].\n\u03b1(cid:107)xt\u22121 \u2212(cid:101)x(cid:107)2 \u2212 1\n1+\u03b1(cid:107)xt \u2212(cid:101)x(cid:107)2 (for any \u03b1 > 0). By plugging\nsm+m(cid:88)\n\n(13)\n\nb\n\nE[(cid:107)\u2207f (xj\u22121)(cid:107)2].\n\n(14)\n\nE[f (x(s+1)m)] \u2264 E[f (xsm)] \u2212 \u03b7\n2\n\nj=sm+1\n\nHowever, due to the Young\u2019s inequality, we need to let b \u2265 m2 to cancel the last two terms in (12) for\nobtaining (14), where b denotes minibatch size and m denotes the epoch length. According to (14), it\nis not hard to see that \u02c6x is an \u0001-\ufb01rst-order stationary point in expectation (i.e., E[(cid:107)\u2207f (\u02c6x)(cid:107)] \u2264 \u0001) if \u02c6x is\nchosen uniformly randomly from {xt\u22121}t\u2208[T ] and the number of iterations T = Sm = 2(f (x0)\u2212f\u2217)\n.\nNote that for each iteration we need to compute b+ n\nm stochastic gradients, where we amortize the full\ngradient computation of the beginning point of each epoch (n stochastic gradients) into each iteration\n\n\u03b7\u00012\n\n7\n\n\fm ) \u2265 n2/3\nin its epoch (i.e., n/m) for simple presentation. Thus, the convergence result is T (b + n\nsince b \u2265 m2, where equality holds if b = m2 = n2/3. Note that here we ignore the factors of\nf (x0) \u2212 f\u2217 and \u03b7 = O(1/L).\nHowever, for stochastic recursive gradient descent estimator (10), we can bound the last variance\nterm in (12) as (see Equation (33) in Appendix C.1):\n\n\u00012\n\nE(cid:2)(cid:107)\u2207f (xt\u22121) \u2212 vt\u22121(cid:107)2(cid:3) \u2264 L2\n\nt\u22121(cid:88)\n\nb\n\nj=sm+1\n\nE[(cid:107)xj \u2212 xj\u22121(cid:107)2].\n\n(15)\n\nsm+m(cid:88)\n\nj=sm+1\n\nNow, the advantage of (15) compared with (13) is that it is already connected to the second term in\n(12), i.e., moving distances {(cid:107)xt\u2212xt\u22121(cid:107)2}t. Thus we do not need an additional Young\u2019s inequality to\ntransform the second term as before. This makes the function value decrease bound tighter. Similarly,\nwe plug (15) into (12) and sum it up for each epoch to cancel the last two terms in (12), i.e., for each\nepoch s, we have (see Equation (35) in Appendix C.1)\n\nE[f (x(s+1)m)] \u2264 E[f (xsm)] \u2212 \u03b7\n2\n\nE[(cid:107)\u2207f (xj\u22121)(cid:107)2].\n\n(16)\n\nCompared with (14) (which requires b \u2265 m2), here (16) only requires b \u2265 m due to the tighter\nfunction value decrease bound since it does not involve the additional Young\u2019s inequality.\n\nHigh-level proof for achieving n1/2/\u00012 result: Now, according to (16), we can use the same\nabove SVRG arguments to show the n1/2/\u00012 convergence result of SSRGD, i.e., \u02c6x is an \u0001-\ufb01rst-\norder stationary point in expectation (i.e., E[(cid:107)\u2207f (\u02c6x)(cid:107)] \u2264 \u0001) if \u02c6x is chosen uniformly randomly\nfrom {xt\u22121}t\u2208[T ] and the number of iterations T = Sm = 2(f (x0)\u2212f\u2217)\n. Also, for each iteration,\nwe compute b + n\nm stochastic gradients. The only difference is that now the convergence result\n) since b \u2265 m (rather than b \u2265 m2), where we let b = m = n1/2,\nis T (b + n\n\u03b7 = O(1/L) and \u2206f := f (x0) \u2212 f\u2217. Moreover, it is optimal since it matches the lower bound\n\u221a\n\u2126( L\u2206f\n\u00012\n\n) provided by [11].\n\nm ) = O( L\u2206f\n\n\u221a\n\u00012\n\n\u03b7\u00012\n\nn\n\nn\n\n5.2 Finding Second-order Stationary Points\n\nIn this section, we only discuss some high-level proof ideas for \ufb01nding a second-order stationary point\nwith high probability due to the space limit. We provide a more detailed proof sketch in Appendix\nA. We have discussed the difference of the \ufb01rst-order guarantee analysis between estimator (9) and\nestimator (10) in previous Section 5.1. For the second-order analysis, since the estimator (10) in our\nSSRGD is more correlated than (9), thus we will use martingales to handle it. Besides, different\nestimators will incur more differences in the detailed proofs of second-order guarantee analysis than\nthat of \ufb01rst-order guarantee analysis.\nWe divide the proof into two situations, i.e., large gradients and around saddle points. According\nto (16), a natural way to prove the convergence result is that the function value will decrease at a\ndesired rate with high probability. Note that the total amount for function value decrease is at most\n\u2206f := f (x0) \u2212 f\u2217.\nLarge gradients: (cid:107)\u2207f (x)(cid:107) \u2265 gthres.\nIn this situation, due to the large gradients, it is suf\ufb01cient to adjust the \ufb01rst-order analysis to show that\nthe function value will decrease a lot in an epoch with high probability. Concretely, we want to show\nthat the function value decrease bound (16) holds with high probability by using Azuma-Hoeffding\ninequality to bound the variance term (15) with high probability. Then, according to (16), it is not\nL ) per iteration in\nhard to see that the desired rate of function value decrease is O(\u03b7g2\n\nthis situation (recall the parameters gthres = \u0001 and \u03b7 = (cid:101)O(1/L) in our Theorem 2). Also note\n\nthres) = (cid:101)O( \u00012\n\nthat we compute b + n\nn in our\nTheorem 2). Here we amortize the full gradient computation of the beginning point of each epoch\n(n stochastic gradients) into each iteration in its epoch (i.e., n/m) for simple presentation (we will\nanalyze this more rigorously in the detailed proofs in appendices). Thus the number of stochastic\n\nn stochastic gradients at each iteration (recall m = b =\n\nm = 2\n\n\u221a\n\n\u221a\n\ngradient computation is at most (cid:101)O(\n\n\u221a\n\n\u00012/L ) = (cid:101)O( L\u2206f\n\n\u221a\n\u00012\n\nn \u2206f\n\nn\n\n) for this large gradients situation.\n\n8\n\n\fNote that (16) only guarantees function value decrease when the summation of gradients in this epoch\nis large. However, in order to connect the guarantees between \ufb01rst situation (large gradients) and\nsecond situation (around saddle points), we need to show guarantees that are related to the gradient\nof the starting point of each epoch (see Line 3 of Algorithm 2). Similar to [15], we achieve this by\nstopping the epoch at a uniformly random point (see Line 16 of Algorithm 2). Then, we will know\nthat either the function value already decreases a lot in this epoch s or the starting point of the next\nepoch x(s+1)m is around a saddle point (or x(s+1)m is already a second-order stationary point).\n\nAround saddle points: (cid:107)\u2207f ((cid:101)x)(cid:107) \u2264 gthres and \u03bbmin(\u22072f ((cid:101)x)) \u2264 \u2212\u03b4 at the initial point(cid:101)x of a super\ninitial point(cid:101)x. To simplify the presentation, we use x0 :=(cid:101)x + \u03be to denote the starting point of the\n\nepoch.\nIn this situation, we want to show that the function value will decrease a lot in a super epoch (instead\nof an epoch as in the \ufb01rst situation) with high probability by adding a random perturbation at the\nsuper epoch after the perturbation, where \u03be uniformly \u223c B0(r) (see Line 6 in Algorithm 2).\nFirstly, we show that if function value does not decrease a lot, then all iteration points are not far from\nthe starting point with high probability (localization). Concretely, we have\n\n\u2200t, (cid:107)xt \u2212 x0(cid:107) \u2264\n\nwhere C(cid:48) = (cid:101)O(1). Then we show that the stuck region is relatively small in the random perturbation\n\nball, i.e., xt will go far away from the perturbed starting point x0 with high probability (small stuck\nregion). Concretely, we have\n\n(17)\n\nC(cid:48)L\n\n,\n\nwhere C1 = (cid:101)O(1). Based on (17) and (18), we can prove that\n\n\u2203t \u2264 tthres, (cid:107)xt \u2212 x0(cid:107) \u2265 \u03b4\n\nC1\u03c1 ,\n\n\u2203t \u2264 tthres, f ((cid:101)x) \u2212 f (xt) \u2265 fthres\n\n(18)\n\n(cid:113) 4t(f (x0)\u2212f (xt))\n\n\u221a\n\n\u221a\n\n\u221a\n\nn\n\n\u221a\n\nn\n\n\u221a\nm = 2\n\n) for this around saddle points situation.\n\nn stochastic gradients\nn in our Theorem 2). Thus the number of stochastic gradient\n\nholds with high probability.\nNow, we can obtain that the desired rate of function value decrease in this situation is fthres/tthres =\n\nL\u03c12 ) per iteration (recall the parameters fthres = (cid:101)O(\u03b43/\u03c12), tthres = (cid:101)O(1/(\u03b7\u03b4))\n\nat each iteration (recall m = b =\nn \u2206f\n\n(cid:101)O( \u03b43/\u03c12\n1/(\u03b7\u03b4) ) = (cid:101)O( \u03b44\nand \u03b7 = (cid:101)O(1/L) in our Theorem 2). Same as before, we compute b + n\ncomputation is at most (cid:101)O(\nIn sum, the number of stochastic gradient computation is at most (cid:101)O( L\u2206f\nsituation and is at most (cid:101)O( L\u03c12\u2206f\n) = (cid:101)O( L\u2206f\n\n\u03b44/(L\u03c12) ) = (cid:101)O( L\u03c12\u2206f\n\u03c1\u0001 [25, 18], then (cid:101)O( L\u03c12\u2206f\n\n) for the large gradients\n) for the around saddle points situation. Moreover, for the\nclassical version where \u03b4 =\n), i.e., both situations get\nthe same stochastic gradient complexity. This also matches the convergence result for \ufb01nding \ufb01rst-\norder stationary points (see our Theorem 1) if we ignore the logarithmic factor. More importantly, it\n\u221a\nalso almost matches the lower bound \u2126( L\u2206f\n) provided by [11] for \ufb01nding even just an \u0001-\ufb01rst-order\n\u00012\nstationary point.\nFinally, we point out that there is an extra term \u03c12\u2206f n\nin Theorem 2 beyond these two terms obtained\nfrom the above two situations. The reason is that we amortize the full gradient computation of the\nbeginning point of each epoch (n stochastic gradients) into each iteration in its epoch (i.e., n/m)\nfor simple presentation. We will analyze this more rigorously in the appendices, which incurs the\nterm \u03c12\u2206f n\n. For the more general online problem (2), the high-level proofs are almost the same as\nthe \ufb01nite-sum problem (1). The difference is that we need to use more concentration bounds in the\ndetailed proofs since the full gradients are not available in online case.\n\n\u221a\n\u00012\n\u221a\n\u00012\n\n\u221a\n\nn\n\n\u221a\n\n\u03b44\n\nn\n\nn\n\n\u03b43\n\nn\n\n\u03b43\n\n\u03b44\n\n\u03b44\n\nAcknowledgments\n\nThis work was supported by Of\ufb01ce of Sponsored Research of KAUST, through the Baseline Research\nFund of Prof. Peter Richt\u00e1rik. The author would like to thank Rong Ge (Duke), Jian Li (Tsinghua)\nand the anonymous reviewers for their useful discussions/suggestions.\n\n9\n\n\fReferences\n[1] Naman Agarwal, Zeyuan Allen-Zhu, Brian Bullins, Elad Hazan, and Tengyu Ma. Find-\ning approximate local minima for nonconvex optimization in linear time. arXiv preprint\narXiv:1611.01146, 2016.\n\n[2] Zeyuan Allen-Zhu. Natasha 2: Faster non-convex optimization than sgd. In Advances in Neural\n\nInformation Processing Systems, pages 2680\u20132691, 2018.\n\n[3] Zeyuan Allen-Zhu and Elad Hazan. Variance reduction for faster non-convex optimization. In\n\nInternational Conference on Machine Learning, pages 699\u2013707, 2016.\n\n[4] Zeyuan Allen-Zhu and Yuanzhi Li. Neon2: Finding local minima via \ufb01rst-order oracles. In\n\nAdvances in Neural Information Processing Systems, pages 3720\u20133730, 2018.\n\n[5] Srinadh Bhojanapalli, Behnam Neyshabur, and Nati Srebro. Global optimality of local search\nfor low rank matrix recovery. In Advances in Neural Information Processing Systems, pages\n3873\u20133881, 2016.\n\n[6] Yair Carmon, John C Duchi, Oliver Hinder, and Aaron Sidford. Accelerated methods for\n\nnon-convex optimization. arXiv preprint arXiv:1611.00756, 2016.\n\n[7] Fan Chung and Linyuan Lu. Concentration inequalities and martingale inequalities: a survey.\n\nInternet Mathematics, 3(1):79\u2013127, 2006.\n\n[8] Hadi Daneshmand, Jonas Kohler, Aurelien Lucchi, and Thomas Hofmann. Escaping saddles\n\nwith stochastic gradients. arXiv preprint arXiv:1803.05999, 2018.\n\n[9] Aaron Defazio, Francis Bach, and Simon Lacoste-Julien. Saga: A fast incremental gradient\nmethod with support for non-strongly convex composite objectives. In Advances in Neural\nInformation Processing Systems, pages 1646\u20131654, 2014.\n\n[10] Simon S Du, Chi Jin, Jason D Lee, Michael I Jordan, Aarti Singh, and Barnabas Poczos.\nGradient descent can take exponential time to escape saddle points. In Advances in Neural\nInformation Processing Systems, pages 1067\u20131077, 2017.\n\n[11] Cong Fang, Chris Junchi Li, Zhouchen Lin, and Tong Zhang. Spider: Near-optimal non-\nconvex optimization via stochastic path-integrated differential estimator. In Advances in Neural\nInformation Processing Systems, pages 687\u2013697, 2018.\n\n[12] Rong Ge, Furong Huang, Chi Jin, and Yang Yuan. Escaping from saddle points \u2014 online\nstochastic gradient for tensor decomposition. In Conference on Learning Theory, pages 797\u2013842,\n2015.\n\n[13] Rong Ge, Jason D Lee, and Tengyu Ma. Matrix completion has no spurious local minimum. In\n\nAdvances in Neural Information Processing Systems, pages 2973\u20132981, 2016.\n\n[14] Rong Ge, Jason D Lee, and Tengyu Ma. Learning one-hidden-layer neural networks with\n\nlandscape design. arXiv preprint arXiv:1711.00501, 2017.\n\n[15] Rong Ge, Zhize Li, Weiyao Wang, and Xiang Wang. Stabilized svrg: Simple variance reduction\n\nfor nonconvex optimization. In Conference on Learning Theory, 2019.\n\n[16] Saeed Ghadimi, Guanghui Lan, and Hongchao Zhang. Mini-batch stochastic approximation\nmethods for nonconvex stochastic composite optimization. Mathematical Programming, 155(1-\n2):267\u2013305, 2016.\n\n[17] Wassily Hoeffding. Probability inequalities for sums of bounded random variables. Journal of\n\nthe American Statistical Association, 58(301):13\u201330, 1963.\n\n[18] Chi Jin, Rong Ge, Praneeth Netrapalli, Sham M Kakade, and Michael I Jordan. How to escape\nsaddle points ef\ufb01ciently. In Proceedings of the 34th International Conference on Machine\nLearning-Volume 70, pages 1724\u20131732. JMLR. org, 2017.\n\n10\n\n\f[19] Rie Johnson and Tong Zhang. Accelerating stochastic gradient descent using predictive variance\n\nreduction. In Advances in neural information processing systems, pages 315\u2013323, 2013.\n\n[20] Guanghui Lan, Zhize Li, and Yi Zhou. A uni\ufb01ed variance-reduced accelerated gradient method\n\nfor convex optimization. In Advances in Neural Information Processing Systems, 2019.\n\n[21] Guanghui Lan and Yi Zhou. Random gradient extrapolation for distributed and stochastic\n\noptimization. SIAM Journal on Optimization, 28(4):2753\u20132782, 2018.\n\n[22] Lihua Lei, Cheng Ju, Jianbo Chen, and Michael I Jordan. Non-convex \ufb01nite-sum optimization\nvia scsg methods. In Advances in Neural Information Processing Systems, pages 2345\u20132355,\n2017.\n\n[23] Zhize Li and Jian Li. A simple proximal stochastic gradient method for nonsmooth nonconvex\noptimization. In Advances in Neural Information Processing Systems, pages 5569\u20135579, 2018.\n\n[24] Yurii Nesterov. Introductory Lectures on Convex Optimization: A Basic Course. Kluwer, 2004.\n\n[25] Yurii Nesterov and Boris T Polyak. Cubic regularization of newton method and its global\n\nperformance. Mathematical Programming, 108(1):177\u2013205, 2006.\n\n[26] Lam M Nguyen, Jie Liu, Katya Scheinberg, and Martin Tak\u00e1\u02c7c. Sarah: A novel method for\nmachine learning problems using stochastic recursive gradient. In Proceedings of the 34th\nInternational Conference on Machine Learning-Volume 70, pages 2613\u20132621. JMLR. org, 2017.\n\n[27] Nhan H Pham, Lam M Nguyen, Dzung T Phan, and Quoc Tran-Dinh. Proxsarah: An ef\ufb01-\ncient algorithmic framework for stochastic composite nonconvex optimization. arXiv preprint\narXiv:1902.05679, 2019.\n\n[28] Sashank J Reddi, Ahmed Hefny, Suvrit Sra, Barnab\u00e1s P\u00f3czos, and Alex Smola. Stochastic\nvariance reduction for nonconvex optimization. In International conference on machine learning,\npages 314\u2013323, 2016.\n\n[29] Terence Tao and Van Vu. Random matrices: Universality of local spectral statistics of non-\n\nhermitian matrices. The Annals of Probability, 43(2):782\u2013874, 2015.\n\n[30] Joel A Tropp. User-friendly tail bounds for matrix martingales. Technical report, CALIFORNIA\n\nINST OF TECH PASADENA, 2011.\n\n[31] Joel A Tropp. User-friendly tail bounds for sums of random matrices. Foundations of computa-\n\ntional mathematics, 12(4):389\u2013434, 2012.\n\n[32] Zhe Wang, Kaiyi Ji, Yi Zhou, Yingbin Liang, and Vahid Tarokh. Spiderboost: A class of faster\nvariance-reduced algorithms for nonconvex optimization. arXiv preprint arXiv:1810.10690,\n2018.\n\n[33] Yi Xu, Jing Rong, and Tianbao Yang. First-order stochastic algorithms for escaping from saddle\npoints in almost linear time. In Advances in Neural Information Processing Systems, pages\n5535\u20135545, 2018.\n\n[34] Dongruo Zhou, Pan Xu, and Quanquan Gu. Finding local minima via stochastic nested variance\n\nreduction. arXiv preprint arXiv:1806.08782, 2018.\n\n[35] Dongruo Zhou, Pan Xu, and Quanquan Gu. Stochastic nested variance reduction for nonconvex\n\noptimization. arXiv preprint arXiv:1806.07811, 2018.\n\n11\n\n\f", "award": [], "sourceid": 855, "authors": [{"given_name": "Zhize", "family_name": "Li", "institution": "Tsinghua University, and KAUST"}]}