{"title": "Gradient Descent Can Take Exponential Time to Escape Saddle Points", "book": "Advances in Neural Information Processing Systems", "page_first": 1067, "page_last": 1077, "abstract": "Although gradient descent (GD) almost always escapes saddle points asymptotically [Lee et al., 2016], this paper shows that even with fairly natural random initialization schemes and non-pathological functions, GD can be significantly slowed down by saddle points, taking exponential time to escape. On the other hand, gradient descent with perturbations [Ge et al., 2015, Jin et al., 2017] is not slowed down by saddle points\u2014it can find an approximate local minimizer in polynomial time. This result implies that GD is inherently slower than perturbed GD, and justifies the importance of adding perturbations for efficient non-convex optimization. While our focus is theoretical, we also present experiments that illustrate our theoretical findings.", "full_text": "Gradient Descent Can Take Exponential Time to\n\nEscape Saddle Points\n\nSimon S. Du\n\nCarnegie Mellon University\n\nssdu@cs.cmu.edu\n\nChi Jin\n\nUniversity of California, Berkeley\n\nchijin@berkeley.edu\n\nJason D. Lee\n\nUniversity of Southern California\n\njasonlee@marshall.usc.edu\n\nMichael I. Jordan\n\nUniversity of California, Berkeley\njordan@cs.berkeley.edu\n\nBarnab\u00e1s P\u00f3czos\n\nCarnegie Mellon University\nbapoczos@cs.cmu.edu\n\nAarti Singh\n\nCarnegie Mellon University\naartisingh@cmu.edu\n\nAbstract\n\nAlthough gradient descent (GD) almost always escapes saddle points asymptot-\nically [Lee et al., 2016], this paper shows that even with fairly natural random\ninitialization schemes and non-pathological functions, GD can be signi\ufb01cantly\nslowed down by saddle points, taking exponential time to escape. On the other\nhand, gradient descent with perturbations [Ge et al., 2015, Jin et al., 2017] is not\nslowed down by saddle points\u2014it can \ufb01nd an approximate local minimizer in\npolynomial time. This result implies that GD is inherently slower than perturbed\nGD, and justi\ufb01es the importance of adding perturbations for ef\ufb01cient non-convex\noptimization. While our focus is theoretical, we also present experiments that\nillustrate our theoretical \ufb01ndings.\n\n1\n\nIntroduction\n\nGradient Descent (GD) and its myriad variants provide the core optimization methodology in machine\nlearning problems. Given a function f (x), the basic GD method can be written as:\n\nx(t+1) \u2190 x(t) \u2212 \u03b7\u2207f\ufffdx(t)\ufffd,\n\n(1)\n\nwhere \u03b7 is a step size, assumed \ufb01xed in the current paper. While precise characterizations of the\nrate of convergence GD are available for convex problems, there is far less understanding of GD\nfor non-convex problems. Indeed, for general non-convex problems, GD is only known to \ufb01nd a\nstationary point (i.e., a point where the gradient equals zero) in polynomial time [Nesterov, 2013].\nA stationary point can be a local minimizer, saddle point, or local maximizer. In recent years, there\nhas been an increasing focus on conditions under which it is possible to escape saddle points (more\nspeci\ufb01cally, strict saddle points as in De\ufb01nition 2.4) and converge to a local minimizer. Moreover,\nstronger statements can be made when the following two key properties hold: 1) all local minima\nare global minima, and 2) all saddle points are strict. These properties hold for a variety of machine\nlearning problems, including tensor decomposition [Ge et al., 2015], dictionary learning [Sun et al.,\n2017], phase retrieval [Sun et al., 2016], matrix sensing [Bhojanapalli et al., 2016, Park et al., 2017],\nmatrix completion [Ge et al., 2016, 2017], and matrix factorization [Li et al., 2016]. For these\n\n31st Conference on Neural Information Processing Systems (NIPS 2017), Long Beach, CA, USA.\n\n\fproblems, any algorithm that is capable of escaping strict saddle points will converge to a global\nminimizer from an arbitrary initialization point.\nRecent work has analyzed variations of GD that include stochastic perturbations. It has been shown\nthat when perturbations are incorporated into GD at each step the resulting algorithm can escape strict\nsaddle points in polynomial time [Ge et al., 2015]. It has also been shown that episodic perturbations\nsuf\ufb01ce; in particular, Jin et al. [2017] analyzed an algorithm that occasionally adds a perturbation\nto GD (see Algorithm 1), and proved that not only does the algorithm escape saddle points in\npolynomial time, but additionally the number of iterations to escape saddle points is nearly dimension-\nindependent1. These papers in essence provide suf\ufb01cient conditions under which a variant of GD\nhas favorable convergence properties for non-convex functions. This leaves open the question as to\nwhether such perturbations are in fact necessary. If not, we might prefer to avoid the perturbations if\npossible, as they involve additional hyper-parameters. The current understanding of gradient descent\nis silent on this issue. The major existing result is provided by Lee et al. [2016], who show that\ngradient descent, with any reasonable random initialization, will always escape strict saddle points\neventually\u2014but without any guarantee on the number of steps required. This motivates the following\nquestion:\n\nDoes randomly initialized gradient descent generally escape saddle points in polynomial time?\n\nIn this paper, perhaps surprisingly, we give a strong negative answer to this question. We show\nthat even under a fairly natural initialization scheme (e.g., uniform initialization over a unit cube,\nor Gaussian initialization) and for non-pathological functions satisfying smoothness properties\nconsidered in previous work, GD can take exponentially long time to escape saddle points and reach\nlocal minima, while perturbed GD (Algorithm 1) only needs polynomial time. This result shows that\nGD is fundamentally slower in escaping saddle points than its perturbed variant, and justi\ufb01es the\nnecessity of adding perturbations for ef\ufb01cient non-convex optimization.\nThe counter-example that supports this conclusion is a smooth function de\ufb01ned on Rd, where GD\nwith random initialization will visit the vicinity of d saddle points before reaching a local minimum.\nWhile perturbed GD takes a constant amount of time to escape each saddle point, GD will get closer\nand closer to the saddle points it encounters later, and thus take an increasing amount of time to\nescape. Eventually, GD requires time that is exponential in the number of saddle points it needs to\nescape, thus e\u03a9(d) steps.\n\n1.1 Related Work\n\nOver the past few years, there have been many problem-dependent convergence analyses of non-\nconvex optimization problems. One line of work shows that with smart initialization that is assumed\nto yield a coarse estimate lying inside a neighborhood of a local minimum, local search algorithms\nsuch as gradient descent or alternating minimization enjoy fast local convergence; see, e.g., [Netrapalli\net al., 2013, Du et al., 2017, Hardt, 2014, Candes et al., 2015, Sun and Luo, 2016, Bhojanapalli\net al., 2016, Yi et al., 2016, Zhang et al., 2017]. On the other hand, Jain et al. [2017] show that\ngradient descent can stay away from saddle points, and provide global convergence rates for matrix\nsquare-root problems, even without smart initialization. Although these results give relatively strong\nguarantees in terms of rate, their analyses are heavily tailored to speci\ufb01c problems and it is unclear\nhow to generalize them to a wider class of non-convex functions.\nFor general non-convex problems, the study of optimization algorithms converging to minimizers\ndates back to the study of Morse theory and continuous dynamical systems ([Palis and De Melo, 2012,\nYin and Kushner, 2003]); a classical result states that gradient \ufb02ow with random initialization always\nconverges to a minimizer. For stochastic gradient, this was shown by Pemantle [1990], although\nwithout explicit running time guarantees. Lee et al. [2016] established that randomly initialized\ngradient descent with a \ufb01xed stepsize also converges to minimizers almost surely. However, these\nresults are all asymptotic in nature and it is unclear how they might be extended to deliver explicit\nconvergence rates. Moreover, it is unclear whether polynomial convergence rates can be obtained for\nthese methods.\nNext, we review algorithms that can provably \ufb01nd approximate local minimizers in polynomial time.\nThe classical cubic-regularization [Nesterov and Polyak, 2006] and trust region [Curtis et al., 2014]\n\n1Assuming that the smoothness parameters (see De\ufb01nition 2.1- 2.3) are all independent of dimension.\n\n2\n\n\falgorithms require access to the full Hessian matrix. A recent line of work [Carmon et al., 2016,\nAgarwal et al., 2017, Carmon and Duchi, 2016] shows that the requirement of full Hessian access\ncan be relaxed to Hessian-vector products, which can be computed ef\ufb01ciently in many machine\nlearning applications. For pure gradient-based algorithms without access to Hessian information,\nGe et al. [2015] show that adding perturbation in each iteration suf\ufb01ces to escape saddle points in\npolynomial time. When smoothness parameters are all dimension independent, Levy [2016] analyzed\na normalized form of gradient descent with perturbation, and improved the dimension dependence to\nO(d3). This dependence has been further improved in recent work [Jin et al., 2017] to polylog(d) via\nperturbed gradient descent (Algorithm 1).\n\n1.2 Organization\n\nThis paper is organized as follows. In Section 2, we introduce the formal problem setting and\nbackground. In Section 3, we discuss some pathological examples and \u201cun-natural\" initialization\nschemes under which the gradient descent fails to escape strict saddle points in polynomial time. In\nSection 4, we show that even under a fairly natural initialization scheme, gradient descent still needs\nexponential time to escape all saddle points whereas perturbed gradient descent is able to do so in\npolynomial time. We provide empirical illustrations in Section 5 and conclude in Section 6. We place\nmost of our detailed proofs in the Appendix.\n\n2 Preliminaries\n\nLet \ufffd\u00b7\ufffd2 denote the Euclidean norm of a \ufb01nite-dimensional vector in Rd. For a symmetric matrix A,\nlet \ufffdA\ufffdop denote its operator norm and \u03bbmin (A) its smallest eigenvalue. For a function f : Rd \u2192 R,\nlet \ufffdf (\u00b7) and \ufffd2f (\u00b7) denote its gradient vector and Hessian matrix. Let Bx (r) denote the d-\ndimensional \ufffd2 ball centered at x with radius r, [\u22121, 1]d denote the d-dimensional cube centered at 0\nwith side-length 2, and B\u221e(x, R) = x + [\u2212R, R]d denote the d-dimensional cube centered at x with\nside-length 2R. We also use O(\u00b7), and \u03a9(\u00b7) as standard Big-O and Big-Omega notation, only hiding\nabsolute constants.\nThroughout the paper we consider functions that satisfy the following smoothness assumptions.\nDe\ufb01nition 2.1. A function f (\u00b7) is B-bounded if for any x \u2208 Rd:\n\nDe\ufb01nition 2.2. A differentiable function f (\u00b7) is \ufffd-gradient Lipschitz if for any x, y \u2208 Rd:\n\nDe\ufb01nition 2.3. A twice-differentiable function f (\u00b7) is \u03c1-Hessian Lipschitz if for any x, y \u2208 Rd:\n\n|f (x)| \u2264 B.\n\n\ufffd\ufffdf (x) \u2212 \ufffdf (y)\ufffd2 \u2264 \ufffd\ufffdx \u2212 y\ufffd2 .\n\n\ufffd\ufffd\ufffd2f (x) \u2212 \ufffd2f (y)\ufffd\ufffdop \u2264 \u03c1\ufffdx \u2212 y\ufffd2 .\n\nIntuitively, de\ufb01nition 2.1 says function value is both upper and lower bounded; de\ufb01nition 2.2 and\n2.3 state the gradients and Hessians of function can not change dramatically if two points are close\nby. De\ufb01nition 2.2 is a standard asssumption in the optimization literature, and de\ufb01nition 2.3 is also\ncommonly assumed when studying saddle points and local minima.\nOur goal is to escape saddle points. The saddle points discussed in this paper are assumed to be\n\u201cstrict\u201d [Ge et al., 2015]:\nDe\ufb01nition 2.4. A saddle point x\u2217 is called an \u03b1-strict saddle point if there exists some \u03b1 > 0 such\n\nthat \ufffd\ufffdf (x\u2217)\ufffd2 = 0 and \u03bbmin\ufffd\ufffd2f (x\u2217)\ufffd \u2264 \u2212\u03b1.\n\nThat is, a strict saddle point must have an escaping direction so that the eigenvalue of the Hessian\nalong that direction is strictly negative. It turns out that for many non-convex problems studied in\nmachine learning, all saddle points are strict (see Section 1 for more details).\nTo escape strict saddle points and converge to local minima, we can equivalently study the approxi-\nmation of second-order stationary points. For \u03c1-Hessian Lipschitz functions, such points are de\ufb01ned\nas follows by Nesterov and Polyak [2006]:\n\n3\n\n\fAlgorithm 1 Perturbed Gradient Descent [Jin et al., 2017]\n1: Input: x(0), step size \u03b7, perturbation radius r, time interval tthres, gradient threshold gthres.\n2: tnoise \u2190 \u2212tthres \u2212 1.\n3: for t = 1, 2,\u00b7\u00b7\u00b7 do\n4:\n5:\n6:\n7:\n8: end for\n\nif \ufffd\ufffdf (xt)\ufffd2 \u2264 gthres and t \u2212 tnoise > tthres then\nx(t) \u2190 x(t) + \u03bet, \u03bet \u223c unif (B0 (r)), tnoise \u2190 t,\nend if\nx(t+1) \u2190 x(t) \u2212 \u03b7\ufffdf\ufffdx(t)\ufffd.\n\nDe\ufb01nition 2.5. A point x is a called a second-order stationary point if \ufffd\ufffdf (x)\ufffd2 = 0 and\n\u03bbmin\ufffd\ufffd2f (x)\ufffd \u2265 0. We also de\ufb01ne its \ufffd-version, that is, an \ufffd-second-order stationary point\nfor some \ufffd > 0, if point x satis\ufb01es \ufffd\ufffdf (x)\ufffd2 \u2264 \ufffd and \u03bbmin\ufffd\ufffd2f (x)\ufffd \u2265 \u2212\u221a\u03c1\ufffd.\n\nSecond-order stationary points must have a positive semi-de\ufb01nite Hessian in additional to a vanishing\ngradient. Note if all saddle points x\u2217 are strict, then second-order stationary points are exactly\nequivalent to local minima.\nIn this paper, we compare gradient descent and one of its variants\u2014the perturbed gradient descent\nalgorithm (Algorithm 1) proposed by Jin et al. [2017]. We focus on the case where the step\nsize satis\ufb01es \u03b7 < 1/\ufffd, which is commonly required for \ufb01nding a minimum even in the convex\nsetting [Nesterov, 2013].\nThe following theorem shows that if GD with random initialization converges, then it will converge\nto a second-order stationary point almost surely.\nTheorem 2.6 ([Lee et al., 2016] ). Suppose that f is \ufffd-gradient Lipschitz, has continuous Hessian,\nand step size \u03b7 < 1\n\ufffd . Furthermore, assume that gradient descent converges, meaning limt\u2192\u221e x(t)\nexists, and the initialization distribution \u03bd is absolutely continuous with respect to Lebesgue measure.\nThen limt\u2192\u221e x(t) = x\u2217 with probability one, where x\u2217 is a second-order stationary point.\nThe assumption that gradient descent converges holds for many non-convex functions (including all\n\nthe examples considered in this paper). This assumption is used to avoid the case when\ufffd\ufffdx(t)\ufffd\ufffd2 goes\n\nto in\ufb01nity, so limt\u2192\u221e x(t) is unde\ufb01ned.\nNote the Theorem 2.6 only provides limiting behavior without specifying the convergence rate. On\nthe other hand, if we are willing to add perturbations, the following theorem not only establishes\nconvergence but also provides a sharp convergence rate:\nTheorem 2.7 ([Jin et al., 2017]). Suppose f is B-bounded, \ufffd-gradient Lipschitz, \u03c1-Hessian Lipschitz.\nFor any \u03b4 > 0, \ufffd \u2264 \ufffd2\n\u03c1 , there exists a proper choice of \u03b7, r, tthres, gthres (depending on B, \ufffd, \u03c1, \u03b4, \ufffd)\nsuch that Algorithm 1 will \ufb01nd an \ufffd-second-order stationary point, with at least probability 1 \u2212 \u03b4, in\nthe following number of iterations:\n\nO\ufffd \ufffdB\n\n\ufffd2 log4\ufffd d\ufffdB\n\n\ufffd2\u03b4\ufffd\ufffd .\n\nThis theorem states that with proper choice of hyperparameters, perturbed gradient descent can\nconsistently escape strict saddle points and converge to second-order stationary point in a polynomial\nnumber of iterations.\n\n3 Warmup: Examples with \u201cUn-natural\" Initialization\n\nThe convergence result of Theorem 2.6 raises the following question: can gradient descent \ufb01nd a\nsecond-order stationary point in a polynomial number of iterations? In this section, we discuss two\nvery simple and intuitive counter-examples for which gradient descent with random initialization\nrequires an exponential number of steps to escape strict saddle points. We will also explain that,\nhowever, these examples are unnatural and pathological in certain ways, thus unlikely to arise in\npractice. A more sophisticated counter-example with natural initialization and non-pathological\nbehavior will be given in Section 4.\n\n4\n\n\f(a) Negative Gradient Field of f (x) =\n1 \u2212 x2\nx2\n2.\n\n(b) Negative Gradient Field for function de\ufb01ned in Equa-\ntion (2).\n\nFigure 1: If the initialization point is in red rectangle then it takes GD a long time to escape the\nneighborhood of saddle point (0, 0).\n\nInitialize uniformly within an extremely thin band. Consider a two-dimensional function f with\na strict saddle point at (0, 0). Suppose that inside the neighborhood U = [\u22121, 1]2 of the saddle point,\nfunction is locally quadratic f (x1, x2) = x2\n4, the update equation can be\nwritten as\n\n2, For GD with \u03b7 = 1\n\n1 \u2212 x2\n\nx(t+1)\n1\n\n=\n\nand x(t+1)\n\n2\n\n=\n\nx(t)\n1\n2\n\n3x(t)\n2\n2\n\n.\n\n2 )\u2212 exp( 1\n\n2 )\u2212 exp( 1\n\n\ufffd ), ( 3\n\n2 )\u2212 exp( 1\n\n\ufffd )). We would seldom use such an initialization scheme in practice.\n\n\ufffd )] then GD requires at least\nIf we initialize uniformly within [\u22121, 1] \u00d7 [\u2212( 3\n\ufffd ) steps to get out of neighborhood U, and thereby escape the saddle point. See Figure 1a\nexp( 1\nfor illustration. Note that in this case the initialization region is exponentially thin (only of width\n2 \u00b7 ( 3\nInitialize far away. Consider again a two-dimensional function with a strict saddle point at (0, 0).\nThis time, instead of initializing in a extremely thin band, we construct a very long slope so that a\nrelatively large initialization region necessarily converges to this extremely thin band. Speci\ufb01cally,\nconsider a function in the domain [\u2212\u221e, 1] \u00d7 [\u22121, 1] that is de\ufb01ned as follows:\n\nwhere h(x1, x2) is a smooth function connecting region [\u2212\u221e,\u22122] \u00d7 [\u22121, 1] and [\u22121, 1] \u00d7 [\u22121, 1]\nwhile making f have continuous second derivatives and ensuring x2 does not suddenly increase when\nx1 \u2208 [\u22122,\u22121].2 For GD with \u03b7 = 1\nx(t+1)\n1\nand when x1 < \u22122 the dynamics are\nx(t+1)\n1\n\n4, when \u22121 < x1 < 1, the dynamics are\n\nand x(t+1)\n\nand x(t+1)\n\n3x(t)\n2\n2\n\nx(t)\n1\n2\n\n= x(t)\n\n1 + 1\n\n=\n\n=\n\n=\n\n2\n\n2\n\n,\n\nx(t)\n2\n2\n\n.\n\nSuppose we initialize uniformly within [\u2212R\u2212 1,\u2212R + 1]\u00d7 [\u22121, 1] , for R large. See Figure 1b for an\nillustration. Letting t denote the \ufb01rst time that x(t)\n1 \u2265 \u22121, then approximately we have t \u2248 R and so\nx(t)\n2 \u2248 x(0)\n\ufffd , that is R \u2248 exp 1\n\ufffd ,\nthen GD will need exponential time to escape from the neighborhood U = [\u22121, 1] \u00d7 [\u22121, 1] of the\nsaddle point (0, 0). In this case, we require an initialization region leading to a saddle point at distance\nR which is exponentially large. In practice, it is unlikely that we would initialize exponentially far\naway from the saddle points or optima.\n\n2 )R. From the previous example, we know that if ( 1\n\n2 )R \u2248 ( 3\n\n2 )\u2212 exp 1\n\n2 \u00b7( 1\n\n2We can construct such a function using splines. See Appendix B.\n\n5\n\nf (x1, x2) =\uf8f1\uf8f2\uf8f3\n\n2\n\nx2\n1 \u2212 x2\n\u22124x1 + x2\nh(x1, x2)\n\nif \u2212 1 < x1 < 1\nif x1 < \u22122\notherwise,\n\n2\n\n(2)\n\n\f4 Main Result\n\nIn the previous section we have shown that gradient descent takes exponential time to escape saddle\npoints under \u201cun-natural\" initialization schemes. Is it possible for the same statement to hold even\nunder \u201cnatural\u201d initialization schemes and non-pathological functions? The following theorem\ncon\ufb01rms this:\nTheorem 4.1 (Uniform initialization over a unit cube). Suppose the initialization point is uniformly\nsampled from [\u22121, 1]d. There exists a function f de\ufb01ned on Rd that is B-bounded, \ufffd-gradient\nLipschitz and \u03c1-Hessian Lipschitz with parameters B, \ufffd, \u03c1 at most poly(d) such that:\n\n1. with probability one, gradient descent with step size \u03b7 \u2264 1/\ufffd will be \u03a9(1) distance away\n\nfrom any local minima for any T \u2264 e\u03a9(d).\n\n2. for any \ufffd > 0, with probability 1 \u2212 e\u2212d, perturbed gradient descent (Algorithm 1) will \ufb01nd\n\na point x such that \ufffdx \u2212 x\u2217\ufffd2 \u2264 \ufffd for some local minimum x\u2217 in poly(d, 1\n\n\ufffd ) iterations.\n\nRemark: As will be apparent in the next section, in the example we constructed, there are 2d\nsymmetric local minima at locations (\u00b1c . . . ,\u00b1c), where c is some constant. The saddle points are\nof the form (\u00b1c, . . . ,\u00b1c, 0, . . . , 0). Both algorithms will travel across d neighborhoods of saddle\npoints before reaching a local minimum. For GD, the number of iterations to escape the i-th saddle\npoint increases as \u03bai (\u03ba is a multiplicative factor larger than 1), and thus GD requires exponential\ntime to escape d saddle points. On the other hand, PGD takes about the same number of iterations\nto escape each saddle point, and so escapes the d saddle points in polynomial time. Notice that\nB, \ufffd, \u03c1 = O(poly(d)), so this does not contradict Theorem 2.7.\nWe also note that in our construction, the local minimizers are outside the initialization region. We\nnote this is common especially for unconstrained optimization problems, where the initialization\nis usually uniform on a rectangle or isotropic Gaussian. Due to isoperimetry, the initialization\nconcentrates in a thin shell, but frequently the \ufb01nal point obtained by the optimization algorithm is\nnot in this shell.\nIt turns out in our construction, the only second-order stationary points in the path are the \ufb01nal local\nminima. Therefore, we can also strengthen Theorem 4.1 to provide a negative result for approximating\n\ufffd-second-order stationary points as well.\nCorollary 4.2. Under the same initialization as in Theorem 4.1, there exists a function f satisfying\nthe requirements of Theorem 4.1 such that for some \ufffd = 1/poly(d), with probability one, gradient\ndescent with step size \u03b7 \u2264 1/\ufffd will not visit any \ufffd-second-order stationary point in T \u2264 e\u03a9(d).\nThe corresponding positive result that PGD to \ufb01nd \ufffd-second-order stationary point in polynomial\ntime immediately follows from Theorem 2.7.\nThe next result shows that gradient descent does not fail due to the special choice of initializing\nuniformly in [\u22121, 1]d. For a large class of initialization distributions \u03bd, we can generalize Theorem\n4.1 to show that gradient descent with random initialization \u03bd requires exponential time, and perturbed\ngradient only requires polynomial time.\nCorollary 4.3. Let B\u221e(z, R) = {z} + [\u2212R, R]d be the \ufffd\u221e ball of radius R centered at z. Then for\nany initialization distribution \u03bd that satis\ufb01es \u03bd(B\u221e(z, R)) \u2265 1 \u2212 \u03b4 for any \u03b4 > 0, the conclusion of\nTheorem 4.1 holds with probability at least 1 \u2212 \u03b4.\nThat is, as long as most of the mass of the initialization distribution \u03bd lies in some \ufffd\u221e ball, a\nsimilar conclusion to that of Theorem 4.1 holds with high probability. This result applies to random\nGaussian initialization, \u03bd = N (0, \u03c32I), with mean 0 and covariance \u03c32I, where \u03bd(B\u221e(0, \u03c3 log d)) \u2265\n1 \u2212 1/poly(d).\n4.1 Proof Sketch\n\nIn this section we present a sketch of the proof of Theorem 4.1. The full proof is presented in the\nAppendix. Since the polynomial-time guarantee for PGD is straightforward to derive from Jin et al.\n[2017], we focus on showing that GD needs an exponential number of steps. We rely on the following\nkey observation.\n\n6\n\n\fKey observation: escaping two saddle points sequentially. Consider, for L > \u03b3 > 0,\n\nf (x1, x2) =\uf8f1\uf8f2\uf8f3\n\n1 + Lx2\n2\n\n\u2212\u03b3x2\nL(x1 \u2212 2)2 \u2212 \u03b3x2\nL(x1 \u2212 2)2 + L(x2 \u2212 2)2\n\n2\n\nif x1 \u2208 [0, 1] , x2 \u2208 [0, 1]\nif x1 \u2208 [1, 3] , x2 \u2208 [0, 1]\nif x1 \u2208 [1, 3] , x2 \u2208 [1, 3]\n\n.\n\n(3)\n\nNote that this function is not continuous. In the next paragraph we will modify it to make it smooth\nand satisfy the assumptions of the Theorem but useful intuition is obtained using this discontinuous\nfunction. The function has an optimum at (2, 2) and saddle points at (0, 0) and (2, 0). We call\n[0, 1] \u00d7 [0, 1] the neighborhood of (0, 0) and [1, 3] \u00d7 [0, 1] the neighborhood of (2, 0). Suppose\nthe initialization\ufffdx(0), y(0)\ufffd lies in [0, 1] \u00d7 [0, 1]. De\ufb01ne t1 = minx(t)\n1 \u22651 t to be the time of \ufb01rst\ndeparture from the neighborhood of (0, 0) (thereby escaping the \ufb01rst saddle point). By the dynamics\nof gradient descent, we have\n2 = (1 \u2212 2\u03b7L)t1 x(0)\n2 .\n\n1 = (1 + 2\u03b7\u03b3)t1 x(0)\nx(t1)\n\n1 , x(t1)\n\nNext we calculate the number of iterations such that x2 \u2265 1 and the algorithm thus leaves the\nneighborhood of the saddle point (2, 0) (thus escaping the second saddle point). Letting t2 =\nminx(t)\n\n2 \u22651 t, we have:\nx(t1)\n2\n\nWe can lower bound t2 by\n\n(1 + 2\u03b7\u03b3)t2\u2212t1 = (1 + 2\u03b7\u03b3)t2\u2212t1 (1 \u2212 2\u03b7L)t1 x(0)\n\n2 \u2265 1.\n\n\uf8f1\uf8f4\uf8f4\uf8f4\uf8f4\uf8f4\uf8f2\uf8f4\uf8f4\uf8f4\uf8f4\uf8f4\uf8f3\n\n2\u03b7(L + \u03b3)t1 + log( 1\nx0\n2\n\n)\n\n2\u03b7\u03b3\n\nt2 \u2265\n\nL + \u03b3\n\n\u03b3\n\n\u2265\n\nt1.\n\nThe key observation is that the number of steps to escape the second saddle point is L+\u03b3\n\u03b3\nnumber of steps to escape the \ufb01rst one.\n\ntimes the\n\nSpline: connecting quadratic regions. To make our function smooth, we create buffer regions\nand use splines to interpolate the discontinuous parts of Equation (3). Formally, we consider the\nfollowing function, for some \ufb01xed constant \u03c4 > 1:\n\nf (x1, x2) =\n\n1 + Lx2\n2\n\n\u2212\u03b3x2\ng(x1, x2)\nL(x1 \u2212 4\u03c4 )2 \u2212 \u03b3x2\nL(x1 \u2212 4\u03c4 )2 + g1(x2) \u2212 \u03bd\nL(x1 \u2212 4\u03c4 )2 + L(x2 \u2212 4\u03c4 )2 \u2212 2\u03bd\n\n2 \u2212 \u03bd\n\nif x1 \u2208 [0, \u03c4 ] , x2 \u2208 [0, \u03c4 ]\nif x1 \u2208 [\u03c4, 2\u03c4 ] , x2 \u2208 [0, \u03c4 ]\nif x1 \u2208 [2\u03c4, 6\u03c4 ] , x2 \u2208 [0, \u03c4 ]\nif x1 \u2208 [2\u03c4, 6\u03c4 ] , x2 \u2208 [\u03c4, 2\u03c4 ]\nif x1 \u2208 [2\u03c4, 6\u03c4 ] , x2 \u2208 [2\u03c4, 6\u03c4 ] ,\n\n(4)\n\nwhere g, g1 are spline polynomials and \u03bd > 0 is a constant de\ufb01ned in Lemma B.2. In this case, there\nare saddle points at (0, 0), and (4\u03c4, 0) and the optimum is at (4\u03c4, 4\u03c4 ). Intuitively, [\u03c4, 2\u03c4 ] \u00d7 [0, \u03c4 ] and\n[2\u03c4, 6\u03c4 ] \u00d7 [\u03c4, 2\u03c4 ] are buffer regions where we use splines (g and g1) to transition between regimes\nand make f a smooth function. Also in this region there is no stationary point and the smoothness\nassumptions are still satis\ufb01ed in the theorem. Figure. 2a shows the surface and stationary points of\nthis function. We call the union of the regions de\ufb01ned in Equation (4) a tube.\n\nFrom two saddle points to d saddle points. We can readily adapt our construction of the tube\nto d dimensions, such that the function is smooth, the location of saddle points are (0, . . . , 0),\n(4\u03c4, 0, . . . , 0), . . ., (4\u03c4, . . . , 4\u03c4, 0), and optimum is at (4\u03c4, . . . , 4\u03c4 ). Let ti be the number of step to\nescape the neighborhood of the i-th saddle point. We generalize our key observation to this case and\nobtain ti+1 \u2265 L+\u03b3\n\u03b3 )d which is exponential time. Figure 2b shows\nthe tube and trajectory of GD.\n\n\u00b7 ti for all i. This gives td \u2265 ( L+\u03b3\n\n\u03b3\n\nMirroring trick: from tube to octopus.\nIn the construction thus far, the saddle points are all on\nthe boundary of tube. To avoid the dif\ufb01culties of constrained non-convex optimization, we would like\nto make all saddle points be interior points of the domain. We use a simple mirroring trick; i.e., for\nevery coordinate xi we re\ufb02ect f along its axis. See Figure 2c for an illustration in the case d = 2.\n\n7\n\n\f15\n\n10\n\n2\n\nx\n\n5\n\n0\n\n0\n\n5\n\n0\n\n-5\n\n0\n\n9\n\n-\n\n0\n0\n-1\n\n- 1 1 0\n\n-\n\n1\n\n2\n\n0\n\n-\n\n9\n\n0\n\n-110\n\n-100\n\n-\n\n8\n\n0\n\n-7\n\n0\n\n-\n\n6\n\n0\n\n-\n\n5\n\n0\n\n-1\n\n0\n\n0\n\n-\n\n9\n\n0\n\n0\n\n1\n\n- 1\n\n0\n\n0\n\n- 1\n- 9 0\n- 8 0\n- 7 0\n\n0\n\n- 6\n\n0\n-5\n\n5\n\n10\n\n15\n\nx\n\n1\n\n(a) Contour plot of the objective\nfunction and tube de\ufb01ned in 2D.\n\n(b) Trajectory of gradient descent\nin the tube for d = 3.\n\n(c) Octopus de\ufb01ned in 2D.\n\nFigure 2: Graphical illustrations of our counter-example with \u03c4 = e. The blue points are saddle\npoints and the red point is the minimum. The pink line is the trajectory of gradient descent.\n\n100\n\nn\no\n\n0\n\nGD\nPGD\n\n \n\ni\nt\nc\nn\nu\nF\ne\nv\ni\nt\nc\ne\nb\nO\n\nj\n\n-100\n\n-200\n\n-300\n\n-400\n\n0\n\n200\n\n0\n\n-200\n\n-400\n\nn\no\n\n \n\ni\nt\nc\nn\nu\nF\ne\nv\ni\nt\nc\ne\nb\nO\n\nj\n\nGD\nPGD\n\nn\no\n\n \n\ni\nt\nc\nn\nu\nF\ne\nv\ni\nt\nc\ne\nb\nO\n\nj\n\nGD\nPGD\n\n200\n\n0\n\n-200\n\n-400\n\nGD\nPGD\n\n200\n\n0\n\n-200\n\n-400\n\n-600\n\nn\no\n\n \n\ni\nt\nc\nn\nu\nF\ne\nv\ni\nt\nc\ne\nb\nO\n\nj\n\n500\n\nEpochs\n\n1000\n\n-600\n\n0\n\n500\n\nEpochs\n\n1000\n\n-600\n\n0\n\n500\n\nEpochs\n\n1000\n\n-800\n\n0\n\n500\n\nEpochs\n\n1000\n\n(a) L = 1, \u03b3 = 1\n\n(b) L = 1.5, \u03b3 = 1\n\n(c) L = 2, \u03b3 = 1\n\n(d) L = 3, \u03b3 = 1\n\nFigure 3: Performance of GD and PGD on our counter-example with d = 5.\n\n200\n\n0\n\n-200\n\n-400\n\n-600\n\n-800\n\nn\no\ni\nt\nc\nn\nu\nF\n \ne\nv\ni\nt\nc\ne\nb\nO\n\nj\n\n0\n\nGD\nPGD\n\n500\n\n0\n\n-500\n\nn\no\ni\nt\nc\nn\nu\nF\n \ne\nv\ni\nt\nc\ne\nb\nO\n\nj\n\nGD\nPGD\n\n1000\n\nEpochs\n\n2000\n\n-1000\n\n0\n\n1000\n\nEpochs\n\n2000\n\nn\no\ni\nt\nc\nn\nu\nF\n \ne\nv\ni\nt\nc\ne\nb\nO\n\nj\n\n500\n\n0\n\n-500\n\n-1000\n\n-1500\n\n0\n\nGD\nPGD\n\n1000\n\nEpochs\n\n2000\n\nn\no\ni\nt\nc\nn\nu\nF\n \ne\nv\ni\nt\nc\ne\nb\nO\n\nj\n\n500\n\n0\n\n-500\n\n-1000\n\n-1500\n\n0\n\nGD\nPGD\n\n1000\n\nEpochs\n\n2000\n\n(a) L = 1, \u03b3 = 1\n\n(b) L = 1.5, \u03b3 = 1\n\n(c) L = 2, \u03b3 = 1\n\n(d) L = 3, \u03b3 = 1\n\nFigure 4: Performance of GD and PGD on our counter-example with d = 10\n\nExtension: from octopus to Rd. Up to now we have constructed a function de\ufb01ned on a closed\nsubset of Rd. The last step is to extend this function to the entire Euclidean space. Here we apply the\nclassical Whitney Extension Theorem (Theorem B.3) to \ufb01nish our construction. We remark that the\nWhitney extension may lead to more stationary points. However, we will demonstrate in the proof\nthat GD and PGD stay within the interior of \u201coctopus\u201d de\ufb01ned above, and hence cannot converge to\nany other stationary point.\n\n5 Experiments\n\n100 and r = e\n\nIn this section we use simulations to verify our theoretical \ufb01ndings. The objective function is de\ufb01ned\nin (14) and (15) in the Appendix. In Figures 3 and Figure 4, GD stands for gradient descent and\nPGD stands for Algorithm 1. For both GD and PGD we let the stepsize \u03b7 = 1\n4L . For PGD, we\nchoose tthres = 1, gthres = \u03b3e\n100 . In Figure 3 we \ufb01x dimension d = 5 and vary L as\nconsidered in Section 4.1; similarly in Figure 4 we choose d = 10 and vary L. First notice that in\nall experiments, PGD converges faster than GD as suggested by our theorems. Second, observe the\n\u201chorizontal\" segment in each plot represents the number of iterations to escape a saddle point. For\nGD the length of the segment grows at a \ufb01xed rate, which coincides with the result mentioned at the\nbeginning for Section 4.1 (that the number of iterations to escape a saddle point increase at each time\nwith a multiplicative factor L+\u03b3\n\u03b3 ). This phenomenon is also veri\ufb01ed in the \ufb01gures by the fact that as\nthe ratio L+\u03b3\n\u03b3 becomes larger, the rate of growth of the number of iterations to escape increases. On\nthe other hand, the number of iterations for PGD to escape is approximately constant (\u223c 1\n\n\u03b7\u03b3 ).\n\n8\n\n\f6 Conclusion\n\nIn this paper we established the failure of gradient descent to ef\ufb01ciently escape saddle points for\ngeneral non-convex smooth functions. We showed that even under a very natural initialization scheme,\ngradient descent can require exponential time to converge to a local minimum whereas perturbed\ngradient descent converges in polynomial time. Our results demonstrate the necessity of adding\nperturbations for ef\ufb01cient non-convex optimization.\nWe expect that our results and constructions will naturally extend to a stochastic setting. In particular,\nwe expect that with random initialization, general stochastic gradient descent will need exponential\ntime to escape saddle points in the worst case. However, if we add perturbations per iteration or\nthe inherent randomness is non-degenerate in every direction (so the covariance of noise is lower\nbounded), then polynomial time is known to suf\ufb01ce [Ge et al., 2015].\nOne open problem is whether GD is inherently slow if the local optimum is inside the initialization\nregion in contrast to the assumptions of initialization we used in Theorem 4.1 and Corollary 4.3. We\nbelieve that a similar construction in which GD goes through the neighborhoods of d saddle points\nwill likely still apply, but more work is needed. Another interesting direction is to use our counter-\nexample as a building block to prove a computational lower bound under an oracle model [Nesterov,\n2013, Woodworth and Srebro, 2016].\nThis paper does not rule out the possibility for gradient descent to perform well for some non-convex\nfunctions with special structures. Indeed, for the matrix square-root problem, Jain et al. [2017] show\nthat with reasonable random initialization, gradient updates will stay away from all saddle points,\nand thus converge to a local minimum ef\ufb01ciently. It is an interesting future direction to identify other\nclasses of non-convex functions that gradient descent can optimize ef\ufb01ciently and not suffer from the\nnegative results described in this paper.\n\n7 Acknowledgements\n\nS.S.D. and B.P. were supported by NSF grant IIS1563887 and ARPA-E Terra program. C.J. and\nM.I.J. were supported by the Mathematical Data Science program of the Of\ufb01ce of Naval Research\nunder grant number N00014-15-1-2670. J.D.L. was supported by ARO W911NF-17-1-0304. A.S.\nwas supported by DARPA grant D17AP00001, AFRL grant FA8750-17-2-0212 and a CMU ProS-\nEED/BrainHub Seed Grant. The authors thank Rong Ge, Qing Qu, John Wright, Elad Hazan, Sham\nKakade, Benjamin Recht, Nathan Srebro, and Lin Xiao for useful discussions. The authors thank\nStephen Wright and Michael O\u2019Neill for pointing out calculation errors in the older version.\n\nReferences\nNaman Agarwal, Zeyuan Allen-Zhu, Brian Bullins, Elad Hazan, and Tengyu Ma. Finding Approx-\nimate Local Minima Faster Than Gradient Descent. In STOC, 2017. Full version available at\nhttp://arxiv.org/abs/1611.01146.\n\nSrinadh Bhojanapalli, Behnam Neyshabur, and Nati Srebro. Global optimality of local search for low\nrank matrix recovery. In Advances in Neural Information Processing Systems, pages 3873\u20133881,\n2016.\n\nEmmanuel J Candes, Xiaodong Li, and Mahdi Soltanolkotabi. Phase retrieval via Wirtinger \ufb02ow:\n\nTheory and algorithms. IEEE Transactions on Information Theory, 61(4):1985\u20132007, 2015.\n\nYair Carmon and John C Duchi. Gradient descent ef\ufb01ciently \ufb01nds the cubic-regularized non-convex\n\nNewton step. arXiv preprint arXiv:1612.00547, 2016.\n\nYair Carmon, John C Duchi, Oliver Hinder, and Aaron Sidford. Accelerated methods for non-convex\n\noptimization. arXiv preprint arXiv:1611.00756, 2016.\n\nAlan Chang. The Whitney extension theorem in high dimensions. arXiv preprint arXiv:1508.01779,\n\n2015.\n\n9\n\n\fFrank E Curtis, Daniel P Robinson, and Mohammadreza Samadi. A trust region algorithm with a\nworst-case iteration complexity of O(\ufffd\u22123/2) for nonconvex optimization. Mathematical Program-\nming, pages 1\u201332, 2014.\n\nRandall L Dougherty, Alan S Edelman, and James M Hyman. Nonnegativity-, monotonicity-, or\nconvexity-preserving cubic and quintic Hermite interpolation. Mathematics of Computation, 52\n(186):471\u2013494, 1989.\n\nSimon S Du, Jason D Lee, and Yuandong Tian. When is a convolutional \ufb01lter easy to learn? arXiv\n\npreprint arXiv:1709.06129, 2017.\n\nRong Ge, Furong Huang, Chi Jin, and Yang Yuan. Escaping from saddle points\u2014online stochastic\ngradient for tensor decomposition. In Proceedings of The 28th Conference on Learning Theory,\npages 797\u2013842, 2015.\n\nRong Ge, Jason D Lee, and Tengyu Ma. Matrix completion has no spurious local minimum. In\n\nAdvances in Neural Information Processing Systems, pages 2973\u20132981, 2016.\n\nRong Ge, Chi Jin, and Yi Zheng. No spurious local minima in nonconvex low rank problems: A\nuni\ufb01ed geometric analysis. In Proceedings of the 34th International Conference on Machine\nLearning, pages 1233\u20131242, 2017.\n\nMoritz Hardt. Understanding alternating minimization for matrix completion. In Foundations of\nComputer Science (FOCS), 2014 IEEE 55th Annual Symposium on, pages 651\u2013660. IEEE, 2014.\n\nPrateek Jain, Chi Jin, Sham Kakade, and Praneeth Netrapalli. Global convergence of non-convex\ngradient descent for computing matrix squareroot. In Arti\ufb01cial Intelligence and Statistics, pages\n479\u2013488, 2017.\n\nChi Jin, Rong Ge, Praneeth Netrapalli, Sham M. Kakade, and Michael I. Jordan. How to escape\nsaddle points ef\ufb01ciently. In Proceedings of the 34th International Conference on Machine Learning,\npages 1724\u20131732, 2017.\n\nJason D Lee, Max Simchowitz, Michael I Jordan, and Benjamin Recht. Gradient descent only\n\nconverges to minimizers. In Conference on Learning Theory, pages 1246\u20131257, 2016.\n\nK\ufb01r Y Levy. The power of normalization: Faster evasion of saddle points. arXiv preprint\n\narXiv:1611.04831, 2016.\n\nXingguo Li, Zhaoran Wang, Junwei Lu, Raman Arora, Jarvis Haupt, Han Liu, and Tuo Zhao.\nSymmetry, saddle points, and global geometry of nonconvex matrix factorization. arXiv preprint\narXiv:1612.09296, 2016.\n\nYurii Nesterov. Introductory Lectures on Convex Optimization: A Basic Course, volume 87. Springer\n\nScience & Business Media, 2013.\n\nYurii Nesterov and Boris T Polyak. Cubic regularization of newton method and its global performance.\n\nMathematical Programming, 108(1):177\u2013205, 2006.\n\nPraneeth Netrapalli, Prateek Jain, and Sujay Sanghavi. Phase retrieval using alternating minimization.\n\nIn Advances in Neural Information Processing Systems, pages 2796\u20132804, 2013.\n\nJ Jr Palis and Welington De Melo. Geometric Theory of Dynamical Systems: An Introduction.\n\nSpringer Science & Business Media, 2012.\n\nDohyung Park, Anastasios Kyrillidis, Constantine Carmanis, and Sujay Sanghavi. Non-square matrix\nsensing without spurious local minima via the Burer-Monteiro approach. In Arti\ufb01cial Intelligence\nand Statistics, pages 65\u201374, 2017.\n\nRobin Pemantle. Nonconvergence to unstable points in urn models and stochastic approximations.\n\nThe Annals of Probability, pages 698\u2013712, 1990.\n\nJu Sun, Qing Qu, and John Wright. A geometric analysis of phase retrieval. In Information Theory\n\n(ISIT), 2016 IEEE International Symposium on, pages 2379\u20132383. IEEE, 2016.\n\n10\n\n\fJu Sun, Qing Qu, and John Wright. Complete dictionary recovery over the sphere I: Overview and\n\nthe geometric picture. IEEE Transactions on Information Theory, 63(2):853\u2013884, 2017.\n\nRuoyu Sun and Zhi-Quan Luo. Guaranteed matrix completion via non-convex factorization. IEEE\n\nTransactions on Information Theory, 62(11):6535\u20136579, 2016.\n\nHassler Whitney. Analytic extensions of differentiable functions de\ufb01ned in closed sets. Transactions\n\nof the American Mathematical Society, 36(1):63\u201389, 1934.\n\nBlake E Woodworth and Nati Srebro. Tight complexity bounds for optimizing composite objectives.\n\nIn Advances in Neural Information Processing Systems, pages 3639\u20133647, 2016.\n\nXinyang Yi, Dohyung Park, Yudong Chen, and Constantine Caramanis. Fast algorithms for robust\nPCA via gradient descent. In Advances in Neural Information Processing Systems, pages 4152\u2013\n4160, 2016.\n\nG George Yin and Harold J Kushner. Stochastic Approximation and Recursive Algorithms and\n\nApplications, volume 35. Springer, 2003.\n\nXiao Zhang, Lingxiao Wang, and Quanquan Gu. Stochastic variance-reduced gradient descent for\n\nlow-rank matrix recovery from linear measurements. arXiv preprint arXiv:1701.00481, 2017.\n\n11\n\n\f", "award": [], "sourceid": 688, "authors": [{"given_name": "Simon", "family_name": "Du", "institution": "Carnegie Mellon University"}, {"given_name": "Chi", "family_name": "Jin", "institution": "UC Berkeley"}, {"given_name": "Jason", "family_name": "Lee", "institution": "USC"}, {"given_name": "Michael", "family_name": "Jordan", "institution": "UC Berkeley"}, {"given_name": "Aarti", "family_name": "Singh", "institution": "CMU"}, {"given_name": "Barnabas", "family_name": "Poczos", "institution": "Carnegie Mellon University"}]}