{"title": "Escaping from saddle points on Riemannian manifolds", "book": "Advances in Neural Information Processing Systems", "page_first": 7276, "page_last": 7286, "abstract": "We consider minimizing a nonconvex, smooth function $f$ on a Riemannian manifold $\\mathcal{M}$. We show that a perturbed version of the gradient descent algorithm converges to a second-order stationary point for this problem (and hence is able to escape saddle points on the manifold). While the unconstrained problem is well-studied, our result is the first to prove such a rate for nonconvex, manifold-constrained problems.\nThe rate of convergence depends as $1/\\epsilon^2$ on the accuracy $\\epsilon$, which matches a rate known only for unconstrained smooth minimization. The convergence rate also has a polynomial dependence on the parameters denoting the curvature of the manifold and the smoothness of the function.", "full_text": "Escaping from saddle points on Riemannian\n\nmanifolds\n\nYue Sun\n\nUniversity of Washington\n\nSeattle, WA 98105\nyuesun@uw.edu\n\nNicolas Flammarion\n\nEPFL\n\nLausanne, Switzerland\n\nnicolas.flammarion@epfl.ch\n\nMaryam Fazel\n\nUniversity of Washington\n\nSeattle, WA 98105\nmfazel@uw.edu\n\nAbstract\n\nWe consider minimizing a nonconvex, smooth function f on a Riemannian man-\nifold M. We show that a perturbed version of Riemannian gradient descent\nalgorithm converges to a second-order stationary point (and hence is able to escape\nsaddle points on the manifold). The rate of convergence depends as 1/\u00012 on the\naccuracy \u0001, which matches a rate known only for unconstrained smooth minimiza-\ntion. The convergence rate depends polylogarithmically on the manifold dimension\nd, hence is almost dimension-free. The rate also has a polynomial dependence\non the parameters describing the curvature of the manifold and the smoothness of\nthe function. While the unconstrained problem (Euclidean setting) is well-studied,\nour result is the \ufb01rst to prove such a rate for nonconvex, manifold-constrained\nproblems.\n\nIntroduction\n\n1\nWe consider minimizing a non-convex smooth function on a smooth manifold M,\n\nminimizex\u2208M f (x),\n\n(1)\nwhere M is a d-dimensional smooth manifold1, and f is twice differentiable, with a Hessian that\nis \u03c1-Lipschitz (assumptions are formalized in section 4). This framework includes a wide range of\nfundamental problems (often non-convex), such as PCA (Edelman et al., 1998), dictionary learning\n(Sun et al., 2017), low rank matrix completion (Boumal & Absil, 2011), and tensor factorization\n(Ishteva et al., 2011). Finding the global minimum to Eq. (1) is in general NP-hard; our goal is to\n\ufb01nd an approximate second order stationary point with \ufb01rst order optimization methods. We are\ninterested in \ufb01rst-order methods because they are extremely prevalent in machine learning, partly\nbecause computing Hessians is often too costly. It is then important to understand how \ufb01rst-order\nmethods fare when applied to nonconvex problems, and there has been a wave of recent interest on\nthis topic since (Ge et al., 2015), as reviewed below.\nIn the Euclidean space, it is known that with random initialization, gradient descent avoids saddle\npoints asymptotically (Pemantle, 1990; Lee et al., 2016). Lee et al. (2017) (section 5.5) show that this\nis also true on smooth manifolds, although the result is expressed in terms of nonstandard manifold\nsmoothness measures. Also, importantly, this line of work does not give quantitative rates for the\nalgorithm\u2019s behaviour near saddle points.\n\n1Here d is the dimension of the manifold itself; we do not consider M as a submanifold of a higher\n\ndimensional space. For instance, if M is a 2-dimensional sphere embedded in R3, its dimension is d = 2.\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\fDu et al. (2017) show gradient descent can be exponentially slow in the presence of saddle points. To\nalleviate this phenomenon, it is shown that for a \u03b2-gradient Lipschitz, \u03c1-Hessian Lipschitz function,\net al., 2017a) converges to (\u0001,\u2212\u221a\ncubic regularization (Carmon & Duchi, 2017) and perturbed gradient descent (Ge et al., 2015; Jin\n\u03c1\u0001) local minimum 2 in polynomial time, and momentum based\nmethod accelerates (Jin et al., 2017b). Much less is known about inequality constraints: Nouiehed\net al. (2018) and Mokhtari et al. (2018) discuss second order convergence for general inequality-\nconstrained problems, where they need an NP-hard subproblem (checking the co-positivity of a\nmatrix) to admit a polynomial time approximation algorithm. However such an approximation exists\nonly under very restrictive assumptions.\nAn orthogonal line of work is optimization on Riemannian manifolds. Absil et al. (2009) provide\ncomprehensive background, showing how algorithms such as gradient descent, Newton and trust\nregion methods can be implemented on Riemannian manifolds, together with asymptotic convergence\nguarantees to \ufb01rst order stationary points. Zhang & Sra (2016) provide global convergence guarantees\nfor \ufb01rst order methods when optimizing geodesically convex functions. Bonnabel (2013) obtains the\n\ufb01rst asymptotic convergence result for stochastic gradient descent in this setting, which is further\nextended by Tripuraneni et al. (2018); Zhang et al. (2016); Khuzani & Li (2017). If the problem is\nnon-convex, or the Riemannian Hessian is not positive de\ufb01nite, one can use second order methods\nto escape from saddle points. Boumal et al. (2016a) shows that Riemannian trust region method\nconverges to a second order stationary point in polynomial time (see, also, Kasai & Mishra, 2018; Hu\net al., 2018; Zhang & Zhang, 2018). But this method requires a Hessian oracle, whose complexity is\nd times more than computing gradient. In Euclidean space, trust region subproblem can be sometimes\nsolved via a Hessian-vector product oracle, whose complexity is about the same as computing\ngradients. Agarwal et al. (2018) discuss its implementation on Riemannian manifolds, but not clear\nabout the complexity and sensitivity of Hessian vector product oracle on manifold.\nThe study of the convergence of gradient descent for non-convex Riemannian problems is previously\ndone only in the Euclidean space by modeling the manifold with equality constraints. Ge et al. (2015,\nAppendix B) prove that stochastic projected gradient descent methods converge to second order\nstationary points in polynomial time (here the analysis is not geometric, and depends on the algebraic\nrepresentation of the equality constraints). Sun & Fazel (2018) proves perturbed projected gradient\ndescent converges with a comparable rate to the unconstrained setting (Jin et al., 2017a) (polylog in\ndimension). The paper applies projections from the ambient Euclidean space to the manifold and\nanalyzes the iterations under the Euclidean metric. This approach loses the geometric perspective\nenabled by Riemannian optimization, and cannot explain convergence rates in terms of inherent\nquantities such as the sectional curvature of the manifold.\nContributions. We provide convergence guarantees for perturbed \ufb01rst order Riemannian optimization\nmethods to seond-order stationary points (local minima). We prove that as long as the function is\nappropriately smooth and the manifold has bounded sectional curvature, a perturbed Riemannian\ngradient descent algorithm escapes (an approximate) saddle points with a rate of 1/\u00012, a polylog\ndependence on the dimension of the manifold (hence almost dimension-free), and a polynomial\ndependence on the smoothness and curvature parameters. This is the \ufb01rst result showing such a rate\nfor Riemannian optimization, and the \ufb01rst to relate the rate to geometric parameters of the manifold.\nDespite analogies with the unconstrained (Euclidean) analysis and with the Riemannian optimization\nliterature, the technical challenge in our proof goes beyond combining two lines of work: we need to\nanalyze the interaction between the \ufb01rst-order method and the second order structure of the manifold\nto obtain second-order convergence guarantees that depend on the manifold curvature. Unlike in\nEuclidean space, the curvature affects the Taylor approximation of gradient steps. On the other hand,\nunlike in the local rate analysis in \ufb01rst-order Riemannian optimization, our second-order analysis\nrequires more re\ufb01ned properties of the manifold structure (whereas in prior work, \ufb01rst order oracle\nmakes enough progress for a local convergence rate proof, see Lemma 1), and second order algorithms\nsuch as (Boumal et al., 2016a) use second order oracles (Hessian evaluation). See section 4 for further\ndiscussion.\n\n2de\ufb01ned as x satisfying (cid:107)\u2207f (x)(cid:107) \u2264 \u0001, \u03bbmin\u22072f (x) \u2265 \u2212\u221a\n\n\u03c1\u0001\n\n2\n\n\f2 Notation and Background\nWe consider a complete3, smooth, d dimensional Riemannian manifold (M, g), equipped with a\nRiemannian metric g, and we denote by TxM its tangent space at x \u2208 M (which is a vector space\nof dimension d). We also denote by Bx(r) = {v \u2208 TxM,(cid:107)v(cid:107) \u2264 r} the ball of radius r in TxM\ncentered at 0. At any point x \u2208 M, the metric g induces a natural inner product on the tangent space\ndenoted by (cid:104)\u00b7,\u00b7(cid:105) : TxM \u00d7 TxM \u2192 R. We also consider the Levi-Civita connection \u2207 (Absil et al.,\n2009, Theorem 5.3.1). The Riemannian curvature tensor is denoted by R(x)[u, v] where x \u2208 M,\nu, v \u2208 TxM and is de\ufb01ned in terms of the connection \u2207 (Absil et al., 2009, Theorem 5.3.1). The\nsectional curvature K(x)[u, v] for x \u2208 M and u, v \u2208 TxM is then de\ufb01ned in Lee (1997, Prop. 8.8).\n\nK(x)[u, v] =\n\n(cid:104)R(x)[u, v]u, v(cid:105)\n(cid:104)u, u(cid:105)(cid:104)v, v(cid:105) \u2212 (cid:104)u, v(cid:105)2 , x \u2208 M, u, v \u2208 TxM.\n\nx (y), which satis\ufb01es d(x, y) = (cid:107)Exp\u22121\n\nx (y)(cid:107). Parallel translation \u0393y\n\nDenote the distance (induced by the Riemannian metric) between two points in M by d(x, y). A\ngeodesic \u03b3 : R \u2192 M is a constant speed curve whose length is equal to d(x, y), so it is the shortest\npath on manifold linking x and y. \u03b3x\u2192y denotes the geodesic from x to y (thus \u03b3x\u2192y(0) = x and\n\u03b3x\u2192y(1) = y).\nThe exponential map Expx(v) maps v \u2208 TxM to y \u2208 M such that there exists a geodesic \u03b3 with\ndt \u03b3(0) = v. The injectivity radius at point x \u2208 M is the maximal radius\n\u03b3(0) = x, \u03b3(1) = y and d\nr for which the exponential map is a diffeomorphism on Bx(r) \u2282 TxM. The injectivity radius of\nthe manifold, denoted by I, is the in\ufb01mum of the injectivity radii at all points. Since the manifold\nis complete, we have I > 0. When x, y \u2208 M satis\ufb01es d(x, y) \u2264 I, the exponential map admits\nan inverse Exp\u22121\nx denotes a the\nxv \u2208 TyM along \u03b3x\u2192y such that the vector stays constant by\nmap which transports v \u2208 TxM to \u0393y\nsatisfying a zero-acceleration condition (Lee, 1997, equation (4.13)).\nFor a smooth function f : M \u2192 R, gradf (x) \u2208 TxM denotes the Riemannian gradient of f at\nx \u2208 M which satis\ufb01es d\ndt f (\u03b3(t)) = (cid:104)\u03b3(cid:48)(t), gradf (x)(cid:105) (see Absil et al., 2009, Sec 3.5.1 and (3.31)).\nThe Hessian of f is de\ufb01ned jointly with the Riemannian structure of the manifold. The (directional)\nHessian is H(x)[\u03bex] := \u2207\u03bex gradf, and we use H(x)[u, v] := (cid:104)u, H(x)[v](cid:105) as a shorthand. We call\nx \u2208 M an (\u0001,\u2212\u221a\n\u03c1\u0001. We refer the\ninterested reader to Do Carmo (2016) and Lee (1997) which provide a thorough review on these\nimportant concepts of Riemannian geometry.\n3 Perturbed Riemannian gradient algorithm\nOur main Algorithm 1 runs as follows:\n\n\u03c1\u0001) saddle point when (cid:107)\u2207f (x)(cid:107) \u2264 \u0001 and \u03bbmin(H(x)) \u2264 \u2212\u221a\n\n1. Check the norm of the gradient: If it is large, do one step of Riemannian gradient descent,\n\nconsequently the function value decreases.\n\n2. If the norm of gradient is small, it\u2019s either an approximate saddle point or a local minimum.\nPerturb the variable by adding an appropriate level of noise in its tangent space, map it back\nto the manifold and run a few iterations.\n(a) If the function value decreases, iterates are escaping from the approximate saddle point\n\n(and the algorithm continues)\n\n(b) If the function value does not decrease, then it is an approximate local minimum (the\n\nalgorithm terminates).\n\nAlgorithm 1 relies on the manifold\u2019s exponential map, and is useful for cases where this map is\neasy to compute4. We refer readers to Lee (1997, pp. 81-86) for the exponential map of sphere and\nhyperbolic manifolds, and Absil et al. (2009, Example 5.4.2, 5.4.3) for the Stiefel and Grassmann\nmanifolds. If the exponential map is not computable, the algorithm can use a retraction5 instead,\n\n3Since our results are local, completeness is not necessary and our results can be easily generalized, with\n\nextra assumptions on the injectivity radius.\n\n4Numerous interesting manifolds have closed-form exponential maps: the Grassmannian manifold, the Stiefel\nmanifold, the Minkowski space, the hyperbolic space, SE(n), SO(n)...(see, Miolane et al. (2018); Boumal et al.\n(2014) and their open-source packages and B\u00e9cigneul & Ganea (2019, Sec 5) for an example in NLP).\n5A retraction is a \ufb01rst-order approximation of the exponential map which is often easier to compute.\n\n3\n\n\fAlgorithm 1 Perturbed Riemannian gradient algorithm\nRequire: Initial point x0 \u2208 M, parameters \u03b2, \u03c1, K, I, accuracy \u0001, probability of success \u03b4 (parame-\n\nters de\ufb01ned in Assumptions 1, 2, 3 and assumption of Theorem 1).\nSet constants: \u02c6c \u2265 4, C := C(K, \u03b2, \u03c1) (de\ufb01ned in Lemma 2 and proof of Lemma 8)\n\u0001, \u03c7 = 3 max{log( d\u03b2(f (x0)\u2212f\u2217)\n\ncmax \u2264 1\n\nand\n\n\u221a\n\n\u221a\n\n56\u02c6c2 , r =\n\ncmax\n\u03c72\n\n(cid:113) \u00013\n\n\u221a\n\n\u03c1 , gthres =\n\ncmax\n\u03c72\n\n\u02c6c\u00012\u03b4\n\n), 4}.\n\u0001, tthres = \u03c7\ncmax\n\n\u03c1\u0001, tnoise = \u2212tthres\u22121.\n\u03b2\u221a\n\ntnoise \u2190 t, \u02dcxt \u2190 xt, xt \u2190 Expxt(\u03bet), \u03bet uniformly sampled from Bxt(r) \u2282 TxM.\n\nSet threshold values: fthres = cmax\n\u03c73\nSet stepsize: \u03b7 = cmax\n\u03b2 .\nwhile 1 do\n\nif (cid:107)gradf (xt)(cid:107) \u2264 gthres and t \u2212 tnoise > tthres then\nend if\nif t \u2212 tnoise = tthres and f (xt) \u2212 f (\u02dcxtnoise) > \u2212fthres then\nend if\nxt+1+ \u2190 Expxt(\u2212 min{\u03b7,\nt \u2190 t + 1.\n\n(cid:107)gradf (xt)(cid:107)}gradf (xt)).\n\noutput \u02dcxtnoise\n\nI\n\nend while\n\nhowever our current analysis only covers the case of the exponential map. In Figure 1, we illustrate a\nfunction with saddle point on sphere, and plot the trajectory of Algorithm 1 when it is initialized at a\nsaddle point.\n\n1\u2212x2\n\n2 +4x2\n\nFigure 1: Function f with saddle point on a sphere. f (x) = x2\n3. We plot the contour of this\nfunction on unit sphere. Algorithm 1 initializes at x0 = [1, 0, 0] (a saddle point), perturbs it towards\nx1 and runs Riemannian gradient descent, and terminates at x\u2217 = [0,\u22121, 0] (a local minimum). We\namplify the \ufb01rst iteration to make saddle perturbation visible.\n4 Main theorem: escape rate for perturbed Riemannian gradient descent\nWe now turn to our main results, beginning with our assumptions and a statement of our main theorem.\nWe then develop a brief proof sketch.\nOur main result involves two conditions on function f and one on the curvature of the manifold M.\nAssumption 1 (Lipschitz gradient). There is a \ufb01nite constant \u03b2 such that\n\n(cid:107)gradf (y) \u2212 \u0393y\n\nxgradf (x)(cid:107) \u2264 \u03b2d(x, y)\n\nfor all x, y \u2208 M.\n\nAssumption 2 (Lipschitz Hessian). There is a \ufb01nite constant \u03c1 such that\n\n(cid:107)H(y) \u2212 \u0393y\n\nxH(x)\u0393x\n\ny(cid:107)2 \u2264 \u03c1d(x, y)\n\nfor all x, y \u2208 M.\n\nAssumption 3 (Bounded sectional curvature). There is a \ufb01nite constant K such that\n\n|K(x)[u, v]| \u2264 K for all x \u2208 M and u, v \u2208 TxM\n\nK is an intrinsic parameter of the manifold capturing the curvature. We list a few examples here: (i)\nA sphere of radius R has a constant sectional curvature K = 1/R2 (Lee, 1997, Theorem 1.9). If\n\n4\n\n\fproperty of manifold. If the manifold is a sphere(cid:80)d+1\n\nthe radius is bigger, K is smaller which means the sphere is less curved; (ii) A hyper-bolic space\nR of radius R has K = \u22121/R2 (Lee, 1997, Theorem 1.9); (iii) For sectional curvature of the\nH n\nStiefel and the Grasmann manifolds, we refer readers to Rapcs\u00e1k (2008, Section 5) and Wong (1968),\nrespectively.\nNote that the constant K is not directly related to the RLICQ parameter R de\ufb01ned by Ge et al. (2015)\nwhich \ufb01rst requires describing the manifold by equality constraints. Different representations of the\nsame manifold could lead to different curvature bounds, while sectional curvature is an intrinsic\ni = R2, then K = 1/R2, but more generally\nthere is no simple connection. The smoothness parameters we assume are natural compared to some\nquantity from complicated compositions Lee et al. (2017) (Section 5.5) or pullback (Zhang & Zhang,\n2018). With these assumptions, the main result of this paper is the following:\nTheorem 1. Under Assumptions 1,2,3, let C(K, \u03b2, \u03c1) be a function de\ufb01ned in Lemma 2, \u02c6\u03c1 =\nmax{\u03c1, C(K, \u03b2, \u03c1)}, if \u0001 satis\ufb01es that\n\n(cid:18) I\u02c6\u03c1\n(cid:18) d\u03b2\u221a\n(2)\nwhere c2(K), c3(K) are de\ufb01ned in Lemma 4, then with probability 1 \u2212 \u03b4, perturbed Riemannian\ngradient descent with step size cmax/\u03b2 converges to a (\u0001,\u2212\u221a\n(cid:18) \u03b2d(f (x0) \u2212 f (x\u2217))\n\nlog\n\n(cid:18) d\u03b2\u221a\n(cid:19)(cid:33)\n\n56 max{c2(K), c3(K)}\u03b7\u03b2\n\n\u03b2(f (x0) \u2212 f (x\u2217))\n\n\u02c6\u03c1\u0001)-stationary point of f in\n\n(cid:19)(cid:19)2(cid:41)\n\n\u221a\n12\u02c6c\n\n\u03b7\u03b2\n\n\u0001 \u2264 min\n\n(cid:19)\n\n,\n\ni=1 x2\n\nlog\n\n\u02c6\u03c1\u0001\u03b4\n\n(cid:40)\n\n\u02c6\u03c1\u0001\u03b4\n\n\u02c6\u03c1\n\n(cid:32)\n\nO\n\nlog4\n\n\u00012\n\n\u00012\u03b4\n\niterations.\nProof roadmap. For a function satisfying smoothness condition (Assumption 1 and 2), we use a\nlocal upper bound of the objective based on the third-order Taylor expansion (see supplementary\nmaterial Section A for a review),\n\nf (u) \u2264 f (x) + (cid:104)gradf (x), Exp\u22121\n\nx (u)(cid:105) +\n\nH(x)[Exp\u22121\n\nx (u), Exp\u22121\n\nx (u)] +\n\n(cid:107)Exp\u22121\n\nx (u)(cid:107)3.\n\n\u03c1\n6\n\n1\n2\n\n1\n\nI\n\nWhen the norm of the gradient is large (not near a saddle), the following lemma guarantees the\ndecrease of the objective function in one iteration.\nLemma 1. (Boumal et al., 2018) Under Assumption 1, by choosing \u00af\u03b7 = min{\u03b7,\n(cid:107)gradf (u)(cid:107)} =\nO(1/\u03b2), the Riemannian gradient descent algorithm is monotonically descending, f (u+)\u2264 f (u)\u2212\n2 \u00af\u03b7(cid:107)gradf (u)(cid:107)2.\nThus our main challenge in proving the main theorem is the Riemannian gradient behaviour at an\napproximate saddle point:\n1. Similar to the Euclidean case studied by Jin et al. (2017a), we need to bound the \u201cthickness\u201d of the\n\u201cstuck region\u201d where the perturbation fails. We still use a pair of hypothetical auxiliary sequences and\nstudy the \u201ccoupling\u201d sequences. When two perturbations couple in the thinnest direction of the stuck\nregion, their distance grows and one of them escapes from saddle point.\n2. However our iterates are evolving on a manifold rather than a Euclidean space, so our strategy is to\nmap the iterates back to an appropriate \ufb01xed tangent space where we can use the Euclidean analysis.\nThis is done using the inverse of the exponential map and various parallel transports.\n3. Several key challenges arise in doing this. Unlike Jin et al. (2017a), the structure of the manifold\ninteracts with the local approximation of the objective function in a complicated way. On the other\nhand, unlike recent work on Riemannian optimization by Boumal et al. (2016a), we do not have\naccess to a second order oracle and we need to understand how the sectional curvature and the\ninjectivity radius (which both capture intrinsic manifold properties) affect the behavior of the \ufb01rst\norder iterates.\n4. Our main contribution is to carefully investigate how the various approximation errors arising\nfrom (a) the linearization of the iteration couplings and (b) their mappings to a common tangent\nspace can be handled on manifolds with bounded sectional curvature. We address these challenges in\na sequence of lemmas (Lemmas 3 through 6) we combine to linearize the coupling iterations in a\ncommon tangent space and precisely control the approximation error. This result is formally stated in\nthe following lemma.\n\n5\n\n\f(a)\n\n(b)\n\nxy step in TxM, and map to manifold. Expz(\u0393z\n\nFigure 2: (a) Eq. (5). First map w and w+ to TuM and Tu+M, and transport the two vectors to\nTxM, and get their relation. (b) Lemma 3 bounds the difference of two steps starting from x: (1)\ntake y + a step in TxM and map it to manifold, and (2) take a step in TxM, map to manifold, call it\nz, and take \u0393z\n\u03b4 ). Let us consider x be a (\u0001,\u2212\u221a\nLemma 2. De\ufb01ne \u03b3 =\n\u02c6\u03c1\u0001)\nsaddle point, and de\ufb01ne u+ = Expu(\u2212\u03b7gradf (u)) and w+ = Expw(\u2212\u03b7gradf (w)). Under\nAssumptions 1, 2, 3, if all pairwise distances between u, w, u+, w+, x are less than 12S , then for\nsome explicit constant C(K, \u03c1, \u03b2) depending only on K, \u03c1, \u03b2, there is\nx (u+) \u2212 (I \u2212 \u03b7H(x))(Exp\u22121\n(cid:107)Exp\u22121\n\u2264 C(K, \u03c1, \u03b2)d(u, w) (d(u, w) + d(u, x) + d(w, x)) .\n\nxy) is close to Expx(y + a).\n\nx (w+) \u2212 Exp\u22121\n\n\u03b3 , and S =\n\nx (w) \u2212 Exp\u22121\n\nx (u))(cid:107)\n\n\u02c6\u03c1\u0001, \u03ba = \u03b2\n\n\u22121( d\u03ba\n\n\u02c6\u03c1 log\n\n\u221a\n\n\u221a\n\n\u03b7\u03b2 \u03b3\n\n(3)\n\nThe proof of this lemma includes novel contributions by strengthen known result (Lemmas 3) and\nalso combining known inequalities in novel ways (Lemmas 4 to 6) that allow us to control all the\napproximation errors and arrive at the tight rate of escape for the algorithm.\n5 Proof of Lemma 2\nLemma 2 controls the error of the linear approximation of the iterates when mapped in TxM. In this\nsection, we assume that all points are within a region of diameter R := 12S \u2264 I (inequality follows\nfrom Eq. (2) ), i.e., the distance of any two points in the following lemmas are less than R. The proof\nof Lemma 2 is based on the sequence of following lemmas.\nLemma 3. Let x \u2208 M and y, a \u2208 TxM. Let us denote by z = Expx(a) then under Assumption 3\n(4)\n\nxy)) \u2264 c1(K) min{(cid:107)a(cid:107),(cid:107)y(cid:107)}((cid:107)a(cid:107) + (cid:107)y(cid:107))2.\n\nd(Expx(y + a), Expz(\u0393z\n\nThis lemma tightens the result of Karcher (1977, C2.3), which only shows an upper-bound\nO((cid:107)a(cid:107)((cid:107)a(cid:107) + (cid:107)y(cid:107))2). We prove the upper-bound O((cid:107)y(cid:107)((cid:107)a(cid:107) + (cid:107)y(cid:107))2) in the supplement. We\nalso need the following lemma showing that both the exponential map and its inverse are Lipschitz.\nLemma 4. Let x, y, z \u2208 M, and the distance of each two points is no bigger than R. Then under\nassumption 3\n\n(1 + c2(K)R2)\u22121d(y, z) \u2264 (cid:107)Exp\u22121\n\nx (y) \u2212 Exp\u22121\n\nx (z)(cid:107) \u2264 (1 + c3(K)R2)d(y, z).\n\nIntuitively this lemma relates the norm of the difference of two vectors of TxM to the distance\nbetween the corresponding points on the manifold M and follows from bounds on the Hessian of the\nsquare-distance function (Sakai, 1996, Ex. 4 p. 154). The upper-bound is directly proven by Karcher\n(1977, Proof of Cor. 1.6), and we prove the lower-bound via Lemma 3 in the supplement.\nThe following contraction result is fairly classical and is proven using the Rauch comparison theorem\nfrom differential geometry (Cheeger & Ebin, 2008).\nLemma 5. (Mangoubi et al., 2018, Lemma 1) Under Assumption 3, for x, y \u2208 M and w \u2208 TxM,\n\nd(Expx(w), Expy(\u0393y\n\nxw)) \u2264 c4(K)d(x, y).\n\nFinally we need the following corollary of the Ambrose-Singer theorem (Ambrose & Singer, 1953).\nLemma 6. (Karcher, 1977, Section 6) Under Assumption 3, for x, y, z \u2208 M and w \u2208 TxM,\n\n(cid:107)\u0393z\n\nxw \u2212 \u0393z\n\nxw(cid:107) \u2264 c5(K)d(x, y)d(y, z)(cid:107)w(cid:107).\n\ny\u0393y\n\n6\n\n\fLemma 3 through 6 are mainly proven in the literature, and we make up the missing part in Supple-\nmentary material Section B. Then we prove Lemma 2 in Supplementary material Section B.\nThe spirit of the proof is to linearize the manifold using the exponential map and its inverse,\nand to carefully bounds the various error terms caused by the approximation. Let us denote by\n\u03b8 = d(u, w) + d(u, x) + d(w, x).\n1. We \ufb01rst show using twice Lemma 3 and Lemma 5 that\nd(Expu(Exp\u22121\n2. We use Lemma 4 to linearize this iteration in TuM as\n\nwgradf (w)), Expu(\u2212\u03b7gradf (u) + \u0393u\n\n(w+))) = O(\u03b8d(u, w)).\n\nu (w) \u2212 \u03b7\u0393u\n\nExp\u22121\n\nu+\n\nu+\n\nu (w) + \u03b7[gradf (u) \u2212 \u0393u\n\nwgradf (w)](cid:107) = O(\u03b8d(u, w)).\n\n(cid:107)\u0393u\n\n(w+) \u2212 Exp\u22121\n3. Using the Hessian Lipschitzness\n\nExp\u22121\n\nu+\n\nu+\n\n(cid:107)\u0393u\n\nu+\n\nExp\u22121\n\nu+\n\n(w+)) \u2212 Exp\u22121\n\nu (w) + \u03b7H(u)Exp\u22121\n\nu (w)(cid:107) = O(\u03b8d(u, w)).\n\n3. We use Lemma 6 to map to TxM and the Hessian Lipschitzness to compare H(u) to H(x). This\nis an important intermediate result (see Lemma 1 in Supplementary material Section B).\nu (w)(cid:107) = O(\u03b8d(u, w)).\n\nu (w) + \u03b7H(x)\u0393x\n\n(w+) \u2212 \u0393x\n\nuExp\u22121\n\nuExp\u22121\n\nExp\u22121\n\n(cid:107)\u0393x\n\n(5)\n\nu+\n\nu+\n\n4. We use Lemma 3 and 4 to approximate two iteration updates in TxM.\n\n(cid:107)Exp\u22121\n\nx (w) \u2212 (Exp\u22121\n\nx (u) + \u0393x\n\nuExp\u22121\n\nu (w))(cid:107) \u2264 O(\u03b8d(u, w)).\n\n(6)\n\nAnd same for the u+, w+ pair replacing u, w.\n5. Combining Eq. (5) and Eq. (6) together, we obtain\n\n(cid:107)Exp\u22121\n\nx (w+) \u2212 Exp\u22121\n\nx (u+) \u2212 (I \u2212 \u03b7H(x))(Exp\u22121\n\nx (w) \u2212 Exp\u22121\n\nx (u))(cid:107) \u2264 O(\u03b8d(u, w)).\n\nx (ut) and Exp\u22121\n\nx (\u00b7) to map them to the same tangent space at x.\n\nNow note that, the iterations u, u+, w, w+ of the algorithm are both on the manifold. We use\nExp\u22121\nTherefore we have linearized the two coupled trajectories Exp\u22121\nx (wt) in a common\ntangent space, and we can modify the Euclidean escaping saddle analysis thanks to the error bound\nwe proved in Lemma 2.\n6 Proof of main theorem\nIn this section we suppose all assumptions in Section 4 hold. The proof strategy is to show with\nhigh probability that the function value decreases of F in T iterations at an approximate saddle\npoint. Lemma 7 suggests that, if after a perturbation and T steps, the iterate is \u2126(S ) far from the\napproximate saddle point, then the function value decreases. If the iterates do not move far, the\nperturbation falls in a stuck region. Lemma 8 uses a coupling strategy, and suggests that the width of\nthe stuck region is small in the negative eigenvector direction of the Riemannian Hessian.\nDe\ufb01ne\n\nF = \u03b7\u03b2\n\n\u03b33\n\u02c6\u03c12 log\n\n\u22123(\n\nd\u03ba\n\u03b4\n\n\u03b32\n\u02c6\u03c1\n\n\u22122(\n\nlog\n\nd\u03ba\n\u03b4\n\n), T =\n\nlog( d\u03ba\n\u03b4 )\n\n.\n\n\u03b7\u03b3\n\nAt an approximate saddle point \u02dcx, let y be in the neighborhood of \u02dcx where d(y, \u02dcx) \u2264 I, denote\n\n), G =(cid:112)\u03b7\u03b2\n\n\u02dcfy(x) := f (y) + (cid:104)gradf (y), Exp\u22121\n\ny (\u02dcx)(cid:105) +\n\n\u0393y\n\u02dcxH(\u02dcx)\u0393\u02dcx\n\ny[Exp\u22121\n\ny (\u02dcx), Exp\u22121\n\ny (\u02dcx)].\n\n1\n2\n\nLet (cid:107)gradf (\u02dcx)(cid:107) \u2264 G and \u03bbmin(H(\u02dcx)) \u2264 \u2212\u03b3. We consider two iterate sequences, u0, u1, ... and\nw0, w1, ... where u0, w0 are two perturbations at \u02dcx.\nLemma 7. Assume Assumptions 1, 2, 3 and Eq. (2) hold. There exists a constant cmax, \u2200\u02c6c > 3, \u03b4 \u2208\ne ], for any u0 with d(\u02dcx, u0) \u2264 2S /(\u03ba log( d\u03ba\n(0, d\u03ba\n\n(cid:110)\n\n(cid:110)\nt| \u02dcfu0(ut) \u2212 f (u0) \u2264 \u22123F(cid:111)\n\n\u03b4 )), \u03ba = \u03b2/\u03b3.\n\n, \u02c6cT(cid:111)\n\n,\n\nT = min\n\ninf\nt\n\nthen \u2200\u03b7 \u2264 cmax/\u03b2, we have \u22000 < t < T , d(u0, ut) \u2264 3(\u02c6cS ).\n\n7\n\n\fLemma 8. Assume Assumptions 1, 2, 3 and Eq. (2) hold. Take two points u0 and w0 which are\nperturbed from an approximate saddle point, where d(\u02dcx, u0) \u2264 2S /(\u03ba log( d\u03ba\n\u02dcx (w0) \u2212\nExp\u22121\nd), 1], and the algorithm\nruns two sequences {ut} and {wt} starting from u0 and w0. Denote\n\n\u02dcx (u0) = \u00b5re1, e1 is the smallest eigenvector6 of H(\u02dcx), \u00b5 \u2208 [\u03b4/(2\n\n\u03b4 )), Exp\u22121\n\n\u221a\n\n(cid:110)\n\n(cid:110)\nt| \u02dcfw0(wt) \u2212 f (w0) \u2264 \u22123F(cid:111)\n\n, \u02c6cT(cid:111)\n\n,\n\nT = min\n\ninf\nt\n\nthen \u2200\u03b7 \u2264 cmax/l, if \u22000 < t < T , d(\u02dcx, ut) \u2264 3(\u02c6cS ), we have T < \u02c6cT .\nWe prove Lemma 7 and 8 in supplementary material Section C. We also prove, in the same section,\nthe main theorem using the coupling strategy of Jin et al. (2017a). but with the additional dif\ufb01culty of\ntaking into consideration the effect of the Riemannian geometry (Lemma 2) and the injectivity radius.\n\n7 Examples\nkPCA. We consider the kPCA problem, where we want to \ufb01nd the k \u2264 n principal eigenvectors of\na symmetric matrix H \u2208 Rn\u00d7n, as an example (Tripuraneni et al., 2018). This corresponds to\n\nmin\n\nX\u2208Rn\u00d7k\n\n\u2212 1\n2\n\ntr(X T HX)\n\nsubject to X T X = I,\n\nwhich is an optimization problem on the Grassmann manifold de\ufb01ned by the constraint X T X = I.\nIf the eigenvalues of H are distinct, we denote by v1,...,vn the eigenvectors of H, corresponding to\neigenvalues with decreasing order. Let V \u2217 = [v1, ..., vk] be the matrix with columns composed of the\ntop k eigenvectors of H, then the local minimizers of the objective function are V \u2217G for all unitary\nmatrices G \u2208 Rk\u00d7k. Denote also by V = [vi1 , ..., vik ] the matrix with columns composed of k\ndistinct eigenvectors, then the \ufb01rst order stationary points of the objective function (with Riemannian\ngradient being 0) are V G for all unitary matrices G \u2208 Rk\u00d7k. In our numerical experiment, we choose\nH to be a diagonal matrix H = diag(0, 1, 2, 3, 4) and let k = 3. The Euclidean basis (ei) are an\neigenbasis of H and the \ufb01rst order stationary points of the objective function are [ei1 , ei2 , ei3]G with\ndistinct basis and G being unitary. The local minimizers are [e3, e4, e5]G. We start the iteration at\nX0 = [e2, e3, e4] and see in Fig. 3 the algorithm converges to a local minimum.\nBurer-Monteiro approach for certain low rank problems. Following Boumal et al. (2016b), we\nconsider, for A \u2208 Sd\u00d7d and r(r + 1)/2 \u2264 d, the problem\n\ntrace(AX), s.t. diag(X) = 1, X (cid:23) 0, rank(X) \u2264 r.\n\nmin\nX\u2208Sd\u00d7d\n\nWe factorize X by Y Y T with an overparametrized Y \u2208 Rd\u00d7p and p(p + 1)/2 \u2265 d. Then any local\nminimum of\n\nmin\n\nY \u2208Rd\u00d7p\n\ntrace(AY Y T ), s.t. diag(Y Y T ) = 1,\n\nis a global minimum where Y Y T = X\u2217 (Boumal et al., 2016b). Let f (Y ) = 1\n2 trace(AY Y T ). In\nthe experiment, we take A \u2208 R100\u00d720 being a sparse matrix that only the upper left 5 \u00d7 5 block\nis random and other entries are 0. Let the initial point Y0 \u2208 R100\u00d720, such that (Y0)i,j = 1 for\n5j \u2212 4 \u2264 i \u2264 5j and (Y0)i,j = 0 otherwise. Then Y0 is a saddle point. We see in Fig. 3 the algorithm\nconverges to the global optimum.\n\nSummary We have shown that for the constrained optimization problem of minimizing f (x)\nsubject to a manifold constraint as long as the function and the manifold are appropriately smooth, a\nperturbed Riemannian gradient descent algorithm will escape saddle points with a rate of order 1/\u00012\nin the accuracy \u0001, polylog in manifold dimension d, and depends polynomially on the curvature and\nsmoothness parameters.\nA natural extension of our result is to consider other variants of gradient descent, such as the heavy\nball method, Nesterov\u2019s acceleration, and the stochastic setting. The question is whether these\nalgorithms with appropriate modi\ufb01cation (with manifold constraints) would have a fast convergence\nto second-order stationary point (not just \ufb01rst-order stationary as studied in recent literature), and\nwhether it is possible to show the relationship between convergence rate and smoothness of manifold.\n\n6\u201csmallest eigenvector\u201d means the eigenvector corresponding to the smallest eigenvalue.\n\n8\n\n\f(a)\n\n(b)\n\nFigure 3: (a) kPCA problem. We start from an approximate saddle point, and it converges to\na local minimum (which is also global minimum). (b) Burer-Monteiro approach Plot f (Y ) =\n2 trace(AY Y T ) versus iterations. We start from the saddle point, and it converges to a local\n1\nminimum (which is also global minimum).\n\nReferences\nAbsil, P.-A., Mahony, R., and Sepulchre, R. Optimization Algorithms on Matrix Manifolds. Princeton\n\nUniversity Press, 2009.\n\nAgarwal, N., Boumal, N., Bullins, B., and Cartis, C. Adaptive regularization with cubics on manifolds\n\nwith a \ufb01rst-order analysis. arXiv preprint arXiv:1806.00065, 2018.\n\nAmbrose, W. and Singer, I. M. A theorem on holonomy. Transactions of the American Mathematical\n\nSociety, 75(3):428\u2013443, 1953.\n\nB\u00e9cigneul, G. and Ganea, O.-E. Riemannian adaptive optimization methods.\n\nConference on Learning Representations, 2019.\n\nIn International\n\nBonnabel, S. Stochastic gradient descent on riemannian manifolds. IEEE Transactions on Automatic\n\nControl, 58(9):2217\u20132229, 2013.\n\nBoumal, N. and Absil, P.-a. Rtrmc: A riemannian trust-region method for low-rank matrix completion.\n\nIn Advances in neural information processing systems, pp. 406\u2013414, 2011.\n\nBoumal, N., Mishra, B., Absil, P.-A., and Sepulchre, R. Manopt, a Matlab toolbox for optimization\n\non manifolds. JMLR, 2014.\n\nBoumal, N., Absil, P.-A., and Cartis, C. Global rates of convergence for nonconvex optimization on\n\nmanifolds. IMA Journal of Numerical Analysis, 2016a.\n\nBoumal, N., Voroninski, V., and Bandeira, A. The non-convex burer-monteiro approach works\non smooth semide\ufb01nite programs. In Advances in Neural Information Processing Systems, pp.\n2757\u20132765, 2016b.\n\nBoumal, N., Absil, P.-A., and Cartis, C. Global rates of convergence for nonconvex optimization on\nmanifolds. IMA Journal of Numerical Analysis, pp. drx080, 2018. doi: 10.1093/imanum/drx080.\nURL http://dx.doi.org/10.1093/imanum/drx080.\n\nCarmon, Y. and Duchi, J. C. Gradient descent ef\ufb01ciently \ufb01nds the cubic-regularized non-convex\n\nnewton step. arXiv preprint arXiv:1612.00547, 2017.\n\nCheeger, J. and Ebin, D. G. Comparison Theorems in Riemannian Geometry. AMS Chelsea\n\nPublishing, Providence, RI, 2008.\n\nDo Carmo, M. P. Differential Geometry of Curves and Surfaces. Courier Dover Publications, 2016.\n\nDu, S. S., Jin, C., Lee, J. D., Jordan, M. I., Singh, A., and Poczos, B. Gradient descent can take\nexponential time to escape saddle points. In Advances in Neural Information Processing Systems,\npp. 1067\u20131077, 2017.\n\nEdelman, A., Arias, T. A., and Smith, S. T. The geometry of algorithms with orthogonality constraints.\n\nSIAM journal on Matrix Analysis and Applications, 20(2):303\u2013353, 1998.\n\n9\n\n0510152025303540Iterations33.544.5function value051015202530354045Iterations0123456Function value\fGe, R., Huang, F., Jin, C., and Yuan, Y. Escaping from saddle points \u2013 online stochastic gradient for\n\ntensor decomposition. In Conference on Learning Theory, pp. 797\u2013842, 2015.\n\nHu, J., Milzarek, A., Wen, Z., and Yuan, Y. Adaptive quadratically regularized Newton method for\n\nRiemannian optimization. SIAM J. Matrix Anal. Appl., 39(3):1181\u20131207, 2018.\n\nIshteva, M., Absil, P.-A., Van Huffel, S., and De Lathauwer, L. Best low multilinear rank approxi-\nmation of higher-order tensors, based on the riemannian trust-region scheme. SIAM Journal on\nMatrix Analysis and Applications, 32(1):115\u2013135, 2011.\n\nJin, C., Ge, R., Netrapalli, P., Kakade, S., and Jordan, M. I. How to escape saddle points ef\ufb01ciently.\n\nIn ICML, 2017a.\n\nJin, C., Netrapalli, P., and Jordan, M. I. Accelerated gradient descent escapes saddle points faster\n\nthan gradient descent. arXiv preprint arXiv:1711.10456, 2017b.\n\nKarcher, H. Riemannian center of mass and molli\ufb01er smoothing. Communications on pure and\n\napplied mathematics, 30(5):509\u2013541, 1977.\n\nKasai, H. and Mishra, B. Inexact trust-region algorithms on riemannian manifolds. In Advances in\n\nNeural Information Processing Systems 31, pp. 4254\u20134265. 2018.\n\nKhuzani, M. B. and Li, N. Stochastic primal-dual method on riemannian manifolds with bounded\n\nsectional curvature. arXiv preprint arXiv:1703.08167, 2017.\n\nLee, J. D., Simchowitz, M., Jordan, M. I., and Recht, B. Gradient descent only converges to\n\nminimizers. Conference on Learning Theory, pp. 1246\u20131257, 2016.\n\nLee, J. D., Panageas, I., Piliouras, G., Simchowitz, M., Jordan, M. I., and Recht, B. First-order\n\nmethods almost always avoid saddle points. arXiv preprint arXiv:1710.07406, 2017.\n\nLee, J. M. Riemannian manifolds : an introduction to curvature. Graduate texts in mathematics ;\n\n176. Springer, New York, 1997. ISBN 9780387227269.\n\nMangoubi, O., Smith, A., et al. Rapid mixing of geodesic walks on manifolds with positive curvature.\n\nThe Annals of Applied Probability, 28(4):2501\u20132543, 2018.\n\nMiolane, N., Mathe, J., Donnat, C., Jorda, M., and Pennec, X. geomstats: a python package for\n\nriemannian geometry in machine learning. arXiv preprint arXiv:1805.08308, 2018.\n\nMokhtari, A., Ozdaglar, A., and Jadbabaie, A. Escaping saddle points in constrained optimization.\n\narXiv preprint arXiv:1809.02162, 2018.\n\nNouiehed, M., Lee, J. D., and Razaviyayn, M. Convergence to second-order stationarity for con-\n\nstrained non-convex optimization. arXiv preprint arXiv:1810.02024, 2018.\n\nPemantle, R. Nonconvergence to unstable points in urn models and stochastic approximations. The\n\nAnnals of Probability, pp. 698\u2013712, 1990.\n\nRapcs\u00e1k, T. Sectional curvatures in nonlinear optimization. Journal of Global Optimization, 40(1-3):\n\n375\u2013388, 2008.\n\nSakai, T. Riemannian Geometry, volume 149 of Translations of Mathematical Monographs. American\n\nMathematical Society, 1996.\n\nSun, J., Qu, Q., and Wright, J. Complete dictionary recovery over the sphere ii: Recovery by\nriemannian trust-region method. IEEE Transactions on Information Theory, 63(2):885\u2013914, 2017.\n\nSun, Y. and Fazel, M. Escaping saddle points ef\ufb01ciently in equality-constrained optimization\nproblems. In Workshop on Modern Trends in Nonconvex Optimization for Machine Learning,\nInternational Conference on Machine Learning, 2018.\n\nTripuraneni, N., Flammarion, N., Bach, F., and Jordan, M. I. Averaging Stochastic Gradient Descent\n\non Riemannian Manifolds. arXiv preprint arXiv:1802.09128, 2018.\n\n10\n\n\fWong, Y.-c. Sectional curvatures of Grassmann manifolds. Proc. Nat. Acad. Sci. U.S.A., 60:75\u201379,\n\n1968.\n\nZhang, H. and Sra, S. First-order methods for geodesically convex optimization. arXiv:1602.06053,\n\n2016. Preprint.\n\nZhang, H., Reddi, S. J., and Sra, S. Riemannian svrg: fast stochastic optimization on riemannian\n\nmanifolds. In Advances in Neural Information Processing Systems, pp. 4592\u20134600, 2016.\n\nZhang, J. and Zhang, S. A cubic regularized newton\u2019s method over riemannian manifolds. arXiv\n\npreprint arXiv:1805.05565, 2018.\n\n11\n\n\f", "award": [], "sourceid": 3961, "authors": [{"given_name": "Yue", "family_name": "Sun", "institution": "University of Washington"}, {"given_name": "Nicolas", "family_name": "Flammarion", "institution": "EPFL"}, {"given_name": "Maryam", "family_name": "Fazel", "institution": "University of Washington"}]}