{"title": "Riemannian SVRG: Fast Stochastic Optimization on Riemannian Manifolds", "book": "Advances in Neural Information Processing Systems", "page_first": 4592, "page_last": 4600, "abstract": "We study optimization of finite sums of \\emph{geodesically} smooth functions on Riemannian manifolds. Although variance reduction techniques for optimizing finite-sums have witnessed tremendous attention in the recent years, existing work is limited to vector space problems. We introduce \\emph{Riemannian SVRG} (\\rsvrg), a new variance reduced Riemannian optimization method. We analyze \\rsvrg for both geodesically \\emph{convex} and \\emph{nonconvex} (smooth) functions. Our analysis reveals that \\rsvrg inherits advantages of the usual SVRG method, but with factors depending on curvature of the manifold that influence its convergence. To our knowledge, \\rsvrg is the first \\emph{provably fast} stochastic Riemannian method. Moreover, our paper presents the first non-asymptotic complexity analysis (novel even for the batch setting) for nonconvex Riemannian optimization. Our results have several implications; for instance, they offer a Riemannian perspective on variance reduced PCA, which promises a short, transparent convergence analysis.", "full_text": "Riemannian SVRG: Fast Stochastic Optimization on\n\nRiemannian Manifolds\n\nHongyi Zhang\n\nSashank J. Reddi\n\nMIT\n\nCarnegie Mellon University\n\nSuvrit Sra\n\nMIT\n\nAbstract\n\nWe study optimization of \ufb01nite sums of geodesically smooth functions on Rieman-\nnian manifolds. Although variance reduction techniques for optimizing \ufb01nite-sums\nhave witnessed tremendous attention in the recent years, existing work is lim-\nited to vector space problems. We introduce Riemannian SVRG (RSVRG), a new\nvariance reduced Riemannian optimization method. We analyze RSVRG for both\ngeodesically convex and nonconvex (smooth) functions. Our analysis reveals that\nRSVRG inherits advantages of the usual SVRG method, but with factors depending\non curvature of the manifold that in\ufb02uence its convergence. To our knowledge,\nRSVRG is the \ufb01rst provably fast stochastic Riemannian method. Moreover, our\npaper presents the \ufb01rst non-asymptotic complexity analysis (novel even for the\nbatch setting) for nonconvex Riemannian optimization. Our results have several\nimplications; for instance, they offer a Riemannian perspective on variance reduced\nPCA, which promises a short, transparent convergence analysis.\n\n1\n\nIntroduction\n\nWe study the following rich class of (possibly nonconvex) \ufb01nite-sum optimization problems:\n\nmin\n\nx2X\u21e2M\n\nf (x) , 1\nn\n\nfi(x),\n\nnXi=1\n\n(1)\n\nwhere (M, g) is a Riemannian manifold with the Riemannian metric g, and X\u21e2M is a geodesically\nconvex set. We assume that each fi : M! R is geodesically L-smooth (see \u00a72). Problem (1)\ngeneralizes the fundamental machine learning problem of empirical risk minimization, which is\nusually cast in vector spaces, to a Riemannian setting. It also includes as special cases important\nproblems such as principal component analysis (PCA), independent component analysis (ICA),\ndictionary learning, mixture modeling, among others (see e.g., the related work section).\nThe Euclidean version of (1) where M = Rd and g is the Euclidean inner-product has been the subject\nof intense algorithmic development in machine learning and optimization, starting with the classical\nwork of Robbins and Monro [26] to the recent spate of work on variance reduction [10; 18; 20; 25; 28].\nHowever, when (M, g) is a nonlinear Riemannian manifold, much less is known beyond [7; 38].\nWhen solving problems with manifold constraints, one common approach is to alternate between\noptimizing in the ambient Euclidean space and \u201cprojecting\u201d onto the manifold. For example, two\nwell-known methods to compute the leading eigenvector of symmetric matrices, power iteration and\nOja\u2019s algorithm [23], are in essence projected gradient and projected stochastic gradient algorithms.\nFor certain manifolds (e.g., positive de\ufb01nite matrices), projections can be quite expensive to compute.\n\n30th Conference on Neural Information Processing Systems (NIPS 2016), Barcelona, Spain.\n\n\fAn effective alternative is to use Riemannian optimization1, which directly operates on the manifold\nin question. This mode of operation allows Riemannian optimization to view the constrained\noptimization problem (1) as an unconstrained problem on a manifold, and thus, to be \u201cprojection-free.\u201d\nMore important is its conceptual value: viewing a problem through the Riemannian lens, one can\ndiscover insights into problem geometry, which can translate into better optimization algorithms.\nAlthough the Riemannian approach is appealing, our knowledge of it is fairly limited. In particular,\nthere is little analysis about its global complexity (a.k.a. non-asymptotic convergence rate), in part\ndue to the dif\ufb01culty posed by the nonlinear metric. Only very recently Zhang and Sra [38] developed\nthe \ufb01rst global complexity analysis of batch and stochastic gradient methods for geodesically convex\nfunctions. However, the batch and stochastic gradient methods in [38] suffer from problems similar to\ntheir vector space counterparts. For solving \ufb01nite sum problems with n components, the full-gradient\nmethod requires n derivatives at each step; the stochastic method requires only one derivative but at\nthe expense of slower O(1/\u270f2) convergence to an \u270f-accurate solution.\nThese issues have motivated much of the recent progress on faster stochastic optimization in vector\nspaces by using variance reduction [10; 18; 28] techniques. However, all ensuing methods critically\nrely on properties of vector spaces, whereby, adapting them to work in the context of Riemannian\nmanifolds poses major challenges. Given the richness of Riemannian optimization (it includes\nvector space optimization as a special case) and its growing number of applications, developing fast\nstochastic Riemannian optimization is important. It will help us apply Riemannian optimization to\nlarge-scale problems, while offering a new set of algorithmic tools for the practitioner\u2019s repertoire.\n\nContributions. We summarize the key contributions of this paper below.\n\u2022 We introduce Riemannian SVRG (RSVRG), a variance reduced Riemannian stochastic gradient\nmethod based on SVRG [18]. We analyze RSVRG for geodesically strongly convex functions\nthrough a novel theoretical analysis that accounts for the nonlinear (curved) geometry of the\nmanifold to yield linear convergence rates.\n\n\u2022 Building on recent advances in variance reduction for nonconvex optimization [3; 25], we gen-\neralize the convergence analysis of RSVRG to (geodesically) nonconvex functions and also to\ngradient dominated functions (see \u00a72 for the de\ufb01nition). Our analysis provides the \ufb01rst stochastic\nRiemannian method that is provably superior to both batch and stochastic (Riemannian) gradient\nmethods for nonconvex \ufb01nite-sum problems.\n\n\u2022 Using a Riemannian formulation and applying our result for (geodesically) gradient-dominated\nfunctions, we provide new insights, and a short transparent analysis explaining fast convergence of\nvariance reduced PCA for computing the leading eigenvector of a symmetric matrix.\n\nTo our knowledge, this paper provides the \ufb01rst stochastic gradient method with global linear conver-\ngence rates for geodesically strongly convex functions, as well as the \ufb01rst non-asymptotic convergence\nrates for geodesically nonconvex optimization (even in the batch case). Our analysis reveals how\nmanifold geometry, in particular curvature, impacts convergence rates. We illustrate the bene\ufb01ts of\nRSVRG by showing an application to computing leading eigenvectors of a symmetric matrix and to\nthe task of computing the Riemannian centroid of covariance matrices, a problem that has received\ngreat attention in the literature [5; 16; 38].\n\nRelated Work. Variance reduction techniques, such as control variates, are widely used in Monte\nCarlo simulations [27]. In linear spaces, variance reduced methods for solving \ufb01nite-sum problems\nhave recently witnessed a huge surge of interest [e.g. 4; 10; 14; 18; 20; 28; 36]. They have been shown\nto accelerate stochastic optimization for strongly convex objectives, convex objectives, nonconvex\nfi (i 2 [n]), and even when both f and fi (i 2 [n]) are nonconvex [3; 25]. Reddi et al. [25] further\nproved global linear convergence for gradient dominated nonconvex problems. Our analysis is\ninspired by [18; 25], but applies to the substantially more general Riemannian optimization setting.\nReferences of Riemannian optimization can be found in [1; 33], where analysis is limited to asymp-\ntotic convergence (except [33, Theorem 4.2] which proved linear rate convergence for \ufb01rst-order line\nsearch method with bounded and positive de\ufb01nite hessian). Stochastic Riemannian optimization has\n1Riemannian optimization is optimization on a known manifold structure. Note the distinction from man-\nifold learning, which attempts to learn a manifold structure from data. We brie\ufb02y review some Riemannian\noptimization applications in the related work.\n\n2\n\n\fbeen previously considered in [7; 21], though with only asymptotic convergence analysis, and without\nany rates. Many applications of Riemannian optimization are known, including matrix factorization\non \ufb01xed-rank manifold [32; 34], dictionary learning [8; 31], optimization under orthogonality con-\nstraints [11; 22], covariance estimation [35], learning elliptical distributions [30; 39], and Gaussian\nmixture models [15]. Notably, some nonconvex Euclidean problems are geodesically convex, for\nwhich Riemannian optimization can provide similar guarantees to convex optimization. Zhang and\nSra [38] provide the \ufb01rst global complexity analysis for \ufb01rst-order Riemannian algorithms, but their\nanalysis is restricted to geodesically convex problems with full or stochastic gradients. In contrast,\nwe propose RSVRG, a variance reduced Riemannian stochastic gradient algorithm, and analyze its\nglobal complexity for both geodesically convex and nonconvex problems.\nIn parallel with our work, [19] also proposed and analyzed RSVRG speci\ufb01cally for the Grassmann\nmanifold. Their complexity analysis is restricted to local convergence to strict local minima, which\nessentially corresponds to our analysis of (locally) geodesically strongly convex functions.\n\n2 Preliminaries\nBefore formally discussing Riemannian optimization, let us recall some foundational concepts of\nRiemannian geometry. For a thorough review one can refer to any classic text, e.g.,[24].\nA Riemannian manifold (M, g) is a real smooth manifold M equipped with a Riemannain metric\ng. The metric g induces an inner product structure in each tangent space TxM associated with\nevery x 2M . We denote the inner product of u, v 2 TxM as hu, vi , gx(u, v); and the norm\nof u 2 TxM is de\ufb01ned as kuk ,pgx(u, u). The angle between u, v is de\ufb01ned as arccos hu,vi\n.\nkukkvk\nA geodesic is a constant speed curve : [0, 1] !M that is locally distance minimizing. An\nexponential map Expx : TxM!M maps v in TxM to y on M, such that there is a geodesic \nwith (0) = x, (1) = y and \u02d9(0) , d\ndt (0) = v. If between any two points in X\u21e2M there is a\nunique geodesic, the exponential map has an inverse Exp1\n: X! TxM and the geodesic is the\nx\nunique shortest path with kExp1\nParallel transport y\nxv 2 TyM, while preserving\nnorm, and roughly speaking, \u201cdirection,\u201d analogous to translation in Rd. A tangent vector of a\ngeodesic remains tangent if parallel transported along . Parallel transport preserves inner products.\n\nx : TxM! TyM maps a vector v 2 TxM to y\n\nx (y)k = kExp1\n\ny (x)k the geodesic distance between x, y 2X .\n\nv\n\nx\n\nExpx(v)\n\nx\n\nv\n\ny\n\ny\nxv\n\nFigure 1: Illustration of manifold operations. (Left) A vector v in TxM is mapped to Expx(v); (right) A vector\nv in TxM is parallel transported to TyM as y\nxv.\nThe geometry of a Riemannian manifold is determined by its Riemannian metric tensor through\nvarious characterization of curvatures. Let u, v 2 TxM be linearly independent, so that they span\na two dimensional subspace of TxM. Under the exponential map, this subspace is mapped to a\ntwo dimensional submanifold of U\u21e2M . The sectional curvature \uf8ff(x,U) is de\ufb01ned as the Gauss\ncurvature of U at x. As we will mainly analyze manifold trigonometry, for worst-case analysis, it is\nsuf\ufb01cient to consider sectional curvature.\n\nFunction Classes. We now de\ufb01ne some key terms. A set X is called geodesically convex if for any\nx, y 2X , there is a geodesic with (0) = x, (1) = y and (t) 2X for t 2 [0, 1]. Throughout the\npaper, we assume that the function f in (1) is de\ufb01ned on a geodesically convex set X on a Riemannian\nmanifold M.\nWe call a function f : X! R geodesically convex (g-convex) if for any x, y 2X and any geodesic\n such that (0) = x, (1) = y and (t) 2X for t 2 [0, 1], it holds that\n\nf ((t)) \uf8ff (1 t)f (x) + tf (y).\n\n3\n\n\fIt can be shown that if the inverse exponential map is well-de\ufb01ned, an equivalent de\ufb01nition is that for\nany x, y 2X , f (y) f (x) + hgx, Exp1\nx (y)i, where gx is a subgradient of f at x (or the gradient\nif f is differentiable). A function f : X! R is called geodesically \u00b5-strongly convex (\u00b5-strongly\ng-convex) if for any x, y 2X and subgradient gx, it holds that\nx (y)i + \u00b5\n\nf (y) f (x) + hgx, Exp1\n\n2kExp1\n\nx (y)k2.\n\nWe call a vector \ufb01eld g : X! Rd geodesically L-Lipschitz (L-g-Lipschitz) if for any x, y 2X ,\n\nkg(x) x\n\nyg(y)k \uf8ff LkExp1\n\nx (y)k,\n\nwhere x\ngeodesically L-smooth (L-g-smooth) if its gradient is L-g-Lipschitz, in which case we have\n\ny is the parallel transport from y to x. We call a differentiable function f : X! R\n\nf (y) \uf8ff f (x) + hgx, Exp1\n\nx (y)i + L\n\n2 kExp1\n\nx (y)k2.\n\nWe say f : X! R is \u2327-gradient dominated if x\u21e4 is a global minimizer of f and for every x 2X\n\nf (x) f (x\u21e4) \uf8ff \u2327krf (x)k2.\n\n(2)\n\nWe recall the following trigonometric distance bound that is essential for our analysis:\nLemma 1 ([7; 38]). If a, b, c are the side lengths of a geodesic triangle in a Riemannian manifold\nwith sectional curvature lower bounded by \uf8ffmin, and A is the angle between sides b and c (de\ufb01ned\nthrough inverse exponential map and inner product in tangent space), then\n\na2 \uf8ff p|\uf8ffmin|c\ntanh(p|\uf8ffmin|c)\n\nb2 + c2 2bc cos(A).\n\n(3)\n\nAn Incremental First-order Oracle (IFO) [2] in (1) takes an i 2 [n] and a point x 2X , and returns a\npair (fi(x),rfi(x)) 2 R \u21e5 TxM. We measure non-asymptotic complexity in terms of IFO calls.\n3 Riemannian SVRG\n\nIn this section we introduce RSVRG formally. We make the following standing assumptions: (a) f\nattains its optimum at x\u21e4 2X ; (b) X is compact, and the diameter of X is bounded by D, that is,\nmaxx,y2X d(x, y) \uf8ff D; (c) the sectional curvature in X is upper bounded by \uf8ffmax, and within X\nthe exponential map is invertible; and (d) the sectional curvature in X is lower bounded by \uf8ffmin. We\nde\ufb01ne the following key geometric constant that capture the impact of manifold curvature:\n\n\u21e3 =(\n\np|\uf8ffmin|D\n\ntanh(p|\uf8ffmin|D)\n\n1,\n\n,\n\nif \uf8ffmin < 0,\nif \uf8ffmin 0,\n\n(4)\n\nWe note that most (if not all) practical manifold optimization problems can satisfy these assumptions.\nOur proposed RSVRG algorithm is shown in Algorithm 1. Compared with the Euclidean SVRG, it\ndiffers in two key aspects: the variance reduction step uses parallel transport to combine gradients\nfrom different tangent spaces; and the exponential map is used (instead of the update xs+1\nt \u2318vs+1\n).\n\nt\n\n3.1 Convergence analysis for strongly g-convex functions\nIn this section, we analyze global complexity of RSVRG for solving (1), where each fi (i 2 [n]) is\ng-smooth and f is strongly g-convex. In this case, we show that RSVRG has linear convergence rate.\nThis is in contrast with the O(1/t) rate of Riemannian stochastic gradient algorithm for strongly\ng-convex functions [38].\nTheorem 1. Assume in (1) each fi is L-g-smooth, and f is \u00b5-strongly g-convex, then if we run\nAlgorithm 1 with Option I and parameters that satisfy\n\n\u21b5 =\n\n3\u21e3\u2318L 2\n\n\u00b5 2\u21e3\u2318L 2 +\n\n(1 + 4\u21e3\u23182 2\u2318\u00b5)m(\u00b5 5\u21e3\u2318L 2)\n\n< 1\n\n\u00b5 2\u21e3\u2318L 2\n\nthen with S outer loops, the Riemannian SVRG algorithm produces an iterate xa that satis\ufb01es\n\nEd2(xa, x\u21e4) \uf8ff \u21b5Sd2(x0, x\u21e4).\n\n4\n\n\fAlgorithm 1: RSVRG (x0, m,\u2318, S )\nParameters: update frequency m, learning rate \u2318, number of epochs S\ninitialize \u02dcx0 = x0;\nfor s = 0, 1, . . . , S 1 do\n\n0 = \u02dcxs;\nxs+1\nnPn\ni=1 rfi(\u02dcxs);\ngs+1 = 1\nfor t = 0, 1, . . . , m 1 do\nRandomly pick it 2{ 1, . . . , n};\n) xs+1\nvs+1\nt = rfit (xs+1\n;\n\u2318v s+1\nt\nxs+1\nt+1 = Expxs+1\n\nend\nSet \u02dcxs+1 = xs+1\nm ;\n\nt\n\u02dcxs\n\nt\n\nt\n\nrfit (\u02dcxs) gs+1;\n\nend\nOption I: output xa = \u02dcxS;\nOption II: output xa chosen uniformly randomly from {{xs+1\nThe proof of Theorem 1 is in the appendix, and takes a different route compared with the original\nSVRG proof [18]. Speci\ufb01cally, due to the nonlinear Riemannian metric, we are not able to bound\nthe squared norm of the variance reduced gradient by f (x) f (x\u21e4). Instead, we bound this quantity\nby the squared distances to the minimizer, and show linear convergence of the iterates. A bound\non E[f (x) f (x\u21e4)] is then implied by L-g-smoothness, albeit with a stronger dependence on\nthe condition number. Theorem 1 leads to the following more digestible corollary on the global\ncomplexity of the algorithm:\nCorollary 1. With assumptions as in Theorem 1 and properly chosen parameters, after\n\nt=0 }S1\n}m1\ns=0 .\n\nt\n\nO\u21e3(n + \u21e3L2\n\n\u00b52 ) log( 1\n\n\u270f )\u2318 IFO calls, the output xa satis\ufb01es\n\nE[f (xa) f (x\u21e4)] \uf8ff \u270f.\n\nWe give a proof with speci\ufb01c parameter choices in the appendix. Observe the dependence on \u21e3 in our\nresult: for \uf8ffmin < 0, we have \u21e3> 1, which implies that negative space curvature adversarially affects\nconvergence rate; while for \uf8ffmin 0, we have \u21e3 = 1, which implies that for nonnegatively curved\nmanifolds, the impact of curvature is not explicit. In the rest of our analysis we will see a similar\neffect of sectional curvature; this phenomenon seems innate to manifold optimization (also see [38]).\nIn the analysis we do not assume each fi to be g-convex, which resulted in a worse dependence on\nthe condition number. We note that a similar result was obtained in linear space [12]. However, we\nwill see in the next section that by generalizing the analysis for gradient dominated functions in [25],\nwe are able to greatly improve this dependence.\n\n3.2 Convergence analysis for geodesically nonconvex functions\n\nIn this section, we analyze global complexity of RSVRG for solving (1), where each fi is only required\nto be L-g-smooth, and neither fi nor f need be g-convex. We measure convergence to a stationary\npoint using krf (x)k2 following [13]. Note, however, that here rf (x) 2 TxM and krf (x)k is\nde\ufb01ned via the inner product in TxM. We \ufb01rst note that Riemannian-SGD on nonconvex L-g-smooth\nproblems attains O(1/\u270f2) convergence as SGD [13] holds; we relegate the details to the appendix.\nRecently, two groups independently proved that variance reduction also bene\ufb01ts stochastic gradient\nmethods for nonconvex smooth \ufb01nite-sum optimization problems, with different analysis [3; 25]. Our\nanalysis for nonconvex RSVRG is inspired by [25]. Our main result for this section is Theorem 2.\nTheorem 2. Assume in (1) each fi is L-g-smooth, the sectional curvature in X is lower bounded by\n\uf8ffmin, and we run Algorithm 1 with Option II. Then there exist universal constants \u00b50 2 (0, 1),\u232b > 0\nsuch that if we set \u2318 = \u00b50/(Ln\u21b51\u21e3\u21b52) (0 <\u21b5 1 \uf8ff 1 and 0 \uf8ff \u21b52 \uf8ff 2), m = bn3\u21b51/2/(3\u00b50\u21e312\u21b52)c\nand T = mS, we have\n\nwhere x\u21e4 is an optimal solution to (1).\n\nE[krf (xa)k2] \uf8ff Ln\u21b51 \u21e3\u21b52 [f (x0)f (x\u21e4)]\n\nT\u232b\n\n,\n\n5\n\n\fxk+1 = RSVRG(xk, m,\u2318, S ) with Option II;\n\nAlgorithm 2: GD-SVRG(x0, m,\u2318, S, K )\nParameters: update frequency m, learning rate \u2318, number of epochs S, K, x0\nfor k = 0, . . . , K 1 do\nend\nOutput: xK\nThe key challenge in proving Theorem 2 in the Riemannian setting is to incorporate the impact of\nusing a nonlinear metric. Similar to the g-convex case, the nonlienar metric impacts the convergence,\nnotably through the constant \u21e3 that depends on a lower-bound on sectional curvature.\nReddi et al. [25] suggested setting \u21b51 = 2/3, in which case we obtain the following corollary.\nCorollary 2. With assumptions and parameters in Theorem 2, choosing \u21b51 = 2/3, the IFO complex-\nity for achieving an \u270f-accurate solution is:\n\nIFO calls =\u21e2 On + (n2/3\u21e31\u21b52/\u270f) ,\nOn\u21e3 2\u21b521 + (n2/3\u21e3\u21b52/\u270f) ,\n\nif \u21b52 \uf8ff 1/2,\nif \u21b52 > 1/2.\n\nSetting \u21b52 = 1/2 in Corollary 2 immediately leads to Corollary 3:\nCorollary 3. With assumptions in Theorem 2 and \u21b51 = 2/3,\u21b5 2 = 1/2, the IFO complexity for\n\nachieving an \u270f-accurate solution is On + (n2/3\u21e31/2/\u270f).\n\nThe same reasoning allows us to also capture the class of gradient dominated functions (2), for which\nReddi et al. [25] proved that SVRG converges linearly to a global optimum. We have the following\ncorresponding theorem for RSVRG:\nTheorem 3. Suppose that in addition to the assumptions in Theorem 2, f is \u2327-gradient dominated.\nThen there exist universal constants \u00b50 2 (0, 1),\u232b > 0 such that if we run Algorithm 2 with\n\u2318 = \u00b50/(Ln2/3\u21e31/2), m = bn/(3\u00b50)c, S = d(6 + 18\u00b50\n\nn3 )L\u2327 \u21e3 1/2\u00b50/(\u232bn1/3)e, we have\n\nE[krf (xK)k2] \uf8ff 2Kkrf (x0)k2,\n\nE[f (xK) f (x\u21e4)] \uf8ff 2K[f (x0) f (x\u21e4)].\n\nWe summarize the implication of Theorem 3 as follows (note the dependence on curvature):\nCorollary 4. With Algorithm 2 and the parameters in Theorem 3, the IFO complexity to compute an\n\u270f-accurate solution for a gradient dominated function f is O((n + L\u2327 \u21e3 1/2n2/3) log(1/\u270f)).\n\nA typical example of gradient dominated function is a strongly g-convex function (see appendix).\nSpeci\ufb01cally, we have the following corollary, which prove linear convergence rate of RSVRG with\nthe same assumptions as in Theorem 1, improving the dependence on the condition number.\nCorollary 5. With Algorithm 2 and the parameters in Theorem 3, the IFO complexity to compute an\n\u270f-accurate solution for a \u00b5-strongly g-convex function f is O((n + \u00b51L\u21e3 1/2n2/3) log(1/\u270f)).\n\n4 Applications\n\n4.1 Computing the leading eigenvector\nIn this section, we apply our analysis of RSVRG for gradient dominated functions (Theorem 3) to fast\neigenvector computation, a fundamental problem that is still being actively researched in the big-data\nsetting [12; 17; 29]. For the problem of computing the leading eigenvector, i.e.,\n\nmin\n\nx>x=1 x>\u21e3Xn\n\ni=1\n\nziz>i \u2318 x , x>Ax = f (x),\n\n(5)\n\nexisting analyses for state-of-the-art algorithms typically result in O(1/2) dependence on the\neigengap of A, as opposed to the conjectured O(1/) dependence [29], as well as the O(1/)\ndependence of power iteration. Here we give new support for the O(1/) conjecture. Note that\nProblem (5) seen as one in Rd is nonconvex, with negative semide\ufb01nite Hessian everywhere, and has\nnonlinear constraints. However, we show that on the hypersphere Sd1 Problem (5) is unconstrained,\nand has gradient dominated objective. In particular we have the following result:\n\n6\n\n\fTheorem 4. Suppose A has eigenvalues 1 > 2 \u00b7\u00b7\u00b7 d and = 1 2, and x0 is drawn\nuniformly randomly on the hypersphere. Then with probability 1 p, x0 falls in a Riemannian ball\nof a global optimum of the objective function, within which the objective function is O( d\np2 )-gradient\ndominated.\n\nWe provide the proof of Theorem 4 in appendix. Theorem 4 gives new insights for why the conjecture\nmight be true \u2013 once it is shown that with a constant stepsize and with high probability (both\nindependent of ) the iterates remain in such a Riemannian ball, applying Corollary 4 one can\nimmediately prove the O(1/) dependence conjecture. We leave this analysis as future work.\nNext we show that variance reduced PCA (VR-PCA) [29] is closely related to RSVRG. We implement\nRiemannian SVRG for PCA, and use the code for VR-PCA in [29]. Analytic forms for exponential\nmap and parallel transport on hypersphere can be found in [1, Example 5.4.1; Example 8.1.1]. We\nconduct well-controlled experiments comparing the performance of two algorithms. Speci\ufb01cally,\nto investigate the dependence of convergence on , for each = 103/k where k = 1, . . . , 25, we\ngenerate a d \u21e5 n matrix Z = (z1, . . . , zn) where d = 103, n = 104 using the method Z = U DV >\nwhere U, V are orthonormal matrices and D is a diagonal matrix, as described in [29]. Note that A\nhas the same eigenvalues as D2. All the data matrices share the same U, V and only differ in (thus\nalso in D). We also \ufb01x the same random initialization x0 and random seed. We run both algorithms\non each matrix for 50 epochs. For every \ufb01ve epochs, we estimate the number of epochs required to\ndouble its accuracy 2. This number can serve as an indicator of the global complexity of the algorithm.\nWe plot this number for different epochs against 1/, shown in Figure 2. Note that the performance\nof RSVRG and VR-PCA with the same stepsize is very similar, which implies a close connection\nof the two. Indeed, the update x+v\nused in [29] and others is a well-known approximation to the\nkx+vk\nexponential map Expx(v) with small stepsize (a.k.a. retraction). Also note the complexity of both\nalgorithms seems to have an asymptotically linear dependence on 1/.\n\n10-2\n\n10-4\n\n10-6\n\ny\nc\na\nr\nu\nc\nc\na\n\n10-8\n\n0\n\n/ = 1e-3\n\nRSVRG\nVR-PCA\n\n4\n\n2\n6\n#IFO calls #105\n\nd\ne\nr\ni\nu\nq\ne\nr\n \ns\nh\nc\no\np\ne\n#\n\n100\n\n50\n\n0\n\n0\n\nRSVRG\n\n1\n\n2\n\n3\n#104\n\n1//\n\n1-5\n11-15\n21-25\n31-35\n41-45\n\nd\ne\nr\ni\nu\nq\ne\nr\n \ns\nh\nc\no\np\ne\n#\n\n100\n\n50\n\n0\n\n0\n\nVR-PCA\n\n1\n\n1//\n\n2\n\n3\n#104\n\n1-5\n11-15\n21-25\n31-35\n41-45\n\nFigure 2: Computing the leading eigenvector. Left: RSVRG and VR-PCA are indistinguishable in terms of\nIFO complexity. Middle and right: Complexity appears to depend on 1/. x-axis shows the inverse of eigengap\n, y-axis shows the estimated number of epochs required to double the accuracy. Lines represent different epoch\nindex. All variables are controlled except for .\n\ni=1 is X\u21e4 = arg minX\u232b0nf (X;{Ai}n\n\n4.2 Computing the Riemannian centroid\nIn this subsection we validate that RSVRG converges linearly for averaging PSD matrices under\nthe Riemannian metric. The problem for \ufb01nding the Riemannian centroid of a set of PSD matrices\n{Ai}n\nalso a PSD matrix. This is a geodesically strongly convex problem, yet nonconvex in Euclidean space.\nIt has been studied both in matrix computation and in various applications [5; 16]. We use the same\nexperiment setting as described in [38] 3, and compare RSVRG against Riemannian full gradient\n(RGD) and stochastic gradient (RSGD) algorithms (Figure 3). Other methods for this problem include\nthe relaxed Richardson iteration algorithm [6], the approximated joint diagonalization algorithm [9],\nand Riemannian Newton and quasi-Newton type methods, notably the limited-memory Riemannian\n\nFo where X is\n\ni=1 k log(X1/2AiX1/2)k2\n\ni=1) ,Pn\n\n2Accuracy is measured by f (x)f (x\u21e4)\n\n, i.e. the relative error between the objective value and the optimum.\nWe measure how much the error has been reduced after each \ufb01ve epochs, which is a multiplicative factor c < 1\non the error at the start of each \ufb01ve epochs. Then we use log(2)/ log(1/c) \u21e4 5 as the estimate, assuming c stays\nconstant.\n3We generate 100 \u21e5 100 random PSD matrices using the Matrix Mean Toolbox [6] with normalization so\n\n|f (x\u21e4)|\n\nthat the norm of each matrix equals 1.\n\n7\n\n\fBFGS [37]. However, none of these methods were shown to greatly outperform RGD, especially in\ndata science applications where n is large and extremely small optimization error is not required.\nNote that the objective is sum of squared Riemannian distances in a nonpositively curved space,\nthus is (2n)-strongly g-convex and (2n\u21e3)-g-smooth. According to the proof of Corollary 1 (see\nappendix) the optimal stepsize for RSVRG is O(1/(\u21e33n)). For all the experiments, we initialize all\nthe algorithms using the arithmetic mean of the matrices. We set \u2318 = 1\n100n, and choose m = n in\nAlgorithm 1 for RSVRG, and use suggested parameters from [38] for other algorithms. The results\nsuggest RSVRG has clear advantage in the large scale setting.\n\ny\nc\na\nr\nu\nc\nc\na\n\n105\n\n100\n\n10-5\n\n0\n\nN=100,Q=1e2\n\nRGD\nRSGD\nRSVRG\n\n1000\n#IFO calls\n\n2000\n\ny\nc\na\nr\nu\nc\nc\na\n\n105\n\n100\n\n10-5\n\n0\n\nN=100,Q=1e8\n\nRGD\nRSGD\nRSVRG\n\n1000\n#IFO calls\n\n2000\n\n105\n\n100\n\ny\nc\na\nr\nu\nc\nc\na\n\n10-5\n\n0\n\nN=1000,Q=1e2\n\nRGD\nRSGD\nRSVRG\n\n5000\n\n#IFO calls\n\n105\n\n100\n\ny\nc\na\nr\nu\nc\nc\na\n\n10000\n\n10-5\n\n0\n\nN=1000,Q=1e8\n\nRGD\nRSGD\nRSVRG\n\n5000\n\n#IFO calls\n\n10000\n\nFigure 3: Riemannian mean of PSD matrices. N: number of matrices, Q: conditional number of each\nmatrix. x-axis shows the actual number of IFO calls, y-axis show f (X) f (X\u21e4) in log scale. Lines show the\nperformance of different algorithms in colors. Note that RSVRG achieves linear convergence and is especially\nadvantageous for large dataset.\n\n5 Discussion\nWe introduce Riemannian SVRG, the \ufb01rst variance reduced stochastic gradient algorithm for Rieman-\nnian optimization. In addition, we analyze its global complexity for optimizing geodesically strongly\nconvex, convex, and nonconvex functions, explicitly showing their dependence on sectional curvature.\nOur experiments validate our analysis that Riemannian SVRG is much faster than full gradient and\nstochastic gradient methods for solving \ufb01nite-sum optimization problems on Riemannian manifolds.\nOur analysis of computing the leading eigenvector as a Riemannian optimization problem is also\nworth noting: a nonconvex problem with nonpositive Hessian and nonlinear constraints in the ambient\nspace turns out to be gradient dominated on the manifold. We believe this shows the promise of\ntheoretical study of Riemannian optimization, and geometric optimization in general, and we hope it\nencourages other researchers in the community to join this endeavor.\nOur work also has limitations \u2013 most practical Riemannian optimization algorithms use retraction\nand vector transport to ef\ufb01ciently approximate the exponential map and parallel transport, which we\ndo not analyze in this work. A systematic study of retraction and vector transport is an important\ntopic for future research. For other applications of Riemannian optimization such as low-rank matrix\ncompletion [34], covariance matrix estimation [35] and subspace tracking [11], we believe it would\nalso be promising to apply fast incremental gradient algorithms in the large scale setting.\nAcknowledgment: SS acknowledges support of NSF grant: IIS-1409802. HZ acknowledges support\nfrom the Leventhal Fellowship.\n\nReferences\n[1] P.-A. Absil, R. Mahony, and R. Sepulchre. Optimization algorithms on matrix manifolds. Princeton\n\nUniversity Press, 2009.\n\n[2] A. Agarwal and L. Bottou. A lower bound for the optimization of \ufb01nite sums. In Proceedings of the 32nd\n\nInternational Conference on Machine Learning (ICML-15), pages 78\u201386, 2015.\n\n[3] Z. Allen-Zhu and E. Hazan. Variance reduction for faster non-convex optimization. arXiv:1603.05643,\n\n2016.\n\n[4] F. Bach and E. Moulines. Non-strongly-convex smooth stochastic approximation with convergence rate o\n\n(1/n). In Advances in Neural Information Processing Systems, pages 773\u2013781, 2013.\n\n[5] R. Bhatia. Positive De\ufb01nite Matrices. Princeton University Press, 2007.\n[6] D. A. Bini and B. Iannazzo. Computing the karcher mean of symmetric positive de\ufb01nite matrices. Linear\n\nAlgebra and its Applications, 438(4):1700\u20131710, 2013.\n\n[7] S. Bonnabel. Stochastic gradient descent on Riemannian manifolds. Automatic Control, IEEE Transactions\n\non, 58(9):2217\u20132229, 2013.\n\n8\n\n\f[8] A. Cherian and S. Sra. Riemannian dictionary learning and sparse coding for positive de\ufb01nite matrices.\n\narXiv:1507.02772, 2015.\n\n[9] M. Congedo, B. Afsari, A. Barachant, and M. Moakher. Approximate joint diagonalization and geometric\n\nmean of symmetric positive de\ufb01nite matrices. PloS one, 10(4):e0121423, 2015.\n\n[10] A. Defazio, F. Bach, and S. Lacoste-Julien. Saga: A fast incremental gradient method with support for\n\nnon-strongly convex composite objectives. In NIPS, pages 1646\u20131654, 2014.\n\n[11] A. Edelman, T. A. Arias, and S. T. Smith. The geometry of algorithms with orthogonality constraints.\n\nSIAM journal on Matrix Analysis and Applications, 20(2):303\u2013353, 1998.\n\n[12] D. Garber and E. Hazan. Fast and simple pca via convex optimization. arXiv preprint arXiv:1509.05647,\n\n2015.\n\n[13] S. Ghadimi and G. Lan. Stochastic \ufb01rst-and zeroth-order methods for nonconvex stochastic programming.\n\nSIAM Journal on Optimization, 23(4):2341\u20132368, 2013.\n\n[14] P. Gong and J. Ye. Linear convergence of variance-reduced stochastic gradient without strong convexity.\n\narXiv preprint arXiv:1406.1102, 2014.\n\n[15] R. Hosseini and S. Sra. Matrix manifold optimization for Gaussian mixtures. In NIPS, 2015.\n[16] B. Jeuris, R. Vandebril, and B. Vandereycken. A survey and comparison of contemporary algorithms for\ncomputing the matrix geometric mean. Electronic Transactions on Numerical Analysis, 39:379\u2013402, 2012.\n[17] C. Jin, S. M. Kakade, C. Musco, P. Netrapalli, and A. Sidford. Robust shift-and-invert preconditioning:\n\nFaster and more sample ef\ufb01cient algorithms for eigenvector computation. arXiv:1510.08896, 2015.\n\n[18] R. Johnson and T. Zhang. Accelerating stochastic gradient descent using predictive variance reduction. In\n\nAdvances in Neural Information Processing Systems, pages 315\u2013323, 2013.\n\n[19] H. Kasai, H. Sato, and B. Mishra. Riemannian stochastic variance reduced gradient on grassmann manifold.\n\narXiv preprint arXiv:1605.07367, 2016.\n\n[20] J. Kone\u02c7cn`y and P. Richt\u00e1rik. Semi-stochastic gradient descent methods. arXiv:1312.1666, 2013.\n[21] X. Liu, A. Srivastava, and K. Gallivan. Optimal linear representations of images for object recognition.\n\nIEEE TPAMI, 26(5):662\u2013666, 2004.\n\napplications, 24(1):1\u201316, 2002.\n\n927\u2013935, 1992.\n\n[22] M. Moakher. Means and averaging in the group of rotations. SIAM journal on matrix analysis and\n\n[23] E. Oja. Principal components, minor components, and linear neural networks. Neural Networks, 5(6):\n\n[24] P. Petersen. Riemannian geometry, volume 171. Springer Science & Business Media, 2006.\n[25] S. J. Reddi, A. Hefny, S. Sra, B. P\u00f3cz\u00f3s, and A. Smola. Stochastic variance reduction for nonconvex\n\noptimization. arXiv:1603.06160, 2016.\n\n[26] H. Robbins and S. Monro. A stochastic approximation method. Annals of Mathematical Statistics, 22:\n\n[27] R. Y. Rubinstein and D. P. Kroese. Simulation and the Monte Carlo method, volume 707. John Wiley &\n\n400\u2013407, 1951.\n\nSons, 2011.\n\narXiv:1309.2388, 2013.\n\n[28] M. Schmidt, N. L. Roux, and F. Bach. Minimizing \ufb01nite sums with the stochastic average gradient.\n\n[29] O. Shamir. A Stochastic PCA and SVD Algorithm with an Exponential Convergence Rate. In International\n\nConference on Machine Learning (ICML-15), pages 144\u2013152, 2015.\n\n[30] S. Sra and R. Hosseini. Geometric optimisation on positive de\ufb01nite matrices for elliptically contoured\n\ndistributions. In Advances in Neural Information Processing Systems, pages 2562\u20132570, 2013.\n\n[31] J. Sun, Q. Qu, and J. Wright. Complete dictionary recovery over the sphere ii: Recovery by riemannian\n\ntrust-region method. arXiv:1511.04777, 2015.\n\n[32] M. Tan, I. W. Tsang, L. Wang, B. Vandereycken, and S. J. Pan. Riemannian pursuit for big matrix recovery.\n\nIn International Conference on Machine Learning (ICML-14), pages 1539\u20131547, 2014.\n\n[33] C. Udriste. Convex functions and optimization methods on Riemannian manifolds, volume 297. Springer\n\n[34] B. Vandereycken. Low-rank matrix completion by Riemannian optimization. SIAM Journal on Optimiza-\n\n[35] A. Wiesel. Geodesic convexity and covariance estimation. IEEE Transactions on Signal Processing, 60\n\n[36] L. Xiao and T. Zhang. A proximal stochastic gradient method with progressive variance reduction. SIAM\n\nJournal on Optimization, 24(4):2057\u20132075, 2014.\n\n[37] X. Yuan, W. Huang, P.-A. Absil, and K. Gallivan. A riemannian limited-memory bfgs algorithm for\n\ncomputing the matrix geometric mean. Procedia Computer Science, 80:2147\u20132157, 2016.\n\n[38] H. Zhang and S. Sra. First-order methods for geodesically convex optimization. arXiv:1602.06053, 2016.\n[39] T. Zhang, A. Wiesel, and M. S. Greco. Multivariate generalized Gaussian distribution: Convexity and\n\ngraphical models. Signal Processing, IEEE Transactions on, 61(16):4141\u20134148, 2013.\n\nScience & Business Media, 1994.\n\ntion, 23(2):1214\u20131236, 2013.\n\n(12):6182\u20136189, 2012.\n\n9\n\n\f", "award": [], "sourceid": 2291, "authors": [{"given_name": "Hongyi", "family_name": "Zhang", "institution": "MIT"}, {"given_name": "Sashank", "family_name": "J. Reddi", "institution": "Carnegie Mellon University"}, {"given_name": "Suvrit", "family_name": "Sra", "institution": "MIT"}]}