{"title": "Gradient Descent Meets Shift-and-Invert Preconditioning for Eigenvector Computation", "book": "Advances in Neural Information Processing Systems", "page_first": 2825, "page_last": 2834, "abstract": "Shift-and-invert preconditioning, as a classic acceleration technique for the leading eigenvector computation, has received much attention again recently, owing to fast least-squares solvers for efficiently approximating matrix inversions in power iterations. In this work, we adopt an inexact Riemannian gradient descent perspective to investigate this technique on the effect of the step-size scheme. The shift-and-inverted power method is included as a special case with adaptive step-sizes. Particularly, two other step-size settings, i.e., constant step-sizes and Barzilai-Borwein (BB) step-sizes, are examined theoretically and/or empirically. We present a novel convergence analysis for the constant step-size setting that achieves a rate at $\\tilde{O}(\\sqrt{\\frac{\\lambda_{1}}{\\lambda_{1}-\\lambda_{p+1}}})$, where $\\lambda_{i}$ represents the $i$-th largest eigenvalue of the given real symmetric matrix and $p$ is the multiplicity of $\\lambda_{1}$. Our experimental studies show that the proposed algorithm can be significantly faster than the shift-and-inverted power method in practice.", "full_text": "Gradient Descent Meets Shift-and-Invert\n\nPreconditioning for Eigenvector Computation\n\nNational Engineering Laboratory of Deep Learning Technology and Application, China\n\nCognitive Computing Lab (CCL), Baidu Research\n\nZhiqiang Xu\n\nNK\u0007DEGE=\u0006C\u001e\"(>=E@K\u0002?\u0006\u0006\n\nAbstract\n\nShift-and-invert preconditioning, as a classic acceleration technique for the lead-\ning eigenvector computation, has received much attention again recently, owing\nto fast least-squares solvers for ef\ufb01ciently approximating matrix inversions in\npower iterations.\nIn this work, we adopt an inexact Riemannian gradient de-\nscent perspective to investigate this technique on the effect of the step-size scheme.\nThe shift-and-inverted power method is included as a special case with adaptive\nstep-sizes. Particularly, two other step-size settings, i.e., constant step-sizes and\nBarzilai-Borwein (BB) step-sizes, are examined theoretically and/or empirically.\nWe present a novel convergence analysis for the constant step-size setting that\nachieves a rate at ~O(\n), where (cid:21)i represents the i-th largest eigenvalue\nof the given real symmetric matrix and p is the multiplicity of (cid:21)1. Our experimen-\ntal studies show that the proposed algorithm can be signi\ufb01cantly faster than the\nshift-and-inverted power method in practice.\n\n(cid:21)1(cid:0)(cid:21)p+1\n\n\u221a\n\n(cid:21)1\n\n1\n\nIntroduction\n\n1p\n(cid:21)1(cid:0)(cid:21)2\n\nEigenvector computation is a fundamental problem in numerical algebra and often of central impor-\ntance to a variety of scienti\ufb01c and engineering computing tasks such as principal component analysis\n[Fan et al., 2018], spectral clustering [Ng et al., 2001], low-rank matrix approximation [Hastie et al.,\n2015, Liu and Li, 2014], among others. Classic solvers for this problem are power methods and\nLanczos algorithms [Golub and Van Loan, 1996]. Although Lanczos algorithms possess the optimal\nconvergence rate ~O(\n), it seems not amenable to stochastic optimization. People thus tend to\ndevelop faster algorithms on top of power methods [Arora et al., 2012, 2013, Hardt and Price, 2014,\nShamir, 2015, Garber and Hazan, 2015, Garber et al., 2016, Lei et al., 2016, Wang et al., 2017]. One\nnotable technique among them is the shift-and-invert preconditioning that has revived recently for\nthis purpose [Garber and Hazan, 2015, Garber et al., 2016, Wang et al., 2017, Gao et al., 2017]. Us-\ning this technique, each power iteration step can be reduced to approximately solving a linear system\nsubproblem that can leverage fast least-squares solvers, e.g., accelerated gradient descent (AGD)\n[Nesterov, 2014] or stochastic variance reduced gradient (SVRG) [Johnson and Zhang, 2013].\nIn this work, we take a Riemannian gradient descent view to investigate the shift-and-invert precon-\nditioning for the leading eigenvector computation on the effect of the step-size scheme. The resulting\nalgorithm thus is termed as the shift-and-inverted Riemannian gradient descent eigensolver, or SI-\nrgEIGS for short. It includes the shift-and-invert preconditioned power method (termed as SI-PM for\nshort) as a special case with adaptive step-sizes. Applying the shift-and-invert preconditioning tech-\nnique needs to locate an appropriate upper bound of the largest eigenvalue, i.e., (cid:27) > (cid:21)1, as the shift\nparameter. We reply on the crude phase of the shift-and-inverted power method [Garber and Hazan,\n2015, Garber et al., 2016] to get this upper bound in theory. However, in practice, the plain power\n\n32nd Conference on Neural Information Processing Systems (NeurIPS 2018), Montr\u00e9al, Canada.\n\n\fmethod often works via the proposed heuristics in experiments. In addition, the crude phase can\nwarm-start the Riemannian gradient descent method. Similarly, Shamir [2016a] adopted the plain\npower method to warm-start the stochastic variance reduced projected gradient descent without pre-\nconditioning for principal component analysis (VR-PCA). The crude phase only consumes non-\ndominant time due to the independence of the \ufb01nal accuracy parameter \u03f5 [Wang et al., 2017]. The\nalgorithm then steps into an accurate phase by calling the Riemannian gradient descent solver on the\nshift-and-inverted matrix ((cid:27)I (cid:0) A)\n\n(cid:0)1, i.e., solving the following problem:\n\nh(x) = (cid:0) 1\n2\n\n\u22a4\n\n((cid:27)I (cid:0) A)\n\n(cid:0)1x:\n\nx\n\nmin\n\nm YY\n\n(1)\nx2Rn(cid:2)1:\u2225x\u22252=1\nIn each gradient descent step, we have to solve a linear system ((cid:27)I (cid:0) A)z = xt(cid:0)1 in order to\nget the Euclidean gradient ((cid:27)I (cid:0) A)\n(cid:0)1xt(cid:0)1. The key advantage of the preconditioning technique\nis that we only need to solve the system to an approximate level commensurate with the quality\nof the current iterate. This can be easily accomplished by performing convex optimization on the\nassociated least-squares problem (see Equation (3)). Another advantage of the reduction to convex\noptimization is that it enables stochastic optimization [Garber and Hazan, 2015, Garber et al., 2016],\n\u22a4, where Y 2 Rm(cid:2)d. Approximate solutions\nespecially for the covariance structure of A = 1\nto the linear systems requires one to cope with inexact Riemannian gradients. In fact, as we will\nsee for Problem (1), the inexact Riemannian gradient method includes the shift-and-inverted power\nmethod as a special case with adaptive step-sizes. In the present paper, two other step-size schemes,\ni.e., constant step-sizes and Barzilai-Borwein (BB) step-sizes, are examined theoretically and/or\nempirically. Different from Shamir [2015] and Wang et al. [2017] which only consider the positive\neigengap between (cid:21)1 and (cid:21)2, i.e., (cid:21)1 > (cid:21)2, for the constant step-size setting we explicitly take care\nof all the cases of this eigengap and achieve a uni\ufb01ed convergence rate at ~O(\n) via a novel\nanalysis (e.g., the potential function and the way we cope with the solution space), where p < n is\nthe multiplicity of (cid:21)1 and (cid:21)1 (cid:0) (cid:21)p+1 > 0 always holds without loss of generality. To the best of our\nknowledge, this is the \ufb01rst time that a gradient descent solver for the problem with \ufb01xed step-sizes\nreaches this type of rate, which is a nearly biquadratic improvement over ~O(\n((cid:21)1(cid:0)(cid:21)2)2 ) [Shamir,\n2015]. In addition, the rate logarithmically depends on the initial iterate, instead of quadratically\nas in Shamir [2015]. Theoretical properties are veri\ufb01ed on synthetic data in experiments. For real\ndata, we explore an automatic step-size scheme, i.e, Barzilai-Borwein (BB) step-sizes, to eliminate\nthe dif\ufb01culty of hand-tuning step-sizes. Experimental results indicate that the shift-and-inverted\nRiemannian gradient descent method can be signi\ufb01cantly faster than the shift-and-inverted power\nmethod that has gained much popularity recently.\nThe rest of the paper is organized as follows. We brie\ufb02y discuss recent literature in Section 2 and\nthen present our shift-and-inverted Riemannian gradient descent solver with theoretical analysis in\nSection 3. Experiments are reported in Section 4. The paper then ends with discussions in Section\n5.\n\n(cid:21)1(cid:0)(cid:21)p+1\n\n\u221a\n\n(cid:21)1\n\n1\n\n2 Related Work\n\nRecent research on eigenvector computation has been mainly focusing on theoretically scaling up\nrelated algorithms. Halko et al. [2011] surveyed and extended randomized algorithms for truncated\nsingular value decomposition (SVD), while Musco and Musco [2015] proposed randomized block\nKrylov methods for stronger and faster approximate SVD. Convergence rates for both versions are\nprovided in Musco and Musco [2015]. Hardt and Price [2014] studied the noisy power method for\nthe small noise case, and Balcan et al. [2016] extended this method to achieve an improved gap de-\npendency by using subspace iterates of larger dimensions. Garber et al. [2016] presented a robust\nanalysis of the shift-and-invert preconditioned power method and achieved optimal convergence\nrates. Allen-Zhu and Li [2016] reproved the result for this method and extended to the case that\nk > 1 by de\ufb02ation via a careful analysis, while Wang et al. [2017] improved the associated anal-\nysis and advocated coordinate descent as the solver for linear systems. Lei et al. [2016] proposed\na different coordinate-wise power method. Sa et al. [2017] proposed the accelerated (stochastic)\npower method with optimal rate. However, its empirical performance seems not as good as ex-\npected in our experiments. Our work is more related to another line of work on gradient descent\nsolvers. Arora et al. [2012] proposed the stochastic power method without theoretical guarantees\nwhich runs the projected stochastic gradient descent (PSGD) for the PCA problem. Arora et al.\n\n2\n\n\fTable 1: Typical convergence rates. ~O notations hide logarithmic factors, e.g., log 1\n\n\u03f5 , log\n\n1\n\n(cid:21)1(cid:0)(cid:21)2\n\n.\n\nPaper\nPSGD [Arora et al., 2013]\nOja\u2019s algorithm [Balsubramani et al., 2013]\nNoisy PM [Hardt and Price, 2014]\nVR-PCA [Shamir, 2015]\nPower Method (PM) [Musco and Musco, 2015]\nBlock Krylov [Musco and Musco, 2015]\nSGD-PCA [Shamir, 2016b]\nShift-and-Inverted PM [Garber et al., 2016]\nCoordiante-wise PM [Lei et al., 2016]\nAccelerated PM [Sa et al., 2017]\nThis work\n\nRate\nO(1=\u03f52)\nO(1=(((cid:21)1 (cid:0) (cid:21)2)2\u03f5))\n~O(1=((cid:21)1 (cid:0) (cid:21)2))\n~O(1=((cid:21)1 (cid:0) (cid:21)2)2)\n~O(1=((cid:21)1 (cid:0) (cid:21)2))\n(cid:21)1 (cid:0) (cid:21)2)\n~O(1=\nO(1=(((cid:21)1 (cid:0) (cid:21)2)\u03f5))\n(cid:21)1 (cid:0) (cid:21)2)\n~O(1=\n\u221a\n~O(1=((cid:21)1 (cid:0) (cid:21)2))\n(cid:21)1 (cid:0) (cid:21)2)\n~O(1=\n(cid:21)1 (cid:0) (cid:21)p+1)\n~O(1=\n\np\np\np\n\n[2013] subsequently extended this method via the convex relaxation with theoretical guarantees.\nBalsubramani et al. [2013] achieved a better guarantee for PCA via the martingale analysis. [Shamir,\n2015, 2016a] proposed the VR-PCA which extended the projected stochastic variance reduced gra-\ndient (SVRG) to the non-convex PCA problem with global convergence guarantees for the case that\n(cid:21)1 > (cid:21)2. Shamir [2016b] also studied SGD for the non-convex PCA problem and established its\nsub-linear convergence rates. Wen and Yin [2013] proposed a practical curvilinear search method\nfor addressing the eigenvalue problem but without theoretical analysis. It actually belongs to the\nRiemannian gradient descent method. By proving an explicit \u0141ojasiewicz exponent at 1\n2, Liu et al.\n[2016] established the local and linear convergence rate of the Riemannian gradient method with a\nline-search procedure for quadratic optimization problems under orthogonality constraints. Details\nof typical theoretical results are summarized in Table 1.\n\n3 Shift-and-Inverted Riemannian Gradient Descent Solver\n\nIn this section, we present our shift-and-inverted Riemannian gradient descent solver. Without loss\nof generality, eigenvalues of the given real symmetric matrix A are assumed to be in [0; 1] and the\nmultiplicity of the largest eigenvalue (cid:21)1 is p, , i.e., 1 (cid:21) (cid:21)1 = (cid:1)(cid:1)(cid:1) = (cid:21)p > (cid:21)p+1 (cid:21) (cid:1)(cid:1)(cid:1) (cid:21) (cid:21)n (cid:21) 0.\nDe\ufb01ne the i-th eigengap of A as \u2206i = (cid:21)i (cid:0) (cid:21)i+1. Most of existing work handle only the case that\n\u22061 = (cid:21)1(cid:0)(cid:21)2 > 0, ignoring the case that \u22061 = 0. In this work, the two cases are uni\ufb01ed via \u2206p > 0\nwhich holds always without loss of generality1, i.e., p < n. Suppose that corresponding eigenvectors\nare v1;(cid:1)(cid:1)(cid:1) ; vn. Our goal then is to \ufb01nd one of the leading eigenvectors, i.e., v 2 span(v1;(cid:1)(cid:1)(cid:1) ; vp)\nand \u2225v\u22252 = 1. Let Vp = (v1;(cid:1)(cid:1)(cid:1) ; vp) and denote B = ((cid:27)I(cid:0)A)\n(cid:0)1 as the shift-and-inverted matrix,\nsatisfying (cid:22)1 = (cid:1)(cid:1)(cid:1) = (cid:22)p > (cid:22)p+1 (cid:21) (cid:1)(cid:1)(cid:1) (cid:21) (cid:22)n,\nwhere (cid:27) > (cid:21)1. B\u2019s eigenvalues then are (cid:22)i = 1\n(cid:27)(cid:0)(cid:21)i\nwhile eigenvectors remain unchanged. Accordingly, de\ufb01ne the i-th eigengap of B as (cid:28)i = (cid:22)i(cid:0) (cid:22)i+1.\nIn particular,\n\n(cid:22)p (cid:0) (cid:22)p+1\n\n(cid:22)1\n\n(cid:28)p\n(cid:22)1\n\n=\n\n=\n\n\u2206p\n\n(cid:27) (cid:0) (cid:21)p+1\n\n:\n\n1\n\n(cid:27)(cid:0)(cid:21)p\n\n=\n\n1\n\n(cid:27)(cid:0)(cid:21)p+1\n\n(cid:0)\n(cid:27)(cid:0)(cid:21)1\n\n1\n\nA faster rate can be obtained if the relative eigengap can be enlarged from A to B, which is ex-\nactly the idea behind the shift-and-invert preconditioning. To this end, we follow Garber and Hazan\n[2015] and Wang et al. [2017]\u2019s procedure (see the supplementary material) to choose an appropri-\nate constant (cid:27) such that it is only slightly larger than (cid:21)1, i.e., (cid:27) = (cid:21)1 + c\u2206p where c 2 [ 1\n2 ], as\nguaranteed by the following theorem.\nTheorem 3.1 [Garber and Hazan, 2015, Wang et al., 2017] Let \u03f5(x) = l(x) (cid:0) minx l(x) be the\nIf the initial to \ufb01nal error ratio for the least-\nfunction error with the least-squares subproblem.\nsquares subproblems can be maintained as \u03f5(z0)\n(cid:17)2 where m =\n\n\u03f5(z(cid:3)) = 32(cid:1)102m+1\n\n4 ; 3\n\nand \u03f5(w0)\n\n\u03f5(w(cid:3)) = 1024\n\n(cid:17)2m\n\n1If p = n, the objective function h(x) is constant and Problem (1) is trivial.\n\n3\n\n\f\u23088 log\niterations in the outer repeat-until loop.\n\np ~y0\u22252\n\n16\n\u2225V\u22a4\n\n2\n\n\u2309, then we have the output (cid:27) = (cid:21)1 + c\u2206p for certain c 2 [ 1\n\n4 ; 3\n\n2 ] after S = O(log 1\n(cid:17) )\n\n(cid:21) 2\n\n= 1\nc+1\n\nWe then have (cid:28)p\n5 and can run Riemannian gradient descent to solve Problem (1). The\n(cid:22)1\nalgorithmic steps are described in Algorithm 1 which caters to all the step-size settings. Riemannian\ngradient descent with normalization retraction [Absil et al., 2008], i.e., R(x; (cid:24)) = x+(cid:24)\nfor any (cid:24)\n\u2225x+(cid:24)\u22252\nfrom the tangent space of the sphere \u2225x\u22252 = 1 at x, can be written as\n\nxt = R(xt(cid:0)1;(cid:0)(cid:11)t ~\u2207h(xt(cid:0)1))\n= R(xt(cid:0)1;(cid:0)(cid:11)t(I (cid:0) xt(cid:0)1x\nt(cid:0)1)\u2207h(xt(cid:0)1))\n\u22a4\n= R(xt(cid:0)1; (cid:11)t(I (cid:0) xt(cid:0)1x\n\u22a4\nt(cid:0)1)Bxt(cid:0)1);\n\n(2)\nwhere ~\u2207h(xt(cid:0)1) and \u2207h(xt(cid:0)1) represent the Riemannian gradient2 and Euclidean gradient, respec-\ntively. As ( (cid:22)1\n)2 = O(1), gradient descent takes only a logarithmic number of iterations O(log 1\n\u03f5 )\n(cid:28)p\nto converge now, which does not have the quadratic dependence on\nany more [Shamir, 2015,\n2016a, Xu et al., 2017, Xu and Gao, 2018, Xu et al., 2018]. However, we need to calculate the Eu-\nclidean gradient \u2207h(x) = (cid:0)Bx, which by inverting the shifted matrix (cid:27)I (cid:0) A will be inef\ufb01cient.\nFortunately, as stated in Line 3, Algorithm 1, we can make it ef\ufb01cient by solving an equivalent least-\nsquares subproblem, and an approximate solution to the subproblem will suf\ufb01ce. It is worth noting\nthat when (cid:11)t = 1=x\n\n\u22a4\nt Bxt Algorithm 1 will recover the shift-and-inverted power method.\n\n(cid:21)1(cid:0)(cid:21)2\n\n(cid:21)1\n\nAlgorithm 1 Shift-and-Inverted Riemannian Gradient Descent Eigensolver\n1: Input: matrix A, shift (cid:27), and initial x0.\n2: for t = 1; 2;(cid:1)(cid:1)(cid:1) do\n3:\n\napproximate negative Euclidean gradient\nyt(cid:0)1 (cid:25) arg min\n\nlt(z) = z\n\nz\n\n\u22a4\n\n(cid:0)1z=2 (cid:0) x\nB\n\nby a fast least-squares solver, e.g., AGD, starting from z0 = xt(cid:0)1=x\napproximate Riemannian gradient ^gt(cid:0)1 = (cid:0)(I (cid:0) xt(cid:0)1x\nchoose a step size (cid:11)t > 0\nset xt = (xt(cid:0)1 (cid:0) (cid:11)t^gt(cid:0)1)=\u2225xt(cid:0)1 (cid:0) (cid:11)t^gt(cid:0)1\u22252\nterminate if stopping criterion is met\n\n\u22a4\nt(cid:0)1)yt(cid:0)1\n\n4:\n5:\n6:\n7:\n8: end for\n\n\u22a4\nt(cid:0)1z\n\u22a4\nt(cid:0)1B\n\n(cid:0)1xt(cid:0)1\n\n(3)\n\n3.1 Analysis\n\nWe now provide the convergence analysis of Algorithm 1 under the constant step-size setting. To\nmeasure the progress of iterates to one of the leading eigenvectors, we use a novel potential function\nde\ufb01ned by\np xt\u22252 =\nfor analysis. As \u2225V\n\u22a4\ncos (cid:18)(xt; Vp), where (cid:18)(xt; Vp) 2 [0; (cid:25)\n2 ] represents the principal angle [Golub and Van Loan, 1996]\nbetween xt and the space of the leading eigenvectors span(Vp). Particularly, it is worth noting that\n\np xt\u22252 (cid:20) \u2225Vp\u22252\u2225xt\u22252 = 1, we have (xt) (cid:21) 0.\n\u22a4\n\n (xt; Vp) = (cid:0)2 log \u2225V\n\nIn fact, \u2225V\n\np xt\u22252\n\u22a4\n\nwhere (cid:18)(xt; v) 2 [0; (cid:25)\nspan(Vp) is equal to the minimum angle between x and any v 2 span(Vp). Thus, we can write\n\n2 ]. That is, the angle between a vector x and a p-dimensional subspace\n\n(cid:18)(xt; Vp) = min\n\nv2span(Vp)\n\n(cid:18)(xt; v);\n\n (xt; Vp) = min\nv2Vp;1\n\n (xt; v);\n\nxtj = (cid:0)2 log cos (cid:18)(xt; v)\nwhere Vp;1 , fv 2 span(Vp) : \u2225v\u22252 = 1g and (xt; v) = (cid:0)2 log jv\nfor any v 2 Vp;1. This property will play an important role in our analysis. It is easy to see that if\n2It can be obtained by projecting the Euclidean gradient onto the tangent space [Absil et al., 2008] at xt(cid:0)1,\n\n\u22a4\n\ni.e., ~\u2207h(xt(cid:0)1) = Pxt(cid:0)1\n\n\u2207h(xt(cid:0)1), where Pxt(cid:0)1 = I (cid:0) xt(cid:0)1x\n\n\u22a4\nt(cid:0)1.\n\n4\n\n\f (xt; Vp) goes to 0, xt must converge to certain vector v 2 Vp;1. We also use another potential\nfunction\n\nsin2 (cid:18)(xt; Vp) = 1 (cid:0) \u2225V\n\np xt\u22252\n\u22a4\n2:\n\nOur main results then can be stated as follows.\nTheorem 3.2 Given a shift parameter (cid:27) = (cid:21)1 + c\u22061 for c 2 (0; 3\n2 ], Algorithm 1 with \ufb01xed step-\nsizes and using accelerated gradient descent as a least-squares solver is able to converge to one of\nthe leading eigenvectors of A, i.e., (xT ; Vp) < \u03f5, after T = O(log (x0;Vp)\n) gradient steps, and\nthe overall complexity is O(\n\nlog (x0;Vp)\n\n\u221a\n\n).\n\n\u03f5\n\n(cid:21)1\n\u2206p\n\nlog (cid:21)1\n\u2206p\n\n\u03f5\n\n\u22a4\n\nBx (cid:20) ((cid:22)1 (cid:0) (cid:22)n) sin2 (cid:18)(x; Vp) and \u2225 ~\u2207h(x)\u22252 (cid:20)\n\nTo prove the theorem, we need the following auxiliary lemmas whose proofs are given in the supple-\nmentary material.\nLemma 3.3 (cid:28)p sin2 (cid:18)(x; Vp) (cid:20) (cid:22)1 (cid:0) x\n2(cid:22)1 sin (cid:18)(x; Vp).\nLemma 3.4\n(cid:21)\n1(cid:0)log(1(cid:0)x) .\nx(cid:0) log(1(cid:0)x)\nLemma 3.5 [Wang et al., 2017] Let z\u22c6 = arg min lt(z) = Bxt(cid:0)1, (cid:24)t = yt (cid:0) z\u22c6, and \u03f5t = lt(yt)(cid:0)\nlt(z\u22c6). Then \u2225(cid:24)t\u22252 (cid:20) p\nsin2 (cid:18)(xt(cid:0)1; Vp). Moreover, Nesterov\u2019s\n) complexity for solving Problem (3) to\n\n(cid:20) log(1 + x) (cid:20) x for any x > (cid:0)1, while for any x 2 (0; 1) it holds that\n\n2(cid:22)1\u03f5t and lt(z0) (cid:0) lt(z\u22c6) (cid:20) (cid:22)2\nlog lt(z0)(cid:0)lt(z\u22c6)\n\n\u221a\n\n1\n2(cid:22)n\n\n1+x\n\nx\n\n1\n\naccelerated gradient descent takes O(\nsub-optimality \u03f5t.\n\n(cid:21)1\n\u2206p\n\n\u03f5t\n\nSince the least-squares solver for Problem (3) is warm-started with z0 =\n, the initial\nerror lt(z0)(cid:0) lt(z\u22c6) is much smaller than the error from the random initial z0. We can also try other\nleast-squares solvers, such as SVRG [Johnson and Zhang, 2013], accelerated SVRG [Garber et al.,\n2016], and coordinate descent [Wang et al., 2017].\n\n\u22a4\nt(cid:0)1B(cid:0)1xt(cid:0)1\nx\n\nxt(cid:0)1\n\nProof of Theorem 3.2\n\nProof For brevity, denote (cid:18)t = (cid:18)(xt; Vp) and t = (xt; Vp) throughout the proof. First, for any\nv 2 Vp;1,\n\n (xt+1; v) = (cid:0)2 log jv\n= (cid:0)2 log jv\n\n\u22a4\n\u22a4\n\nxt+1j\n(xt (cid:0) (cid:11)t+1^gt)j + 2 log \u2225xt (cid:0) (cid:11)t+1^gt\u22252:\n\nFrom Lemma 3.5 and Equation (2), we can write ^gt = ~\u2207h(xt) (cid:0) (I (cid:0) xtx\n\u22a4\nt )(cid:24)t, where (cid:24)t is the\nerror with the approximate negative Euclidean gradient in Line 4 of Algorithm 1 incurred from\nleast-squares subproblems (3). We then can expand\n\n(4)\n\n(5)\n\njv\n= jv\n(cid:21) jv\n\n(cid:21) jv\n(cid:21) jv\n\n\u22a4\n\u22a4\n\u22a4\n(cid:0)2(cid:11)t+1jv\n\u22a4\n\n(xt (cid:0) (cid:11)t+1^gt)j2\n(xt (cid:0) (cid:11)t+1 ~\u2207h(xt)) + (cid:11)t+1v\n(xt (cid:0) (cid:11)t+1 ~\u2207h(xt))j2 + (cid:11)2\n\nt )(cid:24)tj2\n(I (cid:0) xtx\n\u22a4\n\u22a4\njv\n(I (cid:0) xtx\nt )(cid:24)tj2\n\u22a4\n\u22a4\n(I (cid:0) xtx\nt )(cid:24)tj\n(xt (cid:0) (cid:11)t+1 ~\u2207h(xt))j (cid:1) jv\n\u22a4\n\u22a4\n(xt (cid:0) (cid:11)t+1 ~\u2207h(xt))j2 (cid:0) 2(cid:11)t+1jv\n(xt (cid:0) (cid:11)t+1 ~\u2207h(xt))j (cid:1) jv\n\u22a4\n(I (cid:0) xtx\n\u2225v\nt )\u22252\u2225(cid:24)t\u22252\n\u22a4\n\u22a4\n(xt (cid:0) (cid:11)t+1 ~\u2207h(xt))j2(1 (cid:0) 2(cid:11)t+1\njv\u22a4(xt (cid:0) (cid:11)t+1 ~\u2207h(xt))j );\n\nt+1\n\n\u22a4\n\n\u22a4\n\n\u22a4\n\n(I (cid:0) xtx\n\nt )(cid:24)tj\n\u22a4\n\nwhere the last inequality is by the Cauchy-Schwartz inequality. To proceed, we note that\n\nBxt (cid:0) v\nTogether with Lemma 3.3, we then have\n\n\u22a4 ~\u2207h(xt) = (cid:0)(v\n\n\u22a4\n\nv\n\n\u22a4\n\nt Bxt) = (cid:0)((cid:22)1 (cid:0) x\n\u22a4\n\n\u22a4\nt Bxt)v\n\n\u22a4\n\nxt:\n\nxtx\n\n\u22a4\n\njv\n\n(xt (cid:0) (cid:11)t+1 ~\u2207h(xt))j = (1 + (cid:11)t+1((cid:22)1 (cid:0) x\n\n(cid:21) (1 + (cid:11)t+1(cid:28)p sin2 (cid:18)t)jv\n\n\u22a4\n\nxtj\n\nt Bxt))jv\n\u22a4\nxtj:\n\n\u22a4\n\n5\n\n\fIn addition, one can write\n\n\u22a4\n\n\u2225v\n\n(I (cid:0) xtx\n\n\u22a4\n\nt )\u22252 = \u2225v\n\u22a4\n\n?\nx\nt\n= (1 (cid:0) (v\n2 = (xt (cid:0) (cid:11)t+1^gt)\n\n\u22a4\n\nx\n\n\u22a4\n\n\u22a4\n\n?\nt )\n\n?\nt (x\n\n\u22252 = (v\n\u22a4\nxt)2)1=2 = sin (cid:18)(xt; v);\n(xt (cid:0) (cid:11)t+1^gt) = 1 + (cid:11)2\n\n\u2225^gt\u22252\n2\nt+1(4(cid:22)2\nwhere the last inequality is due to Lemma 3.3. By (4)-(7), one can arrive at\n\nt+1(\u2225 ~\u2207h(xt)\u22252\n\n2) (cid:20) 1 + 2(cid:11)2\n\n(cid:20) 1 + 2(cid:11)2\n\n2 + \u2225(cid:24)t\u22252\n\nt+1\n\nand\n\u2225xt (cid:0) (cid:11)t+1^gt\u22252\n\nv)1=2 = (v\n\n\u22a4\n\n(I (cid:0) xtx\n\n\u22a4\nt )v)1=2\n\n1 sin2 (cid:18)t + \u2225(cid:24)t\u22252\n2);\n\n (xt+1; v) (cid:20) (xt; v) (cid:0) 2 log(1 + (cid:11)t+1(cid:28)p sin2 (cid:18)t) (cid:0) log(1 (cid:0) 2(cid:11)t+1\u2225(cid:24)t\u22252 tan (cid:18)(xt; v)\nTaking the minimum with respect to v over Vp;1 on both sides and noting that \u2225(cid:24)t\u22252 (cid:20) p\n\n1 sin2 (cid:18)t + \u2225(cid:24)t\u22252\n\n1 + (cid:11)t+1(cid:28)p sin2 (cid:18)t\n\n+ log(1 + 2(cid:11)2\n\nt+1(4(cid:22)2\n\n2)):\n\n2(cid:22)1\u03f5t by\n\nLemma 3.5, we then get\n\n t+1 (cid:20) t (cid:0) 2 log(1 + (cid:11)t+1(cid:28)p sin2 (cid:18)t) (cid:0) log(1 (cid:0) 2(cid:11)t+1\n\np\n\n2(cid:22)1\u03f5t tan (cid:18)t\n1 + (cid:11)t+1(cid:28)p sin2 (cid:18)t\n\n)\n\n(6)\n\n(7)\n\n)\n\n+ log(1 + 2(cid:11)2\n\nt+1(4(cid:22)2\n\n1 sin2 (cid:18)t + 2(cid:22)1\u03f5t)):\n\nLetting \u03f5t =\n\n(cid:28) 2\np\n(cid:22)1\n\nsin2(2(cid:18)t)\n\n32\n\n, the above inequality can be reduced:\n\n t+1 (cid:20) t (cid:0) 2 log(1 + (cid:11)t+1(cid:28)p sin2 (cid:18)t) (cid:0) log(1 (cid:0) (cid:11)t+1(cid:28)p sin2 (cid:18)t\n1 + (cid:11)t+1(cid:28)p sin2 (cid:18)t\n\n)\n\n(cid:28) 2\np\n4\n\n+ log(1 + 2(cid:11)2\n\nt+1(4(cid:22)2\n\n1 sin2 (cid:18)t +\n\nsin2 (cid:18)t cos2 (cid:18)t))\n\n(cid:20) t (cid:0) log(1 + (cid:11)t+1(cid:28)p sin2 (cid:18)t) + log(1 + 10(cid:11)2\n(cid:20) t (cid:0) (cid:11)t+1(cid:28)p sin2 (cid:18)t\n1 sin2 (cid:18)t)\n1 + (cid:11)t+1(cid:28)p sin2 (cid:18)t\n(cid:20) t (cid:0) (cid:11)t+1(\n(cid:0) 10(cid:11)t+1(cid:22)2\n\nt+1(cid:22)2\n+ 10(cid:11)2\n(cid:0) 10(cid:11)t+1(cid:22)2\n\n1 > 0, i.e., (cid:11)t+1 <\n\n1) sin2 (cid:18)t:\n\n1 + (cid:11)t+1(cid:28)p\n\n(cid:28)p\n\n(cid:28)p\n\nt+1(cid:22)2\n\n1 sin2 (cid:18)t)\n\n(by Lemma 3.4)\n\n1(1+(cid:11)t+1(cid:28)p), we then get t+1 < t and\n\n20(cid:22)2\n\n(cid:28)p\n\nThus, if\n t+1 (cid:20) t (cid:0) (cid:11)t+1(cid:28)p sin2 (cid:18)t\n\n2(1+(cid:11)t+1(cid:28)p)\n\n2(1+(cid:11)t+1(cid:28)p) . Note that\n\nsin2 (cid:18)t\n\nsin2 (cid:18)t =\n\n(cid:0) log(1 (cid:0) sin2 (cid:18)t)\n\n t\n1 + t\nwhere the \ufb01rst inequality is by Lemma 3.4. If (cid:11)t (cid:17) (cid:11), we then can arrive at\n(cid:1)\n\n1 (cid:0) log(1 (cid:0) sin2 (cid:18)t)\n\n T (cid:20) (1 (cid:0)\n\n(cid:11)(cid:28)p\n\n(cid:11)(cid:28)p\n\n t\n\n=\n\n1\n\n(cid:1)\n\n(cid:1) t (cid:21)\n\n2(1 + (cid:11)(cid:28)p)\n\n(cid:21) t\n\n1 + 0\n\n;\n\n1\n\n1 + 0\n\n)T 0\n\n(cid:20) expf(cid:0)T\n\n2(1 + (cid:11)(cid:28)p)\n(cid:11)(cid:28)p\n\n1 + 0\n(cid:1)\n1\n\n) T(cid:0)1 (cid:20) (1 (cid:0)\ng 0 , (cid:4):\n\n 0\n\u03f5\n\n) = O((\n\n(cid:22)1\n(cid:28)p\n\n)2 log\n\n 0\n\u03f5\n\n) = O(log\n\n 0\n\u03f5\n\n):\n\n\u221a\n\nSetting (cid:4) = \u03f5 and noting (cid:11) <\n\n2(1 + (cid:11)(cid:28)p)(1 + 0)\n\nT =\n\n(cid:28)p\n\n2(1 + (cid:11)(cid:28)p)\n\n1 + 0\n1(1+(cid:11)(cid:28)p) yields\n20(cid:22)2\n 0\n\u03f5\n\n1\n(cid:11)(cid:28)p\n\n= O(\n\nlog\n\nlog\n\n(cid:11)(cid:28)p\n\n\u221a\n\u221a\n\nOn the other hand, by Lemma 3.5, the complexity for computing yt is\n(cid:22)2\n1\n(cid:22)n\n(cid:28) 2\np\n(cid:22)1\n\nlt(z0) (cid:0) lt(Bxt(cid:0)1)\n\u221a\n\n(cid:21)1\n\u2206p\n\n(cid:21)1\n\u2206p\n\n) = O(\n\nlog\n\nlog\n\nO(\n\n\u03f5t\n\nsin2 (cid:18)t\nsin2(2(cid:18)t)\n\n\u221a\n\n32\n\n= O(\n\n(cid:21)1\n\u2206p\n\n(log\n\n(cid:22)1\n(cid:22)n\n\n\u221a\n\n(cid:21)1\n\u2206p\n\n(log\n\n(cid:22)1\n(cid:22)n\n\n\u221a\n\n+ t)) = O(\n\n+ 0)) = O(\n\nThus, the overall complexity is O(\n\n(cid:21)1\n\u2206p\n\nlog (cid:21)1\n\u2206p\n\nlog 0\n\n\u03f5 ) = ~O(\n\n(cid:21)1\n\u2206p\n\n).\n\n6\n\n)\n\n(cid:21)1\n\u2206p\n\nlog\n\n(cid:21)1\n\u2206p\n\n);\n\n(cid:3)\n\n\fRelative function error\n\nPotential function sin2 (cid:18)t\n\nFigure 1: Synthetic data.\n\n4 Experiments\n\nWe test our algorithm on both synthetic and real data. Throughout experiments, our SI-rgEIGS\nsolver is warm-started by a few power iterations, and four iterations of Nesterov\u2019s AGD are run to\napproximately solve the least-squares subproblems. The same initial x0 is used for different solvers.\nAll the algorithms are implemented in matlab and running single threaded. All the ground-truth\ninformation is obtained by matlab\u2019s eigs function for benchmarking purpose. The implementation\nof our algorithm is available at DJJFI\u0003\u0002\u0002CEJDK>\u0002?\u0006\u0006\u0002\u0007DEGE=\u0006CNK \u001e\u001e\u001f\u000251\u0002HC-1/5.\n\n4.1 Synthetic Data\n\nWe follow Shamir [2015] to generate synthetic data. Note that A\u2019s full eigenvalue decomposition\n\u22a4\nn , where (cid:6) is diagonal. Thus, it suf\ufb01ces to generate random orthog-\ncan be written as A = Vn(cid:6)V\nonal matrix Vn and set (cid:6) = diag(1; 1 (cid:0) \u2206; 1 (cid:0) 1:1\u2206;(cid:1)(cid:1)(cid:1) ; 1 (cid:0) 1:4\u2206; g1=n;(cid:1)(cid:1)(cid:1) ; gn(cid:0)6=n) with gi\nbeing standard normal samples, i.e., gi (cid:24) N (0; 1). Here we set n = 1000 and (cid:27) = 1:005, and three\nsolvers are compared: Rimennian gradient descent solver with/without shift-and-invert precondi-\ntioning under the constant step-size setting, and the shift-and-inverted power method [Garber et al.,\n2016]. Constant step-sizes are hand-tuned. Figure 1 reports the performance of three algorithms,\nin terms of relative function error (f (xt) (cid:0) f (v1))=f (v1) or the potential sin2 (cid:18)t, where we use\nf (x). We see that Riemannian gradient de-\nf (x) = x\nscent with shift-and-invert preconditioning indeed outperforms the counterpart without precondition-\ning which is also worse than the SI-PM. This demonstrates the effectiveness of the shift-and-invert\npreconditioning for acceleration again. Second, unexpectedly, SI-rgEIGS runs faster than SI-PM,\ndespite an extra log factor in theory. This may hint at the possibility of removing this factor in\nanalysis of our method. Last, note that convergence behaviors are consistent in terms of two quality\nmeasures.\n\nAx and then f (v1) = (cid:21)1 =\n\n\u22a4\n\nmax\n\nx2Rn(cid:2)1:\u2225x\u22252=1\n\n4.2 Real Data\n\nWe now demonstrate the performance of Algorithm 1 on real data from the sparse matrix collection,\nand also compare with the accelerated power method with optimal momentum (cid:12) = (cid:21)2\n2=4 (abbrevi-\nated as APM-OM) [Sa et al., 2017]. However, two issues need to be \ufb01xed. First, the crude phase of\nGarber and Hazan [2015], Wang et al. [2017] for locating the shift parameter is hard to use as there\n\n7\n\n00.10.20.30.40.50.60.70.8time (seconds)10-1410-1210-1010-810-610-410-2100( f(xt) - f(v1) ) / f(v1) = 5 10-3rgEIGS =1.84rgEIGS =1.92rgEIGS =2.00SI-rgEIGS =0.012SI-rgEIGS =0.016SI-rgEIGS =0.020SI-PM00.10.20.30.40.50.60.70.8time (seconds)10-1410-1210-1010-810-610-410-2100sin2(xt , v1) = 5 10-3rgEIGS =1.84rgEIGS =1.92rgEIGS =2.00SI-rgEIGS =0.012SI-rgEIGS =0.016SI-rgEIGS =0.020SI-PM00.511.522.533.544.5time (seconds)10-1410-1210-1010-810-610-410-2100( f(xt) - f(v1) ) / f(v1) = 5 10-4rgEIGS =1.37rgEIGS =1.99rgEIGS =2.00SI-rgEIGS =0.011SI-rgEIGS =0.015SI-rgEIGS =0.019SI-PM00.511.522.533.544.5time (seconds)10-1410-1210-1010-810-610-410-2100sin2(xt , v1) = 5 10-4rgEIGS =1.37rgEIGS =1.99rgEIGS =2.00SI-rgEIGS =0.011SI-rgEIGS =0.015SI-rgEIGS =0.019SI-PM\fFigure 2: Real data.\n\nAxTc + (cid:12)\u2225 ~\u2207f (xTc)\u22252\n\nare three parameters that need to be tuned. We use heuristics based on Lemma 3.3. The lemma\nshows that (cid:21)1 (cid:20) x\nt Axt + ((cid:21)1 (cid:0) (cid:21)n) sin2 (cid:18)t and \u2225 ~\u2207f (xt)\u22252 (cid:20) 2(cid:21)1 sin (cid:18)t. Then we have the upper\n\u22a4\n\u22a4\n2 and wart-start xTc from the\nbound on (cid:21)1: (cid:27) = x\ncrude phrase. We \ufb01nd that setting (cid:12) = 1=\u2225 ~\u2207f (xTc)\u22252 works well on our data. Second, hand-tuning\nTc\nof step-sizes, even for constant step-sizes, is a dif\ufb01cult task. We thus use an automatic step-size\nscheme, speci\ufb01cally, Barzilai-Borwein (BB) step-size, which is a non-monotone step-size scheme\nand performs well in practice [Wen and Yin, 2013]. In our context, it is set as follows:\n(^gt (cid:0) ^gt(cid:0)1)j\n\n2 for proper constant (cid:12) > 1\n\n\u2225xt (cid:0) xt(cid:0)1\u22252\n\nj(xt (cid:0) xt(cid:0)1)\n\n\u22a4\n\n(cid:11)t+1 =\n\nj(xt (cid:0) xt(cid:0)1)\u22a4(^gt (cid:0) ^gt(cid:0)1)j ; or (cid:11)t+1 =\n\n2\n\n\u2225^gt (cid:0) ^gt(cid:0)1\u22252\n\n2\n\n:\n\nNote that we use inexact Riemannian gradients ^gt here, instead of exact ones ~\u2207h(x) as in the\ntraditional case. Nonetheless, it still performs well and signi\ufb01cantly better than the shift-and-inverted\npower method as observed in Figure 2. See the supplementary material for the description of the\nreal data.\n\n5 Discussions\n\n\u221a\n\nIn this work, we investigated Riemannian gradient descent with shift-and-invert preconditioning\nfor the leading eigenvector computation on the effect of step-size schemes, in comparison to the\nrecently popular shift-and-inverted power method. Speci\ufb01cally, the constant step-size scheme and\nthe Barzilai-Borwein (BB) step-size scheme were considered theoretically and/or empirically. The\nalgorithm was theoretically analyzed under the constant step-size setting and shown for the \ufb01rst\ntime to able to achieve a rate of the type ~O(\n) and a logarithmic dependence on the initial\niterate. It is a nearly biquadratic improvement for the gradient descent solver, covering both \u22061 > 0\nand \u22061 = 0. Experimental results demonstrated that the shift-and-invert preconditioning can indeed\naccelerate gradient descent solver. Unexpectedly, the adaptive step-size setting with the shift-and-\ninverted power method is outperformed by the considered step-size settings, especially the BB step-\nsize scheme on real data, albeit with a provable optimal rate. For future work, we may further\ninvestigate if the log factor log\ncan be removed from the overall complexity and test our\nalgorithms with other least-squares solvers for deeper understanding of its performance.\n\n(cid:21)1(cid:0)(cid:21)p+1\n\n(cid:21)1(cid:0)(cid:21)p+1\n\n(cid:21)1\n\n(cid:21)1\n\n8\n\n00.020.040.060.080.10.12time (seconds)10-1410-1210-1010-810-610-410-2100( f(xt) - f(v1) ) / f(v1)hangGlider5SI-rgEIGSSI-PMAPM-OM00.020.040.060.080.10.120.14time (seconds)10-1410-1210-1010-810-610-410-2100sin2(xt , v1)hangGlider5SI-rgEIGSSI-PMAPM-OM00.20.40.60.811.21.4time (seconds)10-1410-1210-1010-810-610-410-2100( f(xt) - f(v1) ) / f(v1)Boeing35SI-rgEIGSSI-PMAPM-OM00.511.5time (seconds)10-1410-1210-1010-810-610-410-2100sin2(xt , v1)Boeing35SI-rgEIGSSI-PMAPM-OM0123456time (seconds)10-1010-810-610-410-2100( f(xt) - f(v1) ) / f(v1)indef_dSI-rgEIGSSI-PMAPM-OM0123456time (seconds)10-610-510-410-310-210-1100sin2(xt , v1)indef_dSI-rgEIGSSI-PMAPM-OM00.511.522.533.54time (seconds)10-1410-1210-1010-810-610-410-2100( f(xt) - f(v1) ) / f(v1)indef_aSI-rgEIGSSI-PMAPM-OM00.511.522.533.54time (seconds)10-1410-1210-1010-810-610-410-2100sin2(xt , v1)indef_aSI-rgEIGSSI-PMAPM-OM00.050.10.150.20.250.30.350.4time (seconds)10-1410-1210-1010-810-610-410-2100( f(xt) - f(v1) ) / f(v1)dimacs10_ctSI-rgEIGSSI-PMAPM-OM00.050.10.150.20.250.30.350.4time (seconds)10-1410-1210-1010-810-610-410-2100sin2(xt , v1)dimacs10_ctSI-rgEIGSSI-PMAPM-OM02468101214time (seconds)10-1410-1210-1010-810-610-410-2100( f(xt) - f(v1) ) / f(v1)dimacs10_nvSI-rgEIGSSI-PMAPM-OM024681012141618time (seconds)10-1410-1210-1010-810-610-410-2100sin2(xt , v1)dimacs10_nvSI-rgEIGSSI-PMAPM-OM\fAcknowledgments\n\nAuthors would like to thank the reviewers, AC, and SAC for their valuable comments.\n\nReferences\nP-A Absil, Robert Mahony, and Rodolphe Sepulchre. Optimization algorithms on matrix manifolds.\n\nPrinceton University Press, 2008.\n\nZeyuan Allen-Zhu and Yuanzhi Li. Even faster svd decomposition yet without agonizing pain. In\n\nAdvances in Neural Information Processing Systems, pages 974\u2013982, 2016.\n\nRaman Arora, Andrew Cotter, Karen Livescu, and Nathan Srebro.\n\nStochastic optimiza-\nIn 50th Annual Allerton Conference on Communication, Con-\ntion for PCA and PLS.\ntrol, and Computing, Allerton 2012, Allerton Park & Retreat Center, Monticello, IL, USA,\nOctober 1-5, 2012, pages 861\u2013868, 2012.\ndoi: 10.1109/Allerton.2012.6483308. URL\nDJJFI\u0003\u0002\u0002@\u0006E\u0002\u0006HC\u0002\u001f\u001e\u0002\u001f\u001f\u001e'\u0002)\u0006\u0006AHJ\u0006\u0006\u0002 \u001e\u001f \u0002$\"&!!\u001e&.\n\nRaman Arora, Andrew Cotter, and Nati Srebro. Stochastic optimization of PCA with capped MSG.\nIn Advances in Neural Information Processing Systems 26: 27th Annual Conference on Neural\nInformation Processing Systems 2013. Proceedings of a meeting held December 5-8, 2013, Lake\nTahoe, Nevada, United States., pages 1815\u20131823, 2013.\n\nMaria-Florina Balcan, Simon Shaolei Du, Yining Wang, and Adams Wei Yu. An improved gap-\nIn Proceedings of the 29th Conference on\ndependency analysis of the noisy power method.\nLearning Theory, COLT 2016, New York, USA, June 23-26, 2016, pages 284\u2013309, 2016. URL\nDJJF\u0003\u0002\u0002\u0006\u0006\u0006H\u0002\u0006HC\u0002FH\u0006?AA@E\u0006CI\u0002F=FAHI\u0002L\"'\u0002>=\u0006?=\u0006\u001f$=\u0002DJ\u0006\u0006.\n\nAkshay Balsubramani, Sanjoy Dasgupta, and Yoav Freund. The fast convergence of incremental pca.\nIn C.J.C. Burges, L. Bottou, M. Welling, Z. Ghahramani, and K.Q. Weinberger, editors, Advances\nin Neural Information Processing Systems 26, pages 3174\u20133182. Curran Associates, Inc., 2013.\n\nJianqing Fan, Qiang Sun, Wen-Xin Zhou, and Ziwei Zhu. Principal component analysis for big data.\n\narXiv preprint arXiv:1801.01602, 2018.\n\nChao Gao, Dan Garber, Nathan Srebro, Jialei Wang, and Weiran Wang. Stochastic canonical corre-\n\nlation analysis. CoRR, abs/1702.06533, 2017. URL DJJF\u0003\u0002\u0002=HNEL\u0002\u0006HC\u0002=>I\u0002\u001f%\u001e \u0002\u001e$#!!.\n\nDan Garber and Elad Hazan. Fast and simple pca via convex optimization.\n\narXiv:1509.05647, 2015.\n\narXiv preprint\n\nDan Garber, Elad Hazan, Chi Jin, Sham M. Kakade, Cameron Musco, Praneeth Netrapalli, and\nAaron Sidford. Faster eigenvector computation via shift-and-invert preconditioning. In Interna-\ntional Conference on Machine Learning, pages 2626\u20132634, 2016.\n\nGene H. Golub and Charles F. Van Loan. Matrix Computations (3rd Ed.). Johns Hopkins University\n\nPress, Baltimore, MD, USA, 1996. ISBN 0-8018-5414-8.\n\nNathan Halko, Per-Gunnar Martinsson, and Joel A. Tropp. Finding structure with randomness:\nProbabilistic algorithms for constructing approximate matrix decompositions. SIAM Review, 53\n(2):217\u2013288, 2011. doi: 10.1137/090771806.\n\nMoritz Hardt and Eric Price. The noisy power method: A meta algorithm with applications. In\n\nAdvances in Neural Information Processing Systems, pages 2861\u20132869, 2014.\n\nTrevor Hastie, Rahul Mazumder, Jason D. Lee, and Reza Zadeh. Matrix completion and low-rank\nSVD via fast alternating least squares. Journal of Machine Learning Research, 16:3367\u20133402,\n2015. URL DJJF\u0003\u0002\u0002@\u0006\u0002=?\u0006\u0002\u0006HC\u0002?EJ=JE\u0006\u0006\u0002?B\u0006\u0003E@\u0003 '\u001f \u001f\u001e$.\n\nRie Johnson and Tong Zhang. Accelerating stochastic gradient descent using predictive variance\nreduction. In Advances in Neural Information Processing Systems 26: 27th Annual Conference\non Neural Information Processing Systems 2013. Proceedings of a meeting held December 5-8,\n2013, Lake Tahoe, Nevada, United States., pages 315\u2013323, 2013.\n\n9\n\n\fQi Lei, Kai Zhong, and Inderjit S. Dhillon. Coordinate-wise power method.\n\nIn Advances in\nNeural Information Processing Systems 29: Annual Conference on Neural Information Process-\ning Systems 2016, December 5-10, 2016, Barcelona, Spain, pages 2056\u20132064, 2016. URL\nDJJF\u0003\u0002\u0002F=FAHI\u0002\u0006EFI\u0002??\u0002F=FAH\u0002$\u001f\u001e!\u0002?\u0006\u0006H@E\u0006=JA\u0002MEIA\u0002F\u0006MAH\u0002\u0006AJD\u0006@.\n\nGuangcan Liu and Ping Li. Recovery of coherent data via low-rank dictionary pursuit.\n\nIn Ad-\nvances in Neural Information Processing Systems 27: Annual Conference on Neural Information\nProcessing Systems 2014, December 8-13 2014, Montreal, Quebec, Canada, pages 1206\u20131214,\n2014.\n\nHuikang Liu, Weijie Wu, and Anthony Man-Cho So. Quadratic optimization with orthogonality\nIn\n\nconstraints: Explicit lojasiewicz exponent and linear convergence of line-search methods.\nICML, pages 1158\u20131167, 2016.\n\nCameron Musco and Christopher Musco. Randomized block krylov methods for stronger and faster\napproximate singular value decomposition. In Advances in Neural Information Processing Sys-\ntems, pages 1396\u20131404, 2015.\n\nYurii Nesterov. Introductory Lectures on Convex Optimization: A Basic Course. Springer Publishing\n\nCompany, Incorporated, 1 edition, 2014. ISBN 1461346916, 9781461346913.\n\nI. Jordan, and Yair Weiss.\n\nAndrew Y. Ng, Michael\nysis and an algorithm.\n[Neural\nber 3-8, 2001, Vancouver, British Columbia, Canada], pages 849\u2013856, 2001.\nDJJF\u0003\u0002\u0002MMM\u0002 \u0002?I\u0002?\u0006K\u0002A@K\u0002/H\u0006KFI\u0002\u0004125\u0002\u0004125 \u001e\u001e\u001f\u0002F=FAHI\u0002FIC\u0007\u0002))!#\u0002FI\u0002C\u0007.\n\nOn spectral clustering: Anal-\nInformation Processing Systems 14\nInformation Processing Systems: Natural and Synthetic, NIPS 2001, Decem-\nURL\n\nIn Advances in Neural\n\nChristopher De Sa, Bryan D. He,\nAccelerated stochastic power\n\nXu.\nDJJF\u0003\u0002\u0002=HNEL\u0002\u0006HC\u0002=>I\u0002\u001f%\u001e%\u0002\u001e $%\u001e.\n\nIoannis Mitliagkas, Christopher R\u00e9,\niteration.\n\nCoRR, abs/1707.02670, 2017.\n\nand Peng\nURL\n\nOhad Shamir. A stochastic PCA and SVD algorithm with an exponential convergence rate. In Pro-\nceedings of the 32nd International Conference on Machine Learning, ICML 2015, Lille, France,\n6-11 July 2015, pages 144\u2013152, 2015.\n\nOhad Shamir. Fast stochastic algorithms for SVD and PCA: convergence properties and convexity.\n\nIn International Conference on Machine Learning, pages 248\u2013256, 2016a.\n\nOhad Shamir. Convergence of stochastic gradient descent for PCA. In ICML, pages 257\u2013265, 2016b.\n\nJialei Wang, Weiran Wang, Dan Garber,\ncomputation.\n\nwise\nDJJF\u0003\u0002\u0002=HNEL\u0002\u0006HC\u0002=>I\u0002\u001f%\u001e \u0002\u001e%&!\".\n\nleading eigenvector\n\nand Nathan Srebro.\n\nEf\ufb01cient coordinate-\nURL\n\n2017.\n\nCoRR,\n\nabs/1702.07834,\n\nZaiwen Wen and Wotao Yin. A feasible method for optimization with orthogonality constraints.\nMathematical Programming, 142(1-2):397\u2013434, 2013. doi: 10.1007/s10107-012-0584-1. URL\nDJJF\u0003\u0002\u0002@N\u0002@\u0006E\u0002\u0006HC\u0002\u001f\u001e\u0002\u001f\u001e\u001e%\u0002I\u001f\u001e\u001f\u001e%\u0002\u001e\u001f \u0002\u001e#&\"\u0002\u001f.\n\nZhiqiang Xu and Xin Gao. On truly block eigensolvers via riemannian optimization.\n\nIn In-\nternational Conference on Arti\ufb01cial Intelligence and Statistics, AISTATS 2018, 9-11 April\n2018, Playa Blanca, Lanzarote, Canary Islands, Spain, pages 168\u2013177, 2018.\nURL\nDJJF\u0003\u0002\u0002FH\u0006?AA@E\u0006CI\u0002\u0006\u0006H\u0002FHAII\u0002L&\"\u0002NK\u001f&>\u0002DJ\u0006\u0006.\n\nZhiqiang Xu, Yiping Ke, and Xin Gao. A fast stochastic riemannian eigensolver. In UAI, 2017.\n\nZhiqiang Xu, Xin Cao, and Xin Gao. Convergence analysis of gradient descent for eigen-\nthe Twenty-Seventh International Joint Confer-\nvector computation.\nence on Arti\ufb01cial Intelligence, IJCAI-18, pages 2933\u20132939. International Joint Conferences\non Arti\ufb01cial Intelligence Organization, 7 2018.\nURL\nDJJFI\u0003\u0002\u0002@\u0006E\u0002\u0006HC\u0002\u001f\u001e\u0002 \"'$!\u0002E\u0006?=E\u0002 \u001e\u001f&\u0002\"\u001e%.\n\n10.24963/ijcai.2018/407.\n\nIn Proceedings of\n\ndoi:\n\n10\n\n\f", "award": [], "sourceid": 1485, "authors": [{"given_name": "Zhiqiang", "family_name": "Xu", "institution": "Baidu Inc."}]}