{"title": "Diffusion Approximations for Online Principal Component Estimation and Global Convergence", "book": "Advances in Neural Information Processing Systems", "page_first": 645, "page_last": 655, "abstract": "In this paper, we propose to adopt the diffusion approximation tools to study the dynamics of Oja's iteration which is an online stochastic gradient method for the principal component analysis. Oja's iteration maintains a running estimate of the true principal component from streaming data and enjoys less temporal and spatial complexities. We show that the Oja's iteration for the top eigenvector generates a continuous-state discrete-time Markov chain over the unit sphere. We characterize the Oja's iteration in three phases using diffusion approximation and weak convergence tools. Our three-phase analysis further provides a finite-sample error bound for the running estimate, which matches the minimax information lower bound for PCA under the additional assumption of bounded samples.", "full_text": "Diffusion Approximations for Online Principal\nComponent Estimation and Global Convergence\n\nChris Junchi Li\n\nMengdi Wang\n\nPrinceton University\n\nHan Liu\n\nDepartment of Operations Research and Financial Engineering, Princeton, NJ 08544\n\n{junchil,mengdiw,hanliu}@princeton.edu\n\nTong Zhang\nTencent AI Lab\n\nShennan Ave, Nanshan District, Shenzhen, Guangdong Province 518057, China\n\ntongzhang@tongzhang-ml.org\n\nAbstract\n\nIn this paper, we propose to adopt the diffusion approximation tools to study the\ndynamics of Oja\u2019s iteration which is an online stochastic gradient descent method\nfor the principal component analysis. Oja\u2019s iteration maintains a running estimate\nof the true principal component from streaming data and enjoys less temporal\nand spatial complexities. We show that the Oja\u2019s iteration for the top eigenvector\ngenerates a continuous-state discrete-time Markov chain over the unit sphere. We\ncharacterize the Oja\u2019s iteration in three phases using diffusion approximation and\nweak convergence tools. Our three-phase analysis further provides a \ufb01nite-sample\nerror bound for the running estimate, which matches the minimax information\nlower bound for principal component analysis under the additional assumption of\nbounded samples.\n\n1\n\nIntroduction\n\nIn the procedure of Principal Component Analysis (PCA) we aim at learning the principal leading\neigenvector of the covariance matrix of a d-dimensional random vector Z from its independent\nand identically distributed realizations Z1, . . . , Zn. Let E[Z] = 0, and let the eigenvalues of \u03a3 be\n\u03bb1 > \u03bb2 \u2265 \u00b7\u00b7\u00b7 \u2265 \u03bbd > 0, then the PCA problem can be formulated as minimizing the expectation of\na nonconvex function:\n\nminimize \u2212 w(cid:62)E(cid:2)ZZ(cid:62)(cid:3) w,\n\n(1.1)\nwhere (cid:107) \u00b7 (cid:107) denotes the Euclidean norm. Since the eigengap \u03bb1 \u2212 \u03bb2 is nonzero, the solution to\n(1.1) is unique, denoted by w\u2217. The classical method of \ufb01nding the estimator of the \ufb01rst leading\neigenvector w\u2217 can be formulated as the solution to the empirical covariance problem as\n\nsubject to (cid:107)w(cid:107) = 1, w \u2208 Rd,\n\n(cid:107)w(cid:107)=1\n\nIn words, (cid:98)\u03a3(n) denotes the empirical covariance matrix for the \ufb01rst n samples. The estimator (cid:98)w(n)\nproduced via this process provides a statistical optimal solution (cid:98)w(n). Precisely, [43] shows that the\nangle between any estimator (cid:101)w(n) that is a function of the \ufb01rst n samples and w\u2217 has the following\n\ni=1\n\nn\n\n(cid:98)w(n) = argmin\n\n\u2212w(cid:62)(cid:98)\u03a3(n)w,\n\nwhere (cid:98)\u03a3(n) \u2261 1\n\nn(cid:88)\n\nZ(i)(cid:16)\n\nZ(i)(cid:17)(cid:62)\n\n.\n\n31st Conference on Neural Information Processing Systems (NIPS 2017), Long Beach, CA, USA.\n\n\fFigure 1: Left: an objective function for the top-1 PCA, where we use both the radius and heatmap to\nrepresent the function value at each point of the unit sphere. Right: A quiver plot on the unit sphere\ndenoting the directions of negative gradient of the PCA objective.\n\nE(cid:104)\n\nsin2 \u2220((cid:101)w(n), w\u2217)\n\n(cid:105) \u2265 c \u00b7 \u03c32\u2217 \u00b7 d \u2212 1\n\n,\n\nminimax lower bound\n\ninf(cid:101)w(n)\n\nsup\n\nn\n\nZ\u2208M(\u03c32\u2217,d)\n\nwhere c is some positive constant. Here the in\ufb01mum of (cid:101)w(n) is taken over all principal eigenvector\nestimators, and M(\u03c32\u2217, d) is the collection of all d-dimensional subgaussian distributions with mean\nzero and eigengap \u03bb1 \u2212 \u03bb2 > 0 satisfying \u03bb1\u03bb2/(\u03bb1 \u2212 \u03bb2)2 \u2264 \u03c32\u2217. Classical PCA method has time\ncomplexity O(nd2) and space complexity O(d2). The drawback of this method is that, when the\ndata samples are high-dimensional, computing and storage of a large empirical covariance matrix can\nbe costly.\nIn this paper we concentrate on the streaming or online method for PCA that processes online data\nand estimates the principal component sequentially without explicitly computing and storing the\n\nempirical covariance matrix (cid:98)\u03a3. Over thirty years ago, Oja [30] proposed an online PCA iteration that\n\n(1.2)\n\ncan be regarded as a projected stochastic gradient descent method as\n\n(cid:104)\nw(n\u22121) + \u03b2Z(n)(Z(n))(cid:62)w(n\u22121)(cid:105)\n\n.\n\nw(n) = \u03a0\n\n(1.3)\nHere \u03b2 is some positive learning rule or stepsize, and \u03a0 is de\ufb01ned as \u03a0w = (cid:107)w(cid:107)\u22121w for each\nnonzero vector w, namely, \u03a0 projects any vector onto the unit sphere S d\u22121 = {w \u2208 Rd | (cid:107)w(cid:107) = 1}.\nOja\u2019s iteration enjoys a less expensive time complexity O(nd) and space complexity O(d) and\nthereby has been used as an alternative method for PCA when both the dimension d and number of\nsamples n are large.\nIn this paper, we adopt the diffusion approximation method to characterize the stochastic algorithm\nusing Markov processes and its differential equation approximations. The diffusion process approxi-\nmation is a fundamental and powerful analytic tool for analyzing complicated stochastic process. By\nleveraging the tool of weak convergence, we are able to conduct a heuristic \ufb01nite-sample analysis of\nthe Oja\u2019s iteration and obtain a convergence rate which, by carefully choosing the stepsize \u03b2, matches\nthe PCA minimax information lower bound. Our analysis involves the weak convergence theory for\nMarkov processes [11], which is believed to have a potential for a broader class of stochastic algo-\nrithms for nonconvex optimization, such as tensor decomposition, phase retrieval, matrix completion,\nneural network, etc.\n\nOur Contributions We provide a Markov chain characterization of the stochastic process {w(n)}\ngenerated by the Oja\u2019s iteration with constant stepsize. We show that upon appropriate scalings, the\niterates as a Markov process weakly converges to the solution of an ordinary differential equation\nsystem, which is a multi-dimensional analogue to the logistic equations. Also locally around the\nneighborhood of a stationary point, upon a different scaling the process weakly converges to the\nmultidimensional Ornstein-Uhlenbeck processes. Moreover, we identify from differential equation\napproximations that the global convergence dynamics of the Oja\u2019s iteration has three distinct phases:\n\n2\n\n\fFigure 2: A simulation plot of Oja\u2019s method, marked with the three phases.\n\n(i) The initial phase corresponds to escaping from unstable stationary points;\n(ii) The second phase corresponds to fast deterministic crossing period;\n(iii) The third phase corresponds to stable oscillation around the true principal component.\n\nLastly, this is the \ufb01rst work that analyze the global rate of convergence analysis of Oja\u2019s iteration,\ni.e., the convergence rate does not have any initialization requirements.\n\nRelated Literatures This paper is a natural companion to paper by the authors\u2019 recent work [23]\nthat gives explicit rate analysis using a discrete-time martingale-based approach. In this paper, we\nprovide a much simpler and more insightful heuristic analysis based on diffusion approximation\nmethod under the additional assumption of bounded samples.\nThe idea of stochastic approximation for PCA problem can be traced back to Krasulina [19] published\nalmost \ufb01fty years ago. His work proposed an algorithm that is regarded as the stochastic gradient\ndescent method for the Rayleigh quotient. In contrast, Oja\u2019s iteration can be regarded as a projected\nstochastic gradient descent method. The method of using differential equation tools for PCA appeared\nin the \ufb01rst papers [19, 31] to prove convergence result to the principal component, among which, [31]\nalso analyze the subspace learning for PCA. See also [16, Chap. 1] for a gradient \ufb02ow dynamical\nsystem perspective of Oja\u2019s iteration.\nThe convergence rate analysis of the online PCA iteration has been very few until the recent big data\ntsunami, when the need to handle massive amounts of data emerges. Recent works by [6, 10, 17, 34]\nstudy the convergence of online PCA from different perspectives, and obtain some useful rate results.\nOur analysis using the tools of diffusion approximations suggests a rate that is sharper than all existing\nresults, and our global convergence rate result poses no requirement for initialization.\n\nMore Literatures Our work is related to a very recent line of work [3, 13, 21, 33, 38\u201341] on\nthe global dynamics of nonconvex optimization with statistical structures. These works carefully\ncharacterize the global geometry of the objective functions, and in special, around the unstable\nstationary points including saddle points and local maximizers. To solve the optimization problem\nvarious algorithms were used, including (stochastic) gradient method with random initialization or\nnoise injection as well as variants of Newton\u2019s method. The unstable stationary points can hence be\navoided, enabling the global convergence to desirable local minimizers.\nOur diffusion process-based characterization of SGD is also related to another line of work [8, 10, 24,\n26, 37]. Among them, [10] uses techniques based on martingales in discrete time to quantify the global\n\n3\n\n\fconvergence of SGD on matrix decomposition problems. In comparison, our techniques are based on\nStroock and Varadhan\u2019s weak convergence of Markov chains to diffusion processes, which yield the\ncontinuous-time dynamics of SGD. The rest of these results mostly focus on analyzing continuous-\ntime dynamics of gradient descent or SGD on convex optimization problems. In comparison, we are\nthe \ufb01rst to characterize the global dynamics for nonconvex statistical optimization. In particular, the\n\ufb01rst and second phases of our characterization, especially the unstable Ornstein-Uhlenbeck process,\nare unique to nonconvex problems. Also, it is worth noting that, using the arguments of [26], we can\nshow that the diffusion process-based characterization admits a variational Bayesian interpretation of\nnonconvex statistical optimization. However, we do not pursue this direction in this paper.\nIn the mathematical programming and statistics communities, the computational and statistical\naspects of PCA are often studied separately. From the statistical perspective, recent developments\nhave focused on estimating principal components for very high-dimensional data. When the data\ndimension is much larger than the sample size, i.e., d (cid:29) n, classical method using decomposition of\nthe empirical convariance matrix produces inconsistent estimates [18, 29]. Sparsity-based methods\nhave been studied, such as the truncated power method studied by [45] and [44]. Other sparsity\nregularization methods for high dimensional PCA has been studied in [2, 7, 9, 18, 25, 42, 43, 46], etc.\nNote that in this paper we do not consider the high-dimensional regime and sparsity regularization.\nFrom the computational perspective, power iterations or the Lanczos method are well studied. These\niterative methods require performing multiple products between vectors and empirical covariance\nmatrices. Such operation usually involves multiple passes over the data, whose complexity may scale\nwith the eigengap and dimensions [20, 28]. Recently, randomized algorithms have been developed to\nreduce the computation complexity [12, 35, 36]. A critical trend today is to combine the computational\nand statistical aspects and to develop algorithmic estimator that admits fast computation as well as\ngood estimation properties. Related literatures include [4, 5, 10, 14, 27].\n\nOrganization \u00a72 introduces the settings and distributional assumptions. \u00a73 brie\ufb02y discusses the\nOja\u2019s iteration from the Markov processes perspective and characterizes that it globally admits\nordinary differential equation approximation upon appropriate scaling, and also stochastic differential\nequation approximation locally in the neighborhood of each stationary point. \u00a74 utilizes the weak\nconvergence results and provides a three-phase argument for the global convergence rate analysis,\nwhich is near-optimal for the Oja\u2019s iteration. Concluding remarks are provided in \u00a75.\n\n2 Settings\n\nIn this section, we present the basic settings for the Oja\u2019s iteration. The algorithm maintains a running\nestimate w(n) of the true principal component w\u2217, and updates it while receiving streaming samples\nfrom exterior data source. We summarize our distributional assumptions.\nAssumption 2.1. The random vectors Z \u2261 Z(1), . . . , Z(n) \u2208 Rd are independent and identically\ndistributed and have the following properties:\n\n(i) E[Z] = 0 and E(cid:2)ZZ(cid:62)(cid:3) = \u03a3;\n\n(ii) \u03bb1 > \u03bb2 \u2265 \u00b7\u00b7\u00b7 \u2265 \u03bbd > 0;\n(iii) There is a constant B such that (cid:107)Z(cid:107)2 \u2264 B.\n\nFor the easiness of presentation, we transform the iterates w(n) and de\ufb01ne the rescaled samples, as\nfollows. First we let the eigendecomposition of the covariance matrix be\n\n\u03a3 = E(cid:2)ZZ(cid:62)(cid:3) = U\u039bU(cid:62),\n\nwhere \u039b = diag(\u03bb1, \u03bb2, . . . , \u03bbd) is a diagonal matrix with diagonal entries \u03bb1, \u03bb2, . . . , \u03bbd, and U is\nan orthogonal matrix consisting of column eigenvectors of \u03a3. Clearly the \ufb01rst column of U is equal\nto the principal component w\u2217. Note that the diagonal decomposition might not be unique, in which\ncase we work with an arbitrary one. Second, let\n\nY (n) = U(cid:62)Z(n), v(n) = U(cid:62)w(n), v\u2217 = U(cid:62)w\u2217.\n\n(2.1)\n\nOne can easily verify that\n\nE[Y ] = 0,\n\nE(cid:2)Y Y (cid:62)(cid:3) = \u039b;\n\n4\n\n\fThe principal component of the rescaled random variable Y , which we denote by v\u2217, is equal to e1,\nwhere {e1, . . . , ed} is the canonical basis of Rd. By applying the orthonormal transformation U(cid:62)\nto the stochastic process {w(n)}, we obtain an iterative process {v(n) = U(cid:62)w(n)} in the rescaled\nspace:\n\n(cid:26)\n(cid:26)\n\nU(cid:62)w(n\u22121) + \u03b2U(cid:62)Z(n)(cid:16)\nY (n)(cid:17)(cid:62)\nv(n\u22121) + \u03b2Y (n)(cid:16)\n\nZ(n)(cid:17)(cid:62)\n(cid:27)\n\nv(n\u22121)\n\nv(n) = U(cid:62)w(n) = \u03a0\n\n= \u03a0\n\nUU(cid:62)w(n\u22121)\n\n(2.2)\n\n.\n\n(cid:27)\n\nMoreover, the angle processes associated with {w(n)} and {v(n)} are equivalent, i.e.,\n\n(2.3)\nTherefore it would be suf\ufb01cient to study the rescaled iteration v(n) in (2.2) and the transformed\niteration Y (n) throughout the rest of this paper.\n\n\u2220(w(n), w\u2217) = \u2220(v(n), v\u2217).\n\n3 A Theory of Diffusion Approximation for PCA\n\nIn this section we show that the stochastic iterates generated by the Oja\u2019s iteration can be approximated\nby the solution of an ODE system upon appropriate scaling, as long as \u03b2 is small. To work on the\napproximation we \ufb01rst observe that the iteration v(n), n = 0, 1, . . . generated by (2.2) forms a\ndiscrete-time, time-homogeneous Markov process that takes values on S d\u22121. Furthermore, v(n)\nholds strong Markov property.\n\n3.1 Global ODE Approximation\n\nTo state our results on differential equation approximations, let us de\ufb01ne a new process, which is\nobtained by rescaling the time index n according to the stepsize \u03b2\n\n(3.1)\nWe add the superscript \u03b2 in the notation to emphasize the dependence of the process on \u03b2. We will\n\nshow that (cid:101)V \u03b2(t) converges weakly to a deterministic function V (t), as \u03b2 \u2192 0+.\n\nFurthermore, we can identify the limit V (t) as the closed-form solution to an ODE system. Under\n\nAssumption 2.1 and using an in\ufb01nitesimal generator analysis we have\n\n(cid:101)V \u03b2(t) \u2261 v\u03b2,((cid:98)t\u03b2\u22121(cid:99)).\n\n(cid:12)(cid:12)(cid:101)V \u03b2(t + \u03b2) \u2212 (cid:101)V \u03b2(t)(cid:12)(cid:12) = O(B\u03b2).\n(cid:105)\n(cid:104)(cid:101)V \u03b2(t + \u03b2) \u2212 (cid:101)V \u03b2(t)(cid:12)(cid:12)(cid:101)V \u03b2(t) = v\n\nIt follows that, as \u03b2 \u2192 0+, the in\ufb01nitesimal conditional variance tends to 0:\n= O(B\u03b2),\n\n\u03b2\u22121var\nand the in\ufb01nitesimal mean is\n\n\u03b2\u22121E(cid:104)(cid:101)V \u03b2(t + \u03b2) \u2212 (cid:101)V \u03b2(t)(cid:12)(cid:12)(cid:101)V \u03b2(t) = v\n\n=(cid:0)\u039b \u2212 V (cid:62)\u039bV(cid:1) V + O(B2\u03b22).\n\n(cid:105)\n\nUsing the classical weak convergence to diffusion argument [11, Corollary 4.2 in \u00a77.4], we obtain\nthe following result.\nTheorem 3.1. If v\u03b2,(0) converges weakly to some constant vector V o \u2208 S d\u22121 as \u03b2 \u2192 0+ then the\nMarkov process v\u03b2,((cid:98)t\u03b2\u22121(cid:99)) converges weakly to the solution V = V (t) to the following ordinary\ndifferential equation system\n\n=(cid:0)\u039b \u2212 V (cid:62)\u039bV(cid:1) V ,\n\n(3.2)\n\nwith initial values V (0) = V o.\n\ndV\ndt\n\nWe can straightforwardly check for sanity that the solution vector V (t) lies on the unit sphere\nS d\u22121, i.e., (cid:107)V (t)(cid:107) = 1 for all t \u2265 0. Written in coordinates V (t) = (V1(t), . . . , Vd(t))(cid:62), the ODE\nis expressed for k = 1, . . . , d\n\ndVk\ndt\n\n= Vk\n\n(\u03bbk \u2212 \u03bbi)V 2\ni .\n\nd(cid:88)\n\ni=1\n\n5\n\n\fOne can straightforwardly verify that the solution to (3.2) has\n\nVk(t) = (Z(t))\n\n\u22121/2 Vk(0) exp(\u03bbkt),\n\n(3.3)\n\nwhere Z(t) is the normalization function\n\nZ(t) =\n\nd(cid:88)\n\ni=1\n\n(V o\n\ni )2 exp(2\u03bbit).\n\n(cid:16)\n\n(cid:16)\n\n1 )2(cid:17)\n1 )2(cid:17)\n\nTo understand the limit function given by (3.3), we note that in the special case where \u03bb2 = \u00b7\u00b7\u00b7 = \u03bbd\n\nZ(t) = (V o\n\n1 )2 exp(2\u03bb1t) +\n\n1 \u2212 (V o\n\nexp(2\u03bb2t),\n\nand\n\n(V1(t))2 =\n\n(V o\n\n1 )2 exp(2\u03bb1t)\n1 \u2212 (V o\n\n(V o\n\n1 )2 exp(2\u03bb1t) +\n\n.\n\nexp(2\u03bb2t)\n\n(3.4)\n\nThis is the formula of the logistic curve. Hence analogously, V (t) in (3.3) is namely the generalized\nlogistic curves.\n\n3.2 Local Approximation by Diffusion Processes\n\nThe weak convergence to ODE theorem introduced in \u00a73.1 characterizes the global dynamics of the\nOja\u2019s iteration. Such approximation explains many behaviors, but neglected the presence of noise that\nplays a role in the algorithm. In this section we aim at understanding the Oja\u2019s iteration via stochastic\ndifferential equations (SDE). We refer the readers to [32] for more on basic concepts of SDE.\n\nIn this section, we instead show that under some scaling, the process admits an approximation\nof multidimensional Ornstein-Uhlenbeck process within a neighborhood of each of the unstable\nstationary points, both stable and unstable. Afterwards, we develop some weak convergence results\nto give a rough estimate on the rate of convergence of the Oja\u2019s iteration. For purposes of illustration\nand brevity, we restrict ourselves to the case of starting point v(0) being the stationary point ek for\nsome k = 1, . . . , d, and denote an arbitrary vector xk to be a (d \u2212 1)-dimensional vector that keeps\nall but the kth coordinate of x. Using theory from [11] we conclude the following theorem.\nTheorem 3.2. Let k = 1, . . . , d be arbitrary. If \u03b2\u22121/2v\u03b2,(0)\nas \u03b2 \u2192 0+, then the Markov process\n\nconverges weakly to some Uo\n\nk \u2208 Rd\u22121\n\nk\n\n\u03b2\u22121/2v\u03b2,((cid:98)t\u03b2\u22121(cid:99))\n\ndUk(t) = \u2212(\u03bbkId\u22121 \u2212 \u039bk)Uk dt +(cid:0)\u03bbk\u039bk\n\nk\n\n(cid:1)1/2\n\nconverges weakly to the solution of the multidimensional stochastic differential equation\n\nwith initial values Uk(0) = Uo\n\n(3.5)\nk. Here Bk(t) is a standard (d \u2212 1)-dimensional Brownian motion. 1\nThe solution to (3.5) can be solved explicitly. We let for a matrix A \u2208 Rn\u00d7n the matrix expo-\nfor the\n\n, . . . , \u03bb1/2\npositive semide\ufb01nite diagonal matrix \u039b = diag(\u03bb1, . . . , \u03bbd). The solution to (3.5) is hence\n\nnentiation exp(A) as exp(A) =(cid:80)\u221e\nUk(t) = exp(cid:2)\u2212t(\u03bbkId\u22121 \u2212 \u039bk)(cid:3) U o\n\nexp(cid:2)(s \u2212 t)(\u03bbkId\u22121 \u2212 \u039bk)(cid:3) dBk(s),\n\nn=0(1/n!)An. Also, let \u039b1/2 = diag\n\nk +(cid:0)\u03bbk\u039bk\n\n(cid:1)1/2(cid:90) t\n\ndBk(t),\n\n\u03bb1/2\n1\n\n(cid:17)\n\n(cid:16)\n\nd\n\n0\n\nwhich is known as the multidimensional Ornstein-Uhlenbeck process, whose behavior depends on\nthe matrix \u2212(\u03bbkId\u22121 \u2212 \u039bk) and is discussed in details in \u00a74.\n\nBefore concluding this section, we emphasize that the weak convergence to diffusions results in\n\u00a73.1 and \u00a73.2 should be distinguished from the convergence of the Oja\u2019s iteration. From a random\nprocess theoretical perspective, the former one treats the weak convergence of \ufb01nite dimensional\ndistributions of a sequence of rescaled processes as \u03b2 tends to 0, while the latter one charaterizes the\nlong-time behavior of a single realization of iterates generated by algorithm for a \ufb01xed \u03b2 > 0.\n\n1 The reason we have a (d \u2212 1)-dimensional Ornstein-Uhlenbeck process is because the objective function\n\nof PCA is de\ufb01ned on a (d \u2212 1)-dimensional manifold S d\u22121 and has d \u2212 1 independent variables.\n\n6\n\n\f4 Global Three-Phase Analysis of Oja\u2019s Iteration\n\nPreviously \u00a73.1 and \u00a73.2 develop the tools of weak convergence to diffusion under global and local\nscalings. In this section, we apply these tools to analyze the dynamics of online PCA iteration in\nthree phases in sequel. For purposes of illustration and brevity, we restrict ourselves to the case of\nstarting point v(0) that is near a saddle point ek. Let A\u03b2 (cid:46) B\u03b2 denotes lim sup\u03b2\u21920+ A\u03b2/B\u03b2 \u2264 1,\na.s., and A\u03b2 (cid:16) B\u03b2 when both A\u03b2 (cid:46) B\u03b2 and B\u03b2 (cid:46) A\u03b2 hold.\n\n4.1 Phase I: Noise Initialization\n\nIn consideration of global convergence, we analyze the initial phase where the iteration starts at a\npoint on or around Se and eventually escapes an O(1)-neighborhood of the set\n\nSe =(cid:8)v \u2208 S d\u22121 : v1 = 0(cid:9) .\n\nWhen thinking the sphere S d\u22121 as the globe with \u00b1e1 being the north and south poles, Se corresponds\nto the equator of the globe. Therefore, all unstable stationary points (including saddle points and\nlocal maximizers) lie on the equator Se.\n\n4.2 Phase II: Deterministic Crossing\nIn Phase II, the iteration escapes from the neighborhood of equator Se and converges to a basin of\nattraction of the local minimizer v\u2217. From strong Markov property of the Oja\u2019s iteration introduced\nin the beginning of \u00a73, one can forget the iteration steps in Phase I and analyze the iteration from\n1 )2 (cid:16) \u03b4, where\nthe \ufb01nal iterate of Phase I. Suppose we have an initial point v(0) that satis\ufb01es (v(0)\n\u03b4 is a \ufb01xed constant in (0, 1/2), Theorem 3.1 concludes that the iteration moves in a deterministic\npattern and quickly evolves into a small neighborhood of the principal component e1 such that\n(v(n)\n\n)2 (cid:16) 1 \u2212 \u03b4.\n\n1\n\n4.3 Phase III: Convergence to Principal Component\n\nE(cid:107)U1(\u221e)(cid:107)2 = tr E(cid:0)(cid:2)U1(t)U1(t)(cid:62)(cid:3)(cid:1) = (\u03bb1/2) tr(cid:0)\u039b1(\u03bb1Id\u22121 \u2212 \u039b1)\u22121(cid:1) . Rescaling the Markov\n\nIn Phase III, the iteration quickly converges to and \ufb02uctuates around the true principal component\nv\u2217 = e1. We start our iteration from a neighborhood around the principal component, where\n1 )2 = 1 \u2212 \u03b4. Letting k = 1 in (3.5) and taking the limit t \u2192 \u221e, we have the limit\nv(0) has (v(0)\nprocess along with some calculations gives as n \u2192 \u221e, in very rough sense,\n\nlim\nn\u2192\u221e\n\nE sin2 \u2220(v(n), v\u2217) (cid:16) \u03b2 \u00b7 E(cid:107)U1(\u221e)(cid:107)2 = \u03b2 \u00b7 \u03bb1\n2\n\ntr(cid:0)\u039b1(\u03bb1Id\u22121 \u2212 \u039b1)\u22121(cid:1)\n\n\u03bb1\u03bbk\n\n2(\u03bb1 \u2212 \u03bbk)\n\n.\n\n(4.1)\n\n= \u03b2 \u00b7 d(cid:88)\n\nk=2\n\nThe above display implies that there will be some nondiminishing \ufb02uctuations, variance being\nproportional to the constant stepsize \u03b2, as time goes to in\ufb01nity or at stationarity. Therefore in terms of\nangle, at stationarity the Markov process concentrates within a O(\u03b21/2)-radius neighborhood of zero.\n\n4.4 Crossing Time Estimate\n\nWe turn to estimate the running time, namely the crossing time, which is the number of iterates\nrequired for the iteration to cross the corresponding regions in different phases. We will use the\nrelation v(n) \u2248 V (n\u03b2) to bridge the discrete-time algorithm and its continuous-time approximation.\nPhase I. For illustrative purposes we only consider the special case where v is close to ek the kth\ncoordinate vector, which is a saddle point that has a negative Hessian eigenvalue. In this situation, the\nSDE (3.5) in terms of the \ufb01rst coordinate U (t) of Uk reduces to\n\ndU (t) = (\u03bb1 \u2212 \u03bbk)U (t) dt + (\u03bb1\u03bbk)1/2 dB(t),\n\n(4.2)\n\n7\n\n\f(cid:90) t\n\n(4.3)\n\nwith initial value U (0) = 0. Solution to (4.2) is known as unstable Ornstein-Uhlenbeck process [1]\nand can be expressed explicitly in closed-form, as\nU (t) = W \u03b2(t) exp ((\u03bb1 \u2212 \u03bbk)t) , where W \u03b2(t) \u2261 (\u03bb1\u03bbk)1/2\nRescaling the time back to the discrete-time iteration, we let n = t\u03b2\u22121 and obtain\n\nexp (\u2212(\u03bb1 \u2212 \u03bbk)s) dB(s).\n\n0\n\n(cid:19)1/2\n\nW \u03b2(n\u03b2) (cid:16)\n\n2(\u03bb1 \u2212 \u03bbk)\n\n(cid:18) \u03bb1\u03bbk\n\nwhere \u03c7 stands for a standard normal variable. We have\n\n1 (cid:16) \u03b21/2W \u03b2(n\u03b2) exp (\u03b2(\u03bb1 \u2212 \u03bbk)n) .\nv(n)\nIn (4.3), the term W \u03b2(n\u03b2) is approximately distributed as t = n\u03b2 \u2192 \u221e\n\n(cid:18) \u03bb1\u03bbk\n(cid:19)1/2\n(cid:19)\n\u22121 \u03b2\u22121 log(cid:0)\u03b4|\u03c7|\u22121(cid:1) + (\u03bb1 \u2212 \u03bbk)\n\u22121 \u03b2\u22121 log(cid:0)\u03b2\u22121(cid:1) . This suggests that the noise helps the iteration to move away\n\n1 (cid:16) \u03b21/2\nv(n)\n)2 = \u03b4 in (4.4), we have as \u03b2 \u2192 0+ the crossing time is approximately\n\u03b2\u22121/2\n\n(cid:18)(cid:18) \u03bb1\u03bbd\n\n.\n(4.5)\nTherefore we have whenever the smallest eigenvalue \u03bbd is bounded away from 0, then asymptotically\n1 (cid:16) 0.5 (\u03bb1 \u2212 \u03bbk)\nN \u03b2\nfrom ek rapidly.\nPhase II. We turn to estimate the crossing time N \u03b2\nensures the existence of a constant T , that depends only on \u03b4 such that V 2\nT has the following bounds:\n\n2 in Phase II. (3.3) together with simple calculation\n1 (T ) \u2265 1 \u2212 \u03b4. Furthermore\n\nIn order to have (v(n)\n1 (cid:16) (\u03bb1 \u2212 \u03bbk)\nN \u03b2\n\n\u03c7 exp (\u03b2(\u03bb1 \u2212 \u03bbk)n) .\n\n(cid:19)\u22121/2\n\n2(\u03bb1 \u2212 \u03bbk)\n\n2(\u03bb1 \u2212 \u03bbd)\n\n\u22121 \u03b2\u22121 log\n\n(4.4)\n\n\u03c7,\n\n1\n\n(\u03bb1 \u2212 \u03bbd)\u22121 log ((1 \u2212 \u03b4)/\u03b4) (cid:46) T (cid:46) (\u03bb1 \u2212 \u03bb2)\u22121 log ((1 \u2212 \u03b4)/\u03b4) .\n\n(4.6)\n\nTranslating back to the timescale of the iteration, it takes asymptotically\n\n(cid:46) (\u03bb1 \u2212 \u03bb2)\u22121\u03b2\u22121 log ((1 \u2212 \u03b4)/\u03b4)\n\nN \u03b2\n2\n\n1\n\n)2 \u2265 1 \u2212 \u03b4. Theorem 3.1 indicates that when \u03b2 is positively small, the\niterates to achieve (v(N \u03b2\n2 )\niterates needed for the \ufb01rst coordinate squared to cross from \u03b4 to 1\u2212\u03b4 is O(\u03b2\u22121). This is substantiated\nby simulation results [4] suggesting that the Oja\u2019s iteration moves fast from the warm initialization.\nPhase III. To estimate the crossing time N \u03b2\n3 or the number of iterates needed in Phase III, we restart\nour counter and have from the approximation in Theorem 3.2 and (3.5) that\n\nE(v(n)\n\nk )2 = (v(0)\n\nk )2 exp (\u22122(\u03bb1 \u2212 \u03bbk)\u03b2n) + \u03b2\u03bb1\u03bbk\n\nexp (\u22122(\u03bb1 \u2212 \u03bbk)(t \u2212 s)) ds\n\n(cid:90) \u03b2n\n\n(cid:19)\n\n0\n\n\u03bb1\u03bbk\n\n2(\u03bb1 \u2212 \u03bbk)\n\n(cid:18)\n\nd(cid:88)\n\nk=2\n\n\u03bb1\u03bbk\n\n2(\u03bb1 \u2212 \u03bbk)\n\n+\n\nk )2 \u2212 \u03b2 \u00b7\n(v(0)\n\nexp (\u22122\u03b2(\u03bb1 \u2212 \u03bbk)n)\n\n= \u03b2 \u00b7 d(cid:88)\n(cid:16) \u03b2 \u00b7 d(cid:88)\n\nk=2\n\nIn terms of the iterations v(n), note the relationship E sin2 \u2220(v, e1) =(cid:80)d\n\n+ \u03b4 exp (\u22122\u03b2(\u03bb1 \u2212 \u03bb2)n) .\n\n2(\u03bb1 \u2212 \u03bbk)\n\n\u03bb1\u03bbk\n\nk=2\n\nof Phase II implies that E sin2 \u2220(v(0), e1) = 1 \u2212 (v(0)\n\n1 )2 = \u03b4, and hence by setting\n\nk=2 v2\n\nk = 1 \u2212 v2\n\n1. The end\n\n3 ), e1) = \u03b2 \u00b7 d(cid:88)\n\nE sin2 \u2220(v(N \u03b2\n\n\u03bb1\u03bbk\n\n2(\u03bb1 \u2212 \u03bbk)\n\n+ o(\u03b2),\n\nk=2\n\n3 (cid:16) 0.5(\u03bb1 \u2212 \u03bb2)\u22121\u03b2\u22121 log(cid:0)\u03b4\u03b2\u22121(cid:1) .\n\nN \u03b2\n\nwe conclude that as \u03b2 \u2192 0+\n\n(4.7)\n\n8\n\n\f4.5 Finite-Sample Rate Bound\n\nIn this subsection we establish the global \ufb01nite-sample convergence rate using the crossing time\nestimates in the previous subsection. Starting from v(0) = ek where k = 2, . . . , d is arbitrary, the\n3 as \u03b2 \u2192 0+ such that, by choosing \u03b4 \u2208 (0, 1/2) as a\nglobal convergence time N \u03b2 = N \u03b2\nsmall \ufb01xed constant,\n\n2 + N \u03b2\n1 + N \u03b2\nN \u03b2 (cid:16) (\u03bb1 \u2212 \u03bb2)\n\n\u22121 \u03b2\u22121 log(cid:0)\u03b2\u22121(cid:1) ,\n\nwith the following estimation on global convergence rate as in (4.1)\n\nsin2 \u2220(v(N \u03b2 ), v\u2217) = \u03b2 \u00b7 d(cid:88)\n\n\u03bb1\u03bbk\n\n2(\u03bb1 \u2212 \u03bbk)\n\n.\n\n\u00af\u03b2(T )), v\u2217) \u2264 d(cid:88)\n\nk=2\n\nGiven a \ufb01xed number of samples T , by choosing \u03b2 as\n\nk=2\n\nwe have T (cid:16) (\u03bb1 \u2212 \u03bb2)\u22121 \u00af\u03b2(T )\u22121 log(cid:0) \u00af\u03b2(T )(cid:1)\u22121\n\n\u03b2 = \u00af\u03b2(T ) \u2261\n\nlog T\n\n(\u03bb1 \u2212 \u03bb2)T\n= N \u00af\u03b2(T ). Plugging in \u03b2 as in (4.8) we have, by\n\n(4.8)\n\nthe angle-preserving property of coordinate transformation (2.3), that\n\nE sin2 \u2220(w(N\n\n\u00af\u03b2(T )), w\u2217) = E sin2 \u2220(v(N\n\n\u03bb1\u03bbk\n\n2(\u03bb1 \u2212 \u03bbk)\n\n\u00b7\n\nlog T\n\n(\u03bb1 \u2212 \u03bb2)T\n\n.\n\n(4.9)\n\nThe \ufb01nite sample bound in (4.9) is sharper than any existing results and matches the information lower\nbound. Moreover, (4.9) implies that the rate in terms of sine-squared angle is sin2 \u2220(w(T ), w\u2217) \u2264\nC \u00b7 \u03bb1\u03bb2/(\u03bb1 \u2212 \u03bb2)2 \u00b7 d log T /T, which matches the minimax information lower bound (up to a log T\nfactor), see for example, Theorem 3.1 of [43]. Limited by space, details about the rate comparison is\nprovided in the supplementary material.\n\n5 Concluding Remarks\n\n(i) As \u03b2 \u2192 0+ we have N \u03b2\n\nWe make several concluding remarks on the global convergence rate estimations, as follows.\nCrossing Time Comparison. From the crossing time estimates in (4.5), (4.6), (4.7) we conclude\n1 \u2192 0. This implies that the algorithm demonstrates the cutoff\nphenomenon which frequently occur in discrete-time Markov processes [22]. In words,\nthe Phase II where the objective value in Rayleigh quotient drops from 1 \u2212 \u03b4 to \u03b4 is an\nasymptotically a phase of short time, compared to Phases I and III, so the convergence curve\noccurs instead of an exponentially decaying curve.\n1 (cid:16) 1. This suggests that for the high-d case that Phase I of\n\n(ii) As \u03b2 \u2192 0+ we have N \u03b2\n\n2 /N \u03b2\n\n3 /N \u03b2\n\nescaping from the equator consumes roughly the same iterations as in Phase III.\n\nTo summarize from above, the cold initialization iteration roughly takes twice the number of steps\n\nthan the warm initialization version which is consistent with the simulation discussions in [31].\nSubspace Learning. In this work we primarily concentrates on the problem of \ufb01nding the top-\n1 eigenvector. It is believed that the problem of \ufb01nding top-k eigenvectors, a.k.a. the subspace\nPCA problem, can be analyzed using our approximation methods. This will involve a careful\ncharacterization of subspace angles and is hence more complex. We leave this for future investigations.\n\nReferences\n[1] Aldous, D. (1989). Probability Approximations via the Poisson Clumping Heuristic, volume 77. Springer.\n\n[2] Amini, A. & Wainwright, M. (2009). High-dimensional analysis of semide\ufb01nite relaxations for sparse\n\nprincipal components. The Annals of Statistics, 37(5B), 2877\u20132921.\n\n[3] Anandkumar, A. & Ge, R. (2016). Ef\ufb01cient approaches for escaping higher order saddle points in non-convex\n\noptimization. arXiv preprint arXiv:1602.05908.\n\n[4] Arora, R., Cotter, A., Livescu, K., & Srebro, N. (2012). Stochastic optimization for PCA and PLS. In 50th\n\nAnnual Allerton Conference on Communication, Control, and Computing (pp. 861\u2013868).\n\n9\n\n\f[5] Arora, R., Cotter, A., & Srebro, N. (2013). Stochastic optimization of PCA with capped msg. In Advances\n\nin Neural Information Processing Systems (pp. 1815\u20131823).\n\n[6] Balsubramani, A., Dasgupta, S., & Freund, Y. (2013). The fast convergence of incremental PCA. In\n\nAdvances in Neural Information Processing Systems (pp. 3174\u20133182).\n\n[7] Cai, T. T., Ma, Z., & Wu, Y. (2013). Sparse PCA: Optimal rates and adaptive estimation. The Annals of\n\nStatistics, 41(6), 3074\u20133110.\n\n[8] Darken, C. & Moody, J. (1991). Towards faster stochastic gradient search. In Advances in Neural Information\n\nProcessing Systems (pp. 1009\u20131016).\n\n[9] d\u2019Aspremont, A., Bach, F., & El Ghaoui, L. (2008). Optimal solutions for sparse principal component\n\nanalysis. Journal of Machine Learning Research, 9, 1269\u20131294.\n\n[10] De Sa, C., Olukotun, K., & R\u00e9, C. (2015). Global convergence of stochastic gradient descent for some\nnon-convex matrix problems. In Proceedings of the 32nd International Conference on Machine Learning\n(ICML-15) (pp. 2332\u20132341).\n\n[11] Ethier, S. N. & Kurtz, T. G. (2005). Markov Processes: Characterization and Convergence, volume 282.\n\nJohn Wiley & Sons.\n\n[12] Garber, D. & Hazan, E. (2015). Fast and simple PCA via convex optimization. arXiv preprint\n\narXiv:1509.05647.\n\n[13] Ge, R., Huang, F., Jin, C., & Yuan, Y. (2015). Escaping from saddle points \u2013 online stochastic gradient for\n\ntensor decomposition. In Proceedings of The 28th Conference on Learning Theory (pp. 797\u2013842).\n\n[14] Hardt, M. & Price, E. (2014). The noisy power method: A meta algorithm with applications. In Advances\n\nin Neural Information Processing Systems (pp. 2861\u20132869).\n\n[15] Hardt, Moritz & Price, Eric (2014). The Noisy Power Method: A Meta Algorithm with Applications.\n\nNIPS, (pp. 2861\u20132869).\n\n[16] Helmke, U. & Moore, J. B. (1994). Optimization and Dynamical Systems. Springer.\n\n[17] Jain, P., Jin, C., Kakade, S. M., Netrapalli, P., & Sidford, A. (2016). Matching matrix bernstein with little\n\nmemory: Near-optimal \ufb01nite sample guarantees for oja\u2019s algorithm. arXiv preprint arXiv:1602.06929.\n\n[18] Johnstone, I. M. & Lu, A. Y. (2009). On Consistency and Sparsity for Principal Components Analysis in\n\nHigh Dimensions. Journal of the American Statistical Association, 104(486), 682\u2013693.\n\n[19] Krasulina, T. (1969). The method of stochastic approximation for the determination of the least eigenvalue\n\nof a symmetrical matrix. USSR Computational Mathematics and Mathematical Physics, 9(6), 189\u2013195.\n\n[20] Kuczynski, J. & Wozniakowski, H. (1992). Estimating the largest eigenvalue by the power and lanczos\n\nalgorithms with a random start. SIAM journal on matrix analysis and applications, 13(4), 1094\u20131122.\n\n[21] Lee, J. D., Simchowitz, M., Jordan, M. I., & Recht, B. (2016). Gradient descent only converges to\n\nminimizers. In Conference on Learning Theory (pp. 1246\u20131257).\n\n[22] Levin, D. A., Peres, Y., & Wilmer, E. L. (2009). Markov chains and mixing times. American Mathematical\n\nSociety.\n\n[23] Li, C. J., Wang, M., Liu, H., & Zhang, T. (2016). Near-optimal stochastic approximation for online\n\nprincipal component estimation. arXiv preprint arXiv:1603.05305.\n\n[24] Li, Q., Tai, C., & E, W. (2015). Dynamics of stochastic gradient algorithms.\n\narXiv:1511.06251.\n\narXiv preprint\n\n[25] Ma, Z. (2013). Sparse principal component analysis and iterative thresholding. The Annals of Statistics,\n\n41(2), 772\u2013801.\n\n[26] Mandt, S., Hoffman, M. D., & Blei, D. M. (2016). A variational analysis of stochastic gradient algorithms.\n\narXiv preprint arXiv:1602.02666.\n\n[27] Mitliagkas, I., Caramanis, C., & Jain, P. (2013). Memory limited, streaming PCA. In Advances in Neural\n\nInformation Processing Systems (pp. 2886\u20132894).\n\n10\n\n\f[28] Musco, C. & Musco, C. (2015). Stronger approximate singular value decomposition via the block lanczos\n\nand power methods. arXiv preprint arXiv:1504.05477.\n\n[29] Nadler, B. (2008). Finite sample approximation results for principal component analysis: A matrix\n\nperturbation approach. The Annals of Statistics, 41(2), 2791\u20132817.\n\n[30] Oja, E. (1982). Simpli\ufb01ed neuron model as a principal component analyzer. Journal of mathematical\n\nbiology, 15(3), 267\u2013273.\n\n[31] Oja, E. & Karhunen, J. (1985). On stochastic approximation of the eigenvectors and eigenvalues of the\n\nexpectation of a random matrix. Journal of mathematical analysis and applications, 106(1), 69\u201384.\n\n[32] Oksendal, B. (2003). Stochastic Differential Equations. Springer.\n\n[33] Panageas, I. & Piliouras, G. (2016). Gradient descent converges to minimizers: The case of non-isolated\n\ncritical points. arXiv preprint arXiv:1605.00405.\n\n[34] Shamir, O. (2015a). Convergence of stochastic gradient descent for PCA. arXiv preprint arXiv:1509.09002.\n\n[35] Shamir, O. (2015b). Fast stochastic algorithms for svd and PCA: Convergence properties and convexity.\n\narXiv preprint arXiv:1507.08788.\n\n[36] Shamir, O. (2015c). A stochastic PCA and svd algorithm with an exponential convergence rate. In\n\nProceedings of the 32nd International Conference on Machine Learning (ICML-15) (pp. 144\u2013152).\n\n[37] Su, W., Boyd, S., & Candes, E. J. (2016). A differential equation for modeling Nesterov\u2019s accelerated\n\ngradient method: theory and insights. Journal of Machine Learning Research, 17(153), 1\u201343.\n\n[38] Sun, J., Qu, Q., & Wright, J. (2015a). Complete dictionary recovery over the sphere i: Overview and the\n\ngeometric picture. arXiv preprint arXiv:1511.03607.\n\n[39] Sun, J., Qu, Q., & Wright, J. (2015b). Complete dictionary recovery over the sphere ii: Recovery by\n\nRiemannian trust-region method. arXiv preprint arXiv:1511.04777.\n\n[40] Sun, J., Qu, Q., & Wright, J. (2015c). When are nonconvex problems not scary?\n\narXiv:1510.06096.\n\n[41] Sun, J., Qu, Q., & Wright, J. (2016). A geometric analysis of phase retrieval.\n\narXiv:1602.06664.\n\narXiv preprint\n\narXiv preprint\n\n[42] Vu, V. Q. & Lei, J. (2012). Minimax Rates of Estimation for Sparse PCA in High Dimensions. AISTATS,\n\n(pp. 1278\u20131286).\n\n[43] Vu, V. Q. & Lei, J. (2013). Minimax sparse principal subspace estimation in high dimensions. The Annals\n\nof Statistics, 41(6), 2905\u20132947.\n\n[44] Wang, Z., Lu, H., & Liu, H. (2014). Nonconvex statistical optimization: Minimax-optimal sparse pca in\n\npolynomial time. arXiv preprint arXiv:1408.5352.\n\n[45] Yuan, X.-T. & Zhang, T. (2013). Truncated power method for sparse eigenvalue problems. Journal of\n\nMachine Learning Research, 14(Apr), 899\u2013925.\n\n[46] Zou, H. (2006). The adaptive lasso and its oracle properties. Journal of the American statistical association,\n\n101(476), 1418\u20131429.\n\n11\n\n\f", "award": [], "sourceid": 443, "authors": [{"given_name": "Chris Junchi", "family_name": "Li", "institution": "Princeton University"}, {"given_name": "Mengdi", "family_name": "Wang", "institution": "Princeton University"}, {"given_name": "Han", "family_name": "Liu", "institution": "Tencent AI Lab"}, {"given_name": "Tong", "family_name": "Zhang", "institution": "Tencent AI Lab"}]}