{"title": "The Fast Convergence of Incremental PCA", "book": "Advances in Neural Information Processing Systems", "page_first": 3174, "page_last": 3182, "abstract": "We prove the first finite-sample convergence rates for any incremental PCA algorithm using sub-quadratic time and memory per iteration. The algorithm analyzed is Oja's learning rule, an efficient and well-known scheme for estimating the top principal component. Our analysis of this non-convex problem yields expected and high-probability convergence rates of $\\tilde{O}(1/n)$ through a novel technique. We relate our guarantees to existing rates for stochastic gradient descent on strongly convex functions, and extend those results. We also include experiments which demonstrate convergence behaviors predicted by our analysis.", "full_text": "The Fast Convergence of Incremental PCA\n\nAkshay Balsubramani\n\nUC San Diego\n\nabalsubr@cs.ucsd.edu\n\nSanjoy Dasgupta\n\nUC San Diego\n\ndasgupta@cs.ucsd.edu\n\nYoav Freund\nUC San Diego\n\nyfreund@cs.ucsd.edu\n\nAbstract\n\nWe consider a situation in which we see samples Xn \u2208 Rd drawn i.i.d. from some\ndistribution with mean zero and unknown covariance A. We wish to compute the\ntop eigenvector of A in an incremental fashion - with an algorithm that maintains\nan estimate of the top eigenvector in O(d) space, and incrementally adjusts the\nestimate with each new data point that arrives. Two classical such schemes are\ndue to Krasulina (1969) and Oja (1983). We give \ufb01nite-sample convergence rates\nfor both.\n\n1\n\nIntroduction\n\nPrincipal component analysis (PCA) is a popular form of dimensionality reduction that projects a\ndata set on the top eigenvector(s) of its covariance matrix. The default method for computing these\neigenvectors uses O(d2) space for data in Rd, which can be prohibitive in practice. It is therefore\nof interest to study incremental schemes that take one data point at a time, updating their estimates\nof the desired eigenvectors with each new point. For computing one eigenvector, such methods use\nO(d) space.\nFor the case of the top eigenvector, this problem has long been studied, and two elegant solutions\nwere obtained by Krasulina [7] and Oja [9]. Their methods are closely related. At time n \u2212 1, they\nhave some estimate Vn\u22121 \u2208 Rd of the top eigenvector. Upon seeing the next data point, Xn, they\nupdate this estimate as follows:\n\n(cid:18)\n\n(cid:19)\n\nVn = Vn\u22121 + \u03b3n\n\nXnX T\n\nVn =\n\nVn\u22121 + \u03b3nXnX T\n(cid:107)Vn\u22121 + \u03b3nXnX T\n\nn Vn\u22121\n\nn\u22121XnX T\n(cid:107)Vn\u22121(cid:107)2\n\nId\n\nn \u2212 V T\nn Vn\u22121\nn Vn\u22121(cid:107)\n\nVn\u22121\n\n(Krasulina)\n\n(Oja)\n\nHere \u03b3n is a \u201clearning rate\u201d that is typically proportional to 1/n.\nSuppose the points X1, X2, . . . are drawn i.i.d. from a distribution on Rd with mean zero and co-\nvariance matrix A. The original papers proved that these estimators converge almost surely to the\ntop eigenvector of A (call it v\u2217) under mild conditions:\n\n\u2022 (cid:80)\n\nn \u03b3n = \u221e while(cid:80)\n\nn < \u221e.\n\nn \u03b32\n\n\u2022 If \u03bb1, \u03bb2 denote the top two eigenvalues of A, then \u03bb1 > \u03bb2.\n\u2022 E(cid:107)Xn(cid:107)k < \u221e for some suitable k (for instance, k = 8 works).\n\nThere are also other incremental estimators for which convergence has not been established; see, for\ninstance, [12] and [16].\nIn this paper, we analyze the rate of convergence of the Krasulina and Oja estimators. They can\nbe treated in a common framework, as stochastic approximation algorithms for maximizing the\n\n1\n\n\fRayleigh quotient\n\nG(v) =\n\nThe maximum value of this function is \u03bb1, and is achieved at v\u2217 (or any nonzero multiple thereof).\nThe gradient is\n\nvT Av\nvT v\n\n.\n\n(cid:18)\n\n\u2207G(v) =\n\n2\n(cid:107)v(cid:107)2\n\nA \u2212 vT Av\nvT v\n\n(cid:19)\n\nId\n\nv.\n\nn = A, we see that Krasulina\u2019s method is stochastic gradient descent. The Oja proce-\n\nSince EXnX T\ndure is closely related: as pointed out in [10], the two are identical to within second-order terms.\nRecently, there has been a lot of work on rates of convergence for stochastic gradient descent (for in-\nstance, [11]), but this has typically been limited to convex cost functions. These results do not apply\nto the non-convex Rayleigh quotient, except at the very end, when the system is near convergence.\nMost of our analysis focuses on the buildup to this \ufb01nale.\nWe measure the quality of the solution Vn at time n using the potential function\n\n\u03a8n = 1 \u2212 (Vn \u00b7 v\u2217)2\n(cid:107)Vn(cid:107)2\n\n,\n\nwhere v\u2217 is taken to have unit norm. This quantity lies in the range [0, 1], and we are interested in\nthe rate at which it approaches zero. The result, in brief, is that E[\u03a8n] = O(1/n), under conditions\nthat are similar to those above, but stronger. In particular, we require that \u03b3n be proportional to 1/n\nand that (cid:107)Xn(cid:107) be bounded.\n\n1.1 The algorithm\n\nWe analyze the following procedure.\n\n1. Set starting time. Set the clock to time no.\n2. Initialization. Initialize Vno uniformly at random from the unit sphere in Rd.\n3. For time n = no + 1, no + 2, . . .:\n\n(a) Receive the next data point, Xn.\n(b) Update step. Perform either the Krasulina or Oja update, with \u03b3n = c/n.\n\nThe \ufb01rst step is similar to using a learning rate of the form \u03b3n = c/(n + no), as is often done in\nstochastic gradient descent implementations [1]. We have adopted it because the initial sequence of\nupdates is highly noisy: during this phase Vn moves around wildly, and cannot be shown to make\nprogress. It becomes better behaved when the step size \u03b3n becomes smaller, that is to say when n\ngets larger than some suitable no. By setting the start time to no, we can simply fast-forward the\nanalysis to this moment.\n\n1.2\n\nInitialization\n\nOne possible initialization is to set Vno to the \ufb01rst data point that arrives, or to the average of a few\ndata points. This seems sensible enough, but can fail dramatically in some situations.\nHere is an example. Suppose X can take on just 2d possible values: \u00b1e1,\u00b1\u03c3e2, . . . ,\u00b1\u03c3ed, where\nthe ei are coordinate directions and 0 < \u03c3 < 1 is a small constant. Suppose further that the\ndistribution of X is speci\ufb01ed by a single positive number p < 1:\n\nPr(X = e1) = Pr(X = \u2212e1) =\nPr(X = \u03c3ei) = Pr(X = \u2212\u03c3ei) =\n\np\n2\n1 \u2212 p\n2(d \u2212 1)\n\nfor i > 1\n\nThen X has mean zero and covariance diag(p, \u03c32(1 \u2212 p)/(d \u2212 1), . . . , \u03c32(1 \u2212 p)/(d \u2212 1)). We will\nassume that p and \u03c3 are chosen so that p > \u03c32(1 \u2212 p)/(d \u2212 1); in our notation, the top eigenvalues\nare then \u03bb1 = p and \u03bb2 = \u03c32(1 \u2212 p)/(d \u2212 1), and the target vector is v\u2217 = e1.\n\n2\n\n\fIf Vn is ever orthogonal to some ei, it will remain so forever. This is because both the Krasulina and\nOja updates have the following properties:\n\nVn\u22121 \u00b7 Xn = 0 =\u21d2 Vn = Vn\u22121\nVn\u22121 \u00b7 Xn (cid:54)= 0 =\u21d2 Vn \u2208 span(Vn\u22121, Xn).\n\nIf Vno is initialized to a random data point, then with probability 1 \u2212 p, it will be assigned to some\nei with i > 1, and will converge to a multiple of that same ei rather than to e1. Likewise, if it is\ninitialized to the average of \u2264 1/p data points, then with constant probability it will be orthogonal\nto e1 and remain so always.\nSetting Vno to a random unit vector avoids this problem. However, there are doubtless cases, for\ninstance when the data has intrinsic dimension (cid:28) d, in which a better initializer is possible.\n\n1.3 The setting of the learning rate\n\nIn order to get a sense of what rates of convergence we might expect, let\u2019s return to the example of a\nrandom vector X with 2d possible values. In the Oja update Vn = Vn\u22121 + \u03b3nXnX T\nn Vn\u22121, we can\nignore normalization if we are merely interested in the progress of the potential function \u03a8n. Since\nthe Xn correspond to coordinate directions, each update changes just one coordinate of V :\n\nXn = \u00b1e1 =\u21d2 Vn,1 = Vn\u22121,1(1 + \u03b3n)\nXn = \u00b1\u03c3ei =\u21d2 Vn,i = Vn\u22121,i(1 + \u03c32\u03b3n)\n\nRecall that we initialize Vno to a random vector from the unit sphere. For simplicity, let\u2019s just\nsuppose that no = 0 and that this initial value is the all-ones vector (again, we don\u2019t have to worry\nabout normalization). On each iteration the \ufb01rst coordinate is updated with probability exactly\np = \u03bb1, and thus\n\nE[Vn,1] = (1 + \u03bb1\u03b31)(1 + \u03bb1\u03b32)\u00b7\u00b7\u00b7 (1 + \u03bb1\u03b3n) \u223c exp(\u03bb1(\u03b31 + \u00b7\u00b7\u00b7 + \u03b3n)) \u223c nc\u03bb1\n\nsince \u03b3n = c/n. Likewise, for i > 1,\n\nE[Vn,i] = (1 + \u03bb2\u03b31)(1 + \u03bb2\u03b32)\u00b7\u00b7\u00b7 (1 + \u03bb2\u03b3n) \u223c nc\u03bb2 .\n\nIf all goes according to expectation, then at time n,\n\n\u03a8n = 1 \u2212 V 2\n\nn,1\n\n(cid:107)Vn(cid:107)2 \u223c 1 \u2212\n\nn2c\u03bb1\n\nn2c\u03bb1 + (d \u2212 1)n2c\u03bb2\n\n\u223c d \u2212 1\nn2c(\u03bb1\u2212\u03bb2)\n\n.\n\n(This is all very rough, but can be made precise by obtaining concentration bounds for ln Vn,i.)\nFrom this, we can see that it is not possible to achieve a O(1/n) rate unless c \u2265 1/(2(\u03bb1 \u2212 \u03bb2)).\nTherefore, we will assume this when stating our \ufb01nal results, although most of our analysis is in\nterms of general \u03b3n. An interesting practical question, to which we do not have an answer, is how\none would empirically set c without prior knowledge of the eigenvalue gap.\n\n1.4 Nested sample spaces\nFor n \u2265 no, let Fn denote the sigma-\ufb01eld of all outcomes up to and including time n: Fn =\n\u03c3(Vno, Xno+1, . . . , Xn). We start by showing that\n\nE[\u03a8n|Fn\u22121] \u2264 \u03a8n\u22121(1 \u2212 2\u03b3n(\u03bb1 \u2212 \u03bb2)(1 \u2212 \u03a8n\u22121)) + O(\u03b32\nn).\n\nInitially \u03a8n is likely to be close to 1. For instance, if the initial Vno is picked uniformly at random\nfrom the surface of the unit sphere in Rd, then we\u2019d expect \u03a8no \u2248 1 \u2212 1/d. This means that the\ninitial rate of decrease is very small, because of the (1 \u2212 \u03a8n\u22121) term.\nTo deal with this, we divide the analysis into epochs: the \ufb01rst takes \u03a8n from 1\u2212 1/d to 1\u2212 2/d, the\nsecond from 1\u22122/d to 1\u22124/d, and so on until \u03a8n \ufb01nally drops below 1/2. We use martingale large\ndeviation bounds to bound the length of each epoch, and also to argue that \u03a8n does not regress. In\nparticular, we establish a sequence of times nj such that (with high probability)\n\n\u03a8n \u2264 1 \u2212 2j\nd\n\n.\n\nsup\nn\u2265nj\n\n3\n\n(1)\n\n\fThe analysis of each epoch uses martingale arguments, but at the same time, assumes that \u03a8n re-\nmains bounded above. Combining the two requires a careful speci\ufb01cation of the sample space at\neach step. Let \u2126 denote the sample space of all realizations (vno, xno+1, xno+2, . . .), and P the\nprobability distribution on these sequences. For any \u03b4 > 0, we de\ufb01ne a nested sequence of spaces\n\u2126 \u2283 \u2126(cid:48)\nn) \u2265 1 \u2212 \u03b4,\nand moreover consists exclusively of realizations \u03c9 \u2208 \u2126 that satisfy the constraints (1) up to and\nincluding time n \u2212 1. We can then build martingale arguments by restricting attention to \u2126(cid:48)\nn when\ncomputing the conditional expectations of quantities at time n.\n\nn is Fn\u22121-measurable, has probability P (\u2126(cid:48)\n\nno+1 \u2283 \u00b7\u00b7\u00b7 such that each \u2126(cid:48)\n\n\u2283 \u2126(cid:48)\n\nno\n\n1.5 Main result\n\nWe make the following assumptions:\n\n(A1) The Xn \u2208 Rd are i.i.d. with mean zero and covariance A.\n(A2) There is a constant B such that (cid:107)Xn(cid:107)2 \u2264 B.\n(A3) The eigenvalues \u03bb1 \u2265 \u03bb2 \u2265 \u00b7\u00b7\u00b7 \u2265 \u03bbd of A satisfy \u03bb1 > \u03bb2.\n(A4) The step sizes are of the form \u03b3n = c/n.\n\nEn\n\n\u2265 1 \u2212\n\n\u2212 A1\n\nUnder these conditions, we get the following rate of convergence for the Krasulina update.\nTheorem 1.1. There are absolute constants Ao, A1 > 0 and 1 < a < 4 for which the following\nholds. Pick any 0 < \u03b4 < 1, and any co > 2. Set the step sizes to \u03b3n = c/n, where c = co/(2(\u03bb1 \u2212\n\u03bb2)), and set the starting time to no \u2265 (AoB2c2d2/\u03b44) ln(1/\u03b4). Then there is a nested sequence of\nsubsets of the sample space \u2126 \u2283 \u2126(cid:48)\n\nP (\u2126(cid:48)\n\n(cid:20) (Vn \u00b7 v\u2217)2\n\nno\n\n\u2283 \u2126(cid:48)\n(cid:21)\nn) \u2265 1 \u2212 \u03b4 and\n\n(cid:18) c2B2eco/no\n\nno+1 \u2283 \u00b7\u00b7\u00b7 such that for any n \u2265 no, we have:\n(cid:19)co/2\n\n(cid:19)a(cid:18) no + 1\n\n(cid:19) 1\n\n(cid:18) d\n\nn + 1\n\n(cid:107)Vn(cid:107)2\n\n2(co \u2212 2)\nwhere En denotes expectation restricted to \u2126(cid:48)\nn.\nSince co > 2, this bound is of the form En[\u03a8n] = O(1/n).\nThe result above also holds for the Oja update up to absolute constants.\nWe also remark that a small modi\ufb01cation to the \ufb01nal step in the proof of the above yields a rate\nof En [\u03a8n] = O(n\u2212a), with an identical de\ufb01nition of En [\u03a8n]. The details are in the proof, in\nAppendix D.2.\n\nn + 1\n\n\u03b42\n\n,\n\n1.6 Related work\n\nThere is an extensive line of work analyzing PCA from the statistical perspective, in which the con-\nvergence of various estimators is characterized under certain conditions, including generative models\nof the data [5] and various assumptions on the covariance matrix spectrum [14, 4] and eigenvalue\nspacing [17]. Such works do provide \ufb01nite-sample guarantees, but they apply only to the batch case\nand/or are computationally intensive, rather than considering an ef\ufb01cient incremental algorithm.\nAmong incremental algorithms, the work of Warmuth and Kuzmin [15] describes and analyzes\nworst-case online PCA, using an experts-setting algorithm with a super-quadratic per-iteration cost.\nMore ef\ufb01cient general-purpose incremental PCA algorithms have lacked \ufb01nite-sample analyses [2].\nThere have been recent attempts to remedy this situation by relaxing the nonconvexity inherent in\nthe problem [3] or making generative assumptions [8]. The present paper directly analyzes the oldest\nknown incremental PCA algorithms under relatively mild assumptions.\n\n2 Outline of proof\n\nWe now sketch the proof of Theorem 1.1; almost all the details are relegated to the appendix.\nRecall that for n \u2265 no, we take Fn to be the sigma-\ufb01eld of all outcomes up to and including time n,\nthat is, Fn = \u03c3(Vno, Xno+1, . . . , Xn).\n\n4\n\n\fAn additional piece of notation: we will use(cid:98)u to denote u/(cid:107)u(cid:107), the unit vector in the direction of\nu \u2208 Rd. Thus, for instance, the Rayleigh quotient can be written G(v) =(cid:98)vT A(cid:98)v.\n\n2.1 Expected per-step change in potential\n\nWe \ufb01rst bound the expected improvement in \u03a8n in each step of the Krasulina or Oja algorithms.\nTheorem 2.1. For any n > no, we can write \u03a8n \u2264 \u03a8n\u22121 + \u03b2n \u2212 Zn, where\n\n(cid:26) \u03b32\n\n\u03b2n =\n\nnB2/4\nnB2 + 2\u03b33\n5\u03b32\n\nnB3\n\n(Krasulina)\n(Oja)\n\nand where Zn is a Fn-measurable random variable with the following properties:\n\n\u2022 E[Zn|Fn\u22121] = 2\u03b3n((cid:98)Vn\u22121 \u00b7 v\u2217)2(\u03bb1 \u2212 G(Vn\u22121)) \u2265 2\u03b3n(\u03bb1 \u2212 \u03bb2)\u03a8n\u22121(1 \u2212 \u03a8n\u22121) \u2265 0.\n\n\u2022 |Zn| \u2264 4\u03b3nB.\n\nThe theorem follows from Lemmas ?? and ?? in the appendix.\nIts characterization of the two\nestimators is almost identical, and for simplicity we will henceforth deal only with Krasulina\u2019s\nestimator. All the subsequent results hold also for Oja\u2019s method, up to constants.\n\n2.2 A large deviation bound for \u03a8n\nWe know from Theorem 2.1 that \u03a8n \u2264 \u03a8n\u22121 + \u03b2n \u2212 Zn, where \u03b2n is non-stochastic and Zn is\na quantity of positive expected value. Thus, in expectation, and modulo a small additive term, \u03a8n\ndecreases monotonically. However, the amount of decrease at the nth time step can be arbitrarily\nsmall when \u03a8n is close to 1. Thus, we need to show that \u03a8n is eventually bounded away from 1,\ni.e. there exists some \u0001o > 0 and some time no such that for any n \u2265 no, we have \u03a8n \u2264 1 \u2212 \u0001o.\nRecall from the algorithm speci\ufb01cation that we advance the clock so as to skip the pre-no phase.\n\u221a\nGiven this, what can we expect \u0001o to be? If the initial estimate Vno is a random unit vector, then\nE[\u03a8no] = 1 \u2212 1/d and, roughly speaking, Pr(\u03a8no > 1 \u2212 \u0001/d) = O(\n\u0001). If no is suf\ufb01ciently large,\nthen \u03a8n may subsequently increase a little bit, but not by very much. In this section, we establish\nthe following bound.\nTheorem 2.2. Suppose the initial estimate Vno is chosen uniformly at random from the surface of\nthe unit sphere in Rd. Assume also that the step sizes are of the form \u03b3n = c/n, for some constant\nc > 0. Then for any 0 < \u0001 < 1, if no \u2265 2B2c2d2/\u00012, we have\n\u221a\n\u2264\n\n(cid:18)\n\n(cid:19)\n\n2e\u0001.\n\nPr\n\n\u03a8n \u2265 1 \u2212 \u0001\nd\n\nsup\nn\u2265no\n\nTo prove this, we start with a simple recurrence for the moment-generating function of \u03a8n.\nLemma 2.3. Consider a \ufb01ltration (Fn) and random variables Yn, Zn \u2208 Fn such that there are two\nsequences of nonnegative constants, (\u03b2n) and (\u03b6n), for which:\n\n\u2022 Yn \u2264 Yn\u22121 + \u03b2n \u2212 Zn.\n\u2022 Each Zn takes values in an interval of length \u03b6n.\n\nThen for any t > 0, we have E[etYn|Fn\u22121] \u2264 exp(t(Yn\u22121 \u2212 E[Zn|Fn\u22121] + \u03b2n + t\u03b6 2\nThis relation shows how to de\ufb01ne a supermartingale based on etYn, from which we can derive a\nlarge deviation bound on Yn.\nLemma 2.4. Assume the conditions of Lemma 2.3, and also that E[Zn|Fn\u22121] \u2265 0. Then, for any\ninteger m and any \u2206, t > 0,\n\nn/8)).\n\n(cid:19)\n\n\u2264 E[etYm ] exp(cid:0) \u2212 t(cid:0)\u2206 \u2212(cid:88)\n\n(cid:96) /8)(cid:1)(cid:1).\n\n(\u03b2(cid:96) + t\u03b6 2\n\n(cid:18)\n\nPr\n\nsup\nn\u2265m\n\nYn \u2265 \u2206\n\n(cid:96)>m\n\n5\n\n\fIn order to apply this to the sequence (\u03a8n), we need to \ufb01rst calculate the moment-generating func-\ntion of its starting value \u03a8no.\nLemma 2.5. Suppose a vector V is picked uniformly at random from the surface of the unit sphere\nin Rd, where d \u2265 3. De\ufb01ne Y = 1 \u2212 (V 2\n\n(cid:114)\n\n1 )/(cid:107)V (cid:107)2. Then, for any t > 0,\nEetY \u2264 et\n\n.\n\nd \u2212 1\n2t\n\nPutting these pieces together yields Theorem 2.2.\n\nIntermediate epochs of improvement\n\n2.3\nWe have seen that, for suitable \u0001 and no, it is likely that \u03a8n \u2264 1 \u2212 \u0001/d for all n \u2265 no. We now\nde\ufb01ne a series of epochs in which 1 \u2212 \u03a8n successively doubles, until \u03a8n \ufb01nally drops below 1/2.\nTo do this, we specify intermediate goals (no, \u0001o), (n1, \u00011), (n2, \u00012), . . . , (nJ , \u0001J ), where no < n1 <\n\u00b7\u00b7\u00b7 < nJ and \u0001o < \u00011 < \u00b7\u00b7\u00b7 < \u0001J = 1/2, with the intention that:\n\nFor all 0 \u2264 j \u2264 J, we have\n\n\u03a8n \u2264 1 \u2212 \u0001j.\n\nsup\nn\u2265nj\n\n(2)\n\nOf course, this can only hold with a certain probability.\nLet \u2126 denote the sample space of all realizations (vno, xno+1, xno+2, . . .), and P the probability\ndistribution on these sequences. We will show that, for a certain choice of {(nj, \u0001j)}, all J + 1\nconstraints (2) can be met by excluding just a small portion of \u2126.\nWe consider a speci\ufb01c realization \u03c9 \u2208 \u2126 to be good if it satis\ufb01es (2). Call this set \u2126(cid:48):\n\n\u2126(cid:48) = {\u03c9 \u2208 \u2126 : sup\nn\u2265nj\n\n\u03a8n(\u03c9) \u2264 1 \u2212 \u0001j for all 0 \u2264 j \u2264 J}.\n\nFor technical reasons, we also need to look at realizations that are good up to time n\u22121. Speci\ufb01cally,\nfor each n, de\ufb01ne\n\nCrucially, this is Fn\u22121-measurable. Also note that \u2126(cid:48) =(cid:84)\n\nn = {\u03c9 \u2208 \u2126 :\n\u2126(cid:48)\n\nnj\u2264(cid:96)no\n\nWe can talk about expectations under the distribution P restricted to subsets of \u2126. In particular, let\nPn be the restriction of P to \u2126(cid:48)\nn). As\nfor expectations with respect to Pn, for any function f : \u2126 \u2192 R, we de\ufb01ne\n\nn; that is, for any A \u2282 \u2126, we have Pn(A) = P (A \u2229 \u2126(cid:48)\n\nn)/P (\u2126(cid:48)\n\n(cid:90)\n\nEnf =\n\n1\nP (\u2126(cid:48)\nn)\n\n\u2126(cid:48)\n\nn\n\nf (\u03c9)P (d\u03c9).\n\nHere is the main result of this section.\nTheorem 2.6. Assume that \u03b3n = c/n, where c = co/(2(\u03bb1 \u2212 \u03bb2)) and co > 0. Pick any 0 < \u03b4 < 1\nand select a schedule (no, \u0001o), . . . , (nJ , \u0001J ) that satis\ufb01es the conditions\n\n2 \u0001j \u2264 \u0001j+1 \u2264 2\u0001j for 0 \u2264 j < J, and \u0001J\u22121 \u2264 1\n\n4\n\n8ed , and 3\n\n\u0001o = \u03b42\n(nj+1 + 1) \u2265 e5/co(nj + 1) for 0 \u2264 j < J\no) ln(4/\u03b4). Then Pr(\u2126(cid:48)) \u2265 1 \u2212 \u03b4.\n\nas well as no \u2265 (20c2B2/\u00012\nThe \ufb01rst step towards proving this theorem is bounding the moment-generating function of \u03a8n in\nterms of that of \u03a8n\u22121.\nLemma 2.7. Suppose n > nj. Suppose also that \u03b3n = c/n, where c = co/(2(\u03bb1 \u2212 \u03bb2)). Then for\nany t > 0,\n\n(3)\n\n(cid:104)\n\n(cid:16)\n\n(cid:16)\n\nEn[et\u03a8n ] \u2264 En\n\nexp\n\nt\u03a8n\u22121\n\n(cid:17)(cid:17)(cid:105)\n\nexp\n\n(cid:18) c2B2t(1 + 32t)\n\n(cid:19)\n\n4n2\n\n.\n\n1 \u2212 co\u0001j\nn\n\n6\n\n\fWe would like to use this result to bound En[\u03a8n] in terms of Em[\u03a8m] for m < n. The shift in\nsample spaces is easily handled using the following observation.\nLemma 2.8. If g : R \u2192 R is nondecreasing, then En[g(\u03a8n\u22121)] \u2264 En\u22121[g(\u03a8n\u22121)] for any n > no.\n\nA repeated application of Lemmas 2.7 and 2.8 yields the following.\nLemma 2.9. Suppose that conditions (3) hold. Then for 0 \u2264 j < J and any t > 0,\n\u2212 1\nnj+1\n\nEnj+1[et\u03a8nj+1 ] \u2264 exp\n\nt(1 \u2212 \u0001j+1) \u2212 t\u0001j +\n\ntc2B2(1 + 32t)\n\n(cid:18) 1\n\nnj\n\n(cid:18)\n\n4\n\n(cid:19)(cid:19)\n\n.\n\nNow that we have bounds on the moment-generating functions of intermediate \u03a8n, we can apply\nmartingale deviation bounds, as in Lemma 2.4, to obtain the following, from which Theorem 2.6\nensues.\nLemma 2.10. Assume conditions (3) hold. Pick any 0 < \u03b4 < 1, and set no \u2265 (20c2B2/\u00012\nThen\n\no) ln(4/\u03b4).\n\n(cid:33)\n\n(cid:32)\n\nJ(cid:88)\n\nPnj\n\nsup\nn\u2265nj\n\n\u03a8n > 1 \u2212 \u0001j\n\n\u2264 \u03b4\n2\n\n.\n\nj=1\n\n2.4 The \ufb01nal epoch\n\nRecall the de\ufb01nition of the intermediate goals (nj, \u0001j) in (2), (3). The \ufb01nal epoch is the period\nn \u2265 nJ, at which point \u03a8n \u2264 1/2. The following consequence of Lemmas ?? and 2.8 captures the\nrate at which \u03a8 decreases during this phase.\nLemma 2.11. For all n > nJ,\n\nEn[\u03a8n] \u2264 (1 \u2212 \u03b1n)En\u22121[\u03a8n\u22121] + \u03b2n,\n\nwhere \u03b1n = (\u03bb1 \u2212 \u03bb2)\u03b3n and \u03b2n = (B2/4)\u03b32\nn.\nBy solving this recurrence relation, and piecing together the various epochs, we get the overall\nconvergence result of Theorem 1.1.\nNote that Lemma 2.11 closely resembles the recurrence relation followed by the squared L2 distance\nfrom the optimum of stochastic gradient descent (SGD) on a strongly convex function [11]. As\n\u03a8n \u2192 0, the incremental PCA algorithms we study have convergence rates of the same form as\nSGD in this scenario.\n\n3 Experiments\n\nWhen performing PCA in practice with massive d and a large/growing dataset, an incremental\nmethod like that of Krasulina or Oja remains practically viable, even as quadratic-time and -memory\nalgorithms become increasingly impractical. Arora et al. [2] have a more complete discussion of\nthe empirical necessity of incremental PCA algorithms, including a version of Oja\u2019s method which\nis shown to be extremely competitive in practice.\nSince the ef\ufb01ciency bene\ufb01ts of these types of algorithms are well understood, we now instead focus\non the effect of the learning rate on the performance of Oja\u2019s algorithm (results for Krasulina\u2019s are\nextremely similar). We use the CMU PIE faces [13], consisting of 11554 images of size 32 \u00d7 32,\nas a prototypical example of a dataset with most of its variance captured by a few PCs, as shown in\nFig. 1. We set n0 = 0.\nWe expect from Theorem 1.1 and the discussion in the introduction that varying c (the constant in\nthe learning rate) will in\ufb02uence the overall rate of convergence. In particular, if c is low, then halving\nit can be expected to halve the exponent of n, and the slope of the log-log convergence graph (ref.\nthe remark after Thm. 1.1). This is exactly what occurs in practice, as illustrated in Fig. 2. The\ndotted line in that \ufb01gure is a convergence rate of 1/n, drawn as a guide.\n\n7\n\n\fFigures 1 and 2.\n\n4 Open problems\n\nSeveral fundamental questions remain unanswered. First, the convergence rates of the two incre-\nmental schemes depend on the multiplier c in the learning rate \u03b3n. If it is too low, convergence will\nbe slower than O(1/n). If it is too high, the constant in the rate of convergence will be large. Is\nthere a simple and practical scheme for setting c?\nSecond, what can be said about incrementally estimating the top p eigenvectors, for p > 1? Both\nmethods we consider extend easily to this case [10]; the estimate at time n is a d \u00d7 p matrix Vn\nn Vn = Ip always maintained.\nwhose columns correspond to the eigenvectors, with the invariant V T\nIn Oja\u2019s algorithm, for instance, when a new data point Xn \u2208 Rd arrives, the following update is\nperformed:\n\nWn = Vn\u22121 + \u03b3nXnX T\nVn = orth(Wn)\n\nn Vn\u22121\n\nwhere the second step orthonormalizes the columns, for instance by Gram-Schmidt. It would be\ninteresting to characterize the rate of convergence of this scheme.\nFinally, our analysis applies to a modi\ufb01ed procedure in which the starting time no is arti\ufb01cially set\nto a large constant. This seems unnecessary in practice, and it would be useful to extend the analysis\nto the case where no = 0.\n\nAcknowledgments\n\nThe authors are grateful to the National Science Foundation for support under grant IIS-1162581.\n\nReferences\n[1] A. Agarwal, O. Chapelle, M. Dud\u00b4\u0131k, and J. Langford. A reliable effective terascale linear\n\nlearning system. CoRR, abs/1110.4198, 2011.\n\n[2] R. Arora, A. Cotter, K. Livescu, and N. Srebro. Stochastic optimization for PCA and PLS.\nIn 50th Annual Allerton Conference on Communication, Control, and Computing, pages 861\u2013\n868. 2012.\n\n[3] R. Arora, A. Cotter, and N. Srebro. Stochastic optimization of PCA with capped MSG. In\n\nAdvances in Neural Information Processing Systems, 2013.\n\n[4] G. Blanchard, O. Bousquet, and L. Zwald. Statistical properties of kernel principal component\n\nanalysis. Machine Learning, 66(2-3):259\u2013294, 2007.\n\n[5] T. T. Cai, Z. Ma, and Y. Wu. Sparse PCA: Optimal rates and adaptive estimation. CoRR,\n\nabs/1211.1309, 2012.\n\n8\n\n0510152025300500100015002000250030003500400045005000Component NumberEigenvaluePIE Dataset Covariance Spectrum10010110210310410510\u2212610\u2212510\u2212410\u2212310\u2212210\u22121100Iteration NumberReconstruction ErrorOja Subspace Rule Dependence on c c=6c=3c=1.5c=1c=0.666c=0.444c=0.296\f[6] R. Durrett. Probability: Theory and Examples. Duxbury, second edition, 1995.\n[7] T.P. Krasulina. A method of stochastic approximation for the determination of the least eigen-\nvalue of a symmetrical matrix. USSR Computational Mathematics and Mathematical Physics,\n9(6):189\u2013195, 1969.\n\n[8] I. Mitliagkas, C. Caramanis, and P. Jain. Memory limited, streaming PCA. In Advances in\n\nNeural Information Processing Systems, 2013.\n\n[9] E. Oja. Subspace Methods of Pattern Recognition. Research Studies Press, 1983.\n[10] E. Oja and J. Karhunen. On stochastic approximation of the eigenvectors and eigenvalues of\nthe expectation of a random matrix. Journal of Math. Analysis and Applications, 106:69\u201384,\n1985.\n\n[11] A. Rakhlin, O. Shamir, and K. Sridharan. Making gradient descent optimal for strongly convex\n\nstochastic optimization. In International Conference on Machine Learning, 2012.\n\n[12] S. Roweis. EM algorithms for PCA and SPCA. In Advances in Neural Information Processing\n\nSystems, 1997.\n\n[13] T. Sim, S. Baker, and M. Bsat. The CMU pose, illumination, and expression database. IEEE\n\nTransactions on Pattern Analysis and Machine Intelligence, 25(12):1615\u20131618, 2003.\n\n[14] V.Q. Vu and J. Lei. Minimax rates of estimation for sparse PCA in high dimensions. Journal\n\nof Machine Learning Research - Proceedings Track, 22:1278\u20131286, 2012.\n\n[15] M.K. Warmuth and D. Kuzmin. Randomized PCA algorithms with regret bounds that are\n\nlogarithmic in the dimension. In Advances in Neural Information Processing Systems. 2007.\n\n[16] J. Weng, Y. Zhang, and W.-S. Hwang. Candid covariance-free incremental principal compo-\nnent analysis. IEEE Transactions on Pattern Analysis and Machine Intelligence, 25(8):1034\u2013\n1040, 2003.\n\n[17] L. Zwald and G. Blanchard. On the convergence of eigenspaces in kernel principal component\n\nanalysis. In Advances in Neural Information Processing Systems, 2005.\n\n9\n\n\f", "award": [], "sourceid": 1464, "authors": [{"given_name": "Akshay", "family_name": "Balsubramani", "institution": "UC San Diego"}, {"given_name": "Sanjoy", "family_name": "Dasgupta", "institution": "UC San Diego"}, {"given_name": "Yoav", "family_name": "Freund", "institution": "UC San Diego"}]}