{"title": "Memory Limited, Streaming PCA", "book": "Advances in Neural Information Processing Systems", "page_first": 2886, "page_last": 2894, "abstract": "We consider streaming, one-pass principal component analysis (PCA), in the high-dimensional regime, with limited memory. Here, $p$-dimensional samples are presented sequentially, and the goal is to produce the $k$-dimensional subspace that best approximates these points. Standard algorithms require $O(p^2)$ memory; meanwhile no algorithm can do better than $O(kp)$ memory, since this is what the output itself requires. Memory (or storage) complexity is most meaningful when understood in the context of computational and sample complexity. Sample complexity for high-dimensional PCA is typically studied in the setting of the {\\em spiked covariance model}, where $p$-dimensional points are generated from a population covariance equal to the identity (white noise) plus a low-dimensional perturbation (the spike) which is the signal to be recovered. It is now well-understood that the spike can be recovered when the number of samples, $n$, scales proportionally with the dimension, $p$. Yet, all algorithms that provably achieve this, have memory complexity $O(p^2)$. Meanwhile, algorithms with memory-complexity $O(kp)$ do not have provable bounds on sample complexity comparable to $p$. We present an algorithm that achieves both: it uses $O(kp)$ memory (meaning storage of any kind) and is able to compute the $k$-dimensional spike with $O(p \\log p)$ sample-complexity -- the first algorithm of its kind. While our theoretical analysis focuses on the spiked covariance model, our simulations show that our algorithm is successful on much more general models for the data.", "full_text": "Memory Limited, Streaming PCA\n\nIoannis Mitliagkas\n\nConstantine Caramanis\n\nDept. of Electrical and Computer Engineering\n\nDept. of Electrical and Computer Engineering\n\nThe University of Texas at Austin\n\nioannis@utexas.edu\n\nThe University of Texas at Austin\nconstantine@utexas.edu\n\nPrateek Jain\n\nMicrosoft Research\n\nBangalore, India\n\nprajain@microsoft.com\n\nAbstract\n\nWe consider streaming, one-pass principal component analysis (PCA), in the high-\ndimensional regime, with limited memory. Here, p-dimensional samples are pre-\nsented sequentially, and the goal is to produce the k-dimensional subspace that\nbest approximates these points. Standard algorithms require O(p2) memory;\nmeanwhile no algorithm can do better than O(kp) memory, since this is what the\noutput itself requires. Memory (or storage) complexity is most meaningful when\nunderstood in the context of computational and sample complexity. Sample com-\nplexity for high-dimensional PCA is typically studied in the setting of the spiked\ncovariance model, where p-dimensional points are generated from a population\ncovariance equal to the identity (white noise) plus a low-dimensional perturbation\n(the spike) which is the signal to be recovered. It is now well-understood that\nthe spike can be recovered when the number of samples, n, scales proportionally\nwith the dimension, p. Yet, all algorithms that provably achieve this, have mem-\nory complexity O(p2). Meanwhile, algorithms with memory-complexity O(kp)\ndo not have provable bounds on sample complexity comparable to p. We present\nan algorithm that achieves both: it uses O(kp) memory (meaning storage of any\nkind) and is able to compute the k-dimensional spike with O(p log p) sample-\ncomplexity \u2013 the \ufb01rst algorithm of its kind. While our theoretical analysis focuses\non the spiked covariance model, our simulations show that our algorithm is suc-\ncessful on much more general models for the data.\n\n1\n\nIntroduction\n\nPrincipal component analysis is a fundamental tool for dimensionality reduction, clustering, classi-\n\ufb01cation, and many more learning tasks. It is a basic preprocessing step for learning, recognition, and\nestimation procedures. The core computational element of PCA is performing a (partial) singular\nvalue decomposition, and much work over the last half century has focused on ef\ufb01cient algorithms\n(e.g., Golub & Van Loan (2012) and references therein) and hence on computational complexity.\n\nThe recent focus on understanding high-dimensional data, where the dimensionality of the data\nscales together with the number of available sample points, has led to an exploration of the sample\ncomplexity of covariance estimation. This direction was largely in\ufb02uenced by Johnstone\u2019s spiked\ncovariance model, where data samples are drawn from a distribution whose (population) covariance\nis a low-rank perturbation of the identity matrix Johnstone (2001). Work initiated there, and also\nwork done in Vershynin (2010a) (and references therein) has explored the power of batch PCA in\nthe p-dimensional setting with sub-Gaussian noise, and demonstrated that the singular value decom-\n\n1\n\n\fposition (SVD) of the empirical covariance matrix succeeds in recovering the principal components\n(extreme eigenvectors of the population covariance) with high probability, given n = O(p) samples.\n\nThis paper brings the focus on another critical quantity: memory/storage. The only currently avail-\nable algorithms with provable sample complexity guarantees either store all n = O(p) samples (note\nthat for more than a single pass over the data, the samples must all be stored) or explicitly form or\napproximate the empirical p \u00d7 p (typically dense) covariance matrix. All cases require as much as\nO(p2) storage for exact recovery. In certain high-dimensional applications, where data points are\nhigh resolution photographs, biometrics, video, etc., p often is of the order of 1010 \u2212 1012, making\nthe need for O(p2) memory prohibitive. At many computing scales, manipulating vectors of length\nO(p) is possible, when storage of O(p2) is not. A typical desktop may have 10-20 GB of RAM, but\nwill not have more than a few TB of total storage. A modern smart-phone may have as much as a\nGB of RAM, but has a few GB, not TB, of storage. In distributed storage systems, the scalability in\nstorage comes at the heavy cost of communication.\n\nIn this light, we consider the streaming data setting, where the samples xt \u2208 Rp are collected\nsequentially, and unless we store them, they are irretrievably gone.1 On the spiked covariance model\n(and natural generalizations), we show that a simple algorithm requiring O(kp) storage \u2013 the best\npossible \u2013 performs as well as batch algorithms (namely, SVD on the empirical covariance matrix),\nwith sample complexity O(p log p). To the best of our knowledge, this is the only algorithm with\nboth storage complexity and sample complexity guarantees.\n\nWe discuss connections to past work in detail in Section 2, introduce the model in Section 3, and\npresent the solution to the rank 1 case, the rank k case, and the perturbed-rank-k case in Sections 4.1,\n4.2 and 4.3, respectively. In Section 5 we provide experiments that not only con\ufb01rm the theoretical\nresults, but demonstrate that our algorithm works well outside the assumptions of our main theorems.\n\n2 Related Work\n\nMemory- and computation-ef\ufb01cient algorithms that operate on streaming data are plentiful in the\nliterature and many seem to do well in practice. However, there is no algorithm that provably\nrecovers the principal components in the same noise and sample-complexity regime as the batch\nPCA algorithm does and maintains a provably light memory footprint. Because of the practical\nrelevance, there is renewed interest in this problem. The fact that it is an important unresolved issue\nhas been pointed out in numerous places, e.g., Warmuth & Kuzmin (2008); Arora et al. (2012).\n\nOnline-PCA for regret minimization is considered in several papers, most recently in Warmuth &\nKuzmin (2008). There the multiplicative weights approach is adapted to this problem, with experts\ncorresponding to subspaces. The goal is to control the regret, improving on the natural follow-\nthe-leader algorithm that performs batch-PCA at each step. However, the algorithm can require\nO(p2) memory, in order to store the multiplicative weights. A memory-light variant described in\nArora et al. (2012) typically requires much less memory, but there are no guarantees for this, and\nmoreover, for certain problem instances, its memory requirement is on the order of p2.\n\nSub-sampling, dimensionality-reduction and sketching form another family of low-complexity and\nlow-memory techniques, see, e.g., Clarkson & Woodruff (2009); Nadler (2008); Halko et al. (2011).\nThese save on memory and computation by performing SVD on the resulting smaller matrix. The\nresults in this line of work provide worst-case guarantees over the pool of data, and typically require\na rapidly decaying spectrum, not required in our setting, to produce good bounds. More funda-\nmentally, these approaches are not appropriate for data coming from a statistical model such as the\nspiked covariance model. It is clear that subsampling approaches, for instance, simply correspond to\ndiscarding most of the data, and for fundamental sample complexity reasons, cannot work. Sketch-\ning produces a similar effect: each column of the sketch is a random (+/\u2212) sum of the data points.\nIf the data points are, e.g., independent Gaussian vectors, then so will each element of the sketch,\nand thus this approach again runs against fundamental sample complexity constraints. Indeed, it is\nstraightforward to check that the guarantees presented in (Clarkson & Woodruff (2009); Halko et al.\n(2011)) are not strong enough to guarantee recovery of the spike. This is not because the results are\nweak; it is because they are geared towards worst-case bounds.\n\n1This is similar to what is sometimes referred to as the single pass model.\n\n2\n\n\fAlgorithms focused on sequential SVD (e.g., Brand (2002, 2006), Comon & Golub (1990),Li (2004)\nand more recently Balzano et al. (2010); He et al. (2011)) seek to have the best subspace estimate\nat every time (i.e., each time a new data sample arrives) but without performing full-blown SVD\nat each step. While these algorithms indeed reduce both the computational and memory burden of\nbatch-PCA, there are no rigorous guarantees on the quality of the principal components or on the\nstatistical performance of these methods.\n\nIn a Bayesian mindset, some researchers have come up with expectation maximization approaches\nRoweis (1998); Tipping & Bishop (1999), that can be used in an incremental fashion. The \ufb01nite\nsample behavior is not known.\n\nStochastic-approximation-based algorithms along the lines of Robbins & Monro (1951) are also\nquite popular, due to their low computational and memory complexity, and excellent performance.\nThey go under a variety of names, including Incremental PCA (though the term Incremental has been\nused in the online setting as well Herbster & Warmuth (2001)), Hebbian learning, and stochastic\npower method Arora et al. (2012). The basic algorithms are some version of the following: upon\nreceiving data point xt at time t, update the estimate of the top k principal components via:\n\nU (t+1) = Proj(U (t) + \u03b7txtx\u22a4t U (t)),\n\n(1)\nwhere Proj(\u00b7) denotes the \u201cprojection\u201d that takes the SVD of the argument, and sets the top k\nsingular values to 1 and the rest to zero (see Arora et al. (2012) for discussion). While empirically\nthese algorithms perform well, to the best of our knowledge - and efforts - there is no associated\n\ufb01nite sample guarantee. The analytical challenge lies in the high variance at each step, which makes\ndirect analysis dif\ufb01cult.\n\nIn summary, while much work has focused on memory-constrained PCA, there has as of yet been no\nwork that simultaneously provides sample complexity guarantees competitive with batch algorithms,\nand also memory/storage complexity guarantees close to the minimal requirement of O(kp) \u2013 the\nmemory required to store only the output. We present an algorithm that provably does both.\n\n3 Problem Formulation and Notation\n\nWe consider the streaming model: at each time step t, we receive a point xt \u2208 Rp. Any point that is\nnot explicitly stored can never be revisited. Our goal is to compute the top k principal components\nof the data: the k-dimensional subspace that offers the best squared-error estimate for the points. We\nassume a probabilistic generative model, from which the data is sampled at each step t. Speci\ufb01cally,\n\nxt = Azt + wt,\n\n(2)\n\nwhere A \u2208 Rp\u00d7k is a \ufb01xed matrix, zt \u2208 Rk\u00d71 is a multivariate normal random variable, i.e.,\n\nzt \u223c N (0k\u00d71, Ik\u00d7k),\n\nwt \u223c N (0p\u00d71, \u03c32Ip\u00d7p).\n\nand wt \u2208 Rp\u00d71 is the \u201cnoise\u201d vector, also sampled from a multivariate normal distribution, i.e.,\n\nFurthermore, we assume that all 2n random vectors (zt, wt,\u22001 \u2264 t \u2264 n) are mutually independent.\nIn this regime, it is well-known that batch-PCA is asymptotically consistent (hence recovering A up\nto unitary transformations) with number of samples scaling as n = O(p) Vershynin (2010b). It is\ninteresting to note that in this high-dimensional regime, the signal-to-noise ratio quickly approaches\nzero, as the signal, or \u201celongation\u201d of the major axis, kAzk2, is O(1), while the noise magnitude,\nkwk2, scales as O(\u221ap). The central goal of this paper is to provide \ufb01nite sample guarantees for a\nstreaming algorithm that requires memory no more than O(kp) and matches the consistency results\nof batch PCA in the sampling regime n = O(p) (possibly with additional log factors, or factors\ndepending on \u03c3 and k).\nWe denote matrices by capital letters (e.g. A) and vectors by lower-case bold-face letters (x). kxkq\ndenotes the \u2113q norm of x; kxk denotes the \u21132 norm of x. kAk or kAk2 denotes the spectral norm of\nA while kAkF denotes the Frobenius norm of A. Without loss of generality (WLOG), we assume\nthat: kAk2 = 1, where kAk2 = maxkxk2=1 kAxk2 denotes the spectral norm of A. Finally, we\nwrite ha, bi = a\u22a4b for the inner product between a, b. In proofs the constant C is used loosely and\nits value may vary from line to line.\n\n3\n\n\fAlgorithm 1 Block-Stochastic Power Method\ninput {x1, . . . , xn}, Block size: B\n1: q0 \u223c N (0, Ip\u00d7p) (Initialization)\n2: q0 \u2190 q0/kq0k2\n3: for \u03c4 = 0, . . . , n/B \u2212 1 do\n\ns\u03c4 +1 \u2190 0\nfor t = B\u03c4 + 1, . . . , B(\u03c4 + 1) do\n\nBhq\u03c4 , xtixt\n\n4:\n5:\n6:\n7:\n8:\n9: end for\n\ns\u03c4 +1 \u2190 s\u03c4 +1 + 1\n\nend for\nq\u03c4 +1 \u2190 s\u03c4 +1/ks\u03c4 +1k2\n\nBlock-Stochastic Orthogonal Iteration\n\nH i \u223c N (0, Ip\u00d7p), 1 \u2264 i \u2264 k (Initialization)\nH \u2190 Q0R0 (QR-decomposition)\nS\u03c4 +1 \u2190 0\nS\u03c4 +1 \u2190 S\u03c4 +1 + 1\nS\u03c4 +1 = Q\u03c4 +1R\u03c4 +1 (QR-decomposition)\n\nB xtx\u22a4t Q\u03c4\n\noutput\n\n4 Algorithm and Guarantees\n\nIn this section, we present our proposed algorithm and its \ufb01nite sample analysis. It is a block-wise\nstochastic variant of the classical power-method. Stochastic versions of the power method already\nexist in the literature; see Arora et al. (2012). The main impediment to the analysis of such stochastic\nalgorithms (as in (1)) is the large variance of each step, in the presence of noise. This motivates us\nto consider a modi\ufb01ed stochastic power method algorithm, that has a variance reduction step built\nin. At a high level, our method updates only once in a \u201cblock\u201d and within one block we average out\nnoise to reduce the variance.\n\nBelow, we \ufb01rst illustrate the main ideas of our method as well as our sample complexity proof for\nthe simpler rank-1 case. The rank-1 and rank-k algorithms are so similar, that we present them in\nthe same panel. We provide the rank-k analysis in Section 4.2. We note that, while our algorithm\ndescribes {x1, . . . , xn} as \u201cinput,\u201d we mean this in the streaming sense:\nthe data are no-where\nstored, and can never be revisited unless the algorithm explicitly stores them.\n\n4.1 Rank-One Case\n\nWe \ufb01rst consider the rank-1 case for which each sample xt is generated using: xt = uzt + wt\nwhere u \u2208 Rp is the principal component that we wish to recover. Our algorithm is a block-wise\nmethod where all the n samples are divided in n/B blocks (for simplicity we assume that n/B is\nan integer). In the (\u03c4 + 1)-st block, we compute\n\ns\u03c4 +1 = \uf8eb\n\uf8ed\n\n1\nB\n\nB(\u03c4 +1)\n\nXt=B\u03c4 +1\n\nxtx\u22a4t \uf8f6\n\n\uf8f8 q\u03c4 .\n\n(3)\n\nThen, the iterate q\u03c4 is updated using q\u03c4 +1 = s\u03c4 +1/ks\u03c4 +1k2. Note that, s\u03c4 +1 can be computed\nonline, with O(p) operations per step. Furthermore, storage requirement is also linear in p.\n\n4.1.1 Analysis\n\nthe total number of iterations T = \u2126(\n\nWe now present the sample complexity analysis of our proposed method. Using O(\u03c34p log(p)/\u01eb2)\nsamples, Algorithm 1 obtains a solution qT of accuracy \u01eb, i.e. kqT \u2212 uk2 \u2264 \u01eb.\nTheorem 1. Denote the data stream by x1, . . . , xn, where xt \u2208 Rp,\u2200t is generated by (2).\nlog((\u03c32+.75)/(\u03c32+.5)) ) and the block size B =\nSet\n\u2126( (1+3(\u03c3+\u03c32)\u221ap)2 log(T )\n). Then, with probability 0.99, kqT \u2212 uk2 \u2264 \u01eb, where qT is the T -th\niterate of Algorithm 1. That is, Algorithm 1 obtains an \u01eb-accurate solution with number of samples\n(n) given by:\nn = \u02dc\u2126(cid:18) (1 + 3(\u03c3 + \u03c32)\u221ap)2 log(p/\u01eb)\n\u01eb2 log((\u03c32 + .75)/(\u03c32 + .5)) (cid:19) .\n\nlog(p/\u01eb)\n\n\u01eb2\n\nNote that in the total sample complexity, we use the notation \u02dc\u2126(\u00b7) to suppress the extra log(T ) factor\nfor clarity of exposition, as T already appears in the expression linearly.\n\n4\n\n\fProof. The proof decomposes the current iterate into the component of the current iterate, q\u03c4 , in the\ndirection of the true principal component (the spike) u, and the perpendicular component, showing\nthat the former eventually dominates. Doing so hinges on three key components: (a) for large enough\nB, the empirical covariance matrix F\u03c4 +1 = 1\nt=B\u03c4 +1 xtx\u22a4t is close to the true covariance matrix\nM = uu\u22a4 + \u03c32I, i.e., kF\u03c4 +1 \u2212 Mk2 is small. In the process, we obtain \u201ctighter\u201d bounds for\nku\u22a4(F\u03c4 +1 \u2212 M )uk for \ufb01xed u; (b) with probability 0.99 (or any other constant probability), the\ninitial point q0 has a component of at least O(1/\u221ap) magnitude along the true direction u; (c) after\n\u03c4 iterations, the error in estimation is at most O(\u03b3\u03c4 ) where \u03b3 < 1 is a constant.\n\nB PB(\u03c4 +1)\n\nThere are several results that we use repeatedly, which we collect here, and prove individually in the\nfull version of the paper (Mitliagkas et al. (2013)).\nLemmas 4, 5 and 6. Let B, T and the data stream {xi} be as de\ufb01ned in the theorem. Then:\n\n\u2022 (Lemma 4): With probability 1 \u2212 C/T , for C a universal constant, we have:\n\n\u2022 (Lemma 5): With probability 1 \u2212 C/T , for C a universal constant, we have:\n\n1\n\nB Xt\n\n(cid:13)(cid:13)(cid:13)(cid:13)(cid:13)\nxtx\u22a4t \u2212 uu\u22a4 \u2212 \u03c32I(cid:13)(cid:13)(cid:13)(cid:13)(cid:13)2\nu\u22a4s\u03c4 +1 \u2265 u\u22a4q\u03c4 (1 + \u03c32)(cid:18)1 \u2212\n\n\u2264 \u01eb.\n\n\u01eb\n\n4(1 + \u03c32)(cid:19) ,\n\nwhere st = 1\n\nB PB\u03c4 0 is a universal constant.\n\nStep (a) is proved in Lemmas 4 and 5, while Lemma 6 provides the required result for the initial\nvector q0. Using these lemmas, we next complete the proof of the theorem. We note that both (a)\nand (b) follow from well-known results; we provide them for completeness.\n\nLet q\u03c4 = \u221a1 \u2212 \u03b4\u03c4 u+\u221a\u03b4\u03c4 g\u03c4 , 1 \u2264 \u03c4 \u2264 n/B, where g\u03c4 is the component of q\u03c4 that is perpendicular\nto u and \u221a1 \u2212 \u03b4\u03c4 is the magnitude of the component of q\u03c4 along u. Note that g\u03c4 may well change\nat each iteration; we only wish to show \u03b4\u03c4 \u2192 0.\nNow, using Lemma 5, the following holds with probability at least 1 \u2212 C/T :\n4(1 + \u03c32)(cid:19) .\n\nu\u22a4s\u03c4 +1 \u2265 p1 \u2212 \u03b4\u03c4 (1 + \u03c32)(cid:18)1 \u2212\n\n(4)\n\n\u01eb\n\nNext, we consider the component of s\u03c4 +1 that is perpendicular to u:\n\ng\u22a4\u03c4 +1s\u03c4 +1 = g\u22a4\u03c4 +1\uf8eb\n\uf8ed\n\n1\nB\n\nB(\u03c4 +1)\n\nXt=B\u03c4 +1\n\nxtx\u22a4t \uf8f6\n\n\uf8f8 q\u03c4 = g\u22a4\u03c4 +1(M + E\u03c4 )q\u03c4 ,\n\nwhere M = uu\u22a4 +\u03c32I and E\u03c4 is the error matrix: E\u03c4 = M \u2212 1\nkE\u03c4k2 \u2264 \u01eb (w.p. \u2265 1 \u2212 C/T ). Hence, w.p. \u2265 1 \u2212 C/T :\n\nB PB(\u03c4 +1)\n\nt=B\u03c4 +1 xtx\u22a4t . Using Lemma 4,\n\ng\u22a4\u03c4 +1s\u03c4 +1 = \u03c32g\u22a4\u03c4 +1q\u03c4 + kg\u03c4 +1k2kE\u03c4k2kq\u03c4k2 \u2264 \u03c32p\u03b4\u03c4 + \u01eb.\n\n(5)\n\nNow, since q\u03c4 +1 = s\u03c4 +1/ks\u03c4 +1k2,\n\n\u03b4\u03c4 +1 = (g\u22a4\u03c4 +1q\u03c4 +1)2 =\n\n(g\u22a4\u03c4 +1s\u03c4 +1)2\n\n(u\u22a4s\u03c4 +1)2 + (g\u22a4\u03c4 +1s\u03c4 +1)2\n\n,\n\n(i)\n\n\u2264\n\n(ii)\n\n\u2264\n\n(g\u22a4\u03c4 +1s\u03c4 +1)2\n\n4(cid:1)2\n(1 \u2212 \u03b4\u03c4 )(cid:0)1 + \u03c32 \u2212 \u01eb\n4(cid:1)2\n(1 \u2212 \u03b4\u03c4 )(cid:0)1 + \u03c32 \u2212 \u01eb\n\n(\u03c32\u221a\u03b4\u03c4 + \u01eb)2\n\n+ (g\u22a4\u03c4 +1s\u03c4 +1)2\n\n,\n\n+ (\u03c32\u221a\u03b4\u03c4 + \u01eb)2\n\n5\n\n,\n\n(6)\n\n\fwhere, (i) follows from (4) and (ii) follows from (5) along with the fact that x\n\nfunction in x for c, x \u2265 0. Assuming \u221a\u03b4\u03c4 \u2265 2\u01eb and using (6) and bounding the failure probability\nwith a union bound, we get (w.p. \u2265 1 \u2212 \u03c4 \u00b7 C/T )\n\nc+x is an increasing\n\n\u03b4\u03c4 (\u03c32 + 1/2)2\n\n(i)\n\n\u03b32\u03c4 \u03b40\n\n(ii)\n\n(1 \u2212 \u03b4\u03c4 )(\u03c32 + 3/4)2 + \u03b4\u03c4 (\u03c32 + 1/2)2\n\n\u03b4\u03c4 +1 \u2264\nwhere \u03b3 = \u03c32+1/2\n\u03c32+3/4 and C1 > 0 is a global constant. Inequality (ii) follows from Lemma 6; to prove\n(i), we need the following lemma. It shows that in the recursion given by (7), \u03b4\u03c4 decreases at a fast\nrate. The rate of decrease in \u03b4\u03c4 might be initially (for small \u03c4 ) sub-linear, but for large enough \u03c4 the\nrate is linear. We defer the proof to the full version of the paper (Mitliagkas et al. (2013)).\n\n1 \u2212 (1 \u2212 \u03b32\u03c4 )\u03b40\n\n\u2264 C1\u03b32\u03c4 p,\n\n\u2264\n\n(7)\n\nLemma 2. If for any \u03c4 \u2265 0 and 0 < \u03b3 < 1, we have \u03b4\u03c4 +1 \u2264\n\n, then,\n\n\u03b32\u03b4\u03c4\n\n1\u2212\u03b4\u03c4 +\u03b32\u03b4\u03c4\n.\n\n\u03b4\u03c4 +1 \u2264\n\n\u03b32t+2\u03b40\n\n1 \u2212 (1 \u2212 \u03b32t+2)\u03b40\n\nHence, using the above equation after T = O (log(p/\u01eb)/ log (1/\u03b3)) updates, with probability at\n\nleast 1 \u2212 C, \u221a\u03b4T \u2264 2\u01eb. The result now follows by noting that ku \u2212 qTk2 \u2264 2\u221a\u03b4T .\n\nRemark: In Theorem 1, the probability of recovery is a constant and does not decay with p. One can\ncorrect this by either paying a price of O(log p) in storage, or in sample complexity: for the former,\nwe can run O(log p) instances of Algorithm 1 in parallel; alternatively, we can run Algorithm 1\nO(log p) times on fresh data each time, using the next block of data to evaluate the old solutions,\nalways keeping the best one. Either approach guarantees a success probability of at least 1 \u2212 1\npO(1) .\n4.2 General Rank-k Case\n\nIn this section, we consider the general rank-k PCA problem where each sample is assumed to be\ngenerated using the model of equation (2), where A \u2208 Rp\u00d7k represents the k principal components\nthat need to be recovered. Let A = U \u039bV \u22a4 be the SVD of A where U \u2208 Rp\u00d7k, \u039b, V \u2208 Rk\u00d7k.\nThe matrices U and V are orthogonal, i.e., U\u22a4U = I, V \u22a4V = I, and \u03a3 is a diagonal matrix with\ndiagonal elements \u03bb1 \u2265 \u03bb2 \u00b7\u00b7\u00b7 \u2265 \u03bbk. The goal is to recover the space spanned by A, i.e., span(U ).\nWithout loss of generality, we can assume that kAk2 = \u03bb1 = 1.\nSimilar to the rank-1 problem, our algorithm for the rank-k problem can be viewed as a streaming\nvariant of the classical orthogonal iteration used for SVD. But unlike the rank-1 case, we require\na more careful analysis as we need to bound spectral norms of various quantities in intermediate\nsteps and simple, crude analysis can lead to signi\ufb01cantly worse bounds. Interestingly, the analysis\nis entirely different from the standard analysis of the orthogonal iteration as there, the empirical\nestimate of the covariance matrix is \ufb01xed while in our case it varies with each block.\n\nFor the general rank-k problem, we use the largest-principal-angle-based distance function between\nany two given subspaces:\n\ndist (span(U ), span(V )) = dist(U, V ) = kU\u22a4\n\n\u22a5 V k2 = kV \u22a4\n\n\u22a5 Uk2,\n\nwhere U\u22a5 and V\u22a5 represent an orthogonal basis of the perpendicular subspace to span(U ) and\nspan(V ), respectively. For the spiked covariance model, it is straightforward to see that this is\nequivalent to the usual PCA \ufb01gure-of-merit, the expressed variance.\n\nTheorem 3. Consider a data stream, where xt \u2208 Rp for every t is generated by (2), and the SVD\nof A \u2208 Rp\u00d7k is given by A = U \u039bV \u22a4. Let, wlog, \u03bb1 = 1 \u2265 \u03bb2 \u2265 \u00b7\u00b7\u00b7 \u2265 \u03bbk > 0. Let,\n\uf8f6\nT = \u2126(cid:18)log(cid:16) p\n\uf8f7\uf8f8\nThen, after T B-size-block-updates, w.p. 0.99, dist(U, QT ) \u2264 \u01eb. Hence, the suf\ufb01cient number of\nsamples for \u01eb-accurate recovery of all the top-k principal components is:\n\n(cid:16)(1 + \u03c3)2\u221ak + \u03c3\u221a1 + \u03c32k\u221ap(cid:17)2\n\nk (cid:19)(cid:19) , B = \u2126\uf8eb\n\uf8ec\uf8ed\n\nk\u01eb(cid:17) / log(cid:18) \u03c32 + 0.75\u03bb2\n\nk\n\u03c32 + 0.5\u03bb2\n\nlog(T )\n\n\u03bb4\nk\u01eb2\n\n.\n\nn = \u02dc\u2126\uf8eb\n\uf8ec\uf8ed\n\n(cid:16)(1 + \u03c3)2\u221ak + \u03c3\u221a1 + \u03c32k\u221ap(cid:17)2\nk\u01eb2 log(cid:16) \u03c32+0.75\u03bb2\nk (cid:17)\n\nk\n\u03c32+0.5\u03bb2\n\n\u03bb4\n\nlog(p/k\u01eb)\n\n.\n\n\uf8f6\n\uf8f7\uf8f8\n\n6\n\n\fAgain, we use \u02dc\u2126(\u00b7) to suppress the extra log(T ) factor.\n\nThe key part of the proof requires the following additional lemmas that bound the energy of the\ncurrent iterate along the desired subspace and its perpendicular space (Lemmas 8 and 9), and Lemma\n10, which controls the quality of the initialization.\n\nLemmas 8, 9 and 10. Let the data stream, A, B, and T be as de\ufb01ned in Theorem 3, \u03c3 be the\nvariance of noise, F\u03c4 +1 = 1\n\nB PB\u03c4 \n0) then the initialization step provides us better distance, i.e., dist(U, Q0) \u2264 C\u2032/\u221ap rather than\ndist(U, Q0) \u2264 C\u2032/\u221akp bound if r = k. This initialization step enables us to give tighter sample\ncomplexity as the r\u221ap in the numerator above can be replaced by \u221arp.\n\n5 Experiments\n\nIn this section, we show that, as predicted by our theoretical results, our algorithm performs close\nto the optimal batch SVD. We provide the results from simulating the spiked covariance model,\nand demonstrate the phase-transition in the probability of successful recovery that is inherent to the\nstatistical problem. Then we stray from the analyzed model and performance metric and test our\nalgorithm on real world\u2013and some very big\u2013datasets, using the metric of explained variance.\n\nIn the experiments for Figures 1 (a)-(b), we draw data from the generative model of (2). Our results\nare averaged over at least 200 independent runs. Algorithm 1 uses the block size prescribed in\nTheorem 3, with the empirically tuned constant of 0.2. As expected, our algorithm exhibits linear\nscaling with respect to the ambient dimension p \u2013 the same as the batch SVD. The missing point\non batch SVD\u2019s curve (Figure 1(a)), corresponds to p > 2.4 \u00b7 104. Performing SVD on a dense\np \u00d7 p matrix, either fails or takes a very long time on most modern desktop computers; in contrast,\nour streaming algorithm easily runs on this size problem. The phase transition plot in Figure 1(b)\n\n7\n\n\f106\n\n104\n\n102\n\nl\n\n)\ns\ne\np\nm\na\ns\n(\n \nn\n\n100\n\n \n102\n\nSamples to retrieve spike (\u03c3=0.5, \u03b5=0.05)\n\n \n\n.\n)\np\n(\n \n\ni\n\nn\no\ns\nn\ne\nm\nd\n\ni\n\n \nt\n\ni\n\nn\ne\nb\nm\nA\n\nBatch SVD\nOur algorithm (streaming)\n\n103\np (dimension)\n\n(a)\n\n104\n\nProbability of success (n=1000, \u03b5=0.05).\n\n \n\n104\n\n103\n\n102\n\n101\n\n \n\n10\u22122\n\n10\u22121\n\nNoise standard deviation (\u03c3).\n\n100\n\n(b)\n\n1\n\n0.8\n\n0.6\n\n0.4\n\n0.2\n\n0\n\n \n\ne\nc\nn\na\ni\nr\na\nv\n \n\ni\n\nd\ne\nn\na\np\nx\nE\n\nl\n\n40%\n\n30%\n\n20%\n\n10%\n\n0%\n\n \n\n2\n\nOur algorithm on large bag\u2212of\u2212words datasets\n\nNY Times: 300K samples, p=103K\nPubMed: 8.2M samples, p=140K\n\nNIPS bag\u2212of\u2212words dataset\n\n \n\n20%\n\ne\nc\nn\na\ni\nr\na\nv\n \n\ni\n\nd\ne\nn\na\np\nx\nE\n\nl\n\n10%\n\nOptimal (batch)\nOur algorithm (streaming)\nOptimal using B samples\n\n4\n\n8\nk (number of components)\n\n6\n\n10\n\n0%\n \n1\n\n2\n\n3\n\n4\n\nk (number of components)\n\n5\n\n6\n\n7\n\n(c)\n\n(d)\n\nFigure 1: (a) Number of samples required for recovery of a single component (k = 1) from the\nspiked covariance model, with noise standard deviation \u03c3 = 0.5 and desired accuracy \u01eb = 0.05.\n(b) Fraction of trials in which Algorithm 1 successfully recovers the principal component (k = 1)\nin the same model, with \u01eb = 0.05 and n = 1000 samples, (c) Explained variance by Algorithm 1\ncompared to the optimal batch SVD, on the NIPS bag-of-words dataset. (d) Explained variance by\nAlgorithm 1 on the NY Times and PubMed datasets.\n\nshows the empirical sample complexity on a large class of problems and corroborates the scaling\nwith respect to the noise variance we obtain theoretically.\n\nFigures 1 (c)-(d) complement our complete treatment of the spiked covariance model, with some\nout-of-model experiments. We used three bag-of-words datasets from Porteous et al. (2008). We\nevaluated our algorithm\u2019s performance with respect to the fraction of explained variance metric:\ngiven the p \u00d7 k matrix V output from the algorithm, and all the provided samples in matrix X, the\nfraction of explained variance is de\ufb01ned as Tr(V T XX T V )/ Tr(XX T ). To be consistent with our\ntheory, for a dataset of n samples of dimension p, we set the number of blocks to be T = \u2308log(p)\u2309\nand the size of blocks to B = \u230an/T\u230b in our algorithm. The NIPS dataset is the smallest, with\n1500 documents and 12K words and allowed us to compare our algorithm with the optimal, batch\nSVD. We had the two algorithms work on the document space (p = 1500) and report the results in\nFigure 1(c). The dashed line represents the optimal using B samples. The \ufb01gure is consistent with\nour theoretical result: our algorithm performs as well as the batch, with an added log(p) factor in\nthe sample complexity.\n\nFinally, in Figure 1 (d), we show our algorithm\u2019s ability to tackle very large problems. Both the\nNY Times and PubMed datasets are of prohibitive size for traditional batch methods \u2013 the latter\nincluding 8.2 million documents on a vocabulary of 141 thousand words \u2013 so we just report the\nperformance of Algorithm 1. It was able to extract the top 7 components for each dataset in a few\nhours on a desktop computer. A second pass was made on the data to evaluate the results, and we\nsaw 7-10 percent of the variance explained on spaces with p > 104.\n\n8\n\n\fReferences\n\nArora, R., Cotter, A., Livescu, K., and Srebro, N. Stochastic optimization for PCA and PLS. In 50th Allerton\n\nConference on Communication, Control, and Computing, Monticello, IL, 2012.\n\nBalzano, L., Nowak, R., and Recht, B. Online identi\ufb01cation and tracking of subspaces from highly incomplete\ninformation. In Communication, Control, and Computing (Allerton), 2010 48th Annual Allerton Conference\non, pp. 704\u2013711, 2010.\n\nBrand, M. Fast low-rank modi\ufb01cations of the thin singular value decomposition. Linear algebra and its\n\napplications, 415(1):20\u201330, 2006.\n\nBrand, Matthew. Incremental singular value decomposition of uncertain data with missing values. Computer\n\nVision\u2014ECCV 2002, pp. 707\u2013720, 2002.\n\nClarkson, Kenneth L. and Woodruff, David P. Numerical linear algebra in the streaming model. In Proceedings\n\nof the 41st annual ACM symposium on Theory of computing, pp. 205\u2013214, 2009.\n\nComon, P. and Golub, G. H. Tracking a few extreme singular values and vectors in signal processing. Proceed-\n\nings of the IEEE, 78(8):1327\u20131343, 1990.\n\nGolub, Gene H. and Van Loan, Charles F. Matrix computations, volume 3. JHUP, 2012.\n\nHalko, Nathan, Martinsson, Per-Gunnar, and Tropp, Joel A. Finding structure with randomness: Probabilistic\n\nalgorithms for constructing approximate matrix decompositions. SIAM review, 53(2):217\u2013288, 2011.\n\nHe, J., Balzano, L., and Lui, J. Online robust subspace tracking from partial information. arXiv preprint\n\narXiv:1109.3827, 2011.\n\nHerbster, Mark and Warmuth, Manfred K. Tracking the best linear predictor. The Journal of Machine Learning\n\nResearch, 1:281\u2013309, 2001.\n\nJohnstone, Iain M. On the distribution of the largest eigenvalue in principal components analysis.(english. Ann.\n\nStatist, 29(2):295\u2013327, 2001.\n\nLi, Y. On incremental and robust subspace learning. Pattern recognition, 37(7):1509\u20131518, 2004.\n\nMitliagkas, Ioannis, Caramanis, Constantine, and Jain, Prateek. Memory limited, streaming PCA. arXiv\n\npreprint arXiv:1307.0032, 2013.\n\nNadler, Boaz. Finite sample approximation results for principal component analysis: a matrix perturbation\n\napproach. The Annals of Statistics, pp. 2791\u20132817, 2008.\n\nPorteous, Ian, Newman, David, Ihler, Alexander, Asuncion, Arthur, Smyth, Padhraic, and Welling, Max. Fast\ncollapsed gibbs sampling for latent dirichlet allocation. In Proceedings of the 14th ACM SIGKDD interna-\ntional conference on Knowledge discovery and data mining, pp. 569\u2013577, 2008.\n\nRobbins, Herbert and Monro, Sutton. A stochastic approximation method. The Annals of Mathematical Statis-\n\ntics, pp. 400\u2013407, 1951.\n\nRoweis, Sam. EM algorithms for PCA and SPCA. Advances in neural information processing systems, pp.\n\n626\u2013632, 1998.\n\nRudelson, Mark and Vershynin, Roman. Smallest singular value of a random rectangular matrix. Communica-\n\ntions on Pure and Applied Mathematics, 62(12):1707\u20131739, 2009.\n\nTipping, Michael E. and Bishop, Christopher M. Probabilistic principal component analysis. Journal of the\n\nRoyal Statistical Society: Series B (Statistical Methodology), 61(3):611\u2013622, 1999.\n\nVershynin, R. How close is the sample covariance matrix to the actual covariance matrix? Journal of Theoret-\n\nical Probability, pp. 1\u201332, 2010a.\n\nVershynin, Roman.\n\nIntroduction to the non-asymptotic analysis of random matrices.\n\narXiv preprint\n\narXiv:1011.3027, 2010b.\n\nWarmuth, Manfred K. and Kuzmin, Dima. Randomized online PCA algorithms with regret bounds that are\n\nlogarithmic in the dimension. Journal of Machine Learning Research, 9:2287\u20132320, 2008.\n\n9\n\n\f", "award": [], "sourceid": 1318, "authors": [{"given_name": "Ioannis", "family_name": "Mitliagkas", "institution": "UT Austin"}, {"given_name": "Constantine", "family_name": "Caramanis", "institution": "UT Austin"}, {"given_name": "Prateek", "family_name": "Jain", "institution": "Microsoft Research"}]}