{"title": "Low Rank Approximation Lower Bounds in Row-Update Streams", "book": "Advances in Neural Information Processing Systems", "page_first": 1781, "page_last": 1789, "abstract": "We study low-rank approximation in the streaming model in which the rows of an $n \\times d$ matrix $A$ are presented one at a time in an arbitrary order. At the end of the stream, the streaming algorithm should output a $k \\times d$ matrix $R$ so that $\\|A-AR^{\\dagger}R\\|_F^2 \\leq (1+\\eps)\\|A-A_k\\|_F^2$, where $A_k$ is the best rank-$k$ approximation to $A$. A deterministic streaming algorithm of Liberty (KDD, 2013), with an improved analysis of Ghashami and Phillips (SODA, 2014), provides such a streaming algorithm using $O(dk/\\epsilon)$ words of space. A natural question is if smaller space is possible. We give an almost matching lower bound of $\\Omega(dk/\\epsilon)$ bits of space, even for randomized algorithms which succeed only with constant probability. Our lower bound matches the upper bound of Ghashami and Phillips up to the word size, improving on a simple $\\Omega(dk)$ space lower bound.", "full_text": "Low Rank Approximation Lower Bounds in\n\nRow-Update Streams\n\nDavid P. Woodruff\n\nIBM Research Almaden\n\ndpwoodru@us.ibm.com\n\nAbstract\n\nF \u2264 (1 + \u0001)(cid:107)A\u2212 Ak(cid:107)2\n\nWe study low-rank approximation in the streaming model in which the rows of\nan n \u00d7 d matrix A are presented one at a time in an arbitrary order. At the end\nof the stream, the streaming algorithm should output a k \u00d7 d matrix R so that\n(cid:107)A\u2212 AR\u2020R(cid:107)2\nF , where Ak is the best rank-k approximation\nto A. A deterministic streaming algorithm of Liberty (KDD, 2013), with an im-\nproved analysis of Ghashami and Phillips (SODA, 2014), provides such a stream-\ning algorithm using O(dk/\u0001) words of space. A natural question is if smaller\nspace is possible. We give an almost matching lower bound of \u2126(dk/\u0001) bits of\nspace, even for randomized algorithms which succeed only with constant proba-\nbility. Our lower bound matches the upper bound of Ghashami and Phillips up to\nthe word size, improving on a simple \u2126(dk) space lower bound.\n\n1\n\nIntroduction\n\nIn the last decade many algorithms for numerical linear algebra problems have been proposed, often\nproviding substantial gains over more traditional algorithms based on the singular value decomposi-\ntion (SVD). Much of this work was in\ufb02uenced by the seminal work of Frieze, Kannan, and Vempala\n[8]. These include algorithms for matrix product, low rank approximation, regression, and many\nother problems. These algorithms are typically approximate and succeed with high probability.\nMoreover, they also generally only require one or a small number of passes over the data.\nWhen the algorithm only makes a single pass over the data and uses a small amount of memory,\nit is typically referred to as a streaming algorithm. The memory restriction is especially important\nfor large-scale data sets, e.g., matrices whose elements arrive online and/or are too large to \ufb01t in\nmain memory. These elements may be in the form of an entry or entire row seen at a time; we\nrefer to the former as the entry-update model and the latter as the row-update model. The row-\nupdate model often makes sense when the rows correspond to individual entities. Typically one is\ninterested in designing robust streaming algorithms which do not need to assume a particular order\nof the arriving elements for their correctness. Indeed, if data is collected online, such an assumption\nmay be unrealistic.\nMuthukrishnan asked the question of determining the memory required of data stream algorithms\nfor numerical linear algebra problems, including best rank-k approximation, matrix product, eigen-\nvalues, determinants, and inverses [18]. This question was posed again by Sarl\u00b4os [21]. A number\nof exciting streaming algorithms now exist for matrix problems. Sarl\u00b4os [21] gave 2-pass algorithms\nfor matrix product, low rank approximation, and regression, which were sharpened by Clarkson and\nWoodruff [5], who also proved lower bounds in the entry-update model for a number of these prob-\nlems. See also work by Andoni and Nguyen for estimating eigenvalues in a stream [2], and work in\n[1, 4, 6] which implicitly provides algorithms for approximate matrix product.\nIn this work we focus on the low rank approximation problem. In this problem we are given an\nn \u00d7 d matrix A and would like to compute a matrix B of rank at most k for which (cid:107)A \u2212 B(cid:107)F \u2264\n\n1\n\n\f(cid:113)(cid:80)n\n\n(cid:80)d\n\nj=1 A2\n\ni,j and\n\ni=1\n\n(1 + \u0001)(cid:107)A\u2212 Ak(cid:107)F . Here, for a matrix A, (cid:107)A(cid:107)F denotes its Frobenius norm\nAk is the best rank-k approximation to A in this norm given by the SVD.\nClarkson and Woodruff [5] show in the entry-update model, one can compute a factorization B =\nL\u00b7 U \u00b7 R with L \u2208 Rn\u00d7k, U \u2208 Rk\u00d7k, and R \u2208 Rk\u00d7d, with a streaming algorithm using O(k\u0001\u22122(n +\nd/\u00012) log(nd)) bits of space. They also show a lower bound of \u2126(k\u0001\u22121(n + d) log(nd)) bits of\nspace. One limitation of these bounds is that they hold only when the algorithm is required to output\na factorization L \u00b7 U \u00b7 R. In many cases n (cid:29) d, and using memory that grows linearly with n (as\nthe above lower bounds show is unavoidable) is prohibitive. As observed in previous work [9, 16],\nin downstream applications we are often only interested in an approximation to the top k principal\ncomponents, i.e., the matrix R above, and so the lower bounds of Clarkson and Woodruff can be\ntoo restrictive. For example, in PCA the goal is to compute the most important directions in the row\nspace of A.\nBy reanalyzing an algorithm of Liberty [16], Ghashami and Phillips [9] were able to overcome this\nrestriction in the row-update model, showing that Liberty\u2019s algorithm is a streaming algorithm which\n\ufb01nds a k\u00d7 d matrix R for which (cid:107)A\u2212 AR\u2020R(cid:107)F \u2264 (1 + \u0001)(cid:107)A\u2212 Ak(cid:107)F using only O(dk/\u0001) words of\nspace. Here R\u2020 is the Moore-Penrose pseudoinverse of R and R\u2020R denotes the projection onto the\nrow space of R. Importantly, this space bound no longer depends on n. Moreover, their algorithm\nis deterministic and achieves relative error. We note that Liberty\u2019s algorithm itself is similar in spirit\nto earlier work on incremental PCA [3, 10, 11, 15, 19], but that work missed the idea of using a\nMisra-Gries heavy hitters subroutine [17] which is used to bound the additive error (which was then\nimproved to relative error by Ghashami and Phillips). It also seems possible to obtain a streaming\nalgorithm using O(dk(log n)/\u0001) words of space, using the coreset approach in an earlier paper by\nFeldman et al. [7].\nThis work is motivated by the following questions: Is the O(dk/\u0001) space bound tight or can one\nachieve an even smaller amount of space? What if one also allows randomization?\nIn this work we answer the above questions. Our main theorem is the following.\nTheorem 1. Any, possibly randomized, streaming algorithm in the row-update model which outputs\na k\u00d7 d matrix R and guarantees that (cid:107)A\u2212 AR\u2020R(cid:107)2\nF with probability at least\n2/3, must use \u2126(kd/\u0001) bits of space.\n\nF \u2264 (1 + \u0001)(cid:107)A\u2212 Ak(cid:107)2\n\nUp to a factor of the word size (which is typically O(log(nd)) bits), our main theorem shows that the\nalgorithm of Liberty is optimal. It also shows that allowing for randomization and a small probability\nof error does not signi\ufb01cantly help in reducing the memory required. We note that a simple argument\ngives an \u2126(kd) bit lower bound, see Lemma 2 below, which intuitively follows from the fact that\nif A itself is rank-k, then R needs to have the same rowspace of A, and specifying a random k-\ndimensional subspace of Rd requires \u2126(kd) bits. Hence, the main interest here is improving upon\nthis lower bound to \u2126(kd/\u0001) bits of space. This extra 1/\u0001 factor is signi\ufb01cant for small values of \u0001,\ne.g., if one wants approximations as close to machine precision as possible with a given amount of\nmemory.\nThe only other lower bounds for streaming algorithms for low rank approximation that we know of\nare due to Clarkson and Woodruff [5]. As in their work, we use the Index problem in communication\ncomplexity to establish our bounds, which is a communication game between two players Alice and\nBob, holding a string x \u2208 {0, 1}r and an index i \u2208 [r] =: {1, 2, . . . , r}, respectively.\nIn this\ngame Alice sends a single message to Bob who should output xi with constant probability. It is\nknown (see, e.g., [13]) that this problem requires Alice\u2019s message to be \u2126(r) bits long. If Alg is a\nstreaming algorithm for low rank approximation, and Alice can create a matrix Ax while Bob can\ncreate a matrix Bi (depending on their respective inputs x and i), then if from the output of Alg\non the concatenated matrix [Ax; Bi] Bob can output xi with constant probability, then the memory\nrequired of Alg is \u2126(r) bits, since Alice\u2019s message is the state of Alg after running it on Ax.\nThe main technical challenges are thus in showing how to choose Ax and Bi, as well as showing\nhow the output of Alg on [Ax; Bi] can be used to solve Index. This is where our work departs\nsigni\ufb01cantly from that of Clarkson and Woodruff [5]. Indeed, a major challenge is that in Theorem\n1, we only require the output to be the matrix R, whereas in Clarkson and Woodruff\u2019s work from\nthe output one can reconstruct AR\u2020R. This causes technical complications, since there is much less\ninformation in the output of the algorithm to use to solve the communication game.\n\n2\n\n\fThe intuition behind the proof of Theorem 1 is that given a 2 \u00d7 d matrix A = [1, x; 1, 0d], where\nx is a random unit vector, then if P = R\u2020R is a suf\ufb01ciently good projection matrix for the low\nrank approximation problem on A, then the second row of AP actually reveals a lot of information\nabout x. This may be counterintuitive at \ufb01rst, since one may think that [1, 0d; 1, 0d] is a perfectly\ngood low rank approximation. However, it turns out that [1, x/2; 1, x/2] is a much better low rank\napproximation in Frobenius norm, and even this is not optimal. Therefore, Bob, who has [1, 0d]\ntogether with the output P , can compute the second row of AP , which necessarily reveals a lot of\ninformation about x (e.g., if AP \u2248 [1, x/2; 1, x/2], its second row would reveal a lot of information\nabout x), and therefore one could hope to embed an instance of the Index problem into x. Most of\nthe technical work is about reducing the general problem to this 2 \u00d7 d primitive problem.\n2 Main Theorem\n\nThis section is devoted to proving Theorem 1. We start with a simple lemma showing an \u2126(kd)\nlower bound, which we will refer to. The proof of this lemma is in the full version.\nLemma 2. Any streaming algorithm which, for every input A, with constant probability (over its\ninternal randomness) succeeds in outputting a matrix R for which (cid:107)A \u2212 AR\u2020R(cid:107)F \u2264 (1 + \u0001)(cid:107)A \u2212\nAk(cid:107)F must use \u2126(kd) bits of space.\nReturning to the proof of Theorem 1, let c > 0 be a small constant to be determined. We consider\nthe following two player problem between Alice and Bob: Alice has a ck/\u0001 \u00d7 d matrix A which\ncan be written as a block matrix [I, R], where I is the ck/\u0001 \u00d7 ck/\u0001 identity matrix, and R is a\nck/\u0001 \u00d7 (d \u2212 ck/\u0001) matrix in which the entries are in {\u22121/(d \u2212 ck/\u0001)1/2, +1/(d \u2212 ck/\u0001)1/2}. Here\n[I, R] means we append the columns of I to the left of the columns of R. Bob is given a set of k\nstandard unit vectors ei1 , . . . , eik, for distinct i1, . . . , ik \u2208 [ck/\u0001] = {1, 2, . . . , ck/\u0001}. Here we need\nc/\u0001 > 1, but we can assume \u0001 is less than a suf\ufb01ciently small constant, as otherwise we would just\nneed to prove an \u2126(kd) lower bound, which is established by Lemma 2.\nLet B be the matrix [A; ei1 , . . . , eik ] obtained by stacking A on top of the vectors ei1, . . . , eik.\nThe goal is for Bob to output a rank-k projection matrix P \u2208 Rd\u00d7d for which (cid:107)B \u2212 BP(cid:107)F \u2264\n(1 + \u0001)(cid:107)B \u2212 Bk(cid:107)F .\nDenote this problem by f. We will show the randomized 1-way communication complexity of this\nproblem R1\u2212way\n(f ), in which Alice sends a single message to Bob and Bob fails with probability\nat most 1/4, is \u2126(kd/\u0001) bits. More precisely, let \u00b5 be the following product distribution on Alice\nand Bob\u2019s inputs: the entries of R are chosen independently and uniformly at random in {\u22121/(d \u2212\nck/\u0001)1/2, +1/(d \u2212 ck/\u0001)1/2}, while {i1, . . . , ik} is a uniformly random set among all sets of k\ndistinct indices in [ck/\u0001]. We will show that D1\u2212way\n\u00b5,1/4 (f ) denotes\nthe minimum communication cost over all deterministic 1-way (from Alice to Bob) protocols which\nfail with probability at most 1/4 when the inputs are distributed according to \u00b5. By Yao\u2019s minimax\nprinciple (see, e.g., [14]), R1\u2212way\nWe use the following two-player problem Index in order to lower bound D1\u2212way\n\u00b5,1/4 (f ). In this prob-\nlem Alice is given a string x \u2208 {0, 1}r, while Bob is given an index i \u2208 [r]. Alice sends a single\nmessage to Bob, who needs to output xi with probability at least 2/3. Again by Yao\u2019s minimax prin-\nciple, we have that R1\u2212way\n\u03bd,1/3 (Index), where \u03bd is the distribution for which x and\ni are chosen independently and uniformly at random from their respective domains. The following\nis well-known.\nFact 3. [13] D1\u2212way\nTheorem 4. For c a small enough positive constant, and d \u2265 k/\u0001, we have D1\u2212way\nProof. We will reduce from the Index problem with r = (ck/\u0001)(d\u2212 ck/\u0001). Alice, given her string x\nto Index, creates the ck/\u0001\u00d7 d matrix A = [I, R] as follows. The matrix I is the ck/\u0001\u00d7 ck/\u0001 identity\nmatrix, while the matrix R is a ck/\u0001\u00d7(d\u2212ck/\u0001) matrix with entries in {\u22121/(d\u2212ck/\u0001)1/2, +1/(d\u2212\nck/\u0001)1/2}. For an arbitrary bijection between the coordinates of x and the entries of R, Alice sets a\n\n\u00b5,1/4 (f ) = \u2126(kd/\u0001), where D1\u2212way\n\n(Index) \u2265 D1\u2212way\n\n\u00b5,1/4 (f ) = \u2126(dk/\u0001).\n\n(f ) \u2265 D1\u2212way\n\n\u00b5,1/4 (f ).\n\n1/4\n\n1/4\n\n1/3\n\n\u03bd,1/3 (Index) = \u2126(r).\n\n3\n\n\fgiven entry in R to \u22121/(d\u2212 ck/\u0001)1/2 if the corresponding coordinate of x is 0, otherwise Alice sets\nthe given entry in R to +1/(d\u2212 ck/\u0001)1/2. In the Index problem, Bob is given an index, which under\nthe bijection between coordinates of x and entries of R, corresponds to being given a row index i\nand an entry j in the i-th row of R that he needs to recover. He sets i(cid:96) = i for a random (cid:96) \u2208 [k],\nand chooses k \u2212 1 distinct and random indices ij \u2208 [ck/\u0001] \\ {i(cid:96)}, for j \u2208 [k] \\ {(cid:96)}. Observe that\nif (x, i) \u223c \u03bd, then (R, i1, . . . , ik) \u223c \u00b5. Suppose there is a protocol in which Alice sends a single\nmessage to Bob who solves f with probability at least 3/4 under \u00b5. We show that this can be used\nto solve Index with probability at least 2/3 under \u03bd. The theorem will follow by Fact 3. Consider\nthe matrix B which is the matrix A stacked on top of the rows ei1, . . . , eik, in that order, so that B\nhas ck/\u0001 + k rows.\nF in a certain way, which will allow our reduction to Index\nWe proceed to lower bound (cid:107)B \u2212 BP(cid:107)2\nto be carried out. We need the following fact:\nFact 5. ((2.4) of [20]) Let A be an m \u00d7 n matrix with i.i.d. entries which are each +1/\u221an with\nprobability 1/2 and \u22121/\u221an with probability 1/2, and suppose m/n < 1. Then for all t > 0,\n\n\u2212\u03b1\n(cid:48)\n\nnt3/2\n\n.\n\nPr[(cid:107)A(cid:107)2 > 1 + t +(cid:112)\n\nm/n] \u2264 \u03b1e\n\nwhere \u03b1, \u03b1(cid:48) > 0 are absolute constants. Here (cid:107)A(cid:107)2 is the operator norm supx (cid:107)Ax(cid:107)/(cid:107)x(cid:107) of A.\nWe apply Fact 5 to the matrix R, which implies,\n\nPr[(cid:107)R(cid:107)2 > 1 + \u221ac +(cid:112)(ck/\u0001)/(d \u2212 (ck/\u0001))] \u2264 \u03b1e\n\n(d\u2212(ck/\u0001))c3/4\n\n,\n\n\u2212\u03b1\n(cid:48)\n\n\u2212\u03b2d,\n\nand using that d \u2265 k/\u0001 and c > 0 is a suf\ufb01ciently small constant, this implies\n\nPr[(cid:107)R(cid:107)2 > 1 + 3\u221ac] \u2264 e\n\n2 \u2264 1 + 7\u221ac, which we condition on.\n\n(1)\nwhere \u03b2 > 0 is an absolute constant (depending on c). Note that for c > 0 suf\ufb01ciently small,\n(1 + 3\u221ac)2 \u2264 1 + 7\u221ac. Let E be the event that (cid:107)R(cid:107)2\nWe partition the rows of B into B1 and B2, where B1 contains those rows whose projection onto\nthe \ufb01rst ck/\u0001 coordinates equals ei for some i /\u2208 {i1, . . . , ik}. Note that B1 is (ck/\u0001 \u2212 k) \u00d7 d and\nB2 is 2k \u00d7 d. Here, B2 is 2k \u00d7 d since it includes the rows in A indexed by i1, . . . , ik, together with\nthe rows ei1 , . . . , eik. Let us also partition the rows of R into RT and RS, so that the union of the\nrows in RT and in RS is equal to R, where the rows of RT are the rows of R in B1, and the rows\nof RS are the non-zero rows of R in B2 (note that k of the rows are non-zero and k are zero in B2\nrestricted to the columns in R).\nLemma 6. For any unit vector u, write u = uR + uS + uT , where S = {i1, . . . , ik}, T = [ck/\u0001]\\ S,\nand R = [d] \\ [ck/\u0001], and where uA for a set A is 0 on indices j /\u2208 A. Then, conditioned on E\noccurring, (cid:107)Bu(cid:107)2 \u2264 (1 + 7\u221ac)(2 \u2212 (cid:107)uT(cid:107)2 \u2212 (cid:107)uR(cid:107)2 + 2(cid:107)uS + uT(cid:107)(cid:107)uR(cid:107)).\n\n4\n\n111111100000000000000000000000000000B2B1STck/\"kck/\"dck/\"RSRTRRAliceBob\fProof. Let C be the matrix consisting of the top ck/\u0001 rows of B, so that C has the form [I, R],\nwhere I is a ck/\u0001 \u00d7 ck/\u0001 identity matrix. By construction of B, (cid:107)Bu(cid:107)2 = (cid:107)uS(cid:107)2 + (cid:107)Cu(cid:107)2. Now,\nCu = uS + uT + RuR, and so\n\n(cid:107)Cu(cid:107)2\n\n2 = (cid:107)uS + uT(cid:107)2 + (cid:107)RuR(cid:107)2 + 2(us + uT )T RuR\n\n\u2264 (cid:107)uS + uT(cid:107)2 + (1 + 7\u221ac)(cid:107)uR(cid:107)2 + 2(cid:107)uS + uT(cid:107)(cid:107)RuR(cid:107)\n\u2264 (1 + 7\u221ac)((cid:107)uS(cid:107)2 + (cid:107)uT(cid:107)2 + (cid:107)uR(cid:107)2) + (1 + 3\u221ac)2(cid:107)uS + uT(cid:107)(cid:107)uR(cid:107)\n\u2264 (1 + 7\u221ac)(1 + 2(cid:107)uS + uT(cid:107)(cid:107)uR(cid:107)),\n(cid:107)Bu(cid:107)2 \u2264 (1 + 7\u221ac)(1 + (cid:107)uS(cid:107)2 + 2(cid:107)uS + uT(cid:107)(cid:107)uR(cid:107))\n\n= (1 + 7\u221ac)(2 \u2212 (cid:107)uR(cid:107)2 \u2212 (cid:107)uT(cid:107)2 + 2(cid:107)uS + UT(cid:107)(cid:107)uR(cid:107)).\n\nand so\n\nWe will also make use of the following simple but tedious fact, shown in the full version.\n(cid:113)\nthe function f (x) = 2x\u221a1 \u2212 x2 \u2212 x2 is maximized when x =\nFact 7. For x \u2208 [0, 1],\n1/2 \u2212 \u221a5/10. We de\ufb01ne \u03b6 to be the value of f (x) at its maximum, where \u03b6 = 2/\u221a5 + \u221a5/10 \u2212\n1/2 \u2248 .618.\nCorollary 8. Conditioned on E occurring, (cid:107)B(cid:107)2\nProof. By Lemma 6, for any unit vector u,\n\n2 \u2264 (1 + 7\u221ac)(2 + \u03b6).\n\n(cid:107)Bu(cid:107)2 \u2264 (1 + 7\u221ac)(2 \u2212 (cid:107)uT(cid:107)2 \u2212 (cid:107)uR(cid:107)2 + 2(cid:107)uS + uT(cid:107)(cid:107)uR(cid:107)).\n\nit is maximized when (cid:107)uT(cid:107) = 0, for which it equals (1 + 7\u221ac)(2\u2212(cid:107)uR(cid:107)2 + 2(cid:112)1 \u2212 (cid:107)uR(cid:107)2(cid:107)uR(cid:107)),\n\nSuppose we replace the vector uS + uT with an arbitrary vector supported on coordinates in S with\nthe same norm as uS +uT . Then the right hand side of this expression cannot increase, which means\nand setting (cid:107)uR(cid:107) to equal the x in Fact 7, we see that this expression is at most (1+7\u221ac)(2+\u03b6).\nWrite the projection matrix P output by the streaming algorithm as U U T , where U is d \u00d7 k with\northonormal columns ui (so R\u2020R = P in the notation of Section 1). Applying Lemma 6 and Fact 7\nto each of the columns ui, we show in the full version:\n\nk(cid:88)\n\ni=1\n\n(cid:107)ui\n\nT(cid:107)2).\n\n(2)\n\nUsing the matrix Pythagorean theorem, we thus have,\n(cid:107)B \u2212 BP(cid:107)2\n\nF = (cid:107)B(cid:107)2\n\nF \u2212 (cid:107)BP(cid:107)2\n\n(cid:107)BP(cid:107)2\n\nF \u2264 (1 + 7\u221ac)((2 + \u03b6)k \u2212\nk(cid:88)\n\nF\n\n\u2265 2ck/\u0001 + k \u2212 (1 + 7\u221ac)((2 + \u03b6)k \u2212\n(cid:107)ui\n\u2265 2ck/\u0001 + k \u2212 (1 + 7\u221ac)(2 + \u03b6)k + (1 + 7\u221ac)\n\ni=1\n\nT(cid:107)2) using (cid:107)B(cid:107)2\nk(cid:88)\n\n(cid:107)ui\n\nT(cid:107)2.\n\ni=1\n\nF = 2ck/\u0001 + k\n\n(3)\n\nF cannot be too large if Alice and Bob succeed in solving f. First, we\nWe now argue that (cid:107)B \u2212 BP(cid:107)2\nF . To do so, we create a matrix \u02dcBk of rank-k and bound (cid:107)B\u2212 \u02dcBk(cid:107)2\nneed to upper bound (cid:107)B\u2212 Bk(cid:107)2\nF .\nMatrix \u02dcBk will be 0 on the rows in B1. We can group the rows of B2 into k pairs so that each pair\nhas the form ei + vi, ei, where i \u2208 [ck/\u0001] and vi is a unit vector supported on [d] \\ [ck/\u0001]. We let\nYi be the optimal (in Frobenius norm) rank-1 approximation to the matrix [ei + vi; ei]. By direct\ncomputation 1 the maximum squared singular value of this matrix is 2 + \u03b6. Our matrix \u02dcBk then\nconsists of a single Yi for each pair in B2. Observe that \u02dcBk has rank at most k and\n\n(cid:107)B \u2212 Bk(cid:107)2\n\nF \u2264 (cid:107)B \u2212 \u02dcBk(cid:107)2\n\nF \u2264 2ck/\u0001 + k \u2212 (2 + \u03b6)k,\n\n1For an online SVD calculator, see http://www.bluebit.gr/matrix-calculator/\n\n5\n\n\fTherefore, if Bob succeeds in solving f on input B, then,\n\n(cid:107)B \u2212 BP(cid:107)2\n\nF \u2264 (1 + \u0001)(2ck/\u0001 + k \u2212 (2 + \u03b6)k) \u2264 2ck/\u0001 + k \u2212 (2 + \u03b6)k + 2ck.\n\nComparing (3) and (4), we arrive at, conditioned on E:\n\n(cid:107)ui\n\nT(cid:107)2 \u2264\n\n1\n\n1 + 7\u221ac \u00b7 (7\u221ac(2 + \u03b6)k + 2ck) \u2264 c1k,\n\nk(cid:88)\n\ni=1\n\n(4)\n\n(5)\n\nwhere c1 > 0 is a constant that can be made arbitrarily small by making c > 0 an arbitrarily small.\nSince P is a projector, (cid:107)BP(cid:107)F = (cid:107)BU(cid:107)F . Write U = \u02c6U + \u00afU, where the vectors in \u02c6U are supported\non T , and the vectors in \u00afU are supported on [d] \\ T . We have,\n\n(cid:107)B \u02c6U(cid:107)2\n\nF \u2264 (cid:107)B(cid:107)2\n\n2c1k \u2264 (1 + 7\u221ac)(2 + \u03b6)c1k \u2264 c2k,\n\nwhere the \ufb01rst inequality uses (cid:107)B \u02c6U(cid:107)F \u2264 (cid:107)B(cid:107)2(cid:107) \u02c6U(cid:107)F and (5), the second inequality uses that event\nE occurs, and the third inequality holds for a constant c2 > 0 that can be made arbitrarily small by\nmaking the constant c > 0 arbitrarily small.\nCombining with (4) and using the triangle inequality,\n\n(cid:107)B \u00afU(cid:107)F \u2265 (cid:107)BP(cid:107)F \u2212 (cid:107)B \u02c6U(cid:107)F using the triangle inequality\n\n\u2265 (cid:107)BP(cid:107)F \u2212\n=\n\n(cid:112)\n(cid:113)\nc2k using our bound on (cid:107)B \u02c6U(cid:107)2\n(cid:112)(2 + \u03b6)k \u2212 2ck \u2212\n(cid:112)(2 + \u03b6)k \u2212 c3k,\n\nF \u2212\nc2k by (4)\n\nF \u2212 (cid:107)B \u2212 BP(cid:107)2\n\n(cid:107)B(cid:107)2\n\n(cid:112)\n\n(cid:112)\n\nF\n\n\u2265\n\u2265\n\nc2k by the matrix Pythagorean theorem\n\n(6)\nwhere c3 > 0 is a constant that can be made arbitrarily small for c > 0 an arbitrarily small constant\n(note that c2 > 0 also becomes arbitrarily small as c > 0 becomes arbitrarily small). Hence,\nF \u2265 (2 + \u03b6)k \u2212 c3k, and together with Corollary 8, that implies (cid:107) \u00afU(cid:107)2\n(cid:107)B \u00afU(cid:107)2\nF \u2265 k \u2212 c4k for a\nconstant c4 that can be made arbitrarily small by making c > 0 arbitrarily small.\nF . Consider any column \u00afu of \u00afU,\nOur next goal is to show that (cid:107)B2\nand write it as \u00afuS + \u00afuR. Hence,\n\nF is almost as large as (cid:107)B \u00afU(cid:107)2\n\n\u00afU(cid:107)2\n\n(cid:107)B \u00afu(cid:107)2 = (cid:107)RT \u00afuR(cid:107)2 + (cid:107)B2 \u00afu(cid:107)2 using B1 \u00afu = RT \u00afuR\n\n\u2264 (cid:107)RT \u00afuR(cid:107)2 + (cid:107)\u00afuS + RS \u00afuR(cid:107)2 + (cid:107)\u00afuS(cid:107)2 by de\ufb01nition of the components\n= (cid:107)R\u00afuR(cid:107)2 + 2(cid:107)\u00afuS(cid:107)2 + 2\u00afuT\n\u2264 1 + 7\u221ac + (cid:107)\u00afuS(cid:107)2 + 2(cid:107)\u00afuS(cid:107)(cid:107)RS \u00afuR(cid:107)\n\nS RS \u00afuR using the Pythagorean theorem\n\n(also using Cauchy-Schwarz to bound the other term).\n\nSuppose (cid:107)RS \u00afuR(cid:107) = \u03c4(cid:107)\u00afuR(cid:107) for a value 0 \u2264 \u03c4 \u2264 1 + 7\u221ac. Then\n(cid:107)B \u00afu(cid:107)2 \u2264 1 + 7\u221ac + (cid:107)\u00afuS(cid:107)2 + 2\u03c4(cid:107)\u00afuS(cid:107)\n\nusing (cid:107)R\u00afuR(cid:107)2 \u2264 (1 + 7\u221ac)(cid:107)\u00afuR(cid:107)2 and (cid:107)\u00afuR(cid:107)2 + (cid:107)\u00afuS(cid:107)2 \u2264 1\n(cid:112)1 \u2212 (cid:107)\u00afuS(cid:107)2.\n(cid:107)B \u00afu(cid:107)2 \u2264 1 + 7\u221ac + (1 \u2212 \u03c4 )(cid:107)\u00afuS(cid:107)2 + \u03c4 ((cid:107)\u00afuS(cid:107)2 + 2(cid:107)\u00afuS(cid:107)\n\nWe thus have,\n\n(cid:112)1 \u2212 (cid:107)\u00afuS(cid:107)2)\n\n\u2264 1 + 7\u221ac + (1 \u2212 \u03c4 ) + \u03c4 (1 + \u03b6) by Fact 7\n\u2264 2 + \u03c4 \u03b6 + 7\u221ac,\n\n(7)\nand hence, letting \u03c41, . . . , \u03c4k denote the corresponding values of \u03c4 for the k columns of \u00afU, we have\n\nComparing the square of (6) with (8), we have\n\nk(cid:88)\n\ni=1\n\n(cid:107)B \u00afU(cid:107)2\n\nF \u2264 (2 + 7\u221ac)k + \u03b6\nk(cid:88)\n\n\u03c4i \u2265 k \u2212 c5k,\n\ni=1\n\n6\n\n\u03c4i.\n\n(8)\n\n(9)\n\n\fwhere c5 > 0 is a constant that can be made arbitrarily small by making c > 0 an arbitrarily small\nconstant. Now, (cid:107) \u00afU(cid:107)2\nF \u2265 k \u2212 c4k as shown above, while since (cid:107)Rs \u00afuR(cid:107) = \u03c4i(cid:107)\u00afuR(cid:107) if \u00afuR is the i-th\ncolumn of \u00afU, by (9) we have\n\n(cid:107)RS\n\n\u00afUR(cid:107)2\n\nF \u2265 (1 \u2212 c6)k\n\nfor a constant c6 that can be made arbitrarily small by making c > 0 an arbitarily small constant.\nNow (cid:107)R \u00afUR(cid:107)2\nthe rows of R are the concatenation of rows of RS and RT , so combining with (10), we arrive at\n\nF \u2264 (1 + 7\u221ac)k since event E occurs, and (cid:107)R \u00afUR(cid:107)2\n\nF = (cid:107)RT\n\nF +(cid:107)RS\n\n\u00afUR(cid:107)2\n\n\u00afUR(cid:107)2\n\nF since\n\n(cid:107)RT\n\n\u00afUR(cid:107)2\n\nF \u2264 c7k,\n\nfor a constant c7 > 0 that can be made arbitrarily small by making c > 0 arbitrarily small.\nCombining the square of (6) with (11), we thus have\nF = (cid:107)B \u00afU(cid:107)2\n\nF = (cid:107)B \u00afU(cid:107)2\n\nF \u2265 (2 + \u03b6)k \u2212 c3k \u2212 c7k\n\nF \u2212 (cid:107)RT\n\n\u00afUR(cid:107)2\n\n\u00afU(cid:107)2\n\n\u00afU(cid:107)2\n\n(cid:107)B2\n\nF \u2212 (cid:107)B1\n\u2265 (2 + \u03b6)k \u2212 c8k,\n\nwhere the constant c8 > 0 can be made arbitrarily small by making c > 0 arbitrarily small.\nBy the triangle inequality,\n\n(10)\n\n(11)\n\n(12)\n\n(13)\n\n(cid:113)\n(cid:113)\n(cid:113)\n\n(cid:107)B2U(cid:107)F \u2265 (cid:107)B2\n\n\u00afU(cid:107)F \u2212 (cid:107)B2\n\n\u02c6U(cid:107)F \u2265 ((2 + \u03b6)k \u2212 c8k)1/2 \u2212 (c2k)1/2.\n\nHence,\n\n(cid:107)B2 \u2212 B2P(cid:107)F =\n\u2264\n\u2264\n\nF \u2212 (cid:107)B2U(cid:107)2\nF \u2212 ((cid:107)B2\n\nF Matrix Pythagorean, (cid:107)B2U(cid:107)F = (cid:107)B2P(cid:107)F\n\u00afU(cid:107)F \u2212 (cid:107)B2\n\n(cid:107)B2(cid:107)2\n(cid:107)B2(cid:107)2\n3k \u2212 (((2 + \u03b6)k \u2212 c8k)1/2 \u2212 (c2k)1/2)2 Using (13),(cid:107)B2(cid:107)2\n\n\u02c6U(cid:107)F )2 Triangle Inequality\n\nF = 3k,(14)\n(15)\nor equivalently, (cid:107)B2 \u2212 B2P(cid:107)2\nF \u2264 3k \u2212 ((2 + \u03b6)k \u2212 c8k) \u2212 (c2k) + 2k(((2 + \u03b6) \u2212 c8)c2)1/2 \u2264\n(1 \u2212 \u03b6)k + c8k + 2k(((2 + \u03b6) \u2212 c8)c2)1/2 \u2264 (1 \u2212 \u03b6)k + c9k for a constant c9 > 0 that can be made\narbitrarily small by making the constant c > 0 small enough. This intuitively says that P provides a\ngood low rank approximation for the matrix B2. Notice that by (14),\n\n(cid:107)B2P(cid:107)2\n\nF = (cid:107)B2(cid:107)2\n\nF \u2212 (cid:107)B2 \u2212 B2P(cid:107)2\n\n(cid:96) . By direct computation2 Z T\n\nF \u2265 3k \u2212 (1 \u2212 \u03b6)k \u2212 c9k \u2265 (2 + \u03b6)k \u2212 c9k.\n\n(16)\nNow B2 is a 2k \u00d7 d matrix and we can partition its rows into k pairs of rows of the form Z(cid:96) =\n(ei(cid:96) +Ri(cid:96), ei(cid:96)), for (cid:96) = 1, . . . , k. Here we abuse notation and think of Ri(cid:96) as a d-dimensional vector,\nits \ufb01rst ck/\u0001 coordinates set to 0. Each such pair of rows is a rank-2 matrix, which we abuse notation\nand call Z T\n(cid:96) has squared maximum singular value 2 + \u03b6. We would\nlike to argue that the projection of P onto the row span of most Z(cid:96) has length very close to 1. To\nthis end, for each Z(cid:96) consider the orthonormal basis V T\n(cid:96) of right singular vectors for its row space\n(cid:96),2 be these two right singular vectors with corresponding\n(which is span(ei(cid:96) , Ri(cid:96))). We let vT\nsingular values \u03c31 and \u03c32 (which will be the same for all (cid:96), see below). We are interested in the\nF which intuitively measures how much of P gets projected onto the\n\nquantity \u2206 =(cid:80)k\nrow spaces of the Z T\nLemma 9. Conditioned on event E, \u2206 \u2208 [k \u2212 c10k, k + c10k], where c10 > 0 is a constant that can\nbe made arbitrarily small by making c > 0 arbitrarily small.\n\n(cid:96)=1 (cid:107)V T\n(cid:96) . The following lemma and corollary are shown in the full version.\n\n(cid:96) P(cid:107)2\n\n(cid:96),1, vT\n\nThe following corollary is shown in the full version.\n\nCorollary 10. Conditioned on event E, for a 1\u2212\u221ac9 + 2c10 fraction of (cid:96) \u2208 [k], (cid:107)V T\nand for a 99/100 fraction of (cid:96) \u2208 [k], we have (cid:107)V T\ncan be made arbitrarily small by making the constant c > 0 arbitrarily small.\n\nF \u2264 1+c11,\nF \u2265 1 \u2212 c11, where c11 > 0 is a constant that\n\n(cid:96) P(cid:107)2\n\n(cid:96) P(cid:107)2\n\n2We again used the calculator at http://www.bluebit.gr/matrix-calculator/\n\n7\n\n\f(cid:96) P(cid:107)2\n\n(cid:96) P(cid:107)2\n\nRecall that Bob holds i = i(cid:96) for a random (cid:96) \u2208 [k]. It follows (conditioned on E) by a union bound\nthat with probability at least 49/50, (cid:107)V T\nF \u2208 [1 \u2212 c11, 1 + c11], which we call the event F and\ncondition on. We also condition on event G that (cid:107)Z T\nF \u2265 (2+\u03b6)\u2212c12, for a constant c12 > 0 that\ncan be made arbitrarily small by making c > 0 an arbitrarily small constant. Combining the \ufb01rst part\nof Corollary 10 together with (16), event G holds with probability at least 99.5/100, provided c > 0\nis a suf\ufb01ciently small constant. By a union bound it follows that E, F, and G occur simultaneously\nwith probability at least 49/51.\nAs (cid:107)Z T\n1 = 1 \u2212 \u03b6, events E,F, and G\n(cid:96) P(cid:107)2\nF = \u03c32\nimply that (cid:107)vT\n(cid:96),1P(cid:107)2 \u2265 1 \u2212 c13, where c13 > 0 is a constant that can be made arbitrarily small by\n(cid:96),1P(cid:107)2 = (cid:104)v(cid:96),1, z(cid:105)2, where z is a unit\nmaking the constant c > 0 arbitrarily small. Observe that (cid:107)vT\nvector in the direction of the projection of v(cid:96),1 onto P .\nBy the Pythagorean theorem, (cid:107)v(cid:96),1 \u2212 (cid:104)v(cid:96),1, z(cid:105)z(cid:107)2 = 1 \u2212 (cid:104)v(cid:96),1, z(cid:105)2, and so\n\n(cid:96),2P(cid:107)2, with \u03c32\n\n1 = 2 + \u03b6 and \u03c32\n\n(cid:96),1P(cid:107)2 + \u03c32\n\n1(cid:107)vT\n\n2(cid:107)vT\n\n(cid:107)v(cid:96),1 \u2212 (cid:104)v(cid:96),1, z(cid:105)z(cid:107)2 \u2264 c14,\n\n(17)\n\nfor a constant c14 > 0 that can be made arbitrarily small by making c > 0 arbitrarily small.\n(cid:96) P = \u03c31(cid:104)v(cid:96),1, z(cid:105)u(cid:96),1zT + \u03c32(cid:104)v(cid:96),2, w(cid:105)u(cid:96),2wT , where w is a unit vector in the\nWe thus have Z T\ndirection of the projection of of v(cid:96),2 onto P , and u(cid:96),1, u(cid:96),2 are the left singular vectors of Z T\n(cid:96) . Since\nF occurs, we have that |(cid:104)v(cid:96),2, w(cid:105)| \u2264 c11, where c11 > 0 is a constant that can be made arbitrarily\nsmall by making the constant c > 0 arbitrarily small. It follows now by (17) that\n\n(cid:107)Z T\n\n(cid:96),1(cid:107)2\n\nF \u2264 c15,\n\n(cid:96) P \u2212 \u03c31u(cid:96),1vt\n\ni(cid:96)P \u2212 (2 + \u03b6)(.448ei(cid:96) + .277Ri(cid:96))(cid:107)2 \u2264 c15.\n\n(18)\nwhere c15 > 0 is a constant that can be made arbitrarily small by making the constant c > 0\narbitrarily small.\nBy direct calculation3 , u(cid:96),1 = \u2212.851ei(cid:96) \u2212 .526Ri(cid:96) and v(cid:96),1 = \u2212.851ei(cid:96) \u2212 .526Ri(cid:96). It follows that\nF \u2264 c15. Since ei(cid:96) is the second row of\n(cid:96) P \u2212 (2 + \u03b6)[.724ei(cid:96) + .448Ri(cid:96) ; .448ei(cid:96) + .277Ri(cid:96)](cid:107)2\n(cid:107)Z T\n(cid:96) , it follows that (cid:107)eT\nZ T\nObserve that Bob has ei(cid:96) and P , and can therefore compute eT\ni(cid:96)P . Moreover, as c15 > 0 can be made\narbitrarily small by making the constant c > 0 arbitrarily small, it follows that a 1 \u2212 c16 fraction of\nthe signs of coordinates of eT\ni(cid:96)P , restricted to coordinates in [d] \\ [ck/\u0001], must agree with those of\n(2 + \u03b6).277Ri(cid:96), which in turn agree with those of Ri(cid:96). Here c16 > 0 is a constant that can be made\narbitrarily small by making the constant c > 0 arbitrarily small. Hence, in particular, the sign of the\nj-th coordinate of Ri(cid:96), which Bob needs to output, agrees with that of the j-th coordinate of eT\ni(cid:96)P\nwith probability at least 1 \u2212 c16. Call this event H.\nBy a union bound over the occurrence of events E,F, G, and H, and the streaming algorithm suc-\nceeding (which occurs with probability 3/4), it follows that Bob succeeds in solving Index with\nprobability at least 49/51 \u2212 1/4 \u2212 c16 > 2/3, as required. This completes the proof.\n3 Conclusion\n\nWe have shown an \u2126(dk/\u0001) bit lower bound for streaming algorithms in the row-update model for\noutputting a k \u00d7 d matrix R with (cid:107)A \u2212 AR\u2020R(cid:107)F \u2264 (1 + \u0001)(cid:107)A \u2212 Ak(cid:107)F , thus showing that the\nalgorithm of [9] is optimal up to the word size. The next natural goal would be to obtain multi-pass\nlower bounds, which seem quite challenging. Such lower bound techniques may also be useful for\nshowing the optimality of a constant-round O(sdk/\u0001) + (sk/\u0001)O(1) communication protocol in [12]\nfor low-rank approximation in the distributed communication model.\n\nAcknowledgments.\nI would like to thank Edo Liberty and Jeff Phillips for many useful discusions\nand detailed comments on this work (thanks to Jeff for the \ufb01gure!).\nI would also like to thank\nthe XDATA program of the Defense Advanced Research Projects Agency (DARPA), administered\nthrough Air Force Research Laboratory contract FA8750-12-C0323 for supporting this work.\n\n3Using the online calculator in earlier footnotes.\n\n8\n\n\fReferences\n[1] N. Alon, P. B. Gibbons, Y. Matias, and M. Szegedy. Tracking join and self-join sizes in limited\n\nstorage. J. Comput. Syst. Sci., 64(3):719\u2013747, 2002.\n\n[2] A. Andoni and H. L. Nguyen. Eigenvalues of a matrix in the streaming model. In SODA, pages\n\n1729\u20131737, 2013.\n\n[3] M. Brand. Incremental singular value decomposition of uncertain data with missing values. In\n\nECCV (1), pages 707\u2013720, 2002.\n\n[4] M. Charikar, K. Chen, and M. Farach-Colton. Finding frequent items in data streams. Theor.\n\nComput. Sci., 312(1):3\u201315, 2004.\n\n[5] K. L. Clarkson and D. P. Woodruff. Numerical linear algebra in the streaming model. In STOC,\n\npages 205\u2013214, 2009.\n\n[6] G. Cormode and S. Muthukrishnan. An improved data stream summary: the count-min sketch\n\nand its applications. J. Algorithms, 55(1):58\u201375, 2005.\n\n[7] D. Feldman, M. Schmidt, and C. Sohler. Turning big data into tiny data: Constant-size coresets\n\nfor k-means, pca and projective clustering. In SODA, pages 1434\u20131453, 2013.\n\n[8] A. M. Frieze, R. Kannan, and S. Vempala. Fast monte-carlo algorithms for \ufb01nding low-rank\n\napproximations. J. ACM, 51(6):1025\u20131041, 2004.\n\n[9] M. Ghashami and J. M. Phillips. Relative errors for deterministic low-rank matrix approxima-\n\ntions. In SODA, pages 707\u2013717, 2014.\n\n[10] G. H. Golub and C. F. van Loan. Matrix computations (3. ed.). Johns Hopkins University\n\nPress, 1996.\n\n[11] P. M. Hall, A. D. Marshall, and R. R. Martin. Incremental eigenanalysis for classi\ufb01cation. In\n\nBMVC, pages 1\u201310, 1998.\n\n[12] R. Kannan, S. Vempala, and D. P. Woodruff. Nimble algorithms for cloud computing. CoRR,\n\n2013.\n\n[13] I. Kremer, N. Nisan, and D. Ron. On randomized one-round communication complexity.\n\nComputational Complexity, 8(1):21\u201349, 1999.\n\n[14] E. Kushilevitz and N. Nisan. Communication complexity. Cambridge University Press, 1997.\n[15] A. Levy and M. Lindenbaum. Ef\ufb01cient sequential karhunen-loeve basis extraction. In ICCV,\n\npage 739, 2001.\n\n[16] E. Liberty. Simple and deterministic matrix sketching. In KDD, pages 581\u2013588, 2013.\n[17] J. Misra and D. Gries. Finding repeated elements. Sci. Comput. Program., 2(2):143\u2013152, 1982.\n[18] S. Muthukrishnan. Data streams: Algorithms and applications. Foundations and Trends in\n\nTheoretical Computer Science, 1(2), 2005.\n\n[19] D. A. Ross, J. Lim, R.-S. Lin, and M.-H. Yang. Incremental learning for robust visual tracking.\n\nInternational Journal of Computer Vision, 77(1-3):125\u2013141, 2008.\n\n[20] M. Rudelson and R. Vershynin. Non-asymptotic theory of random matrices: extreme singular\n\nvalues. CoRR, 2010.\n\n[21] T. Sarl\u00b4os. Improved approximation algorithms for large matrices via random projections. In\n\nFOCS, pages 143\u2013152, 2006.\n\n9\n\n\f", "award": [], "sourceid": 948, "authors": [{"given_name": "David", "family_name": "Woodruff", "institution": "IBM Research"}]}