{"title": "QUIC-SVD: Fast SVD Using Cosine Trees", "book": "Advances in Neural Information Processing Systems", "page_first": 673, "page_last": 680, "abstract": "The Singular Value Decomposition is a key operation in many machine learning methods. Its computational cost, however, makes it unscalable and impractical for the massive-sized datasets becoming common in applications. We present a new method, QUIC-SVD, for fast approximation of the full SVD with automatic sample size minimization and empirical relative error control. Previous Monte Carlo approaches have not addressed the full SVD nor benefited from the efficiency of automatic, empirically-driven sample sizing. Our empirical tests show speedups of several orders of magnitude over exact SVD. Such scalability should enable QUIC-SVD to meet the needs of a wide array of methods and applications.", "full_text": "QUIC-SVD: Fast SVD Using Cosine Trees\n\nMichael P. Holmes, Alexander G. Gray and Charles Lee Isbell, Jr.\n\n{mph, agray, isbell}@cc.gatech.edu\n\nAtlanta, GA 30327\n\nCollege of Computing\n\nGeorgia Tech\n\nAbstract\n\nThe Singular Value Decomposition is a key operation in many machine learning\nmethods. Its computational cost, however, makes it unscalable and impractical\nfor applications involving large datasets or real-time responsiveness, which are\nbecoming increasingly common. We present a new method, QUIC-SVD, for fast\napproximation of the whole-matrix SVD based on a new sampling mechanism\ncalled the cosine tree. Our empirical tests show speedups of several orders of\nmagnitude over exact SVD. Such scalability should enable QUIC-SVD to accel-\nerate and enable a wide array of SVD-based methods and applications.\n\n1 Introduction\n\nThe Singular Value Decomposition (SVD) is a fundamental linear algebraic operation whose abun-\ndant useful properties have placed it at the computational center of many methods in machine learn-\ning and related \ufb01elds. Principal component analysis (PCA) and its kernel and nonlinear variants\nare prominent examples, and countless other instances are found in manifold and metric learning,\nclustering, natural language processing/search, collaborative \ufb01ltering, bioinformatics and more.\nNotwithstanding the utility of the SVD, it is critically bottlenecked by a computational complexity\nthat renders it impractical on massive datasets. Yet massive datasets are increasingly common in\napplications, many of which require real-time responsiveness. Such applications could use SVD-\nbased methods more liberally if the SVD were not so slow to compute. We present a new method,\nQUIC-SVD, for fast, sample-based SVD approximation with automatic relative error control. This\nalgorithm is based on a new type of data partitioning tree, the cosine tree, that shows excellent ability\nto home in on the subspaces needed for good SVD approximation. We demonstrate several-order-\nof-magnitude speedups on medium-sized datasets, and verify that approximation error is properly\ncontrolled. Based on these results, QUIC-SVD seems able to help address the scale of modern\nproblems and datasets, with the potential to bene\ufb01t a wide array of methods and applications.\n\n2 Background\nFor A \u2208 Rm\u00d7n, we write A(i) for the ith row of A and A(j) for the jth column. We use Om\u00d7n to\nrepresent the subset of Rm\u00d7n whose columns are orthonormal. Since the columns of V \u2208 Om\u00d7n\nare an orthonormal basis, we sometimes use expressions such as \u201cthe subspace V \u201d to refer to the\nsubspace spanned by the columns of V . Throughout this paper we assume m \u2265 n, such that\nsampling rows gives bigger speedup than sampling columns. This is no loss of generality, since\nwhenever m < n we can perform SVD on the transpose, then swap U and V to get the SVD of\nthe original matrix. Alternatively, row-sampling-based methods have analogous column-sampling\nversions that can be used in place of transposition; we leave this implicit and develop only the\nrow-sampling version of our algorithm.\n\n1\n\n\fEXTRACTSVD\n\nAlgorithm 1 Optimal approximate SVD within a row subspace (cid:98)V .\nInput: target matrix A \u2208 Rm\u00d7n, subspace basis (cid:98)V \u2208 On\u00d7k\nOutput: U, \u03a3, V , the SVD of the best approximation to A within the subspace spanned by (cid:98)V \u2019s columns\n1. Compute A(cid:98)V , then (A(cid:98)V )T A(cid:98)V and its SVD: U(cid:48)\u03a3(cid:48)V (cid:48)T = (A(cid:98)V )T A(cid:98)V\n2. Let V = (cid:98)V V (cid:48), \u03a3 = (\u03a3(cid:48))1/2, and U = (A(cid:98)V )V (cid:48)\u03a3\u22121\n\n3. Return U, \u03a3, V\n\n(cid:80)\u03c1\n\nA = U\u03a3V T ,\n\nThe singular value decomposition is de\ufb01ned as follows:\nDe\ufb01nition 1. Let A be an m \u00d7 n real matrix of rank \u03c1. Then there exists a factorization of the form\n(1)\nwhere U and V each have orthonormal columns and are of size m \u00d7 \u03c1 and n \u00d7 \u03c1, respectively, and\n\u03a3 is diagonal with entries \u03c31 \u2265 \u03c32 \u2265 . . . \u2265 \u03c3\u03c1 > 0.\nEquivalently, we can write the SVD as a weighted sum of rank-one outer products: A =\ni , where ui and vi represent the ith columns of U and V . The columns ui and vi\nare referred to as the left and right singular vectors, while the weights \u03c3i are the singular values.\nThough it is sometimes overkill, the SVD can be used to solve essentially any problem in numerical\nlinear algebra. Instances of such problems abound in machine learning.\nGiven m \u2265 n, the exact SVD has O(mn2) runtime (O(n3) for square matrices). This is highly\nunscalable, rendering exact SVD impractical for large datasets. However, it is often the case that\ngood approximations can be found using subsets of the rows or columns. Of signi\ufb01cant interest are\nlow-rank approximations to a matrix. The optimal k-rank approximation, in the sense of minimizing\n\ni=1 \u03c3iuivT\n\nthe squared error ||A \u2212 (cid:98)A||2\n\nF , is the k-rank truncation of the SVD:\n\nk(cid:88)\n\nAk =\n\n\u03c3iuivT\n\ni = Uk\u03a3kVk .\n\n(2)\n\ni=1\n\nwithout enumeration of its properties. We state some of the key properties as a lemma.\n\nAk is the projection of A\u2019s rows onto the subspace spanned by the top k right singular vectors, i.e.,\nAk = AVkV T\nk . The optimality of Ak implies that the columns of Vk span the subspace of dimension\nat most k in which the squared error of A\u2019s row-wise projection is minimized. This leads us to a\nformulation of SVD approximation in which we seek to \ufb01nd a subspace in which A\u2019s projection has\nsuf\ufb01ciently low error, then perform the SVD of A in that subspace. If the subspace is substantially\nlower in rank/dimension than A, the SVD of the projection can be computed signi\ufb01cantly faster\nthan the SVD of the original A (quadratically so, as we will have decreased the n in O(mn2)). An\nimportant procedure we will require is the extraction of the best approximate SVD within a given\n\nsubspace (cid:98)V . Algorithm 1 describes this process; portions of this idea appeared in [1] and [2], but\nLemma 1. Given a target matrix A and a row subspace basis stored in the columns of (cid:98)V ,\n2. U\u03a3V T = A(cid:98)V(cid:98)V T , i.e., the extracted SVD reconstructs exactly to the projection of A\u2019s rows\nonto the subspace spanned by (cid:98)V .\nstricted to the span of (cid:98)V .\n\n3. U\u03a3V T minimizes squared-error reconstruction of A among all SVDs whose rows are re-\n\n1. Returns a full SVD, meaning U and V with orthonormal columns, and \u03a3 diagonal.\n\nEXTRACTSVD has the following properties:\n\nWe omit the fairly straightforward proof. The runtime of the procedure is O(kmn), where k is the\n\nrank of(cid:98)V . As this SVD extraction will constitute the last and most expensive step of our algorithm,\n\nwe therefore require a subspace discovery method that \ufb01nds a subspace of suf\ufb01cient quality with as\nlow a rank k as possible. This motivates the essential idea of our approach, which is to leverage the\n\n2\n\n\fTable 1: Distinctions between whole-matrix SVD approximation and LRMA.\nLow-Rank Matrix Approximation\n\n(cid:98)A or unaligned (cid:98)V &(cid:98)\u03a3 only\n\nWhole-Matrix SVD Approximation\nTrue SVD: U, \u03a3, and V\nAddresses full-rank matrix\nFull-rank relative error bound\nTable 2: Distinctions between subspace construction in QUIC-SVD and previous LRMA methods.\nQUIC-SVD\nIterative buildup, fast empirical error control\nAdaptive sample size minimization\nCosine tree sampling\n\nPrevious LRMA Methods\nOne-off computation, loose error bound\nFixed a priori sample size (loose)\nVarious sampling schemes\n\nFixed low-rank k\nk-rank error bound, additive or relative\n\nF /||A||2\n\ngeometric structure of a matrix to ef\ufb01ciently derive compact (i.e., minimal-rank) subspaces in which\nto carry out the approximate SVD.\nPrevious Work. A recent vein of work in the theory and algorithms community has focused on\nusing sampling to solve the problem of low-rank matrix approximation (LRMA). The user speci\ufb01es\na desired low rank k, and the algorithms try to output something close to the optimal k-rank approxi-\nmation. This problem is different from the whole-matrix SVD approximation we address, but a close\nrelationship allow us to draw on some of the LRMA ideas. Table 1 highlights the distinctions be-\ntween whole-matrix SVD approximation and LRMA. Table 2 summarizes the differences between\nour algorithmic approach and the more theoretically-oriented approaches taken in the LRMA work.\nEach LRMA algorithm has a way of sampling to build up a subspace in which the matrix projection\nhas bounded error. Our SVD also samples to build a subspace, so the LRMA sampling methods\nare directly comparable to our tree-based approach. Three main LRMA sampling techniques have\nemerged,1 and we will discuss each from the perspective of iteratively sampling a row, updating a\nsubspace so it spans the new row, and continuing until the subspace captures the input matrix to\nwithin a desired error threshold. This is how our method works, and it is similar to the framework\nused by Friedland et al. [1]. The key to ef\ufb01ciency (i.e., rank-compactness) is for each sampled row\nto represent well the rows that are not yet well represented in the subspace.\nLength-squared (LS) sampling. Rows are sampled with probability proportional to their squared\nlengths: pi = ||A(i)||2\nF . LS sampling was used in the seminal work of Frieze, Kannan, and\nVempala [3], and in much of the follow-on work [4, 5]. It is essentially an importance sampling\nscheme for the squared error objective. However, it has two important weaknesses. First, a row\ncan have high norm while not being representative of other rows. Second, the distribution is non-\nadaptive, in that a point is equally likely to be drawn whether or not it is already well represented in\nthe subspace. Both of these lead to wasted samples and needless in\ufb02ation of the subspace rank.\nResidual length-squared (RLS) sampling. Introduced by Deshpande and Vempala [2], RLS modi-\nF /||A \u2212\n\ufb01es the LS probabilities after each subspace update by setting pi = ||A(i) \u2212 \u03a0V (A(i))||2\n\u03a0V (A)||2\nF , where \u03a0V represents projection onto the current subspace V . By adapting the LS distri-\nbution to be over residuals, this method avoids drawing samples that are already well represented in\nthe subspace. Unfortunately, there is still nothing to enforce that any sample will be representative\nof other high-residual samples. Further, updating residuals requires an expensive s passes through\nthe matrix for every s samples that are added, which signi\ufb01cantly limits practical utility.\nRandom projections (RP). Introduced by Sarl\u00b4os [6], the idea is to sample linear combinations of\nrows, with random combination coef\ufb01cients drawn from a Gaussian. This method is strong where\nLS and RLS are weak \u2014 because all rows in\ufb02uence every sample, each sample is likely to represent a\nsizeable number of rows. Unfortunately the combination coef\ufb01cients are not informed by importance\n(squared length), and the sampling distribution is non-adaptive. Further, each linear combination\nrequires a full matrix pass, again limiting practicality.\nAlso deserving mention is the randomized sparsi\ufb01cation used by Achlioptas et al. [7]. Each of the\nLRMA sampling methods has strengths we can draw on and weaknesses we can improve upon. In\nparticular, our cosine tree sampling method can be viewed as combining the representativeness of\nRP sampling with the adaptivity of RLS, which explains its empirically dominant rank ef\ufb01ciency.\n\n1Note that our summary of related work is necessarily incomplete due to space constraints; our intent is to\n\nsummarize the essential results from the LRMA literature inasmuch as they pertain to our approach.\n\n3\n\n\fAlgorithm 2 Cosine tree construction.\nCTNODE\nInput: A \u2208 Rm\u00d7n\nOutput: cosine tree node containing the rows of A\n\n1. N \u2190 new cosine tree node\n2. N.A \u2190 A\n3. N.splitP t \u2190 ROWSAMPLELS(A) // split point sampled from length-squared distribution\n4. return N\n\nCTNODESPLIT\nInput: cosine tree node N\nOutput: left and right children obtained by cosine-splitting of N\n\n1. for each N.A(i), compute ci = |cos(N.A(i), N.splitP t)|\n2. if \u2200i, ci = 1, return nil\n3. cmax = max{ci|ci < 1}; cmin = min{ci}\n4. Al \u2190 [ ]; Ar \u2190 [ ]\n5. for i = 1 to N.nRows\n\n(a) if cmax \u2212 ci \u2264 ci \u2212 cmin, Al \u2190\n\n(cid:21)\n\n(cid:20) Al\n\nN.A(i)\n\n(cid:21)\n\n(cid:20) Ar\n\nN.A(i)\n\n(b) else Ar \u2190\n\n6. return CTNODE(Al), CTNODE(Ar)\n\n3 Our Approach\n\n||A \u2212 (cid:98)A||2\n\nRather than a \ufb01xed low-rank matrix approximation, our objective is to approximate the whole-matrix\nSVD with as high a rank as is required to obtain the following whole-matrix relative error bound:\n\nF \u2264 \u0001||A||2\nF ,\n\n(3)\n\nwhere (cid:98)A = U\u03a3V T is the matrix reconstructed by our SVD approximation. In contrast to the error\n\nbounds of previous methods, which are stated in terms of the unknown low-rank Ak, our error\nbound is in terms of the known A. This enables us to use a fast, empirical Monte Carlo technique to\ndetermine with high con\ufb01dence when we have achieved the error target, and therefore to terminate\nwith as few samples and as compact a subspace as possible. Minimizing subspace rank is crucial for\nspeed, as the \ufb01nal SVD extraction is greatly slowed by excess rank when the input matrix is large.\nWe use an iterative subspace buildup as described in the previous section, with sampling governed\nby a new spatial partitioning structure we call the cosine tree. Cosine trees are designed to leverage\nthe geometrical structure of a matrix and a partial subspace in order to quickly home in on good rep-\nresentative samples from the regions least well represented. Key to the ef\ufb01ciency of our algorithm is\nan ef\ufb01cient error checking scheme, which we accomplish by Monte Carlo error estimation at judi-\nciously chosen stages. Such a combination of spatial partitioning trees and Monte Carlo estimation\nhas been used before to good effect [8], and we \ufb01nd it to be a successful pairing here as well.\nCosine Trees for Ef\ufb01cient Subspace Discovery. The ideal subspace discovery algorithm would\noracularly choose as samples the singular vectors vi. Each vi is precisely the direction that, added to\nthe subspace spanned by the previous singular vectors, will maximally decrease residual error over\nall rows of the matrix. This intuition is the guiding idea for cosine trees.\nA cosine tree is constructed as follows. Starting with a root node, which contains all points (rows),\nwe take its centroid as a representative to include in our subspace span, and randomly sample a point\nto serve as the pivot for splitting. We sample the pivot from the basic LS distribution, that being the\ncheapest source of information as to sample importance. The remaining points are sorted by their\nabsolute cosines relative to the pivot point, then split according to whether they are closer to the high\nor low end of the cosines. The two groups are assigned to two child nodes, which are placed in a\n\n4\n\n\fAlgorithm 3 Monte Carlo estimation of the squared error of a matrix projection onto a subspace.\nMCSQERROR\n\nInput: A \u2208 Rm\u00d7n, (cid:98)V \u2208 On\u00d7k, s \u2208 {1 . . . m}, \u03b4 \u2208 [0, 1]\n\nsqErr \u2208 R s.t. with probability at least 1 \u2212 \u03b4, ||A \u2212 A(cid:98)V(cid:98)V T||2\n\nF \u2264 sqErr\n\nOutput:\n\n1. S = rowSamplesLS(A, s) // sample s rows from the length-squared distribution\n2. for i = 1 to s : // compute weighted sq. mag. of each sampled row\u2019s projection onto V\n\n(a) wgtM agSq[i] = 1\n\npS(i)\n\n||S(i)V ||2\n\nF // pS(i) is prob. of drawing Si under LS sampling\n\n3. \u02c6\u00b5 = avg(wgtM agSq); \u02c6\u03c32 = var(wgtM agSq); magSqLB = lowBound(\u02c6\u00b5, \u02c6\u03c32, s, \u03b4)\n4. return ||A||2\n\nF \u2212 magSqLB\n\nAlgorithm 4 QUIC-SVD: fast whole-matrix approximate SVD with relative error control.\nQUIC-SVD\nInput: A \u2208 Rm\u00d7n, \u0001 \u2208 [0, 1], and \u03b4 \u2208 [0, 1]\n\nOutput: an SVD U, \u03a3, V s.t. (cid:98)A = U \u03a3V T satis\ufb01es ||A \u2212 (cid:98)A||2\n\nF with probability at least 1 \u2212 \u03b4\n\nF \u2264 \u0001||A||2\n\n1. V = [ ]; mcSqErr = ||A||2\n2. Q = EMPTYPRIORITYQUEUE(); Q.insert(Nroot, 0)\n3. do until mcSqErr \u2264 \u0001||A||2\nF :\n\nF ; Nroot = CTNODE(A)\n\n(a) N = Q.pop(); C = CTNODESPLIT(N ) // C = {Nl, Nr}, the children of N\n(b) Remove N\u2019s contributed basis vector from V\n(c) for each Nc \u2208 C :\n(d) for each Nc \u2208 C :\n\ni. V = [V MGS(V, Nc.centroid)] // MGS = modi\ufb01ed Gram-Schmidt orthonormalization\n\ni. errC = MCSQERROR(Nc.A, V, O(log[Nc.nRows]), \u03b4)\nii. Q.insert(Nc, errC)\n\n(e) mcSqErr = MCSQERROR(A, V, O(log m), \u03b4)\n\n4. return EXTRACTSVD(A, V )\n\nqueue prioritized by the residual error of each node. The process is then repeated according to the\npriority order of the queue. Algorithm 2 de\ufb01nes the splitting process.\nWhy do cosine trees improve sampling ef\ufb01ciency? By prioritizing expansion by the residual error\nof the frontier nodes, sampling is always focused on the areas with maximum potential for error\nreduction. Since cosine-based splitting guides the nodes toward groupings with higher parallelism,\nthe residual magnitude of each node is increasingly likely to be well captured along the direction of\nthe node centroid. Expanding the subspace in the direction of the highest-priority node centroid is\ntherefore a good guess as to the direction that will maximally reduce residual error. Thus, cosine\ntree sampling approximates the ideal of oracularly sampling the true singular vectors.\n\n3.1 QUIC-SVD\n\nStrong error control. Algorithm 4, QUIC-SVD (QUantized Iterative Cosine tree)2, speci\ufb01es a\nway to leverage cosine trees in the construction of an approximate SVD while providing a strong\nprobabilistic error guarantee. The algorithm builds a subspace by expanding a cosine tree as de-\nscribed above, checking residual error after each expansion. Once the residual error is suf\ufb01ciently\nlow, we return the SVD of the projection into the subspace. Note that exact error checking would\nrequire an expensive O(k2mn) total cost, where k is the \ufb01nal subspace rank, so we instead use a\nMonte Carlo error estimate as speci\ufb01ed in Algorithm 3. We also employ Algorithm 3 for the error\nestimates used in node prioritization. With Monte Carlo instead of exact error computations, the\ntotal cost for error checking decreases to O(k2n log m), a signi\ufb01cant practical reduction.\n\n2Quantized alludes to each node being represented by a single point that is added to the subspace basis.\n\n5\n\n\fThe other main contributions to runtime are: 1) k cosine tree node splits for a total of O(kmn), 2)\nO(k) single-vector Gram-Schmidt orthonormalizations at O(km) each for a total of O(k2m), and\n3) \ufb01nal SVD extraction at O(kmn). Total runtime is therefore O(kmn), with the \ufb01nal projection\nonto the subspace being the costliest step since the O(kmn) from node splitting is a very loose\nworst-case bound. We now state the QUIC-SVD error guarantee.\nSVD U, \u03a3, V such that (cid:98)A = U\u03a3V T satis\ufb01es ||A \u2212 (cid:98)A||2\nTheorem 1. Given a matrix A \u2208 Rm\u00d7n and \u0001, \u03b4 \u2208 [0, 1], the algorithm QUIC-SVD returns an\nF with probability at least 1 \u2212 \u03b4.\nF \u2264 \u0001||A||2\nProof sketch. The algorithm terminates after mcSqErr \u2264 \u0001||A||2\nonto V (i.e., (cid:98)A = AV V T ). Thus, we have only to show that mcSqErr in the terminal iteration\nF with a call to EXTRACTSVD.\nFrom Lemma 1 we know that EXTRACTSVD returns an SVD that reconstructs to A\u2019s projection\nis an upper bound on the error ||A \u2212 (cid:98)A||2\nF with probability at least 1 \u2212 \u03b4. Note that intermediate\nerror checks do not affect the success probability, since they only ever tell us to continue expand-\ning the subspace, which is never a failure. From the Pythagorean theorem, ||A \u2212 AV V T||2\nF =\nF \u2212 ||AV V T||2\n||A||2\nF , and, since rotations do not affect lengths, ||AV V T||2\nF = ||AV ||2\nF . The\ncall to MCSQERROR (step 3(e)) performs a Monte Carlo estimate of ||AV ||2\nF in order to esti-\nmate ||A||2\nF \u2212 ||AV ||2\nF . It is easily veri\ufb01ed that the length-squared-weighted sample mean used by\nMCSQERROR produces an unbiased estimate of ||AV ||2\nF . By using a valid con\ufb01dence interval to\ngenerate a 1 \u2212 \u03b4 lower bound on ||AV ||2\nF from the sample mean and variance (e.g., Theorem 1 of\n[9] or similar), MCSQERROR is guaranteed to return an upper bound on ||A||2\nF with\nprobability at least 1 \u2212 \u03b4, which establishes the theorem.\nRelaxed error control. Though the QUIC-SVD procedure speci\ufb01ed in Algorithm 4 provides a\nstrong error guarantee, in practice its error checking routine is overconservative and is invoked more\nfrequently than necessary. For practical usage, we therefore approximate the strict error checking of\nAlgorithm 4 by making three modi\ufb01cations:\n\nF \u2212 ||AV ||2\n\n1. Set mcSqErr to the mean, rather than the lower bound, of the MCSQERROR estimate.\n2. At each error check, estimate mcSqErr with several repeated Monte Carlo evaluations\n(i.e., calls to MCSQERROR), terminating only if they all result in mcSqErr \u2264 \u0001||A||2\nF .\n\n3. In each iteration, use a linear extrapolation from past decreases in error to estimate the\nnumber of additional node splits required to achieve the error target. Perform this projected\nnumber of splits before checking error again, thus eliminating needless intermediate error\nchecks.\n\nAlthough these modi\ufb01cations forfeit the strict guarantee of Theorem 1, they are principled approx-\nimations that more aggressively accelerate the computation while still keeping error well under\ncontrol (this will be demonstrated empirically). Changes 1 and 2 are based on the fact that, because\nmcSqErr is an unbiased estimate generated by a sample mean, it obeys the Central Limit Theorem\nand thus approaches a normal distribution centered on the true squared error. Under such a sym-\nmetric distribution, the probability that a single evaluation of mcSqErr will exceed the true error is\n0.5. The probability that, in a series of x evaluations, at least one of them will exceed the true error\nis approximately 1 \u2212 0.5x (1 minus the probability that they all come in below the true error). The\nprobability that at least one of our mcSqErr evaluations results in an upper bound on the true error\n(i.e., the probability that our error check is correct) thus goes quickly to 1. In our experiments, we\nuse x = 3, corresponding to a success probability of approximately 0.9 (i.e., \u03b4 \u2248 0.1).\nChange 3 exploits that fact that the rate at which error decreases is typically monotonically non-\nincreasing. Thus, extrapolating the rate of error decrease from past error evaluations yields a con-\nservative estimate of the number of splits required to achieve the error target. Naturally, we have to\nimpose limits to guard against outlier cases where the estimated number is unreasonably high. Our\nexperiments limit the size of the split jumps to be no more than 100.\n\n4 Performance\n\nWe report the results of two sets of experiments, one comparing the sample ef\ufb01ciency of cosine\ntrees to previous LRMA sampling methods, and the other evaluating the composite speed and error\nperformance of QUIC-SVD. Due to space considerations we give results for only two datasets, and\n\n6\n\n\f0.025\n\n0.02\n\nmadelon\n\nLS\nRLS\nRP\nCT\nOpt\n\n \n\nr\no\nr\nr\ne\nd\ne\nr\na\nu\nq\ns\n \ne\nv\n\ni\nt\n\nl\n\na\ne\nr\n\n0.015\n\n0.01\n\n0.005\n\n0\n\n0\n\ndeclaration\n\nLS\nRLS\nRP\nCT\nOpt\n\n0.035\n\n0.03\n\n0.025\n\n0.02\n\n0.015\n\n0.01\n\n0.005\n\n \n\nr\no\nr\nr\ne\nd\ne\nr\na\nu\nq\ns\n \ne\nv\n\ni\nt\n\nl\n\na\ne\nr\n\n10\n\n20\n\n30\n\nsubspace rank\n\n40\n\n(a) madelon kernel (2000 (cid:215)\n\n50\n\n60\n\n0\n\n0\n\n2000)\n\n50\n\n100\n\n150\n\n200\n\nsubspace rank\n\n250\n\n300\n\n350\n\n400\n\n(b) declaration (4656 (cid:215)\n\n3923)\n\nFigure 1: Relative squared error vs. subspace rank for various subspace discovery methods. LS is\nlength-squared, RLS is residual length-squared, RP is random projection, and CT is cosine tree.\n\ndue to the need to compute the exact SVD as a baseline we limit ourselves to medium-sized matrices.\nNonetheless, these results are illustrative of the more general performance of the algorithm.\nSample ef\ufb01ciency. Because the runtime of our algorithm is O(kmn), where k is the \ufb01nal dimension\nof the projection subspace, it is critical that we use a sampling method that achieves the error target\nwith the minimum possible subspace rank k. We therefore compare our cosine tree sampling method\nto the previous sampling methods proposed in the LRMA literature. Figure 1 shows results for the\nvarious sampling methods on two matrices, one a 2000 (cid:215)\n2000 Gaussian kernel matrix produced\nby the Madelon dataset from the NIPS 2003 Workshop on Feature Extraction (madelon kernel), and\nthe other a 4656 (cid:215)\n3923 scan of the US Declaration of Independence (declaration). Plotted is the\nrelative squared error of the input matrix\u2019s projection onto the subspaces generated by each method\nat each subspace rank. Also shown is the optimal error produced by the exact SVD at each rank.\nBoth graphs show cosine trees dominating the other methods in terms of rank ef\ufb01ciency. This\ndominance has been con\ufb01rmed by many other empirical results we lack space to report here. It is\nparticularly interesting how closely the cosine tree error can track that of the exact SVD. This would\nseem to give some justi\ufb01cation to the principle of grouping points according to their degree of mutual\nparallelism, and validates our use of cosine trees as the sampling mechanism for QUIC-SVD.\nSpeedup and error. In the second set of experiments we evaluate the runtime and error performance\nof QUIC-SVD. Figure 2 shows results for the madelon kernel and declaration matrices. On the top\nrow we show how speedup over exact SVD varies with the target error (cid:31). Speedups range from 831\nat (cid:31) = 0(cid:46)0025 to over 3,600 at (cid:31) = 0(cid:46)023 for madelon kernel, and from 118 at (cid:31) = 0(cid:46)01 to nearly\n20,000 at (cid:31) = 0(cid:46)03 for declaration. On the bottom row we show the actual error of the algorithm\nin comparison to the target error. While the actual error is most often slightly above the target, it\nnevertheless hugs the target line quite closely, never exceeding the target by more than 10%. Overall,\nthe several-order-of-magnitude speedups and controlled error shown by QUIC-SVD would seem\nto make it an attractive option for any algorithm computing costly SVDs.\n\n5 Conclusion\n\nWe have presented a fast approximate SVD algorithm, QUIC-SVD, and demonstrated several-\norder-of-magnitude speedups with controlled error on medium-sized datasets. This algorithm dif-\nfers from previous related work in that it addresses the whole-matrix SVD, not low-rank matrix\napproximation, it uses a new ef\ufb01cient sampling procedure based on cosine trees, and it uses em-\npirical Monte Carlo error estimates to adaptively minimize needed sample sizes, rather than \ufb01xing\na loose sample size a priori. In addition to theoretical justi\ufb01cations, the empirical performance of\nQUIC-SVD argues for its effectiveness and utility. We note that a re\ufb01ned version of QUIC-SVD is\nforthcoming. The new version is greatly simpli\ufb01ed, and features even greater speed with a determin-\nistic error guarantee. More work is needed to explore the SVD-using methods to which QUIC-SVD\ncan be applied, particularly with an eye to how the introduction of controlled error in the SVD will\n\n7\n\n\fmadelon\n\ndeclaration\n\n4,000\n\n3,000\n\np\nu\nd\ne\ne\np\ns\n\n2,000\n\n1,000\n\n2.5e+04\n\n2e+04\n\n1.5e+04\n\n1e+04\n\n5,000\n\n0\n\np\nu\nd\ne\ne\np\ns\n\n0\n\n0\n\n0.005\n\n0.01\n\nepsilon\n\n0.015\n\n0.02\n\n0.025\n\n0.005\n\n0.01\n\n0.015\n\n0.02\n\nepsilon\n\n0.025\n\n0.03\n\n0.035\n\n(a) speedup - madelon kernel\n\n(b) speedup - declaration\n\nmadelon\n\ndeclaration\n\n128\n\n477\n\n1\n\n0.8\n\n0.6\n\n0.4\n\n0.2\n\n \n\nr\no\nr\nr\ne\nd\ne\nr\na\nu\nq\ns\n \n\ne\nv\ni\nt\na\nl\ne\nr\n\n0\n\n0\n\nactual error\ntarget error\n\nactual error\ntarget error\n\n \n\nr\no\nr\nr\ne\nd\ne\nr\na\nu\nq\ns\n \n\ne\nv\ni\nt\na\nl\ne\nr\n\n1\n\n0.8\n\n0.6\n\n0.4\n\n0.2\n\n0\n\n0.005\n\n0.01\n\nepsilon\n\n0.015\n\n0.02\n\n0.025\n\n0.01\n\n0.015\n\n0.02\n\nepsilon\n\n0.025\n\n0.03\n\n0.035\n\n(c) relative error - madelon kernel\n\n(d) relative error - declaration\n\nFigure 2: Speedup and actual relative error vs. (cid:31)for QUIC-SVD on madelon kernel and declaration.\n\naffect the quality of the methods using it. We expect there will be many opportunities to enable new\napplications through the scalability of this approximation.\n\nReferences\n[1] S. Friedland, A. Niknejad, M. Kaveh, and H. Zare. Fast Monte-Carlo Low Rank Approximations for\n\nMatrices. In Proceedings of Int. Conf. on System of Systems Engineering, 2006.\n\n[2] A. Deshpande and S. Vempala. Adaptive Sampling and Fast Low-Rank Matrix Approximation. In 10th\n\nInternational Workshop on Randomization and Computation (RANDOM06), 2006.\n\n[3] A. M. Frieze, R. Kannan, and S. Vempala. Fast Monte-Carlo Algorithms for Finding Low-Rank Approxi-\n\nmations. In IEEE Symposium on Foundations of Computer Science, pages 370\u2013378, 1998.\n\n[4] P. Drineas, R. Kannan, and M. W. Mahoney. Fast Monte Carlo Algorithms for Matrices II: Computing a\n\nLow-Rank Approximation to a Matrix. SIAM Journal on Computing, 36(1):158\u2013183, 2006.\n\n[5] P. Drineas, E. Drinea, and P. S. Huggins. An Experimental Evaluation of a Monte-Carlo Algorithm for\n\nSingular Value Decomposition. Lectures Notes in Computer Science, 2563:279\u2013296, 2003.\n\n[6] T. Sarlos. Improved Approximation Algorithms for Large Matrices via Random Projections. In 47th IEEE\n\nSymposium on Foundations of Computer Science (FOCS), pages 143\u2013152, 2006.\n\n[7] D. Achlioptas, F. McSherry, and B. Scholkopf. Sampling Techniques for Kernel Methods. In Advances in\n\nNeural Information Processing Systems (NIPS) 17, 2002.\n\n[8] M. P. Holmes, A. G. Gray, and C. L.Isbell, Jr. Ultrafast Monte Carlo for Kernel Estimators and Generalized\n\nStatistical Summations. In Advances in Neural Information Processing Systems (NIPS) 21, 2008.\n\n[9] J. Audibert, R. Munos, and C. Szepesvari. Variance estimates and exploration function in multi-armed\n\nbandits. Technical report, CERTIS, 2007.\n\n8\n\n\f", "award": [], "sourceid": 1009, "authors": [{"given_name": "Michael", "family_name": "Holmes", "institution": null}, {"given_name": "", "family_name": "Isbell", "institution": null}, {"given_name": "Charles", "family_name": "Lee", "institution": null}, {"given_name": "Alexander", "family_name": "Gray", "institution": null}]}