{"title": "Compressive spectral embedding: sidestepping the SVD", "book": "Advances in Neural Information Processing Systems", "page_first": 550, "page_last": 558, "abstract": "Spectral embedding based on the Singular Value Decomposition (SVD) is a widely used preprocessing step in many learning tasks, typically leading to dimensionality reduction by projecting onto a number of dominant singular vectors and rescaling the coordinate axes (by a predefined function of the singular value). However, the number of such vectors required to capture problem structure grows with problem size, and even partial SVD computation becomes a bottleneck. In this paper, we propose a low-complexity it compressive spectral embedding algorithm, which employs random projections and finite order polynomial expansions to compute approximations to SVD-based embedding. For an m times n matrix with T non-zeros, its time complexity is O((T+m+n)log(m+n)), and the embedding dimension is O(log(m+n)), both of which are independent of the number of singular vectors whose effect we wish to capture. To the best of our knowledge, this is the first work to circumvent this dependence on the number of singular vectors for general SVD-based embeddings. The key to sidestepping the SVD is the observation that, for downstream inference tasks such as clustering and classification, we are only interested in using the resulting embedding to evaluate pairwise similarity metrics derived from the euclidean norm, rather than capturing the effect of the underlying matrix on arbitrary vectors as a partial SVD tries to do. Our numerical results on network datasets demonstrate the efficacy of the proposed method, and motivate further exploration of its application to large-scale inference tasks.", "full_text": "Compressive spectral embedding: sidestepping the\n\nSVD\n\nDinesh Ramasamy\n\nUpamanyu Madhow\n\ndineshr@ece.ucsb.edu\n\nmadhow@ece.ucsb.edu\n\nECE Department, UC Santa Barbara\n\nECE Department, UC Santa Barbara\n\nAbstract\n\nSpectral embedding based on the Singular Value Decomposition (SVD) is a\nwidely used \u201cpreprocessing\u201d step in many learning tasks, typically leading to di-\nmensionality reduction by projecting onto a number of dominant singular vectors\nand rescaling the coordinate axes (by a prede\ufb01ned function of the singular value).\nHowever, the number of such vectors required to capture problem structure grows\nwith problem size, and even partial SVD computation becomes a bottleneck. In\nthis paper, we propose a low-complexity compressive spectral embedding algo-\nrithm, which employs random projections and \ufb01nite order polynomial expansions\nto compute approximations to SVD-based embedding. For an m\u00d7n matrix with T\nnon-zeros, its time complexity is O ((T + m + n) log(m + n)), and the embed-\nding dimension is O(log(m + n)), both of which are independent of the number\nof singular vectors whose effect we wish to capture. To the best of our knowledge,\nthis is the \ufb01rst work to circumvent this dependence on the number of singular vec-\ntors for general SVD-based embeddings. The key to sidestepping the SVD is the\nobservation that, for downstream inference tasks such as clustering and classi\ufb01ca-\ntion, we are only interested in using the resulting embedding to evaluate pairwise\nsimilarity metrics derived from the \u21132-norm, rather than capturing the effect of the\nunderlying matrix on arbitrary vectors as a partial SVD tries to do. Our numerical\nresults on network datasets demonstrate the ef\ufb01cacy of the proposed method, and\nmotivate further exploration of its application to large-scale inference tasks.\n\n1 Introduction\n\nInference tasks encountered in natural language processing, graph inference and manifold learning\nemploy the singular value decomposition (SVD) as a \ufb01rst step to reduce dimensionality while re-\ntaining useful structure in the input. Such spectral embeddings go under various guises: Principle\nComponent Analysis (PCA), Latent Semantic Indexing (natural language processing), Kernel Prin-\ncipal Component Analysis, commute time and diffusion embeddings of graphs, to name a few. In\nthis paper, we present a compressive approach for accomplishing SVD-based dimensionality reduc-\ntion, or embedding, without actually performing the computationally expensive SVD step.\n\nThe setting is as follows. The input is represented in matrix form. This matrix could represent the\nadjacency matrix or the Laplacian of a graph, the probability transition matrix of a random walker\non the graph, a bag-of-words representation of documents, the action of a kernel on a set of l points\n{x(p) \u2208 Rd : p = 1, . . . , m} (kernel PCA)[1][2] such as\n\nA(p, q) = e \u2212kx(p)\u2212x(q)k2/ 2\u03b12\n\n(1)\nwhere I(\u00b7) denotes the indicator function or matrices derived from K-nearest-neighbor graphs con-\nstructed from {x(p)}. We wish to compute a transformation of the rows of this m \u00d7 n matrix\nA which succinctly captures the global structure of A via euclidean distances (or similarity met-\nrics derived from the \u21132-norm, such as normalized correlations). A common approach is to com-\n\n(or) A(p, q) = I(kx(p) \u2212 x(q)k < \u03b1), 1 \u2264 p, q \u2264 l,\n\n1\n\n\fl=1 \u03c3lulvT\n\npute a partial SVD of A, Pl=k\nl , k \u226a n, and to use it to embed the rows of A into a\nk-dimensional space using the rows of E = [f (\u03c31)u1 f (\u03c32)u2 \u00b7\u00b7\u00b7 f (\u03c3k)uk], for some function\nf (\u00b7). The embedding of the variable corresponding to the l-th row of the matrix A is the l-th row of\nE. For example, f (x) = x corresponds to Principal Component Analysis (PCA): the k-dimensional\nrows of E are projections of the n-dimensional rows of A along the \ufb01rst k principal components,\n{vl, l = 1, . . . , k}. Other important choices include f (x) = constant used to cut graphs [3] and\nf (x) = 1(cid:14)\u221a1 \u2212 x for commute time embedding of graphs [4]. Inference tasks such as (unsuper-\nvised) clustering and (supervised) classi\ufb01cation are performed using \u21132-based pairwise similarity\nmetrics on the embedded coordinates (rows of E) instead of the ambient data (rows of A).\n\nBeyond the obvious bene\ufb01t of dimensionality reduction from n to k, embeddings derived from the\nleading partial-SVD can often be interpreted as denoising, since the \u201cnoise\u201d in matrices arising from\nreal-world data manifests itself via the smaller singular vectors of A (e.g., see [5], which analyzes\ngraph adjacency matrices). This is often cited as a motivation for choosing PCA over \u201cisotropic\u201d\ndimensionality reduction techniques such as random embeddings, which, under the setting of the\nJohnson-Lindenstrauss (JL) lemma, can also preserve structure.\nThe number of singular vectors k needed to capture the structure of an m \u00d7 n matrix grows with its\nsize, and two bottlenecks emerge as we scale: (a) The computational effort required to extract a large\nnumber of singular vectors using conventional iterative methods such as Lanczos or simultaneous\niteration or approximate algorithms like Nystrom [6], [7] and Randomized SVD [8] for computation\nof partial SVD becomes prohibitive (scaling as \u2126(kT ), where T is the number of non-zeros in A)\n(b) the resulting k-dimensional embedding becomes unwieldy for use in subsequent inference steps.\n\nApproach and Contributions: In this paper, we tackle these scalability bottlenecks by focusing on\nwhat embeddings are actually used for: computing \u21132-based pairwise similarity metrics typically\nused for supervised or unsupervised learning. For example, K-means clustering uses pairwise Eu-\nclidean distances, and SVM-based classi\ufb01cation uses pairwise inner products. We therefore ask the\nfollowing question: \u201cIs it possible to compute an embedding which captures the pairwise euclidean\ndistances between the rows of the spectral embedding E = [f (\u03c31)u1 \u00b7\u00b7\u00b7 f (\u03c3k)uk], while sidestep-\nping the computationally expensive partial SVD?\u201d We answer this question in the af\ufb01rmative by\npresenting a compressive algorithm which directly computes a low-dimensional embedding.\n\nThere are two key insights that drive our algorithm:\n\u2022 By approximating f (\u03c3) by a low-order (L \u226a min{m, n}) polynomial, we can compute the em-\nbedding iteratively using matrix-vector products of the form Aq or AT q.\n\u2022 The iterations can be computed compressively: by virtue of the celebrated JL lemma, the em-\nbedding geometry is approximately captured by a small number d = O(log(m + n)) of randomly\npicked starting vectors.\n\nThe number of passes over A, AT and time complexity of the algorithm are L, L and O(L(T + m +\nn) log(m + n)) respectively. These are all independent of the number of singular vectors k whose\neffect we wish to capture via the embedding. This is in stark contrast to embedding directly based on\nthe partial SVD. Our algorithm lends itself to parallel implementation as a sequence of 2L matrix-\nvector products interlaced with vector additions, run in parallel across d = O(log(m+n)) randomly\nchosen starting vectors. This approach signi\ufb01cantly reduces both computational complexity and\nembedding dimensionality relative to partial SVD. A freely downloadable Python implementation\nof the proposed algorithm that exploits this inherent parallelism can be found in [9].\n\n2 Related work\n\nAs discussed in Section 3.1, the concept of compressive measurements forms a key ingredient in our\nalgorithm, and is based on the JL lemma [10]. The latter, which provides probabilistic guarantees on\napproximate preservation of the Euclidean geometry for a \ufb01nite collection of points under random\nprojections, forms the basis for many other applications, such as compressive sensing [11].\n\nWe now mention a few techniques for exact and approximate SVD computation, before discussing\nalgorithms that sidestep the SVD as we do. The time complexity of the full SVD of an m \u00d7 n\nmatrix is O(mn2) (for m > n). Partial SVDs are computed using iterative methods for eigen\ndecompositions of symmetric matrices derived from A such as AAT and (cid:2)0 AT ; A 0(cid:3) [12]. The\n\n2\n\n\fl k2\n\nl=1 \u03c3lulvT\n\ncomplexity of standard iterative eigensolvers such as simultaneous iteration[13] and the Lanczos\nmethod scales as \u2126(kT ) [12], where T denotes the number of non-zeros of A.\nThe leading k singular value, vector triplets {(\u03c3l, ul, vl), l = 1, . . . , k} minimize the matrix\nreconstruction error under a rank k constraint:\nthey are a solution to the optimization problem\narg minkA \u2212 Pl=k\nF , where k \u00b7 kF denotes the Frobenius norm. Approximate SVD algo-\nrithms strive to reduce this error while also placing constraints on the computational budget and/or\nthe number of passes over A. A commonly employed approximate eigendecomposition algorithm\nis the Nystrom method [6], [7] based on random sampling of s columns of A, which has time com-\nplexity O(ksn + s3). A number of variants of the Nystrom method for kernel matrices like (1) have\nbeen proposed in the literature. These aim to improve accuracy using preprocessing steps such as\nK-means clustering [14] or random projection trees [15]. Methods to reduce the complexity of the\nNystrom algorithm to O(ksn + k3)[16], [17] enable Nystrom sketches that see more columns of A.\nThe complexity of all of these grow as \u2126(ksn). Other randomized algorithms, involving iterative\ncomputations, include the Randomized SVD [8]. Since all of these algorithms set out to recover\nk-leading eigenvectors (exact or otherwise), their complexity scales as \u2126(kT ).\nWe now turn to algorithms that sidestep SVD computation. In [18], [19], vertices of a graph are\nembedded based on diffusion of probability mass in random walks on the graph, using the power\niteration run independently on random starting vectors, and stopping \u201cprior to convergence.\u201d While\nthis approach is specialized to probability transition matrices (unlike our general framework) and\ndoes not provide explicit control on the nature of the embedding as we do, a feature in common with\nthe present paper is that the time complexity of the algorithm and the dimensionality of the resulting\nembedding are independent of the number of eigenvectors k captured by it. A parallel implementa-\ntion of this algorithm was considered in [20]; similar parallelization directly applies to our algorithm.\nAnother speci\ufb01c application that falls within our general framework is the commute time embedding\n\non a graph, based on the normalized adjacency matrix and weighing function f (x) = 1/\u221a1 \u2212 x [4],\n\n[21]. Approximate commute time embeddings have been computed using Spielman-Teng solvers\n[22], [23] and the JL lemma in [24]. The complexity of the latter algorithm and the dimensionality of\nthe resulting embedding are comparable to ours, but the method is specially designed for the normal-\n\nized adjacency matrix and the weighing function f (x) = 1/\u221a1 \u2212 x. Our more general framework\nthe embedding (e.g, by setting f (x) = I(x > \u01eb)/\u221a1 \u2212 x).\n\nwould, for example, provide the \ufb02exibility of suppressing small eigenvectors from contributing to\n\nThus, while randomized projections are extensively used in the embedding literature, to the best\nof our knowledge, the present paper is the \ufb01rst to develop a general compressive framework for\nspectral embeddings derived from the SVD. It is interesting to note that methods similar to ours\nhave been used in a different context, to estimate the empirical distribution of eigenvalues of a large\nhermitian matrix [25], [26]. These methods use a polynomial approximation of indicator functions\nf (\u03bb) = I(a \u2264 \u03bb \u2264 b) and random projections to compute an approximate histogram of the number\nof eigenvectors across different bands of the spectrum: [a, b] \u2286 [\u03bbmin, \u03bbmax].\n\n3 Algorithm\n\nWe \ufb01rst present the algorithm for a symmetric n \u00d7 n matrix S. Later, in Section 3.5, we show\nhow to handle a general m \u00d7 n matrix by considering a related (m + n) \u00d7 (m + n) symmetric\nmatrix. Let \u03bbl denote the eigenvalues of S sorted in descending order and vl their corresponding\nunit-norm eigenvectors (chosen to be orthogonal in case of repeated eigenvalues). For any func-\ntion g(x) : R 7\u2192 R, we denote by g(S) the n \u00d7 n symmetric matrix g(S) = Pl=n\nl=1 g(\u03bbl)vlvT\nl .\nWe now develop an O(n log n) algorithm to compute a d = O(log n) dimensional embedding\nwhich approximately captures pairwise euclidean distances between the rows of the embedding\nE = [f (\u03bb1) v1 f (\u03bb2) v2 \u00b7\u00b7\u00b7 f (\u03bbn) vn].\nRotations are inconsequential: We \ufb01rst observe that rotation of basis does not alter \u21132-based sim-\nilarity metrics. Since V = [v1 \u00b7\u00b7\u00b7 vn] satis\ufb01es V V T = V T V = In, pairwise distances be-\ntween the rows of E are equal to corresponding pairwise distances between the rows of EV T =\nPl=n\nl = f (S). We use this observation to compute embeddings of the rows of f (S)\nrather than those of E.\n\nl=1 f (\u03bbl)vlvT\n\n3\n\n\f3.1 Compressive embedding\n\nSuppose now that we know f (S). This constitutes an n-dimensional embedding, and similarity\nqueries between two \u201cvertices\u201d (we refer to the variables corresponding to rows of S as vertices,\nas we would for matrices derived from graphs) requires O(n) operations. However, we can reduce\nthis time to O(log n) by using the JL lemma, which informs us that pairwise distances can be\napproximately captured by compressive projection onto d = O(log n) dimensions.\n\nSpeci\ufb01cally, for d > (4 + 2\u03b2) log n(cid:14)(cid:0)\u01eb2/2 \u2212 \u01eb3/3(cid:1) , let \u2126 denote an n\u00d7 d matrix with i.i.d. entries\ndrawn uniformly at random from {\u00b11/\u221ad}. According to the JL lemma, pairwise distances between\nthe rows of f (S)\u2126 approximate pairwise distances between the rows of f (S) with high probability.\nIn particular, the following statement holds with probability at least 1 \u2212 n\u2212\u03b2: (1 \u2212 \u01eb)ku \u2212 vk2 \u2264\nk(u \u2212 v) \u2126k2 \u2264 (1 + \u01eb)ku \u2212 vk2, for any two rows u, v of f (S).\nThe key take-aways are that (a) we can reduce the embedding dimension to d = O(log n), since we\nare only interested in pairwise similarity measures, and (b) We do not need to compute f (S). We\nonly need to compute f (S)\u2126. We now discuss how to accomplish the latter ef\ufb01ciently.\n\n3.2 Polynomial approximation of embedding\n\nDirect computation of E\u2032 = f (S)\u2126 from the eigenvectors and eigenvalues of S, as f (S) =\nP f (\u03bbl)vlvT\nl would suggest, is expensive (O(n3)). However, we now observe that computation\nof \u03c8(S)\u2126 is easy when \u03c8(\u00b7) is a polynomial. In this case, \u03c8(S) = Pp=L\np=0 bpSp for some bp \u2208 R, so\nthat \u03c8(S)\u2126 can be computed as a sequence of L matrix-vector products interlaced with vector ad-\nditions run in parallel for each of the d columns of \u2126. Therefore, they only require LdT + O(Ldn)\n\ufb02ops. Our strategy is to approximate E\u2032 = f (S)\u2126 by eE = efL(S)\u2126, where efL(x) is an L-th order\npolynomial approximation of f (x). We defer the details of computing a \u201cgood\u201d polynomial approx-\nimation to Section 3.4. For now, we assume that one such approximation efL(\u00b7) is available and give\n\nbounds on the loss in \ufb01delity as a result of this approximation.\n\n3.3 Performance guarantees\n\nThe spectral norm of the \u201cerror matrix\u201d Z = f (S)\u2212 ef (S) = Pr=n\nr satis\ufb01es\nkZk = \u03b4 = maxl|f (\u03bbl) \u2212 efL(\u03bbl)| \u2264 max{|f (x) \u2212 efL(x)|}, where the spectral norm of a matrix\nB, denoted by kBk refers to the induced \u21132-norm. For symmetric matrices, kBk \u2264 \u03b1 \u21d0\u21d2 |\u03bbl| \u2264\n\u03b1 \u2200l, where \u03bbl are the eigenvalues of B. Letting ip denote the unit vector along the p-th coordinate\nof Rn, the distance between the p, q-th rows of ef (S) can be written as\n\nr=1 (f (\u03bbr)\u2212 efL(\u03bbr))vrvT\n\n(2)\n\nkefL(S) (ip \u2212 iq)k = kf (S) (ip \u2212 iq) \u2212 Z (ip \u2212 iq)k \u2264 kET (ip \u2212 iq)k + \u03b4\u221a2.\n\nSimilarly, we have that kefL(S) (ip \u2212 iq)k \u2265 kET (ip \u2212 iq)k \u2212 \u03b4\u221a2. Thus pairwise distances be-\ntween the rows of efL(S) approximate those between the rows of E. However, the distortion term\n\u03b4\u221a2 is additive and must be controlled by carefully choosing efL(\u00b7), as discussed in Section 4.\nApplying the JL lemma [10] to the rows of efL(S), we have that when d > O (cid:0)\u01eb\u22122 log n(cid:1) with i.i.d.\nentries drawn uniformly at random from {\u00b11/\u221ad}, the embedding eE = efL(S)\u2126 captures pairwise\ndistances between the rows of efL(S) up to a multiplicative distortion of 1 \u00b1 \u01eb with high probability:\n\n(cid:13)(cid:13)(cid:13) eET (ip \u2212 iq)(cid:13)(cid:13)(cid:13) = (cid:13)(cid:13)(cid:13)\u2126T efL(S) (ip \u2212 iq)(cid:13)(cid:13)(cid:13) \u2264 \u221a1 + \u01eb(cid:13)(cid:13)(cid:13) efL(S) (ip \u2212 iq)(cid:13)(cid:13)(cid:13)\nUsing (2), we can show that k eET (ip \u2212 iq)k \u2264 \u221a1 + \u01eb(cid:0)kET (ip \u2212 iq)k + \u03b4\u221a2(cid:1). Similarly,\nk eET (ip \u2212 iq)k \u2265 \u221a1 \u2212 \u01eb(cid:0)kET (ip \u2212 iq)k \u2212 \u03b4\u221a2(cid:1). We state this result in Theorem 1.\nTheorem 1. Let efL(x) denote an L-th order polynomial such that: \u03b4 = maxl|f (\u03bbl) \u2212 efL(\u03bbl)| \u2264\nmax|f (x) \u2212 efL(x)| and \u2126 an n \u00d7 d matrix with entries drawn independently and uniformly at\nrandom from {\u00b11/\u221ad}, where d is an integer satisfying d > (4 + 2\u03b2) log n(cid:14)(\u01eb2/2 \u2212 \u01eb3/3) . Let\n\n4\n\n\fg : Rp \u2192 Rd denote the mapping from the i-th row of E = [f (\u03bb1) v1 \u00b7\u00b7\u00b7 f (\u03bbn) vn] to the i-th\nrow of eE = efL(S)\u2126. The following statement is true with probability at least 1 \u2212 n\u2212\u03b2:\n\u221a1 \u2212 \u01eb(ku \u2212 vk \u2212 \u03b4\u221a2) \u2264 kg(u) \u2212 g(v)k \u2264 \u221a1 + \u01eb(ku \u2212 vk + \u03b4\u221a2)\n\nfor any two rows u, v of E. Furthermore, there exists an algorithm to compute each of the d =\nO(log n) columns of eE in O(L(T + n)) \ufb02ops independent of its other columns which makes L\npasses over S (T is the number of non-zeros in S).\n3.4 Choosing the polynomial approximation\n\nWe restrict attention to matrices which satisfy kSk \u2264 1, which implies that |\u03bbl| \u2264 1. We observe\nthat we can trivially center and scale the spectrum of any matrix to satisfy this assumption when\nwe have the following bounds: \u03bbl \u2264 \u03c3max and \u03bbl \u2265 \u03c3min via the rescaling and centering operation\ngiven by: S\u2032 = 2S/(\u03c3max \u2212 \u03c3min) \u2212 (\u03c3max + \u03c3min) In/(\u03c3max \u2212 \u03c3min) and by modifying f (x) to\nf \u2032(x) = f ( x (\u03c3max \u2212 \u03c3min)/2 + (\u03c3max + \u03c3min)/2 ).\nIn order to compute a polynomial approximation of f (x), we need to de\ufb01ne the notion of \u201cgood\u201d\napproximation. We showed in Section 3.3 that the errors introduced by the polynomial approx-\nimation can be summarized by furnishing a bound on the spectral norm of the error matrix\n\nZ = f (S) \u2212 efL(S): Since kZk = \u03b4 = maxl|f (\u03bbl) \u2212 efL(\u03bbl)|, what matters is how well we\napproximate the function f (\u00b7) at the eigenvalues {\u03bbl} of S. Indeed, if we know the eigenvalues,\nwe can minimize kZk by minimizing maxl|f (\u03bbl) \u2212 efL(\u03bbl)|. This is not a particularly useful ap-\nproach, since computing the eigenvalues is expensive. However, we can use our prior knowledge\nof the domain from which the matrix S comes from to penalize deviations from f (\u03bb) differently\nfor different values of \u03bb. For example, if we know the distribution p(x) of the eigenvalues of\nS, we can minimize the average error \u2206L = R 1\n\u22121 p(\u03bb)|f (\u03bb) \u2212 efL(\u03bb)|2dx. In our examples, for\nthe sake of concreteness, we assume that the eigenvalues are uniformly distributed over [\u22121, 1]\nand give a procedure to compute an L-th order polynomial approximation of f (x) that minimizes\n\u2206L = (1/2)R 1\n\n\u22121 |f (x) \u2212 efL(x)|2dx.\n\nA numerically stable procedure to generate \ufb01nite order polynomial approximations of a function\n\nover [\u22121, 1] with the objective of minimizing R 1\n\u22121 |f (x) \u2212 efL(x)|2dx is via Legendre polynomials\np(r, x), r = 0, 1, . . . , L. They satisfy the recursion p(r, x) = (2 \u2212 1/r)xp(r \u2212 1, x) \u2212 (1 \u2212\n1/r)p(r \u2212 2, x) and are orthogonal: R 1\n\u22121 p(k, x)p(l, x)dx = 2I(k = l)/(2r + 1) . Therefore we\nset efL(x) = Pr=L\n\u22121 p(r, x)f (x)dx. We give a method\nin Algorithm 1 that uses the Legendre recursion to compute p(r, S)\u2126, r = 0, 1, . . . , L using Ld\nmatrix-vector products and vector additions. The coef\ufb01cients a(r) are used to compute efL(S)\u2126 by\nadding weighted versions of p(r, S)\u2126.\n\nr=0 a(r)p(r, x) where a(r) = (r + 1/2)R 1\n\nAlgorithm 1 Proposed algorithm to compute approximate d-dimensional eigenvector embedding of\na n \u00d7 n symmetric matrix S (such that kSk \u2264 1) using the n \u00d7 d random projection matrix \u2126.\n1: Procedure FASTEMBEDEIG(S, f (x), L, \u2126):\n2: //* Compute polynomial approximation efL(x) which minimizes R 1\n\n\u22121 |f (x) \u2212 efL(x)|2dx *//\n//* p(r, x): Order r Legendre polynomial *//\n\n3: for r = 0, . . . , L do\n4:\n\na(r) \u2190 (r + 1/2)R x=1\n\nx=\u22121 f (x)p(r, x)dx\n5: Q(0) \u2190 \u2126, Q(\u22121) \u2190 0, eE \u2190 a(0)Q(0)\n6: for r = 1, 2, . . . , L do\n7: Q(r) \u2190 (2 \u2212 1/r)SQ(r \u2212 1) \u2212 (1 \u2212 1/r)Q(r \u2212 2)\n\n8:\n\neE \u2190 eE + a(r)Q(r)\n\n9: return eE\n\n//* Q(r) = p(r, S)\u2126 *//\n//* eE now holds efr(S)\u2126 *//\n//* eE = efL(S)\u2126 *//\n\nAs described in Section 4, if we have prior knowledge of the distribution of eigenvalues (as we do for\nmany commonly encountered large matrices), then we can \u201cboost\u201d the performance of the generic\nAlgorithm 1 based on the assumption of eigenvalues uniformly distributed over [\u22121, 1].\n\n5\n\n\f3.5 Embedding general matrices\n\nl\n\nWe complete the algorithm description by generalizing to any m \u00d7 n matrix A (not neces-\nsarily symmetric) such that kAk \u2264 1. The approach is to utilize Algorithm 1 to compute\nan approximate d-dimensional embedding of the symmetric matrix S = [0 AT ; A 0]. Let\n{(\u03c3l, ul, vl) : l = 1, . . . , min{m, n}} be an SVD of A = Pl \u03c3lulvT\n(kAk \u2264 1 \u21d0\u21d2\n\u03c3l \u2264 1). Consider the following spectral mapping of the rows of A to the rows of Erow =\n[f (\u03c31)u1 \u00b7\u00b7\u00b7 f (\u03c3m)um] and the columns of A to the rows of Ecol = [f (\u03c31)v1 \u00b7\u00b7\u00b7 f (\u03c3n)vn].\nIt can be shown that the unit-norm orthogonal eigenvectors of S take the form [vl; ul](cid:14)\u221a2\nand [vl;\u2212ul](cid:14)\u221a2 , l = 1, . . . , min{m, n}, and their corresponding eigenvalues are \u03c3l and \u2212\u03c3l\nrespectively. The remaining |m \u2212 n| eigenvalues of S are equal to 0. Therefore, we call\neEall \u2190 FASTEMBEDEIG(S, f \u2032(x), L, \u2126) with f \u2032(x) = f (x)I(x \u2265 0) \u2212 f (\u2212x)I(x < 0) and \u2126\nis an (m + n) \u00d7 d, d = O(log(m + n)) matrix (entries drawn independently and uniformly at ran-\ndom from {\u00b11/\u221ad}). Let eEcol and eErow denote the \ufb01rst n and last m rows of eEall. From Theorem 1,\nwe know that, with overwhelming probability, pairwise distances between any two rows of eErow ap-\nproximates those between corresponding rows of Erow. Similarly, pairwise distances between any\ntwo rows of eEcol approximates those between corresponding rows of Ecol.\n\n4 Implementation considerations\n\nWe now brie\ufb02y go over implementation considerations before presenting numerical results in Sec-\ntion 5.\nSpectral norm estimates In order to ensure that the eigenvalues of S are within [\u22121, 1] as we have\nassumed, we scale the matrix by its spectral norm (kSk = max|\u03bbl|). To this end, we obtain a tight\nlower bound (and a good approximation) on the spectral norm using power iteration (20 iterates on\n6 log n randomly chosen starting vectors), and then scale this up by a small factor (1.01) for our\nestimate (typically an upper bound) for kSk.\nPolynomial approximation order L: The error in approximating f (\u03bb) by efL(\u03bb), as measured by\n\u2206L = R 1\n\u22121 |f (x) \u2212 efL(x)|2dx is a non-increasing function of the polynomial order L. Reduction\nin \u2206L often corresponds to a reduction in \u03b4 that appears as a bound on distortion in Theorem 1.\n\u201cSmooth\u201d functions generally admit a lower order approximation for the same target error \u2206L, and\nhence yield considerable savings in algorithm complexity, which scales linearly with L.\n\nPolynomial approximation method: The rate at which \u03b4 decreases as we increase L depends on\n\nthe function p(\u03bb) used to compute efL(\u03bb) (by minimizing \u2206L = R p(\u03bb)|f (\u03bb) \u2212 efL(\u03bb)|2dx). The\nchoice p(\u03bb) \u221d 1 yields the Legendre recursion used in Algorithm 1, whereas p(\u03bb) \u221d 1/\u221a1 \u2212 \u03bb2\n\ncorresponds to the Chebyshev recursion, which is known to result in fast convergence. We defer to\nfuture work a detailed study of the impact of alternative choices for p(\u03bb) on \u03b4.\n\nDenoising by cascading In large-scale problems, it may be necessary to drive the contribution from\ncertain singular vectors to zero. In many settings, singular vectors with smaller singular values cor-\nrespond to noise. The number of such singular values can scale as fast as O(min{m, n}). Therefore,\nwhen we place nulls (zeros) in f (\u03bb), it is desirable to ensure that these nulls are pronounced after\nwe approximate f (\u03bb) by efL(\u03bb). We do this by computing (cid:0)egL/b(S)(cid:1)b\n\u2126, where egL/b(\u03bb) is an L/b-\nth order approximation of g(\u03bb) = f 1/b(\u03bb). The small values in the polynomial approximation of\nf 1/b(\u03bb) which correspond to f (\u03bb) = 0 (nulls which we have set) get ampli\ufb01ed when we pass them\nthrough the xb non-linearity.\n\n5 Numerical results\n\nWhile the proposed approach is particularly useful for large problems in which exact eigendecom-\nposition is computationally infeasible, for the purpose of comparison, our results are restricted to\nsmaller settings where the exact solution can be computed. We compute the exact partial eigende-\ncomposition using the ARPACK library (called from MATLAB). For a given choice of weighing\n\n6\n\n\ft\nc\nu\nd\no\nr\np\n \nr\ne\nn\nn\ni\n \nd\ne\nz\ni\nl\na\nm\nr\no\nn\nn\ni\n \ne\ng\nn\na\nh\nC\n\n \n\n0.6\n\n0.4\n\n0.2\n\n0\n\n\u22120.2\n\n\u22120.4\n\n\u22120.6\n\n \n\n \n\n99 percentile\n95 percentile\n75 percentile\n50 percentile\n25 percentile\n5 percentile\n1 percentile\n\n20\n\n40\n\n60\nd\n\n80\n\n100\n\n120\n\ng\nn\ni\nd\nd\ne\nb\nm\ne\n \ne\nv\ni\ns\ns\ne\nr\np\nm\no\nC\n\n1\n\n0.8\n\n0.6\n\n0.4\n\n0.2\n\n0\n\n\u22120.2\n\n\u22120.4\n\n\u22120.6\n\n\u22120.8\n\n\u22121\n \n\u22121\n\nNormalized correlation\n\n99 percentile\n95 percentile\n75 percentile\n50 percentile\n25 percentile\n5 percentile\n1 percentile\n\n \n\n\u22120.5\n\n0\n\nEigenvector embedding\n\n0.5\n\n1\n\ng\nn\ni\nd\nd\ne\nb\nm\ne\n \ne\nv\ni\ns\ns\ne\nr\np\nm\no\nC\n\n1\n\n0.8\n\n0.6\n\n0.4\n\n0.2\n\n0\n\n\u22120.2\n\n\u22120.4\n\n\u22120.6\n\n\u22120.8\n\n\u22121\n \n\u22121\n\nNormalized correlation\n\n99 percentile\n95 percentile\n75 percentile\n50 percentile\n25 percentile\n5 percentile\n1 percentile\n\n \n\n\u22120.5\n\n0\n\nEigenvector embedding\n\n0.5\n\n1\n\n(a) Effect of dimensionality d of embedding\n\n(b) Effect of cascading: b = 1(left) and b = 2 (right)\n\nFigure 1: DBLP collaboration network normalized correlations\n\nfunction f (\u03bb), the associated embedding E = [f (\u03bb1)v1 \u00b7\u00b7\u00b7 f (\u03bbn)vn] is compared with the com-\npressive embedding eE returned by Algorithm 1. The latter was implemented in Python using the\nScipy\u2019s sparse matrix-multiplication routines and is available for download from [9].\n\nWe consider two real world undirected graphs in [27] for our evaluation, and compute embeddings\n\nfor the normalized adjacency matrix eA (= D\u22121/2AD\u22121/2, where D is a diagonal matrix with row\nsums of the adjacency matrix A; the eigenvalues of eA lie in [\u22121, 1]) for graphs. We study the ac-\ncuracy of embeddings by comparing pairwise normalized correlations between i, j-th rows of E\ngiven by < E(i, :), E(j, :) >/kE(i, :)kkE(j, :)k with those predicted by the approximate embed-\nding < eE(i, :), eE(j, :) > /k eE(i, :)kk eE(j, :)k (E(i, :) is short-hand for the i-th row of E).\nDBLP collaboration network [27] is an undirected graph on n = 317080 vertices with 1049866\nedges. We compute the leading 500 eigenvectors of the normalized adjacency matrix eA. The small-\nest of the \ufb01ve hundred eigenvalues is 0.98, so we set f (\u03bb) = I(\u03bb \u2265 0.98) and S = eA in Algorithm 1\nand compare the resulting embedding eE with E = [v1 \u00b7\u00b7\u00b7 v500]. We demonstrate the dependence\nof the quality of the embedding eE returned by the proposed algorithm on two parameters: (i) num-\nber of random starting vectors d, which gives the dimensionality of the embedding and (ii) the\nboosting/cascading parameter b using this dataset.\nDependence on the number of random projections d: In Figure (1a), d ranges from 1 to 120 \u2248 9 log n\nand plot the 1-st, 5-th, 25-th, 50-th, 75-th, 95-th and 99-th percentile values of the deviation between\nthe compressive normalized correlation (from the rows of eE) and the corresponding exact normal-\nized correlation (rows of E). The deviation decreases with increasing d, corresponding to \u21132-norm\nconcentration (JL lemma), but this payoff saturates for large values of d as polynomial approxima-\ntion errors start to dominate. From the 5-th and 95-th percentile curves, we see that a signi\ufb01cant\nfraction (90%) of pairwise normalized correlations in eE lie within \u00b10.2 of their corresponding val-\nues in E when d = 80 \u2248 6 log n. For Figure (1a), we use L = 180 matrix-vector products for each\nrandomly picked starting vector and set cascading parameter b = 2 for the algorithm in Section 4.\n\nDependence on cascading parameter b: In Section 4 we described how cascading can help suppress\nthe contribution to the embedding eE of the eigenvectors whose eigenvalues lie in regions where we\nhave set f (\u03bb) = 0. We illustrate the importance of this boosting procedure by comparing the quality\nof the embedding eE for b = 1 and b = 2 (keeping the other parameters of the algorithm in Section 4\n\ufb01xed: L = 180 matrix-vector products for each of d = 80 randomly picked starting vectors).\nWe report the results in Figure (1b) where we plot percentile values of compressive normalized\n\ncorrelation (from the rows of eE) for different values of the exact normalized correlation (rows of\nE). For b = 1, the polynomial approximation of f (\u03bb) does not suppress small eigenvectors. As a\nresult, we notice a deviation (bias) of the 50-percentile curve (green) from the ideal y = x dotted\nline drawn (Figure 1b left). This disappears for b = 2 (Figure 1b right).\n\nThe running time for our algorithm on a standard workstation was about two orders of magnitude\nsmaller than partial SVD using off-the-shelf sparse eigensolvers (e.g., the 80 dimensional embedding\nof the leading 500 eigenvectors of the DBLP graph took 1 minute whereas their exact computation\n\n7\n\n\ftook 105 minutes). A more detailed comparison of running times is beyond the scope of this paper,\nbut it is clear that the promised gains in computational complexity are realized in practice.\n\nApplication to graph clustering for the Amazon co-purchasing network [27] : This is an undi-\nrected graph on n = 334863 vertices with 925872 edges. We illustrate the potential downstream\nbene\ufb01ts of our algorithm by applying K-means clustering on embeddings (exact and compressive)\nof this network. For the purpose of our comparisons, we compute the \ufb01rst 500 eigenvectors for eA ex-\nplicitly using an exact eigensolver, and use an 80-dimensional compressive embedding eE which cap-\ntures the effect of these, with f (\u03bb) = I(\u03bb \u2265 \u03bb500), where \u03bb500 is the 500th eigenvalue. We compare\nthis against the usual spectral embedding using the \ufb01rst 80 eigenvectors of eA: E = [v1 \u00b7\u00b7\u00b7 v80].\nWe keep the dimension \ufb01xed at 80 in the comparison because K-means complexity scales linearly\nwith it, and quickly becomes the bottleneck. Indeed, our ability to embed a large number of eigen-\nvectors directly into a low dimensional space (d \u2248 6 log n) has the added bene\ufb01t of dimensionality\nreduction within the subspace of interest (in this case the largest 500 eigenvectors).\n\nWe consider 25 instances of K-means clustering with K = 200 throughout, reporting the median\nof a commonly used graph clustering score, modularity [28] (larger values translate to better clus-\n\ntering solutions). The median modularity for clustering based on our embedding eE is 0.87. This is\nsigni\ufb01cantly better than that for E, which yields median modularity of 0.835. In addition, the com-\nputational cost for eE is one-\ufb01fth that for E (1.5 minutes versus 10 minutes). When we replace the\nexact eigenvector embedding E with approximate eigendecomposition using Randomized SVD [8]\n(parameters: power iterates q = 5 and excess dimensionality l = 10), the time taken reduces from\n10 minutes to 17 seconds, but this comes at the expense of inference quality: median modularity\ndrops to 0.748. On the other hand, the median modularity increases to 0.845 when we consider ex-\nact partial SVD embedding with 120 eigenvectors. This indicates that our compressive embedding\nyields better clustering quality because it is able to concisely capture more eigenvectors(500 in this\nexample, compared to 80 and 120 with conventional partial SVD). It is worth pointing out that, even\nfor known eigenvectors, the number of dominant eigenvectors k that yields the best inference per-\nformance is often unknown a priori, and is treated as a hyper-parameter. For compressive spectral\n\nembedding eE, an elegant approach for implicitly optimizing over k is to use the embedding function\nf (\u03bb) = I(\u03bb \u2265 c), with c as a hyper-parameter.\n\n6 Conclusion\n\nWe have shown that random projections and polynomial expansions provide a powerful approach for\nspectral embedding of large matrices: for an m\u00d7 n matrix A, our O((T + m + n) log(m + n)) algo-\nrithm computes an O(log(m+n))-dimensional compressive embedding that provably approximates\npairwise distances between points in the desired spectral embedding. Numerical results for several\nreal-world data sets show that our method provides good approximations for embeddings based on\npartial SVD, while incurring much lower complexity. Moreover, our method can also approximate\nspectral embeddings which depend on the entire SVD, since its complexity does not depend on the\nnumber of dominant vectors whose effect we wish to model. A glimpse of this potential is provided\nby the example of K-means based clustering for estimating sparse-cuts of the Amazon graph, where\nour method yields much better performance (using graph metrics) than a partial SVD with signi\ufb01-\ncantly higher complexity. This motivates further investigation into applications of this approach for\nimproving downstream inference tasks in a variety of large-scale problems.\n\nAcknowledgments\nThis work is supported in part by DARPA GRAPHS (BAA-12-01) and by Systems on Nanoscale\nInformation fabriCs (SONIC), one of the six SRC STARnet Centers, sponsored by MARCO and\nDARPA. Any opinions, \ufb01ndings, and conclusions or recommendations expressed in this material\nare those of the authors and do not necessarily re\ufb02ect the views of the funding agencies.\n\nReferences\n\n[1] B. Sch\u00a8olkopf, A. Smola, and K.-R. M\u00a8uller, \u201cKernel principal component analysis,\u201d in Arti\ufb01cial Neural\nNetworks ICANN\u201997, ser. Lecture Notes in Computer Science, W. Gerstner, A. Germond, M. Hasler, and\nJ.-D. Nicoud, Eds. Springer Berlin Heidelberg, 1997, pp. 583\u2013588.\n\n8\n\n\f[2] S. Mika, B. Sch\u00a8olkopf, A. J. Smola, K.-R. M\u00a8uller, M. Scholz, and G. R\u00a8atsch, \u201cKernel PCA and de-noising\n\nin feature spaces,\u201d in Advances in Neural Information Processing Systems, 1999.\n\n[3] S. White and P. Smyth, \u201cA spectral clustering approach to \ufb01nding communities in graph.\u201d in SDM, vol. 5.\n\nSIAM, 2005.\n\n[4] F. G\u00a8obel and A. A. Jagers, \u201cRandom walks on graphs,\u201d Stochastic Processes and their Applications, 1974.\n\n[5] R. R. Nadakuditi and M. E. J. Newman, \u201cGraph spectra and the detectability of community structure in\n\nnetworks,\u201d Physical Review Letters, 2012.\n\n[6] C. Fowlkes, S. Belongie, F. Chung, and J. Malik, \u201cSpectral grouping using the Nystr\u00a8om method,\u201d IEEE\n\nTransactions on Pattern Analysis and Machine Intelligence, vol. 26, no. 2, 2004.\n\n[7] P. Drineas and M. W. Mahoney, \u201cOn the Nystr\u00a8om Method for Approximating a Gram Matrix for Improved\n\nKernel-Based Learning,\u201d Journal on Machine Learning Resources, 2005.\n\n[8] N. Halko, P. G. Martinsson, and J. A. Tropp, \u201cFinding Structure with Randomness: Probabilistic Algo-\n\nrithms for Constructing Approximate Matrix Decompositions,\u201d SIAM Rev., 2011.\n\n[9] \u201cPython\n\nimplementation\n\nof\n\nFastEmbed.\u201d\n\n[Online].\n\nAvailable:\n\nhttps://bitbucket.org/dineshkr/fastembed/src/NIPS2015\n\n[10] D. Achlioptas, \u201cDatabase-friendly random projections,\u201d in Proceedings of the Twentieth ACM SIGMOD-\n\nSIGACT-SIGART Symposium on Principles of Database Systems, ser. PODS \u201901, 2001.\n\n[11] E. Candes and M. Wakin, \u201cAn introduction to compressive sampling,\u201d Signal Processing Magazine, IEEE,\n\nMarch 2008.\n\n[12] L. N. Trefethen and D. Bau, Numerical Linear Algebra. SIAM, 1997.\n\n[13] S. F. McCormick and T. Noe, \u201cSimultaneous iteration for the matrix eigenvalue problem,\u201d Linear Algebra\n\nand its Applications, vol. 16, no. 1, pp. 43\u201356, 1977.\n\n[14] K. Zhang, I. W. Tsang, and J. T. Kwok, \u201cImproved Nystr\u00a8om Low-rank Approximation and Error Analy-\nsis,\u201d in Proceedings of the 25th International Conference on Machine Learning, ser. ICML \u201908. ACM,\n2008.\n\n[15] D. Yan, L. Huang, and M. I. Jordan, \u201cFast Approximate Spectral Clustering,\u201d in Proceedings of the\n15th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, ser. KDD \u201909.\nACM, 2009.\n\n[16] M. Li, J. T. Kwok, and B.-L. Lu, \u201cMaking Large-Scale Nystr\u00a8om Approximation Possible.\u201d in ICML,\n\n2010.\n\n[17] S. Kumar, M. Mohri, and A. Talwalkar, \u201cEnsemble Nystr\u00a8om method,\u201d in Advances in Neural Information\n\nProcessing Systems, 2009.\n\n[18] F. Lin and W. W. Cohen, \u201cPower iteration clustering,\u201d in Proceedings of the 27th International Conference\n\non Machine Learning (ICML-10), 2010.\n\n[19] F. Lin, \u201cScalable methods for graph-based unsupervised and semi-supervised learning,\u201d Ph.D. disserta-\n\ntion, Carnegie Mellon University, 2012.\n\n[20] W. Yan, U. Brahmakshatriya, Y. Xue, M. Gilder, and B. Wise, \u201cPIC: Parallel power iteration clustering\n\nfor big data,\u201d Journal of Parallel and Distributed Computing, 2013.\n\n[21] L. Lov\u00b4asz, \u201cRandom walks on graphs: A survey,\u201d Combinatorics, Paul erdos is eighty, vol. 2, no. 1, pp.\n\n1\u201346, 1993.\n\n[22] D. A. Spielman and S.-H. Teng, \u201cNearly-linear time algorithms for graph partitioning, graph sparsi\ufb01ca-\ntion, and solving linear systems,\u201d in Proceedings of the Thirty-sixth Annual ACM Symposium on Theory\nof Computing, ser. STOC \u201904. New York, NY, USA: ACM, 2004.\n\n[23] D. Spielman and S. Teng, \u201cNearly linear time algorithms for preconditioning and solving symmetric,\ndiagonally dominant linear systems,\u201d SIAM Journal on Matrix Analysis and Applications, vol. 35, Jan.\n2014.\n\n[24] D. Spielman and N. Srivastava, \u201cGraph sparsi\ufb01cation by effective resistances,\u201d SIAM Journal on Comput-\n\ning, 2011.\n\n[25] R. N. Silver, H. Roeder, A. F. Voter, and J. D. Kress, \u201cKernel polynomial approximations for densities\nof states and spectral functions,\u201d Journal of Computational Physics, vol. 124, no. 1, pp. 115\u2013130, Mar.\n1996.\n\n[26] E. Di Napoli, E. Polizzi, and Y. Saad, \u201cEf\ufb01cient estimation of eigenvalue counts in an interval,\u201d\n\narXiv:1308.4275 [cs], Aug. 2013.\n\n[27] J. Yang and J. Leskovec, \u201cDe\ufb01ning and evaluating network communities based on ground-truth,\u201d in 2012\n\nIEEE 12th International Conference on Data Mining (ICDM), Dec. 2012.\n\n[28] S. Fortunato, \u201cCommunity detection in graphs,\u201d Physics Reports, vol. 486, no. 3-5, Feb. 2010.\n\n9\n\n\f", "award": [], "sourceid": 382, "authors": [{"given_name": "Dinesh", "family_name": "Ramasamy", "institution": "UC Santa Barbara"}, {"given_name": "Upamanyu", "family_name": "Madhow", "institution": "UC Santa Barbara"}]}