{"title": "Improved Distributed Principal Component Analysis", "book": "Advances in Neural Information Processing Systems", "page_first": 3113, "page_last": 3121, "abstract": "We study the distributed computing setting in which there are multiple servers, each holding a set of points, who wish to compute functions on the union of their point sets. A key task in this setting is Principal Component Analysis (PCA), in which the servers would like to compute a low dimensional subspace capturing as much of the variance of the union of their point sets as possible. Given a procedure for approximate PCA, one can use it to approximately solve problems such as $k$-means clustering and low rank approximation. The essential properties of an approximate distributed PCA algorithm are its communication cost and computational efficiency for a given desired accuracy in downstream applications. We give new algorithms and analyses for distributed PCA which lead to improved communication and computational costs for $k$-means clustering and related problems. Our empirical study on real world data shows a speedup of orders of magnitude, preserving communication with only a negligible degradation in solution quality. Some of these techniques we develop, such as input-sparsity subspace embeddings with high correctness probability with a dimension and sparsity independent of the error probability, may be of independent interest.", "full_text": "Improved Distributed Principal Component Analysis\n\nMaria-Florina Balcan\n\nSchool of Computer Science\nCarnegie Mellon University\nninamf@cs.cmu.edu\n\nVandana Kanchanapally\nSchool of Computer Science\n\nGeorgia Institute of Technology\n\nvvandana@gatech.edu\n\nYingyu Liang\n\nDepartment of Computer Science\n\nPrinceton University\n\nyingyul@cs.princeton.edu\n\nDavid Woodruff\n\nAlmaden Research Center\n\nIBM Research\n\ndpwoodru@us.ibm.com\n\nAbstract\n\nWe study the distributed computing setting in which there are multiple servers,\neach holding a set of points, who wish to compute functions on the union of their\npoint sets. A key task in this setting is Principal Component Analysis (PCA), in\nwhich the servers would like to compute a low dimensional subspace capturing as\nmuch of the variance of the union of their point sets as possible. Given a proce-\ndure for approximate PCA, one can use it to approximately solve problems such\nas k-means clustering and low rank approximation. The essential properties of an\napproximate distributed PCA algorithm are its communication cost and computa-\ntional ef\ufb01ciency for a given desired accuracy in downstream applications. We give\nnew algorithms and analyses for distributed PCA which lead to improved com-\nmunication and computational costs for k-means clustering and related problems.\nOur empirical study on real world data shows a speedup of orders of magnitude,\npreserving communication with only a negligible degradation in solution quality.\nSome of these techniques we develop, such as a general transformation from a\nconstant success probability subspace embedding to a high success probability\nsubspace embedding with a dimension and sparsity independent of the success\nprobability, may be of independent interest.\n\nIntroduction\n\n1\nSince data is often partitioned across multiple servers [20, 7, 18], there is an increased interest in\ncomputing on it in the distributed model. A basic tool for distributed data analysis is Principal\nComponent Analysis (PCA). The goal of PCA is to \ufb01nd an r-dimensional (af\ufb01ne) subspace that\ncaptures as much of the variance of the data as possible. Hence, it can reveal low-dimensional\nstructure in very high dimensional data. Moreover, it can serve as a preprocessing step to reduce\nthe data dimension in various machine learning tasks, such as Non-Negative Matrix Factorization\n(NNMF) [15] and Latent Dirichlet Allocation (LDA) [3].\nIn the distributed model, approximate PCA was used by Feldman et al. [9] for solving a number\nof shape \ufb01tting problems such as k-means clustering, where the approximation is in the form of a\ncoreset, and has the property that local coresets can be easily combined across servers into a global\ncoreset, thereby providing an approximate PCA to the union of the data sets. Designing small\ncoresets therefore leads to communication-ef\ufb01cient protocols. Coresets have the nice property that\ntheir size typically does not depend on the number n of points being approximated. A beautiful\nproperty of the coresets developed in [9] is that for approximate PCA their size also only depends\nlinearly on the dimension d, whereas previous coresets depended quadratically on d [8]. This gives\nthe best known communication protocols for approximate PCA and k-means clustering.\n\n1\n\n\fDespite this recent exciting progress, several important questions remain. First, can we improve the\ncommunication further as a function of the number of servers, the approximation error, and other\nparameters of the downstream applications (such as the number k of clusters in k-means clustering)?\nSecond, while preserving optimal or nearly-optimal communication, can we improve the computa-\ntional costs of the protocols? We note that in the protocols of Feldman et al. each server has to\nrun a singular value decomposition (SVD) on her local data set, while additional work needs to be\nperformed to combine the outputs of each server into a global approximate PCA. Third, are these al-\ngorithms practical and do they scale well with large-scale datasets? In this paper we give answers to\nthe above questions. To state our results more precisely, we \ufb01rst de\ufb01ne the model and the problems.\nIn the distributed setting, we consider a set of s nodes V = {vi, 1 \u2264 i \u2264\nCommunication Model.\ns}, each of which can communicate with a central coordinator v0. On each node vi, there is a local\ndata matrix Pi \u2208 Rni\u00d7d having ni data points in d dimension (ni > d). The global data P \u2208 Rn\u00d7d\ni=1 ni.\nLet pi denote the i-th row of P. Throughout the paper, we assume that the data points are centered\ni=1 pi = 0. Uncentered data requires a rank-one modi\ufb01cation to the\nalgorithms, whose communication and computation costs are dominated by those in the other steps.\nApproximate PCA and (cid:96)2-Error Fitting. For a matrix A = [aij], let (cid:107)A(cid:107)2\nij be its\nFrobenius norm, and let \u03c3i(A) be the i-th singular value of A. Let A(t) denote the matrix that\ncontains the \ufb01rst t columns of A. Let LX denote the linear subspace spanned by the columns of X.\nFor a point p, let \u03c0L(p) be its projection onto subspace L and let \u03c0X(p) be shorthand for \u03c0LX (p).\nFor a point p \u2208 Rd and a subspace L \u2286 Rd, we denote the squared distance between p and L by\n\nis then a concatenation of the local data matrix, i.e. P(cid:62) =(cid:2)P(cid:62)\nto have zero mean, i.e., (cid:80)n\n\n2 , . . . , P(cid:62)\n\n(cid:3) and n =(cid:80)s\nF = (cid:80)\n\ni,j a2\n\n1 , P(cid:62)\n\ns\n\nd2(p, L) := min\nq\u2208L\n\n(cid:107)p \u2212 q(cid:107)2\n\n2 = (cid:107)p \u2212 \u03c0L(p)(cid:107)2\n2.\n\nDe\ufb01nition 1. The linear (or af\ufb01ne) r-Subspace k-Clustering on P \u2208 Rn\u00d7d is\n\nn(cid:88)\n\ni=1\n\nminL d2(P,L) :=\n\nL\u2208L d2(pi, L)\nmin\n\n(1)\n\nwhere P is an n \u00d7 d matrix whose rows are p1, . . . , pn, and L = {Lj}k\nof which is an r-dimensional linear (or af\ufb01ne) subspace.\n\nj=1 is a set of k centers, each\n\nPCA is a special case when k = 1 and the center is an r-dimensional subspace. This optimal r-\ndimensional subspace is spanned by the top r right singular vectors of P, also known as the principal\ncomponents, and can be found using the singular value decomposition (SVD). Another special case\nof the above is k-means clustering when the centers are points (r = 0). Constrained versions of this\nproblem include NNMF where the r-dimensional subspace should be spanned by positive vectors,\nand LDA which assumes a prior distribution de\ufb01ning a probability for each r-dimensional subspace.\nWe will primarily be concerned with relative-error approximation algorithms, for which we would\nlike to output a set L(cid:48) of k centers for which d2(P,L(cid:48)) \u2264 (1 + \u0001) minL d2(P,L).\nFor approximate distributed PCA, the following protocol is implicit in [9]: each server i computes\nits top O(r/\u0001) principal components Yi of Pi and sends them to the coordinator. The coordinator\nstacks the O(r/\u0001) \u00d7 d matrices Yi on top of each other, forming an O(sr/\u0001) \u00d7 d matrix Y, and\ncomputes the top r principal components of Y, and returns these to the servers. This provides a\nrelative-error approximation to the PCA problem. We refer to this algorithm as Algorithm disPCA.\nOur Contributions. Our results are summarized as follows.\nImproved Communication: We improve the communication cost for using distributed PCA for k-\nmeans clustering and similar (cid:96)2-\ufb01tting problems. The best previous approach is to use Corollary 4.5\nin [9], which shows that given a data matrix P, if we project the rows onto the space spanned by\nthe top O(k/\u00012) principal components, and solve the k-means problem in this subspace, we obtain a\n(1 + \u0001)-approximation. In the distributed setting, this would require \ufb01rst running Algorithm disPCA\nwith parameter r = O(k/\u00012), and thus communication at least O(skd/\u00013) to compute the O(k/\u00012)\nglobal principal components. Then one can solve a distributed k-means problem in this subspace,\nand an \u03b1-approximation in it translates to an overall \u03b1(1 + \u0001) approximation.\nOur Theorem 3 shows that it suf\ufb01ces to run Algorithm disPCA while only incurring O(skd/\u00012)\ncommunication to compute the O(k/\u00012) global principal components, preserving the k-means solu-\ntion cost up to a (1 + \u0001)-factor. Our communication is thus a 1/\u0001 factor better, and illustrates that\n\n2\n\n\ffor downstream applications it is sometimes important to \u201copen up the box\u201d rather than to directly\nuse the guarantees of a generic PCA algorithm (which would give O(skd/\u00013) communication). One\nfeature of this approach is that by using the distributed k-means algorithm in [2] on the projected\ndata, the coordinator can sample points from the servers proportional to their local k-means cost\nsolutions, which reduces the communication roughly by a factor of s, which would come from each\nserver sending their local k-means coreset to the coordinator. Furthermore, before applying the\nabove approach, one can \ufb01rst run any other dimension reduction to dimension d(cid:48) so that the k-means\ncost is preserved up to certain accuracy. For example, if we want a 1+\u0001 approximation factor, we can\nset d(cid:48) = O(log n/\u00012) by a Johnson-Lindenstrauss transform; if we want a larger 2+\u0001 approximation\nfactor, we can set d(cid:48) = O(k/\u00012) using [4]. In this way the parameter d in the above communication\ncost bound can be replaced by d(cid:48). Note that unlike these dimension reductions, our algorithm for\nprojecting onto principal components is deterministic and does not incur error probability.\nImproved Computation: We turn to the computational cost of Algorithm disPCA, which to the best\nof our knowledge has not been addressed. A major bottleneck is that each player is computing\na singular value decomposition (SVD) of its point set Pi, which takes min(nid2, n2\ni d) time. We\nchange Algorithm disPCA to instead have each server \ufb01rst sample an oblivious subspace embedding\n(OSE) [22, 5, 19, 17] matrix Hi, and instead run the algorithm on the point set de\ufb01ned by the rows\nof HiPi. Using known OSEs, one can choose Hi to have only a single non-zero entry per column\nand thus HiPi can be computed in nnz(Pi) time. Moreover, the number of rows of Hi is O(d2/\u00012),\nwhich may be signi\ufb01cantly less than the original ni number of rows. This number of rows can be\nfurther reducted to O(d logO(1) d/\u00012) if one is willing to spend O(nnz(Pi) logO(1) d/\u0001) time [19].\nWe note that the number of non-zero entries of HiPi is no more than that of Pi.\nOne technical issue is that each of s servers is locally performing a subspace embedding, which\nsucceeds with only constant probability. If we want a single non-zero entry per column of Hi, to\nachieve success probability 1 \u2212 O(1/s) so that we can union bound over all s servers succeeding,\nwe naively would need to increase the number of rows of Hi by a factor linear in s. We give a\ngeneral technique, which takes a subspace embedding that succeeds with constant probability as a\nblack box, and show how to perform a procedure which applies it O(log 1/\u03b4) times independently\nand from these applications \ufb01nds one which is guaranteed to succeed with probability 1 \u2212 \u03b4. Thus,\nin this setting the players can compute a subspace embedding of their data in nnz(Pi) time, for\nwhich the number of non-zero entries of HiPi is no larger than that of Pi, and without incurring\nthis additional factor of s. This may be of independent interest.\nIt may still be expensive to perform the SVD of HiPi and for the coordinator to perform an SVD\non Y in Algorithm disPCA. We therefore replace the SVD computation with a randomized approx-\nimate SVD computation with spectral norm error. Our contribution here is to analyze the error in\ndistributed PCA and k-means after performing these speedups.\nEmpirical Results: Our speedups result in signi\ufb01cant computational savings. The randomized tech-\nniques we use reduce the time by orders of magnitude on medium and large-scal data sets, while\npreserving the communication cost. Although the theory predicts a new small additive error because\nof our speedups, in our experiments the solution quality was only negligibly affected.\nRelated Work A number of algorithms for approximate distributed PCA have been proposed [21,\n14, 16, 9], but either without theoretical guarantees, or without considering communication. Most\nclosely related to our work is [9, 12]. [9] observes the top singular vectors of the local data is its\nsummary and the union of these summaries is a summary of the global data, i.e., Algorithm disPCA.\n[12] studies algorithms in the arbitrary partition model in which each server holds a matrix Pi and\n\ni=1 Pi. More details and more related work can be found in the appendix.\n\nP =(cid:80)s\n\n2 Tradeoff between Communication and Solution Quality\nAlgorithm disPCA for distributed PCA is suggested in [21, 9], which consists of a local stage and a\nglobal stage. In the local stage, each node performs SVD on its local data matrix, and communicates\n(t1) to the central coordi-\nthe \ufb01rst t1 singular values \u03a3i\n(t1))(cid:62) to form a matrix Y,\nnator. Then in the global stage, the coordinator concatenates \u03a3i\nand performs SVD on it to get the \ufb01rst t2 right singular vectors.\nTo get some intuition, consider the easy case when the data points actually lie in an r-dimensional\nsubspace. We can run Algorithm disPCA with t1 = t2 = r. Since Pi has rank r, its projection to\n\n(t1) and the \ufb01rst t1 right singular vectors Vi\n\n(t1)(Vi\n\n3\n\n\f\uf8ee\uf8ef\uf8f0 P1\n\n...\nPs\n\nP =\n\n\uf8f9\uf8fa\uf8fb Local PCA\n\n\u2212\u2212\u2212\u2212\u2212\u2192\n\n...\n\nLocal PCA\n\n\u2212\u2212\u2212\u2212\u2212\u2192\n\n\uf8ee\uf8ef\uf8ef\uf8ef\uf8ef\uf8f0 \u03a3(t1)\n\n\u03a3(t1)\n\n1\n\ns\n\n(cid:16)\n(cid:16)\n\n1\n\nV(t1)\n...\nV(t1)\n\ns\n\n(cid:17)(cid:62)\n(cid:17)(cid:62)\n\n\uf8f9\uf8fa\uf8fa\uf8fa\uf8fa\uf8fb =\n\n\uf8ee\uf8ef\uf8f0 Y1\n\n...\nYs\n\n\uf8f9\uf8fa\uf8fb = Y\n\nGlobal PCA\n\n\u2212\u2212\u2212\u2212\u2212\u2212\u2192 V(t2)\n\nFigure 1: The key points of the algorithm disPCA.\n\nthe subspace spanned by its \ufb01rst t1 = r right singular vectors, (cid:98)Pi = Ui\u03a3i\nto Pi. Then we only need to do PCA on(cid:98)P, the concatenation of(cid:98)Pi. Observing that(cid:98)P = (cid:101)UY where\n(cid:101)U is orthonormal, it suf\ufb01ces to compute SVD on Y, and only \u03a3i\nsuf\ufb01ciently large, so that (cid:98)Pi approximates Pi well enough and does not introduce too much error\nLemma 1. Suppose A has SVD A = U\u03a3V and let(cid:98)A = AV(t)(V(t))(cid:62) denote its SVD truncation.\n\n(r) needs to be communicated.\nIn the general case when the data may have rank higher than r, it turns out that one needs to set t1\n\ninto the \ufb01nal solution. In particular, the following close projection property about SVD is useful:\n\n(r))(cid:62), is identical\n\nIf t = O(r/\u0001), then for any d \u00d7 r matrix X with orthonormal columns,\n\n(r)(Vi\n\n(r)Vi\n\n0 \u2264 (cid:107)AX \u2212 (cid:98)AX(cid:107)2\n\nF \u2212 (cid:107)(cid:98)AX(cid:107)2\n\nF \u2264 \u0001d2(A, LX).\n\nF \u2264 \u0001d2(A, LX), and 0 \u2264 (cid:107)AX(cid:107)2\n\nThis means that the projections of (cid:98)A and A on any r-dimensional subspace are close, when the\nprojected dimension t is suf\ufb01ciently large compared to r. Now, note that the difference between\n(cid:107)(cid:98)PiX(cid:107)2\nF ]. Each term in which is bounded by the lemma. So we can use (cid:98)P as a proxy for P in\nF \u2212\ni[(cid:107)PiX(cid:107)2\n(cid:107)P \u2212 PXX(cid:62)(cid:107)2\nthe PCA task. Again, computing PCA on (cid:98)P is equivalent to computing SVD on Y, as done in\n\nF and (cid:107)(cid:98)P \u2212 (cid:98)PXX(cid:62)(cid:107)2\n\nF is only related to (cid:107)PX(cid:107)2\n\nF \u2212 (cid:107)(cid:98)PX(cid:107)2\n\nF = (cid:80)\n\nAlgorithm disPCA. These lead to the following theorem, which is implicit in [9], stating that the\nalgorithm can produce a (1 + \u0001)-approximation for the distributed PCA problem.\nTheorem 2. Suppose Algorithm disPCA takes parameters t1 \u2265 r + (cid:100)4r/\u0001(cid:101) \u2212 1 and t2 = r. Then\n\n(cid:107)P \u2212 PV(r)(V(r))(cid:62)(cid:107)2\n\nF \u2264 (1 + \u0001) min\n\n(cid:107)P \u2212 PXX(cid:62)(cid:107)2\n\nF\n\nX\n\n\u0001 ) words.\n\nwhere the minimization is over d\u00d7r orthonormal matrices X. The communication is O( srd\n2.1 Guarantees for Distributed (cid:96)2-Error Fitting\nAlgorithm disPCA can also be used as a pre-processing step for applications such as (cid:96)2-error \ufb01tting.\nIn this section, we prove the correctness of Algorithm disPCA as pre-processing for these applica-\ntions. In particular, we show that by setting t1, t2 suf\ufb01ciently large, the objective value of any solu-\ntion merely changes when the original data P is replaced the projected data \u02dcP = PV(t2)(V(t2))(cid:62).\nTherefore, the projected data serves as a proxy of the original data, i.e., any distributed algorithm\ncan be applied on the projected data to get a solution on the original data. As the dimension is lower,\nthe communication cost is reduced. Formally,\nTheorem 3. Let t1 = t2 = O(rk/\u00012) in Algorithm disPCA for \u0001 \u2208 (0, 1/3). Then there exists a\nconstant c0 \u2265 0 such that for any set of k centers L in r-Subspace k-Clustering,\n\n(1 \u2212 \u0001)d2(P,L) \u2264 d2( \u02dcP,L) + c0 \u2264 (1 + \u0001)d2(P,L).\n\nThe theorem implies that any \u03b1-approximate solution L on the projected data \u02dcP is a (1 + 3\u0001)\u03b1-\napproximation on the original data P. To see this, let L\u2217 denote the optimal solution. Then\n\n(1 \u2212 \u0001)d2(P,L) \u2264 d2( \u02dcP,L) + c0 \u2264 \u03b1d2( \u02dcP,L\u2217) + c0 \u2264 \u03b1(1 + \u0001)d2(P,L\u2217)\n\nwhich leads to d2(P,L) \u2264 (1 + 3\u0001)\u03b1d2(P,L\u2217). In other words, the distributed PCA step only\nintroduces a small multiplicative approximation factor of (1 + 3\u0001).\nThe key to prove the theorem is the close projection property of the algorithm (Lemma 4): for any\nlow dimensional subspace spanned by X, the projections of P and \u02dcP on the subspace are close. In\n\n4\n\n\fi=1, k \u2208 N+ and \u0001 \u2208 (0, 1/2), a non-distributed \u03b1-approximation algorithm A\u03b1\n\nAlgorithm 1 Distributed k-means clustering\nInput: {Pi}s\n1: Run Algorithm disPCA with t1 = t2 = O(k/\u00012) to get V, and send V to all nodes.\n2: Run the distributed k-means clustering algorithm in [2] on {PiVV(cid:62)}s\nOutput: L.\n\nroutine, to get k centers L.\n\ni=1, using A\u03b1 as a sub-\n\n(t1)(Vi\n\ni=1\n\ni=1\n\n.\n\nF\n\n(cid:105)\n\n+\n\nF \u2212 (cid:107) \u02dcPX(cid:107)2\n\nF =\n\nF and (cid:107)PX(cid:107)2\n\nF \u2212(cid:107) \u02dcPX(cid:107)2\n\nF \u2264 \u0001d2(P, LX), and 0 \u2264 (cid:107)PX(cid:107)2\n\nF \u2212 (cid:107) \u02dcPX(cid:107)2\n\n(cid:105)\nF \u2212 (cid:107)(cid:98)PX(cid:107)2\nF \u2212 (cid:107)(cid:98)PiX(cid:107)2\n\nparticular, we choose X to be the orthonormal basis of the subspace spanning the centers. Then the\ndifference between the objective values of P and \u02dcP can be decomposed into two terms depending\nonly on (cid:107)PX\u2212 \u02dcPX(cid:107)2\nF respectively, which are small as shown by the lemma.\nThe complete proof of Theorem 3 is provided in the appendix.\nLemma 4. Let t1 = t2 = O(k/\u0001) in Algorithm disPCA. Then for any d\u00d7k matrix X with orthonor-\nmal columns, 0 \u2264 (cid:107)PX \u2212 \u02dcPX(cid:107)2\n\nProof Sketch: We \ufb01rst introduce some auxiliary variables for the analysis, which act as intermediate\nconnections between P and \u02dcP. Imagine we perform two kinds of projections: \ufb01rst project Pi to\n\nF \u2264 \u0001d2(P, LX).\n(cid:98)Pi = PiVi\n(t1))(cid:62), then project (cid:98)Pi to Pi = (cid:98)PiV(t2)(V(t2))(cid:62). Let (cid:98)P denote the vertical\nconcatenation of(cid:98)Pi and let P denote the vertical concatenation of Pi. These variables are designed\nso that the difference between P and(cid:98)P and that between(cid:98)P and P are easily bounded.\n(cid:104)(cid:107)PX(cid:107)2\n\nOur proof then proceeds by \ufb01rst bounding these differences, and then bounding that between P and\n\u02dcP. In the following we sketch the proof for the second statement, while the other statement can be\nproved by a similar argument. See the appendix for details.\n(cid:107)PX(cid:107)2\n\n(cid:104)(cid:107)PX(cid:107)2\n(cid:104)(cid:107)PiX(cid:107)2\n(cid:104)(cid:107)(cid:98)PiZ(cid:107)2\n\n(cid:105)\nF \u2212 (cid:107) \u02dcPX(cid:107)2\nThe \ufb01rst term is just(cid:80)s\nsince(cid:98)Pi is the SVD truncation of P. The second term can be bounded similarly. The more dif\ufb01cult\npart is the third term. Note that Pi = (cid:98)PiZ, \u02dcPi = PiZ where Z := V(t2)(V(t2))(cid:62)X, leading to\nF =(cid:80)s\n\n(cid:107)PX(cid:107)2\n. Although Z is not orthonormal as required by\nLemma 1, we prove a generalization (Lemma 7 in the appendix) which can be applied to show that\nthe third term is indeed small.\nApplication to k-Means Clustering To see the implication, consider the k-means clustering prob-\nlem. We can \ufb01rst perform any other possible dimension reduction to dimension d(cid:48) so that the k-\nmeans cost is preserved up to accuracy \u0001, and then run Algorithm disPCA and \ufb01nally run any\ndistributed k-means clustering algorithm on the projected data to get a good approximate solution.\nFor example, in the \ufb01rst step we can set d(cid:48) = O(log n/\u00012) using a Johnson-Lindenstrauss transform,\nor we can perform no reduction and simply use the original data.\nAs a concrete example, we can use original data (d(cid:48) = d), then run Algorithm disPCA, and \ufb01nally\nrun the distributed clustering algorithm in [2] which uses any non-distributed \u03b1-approximation al-\ngorithm as a subroutine and computes a (1 + \u0001)\u03b1-approximate solution. The resulting algorithm is\npresented in Algorithm 1.\nTheorem 5. With probability at least 1 \u2212 \u03b4, Algorithm 1 outputs a (1 + \u0001)2\u03b1-approximate solution\n\u00012 ) vectors\nfor distributed k-means clustering. The total communication cost of Algorithm 1 is O( sk\nin Rd plus O\n\n(cid:104)(cid:107)(cid:98)PX(cid:107)2\n(cid:105)\n(cid:105)\n\n, each of which can be bounded by Lemma 1,\n\nvectors in RO(k/\u00012).\n\nF \u2212 (cid:107)PX(cid:107)2\n\nF\n\nF \u2212 (cid:107)PiZ(cid:107)2\n\nF\n\nF \u2212 (cid:107) \u02dcPX(cid:107)2\n\n\u00014 ( k2\n\n\u00012 + log 1\n\n\u03b4 ) + sk log sk\n\n\u03b4\n\n(cid:16) 1\n\n+\n\nF\n\nF\n\n(cid:17)\n\n3 Fast Distributed PCA\nSubspace Embeddings One can signi\ufb01cantly improve the time of the distributed PCA algorithms\nby using subspace embeddings, while keeping similar guarantees as in Lemma 4, which suf\ufb01ce for\nl2-error \ufb01tting. More precisely, a subspace embedding matrix H \u2208 R(cid:96)\u00d7n for a matrix A \u2208 Rn\u00d7d\nhas the property that for all vectors y \u2208 Rd, (cid:107)HAy(cid:107)2 = (1 \u00b1 \u0001)(cid:107)Ay(cid:107)2. Suppose independently,\n\n5\n\n\fi=1.\n\ni=1 instead of on the original data {Pi}s\n\neach node vi chooses a random subspace embedding matrix Hi for its local data Pi. Then, they run\nAlgorithm disPCA on the embedded data {HiPi}s\nThe work of [22] pioneered subspace embeddings. The recent fast sparse subspace embeddings [5]\nand its optimizations [17, 19] are particularly suitable for large scale sparse data sets, since their\nrunning time is linear in the number of non-zero entries in the data matrix, and they also preserve\nthe sparsity of the data. The algorithm takes as input an n\u00d7d matrix A and a parameter (cid:96), and outputs\nan (cid:96)\u00d7 d embedded matrix A(cid:48) = HA (the embedded matrix H does need to be built explicitly). The\nembedded matrix is constructed as follows: initialize A(cid:48) = 0; for each row in A, multiply it by +1\nor \u22121 with equal probability, then add it to a row in A(cid:48) chosen uniformly at random.\nThe success probability is constant, while we need to set it to be 1 \u2212 \u03b4 where \u03b4 = \u0398(1/s). Known\nresults which preserve the number of non-zero entries of H to be 1 per column increase the dimen-\nsion of H by a factor of s. To avoid this, we propose an approach to boost the success probability\nby computing O(log 1\n\u03b4 ) independent embeddings, each with only constant success probability, and\nthen run a cross validation style procedure to \ufb01nd one which succeeds with probability 1 \u2212 \u03b4. More\nj , and \ufb01nd a j \u2208 [r]\nprecisely, we compute the SVD of all embedded matrices HjA = Uj\u03a3jV(cid:62)\nsuch that for at least half of the indices j(cid:48) (cid:54)= j, all singular values of \u03a3jV(cid:62)\nj(cid:48) are in [1\u00b1 O(\u0001)]\n(see Algorithm 4 in the appendix). The reason why such an embedding HjA succeeds with high\nprobability is as follows. Any two successful embeddings HjA and Hj(cid:48)A, by de\ufb01nition, satisfy\nthat (cid:107)HjAx(cid:107)2\n2 for all x, which we show is equivalent to passing the test\non the singular values. Since with probability at least 1 \u2212 \u03b4, 9/10 fraction of the embeddings are\nsuccessful, it follows that the one we choose is successful with probability 1 \u2212 \u03b4.\nRandomized SVD The exact SVD of an n \u00d7 d matrix is impractical in the case when n or d\nis large. Here we show that the randomized SVD algorithm from [11] can be applied to speed\nup the computation without compromising the quality of the solution much. We need to use their\nspeci\ufb01c form of randomized SVD since the error is with respect to the spectral norm, rather than the\nFrobenius norm, and so can be much smaller as needed by our applications.\nThe algorithm \ufb01rst probes the row space of the (cid:96) \u00d7 d input matrix A with an (cid:96) \u00d7 2t random matrix\n\u2126 and orthogonalizes the image of \u2126 to get a basis Q (i.e., QR-factorize A(cid:62)\u2126); projects the data to\nthis basis and computes the SVD factorization on the smaller matrix AQ. It also performs q power\niterations to push the basis towards the top t singular vectors.\nFast Distributed PCA for l2-Error Fitting We modify Algorithm disPCA by \ufb01rst having each\nnode do a subspace embedding locally, then replace each SVD invocation with a randomized SVD\ninvocation. We thus arrive at Algorithm 2. For (cid:96)2-error \ufb01tting problems, by combining approxima-\ntion guarantees of the randomized techniques with that of distributed PCA, we are able to prove:\n\nTheorem 6. Suppose Algorithm 2 takes \u0001 \u2208 (0, 1/2], t1 = t2 = O(max(cid:8) k\n\n2 = (1 \u00b1 O(\u0001))(cid:107)Hj(cid:48)Ax(cid:107)2\n\n\u00012 ), q = O(max{log d\n\n\u0001 }) as input, and sets the failure probability of each local sub-\nO( d2\nspace embedding to \u03b4(cid:48) = \u03b4/2s. Let \u02dcP = PVV(cid:62). Then with probability at least 1 \u2212 \u03b4, there exists\na constant c0 \u2265 0, such that for any set of k points L,\n\n(cid:9)), (cid:96) =\n\nj Vj(cid:48)\u03a3(cid:62)\n\n\u0001 , log sk\n\n\u00012 , log s\n\n\u03b4\n\n(1 \u2212 \u0001)d2(P,L) \u2212 \u0001(cid:107)PX(cid:107)2\n\nF \u2264 d2( \u02dcP,L) + c0 \u2264 (1 + \u0001)d2(P,L) + \u0001(cid:107)PX(cid:107)2\n\nF\n\nwhere X is an orthonormal matrix whose columns span L. The total communication is O(skd/\u00012)\nand the total time is O\n\nnnz(P) + s\n\nlog d\n\n.\n\n\u00014 + k2d2\n\n\u00016\n\n\u0001 log sk\n\n\u03b4\u0001\n\n(cid:16)\n\n(cid:104) d3k\n\n(cid:105)\n\n(cid:17)\n\nF \u2212 (cid:107) \u02dcPX(cid:107)2\n\nF \u2248 0 and (cid:107)PX(cid:107)2\n\nProof Sketch: It suf\ufb01ces to show that \u02dcP enjoys the close projection property as in Lemma 4, i.e.,\n(cid:107)PX \u2212 \u02dcPX(cid:107)2\nF \u2248 0 for any orthonormal matrix whose columns\nspan a low dimensional subspace. Note that Algorithm 2 is just running Algorithm disPCA (with\nrandomized SVD) on TP where T = diag(H1, H2, . . . , Hs), so we \ufb01rst show that T \u02dcP enjoys\nthis property. But now exact SVD is replaced with randomized SVD, for which we need to use\nthe spectral error bound to argue that the error introduced is small. More precisely, for a matrix A\n\nand its SVD truncation (cid:98)A computed by randomized SVD, it is guaranteed that the spectral norm of\nA \u2212 (cid:98)A is small, then (cid:107)(A \u2212 (cid:98)A)X(cid:107)F is small for any X with small Frobenius norm, in particular,\n\nthe orthonormal basis spanning a low dimensional subspace. This then suf\ufb01ces to guarantee T \u02dcP\nenjoys the close projection property. Given this, it suf\ufb01ces to show that \u02dcP enjoys this property as\nT \u02dcP, which follows from the de\ufb01nition of a subspace embedding.\n\n6\n\n\fAlgorithm 2 Fast Distributed PCA for l2-Error Fitting\nInput: {Pi}s\n1: for each node vi \u2208 V do\n2:\n3: end for\n4: Run Algorithm disPCA on {P(cid:48)\nOutput: V.\n\nCompute subspace embedding P(cid:48)\ni}s\ni=1 to get V, where the SVD is randomized.\n\ni=1; parameters t1, t2 for Algorithm disPCA; (cid:96), q for randomized techniques.\n\ni = HiPi.\n\n4 Experiments\nOur focus is to show the randomized techniques used in Algorithm 2 reduce the time taken signif-\nicantly without compromising the quality of the solution. We perform experiments for three tasks:\nrank-r approximation, k-means clustering and principal component regression (PCR).\nDatasets We choose the following real world datasets from UCI repository [1] for our experiments.\nFor low rank approximation and k-means clustering, we choose two medium size datasets News-\nGroups (18774 \u00d7 61188) and MNIST (70000 \u00d7 784), and two large-scale Bag-of-Words datasets:\nNYTimes news articles (BOWnytimes) (300000 \u00d7 102660) and PubMed abstracts (BOWpubmed)\n(8200000 \u00d7 141043). We use r = 10 for rank-r approximation and k = 10 for k-means clus-\ntering. For PCR, we use MNIST and further choose YearPredictionMSD (515345 \u00d7 90), CTslices\n(53500 \u00d7 386), and a large dataset MNIST8m (800000 \u00d7 784).\nExperimental Methodology The algorithms are evaluated on a star network. The number of nodes\nis s = 25 for medium-size datasets, and s = 100 for the larger ones. We distribute the data over\nthe nodes using a weighted partition, where each point is distributed to the nodes with probability\nproportional to the node\u2019s weight chosen from the power law with parameter \u03b1 = 2.\nFor each projection dimension, we \ufb01rst construct the projected data using distributed PCA. For low\nrank approximation, we report the ratio between the cost of the obtained solution to that of the\nsolution computed by SVD on the global data. For k-means, we run the algorithm in [2] (with\nLloyd\u2019s method as a subroutine) on the projected data to get a solution. Then we report the ratio\nbetween the cost of the above solution to that of a solution obtained by running Lloyd\u2019s method\ndirectly on the global data. For PCR, we perform regression on the projected data to get a solution.\nThen we report the ratio between the error of the above solution to that of a solution obtained by\nPCR directly on the global data. We stop the algorihtm if it takes more than 24 hours. For each\nprojection dimension and each algorithm with randomness, the average ratio over 5 runs is reported.\nResults Figure 2 shows the results for low rank approximation. We observe that the error of the fast\ndistributed PCA is comparable to that of the exact solution computed directly on the global data.\nThis is also observed for distributed PCA with one or none of subspace embedding and randomized\nSVD. Furthermore, the error of the fast PCA is comparable to that of normal PCA, which means\nthat the speedup techniques merely affects the accuracy of the solution. The second row shows the\ncomputational time, which suggests a signi\ufb01cant decrease in the time taken to run the fast distributed\nPCA. For example, on NewsGroups, the time of the fast distributed PCA improves over that of\nnormal distributed PCA by a factor between 10 to 100. On the large dataset BOWpubmed, the\nnormal PCA takes too long to \ufb01nish and no results are presented, while the speedup versions produce\ngood results in reasonable time. The use of the randomized techniques gives us a good performance\nimprovement while keeping the solution quality almost the same.\nFigure 3 and Figure 4 show the results for k-means clustering and PCR respectively. Similar to\nthat for low rank approximation, we observe that the distributed solutions are almost as good as that\ncomputed directly on the global data, and the speedup merely affects the solution quality. We again\nobserve a huge decrease in the running time by the speedup techniques.\n\nAcknowledgments This work was supported in part by NSF grants CCF-0953192, CCF-1451177,\nCCF-1101283, and CCF-1422910, ONR grant N00014-09-1-0751, and AFOSR grant FA9550-09-\n1-0538. David Woodruff would like to acknowledge the XDATA program of the Defense Advanced\nResearch Projects Agency (DARPA), administered through Air Force Research Laboratory contract\nFA8750-12-C0323, for supporting this work.\n\n7\n\n\f(a) NewsGroups\n\n(b) MNIST\n\n(c) BOWnytimes\n\n(d) BOWpubmed\n\n(e) NewsGroups\n\n(f) MNIST\n\n(g) BOWnytimes\n\n(h) BOWpubmed\n\nFigure 2: Low rank approximation. First row: error (normalized by baseline) v.s. projection\ndimension. Second row: time v.s. projection dimension.\n\n(a) NewsGroups\n\n(b) MNIST\n\n(c) BOWnytimes\n\n(d) BOWpubmed\n\n(e) NewsGroups\n\n(f) MNIST\n\n(g) BOWnytimes\n\n(h) BOWpubmed\n\nFigure 3: k-means clustering. First row: cost (normalized by baseline) v.s. projection dimension.\nSecond row: time v.s. projection dimension.\n\n(a) MNIST\n\n(b) YearPredictionMSD\n\n(c) CTslices\n\n(d) MNIST8m\n\n(e) MNIST\n\n(f) YearPredictionMSD\n\n(g) CTslices\n\n(h) MNIST8m\n\nFigure 4: PCR. First row: error (normalized by baseline) v.s. projection dimension. Second row:\ntime v.s. projection dimension.\n\n8\n\n5101520251 1.021.041.061.081.121.14 Fast_PCAOnly_SubspaceOnly_RandomizedNormal_PCA142434445411.041.081.121.161.2 10152025301.011.021.031.041.051.061.071.08 101520253011.021.041.061.081.11.121.14 510152025101102103104 Fast_PCAOnly_SubspaceOnly_RandomizedNormal_PCA1424344454100101102103 1015202530103104105 1015202530104.7104.8104.9 5101520251.021.041.061.081.1 Fast_PCAOnly_RandomizedOnly_SubspaceNormal_PCA142434445411.021.041.061.081.11.121.14 10152025301.0351.0551.0751.0951.1151.135 101520253011.021.041.061.081.1 510152025101102103104 Fast_PCAOnly_SubspaceOnly_RandomizedNormal_PCA1424344454101102103 1015202530101102103104 1015202530104 14243444541.0021.0041.0061.0081.011.012 Fast_PCAOnly_SubspaceOnly_RandomizedNormal_PCA10152025301.051.061.071.081.091.11.111.12 101520253011.0021.0041.0061.0081.011.0121.014 14243444541.0011.00151.0021.00251.003 1424344454100101102103 Fast_PCAOnly_SubspaceOnly_RandomizedNormal_PCA1015202530100101102103 1015202530100101102 1424344454102103104 \fReferences\n[1] K. Bache and M. Lichman. UCI machine learning repository, 2013.\n[2] M.-F. Balcan, S. Ehrlich, and Y. Liang. Distributed k-means and k-median clustering on gen-\neral communication topologies. In Advances in Neural Information Processing Systems, 2013.\nthe Journal of machine\n\n[3] D. M. Blei, A. Y. Ng, and M. I. Jordan. Latent dirichlet allocation.\n\nLearning research, 2003.\n\n[4] C. Boutsidis, A. Zouzias, M. W. Mahoney, and P. Drineas. Stochastic dimensionality reduction\n\nfor k-means clustering. CoRR, abs/1110.2897, 2011.\n\n[5] K. L. Clarkson and D. P. Woodruff. Low rank approximation and regression in input sparsity\n\ntime. In Proceedings of the 45th Annual ACM Symposium on Theory of Computing, 2013.\n\n[6] M. Cohen, S. Elder, C. Musco, C. Musco, and M. Persu. Dimensionality reduction for k-means\n\nclustering and low rank approximation. arXiv preprint arXiv:1410.6801, 2014.\n\n[7] J. C. Corbett, J. Dean, M. Epstein, A. Fikes, C. Frost, J. Furman, S. Ghemawat, A. Gubarev,\nC. Heiser, P. Hochschild, et al. Spanner: Googles globally-distributed database. In Proceedings\nof the USENIX Symposium on Operating Systems Design and Implementation, 2012.\n\n[8] D. Feldman and M. Langberg. A uni\ufb01ed framework for approximating and clustering data. In\n\nProceedings of the Annual ACM Symposium on Theory of Computing, 2011.\n\n[9] D. Feldman, M. Schmidt, and C. Sohler. Turning big data into tiny data: Constant-size core-\nIn Proceedings of the Annual ACM-SIAM\n\nsets for k-means, pca and projective clustering.\nSymposium on Discrete Algorithms, 2013.\n\n[10] M. Ghashami and J. M. Phillips. Relative errors for deterministic low-rank matrix approxima-\n\ntions. In ACM-SIAM Symposium on Discrete Algorithms, 2014.\n\n[11] N. Halko, P.-G. Martinsson, and J. A. Tropp. Finding structure with randomness: Probabilistic\n\nalgorithms for constructing approximate matrix decompositions. SIAM review, 2011.\n\n[12] R. Kannan, S. S. Vempala, and D. P. Woodruff. Principal component analysis and higher\ncorrelations for distributed data. In Proceedings of the Conference on Learning Theory, 2014.\n[13] N. Karampatziakis and P. Mineiro. Combining structured and unstructured randomness in large\n\nscale pca. CoRR, abs/1310.6304, 2013.\n\n[14] Y.-A. Le Borgne, S. Raybaud, and G. Bontempi. Distributed principal component analysis for\n\nwireless sensor networks. Sensors, 2008.\n\n[15] D. D. Lee and H. S. Seung. Algorithms for non-negative matrix factorization. Advances in\n\nNeural Information Processing Systems, 2001.\n\n[16] S. V. Macua, P. Belanovic, and S. Zazo. Consensus-based distributed principal component\nanalysis in wireless sensor networks. In Proceedings of the IEEE International Workshop on\nSignal Processing Advances in Wireless Communications, 2010.\n\n[17] X. Meng and M. W. Mahoney. Low-distortion subspace embeddings in input-sparsity time\nand applications to robust linear regression. In Proceedings of the Annual ACM symposium on\nSymposium on theory of computing, 2013.\n\n[18] S. Mitra, M. Agrawal, A. Yadav, N. Carlsson, D. Eager, and A. Mahanti. Characterizing web-\n\nbased video sharing workloads. ACM Transactions on the Web, 2011.\n\n[19] J. Nelson and H. L. Nguy\u02c6en. Osnap: Faster numerical linear algebra algorithms via sparser\nsubspace embeddings. In IEEE Annual Symposium on Foundations of Computer Science, 2013.\n[20] C. Olston, J. Jiang, and J. Widom. Adaptive \ufb01lters for continuous queries over distributed data\nstreams. In Proceedings of the ACM SIGMOD International Conference on Management of\nData, 2003.\n\n[21] Y. Qu, G. Ostrouchov, N. Samatova, and A. Geist. Principal component analysis for dimension\nreduction in massive distributed data sets. In Proceedings of IEEE International Conference\non Data Mining, 2002.\n\n[22] T. Sarl\u00b4os. Improved approximation algorithms for large matrices via random projections. In\n\nIEEE Symposium on Foundations of Computer Science, 2006.\n\n9\n\n\f", "award": [], "sourceid": 1612, "authors": [{"given_name": "Yingyu", "family_name": "Liang", "institution": "Princeton University"}, {"given_name": "Maria-Florina", "family_name": "Balcan", "institution": "Georgia Tech"}, {"given_name": "Vandana", "family_name": "Kanchanapally", "institution": null}, {"given_name": "David", "family_name": "Woodruff", "institution": "IBM Research"}]}