{"title": "Communication-Optimal Distributed Clustering", "book": "Advances in Neural Information Processing Systems", "page_first": 3727, "page_last": 3735, "abstract": "Clustering large datasets is a fundamental problem with a number of applications in machine learning. Data is often collected on different sites and clustering needs to be performed in a distributed manner with low communication. We would like the quality of the clustering in the distributed setting to match that in the centralized setting for which all the data resides on a single site. In this work, we study both graph and geometric clustering problems in two distributed models: (1) a point-to-point model, and (2) a model with a broadcast channel. We give protocols in both models which we show are nearly optimal by proving almost matching communication lower bounds. Our work highlights the surprising power of a broadcast channel for clustering problems; roughly speaking, to cluster n points or n vertices in a graph distributed across s servers, for a worst-case partitioning the communication complexity in a point-to-point model is n*s, while in the broadcast model it is n + s. We implement our algorithms and demonstrate this phenomenon on real life datasets, showing that our algorithms are also very efficient in practice.", "full_text": "Communication-Optimal Distributed Clustering\u2217\n\nJiecao Chen\n\nIndiana University\n\nBloomington, IN 47401\njiecchen@indiana.edu\n\nHe Sun\n\nUniversity of Bristol\nBristol, BS8 1UB, UK\nh.sun@bristol.ac.uk\n\nDavid P. Woodruff\n\nIBM Research Almaden\n\nSan Jose, CA 95120\n\ndpwoodru@us.ibm.com\n\nAbstract\n\nQin Zhang\n\nIndiana University\n\nBloomington, IN 47401\nqzhangcs@indiana.edu\n\nClustering large datasets is a fundamental problem with a number of applications\nin machine learning. Data is often collected on different sites and clustering needs\nto be performed in a distributed manner with low communication. We would\nlike the quality of the clustering in the distributed setting to match that in the\ncentralized setting for which all the data resides on a single site. In this work, we\nstudy both graph and geometric clustering problems in two distributed models:\n(1) a point-to-point model, and (2) a model with a broadcast channel. We give\nprotocols in both models which we show are nearly optimal by proving almost\nmatching communication lower bounds. Our work highlights the surprising power\nof a broadcast channel for clustering problems; roughly speaking, to spectrally\ncluster n points or n vertices in a graph distributed across s servers, for a worst-case\npartitioning the communication complexity in a point-to-point model is n \u00b7 s, while\nin the broadcast model it is n + s. A similar phenomenon holds for the geometric\nsetting as well. We implement our algorithms and demonstrate this phenomenon\non real life datasets, showing that our algorithms are also very ef\ufb01cient in practice.\n\n1\n\nIntroduction\n\nClustering is a fundamental task in machine learning with widespread applications in data mining,\ncomputer vision, and social network analysis. Example applications of clustering include grouping\nsimilar webpages by search engines, \ufb01nding users with common interests in a social network, and\nidentifying different objects in a picture or video. For these applications, one can model the objects\nthat need to be clustered as points in Euclidean space Rd, where the similarities of two objects are\nrepresented by the Euclidean distance between the two points. Then the task of clustering is to choose\nk points as centers, so that the total distance between all input points to their corresponding closest\ncenter is minimized. Depending on different distance objective functions, three typical problems\nhave been studied: k-means, k-median, and k-center.\nThe other popular approach for clustering is to model the input data as vertices of a graph, and the\nsimilarity between two objects is represented by the weight of the edge connecting the corresponding\nvertices. For this scenario, one is asked to partition the vertices into clusters so that the \u201chighly\nconnected\u201d vertices belong to the same cluster. A widely-used approach for graph clustering is\nspectral clustering, which embeds the vertices of a graph into the points in Rk through the bottom k\neigenvectors of the graph\u2019s Laplacian matrix, and applies k-means on the embedded points.\n\n\u2217Full version appears on arXiv, 2017, under the same title.\n\n30th Conference on Neural Information Processing Systems (NIPS 2016), Barcelona, Spain.\n\n\fBoth the spectral clustering and the geometric clustering algorithms mentioned above have been\nwidely used in practice, and have been the subject of extensive theoretical and experimental studies\nover the decades. However, these algorithms are designed for the centralized setting, and are not\napplicable in the setting of large-scale datasets that are maintained remotely by different sites. In\nparticular, collecting the information from all the remote sites and performing a centralized clustering\nalgorithm is infeasible due to high communication costs, and new distributed clustering algorithms\nwith low communication cost need to be developed.\nThere are several natural communication models, and we focus on two of them: (1) a point-to-point\nmodel, and (2) a model with a broadcast channel. In the former, sometimes referred to as the message-\npassing model, there is a communication channel between each pair of users. This may be impractical,\nand the so-called coordinator model can often be used in place; in the coordinator model there is a\ncentralized site called the coordinator, and all communication goes through the coordinator. This\naffects the total communication by a factor of two, since the coordinator can forward a message from\none server to another and therefore simulate a point-to-point protocol. There is also an additional\nadditive O(log s) bits per message, where s is the number of sites, since a server must specify to the\ncoordinator where to forward its message. In the model with a broadcast channel, sometimes referred\nto as the blackboard model, the coordinator has the power to send a single message which is received\nby all s sites at once. This can be viewed as a model for single-hop wireless networks.\nIn both models we study the total number of bits communicated among all sites. Although the\nblackboard model is at least as powerful as the message-passing model, it is often unclear how to\nexploit its power to obtain better bounds for speci\ufb01c problems. Also, for a number of problems the\ncommunication complexity is the same in both models, such as computing the sum of s length-n bit\nvectors modulo two, where each site holds one bit vector [18], or estimating large moments [20].\nStill, for other problems like set disjointness it can save a factor of s in the communication [5].\nOur contributions. We present algorithms for graph clustering: for any n-vertex graph whose\n\nedges are arbitrarily partitioned across s sites, our algorithms have communication cost (cid:101)O(ns)\nin the message passing model, and have communication cost (cid:101)O(n + s) in the blackboard model,\nwhere the (cid:101)O notation suppresses polylogarithmic factors. The algorithm in the message passing\n\nmodel has each site send a spectral sparsi\ufb01er of its local data to the coordinator, who then merges\nthem in order to obtain a spectral sparsi\ufb01er of the union of the datasets, which is suf\ufb01cient for\nsolving the graph clustering problem. Our algorithm in the blackboard model is technically more\ninvolved, as we show a particular recursive sampling procedure for building a spectral sparsi\ufb01er\ncan be ef\ufb01ciently implemented using a broadcast channel. It is unclear if other natural ways of\nbuilding spectral sparsi\ufb01ers can be implemented with low communication in the blackboard model.\nOur algorithms demonstrate the surprising power of the blackboard model for clustering problems.\nSince our algorithms compute sparsi\ufb01ers, they also have applications to solving symmetric diagonally\ndominant linear systems in a distributed model. Any such system can be converted into a system\ninvolving a Laplacian (see, e.g., [1]), from which a spectral sparsi\ufb01er serves as a good preconditioner.\nNext we show that \u2126(ns) bits of communication is necessary in the message passing model to even\nrecover a constant fraction of a cluster, and \u2126(n + s) bits of communication is necessary in the\nblackboard model. This shows the optimality of our algorithms up to poly-logarithmic factors.\nWe then study clustering problems in constant-dimensional Euclidean space. We show for any c > 1,\ncomputing a c-approximation for k-median, k-means, or k-center correctly with constant probability\nin the message passing model requires \u2126(sk) bits of communication. We then strengthen this lower\nbound, and show even for bicriteria clustering algorithms, which may output a constant factor more\nclusters and a constant factor approximation, our \u2126(sk) bit lower bound still holds. Our proofs are\nbased on communication and information complexity. Our results imply that existing algorithms [3]\n\nfor k-median and k-means with (cid:101)O(sk) bits of communication, as well as the folklore parallel guessing\nalgorithm for k-center with (cid:101)O(sk) bits of communication, are optimal up to poly-logarithmic factors.\nO(1)-approximation using (cid:101)O(s + k) bits of communication. This again separates the models.\n\nFor the blackboard model, we present an algorithm for k-median and k-means that achieves an\n\nWe give empirical results which show that using spectral sparsi\ufb01ers preserves the quality of spectral\nclustering surprisingly well in real-world datasets. For example, when we partition a graph with over\n70 million edges (the Sculpture dataset) into 30 sites, only 6% of the input edges are communicated\nin the blackboard model and 8% are communicated in the message passing model, while the values\n\n2\n\n\fof the normalized cut (the objective function of spectral clustering) given in those two models are\nat most 2% larger than the ones given by the centralized algorithm, and the visualized results are\nalmost identical. This is strong evidence that spectral sparsi\ufb01ers can be a powerful tool in practical,\ndistributed computation. When the number of sites is large, the blackboard model incurs signi\ufb01cantly\nless communication than the message passing model, e.g., in the Twomoons dataset when there are\n90 sites, the message passing model communicates 9 times as many edges as communicated in the\nblackboard model, illustrating the strong separation between these models that our theory predicts.\nRelated work. There is a rich literature on spectral and geometric clustering algorithms from various\naspects (see, e.g., [2, 16, 17, 19]). Balcan et al. [3, 4] and Feldman et al. [9] study distributed k-means\n([3] also studies k-median). Very recently Guha et al. [10] studied distributed k-median/center/means\nwith outliers. Cohen et al. [7] study dimensionality reduction techniques for the input data matrices\nthat can be used for distributed k-means. The main takeaway is that there is no previous work\nwhich develops protocols for spectral clustering in the common message passing and blackboard\nmodels, and lower bounds are lacking as well. For geometric clustering, while upper bounds exist\n(e.g., [3, 4, 9]), no provable lower bounds in either model existed, and our main contribution is to\nshow that previous algorithms are optimal. We also develop a new protocol in the blackboard model.\n\n(cid:124)\n\nAx \u2264 x\n\n(cid:124)\n\n(cid:124)\n\nAx \u2264 x\n\n(cid:124)\n\nR\u22650. The set of neighbors of a vertex v is represented by N (v), and its degree is dv =(cid:80)\nv\u2208S dv. For any sets S, T \u2286 V , we de\ufb01ne w(S, T ) (cid:44)(cid:80)\n(cid:80)\n\n2 Preliminaries\nLet G = (V, E, w) be an undirected graph with n vertices, m edges, and weight function V \u00d7 V \u2192\nu\u223cv w(u, v).\nThe maximum degree of G is de\ufb01ned to be \u2206(G) = maxv{dv}. For any set S \u2286 V , let \u00b5(S) (cid:44)\nu\u2208S,v\u2208T w(u, v) to be the total weight of\nedges crossing S and T . For two sets X and Y , the symmetric difference of X and Y is de\ufb01ned as\nX(cid:52)Y (cid:44) (X \\ Y ) \u222a (Y \\ X).\nFor any matrix A \u2208 Rn\u00d7n, let \u03bb1(A) \u2264 \u00b7\u00b7\u00b7 \u2264 \u03bbn(A) = \u03bbmax(A) be the eigenvalues of A. For any\ntwo matrices A, B \u2208 Rn\u00d7n, we write A (cid:22) B to represent B \u2212 A is positive semi-de\ufb01nite (PSD).\nBx for any x \u2208 Rn. Sometimes we also use a\nNotice that this condition implies that x\nweaker notation (1\u2212 \u03b5)A (cid:22)r B (cid:22)r (1 + \u03b5)A to indicate that (1\u2212 \u03b5)x\n(cid:124)\nAx\nfor all x in the row span of A.\nGraph Laplacian. The Laplacian matrix of G is an n \u00d7 n matrix LG de\ufb01ned by LG = DG \u2212 AG,\nwhere AG is the adjacency matrix of G de\ufb01ned by AG(u, v) = w(u, v), and DG is the n\u00d7 n diagonal\nmatrix with DG(v, v) = dv for any v \u2208 V [G]. Alternatively, we can write LG with respect to a\nsigned edge-vertex incidence matrix: we assign every edge e = {u, v} an arbitrary orientation, and\nlet BG(e, v) = 1 if v is e\u2019s head, BG(e, v) = \u22121 if v is e\u2019s tail, and BG(e, v) = 0 otherwise. We\nfurther de\ufb01ne a diagonal matrix WG \u2208 Rm\u00d7m, where WG(e, e) = we for any edge e \u2208 E[G].\n(cid:124)\nGWGBG. The normalized Laplacian matrix of G is de\ufb01ned by\nThen, we can write LG as LG = B\n\u22121/2\nLG (cid:44) D\n= I \u2212 D\n. We sometimes drop the subscript G when the\nG AGD\nunderlying graph is clear from the context.\nSpectral sparsi\ufb01cation. For any undirected and weighted graph G = (V, E, w), we say a subgraph\nH of G with proper reweighting of the edges is a (1 + \u03b5)-spectral sparsi\ufb01er if\n\nBx \u2264 (1 + \u03b5)x\n\n(1 \u2212 \u03b5)LG (cid:22) LH (cid:22) (1 + \u03b5)LG.\n\n(1)\nBy de\ufb01nition, it is easy to show that, if we decompose the edge set of a graph G = (V, E) into\nE1, . . . , E(cid:96) for a constant (cid:96) and Hi is a spectral sparsi\ufb01er of Gi = (V, Ei) for any 1 \u2264 i \u2264 (cid:96), then\nthe graph formed by the union of edge sets from Hi is a spectral sparsi\ufb01er of G. It is known that, for\nany undirected graph G of n vertices, there is a (1 + \u03b5)-spectral sparsi\ufb01er of G with O(n/\u03b52) edges,\nand it can be constructed in almost-linear time [13]. We will show that a spectral sparsi\ufb01er preserves\nthe cluster structure of a graph.\nModels of computation. We will study distributed clustering in two models for distributed data: the\nmessage passing model and the blackboard model. The message passing model represents those\ndistributed computation systems with point-to-point communication, and the blackboard model\nrepresents those where messages can be broadcast to all parties.\nMore precisely, in the message passing model there are s sites P1, . . . ,Ps, and one coordinator.\nThese sites can talk to the coordinator through a two-way private channel. In fact, this is referred to\n\n\u22121/2\nG LGD\n\n\u22121/2\nG\n\n\u22121/2\nG\n\n3\n\n\fas the coordinator model in Section 1, where it is shown to be equivalent to the point-to-point model\nup to small factors. The input is initially distributed at the s sites. The computation is in terms of\nrounds: at the beginning of each round, the coordinator sends a message to some of the s sites, and\nthen each of those sites that have been contacted by the coordinator sends a message back to the\ncoordinator. At the end, the coordinator outputs the answer. In the alternative blackboard model, the\ncoordinator is simply a blackboard where these s sites P1, . . . ,Ps can share information; in other\nwords, if one site sends a message to the coordinator/blackboard then all the other s \u2212 1 sites can see\nthis information without further communication. The order for the sites to speak is decided by the\ncontents of the blackboard.\nFor both models we measure the communication cost as the total number of bits sent through the\nchannels. The two models are now standard in multiparty communication complexity (see, e.g.,\n[5, 18, 20]). They are similar to the congested clique model [14] studied in the distributed computing\ncommunity; the main difference is that in our models we do not post any bandwidth limitations at\neach channel but instead consider the total number of bits communicated.\n\n3 Distributed graph clustering\n\nIn this section we study distributed graph clustering. We assume that the vertex set of the input graph\nG = (V, E) can be partitioned into k clusters, where vertices in each cluster S are highly connected to\neach other, and there are fewer edges between S and V \\S. To formalize this notion, we de\ufb01ne the con-\nductance of a vertex set S by \u03c6G(S) (cid:44) w(S, V \\ S)/\u00b5(S). Generalizing the Cheeger constant, we de-\n\ufb01ne the k-way expansion constant of graph G by \u03c1(k) (cid:44) minpartition A1, . . . , Ak max1\u2264i\u2264k \u03c6G(Ai).\nNotice that a graph G has k clusters if the value of \u03c1(k) is small.\nLee et al. [12] relate the value of \u03c1(k) to \u03bbk(LG) by the following higher-order Cheeger inequality:\n\n\u2264 \u03c1(k) \u2264 O(k2)(cid:112)\u03bbk(LG).\n\n\u03bbk(LG)\n\n2\n\nBased on this, a large gap between \u03bbk+1(LG) and \u03c1(k) implies (i) the existence of a k-way partition\n{Si}k\ni=1 with smaller value of \u03c6G(Si) \u2264 \u03c1(k), and (ii) any (k + 1)-way partition of G contains a\nsubset with high conductance \u03c1(k + 1) \u2265 \u03bbk+1(LG)/2. Hence, a large gap between \u03bbk+1(LG) and\n\u03c1(k) ensures that G has exactly k clusters.\nIn the following, we assume that \u03a5 (cid:44) \u03bbk+1(LG)/\u03c1(k) = \u2126(k3), as this assumption was used in the\nliterature for studying graph clustering in the centralized setting [17].\nBoth algorithms presented in the section are based on the following spectral clustering algorithm:\n(i) compute the k eigenvectors f1, . . . , fk of LG associated with \u03bb1(LG), . . . , \u03bbk(LG); (ii) embed\n\u00b7 (f1(v), . . . , fk(v)); (iii) run\nevery vertex v to a point in Rk through the embedding F (v) = 1\u221a\nk-means on the embedded points {F (v)}v\u2208V , and group the vertices of G into k clusters according\nto the output of k-means.\n\ndv\n\ncoordinator runs a spectral clustering algorithm on the union of received graphs H (cid:44)(cid:16)\n\n3.1 The message passing model\nWe assume the edges of the input graph G = (V, E) are arbitrarily allocated among s sites P1,\u00b7\u00b7\u00b7 ,Ps,\nand we use Ei to denote the edge set maintained by site Pi. Our proposed algorithm consists of\ntwo steps: (i) every Pi computes a linear-sized (1 + c)-spectral sparsi\ufb01er Hi of Gi (cid:44) (V, Ei), for a\nsmall constant c \u2264 1/10, and sends the edge set of Hi, denoted by E(cid:48)\ni, to the coordinator; (ii) the\ni=1 E(cid:48)\n.\nThe theorem below summarizes the performance of this algorithm, and shows the approximation\nguarantee of this algorithm is as good as the provable guarantee of spectral clustering known in the\ncentralized setting [17].\nTheorem 3.1. Let G = (V, E) be an n-vertex graph with \u03a5 = \u2126(k3), and suppose the edges\nof G are arbitrarily allocated among s sites. Assume S1,\u00b7\u00b7\u00b7 , Sk is an optimal partition that\nachieves \u03c1(k). Then, the algorithm above computes a partition A1, . . . , Ak satisfying vol(Ai(cid:52)Si) =\n\nO(cid:0)k3 \u00b7 \u03a5\u22121 \u00b7 vol(Si)(cid:1) for any 1 \u2264 i \u2264 k. The total communication cost of this algorithm is (cid:101)O(ns)\n\nV,(cid:83)k\n\n(cid:17)\n\ni\n\nbits.\n\n4\n\n\fOur proposed algorithm is very easy to implement, and the next theorem shows that the communica-\ntion cost of our algorithm is optimal up to a logarithmic factor.\nTheorem 3.2. Let G be an undirected graph with n vertices, and suppose the edges of G are\ndistributed among s sites. Then, any algorithm that correctly outputs a constant fraction of a cluster\nin G requires \u2126(ns) bits of communication. This holds even if each cluster has constant expansion.\n\nAs a remark, it is easy to see that this lower bound also holds for constructing spectral sparsi\ufb01ers:\nfor any n \u00d7 n PSD matrix A whose entries are arbitrarily distributed among s sites, any distributed\nalgorithm that constructs a (1 + \u0398(1))-spectral sparsi\ufb01er of A requires \u2126(ns) bits of communication.\nThis follows since such a spectral sparsi\ufb01er can be used to solve the spectral clustering problem.\nSpectral sparsi\ufb01cation has played an important role in designing fast algorithms from different areas,\ne.g., machine learning, and numerical linear algebra. Hence our lower bound result for constructing\nspectral sparsi\ufb01ers may have applications to studying other distributed learning algorithms.\n\n3.2 The blackboard model\n\nNext we present a graph clustering algorithm with (cid:101)O(n + s) bits of communication cost in the\n\nblackboard model. Our result is based on the observation that a spectral sparsi\ufb01er preserves the\nstructure of clusters, which was used for proving Theorem 3.1. So it suf\ufb01ces to design a distributed\nalgorithm for constructing a spectral sparsi\ufb01er in the blackboard model.\nOur distributed algorithm is based on constructing a chain of coarse sparsi\ufb01ers [15], which is described\nas follows: for any input PSD matrix K with \u03bbmax(K) \u2264 \u03bbu and all the non-zero eigenvalues of K\nat least \u03bb(cid:96), we de\ufb01ne d = (cid:100)log2(\u03bbu/\u03bb(cid:96))(cid:101) and construct a chain of d + 1 matrices\n\n[K(0), K(1), . . . , K(d)],\n\n(2)\nwhere \u03b3(i) = \u03bbu/2i and K(i) = K + \u03b3(i)I. Notice that in the chain above every K(i \u2212 1) is\nobtained by adding weights to the diagonal entries of K(i), and K(i \u2212 1) approximates K(i) as long\nas the weights added to the diagonal entries are small. We will construct this chain recursively, so that\nK(0) has heavy diagonal entries and can be approximated by a diagonal matrix. Moreover, since K\nis the Laplacian matrix of a graph G, it is easy to see that d = O(log n) as long as the edge weights\nof G are polynomially upper-bounded in n.\nLemma 3.3 ([15]). The chain (2) satis\ufb01es the following relations: (1) K (cid:22)r K(d) (cid:22)r 2K; (2)\nK((cid:96)) (cid:22) K((cid:96) \u2212 1) (cid:22) 2K((cid:96)) for all (cid:96) \u2208 {1, . . . , d}; (3) K(0) (cid:22) 2\u03b3(0)I (cid:22) 2K(0).\nBased on Lemma 3.3, we will construct a chain of matrices\n\n(cid:104)(cid:101)K(0), (cid:101)K(1), . . . , (cid:101)K(d)\n\n(cid:105)\n\n(cid:124)\n\nin the blackboard model, such that every (cid:101)K((cid:96)) is a spectral sparsi\ufb01er of K((cid:96)), and every (cid:101)K((cid:96) + 1)\ncan be constructed from (cid:101)K((cid:96)). The basic idea behind our construction is to use the relations among\n\n(3)\n\nProof. Let K = B\n\nB, sampling rows of B with\n\ndifferent K((cid:96)) shown in Lemma 3.3 and the fact that, for any K = B\nrespect to their leverage scores can be used to obtain a matrix approximating K.\nTheorem 3.4. Let G be an undirected graph on n vertices, where the edges of G are allocated among\ns sites, and the edge weights are polynomially upper bounded in n. Then, a spectral sparsi\ufb01er of G\n\ncan be constructed with (cid:101)O(n + s) bits of communication in the blackboard model. That is, the chain\n(3) can be constructed with (cid:101)O(n + s) bits of communication in the blackboard model.\nedge-vertex incidence matrix of G. We will prove that every (cid:101)K(i + 1) can be constructed based on\n(cid:101)K(i) with (cid:101)O(n + s) bits of communication. This implies that (cid:101)K(d), a (1 + \u03b5)-spectral sparsi\ufb01er of\nK, can be constructed with (cid:101)O(n + s) bits of communication, as the length of the chain d = O(log n).\n(cid:101)O(n + s) (different sites sequentially write the new IDs of the vertices on the blackboard). In the\nfollowing we assume that \u03bbu is the upper bound of \u03bbmax that we actually obtained in the blackboard.\nBase case of (cid:96) = 0: By de\ufb01nition, K(0) = K + \u03bbu \u00b7 I, and 1\n2 \u00b7 K(0) (cid:22) \u03b3(0) \u00b7 I (cid:22) K(0), due\nto Statement 3 of Lemma 3.3. Let \u2295 denote appending the rows of one matrix to another. We\n\nFirst of all, notice that \u03bbu \u2264 2n, and the value of n can be obtained with communication cost\n\nB be the Laplacian matrix of the underlying graph G, where B \u2208 Rm\u00d7n is the\n\n(cid:124)\n\n5\n\n\f(cid:124)\n\n(cid:124)\n\u03b3(0)B\u03b3(0). By de\ufb01ning\n\n(cid:124)\n\u03c4i = b\ni (K(0))\n(cid:124)\nb\n\n(cid:124)\nbi for each row of B\u03b3(0), we have \u03c4i \u2264 b\n\nrounded up to 1. Then, with high probability sampling O(\u03b5\u22122n log n) rows of B will give a matrix\n\nde\ufb01ne B\u03b3(0) = B \u2295(cid:112)\u03b3(0) \u00b7 I, and write K(0) = K + \u03b3(0) \u00b7 I = B\ni (\u03b3(0) \u00b7 I) bi \u2264 2 \u00b7 \u03c4i. Let (cid:101)\u03c4i =\ni (\u03b3(0) \u00b7 I)+ bi be the leverage score of bi approximated using \u03b3(0) \u00b7 I, and let(cid:101)\u03c4 be the vector of\napproximate leverage scores, with the leverage scores of the n rows corresponding to(cid:112)\u03b3(0) \u00b7 I\n(cid:101)K(0) such that (1 \u2212 \u03b5)K(0) (cid:22) (cid:101)K(0) (cid:22) (1 + \u03b5)K(0). Notice that, as every row of B corresponds\nto an edge of G, the approximate leverage scores(cid:101)\u03c4i for different edges can be computed locally by\nedges to the blackboard, hence the communication cost is (cid:101)O(n + s) bits.\nInduction step: We assume that (1\u2212 \u03b5)K((cid:96)) (cid:22)r (cid:101)K((cid:96)) (cid:22)r (1 + \u03b5)K((cid:96)), and the blackboard maintains\nthe matrix (cid:101)K((cid:96)). This implies that (1\u2212 \u03b5)/(1 + \u03b5)\u00b7 K((cid:96)) (cid:22)r 1/(1 + \u03b5)\u00b7 (cid:101)K((cid:96)) (cid:22)r K((cid:96)). Combining\n\ndifferent sites maintaining the edges, and the sites only need to send the information of the sampled\n\nthis with Statement 2 of Lemma 3.3, we have that\n\n1 \u2212 \u03b5\n2(1 + \u03b5)\n\nK((cid:96) + 1) (cid:22)r\n\n1\n\n(cid:101)K((cid:96)) (cid:22) K((cid:96) + 1).\n\n2(1 + \u03b5)\n\nthe blackboard, the probabilities used for sampling individual edges can be computed locally by\ndifferent sites, and in each round only the sampled edges will be sent to the blackboard in order for\n\nWe apply the same sampling procedure as in the base case, and obtain a matrix (cid:101)K((cid:96) + 1) such\nthat (1 \u2212 \u03b5)K((cid:96) + 1) (cid:22)r (cid:101)K((cid:96) + 1) (cid:22)r (1 + \u03b5)K((cid:96) + 1). Notice that, since (cid:101)K((cid:96)) is written on\nthe blackboard to obtain (cid:101)K((cid:96) + 1). Hence, the total communication cost in each iteration is (cid:101)O(n + s)\nwe obtain a distributed algorithm in the blackboard model with total communication cost (cid:101)O(n + s)\n\nCombining Theorem 3.4 and the fact that a spectral sparsi\ufb01er preserves the structure of clusters,\n\nbits. Combining this with the fact that the chain length d = O(log n) proves the theorem.\n\nbits, and the performance of our algorithm is the same as in the statement of Theorem 3.1. Notice\nthat \u2126(n + s) bits of communication are needed for graph clustering in the blackboard model,\nsince the output of a clustering algorithm contains \u2126(n) bits of information and each site needs to\ncommunicate at least one bit. Hence the communication cost of our proposed algorithm is optimal up\nto a poly-logarithmic factor.\n\n4 Distributed geometric clustering\n\nWe now consider geometric clustering, including k-median, k-means and k-center. Let P be a set\nof points of size n in a metric space with distance function d(\u00b7,\u00b7), and let k \u2264 n be an integer. In\nthe k-center problem we want to \ufb01nd a set C (|C| = k) such that maxp\u2208P d(p, C) is minimized,\nwhere d(p, C) = minc\u2208C d(p, c). In k-median and k-means we replace the objective function\n\nmaxp\u2208P d(p, C) with(cid:80)\n\np\u2208P d(p, C) and(cid:80)\n\np\u2208P (d(p, C))2, respectively.\n\n4.1 The message passing model\n\nAs mentioned, for constant dimensional Euclidean space and a constant c > 1, there are algorithms\n\nthat c-approximate k-median and k-means using (cid:101)O(sk) bits of communication [3]. For k-center, the\nfolklore parallel guessing algorithms (see, e.g., [8]) achieve a 2.01-approximation using (cid:101)O(sk) bits\n\nof communication.\nThe following theorem states that the above upper bounds are tight up to logarithmic factors. Due\nto space constraints we defer the proof to the full version of this paper. The proof uses tools from\nmultiparty communication complexity. We in fact can prove a stronger statement that any algorithm\nthat can differentiate whether we have k points or k + 1 points in total in the message passing model\nneeds \u2126(sk) bits of communication.\nTheorem 4.1. For any c > 1, computing c-approximation for k-median, k-means or k-center\ncorrectly with probability 0.99 in the message passing model needs \u2126(sk) bits of communication.\n\nA number of works on clustering consider bicriteria solutions (e.g., [11, 6]). An algorithm is a\n(c1, c2)-approximation (c1, c2 > 1) if the optimal solution costs W when using k centers, then the\n\n6\n\n\foutput of the algorithm costs at most c1W when using at most c2k centers. We can show that for k-\nmedian and k-means, the \u2126(sk) lower bound holds even for algorithms with bicriteria approximations.\nThe proof of the following theorem can be found in the full version of this paper.\nTheorem 4.2. For any c \u2208 [1, 1.01], computing (7.1 \u2212 6c, c)-bicriteria-approximation for k-median\nor k-means correctly with probability 0.99 in the message passing model needs \u2126(sk) bits of\ncommunication.\n\n4.2 The blackboard model\n\ncommunication for k-median and k-means. Due to space constraints we defer the description of the\nalgorithm to the full version of this paper. For k-center, it is straightforward to implement the parallel\n\nWe can show that there is an algorithm that achieves an O(1)-approximation using (cid:101)O(s + k) bits of\nguessing algorithm in the blackboard model using (cid:101)O(s + k) bits of communication.\nk-center correctly with probability 0.9 in the blackboard model using (cid:101)O(s + k) bits of communication.\n\nTheorem 4.3. There are algorithms that compute O(1)-approximations for k-median, k-means and\n\n5 Experiments\n\nIn this section we present experimental results for spectral graph clustering in the message passing\nand blackboard models. We will compare the following three algorithms. (1) Baseline: each site\nsends all the data to the coordinator directly; (2) MsgPassing: our algorithm in the message passing\nmodel (Section 3.1); (3) Blackboard: our algorithm in the blackboard model (Section 3.2).\nBesides giving the visualized results of these algorithms on various datasets, we also measure the\nw(Ai,V \\Ai)\nqualities of the results via the normalized cut, de\ufb01ned as ncut(A1, . . . , Ak) = 1\n,\n2\nwhich is a standard objective function to be minimized for spectral clustering algorithms.\nWe implemented the algorithms using multiple languages, including Matlab, Python and C++. Our\nexperiments were conducted on an IBM NeXtScale nx360 M4 server, which is equipped with 2 Intel\nXeon E5-2652 v2 8-core processors, 32GB RAM and 250GB local storage.\nDatasets. We test the algorithms in the following real and synthetic datasets.\n\n(cid:80)\n\ni\u2208[k]\n\nvol(Ai)\n\n\u2022 Twomoons: this dataset contains n = 14, 000 coordinates in R2. We consider each point to\nbe a vertex. For any two vertices u, v, we add an edge with weight w(u, v) = exp{\u2212(cid:107)u \u2212\nv(cid:107)2\n2/\u03c32} with \u03c3 = 0.1 when one vertex is among the 7000-nearest points of the other. This\nconstruction results in a graph with about 110, 000, 000 edges.\n\u2022 Gauss: this dataset contains n = 10, 000 points in R2. There are 4 clusters in this dataset,\neach generated using a Gaussian distribution. We construct a complete graph as the similarity\ngraph. For any two vertices u, v, we de\ufb01ne the weight w(u, v) = exp{\u2212(cid:107)u\u2212 v(cid:107)2\n2/\u03c32} with\n\u03c3 = 1. The resulting graph has about 100, 000, 000 edges.\n\u2022 Sculpture: a photo of The Greek Slave We use an 80 \u00d7 150 version of this photo where\neach pixel is viewed as a vertex. To construct a similarity graph, we map each pixel to a point\nin R5, i.e., (x, y, r, g, b), where the latter three coordinates are the RGB values. For any two\n2/\u03c32}\nvertices u, v, we put an edge between u, v with weight w(u, v) = exp{\u2212(cid:107)u \u2212 v(cid:107)2\nwith \u03c3 = 0.5 if one of u, v is among the 5000-nearest points of the other. This results in a\ngraph with about 70, 000, 000 edges.\n\nIn the distributed model edges are randomly partitioned across s sites.\nResults on clustering quality. We visualize the clustered results for the Twomoons dataset in\nFigure 1. It can be seen that Baseline, MsgPassing and Blackboard give results of very similar\nqualities. For simplicity, here we only present the visualization for s = 15. Similar results were\nobserved when we varied the values of s.\nWe also compare the normalized cut (ncut) values of the clustering results of different algorithms.\nThe results are presented in Figure 2. In all datasets, the ncut values of different algorithms are very\nclose. The ncut value of MsgPassing slightly decreases when we increase the value of s, while the\nncut value of Blackboard is independent of s.\n\n7\n\n\f(a) Baseline\n\n(b) MsgPassing\n\n(c) Blackboard\n\nFigure 1: Visualization of the results on Twomoons. In the message passing model each site samples\n5n edges; in the blackboard model all sites jointly sample 10n edges and the chain has length 18.\n\n(a) Twomoons\n\n(b) Gauss\n\n(c) Sculpture\n\nFigure 2: Comparisons on normalized cuts. In the message passing model, each site samples 5n\nedges; in each round of the algorithm in the blackboard model, all sites jointly sample 10n edges (in\nTwomoons and Gauss) or 20n edges (in Sculpture) edges and the chain has length 18.\n\nResults on Communication Costs. We compare the communication costs of different algorithms\nin Figure 3. We observe that while achieving similar clustering qualities as Baseline, both\nMsgPassing and Blackboard are signi\ufb01cantly more communication-ef\ufb01cient (by one or two orders\nof magnitudes in our experiments). We also notice that the value of s does not affect the communica-\ntion cost of Blackboard, while the communication cost of MsgPassing grows almost linearly with\ns; when s is large, MsgPassing uses signi\ufb01cantly more communication than Blackboard.\n\n(a) Twomoons\n\n(b) Gauss\n\n(c) Sculpture\n\n(d) Twomoons\n\n(e) Gauss\n\n(f) Sculpture\n\nFigure 3: Comparisons on communication costs. In the message passing model, each site samples\n5n edges; in each round of the algorithm in the blackboard model, all sites jointly sample 10n (in\nTwomoons and Gauss) or 20n (in Sculpture) edges and the chain has length 18.\n\nAcknowledgement: Jiecao Chen and Qin Zhang are supported in part by NSF CCF-1525024 and\nIIS-1633215. D.W. thanks support from the XDATA program of the Defense Advanced Research\nProjects Agency (DARPA), Air Force Research Laboratory contract FA8750-12-C-0323.\n\n8\n\n\fReferences\n[1] Alexandr Andoni, Jiecao Chen, Robert Krauthgamer, Bo Qin, David P. Woodruff, and Qin\n\nZhang. On sketching quadratic forms. In ITCS, pages 311\u2013319, 2016.\n\n[2] David Arthur and Sergei Vassilvitskii. k-means++: The advantages of careful seeding. In\n\nSODA, pages 1027\u20131035, 2007.\n\n[3] Maria-Florina Balcan, Steven Ehrlich, and Yingyu Liang. Distributed k-means and k-median\n\nclustering on general communication topologies. In NIPS, pages 1995\u20132003, 2013.\n\n[4] Maria-Florina Balcan, Vandana Kanchanapally, Yingyu Liang, and David P. Woodruff. Improved\n\ndistributed principal component analysis. CoRR, abs/1408.5823, 2014.\n\n[5] Mark Braverman, Faith Ellen, Rotem Oshman, Toniann Pitassi, and Vinod Vaikuntanathan. A\ntight bound for set disjointness in the message-passing model. In FOCS, pages 668\u2013677, 2013.\n\n[6] Moses Charikar, Samir Khuller, David M. Mount, and Giri Narasimhan. Algorithms for facility\n\nlocation problems with outliers. In SODA, pages 642\u2013651, 2001.\n\n[7] Michael B. Cohen, Sam Elder, Cameron Musco, Christopher Musco, and Madalina Persu.\nDimensionality reduction for k-means clustering and low rank approximation. In STOC, pages\n163\u2013172, 2015.\n\n[8] Graham Cormode, S Muthukrishnan, and Wei Zhuang. Conquering the divide: Continuous\n\nclustering of distributed data streams. In ICDE, pages 1036\u20131045, 2007.\n\n[9] Dan Feldman, Melanie Schmidt, and Christian Sohler. Turning big data into tiny data: Constant-\n\nsize coresets for k-means, PCA and projective clustering. In SODA, pages 1434\u20131453, 2013.\n\n[10] Sudipto Guha, Yi Li, and Qin Zhang. Distributed partial clustering. Manuscript, 2017.\n\n[11] Madhukar R. Korupolu, C. Greg Plaxton, and Rajmohan Rajaraman. Analysis of a local search\n\nheuristic for facility location problems. In SODA, pages 1\u201310, 1998.\n\n[12] James R. Lee, Shayan Oveis Gharan, and Luca Trevisan. Multi-way spectral partitioning and\n\nhigher-order cheeger inequalities. In STOC, pages 1117\u20131130, 2012.\n\n[13] Yin Tat Lee and He Sun. Constructing linear-sized spectral sparsi\ufb01cation in almost-linear time.\n\nIn FOCS, pages 250\u2013269, 2015.\n\n[14] Zvi Lotker, Elan Pavlov, Boaz Patt-Shamir, and David Peleg. MST construction in O(log log n)\n\ncommunication rounds. In SPAA, pages 94\u2013100, 2003.\n\n[15] Gary L. Miller and Richard Peng. Iterative approaches to row sampling. CoRR, abs/1211.2713,\n\n2012.\n\n[16] Andrew Y. Ng, Michael I. Jordan, and Yair Weiss. On spectral clustering: Analysis and an\n\nalgorithm. Advances in neural information processing systems, 2:849\u2013856, 2002.\n\n[17] Richard Peng, He Sun, and Luca Zanetti. Partitioning well-clustered graphs: Spectral clustering\n\nworks! In COLT, pages 1423\u20131455, 2015.\n\n[18] Jeff M. Phillips, Elad Verbin, and Qin Zhang. Lower bounds for number-in-hand multiparty\n\ncommunication complexity, made easy. SIAM J. Comput., 45(1):174\u2013196, 2016.\n\n[19] Ulrike Von Luxburg. A tutorial on spectral clustering. Statistics and computing, 17(4):395\u2013416,\n\n2007.\n\n[20] David P. Woodruff and Qin Zhang. Tight bounds for distributed functional monitoring. In\n\nSTOC, pages 941\u2013960, 2012.\n\n9\n\n\f", "award": [], "sourceid": 1853, "authors": [{"given_name": "Jiecao", "family_name": "Chen", "institution": "Indiana University Bloomington"}, {"given_name": "He", "family_name": "Sun", "institution": "The University of Bristol"}, {"given_name": "David", "family_name": "Woodruff", "institution": "IBM Research"}, {"given_name": "Qin", "family_name": "Zhang", "institution": "Indiana University Bloomington"}]}