{"title": "A Practical Algorithm for Distributed Clustering and Outlier Detection", "book": "Advances in Neural Information Processing Systems", "page_first": 2248, "page_last": 2256, "abstract": "We study the classic k-means/median clustering, which are fundamental problems in unsupervised learning, in the setting where data are partitioned across multiple sites, and where we are allowed to discard a small portion of the data by labeling them as outliers. We propose a simple approach based on constructing small summary for the original dataset. The proposed method is time and communication efficient, has good approximation guarantees, and can identify the global outliers effectively. \nTo the best of our knowledge, this is the first practical algorithm with theoretical guarantees for distributed clustering with outliers. Our experiments on both real and synthetic data have demonstrated the clear superiority of our algorithm against all the baseline algorithms in almost all metrics.", "full_text": "A Practical Algorithm for Distributed Clustering and\n\nOutlier Detection\u2217\n\nJiecao Chen\n\nIndiana University Bloomington\n\nBloomington, IN\n\njiecchen@indiana.edu\n\nErfan Sadeqi Azer\n\nIndiana University Bloomington\n\nBloomington, IN\n\nesadeqia@indiana.edu\n\nQin Zhang\n\nIndiana University Bloomington\n\nBloomington, IN\n\nqzhangcs@indiana.edu\n\nAbstract\n\nWe study the classic k-means/median clustering, which are fundamental problems\nin unsupervised learning, in the setting where data are partitioned across multiple\nsites, and where we are allowed to discard a small portion of the data by labeling\nthem as outliers. We propose a simple approach based on constructing small\nsummary for the original dataset. The proposed method is time and communication\nef\ufb01cient, has good approximation guarantees, and can identify the global outliers\neffectively. To the best of our knowledge, this is the \ufb01rst practical algorithm with\ntheoretical guarantees for distributed clustering with outliers. Our experiments\non both real and synthetic data have demonstrated the clear superiority of our\nalgorithm against all the baseline algorithms in almost all metrics.\n\n1\n\nIntroduction\n\nThe rise of big data has brought the design of distributed learning algorithm to the forefront. For\nexample, in many practical settings the large quantities of data are collected and stored at different\nlocations, while we want to learn properties of the union of the data. For many machine learning\ntasks, in order to speed up the computation we need to partition the data into a number of machines\nfor a joint computation. In a different dimension, since real-world data often contain background\nnoise or extreme values, it is desirable for us to perform the computation on the \u201cclean data\u201d by\ndiscarding a small portion of the data from the input. Sometimes these outliers are interesting by\nthemselves; for example, in the study of statistical data of a population, outliers may represent those\npeople who deserve special attention. In this paper we study clustering with outliers, a fundamental\nproblem in unsupervised learning, in the distributed model where data are partitioned across multiple\nsites, who need to communicate to arrive at a consensus on the cluster centers and labeling of outliers.\nFor many clustering applications it is common to model data objects as points in Rd, and the similarity\nbetween two objects is represented as the Euclidean distance of the two corresponding points. In\nthis paper we assume for simplicity that each point can be sent by one unit of communication.\nNote that when d is large, we can apply standard dimension reduction tools (for example, the\nJohnson-Lindenstrauss lemma) before running our algorithms.\n\n32nd Conference on Neural Information Processing Systems (NeurIPS 2018), Montr\u00e9al, Canada.\n\n\u2217A full version of this paper is available at https://arxiv.org/abs/1805.09495\n\n\fWe focus on the two well-studied objective functions (k, t)-means and (k, t)-median, de\ufb01ned in\nDe\ufb01nition 1. It is worthwhile to mention that our algorithms also work for other metrics as long as\nthe distance oracles are given.\n\noutliers O \u2286 X of size at most t so that the objective function(cid:80)\nthe (k, t)-means we simply replace the objective function with(cid:80)\n\nDe\ufb01nition 1 ((k, t)-means/median) Let X be a set of points, and k, t be two parameters. For the\n(k, t)-median problem we aim for computing a set of centers C \u2286 Rd of size at most k and a set of\np\u2208X\\O d(p, C) is minimized. For\np\u2208X\\O d2(p, C).\n\nComputation Model. We study the clustering problems in the coordinator model, a well-adopted\nmodel for distributed learning Balcan et al. (2013); Chen et al. (2016); Guha et al. (2017); Diakoniko-\nlas et al. (2017). In this model we have s sites and a central coordinator; each site can communicate\nwith the coordinator. The input data points are partitioned among the s sites, who, together with the\ncoordinator, want to jointly compute some function on the global data. The data partition can be\neither adversarial or random. The former can model the case where the data points are independently\ncollected at different locations, while the latter is common in the scenario where the system uses a\ndispatcher to randomly partition the incoming data stream into multiple workers/sites for a parallel\nprocessing (and then aggregates the information at a central server/coordinator).\nIn this paper we focus on the one-round communication model (also called the simultaneous commu-\nnication model), where each site sends a sketch of its local dataset to the coordinator, and then the\ncoordinator merges these sketches and extracts the answer. This model is arguably the most practical\none since multi-round communication will cost a large system overhead.\nOur goals for computing (k, t)-means/median in the coordinator model are the following: (1) to\nminimize the clustering objective functions; (2) to accurately identify the set of global outliers; and\n(3) to minimize the computation time and the communication cost of the system. We will elaborate\non how to quantify the quality of outlier detection in Section 4.\nOur Contributions. A natural way of performing distributed clustering in the simultaneous commu-\nnication model is to use the two-level clustering framework (see e.g., Guha et al. (2003, 2017)). In\nthis framework each site performs the \ufb01rst level clustering on its local dataset X, getting a subset\nX(cid:48) \u2286 X with each point being assigned a weight; we call X(cid:48) the summary of X. The site then sends\nX(cid:48) to the coordinator, and the coordinator performs the second level clustering on the union of the\ns summaries. We note that the second level clustering is required to output at most k centers and t\noutliers, while the summary returned by the \ufb01rst level clustering can possibly have more than (k + t)\nweighted points. The size of the summary will contribute to the communication cost as well as the\nrunning time of the second level clustering.\nThe main contribution of this paper is to propose a simple and practical summary construction at sites\nwith the following properties.\n\n1. It is extremely fast: runs in time O(max{k, log n} \u00b7 n), where n is the size of the dataset.\n2. The summary has small size: O(k log n + t) for adversarial data partition and O(k log n +\n\nt/s) for random data partition.\n\n3. When coupled with a second level (centralized) clustering algorithm that \u03b3-approximates\n(k, t)-means/median, we obtain an O(\u03b3)-approximation algorithm for distributed (k, t)-\nmeans/median.2\n\n4. It can be used to effectively identify the global outliers.\n\nWe emphasize that both the \ufb01rst and the second properties are essential to make the distributed\nclustering algorithm scalable on large datasets. Our extensive set of experiments have demonstrated\nthe clear superiority of our algorithm against all the baseline algorithms in almost all metrics.\nTo the best of our knowledge, this is the \ufb01rst practical algorithm with theoretical guarantees for\ndistributed clustering with outliers.\nRelated Work. Clustering is a fundamental problem in computer science and has been studied for\nmore than \ufb01fty years. A comprehensive review of the work on k-means/median is beyond the scope\n\n2We say an algorithm \u03b3-approximates a problem if it outputs a solution that is at most \u03b3 times the optimal\n\nsolution.\n\n2\n\n\fof this paper, and we will focus on the literature for centralized/distributed k-means/median clustering\nwith outliers and distributed k-means/median clustering.\nIn the centralized setting, several O(1)-approximation or (O(1), O(1))-approximation3 algorithms\nhave been proposed Charikar et al. (2001); Chen (2009). These algorithms make use of linear\nprogramming and need time at least \u2126(n3), which is prohibitive on large datasets. Feldman and\nSchulman (2012) studied (k, t)-median via coresets, but the running times of their algorithm includes\na term O(n(k + t)k+t)) which is not practical.\nChawla and Gionis (2013) proposed for (k, t)-means an algorithm called k-means--, which is an\niterative procedure and can be viewed as a generalization of Llyod\u2019s algorithm Lloyd (1982). Like\nLlyod\u2019s algorithm, the centers that k-means-- outputs are not the original input points; we thus\ncannot use it for the summary construction in the \ufb01rst level clustering at sites because some of\nthe points in the summary will be the outliers we report at the end. However, we have found that\nk-means-- is a good choice for the second level clustering: it outputs exactly k centers and t outliers,\nand its clustering quality looks decent on datasets that we have tested, though it does not have any\nworst case theoretical guarantees.\nRecently Gupta et al. (2017) proposed a local-search based (O(1), O(k log(n))-approximation\nalgorithm for (k, t)-means. The running time of their algorithm is \u02dcO(k2n2),4 which is again not quite\nscalable. The authors mentioned that one can use the k-means++ algorithm Arthur and Vassilvitskii\n(2007) as a seeding step to boost the running time to \u02dcO(k2(k + t)2 + nt). We note that \ufb01rst, this\nrunning time is still worse than ours. And second, since in the \ufb01rst level clustering we only need a\nsummary \u2013 all that we need is a set of weighted points that can be fed into the second level clustering\nat the coordinator, we can in fact directly use k-means++ with a budget of O(k log n + t) centers\nfor constructing a summary. We will use this approach as a baseline algorithm in our experimental\nstudies.\nIn the past few years there has been a growing interest in studying k-means/median clustering in the\ndistributed models Ene et al. (2011); Bahmani et al. (2012); Balcan et al. (2013); Liang et al. (2014);\nCohen et al. (2015); Chen et al. (2016). In the case of allowing outliers, Guha et al. Guha et al.\n(2017) gave a \ufb01rst theoretical study for distributed (k, t)-means/median. However, their algorithms\nneed \u0398(n2) running time at sites and are thus again not quite practical on large-scale datasets. In\na concurrent work, Li and Guo (2018) further reduced the value of the objective function, but the\nproposed method does not output the outliers.\nWe note that the k-means(cid:107) algorithm proposed by Bahmani et al. (2012) can be extended (again\nby increasing the budget of centers from k to O(k log n + t)) and used as a baseline algorithm for\ncomparison. The main issue with k-means(cid:107) is that it needs O(log n) rounds of communication\nwhich holds back its overall performance.\n\n2 The Summary Construction\n\nIn this section we present our summary construction for (k, t)-median/means in the centralized model.\nIn Section 3 we will show how to use this summary construction for solving the problems in the\ndistributed model. Table 1 is the list of notations we are going to use.\n\nX\nk\nt\n\u03c3\n\n\u03c6X (\u03c3)\nOPTmed\n\nk,t (X)\n\ninput dataset\n\nnumber of centers\nnumber of outliers\n\n\u03c6X (\u03c3) =(cid:80)\n\nclustering mapping \u03c3 : X \u2192 X\nx\u2208X d(x, \u03c3(x))\nd(p, C)\n\n(cid:80)\n\nO\u2286X,|C|\u2264k\n\nmin\n|O|\u2264t\n\np\u2208X\\O\n\nn\n\u03ba\nO\u2217\n\nd(y, X)\n\u03c6(X, Y )\n\nOPTmea\n\nk,t (X)\n\nn = |X|, size of the dataset\n\n\u03ba = max{k, log n}\noutliers chosen by OPT\n\n\u03c6(X, Y ) =(cid:80)\n(cid:80)\n\nd(y, X) = minx\u2208X d(y, x)\ny\u2208Y d(y, X)\nd2(p, C)\n\nO\u2286X,|C|\u2264k\n\nmin\n|O|\u2264t\n\np\u2208X\\O\n\nTable 1: List of Notations\n\nwhere C is the cost of the optimal solution excluding t points.\n\n3We say a solution is an (a, b)-approximation if the cost of the solution is a \u00b7 C while excluding b \u00b7 t points,\n4 \u02dcO(\u00b7) hides some logarithmic factors.\n\n3\n\n\f:dataset X, number of centers k, number of outliers t\n\nAlgorithm 1: Summary-Outliers(X, k, t)\nInput\nOutput :a weighted dataset Q as a summary of X\n1 i \u2190 0, Xi \u2190 X, Q \u2190 \u2205\n2 \ufb01x a \u03b2 such that 0.25 \u2264 \u03b2 < 0.5\n3 \u03ba \u2190 max{log n, k}\n4 let \u03c3 : X \u2192 X be a mapping to be constructed, and \u03b1 be a constant to be determined in the\n5 while |Xi| > 8t do\n\nanalysis.\n\nconstruct a set Si of size \u03b1\u03ba by random sampling (with replacement) from Xi\nfor each point in Xi, compute the distance to its nearest point in Si\nlet \u03c1i be the smallest radius s.t. |B(Si, Xi, \u03c1i)| \u2265 \u03b2|Xi|. Let Ci \u2190 B(Si, Xi, \u03c1i)\nfor each x \u2208 Ci, choose the point y \u2208 Si that minimizes d(x, y) and assign \u03c3(x) \u2190 y\nXi+1 \u2190 Xi\\Ci\ni \u2190 i + 1\n\n6\n7\n8\n9\n10\n11\n12 r \u2190 i\n13 for each x \u2208 Xr, assign \u03c3(x) \u2190 x\n14 for each x \u2208 Xr \u222a (\u222ar\u22121\n15 return Q\n\ni=0 Si), assign weight wx \u2190 |\u03c3\u22121(x)| and add (x, wx) into Q\n\n2.1 The Algorithm\n\nOur algorithm is presented in Algorithm 1. It works for both the k-means and k-median objective\nfunctions. We note that Algorithm 1 is partly inspired by the algorithm for clustering without outliers\nproposed in Mettu and Plaxton (2002). But since we have to handle outliers now, the design and\nanalysis of our algorithm require new ideas.\nFor a set S and a scalar value \u03c1, de\ufb01ne B(S, X, \u03c1) = {x \u2208 X | d(x, S) \u2264 \u03c1}. Algorithm 1 works\nin rounds indexed by i. Let X0 = X be the initial set of input points. The idea is to sample a set\nof points Si of size \u03b1k for a constant \u03b1 (assuming k \u2265 log n) from Xi, and grow a ball of radius\n\u03c1i centered at each s \u2208 Si. Let Ci be the set of points in the union of these balls. The radius \u03c1i is\nchosen such that at least a constant fraction of points of Xi are in Ci.\nDe\ufb01ne Xi+1 = Xi\\Ci. In the i-th round, we add the \u03b1k points in Si to the set of centers, and assign\npoints in Ci to their nearest centers in Si. We then recurse on the rest of the points Xi+1, and stop\nuntil the number of points left unclustered becomes at most 8t. Let r be the \ufb01nal value of i. De\ufb01ne\nthe weight of each point x in \u222ar\u22121\ni=0 Si to be the number of points in X that are assigned to x, and the\nweight of each point in Xr to be 1. Our summary Q consists of points in Xr \u222a (\u222ar\u22121\ni=0 Si) together\nwith their weights.\n\n2.2 The Analysis\n\nWe now try to analyze the performance of Algorithm 1. The analysis will be conducted for the\n(k, t)-median objective function, while the results also hold for (k, t)-means; we will discuss this\nbrie\ufb02y at the end of this section. Due to space constraints, all missing proofs in this section can be\nfound in the supplementary material.\nWe start by introducing the following concept. Note that the summary constructed by Algorithm 1 is\nfully determined by the mapping function \u03c3 (\u03c3 is also constructed in Algorithm 1).\n\nDe\ufb01nition 2 (Information Loss) For a summary Q constructed by Algorithm 1, we de\ufb01ne the infor-\nmation loss of Q as\nThat is, the sum of distances of moving each point x \u2208 X to the corresponding center \u03c3(x) (we can\nview each outlier as a center itself).\n\nloss(Q) = \u03c6X (\u03c3).\n\nWe will prove the following theorem, which says that the information loss of the summary Q\nconstructed by Algorithm 1 is bounded by the optimal (k, t)-median clustering cost on X.\n\n4\n\n\f(cid:16)\n\n(cid:17)\n\nTheorem 1 Algorithm 1 outputs a summary Q such that with probability (1 \u2212 1/n2) we have that\n. The running time of Algorithm 1 is bounded by O(max{log n, k} \u00b7 n),\nloss(Q) = O\nand the size of the outputted summary Q is bounded by O(k log n + t).\n\nOPTmed\n\nk,t (X)\n\nk,t (X). Namely, \u03c6X (\u03c3) = O((cid:80)\n\nThe proof of this theorem relies on building an upper bound on \u03c6X (\u03c3) and a lower bound on\ni \u03c1i|Di|), where Di =\nOPTmed\nCi\\O\u2217, where Ci is constructed in the i-th round of Algorithm 1 and O\u2217 is the set of outliers returned\nby the optimal algorithm. See the detailed proof in the supplementary material.\nAs a consequence of Theorem 1, we obtain by triangle inequality arguments the following corollary\nthat directly characterizes the quality of the summary in the task of (k, t)-median. We include a proof\nin the supplementary material for completeness.\n\ni \u03c1i|Di|) and OPTmed\n\nk,t (X) = \u2126((cid:80)\n\nCorollary 1 If we run a \u03b3-approximation algorithm for (k, t)-median on Q, we can obtain a set\nof centers C and a set of outliers O such that \u03c6(X\\O, C) = O(\u03b3 \u00b7 OPTmed\nk,t (X)) with probability\n(1 \u2212 1/n2).\nThe running time. We now analyze the running time of Algorithm 1. At the i-th iteration, the\nsampling step at Line 6 can be done in O(|Xi|) time. The nearest-center assignments at Line 7 and 9\ncan be done in |Si| \u00b7 |Xi| = O(\u03ba|Xi|) time. Line 8 can be done by \ufb01rst sorting the distances in the\nincreasing order and then scanning the shorted list until we get enough points. In this way the running\ntime is bounded by |Xi| log |Xi| = O(\u03ba|Xi|). Thus the total running time can be bounded by\n\nO(\u03ba|Xi|) = O(\u03ban) = O(max{log n, k} \u00b7 n),\n\n(cid:88)\n\ni=0,1,...,r\u22121\n\nwhere the \ufb01rst equation holds since the size of Xi decreases geometrically, and the second equation\nis due to the de\ufb01nition of \u03ba.\nFinally, we comment that we can get a similar result for (k, t)-means by appropriately adjusting\nvarious constant parameters in the proof. Please refer to the supplementary material for a more\ndetailed discussion.\n\n2.3 An Augmentation\nIn the case when t (cid:29) k, which is typically the case in practice since the number of centers k does\nnot scale with the size of the dataset while the number of outliers t does, we add an augmentation\nprocedure to Algorithm 1 to achieve a better practical performance. The pseudocode can be found in\nthe supplementary materials and the full version of this paper.\nThe augmentation is as follows, after computing the set of outliers Xr and the set of centers\nS = \u222ar\u22121\ni=0 Si in Algorithm 1, we sample randomly from X\\(Xr \u222a S) an additional set of center\npoints S(cid:48) of size |Xr| \u2212 |S|. That is, we try to make the number of centers and the number of outliers\nin the summary to be balanced. We then reassign each point in the set X\\Xr to its nearest center in\nS \u222a S(cid:48). Denote the new mapping by \u03c0. Finally, we include points in Xr and S, together with their\nweights, into the summary Q.\nIt is clear that the augmentation procedure preserves the size of the summary asymptotically. And by\nincluding more centers we have loss(Q) \u2264 \u03c6X (\u03c0) \u2264 \u03c6X (\u03c3), where \u03c3 is the mapping returned by\nAlgorithm 1. The running time will increase to O(tn) due to the reassignment step, but our algorithm\nis still much faster than all the baseline algorithms, as we shall see in Section 4.\n\n3 Distributed Clustering with Outliers\n\nIn this section we discuss distributed (k, t)-median/means using the summary constructed in Algo-\nrithm 1. Our main result is the following theorem, which is based on the work by Guha et al. (2003,\n2017). The proof for this theorem can be found in the supplementary material.\n\nTheorem 2 Suppose Algorithm 2 uses a \u03b3-approximation algorithm for (k, t)-median in the second\nlevel clustering (Line 2). We have with probability (1 \u2212 1/n) that:\n\n5\n\n\f:For each i \u2208 [s], Site i gets input dataset Ai where (A1, . . . , As) is a random partition\nof X\n\nAlgorithm 2: Distributed-Median(A1, . . . , As, k, t)\nInput\nOutput :a (k, t)-median clustering for X = \u222ai\u2208[s]Ai\n1 for each i \u2208 [s], Site i constructs a summary Qi by running Summary-Outliers(Ai, k, 2t/s)\n2 the coordinator then performs a second level clustering on Q = Q1 \u222a Q2 \u222a . . . \u222a Qs using an\n\n(Algorithm 1) and sends Qi to the coordinator\n\noff-the-shelf (k, t)-median algorithm, and returns the resulting clustering.\n\n\u2022 it outputs a set of centers C \u2286 Rd and a set of outliers O \u2286 X such that \u03c6(X\\O, C) \u2264\n\nO(\u03b3) \u00b7 OPTmed\n\nk,t (X);\n\n\u2022 it uses one round of communication whose cost is bounded by O(sk log n + t);\n\u2022 the running time at the i-th site is bounded by O(max{log n, k} \u00b7 |Ai|), and the running\n\ntime at the coordinator is that of the second level clustering.\n\nWe note that in Mettu and Plaxton (2002) it was shown that under some mild assumption, \u2126(kn) time\nis necessary for any O(1)-approximate randomized algorithm to compute k-median on n points with\nnonnegligible success probability (e.g., 1/100). Thus the running time of our algorithm is optimal up\nto a log n factor under the same assumption.\nIn the case that the dataset is adversarially partitioned, the total communication increases to\nO(s(k log n + t)). This is because all of the t outliers may go to the same site and thus 2t/s\nin Line 1 needs to be replaced by t.\nFinally, we comment that the result above also holds for the summary constructed using the augu-\nmented version (Sec. 2.3), except, as discussed in Section 2, that the local running time at the i-th\nsite will increase to O(t|Ai|).\n\n4 Experiments\n\n4.1 Experimental Setup\n\n4.1.1 Datasets and Algorithms\n\nDue to space constraints, we only present the experimental results for two data sets (kddFull and\nkddSp). One can \ufb01nd results for a number of other datasets in our supplementary materials and the\nfull paper.\n\n\u2022 kddFull. This dataset is from 1999 kddcup competition and contains instances describing\nconnections of sequences of tcp packets. There are about 4.9M data points. We only consider\nthe 34 numerical features of this dataset. We also normalize each feature so that it has zero\nmean and unit standard deviation. There are 23 classes in this dataset, 98.3% points of\nthe dataset belong to 3 classes (normal 19.6%, neptune 21.6%, and smurf 56.8%). We\nconsider small clusters as outliers and there are 45747 outliers.\n\u2022 kddSp. This data set contains about 10% points of kddFull (released by the original\n\nprovider). This dataset is also normalized and there are 8752 outliers.\n\nWe comment that \ufb01nding appropriate k and t values for the task of clustering with outliers is a\nseparate problem, and is not part of the topic of this paper. In all our experiments, k and t are naturally\nsuggested by the datasets we use.\nWe compare the performance of following algorithms, each of which is implemented using the MPI\nframework and run in the coordinator model. The data are randomly partitioned among the sites.\n\n\u2022 ball-grow. Algorithm 2 proposed in this paper, with the augmented version Algorithm 1 for\nthe summary construction. As mentioned we use k-means-- as the second level clustering\nat Line 2. We \ufb01x \u03b1 = 2 and \u03b2 = 4.5 in the subroutine Algorithm 1.\n\n6\n\n\f\u2022 rand. Each site constructs a summary by randomly sampling points from its local dataset.\nEach sampled point p is assigned a weight equal to the number of points in the local dataset\nthat are closer to p than other points in the summary. The coordinator then collects all\nweighted samples from all sites and feeds to k-means-- for a second level clustering.\n\n\u2022 k-means++. Each site constructs a summary of the local dataset using the k-means++\nalgorithm Arthur and Vassilvitskii (2007), and sends it to the coordinator. The coordinator\nfeeds the unions all summaries to k-means-- for a second level clustering.\n\n\u2022 k-means(cid:107). An MPI implementation of the k-means(cid:107) algorithm proposed by Bahmani\net al. (2012) for distributed k-means clustering. To adapt their algorithm to solve the outlier\nversion, we increase the parameter k in the algorithm to O(k +t), and then feed the outputted\ncenters to k-means-- for a second level clustering.\n\n4.1.2 Measurements\n\nLet C and O be the sets of centers and outliers respectively returned by a tested algorithm. To\nevaluate the quality of the clustering results we use two metrics: (a) (cid:96)1-loss (for (k, t)-median):\n\n(cid:80)\np\u2208X\\O d(p, C); (b) (cid:96)2-loss (for (k, t)-means):(cid:80)\n\np\u2208X\\O d2(p, C).\n\nTo measure the performance of outlier detection we use three metrics. Let S be the set of points fed\ninto the second level clustering k-means-- in each algorithm, and let O\u2217 be the set of actual outliers\n(i.e., the ground truth), we use the following metrics: (a) preRec: the proportion of actual outliers\nthat are included in the returned summary, de\ufb01ned as |S\u2229O\u2217|\n; (b) recall: the proportion of actual\n|O\u2217|\noutliers that are returned by k-means--, de\ufb01ned as |O\u2229O\u2217|\n; (c) prec: the proportion of points in O\n|O\u2217|\nthat are actually outliers, de\ufb01ned as |O\u2229O\u2217|\n|O|\n\n.\n\n4.1.3 Computation Environments\n\nAll algorithms are implemented in C++ with Boost.MPI support. We use Armadillo Sanderson (2010)\nas the numerical linear library and -O3 \ufb02ag is enabled when compile the code. All experiments are\nconducted in a PowerEdge R730 server equipped with 2 x Intel Xeon E5-2667 v3 3.2GHz. This\nserver has 8-core/16-thread per CPU, 192GB Memeory and 1.6TB SSD.\n\n4.2 Experimental Results\n\nWe now present our experimental results. All results take the average of 10 runs. In our supplementary\nmaterial, results for more datasets can be found, but all the conclusions remain the same.\n\n4.2.1 Quality\nWe \ufb01rst compare the qualities of the summaries returned by ball-grow, rand and k-means(cid:107). Note\nthat the size of the summary returned by ball-grow is determined by the parameters k and t, and\nwe can not control the exact size. In k-means(cid:107), the summary size is determined by the sample ratio,\nand again we can not control the exact size. On the other hand, the summary sizes of rand and\nk-means++ can be fully controlled. To be fair, we manually tune those parameters so that the sizes\nof summaries returned by different algorithms are roughly the same (the difference is less than 10%).\nIn this set of experiments, each dataset is randomly partitioned into 20 sites.\nTable 2 presents the experimental results on kddSp and kddFull datasets. We observe that\nball-grow gives better (cid:96)1-loss and (cid:96)2-loss than k-means(cid:107) and k-means++, and rand performs\nthe worst among all.\nFor outlier detection, rand fails completely. In both kddFull and kddSp, ball-grow outperforms\nk-means++ and k-means(cid:107) in almost all metrics. k-means(cid:107) slightly outperforms k-means++.\n\n4.2.2 Communication Costs\n\nWe next compare the communication cost of different algorithms. Figure 1a presents the experimental\nresults. The communication cost is measured by the number of points exchanged between the\n\n7\n\n\fdataset\n\nkddSp\n\nkddFull\n\nalgo\nball-grow\nk-means++\nk-means(cid:107)\nrand\nball-grow\nk-means++\nk-means(cid:107)\nrand\n\nsummarySize\n3.37e+4\n3.37e+4\n3.30e+4\n3.37e+4\n1.83e+5\n1.83e+5\n\n1.83e+5\n\n(cid:96)1-loss\n8.00e+5\n8.38e+5\n8.18e+5\n8.85e+5\n7.38e+6\n8.21e+6\n\n9.60e+6\n\n(cid:96)2-loss\n3.46e+6\n4.95e+6\n4.19e+6\n1.06e+7\n3.54e+7\n4.65e+7\n\npreRec\n0.6102\n0.3660\n0.2921\n0.0698\n0.7754\n0.2188\n\ndoes not stop after 8 hours\n0.0378691\n\n1.11e+8\n\nprec\n0.5586\n0.3676\n0.3641\n0.5076\n0.5992\n0.2828\n\nrecall\n0.5176\n0.1787\n0.1552\n0.0374\n0.5803\n0.1439\n\n0.6115\n\n0.0241\n\nTable 2: Clustering quality. k = 3, t = 8752 for kddSp and t = 45747 for kddFull\n\n(a) communication cost\n\n(b) running time (log10 scale)\n\n(c) running time, #sites = 20\n\nFigure 1: experiments on kddSp dataset\n\ncoordinator and all sites. In this set of experiments we only change the number of partitions (i.e., # of\nsites s). The summaries returned by all algorithms have almost the same size.\nWe observe that the communication costs of ball-grow, k-means++ and rand are almost indepen-\ndent of the number of sites. Indeed, ball-grow, k-means++ and rand all run in one round and their\ncommunication cost is simply the size of the union of the s summaries. k-means(cid:107) incurs signi\ufb01cantly\nmore communication, and it grows almost linearly to the number of sites. This is because k-means(cid:107)\ngrows its summary in multiple rounds; in each round, the coordinator needs to collect messages from\nall sites and broadcasts the union of those messages. When there are 20 sites, k-means(cid:107) incurs 20\ntimes more communication cost than its competitors.\n\n4.2.3 Running Time\n\nWe \ufb01nally compare the running time of different algorithms. All experiments in this part are conducted\non kddSp dataset since k-means(cid:107) does not scale to kddFull; similar results can also be observed on\nother datasets. The running time we show is only the time used to construct the input (i.e., the union\nof the s summaries) for the second level clustering, and we do not include the running time of the\nsecond level clustering since it is always the same for all tested algorithms (i.e., the k-means--).\nFigure 1b shows the running time when we change the number of sites while \ufb01x the size of the\nsummary produced by each site. We observe that k-means(cid:107) uses signi\ufb01cantly more time than\nball-grow, k-means++ and rand. This is predictable because k-means(cid:107) runs in multiple rounds\nand communicates more than its competitors. ball-grow uses signi\ufb01cantly less time than others,\ntypically 1/25 of k-means(cid:107), 1/7 of k-means++ and 1/2 of rand. The reason that ball-grow is\neven faster than rand is that ball-grow only needs to compute weights for about half of the points\nin the constructed summary. As can be predicted, when we increase the number of sites, the total\nrunning time of each algorithm decreases.\nWe also investigate how the size of the summary will affect the running time. Note that for ball-grow\nthe summary size is controlled by the parameter t. We \ufb01x k = 3 and vary t, resulting different\nsummary sizes for ball-grow. For other algorithms, we tune the parameters so that they output\nsummaries of similar sizes as ball-grow outputs. Figure 1c shows that when the size of summary\nincreases, the running time increases almost linearly for all algorithms.\n\n8\n\n\fAcknowledgments\n\nJiecao Chen, Erfan Sadeqi Azer and Qin Zhang are supported in part by NSF CCF-1525024, NSF\nCCF-1844234 and IIS-1633215.\n\nReferences\nArthur, D. and Vassilvitskii, S. (2007). k-means++: the advantages of careful seeding. In SODA,\n\npages 1027\u20131035.\n\nBahmani, B., Moseley, B., Vattani, A., Kumar, R., and Vassilvitskii, S. (2012). Scalable k-means++.\n\nPVLDB, 5(7), 622\u2013633.\n\nBalcan, M., Ehrlich, S., and Liang, Y. (2013). Distributed k-means and k-median clustering on\n\ngeneral communication topologies. In NIPS, pages 1995\u20132003.\n\nBaldi, P., Sadowski, P., and Whiteson, D. (2014). Searching for exotic particles in high-energy\n\nphysics with deep learning. Nature communications, 5.\n\nCharikar, M., Khuller, S., Mount, D. M., and Narasimhan, G. (2001). Algorithms for facility location\n\nproblems with outliers. In SODA, pages 642\u2013651.\n\nChawla, S. and Gionis, A. (2013). k-means-: A uni\ufb01ed approach to clustering and outlier detection.\n\nIn SDM, pages 189\u2013197.\n\nChen, J., Sun, H., Woodruff, D. P., and Zhang, Q. (2016). Communication-optimal distributed\n\nclustering. In NIPS, pages 3720\u20133728.\n\nChen, K. (2009). On coresets for k-median and k-means clustering in metric and euclidean spaces\n\nand their applications. SIAM J. Comput., 39(3), 923\u2013947.\n\nCohen, M. B., Elder, S., Musco, C., Musco, C., and Persu, M. (2015). Dimensionality reduction for\n\nk-means clustering and low rank approximation. In STOC, pages 163\u2013172.\n\nDiakonikolas, I., Grigorescu, E., Li, J., Natarajan, A., Onak, K., and Schmidt, L. (2017).\nCommunication-ef\ufb01cient distributed learning of discrete distributions. In NIPS, pages 6394\u20136404.\n\nEne, A., Im, S., and Moseley, B. (2011). Fast clustering using mapreduce. In SIGKDD, pages\n\n681\u2013689.\n\nFeldman, D. and Schulman, L. J. (2012). Data reduction for weighted and outlier-resistant clustering.\n\nIn SODA, pages 1343\u20131354.\n\nGuha, S., Meyerson, A., Mishra, N., Motwani, R., and O\u2019Callaghan, L. (2003). Clustering data\n\nstreams: Theory and practice. IEEE Trans. Knowl. Data Eng., 15(3), 515\u2013528.\n\nGuha, S., Li, Y., and Zhang, Q. (2017). Distributed partial clustering. In SPAA, pages 143\u2013152.\n\nGupta, S., Kumar, R., Lu, K., Moseley, B., and Vassilvitskii, S. (2017). Local search methods for\n\nk-means with outliers. PVLDB, 10(7), 757\u2013768.\n\nLi, S. and Guo, X. (2018). Distributed k-clustering for data with heavy noise. arXiv preprint\n\narXiv:1810.07852.\n\nLiang, Y., Balcan, M., Kanchanapally, V., and Woodruff, D. P. (2014). Improved distributed principal\n\ncomponent analysis. In NIPS, pages 3113\u20133121.\n\nLloyd, S. P. (1982). Least squares quantization in PCM. IEEE Trans. Information Theory, 28(2),\n\n129\u2013136.\n\nMettu, R. R. and Plaxton, C. G. (2002). Optimal time bounds for approximate clustering. In UAI,\n\npages 344\u2013351.\n\nSanderson, C. (2010). Armadillo: An open source c++ linear algebra library for fast prototyping and\n\ncomputationally intensive experiments.\n\n9\n\n\f", "award": [], "sourceid": 1141, "authors": [{"given_name": "Jiecao", "family_name": "Chen", "institution": "Indiana University Bloomington"}, {"given_name": "Erfan", "family_name": "Sadeqi Azer", "institution": "Indiana University"}, {"given_name": "Qin", "family_name": "Zhang", "institution": "Indiana University Bloomington"}]}