{"title": "Fast and Accurate k-means For Large Datasets", "book": "Advances in Neural Information Processing Systems", "page_first": 2375, "page_last": 2383, "abstract": "", "full_text": "Fast and Accurate k-llleans For Large Datasets\n\nMichael Shindler\nSchool of EECS\n\nOregon State University\n\nAlex Wong\n\nDepartment of Computer Science\n\nUC Los Angeles\n\nshindler@eecs.oregonstate.edu\n\nalexw@seas.ucla.edu\n\nAdam Meyerson\n\nGoogle, Inc.\n\nMountain View, CA\n\nawmeyerson@google.com\n\nAbstract\n\nClustering is a popular problem with many applications. We consider the k-means\nproblem in the situation where the data is too large to be stored in main memory\nand must be accessed sequentially, such as from a disk, and where we must use\nas little memory as possible. Our algorithm is based on recent theoretical results,\nwith significant improvements to make it practical. Our approach greatly simpli(cid:173)\nfies a recently developed algorithm, both in design and in analysis, and eliminates\nlarge constant factors in the approximation guarantee, the memory requirements,\nand the running time. We then incorporate approximate nearest neighbor search to\ncompute k-means in o(nk) (where n is the number of data points; note that com(cid:173)\nputing the cost, given a solution, takes 8(nk) time). We show that our algorithm\ncompares favorably to existing algorithms - both theoretically and experimentally,\nthus providing state-of-the-art performance in both theory and practice.\n\n1 Introduction\nIn the k-means\nWe design improved algorithms for Euclidean k-means in the streaming model.\nproblem, we are given a set of n points in space. Our goal is to select k points in this space to\ndesignate asfacilities (sometimes called centers or means); the overall cost of the solution is the sum\nof the squared distances from each point to its nearest facility. The goal is to minimize this cost;\nunfortunately the problem is NP-Hard to optimize, although both heuristic [21] and approximation\nalgorithm techniques [20,25,7] exist. In the streaming model, we require that the point set be read\nsequentially, and that our algorithm stores very few points at any given time. Many problems which\nare easy to solve in the standard batch-processing model require more complex techniques in the\nstreaming model (a survey of streaming results is available [3]); nonetheless there are a number\nof existing streaming approximations for Euclidean k-means. We present a new algorithm for the\nproblem based on [9] with several significant improvements; we are able to prove a faster worst-case\nrunning time and a better approximation factor. In addition, we compare our algorithm empirically\nwith the previous state-of-the-art results of [2] and [4] on publicly available large data sets. Our\nalgorithm outperforms them both.\n\nThe notion of clustering has widespread applicability, such as in data mining, pattern recognition,\ncompression, and machine learning. The k-means objective is one of the most popular formalisms,\nand in particular Lloyd's algorithm [21] has significant usage [5,7, 19,22, 23, 25, 27, 28]. Many of\nthe applications for k-means have experienced a large growth in data that has overtaken the amount\nof memory typically available to a computer. This is expressed in the streaming model, where an\nalgorithm must make one (or very few) passes through the data, reflecting cases where random\naccess to the data is unavailable, such as a very large file on\u00b7 a hard disk. Note that the data size,\ndespite being large, is still finite.\n\n\fOur algorithm is based on the recent work of [9]. They \"guess\" the cost of the optimum, then run the\nonline facility location algorithm of [24] until either the total cost of the solution exceeds a constant\ntimes the guess or the total number of facilities exceeds some computed value 1\\,. They then declare\nthe end of a phase, increase the guess, consolidate the facilities via matching, and continue with\nthe next point. When the stream has been exhausted, the algorithm has some I\\, facilities, which are\nthen consolidated down to k. They then run a ball k-means step (similar to [25]) by maintaining\nsamples of the points assigned to each facility and moving the facilities to the centers of mass of\nthese samples. The algorithm uses O(k logn) memory, runs in O(nk logon) time, and obtains an\n0(1) worst-case approximation. Provided that the original data set was o--separable (see section 1.2\nfor the definition), they use ball k-means to improve the approximation factor to 1 + 0(0-2 ).\nFrom a practical standpoint, the main issue with [9] is that the constants hidden in the asymptotic\nnotation are quite large. The approximation factor is in the hundreds, and the 0 (k log n) memory\nrequirement has sufficiently high constants that there are actually more than n facilities for many of\nthe data sets analyzed in previous papers. Further, these constants are encoded into the algorithm\nitself, making it difficult to argue that the performance should improve for non-worst-case inputs.\n1.1 Our Contributions\nWe substantially simplify the algorithm of [9]. We improve the manner by which the algorithm\ndetermines better facility cost as the stream is processed, removing unnecessary checks and allowing\nthe user to parametrize what remains. We show that our changes result in a better approximation\nguarantee than the previous work. We also develop a variant that computes a solution in o(nk) and\nshow experimentally that both algorithms outperform previous techniques.\n\nWe remove the end-of-phase condition based on the total cost, ending phases only when the number\nof facilities exceeds 1\\,. While we require I\\, E n(k log n), we do not require any particular constants\nin the expression (in fact we will use I\\, == k log n in our experiments). We also simplify the transition\nbetween phases, observing that it's quite simple to bound the number of phases by log 0 PT (where\nOPT is the optimum k-means cost), and that in practice this number of phases is usually quite a bit\nless than n.\n\nWe show that despite our modifications, the worst case approximation factor is still constant. Our\nproof is based on a much tighter bound on the cost incurred per phase, along with a more flexible\ndefinition of the \"critical phase\" by which the algorithm should terminate. Our proofs establish that\nthe algorithm converges for any I\\, > k; of course, there are inherent tradeoffs between I\\, and the\napproximation bound. For appropriately chosen constants our approximation factor will be roughly\n17, substantially less than the factor claimed in [9] prior to the ball k-means step.\n\nIn addition, we apply approximate nearest-neighbor algorithms to compute the facility assignment\nof each point. The running time of our algorithm is dominated by repeated nearest-neighbor\ncalculations, and an appropriate technique can change our running time from 8 (nk log n) to\n8 (n(log k + loglogn)), an improvement for most values of k. Of course, this hurts our accuracy\nsomewhat, but we are able to show that we take only a constant-factor loss in approximation. Note\nthat our final running time is actually faster than the 8 (nk) time needed to compute the k-means\ncost of a given set of facilities!\n\nIn addition to our theoretical improvements, we perform a number of empirical tests using realistic\ndata. This allows us to compare our algorithm to previous [4, 2] streaming k-means results.\n\n1.2 Previous Work\nA simple local search heuristic for the k-means problem was proposed in 1957 by Lloyd [21].\nThe algorithm begins with k arbitrarily chosen points as facilities. At each stage, it allocates the\npoints into clusters (each point assigned to closest facility) and then computes the center of mass\nfor each cluster. These become the new facilities for the next phase, and the process repeats until\nit is stable. Unfortunately, Lloyd's algorithm has no provable approximation bound, and arbitrarily\nbad examples exist. Furthermore, the worst-case running time is exponential [29]. Despite these\ndrawbacks, Lloyd's algorithm (frequently known simply as k-means) remains common in practice.\n\nThe best polynomial-time approximation for k-means is by Kanungo, Mount, Netanyahu, Piatko,\nSilverman, and Wu [20]. Their algorithm uses local search (similar to the k-median algorithm of\n[8]), and is a 9 +c approximation. However, Lloyd's observed runtime is superior, and this is a high\npriority for real applications.\n\n2\n\n\fOstrovsky, Rabani, Schulman and Swamy [25] observed that the value of k is typically selected\nsuch that the data is \"well-clusterable\" rather than being arbitrary. They defined the notion of (J\"(cid:173)\nseparability, where the input to k-means is said to be (J\"-separable if reducing the number of facilities\nfrom k to k - 1 would incr~ase the cost of the optimum solution by a factor ;2' They designed an\nalgorithm with approximation ratio 1 + O((J\"2). Subsequently, Arthur and Vassilvitskii [7] showed\nthat the same procedure produces an O(log k) approximation for arbitrary instances of k-means.\n\nThere are two basic approaches to the streaming version of the k-means problem. Our approach\nis based on solving k-means as we go (thus at each point in the algorithm, our memory contains a\ncurrent set of facilities). This type of approach was pioneered in 2000 by Guha, Mishra, Motwani,\nand O'Callaghan [17]. Their algorithm reads the data in blocks, clustering each using some non(cid:173)\nstreaming approximation, and then gradually merges these blocks when enough of them arrive. An\nimproved result for k-median was given by Charikar, O'Callaghan, and Panigrahy in 2003 [11],\nproducing an 0 (1) approximation using 0 (k log2 n) space. Their work was based on guessing a\nlower bound on the optimum k-median cost and running O(log n) parallel versions of the online\nfacility location algorithm of Meyerson [24] with facility cost based on the guessed lower bound.\nWhen these parallel calls exceeded the approximation bounds, they would be terminated and the\nguessed lower bound on the optimum k-median cost would increase. The recent paper of Braverman,\nMeyerson, Ostrovsky, Roytman, Shindler, and Tagiku [9] extended the result of [11] to k-means and\nimproved the space bound to 0 (k log n) by proving high-probability bounds on the performance\nof online facility location. This result also added a ball k-means step (as in [25]) to substantially\nimprove the approximation factor under the assumption that the original data was (J\"-separable.\n\nAnother recent result for streaming k-means, due to Ailon, Jaiswal, and Monteleoni [4], is based on\na divide and conquer approach, similar to the k-median algorithm of Guha, Meyerson Mishra, Mot(cid:173)\nwani, and O'Callaghan [16]. It uses the result of Arthur and Vassilvitskii [7] as a subroutine, finding\n3k log k centers for each block. Their experiment showed that this algorithm is an improvement over\nan online variant of Lloyd's algorithm and was comparable to the batch version of Lloyd's.\n\nThe other approach to streaming k-means is based on coresets: selecting a weighted subset of the\noriginal input points such that any k-means solution on the subset has roughly the same cost as on\nthe original point set. At any point in the algorithm, the memory should contain a weighted represen(cid:173)\ntative sample of the points. This approach was first used in a non-streaming setting for a variety of\nclustering problems by Badoiu, Har-Peled, and Indyk [10], and in the streaming setting by Har-Peled\nand Mazumdar [18]; the time and memory bounds were subsequently improved through a series of\npapers [14, 13] with the current best theoretical bounds by Chen [12]. A practical implementation\nof the coreset paradigm is due to Ackermann, Lammersen, Martens, Raupach, Sohler, and Swierkot\n[2]. Their approach was shown empirically to be fast and accurate on a variety of benchmarks.\n\n2 Algorithm and Theory\n\nBoth our algorithm and that of [9] are based on the online facility location algorithm of [24]. For\nthe facility location problem, the number of clusters is not part of the input (as it is for k-means),\nbut rather a facility cost is given; an algorithm to solve this problem may have as many clusters as it\ndesires in its output, simply by denoting some point as a facility. The solution cost is then the sum\nof the resulting k-means cost (\"service cost\") and the total paid for facilities.\n\nOur algorithm runs the online facility location algorithm of [24] with a small facility cost until we\nhave more than ~ E e(k log n) facilities. It then increases the facility cost, re-evaluates the current\nfacilities, and continues with the stream. This repeats until the entire stream is read. The details of\nthe algorithm are given as Algorithm 1.\n\nThe major differences between our algorithm and that of [9] are as follows. We ignore the overall\nservice cost in determining when to end a phase and raise our facility cost f. Further, the number of\nfacilities which must open to end a phase can be any ~ E e (k log n), the constants do not depend\ndirectly on the competitive ratio of online facility location (as they did in [9]). Finally, we omit\nthe somewhat complicated end-of-phase analysis of [9], which used matching to guarantee that the'\nnumber of facilities decreased substantially with each phase and allowed bounding the number of\nphases by kl~gn' We observe that our number of phases will be bounded by logj3 OPT; while this is\nnot technically bounded in terms of n, in practice this term should be smaller than the linear number\nof phases implied in previous work.\n\n3\n\n\fwhile IKI :::; ~ = 8(k log n) and some portion of the stream is unread do\n\nsetK r- KU{x}\n\nelse\n\nassign x to its closest facility in K\n\nif stream not exhausted then\n\nwhile IKI > ~ do\n\nRead the next point x from the stream\nMeasure 6 = minYEK d(x, y)2\nif probability fJ j f event occurs then\n\nAlgorithm 1 Fast streaming k-means (data stream, k, ~, (3)\n1: Initialize f = Ij(k(l + logn)) and an empty set K\n2: while some portion of the stream remains unread do\n3:\n4:\n5:\n6:\n7:\n8:\n9:\n10:\n11:\n12:\n13:\n14:\n15:\n16:\n17:\n18:\n19:\n20:\n21:\n22:\n23:\n24:\n25:\n\nSet f r- {31\nMove each x E K to the center-of-mass of its points\nLet W x be the number of points assigned to x E K\nInitialize K containing the first facility from K\nfor each x E K do\n\nelse\n\nsetKr-KU{x}\nassign\" x to its closest facility in K\n\nMeasure fJ = minyEK d(x, y)2\nif probability w x 6j f event occurs then\n\nSetK r- K\n\nelse\n\nRun batch k-means algorithm on weighted points K\nPerform ball k-means (as per [9]) on the resulting set of clusters\n\nWe will give a theoretical analysis of our modified algorithm to obtain a constant approximation\nbound. Our constant is substantially smaller than those implicit in [9], with most of the loss occur(cid:173)\nring in the final non-streaming k-means algorithm to consolidate ~ means down to k. The analysis\nwill follow from the theorems stated below; proofs of these theorems are deferred to the appendix.\nTheorem 1. Suppose that our algorithm completes the data stream when the facility cost is f. Then\nthe overall solution prior to the final re-clustering has expected service cost at most ~~, and the\nprobability of being within 1 + E of the expected service cost is at least 1 -\n\npol~(n) .\n\npol~(n)' the algorithm will either halt with 1 :::; e(~*){3,\nTheorem 2. With probability at least 1 -\nwhere C* is the optimum k-means cost, or it will halt within one phase of exceeding this value.\nFurthermore, for large values of ~ and {3, the hidden constant in 8 (C*) approaches 4.\n\nNote that while the worst-case bound of roughly 4 proven here may not seem particularly strong,\n'unlike the previous work of [9], the worst-case performance is not directly encoded into the algo(cid:173)\nrithm. In practice, we would expect the performance of online facility location to be substantially\nbetter than worst-case (in fact, if the ordering of points in the stream is non-adversarial there is a\nproof to this effect in [24]); in addition the assumption was made that distances add (i.e.\ntriangle\ninequality is tight) which will not be true in practice (especially of points in low-dimensional space).\nWe also assumed that using more than k facilities does not substantially help the optimum service\ncost (also unlikely to be true for real data). Combining these, it would be unsurprising if our service\ncost was actually better than optimum at the end of the data stream (of course, we used many more\nfacilities than optimum, so it is not precisely a fair comparison). The following theorem summarizes\nthe worst-case performance of the algorithm; its proof is direct from Theorems 1 and 2.\n\nTheorem 3. The cost of our algorithm's final ~-mean solution is at most O(C*), where C* is the\ncost of the optimum k-means solution, with probability 1 -\npol~(n)' If ~ is a large constant times\nk log nand j3 > 2 is fairly large, then the cost of our algorithm's solution will approach C* :~21;\nthe extra j3 factor is due to \"overshooting\" the bestfacility cost f.\n\n4\n\n\fWe note that if we run the streaming part of the algorithm NI times in parallel, we can take the\nsolution with the smallest final facility cost. This improves the approximation factor to roughly\nh\ncourse, IncreasIng ~ can su stantla y Increase t e\n4;31+(1/ M)\nmemory requirement and increasing NI can increase both memory and running time requirements.\n\n,w c approac es\n\nb ' 11\n\n. .\n\nhi h\n\n.\n\n,B-1\n\nh 4'\n\nh l'\n\nIn t e lIDlt.\n\n. Of\n\nWhen the algorithm terminates, we have a set of ~ weighted means which we must reduce to k\nmeans. A theoretically sound approach involves mapping these means back to randomly selected\npoints from the original set (these can be maintained in a streaming manner) and then approximating\nk-means on ~ points using a non-streaming algorithm. The overall approximation ratio will be twice\nthe ratio established by our algorithm (we lose a factor of two by mapping back to the original points)\nplus the approximation ratio for the non-streaming algorithm. If we use the algorithm of [20] along\nwith a large ~, we will get an approximation factor of twice 4 plus 9+c for roughly 17. Ball k-means\ncan then reduce the approximation factor to 1+ 0 ((J2) if the inputs were (J-separable (as in [25] and\n[9]; the hidden constant will be reduced by our more accurate algorithm).\n\n3 Approximate Nearest Neighbor\nThe most time-consuming step in our algorithm is measuring 6 in lines 5 and 17. This requires\nas many as ~ distance computations; there are a number of results enabling fast computation of\napproximate nearest neighbors and applying these results will improve our running time. If we can\nassume that errors in nearest neighbor computation are independent from one point to the next (and\nthat the expected result is good), our analysis from the previous section applies. Unfortunately, many\nof the algorithms construct a random data structure to store the facilities, then use this structure to\nresolve all queries; this type of approach implies that errors are not independent from one query to\nthe next. Nonetheless we can obtain a constant approximation for sufficiently large choices of (3.\n\nFor our empirical result, we will use a very simple approximate nearest-neighbor algorithm based\non random projection. This has reasonable performance in expectation, but is not independent from\none step to the next. While the theoretical results from this particular approach are not very strong,\nit works very well in our experiments. For this implementation, a vector w is created, with each of\nthe d dimensions space being chosen independently and uniformly at random from [0,1). We store\nour facilities sorted by their inner product with w. When a new point x arrives, instead of taking\nO(~) to determine its (exact) nearest neighbor, we instead use O(log~) to find the two facilities that\nx . w is between. We determine the (exact) closer of these two facilities; this determines the value\nof 6 in lines 5 and 17 and the \"closest\" facility in lines 9 and 21.\nTheorem 4. If our approximate nearest neighbor computation finds a facility with distance at most\nv times the distance to the closest facility in expectation, then the approximation ratio increase by a\nconstantfactor.\n\nWe defer explanation of how we form the stronger theoretical result to the appendix.\n4 Empirical Evaluation\nA comparison of algorithms on real data sets gives a great deal of insight as to their relative perfor(cid:173)\nmance. Real data is not worst-case, implying that neither the asymptotic performance or running(cid:173)\ntime bounds claimed in theoretical results are necessarily tight. Of course, empirical evaluation\ndepends heavily on the data sets selected for the experiments.\n\nWe selected data sets which have been used previously to demonstrate streaming algorithms. A\nnumber of the data sets analyzed in previous work were not particularly large, probably so that\nbatch-processing algorithms would terminate quickly on those inputs. The main motivation for\nstreaming is very large data sets, so we are more interested in sets that might be difficult to fit in a\nmain memory and focused on the largest examples. We looked to [2], and used the two biggest data\nsets they considered. These were the BigCross dataset1 and the Census1990 dataset 2. All the other\ndata sets in [2, 4] were either subsets of these or were well under a half million points.\n\nA necessary input for each of these algorithms is the desired number of clusters. Previous work\nchose k seemingly arbitrarily; typical values were of the form {5, 10, 15,20, 25}. While this input\nprovides a well-defined geometry problem, it fails to capture any information about how k-means is\n\nIThe BigCross dataset is 11,620,300 points in 57-dimensional space; it is available from [1]\n2The Census1990 dataset is 2,458,285 points in 68 dimensions; it is available from [15]\n\n5\n\n\fused in practice and need not lead to separable data. Instead, we want to select k such that the best\nk-means solution is much cheaper than the best (k -I)-means solution. Since k-means is NP-Hard,\nwe cannot solve large instances to optimality. For the Census dataset we ran several iterations of the\nalgorithm of [25] for each of many values of k. We took the best observed cost for each value of k,\nand found the four values of k minimizing the ratio of k-means cost to (k -\n\nI)-means cost.\n\nThis was not possible for the larger BigCross dataset. Instead, we ran a modified version of our\nalgorithm; at the end of a phase, it adjusts the facility cost and restarts the stream. This avoids the\nproblem of compounding the approximation factor at the end of a phase. As with Census, we ran\nthis for consecutive values of k and chose the best ratios of observed values; we chose two, rather\nthan four, so that we could finish our experiments in a reasonable amount of time. Our approach to\nselecting k is closer to what's done in practice, and is more likely to yield meaningful results.\n\nWe do not compare to the algorithm of [9]. First, the memory is not configurable, making it not\nfit into the common baselme that we will define shortly. Second, the memory requirements and\nruntime, while asymptotically nice, have large leading constants that cause it to be impracticaL In\nfact, it was an attempt to implement this algorithm that initially motivated the work on this paper.\n\nImplementation Discussion\n\n4.1\nThe divide and conquer (\"D&C\") algorithm [4] can use its available memory in two possible ways.\nFirst, it can use the entire amount to read from the stream, writing the results of computing their\n3k log k means to disk; when the stream is exhausted, this file is treated as a stream, until an iteration\nproduces a file that fits entirely into main memory. Alternately, the available memory could be\npartitioned into layers; the first layer would be filled by reading from the stream, and the weighted\nfacilities produced would be stored in the second. When any layer is full, it can be clustered and the\nresult placed in a higher layer, replacing the use of files and disk. Upon completion of the stream,\nany remaining points are gathered and clustered to produce k final means. When larger amounts\nof memory are available, the latter method is preferred. With smaller amounts, however, this isn't\nalways possible, and when it is possible, it can produce worse actual running times than a disk-based\napproach. As our goal is to judge streaming algorithms under low memory conditions, we used the\nfirst approach, which is more fitting to such a constraint.\nEach algorithm3 was programmed in C/C++, compiled with g++, and run under Ubuntu Linux\n(10.04 LTS) on HP Pavilion p6520fDesktop PC, with an AMD Athlon II X4 635 Processor running\nat 2.9 GhZ and with 6 GB main memory (although nowhere near the entirety of this was used by any\nalgorithm). For StreamKM++, the authors' implementation [2], also in C, was used instead. With\nall algorithms, the reported cost is determined by taking the resulting k facilities and computing the\nk-means cost across the entire dataset. The time to compute this cost is not included in the reported\nrunning times of the algorithms. Each test case was run 10 times and the average costs and running\ntimes were reported.\n\n4.2 Experimental Design\nOur goal is to compare the algorithms at a common basepoint. Instead of just comparing for the\nsame dataset and cluster count, we further constrained each to use the same amount of memory (in\nterms of number of points stored in random access). The memory constraints were chosen to reflect\nthe usage of small amounts of memory that are close to the algorithms' designers' specifications,\nwhere possible. Ailon et al [4] suggest -v:n:k memory for the batch process; this memory availability\nis marked in the charts by an asterisk. The suggestion from [2] for a coreset of size 200k was not\nrun for all algorithms, as the amount of memory necessary for computing a coreset of this size is\nmuch larger than the other cases, and our goal is to compare the algorithms at a small memory limit.\nThis does produce a drop in solution quality compared to running the algorithm at their suggested\nparameters, although their approach remains competitive. Finally, our algorithm suggests memory\nof ~ = k log n or a small constant times the same.\nIn each case, the memory constraint dictates the parameters; for the divide and conquer algorithm,\nthis is simply the batch size. The coreset size is also dictated by the available memory. Our algorithm\nis a little more parametrizable; when M memory is available, we allowed ~ = M /5 and each facility\nto have four samples.\n\n3Visit http://web . engr. oregonstate. edu/ - shindler / to access code for our algorithms\n\n6\n\n\fIIIlD&C\n\nJIIIOurs+ANN\n\nJIIID&C\n\n2520\n\nMemoryAvailable\n\n3350\nMemoryAvailaible\n\nFigure 1: Census Data, k=8, cost\n\nFigure 2: Census Data, k=8, time\n\n~ 4DDE+13\n\n1I0ur,,+ANN\n\n.D&C\n\n1I0&C\n\n3780\n\nMemory Availahle\n\n5040\nMemOfyAvaUahle\n\nFigure 3: Census Data, k=12, cost\n\nFigure 4: Census Data, k=12, time\n\n~\n\n1.50E\"\"-14\u00b7\u00b7\u00b7\u00b7\u00b7\n\n.Our!l+ANN\n\nMemoryAvailahle\n\nMemoryAvaiEable\n\nFigure 5: BigCross Data, k=13, cost\n\nFigure 6: BigCross Data, k=13, time\n\n; Ours\n\n1lI0urs-:-ANN\n\nIlIStreamKMH\n\nMemoryAvaifable\n\nMemoryAVlliiable\n\nFigure 7: BigCross Data, k=24, cost\n\nFigure 8: BigCross Data, k=24, time\n\n7\n\n\f4.3 Discussion of Results\n\nWe see that our algorithms are much faster than the D&C algorithm, while having a comparable (and\noften better) solution quality. We find that we compare well to StreamKM++ in average results, with\na closer standard deviation and a better sketch of the data produced. Furthermore, our algorithm\nstands to gain the most by improved solutions to batch k-means, due to the better representative\nsample present after the stream is processed.\n\nThe prohibitively high running time of the divide-and-conquer algorithm [4] is due to the many re(cid:173)\npeated instances of running their k-means# algorithm on each batch of the given size. For sufficiently\nlarge memory, this is not problematic, as very few batches will need this treatment. Unfortunately,\nwith very small locally available memory, there will be an immense amount of repeated calls, and\nthe overall running time will suffer greatly. In particular, the observed running time was much worse\nthan the other approaches. For the Census dataset, k = 12 case, for example, the slowest run of our\nalgorithm (20 minutes) and the fastest run of the D&C algorithm (125 minutes) occurred at the same\ncase.\nIt is because of this discrepancy that we present the chart of algorithm running times as a\nlog-plot. Furthermore, due to the prohibitively high running time on the smaller data set, we omitted\nthe divide-and-conquer algorithm for the experiment with the larger set.\n\nThe decline in accuracy for StreamKM++ at very low memory can be partially explained by the\n8(k2 1og8 n) points' worth of memory needed for a strong guarantee in previous theory work [12].\nHowever, the fact that the algorithm is able to achieve a good approximation in practice while using\nfar less than that amount of memory suggests that improved provable bounds for coreset algorithms\nmay be on the horizon. We should note that the performance of the algorithm declines sharply as the\nmemory difference with the authors' specification grows, but gains accuracy as the memory grows.\n\nAll three algorithms can be described as computing a weighted sketch of the data, and then solving\nk-means on that sketch. The final approximation ratios can be described as a(l + E) where a is\nthe loss from the final batch algorithm. The coreset E is a direct function of the memory allowed to\nthe algorithm, and can be made arbitrarily small. However, the memory needed to provably reduce\nE to a small constant is quite substantial, and while StreamKM++ does produce a good resulting\nclustering, it is not immediately clear the the discovery of better batch k-means algorithms would\nimprove their solution quality. Our algorithm's E represents the ratio of the cost of our f1:-mean\nsolution to the cost of the optimum k-means solution. The provable value is a large constant, but\nsince ~ is much larger than k, we would expect better performance in practice, and we observe this\neffect in our experiments.\nFor our algorithm, the observed value of 1+ E has been typically between 1 and 3, whereas the D&C\napproach did not yield one better than 24, and was high (low thousands) for the very low memory\nconditions. The coreset algorithm was the worst, with even the best values in the mid ten figures\n(tens to hundreds of billions). The low ratio for our algorithm also suggests that our ~ facilities are a\ngood sketch of the overall data, and thus our observed accuracy can be expected to improve as more\naccurate batch k-means algorithms are discovered.\n\nAcknowledgments\n\nWe are grateful to Christian Sohler's research group for providing their code for the StreamKM++\nalgorithm. We also thank Jennifer Wortman Vaughan, Thomas G. Dietterich, Daniel Sheldon, An(cid:173)\ndrea Vattani, and Christian Sohler for helpful feedback on drafts of this paper. This work was done\nwhile all the authors were at UCLA; at that time, Adam Meyerson and Michael Shindler were par(cid:173)\ntially supported by NSF CIF Grant CCF-1016540.\n\nReferences\n\n[1] http://www.cs.uni-paderbom.de/en/fachgebiete/ag-bloemer/research/clustering/streamkmpp.\n\n[2] Marcel R. Ackermann, Christian Lammersen, Marcus Martens, Christoph Raupach, Christian\nIn\n\nSohler, and Kamil Swierkot. StreamKM++: A clustering algorithms for data streams.\nALENEX, 2010.\n\n[3] Cham C. Aggarwal, editor. Data Streams: Models and Algorithms. Springer, 2007.\n\n8\n\n\f[4] Nir Ailon, Ragesh Jaiswal, and Claire Monteleoni. Streaming k-means approximation.\n\nIn\n\nNIPS, 2009.\n\n[5] Khaled Alsabti, Sanjay Ranka, and Vineet Singh. An efficient k-means clustering algorithm.\n\nIn HPDM, 1998.\n\n[6] Alexandr Andoni and Piotr Indyk. Near-optimal hashing algorithms for approximate nearest\n\nneighbor in high dimensions. Communications of the ACM, January 2008.\n\n[7] David Arthur and Sergei Vassilvitskii. k-means++: The Advantages of Careful Seeding. In\n\nSODA, 2007.\n\n[8] Vijay Arya, Naveen Garg, Rohit Khandekar, Adam Meyerson, Kamesh Munagala, and\nVinayaka Pandit. Local search heuristic for k-median and facility location problems. In STOC,\n2001.\n\n[9] Vladimir Braverman, Adam Meyerson, Rafail Ostrovsky, Alan Roytman, Michael Shindler,\n\nand Brian Tagiku. Streaming k-means on Well-Clusterable Data. In SODA, 201l.\n\n[10] Mihai Badoiu, Sariel Har-Peled, and Piotr Indyk. Approximate clustering via core-sets.\n\nIn\n\nSTOC, 2002.\n\n[11] Moses Charikar, Liadan O'Callaghan, and Rina Panigrahy. Better streaming algorithms for\n\nclustering problems. In STOC, 2003.\n\n[12] Ke Chen. On coresets for k-median and k-means clustering in metric and euclidean spaces and\n\ntheir applications. SIAM J. Comput., 2009.\n\n.\n\n[13] Dan Feldman, Morteza Monemizadeh, and Christian Sohler. A PTAS for k-means clustering\n\nbased on weak coresets. In SCG, 2007.\n\n[14] Gereon Frahling and Christian Sohler. Coresets in dynamic geometric data streams. In STOC,\n\n2005.\n\n[15] A. Frank and A. Asuncion. DCI machine learning repository, 2010.\n[16] Sudipto Guha, Adam Meyerson, Nina Mishra, Rajeev Motwani, and Liadan O'Callaghan.\n\nClustering data streams: Theory and practice. In TDKE, 2003.\n\n[17] Sudipto Guha, Nina Mishra, Rajeev Motwani, and Liadan 0' Callaghan. Clustering data\n\nstreams. In FOCS, 2000.\n\n[18] Sariel Har-Peled and Soham Mazumdar. On coresets for k-means and k-median clustering. In\n\nSTOC, 2004.\n\n[19] Anil Kumar Jain, M Narasimha Murty, and Patrick Joseph Flynn. Data clustering: a review.\n\nACM Computing Surveys, 31(3), September 1999.\n\n[20] Tapas Kanungo, David Mount, Nathan Netanyahu, Christine Piatko, Ruth Silverman, and An(cid:173)\n\ngela Wu. A local search approximation algorithm for k-means clustering. In SCG, 2002.\n\n[21] Stuart Lloyd. Least Squares Quantization in PCM.\n\nIn Special issue on quantization, IEEE\n\nTransactions on Information Theory, 1982.\n\n[22] James MacQueen. Some methods for classification and analysis of multivariate observations.\n\nIn Berkeley Symposium on Mathematical Statistics and Probability, 1967.\n\n[23] Joel Max. Quantizing for minimum distortion.\n\nIEEE Transactions on Information Theory,\n\n1960.\n\n[24] Adam Meyerson. Online facility location. In FOCS, 200l.\n[25] Rafail Ostrovsky, Yuval Rabani, Leonard Schulman, and Chaitanya Swamy. The Effectiveness\n\nof Lloyd-Type Methods for the k-Means Problem. In FOCS, 2006.\n\n[26] Rina Panigrahy. Entropy based nearest neighbor search in high dimensions. In SODA, 2006.\n[27] Dan Pelleg and Andrew Moore. Accelerating exact k-means algorithms with geometric rea(cid:173)\n\nsoning. In KDD, 1999.\n\n[28] Steven 1. Phillips. Acceleration of k-means and related clustering problems.\n\nIn ALENEX,\n\n2002.\n\n[29] Andrea Vattani. k-means requires exponentially many iterations even in the plane. Discrete\n\nComputational Geometry, June 2011.\n\n9\n\n\f", "award": [], "sourceid": 4362, "authors": [{"given_name": "Michael", "family_name": "Shindler", "institution": null}, {"given_name": "Alex", "family_name": "Wong", "institution": null}, {"given_name": "Adam", "family_name": "Meyerson", "institution": null}]}