{"title": "Learning the k in k-means", "book": "Advances in Neural Information Processing Systems", "page_first": 281, "page_last": 288, "abstract": "", "full_text": "Learning the k in k-means\n\nGreg Hamerly, Charles Elkan\n{ghamerly,elkan}@cs.ucsd.edu\n\nDepartment of Computer Science and Engineering\n\nUniversity of California, San Diego\nLa Jolla, California\n92093-0114\n\nAbstract\n\nWhen clustering a dataset, the right number k of clusters to use is often\nnot obvious, and choosing k automatically is a hard algorithmic prob-\nlem. In this paper we present an improved algorithm for learning k while\nclustering. The G-means algorithm is based on a statistical test for the\nhypothesis that a subset of data follows a Gaussian distribution. G-means\nruns k-means with increasing k in a hierarchical fashion until the test ac-\ncepts the hypothesis that the data assigned to each k-means center are\nGaussian. Two key advantages are that the hypothesis test does not limit\nthe covariance of the data and does not compute a full covariance matrix.\nAdditionally, G-means only requires one intuitive parameter, the stand-\nard statistical signi\ufb01cance level \u03b1. We present results from experiments\nshowing that the algorithm works well, and better than a recent method\nbased on the BIC penalty for model complexity. In these experiments,\nwe show that the BIC is ineffective as a scoring function, since it does\nnot penalize strongly enough the model\u2019s complexity.\n\n1 Introduction and related work\n\nClustering algorithms are useful tools for data mining, compression, probability density es-\ntimation, and many other important tasks. However, most clustering algorithms require the\nuser to specify the number of clusters (called k), and it is not always clear what is the best\nvalue for k. Figure 1 shows examples where k has been improperly chosen. Choosing k is\noften an ad hoc decision based on prior knowledge, assumptions, and practical experience.\nChoosing k is made more dif\ufb01cult when the data has many dimensions, even when clusters\nare well-separated.\n\nCenter-based clustering algorithms (in particular k-means and Gaussian expectation-\nmaximization) usually assume that each cluster adheres to a unimodal distribution, such\nas Gaussian. With these methods, only one center should be used to model each subset\nof data that follows a unimodal distribution. If multiple centers are used to describe data\ndrawn from one mode, the centers are a needlessly complex description of the data, and in\nfact the multiple centers capture the truth about the subset less well than one center.\n\nIn this paper we present a simple algorithm called G-means that discovers an appropriate\nk using a statistical test for deciding whether to split a k-means center into two centers.\nWe describe examples and present experimental results that show that the new algorithm\n\n\fFigure 1: Two clusterings where k was improperly chosen. Dark crosses are k-means\ncenters. On the left, there are too few centers; \ufb01ve should be used. On the right, too many\ncenters are used; one center is suf\ufb01cient for representing the data. In general, one center\nshould be used to represent one Gaussian cluster.\n\nis successful. This technique is useful and applicable for many clustering algorithms other\nthan k-means, but here we consider only the k-means algorithm for simplicity.\n\nSeveral algorithms have been proposed previously to determine k automatically. Like our\nmethod, most previous methods are wrappers around k-means or some other clustering\nalgorithm for \ufb01xed k. Wrapper methods use splitting and/or merging rules for centers to\nincrease or decrease k as the algorithm proceeds.\n\nPelleg and Moore [14] proposed a regularization framework for learning k, which they call\nX-means. The algorithm searches over many values of k and scores each clustering model\nusing the so-called Bayesian Information Criterion [10]: BIC(C|X) = L(X|C)\u2212 p\n2 log n\nwhere L(X|C) is the log-likelihood of the dataset X according to model C, p = k(d + 1)\nis the number of parameters in the model C with dimensionality d and k cluster centers,\nand n is the number of points in the dataset. X-means chooses the model with the best BIC\nscore on the data. Aside from the BIC, other scoring functions are also available.\n\nBischof et al. [1] use a minimum description length (MDL) framework, where the descrip-\ntion length is a measure of how well the data are \ufb01t by the model. Their algorithm starts\nwith a large value for k and removes centers (reduces k) whenever that choice reduces\nthe description length. Between steps of reducing k, they use the k-means algorithm to\noptimize the model \ufb01t to the data.\n\nWith hierarchical clustering algorithms, other methods may be employed to determine the\nbest number of clusters. One is to build a merging tree (\u201cdendrogram\u201d) of the data based\non a cluster distance metric, and search for areas of the tree that are stable with respect\nto inter- and intra-cluster distances [9, Section 5.1]. This method of estimating k is best\napplied with domain-speci\ufb01c knowledge and human intuition.\n\n2 The Gaussian-means (G-means) algorithm\n\nThe G-means algorithm starts with a small number of k-means centers, and grows the\nnumber of centers. Each iteration of the algorithm splits into two those centers whose data\nappear not to come from a Gaussian distribution. Between each round of splitting, we run\nk-means on the entire dataset and all the centers to re\ufb01ne the current solution. We can\ninitialize with just k = 1, or we can choose some larger value of k if we have some prior\nknowledge about the range of k.\n\nG-means repeatedly makes decisions based on a statistical test for the data assigned to each\ncenter. If the data currently assigned to a k-means center appear to be Gaussian, then we\nwant to represent that data with only one center. However, if the same data do not appear\n\n\u22120.100.10.20.30.40.50.60.70.80.90.10.20.30.40.50.60.70.80.9\u22123\u22122\u221210123\u22124\u22123\u22122\u2212101234\fAlgorithm 1 G-means(X, \u03b1)\n1: Let C be the initial set of centers (usually C \u2190 {\u00afx}).\n2: C \u2190 kmeans(C, X).\n3: Let {xi|class(xi) = j} be the set of datapoints assigned to center cj.\n4: Use a statistical test to detect if each {xi|class(xi) = j} follow a Gaussian distribution\n\n(at con\ufb01dence level \u03b1).\n\n5: If the data look Gaussian, keep cj. Otherwise replace cj with two centers.\n6: Repeat from step 2 until no more centers are added.\n\nto be Gaussian, then we want to use multiple centers to model the data properly. The\nalgorithm will run k-means multiple times (up to k times when \ufb01nding k centers), so the\ntime complexity is at most O(k) times that of k-means.\nThe k-means algorithm implicitly assumes that the datapoints in each cluster are spherically\ndistributed around the center. Less restrictively, the Gaussian expectation-maximization\nalgorithm assumes that the datapoints in each cluster have a multidimensional Gaussian\ndistribution with a covariance matrix that may or may not be \ufb01xed, or shared. The Gaussian\ndistribution test that we present below are valid for either covariance matrix assumption.\nThe test also accounts for the number of datapoints n tested by incorporating n in the\ncalculation of the critical value of the test (see Equation 2). This prevents the G-means\nalgorithm from making bad decisions about clusters with few datapoints.\n\n2.1 Testing clusters for Gaussian \ufb01t\n\nTo specify the G-means algorithm fully we need a test to detect whether the data assigned\nto a center are sampled from a Gaussian. The alternative hypotheses are\n\u2022 H0: The data around the center are sampled from a Gaussian.\n\u2022 H1: The data around the center are not sampled from a Gaussian.\n\nIf we accept the null hypothesis H0, then we believe that the one center is suf\ufb01cient to\nmodel its data, and we should not split the cluster into two sub-clusters. If we reject H0\nand accept H1, then we want to split the cluster.\nThe test we use is based on the Anderson-Darling statistic. This one-dimensional test has\nbeen shown empirically to be the most powerful normality test that is based on the empirical\ncumulative distribution function (ECDF). Given a list of values xi that have been converted\nto mean 0 and variance 1, let x(i) be the ith ordered value. Let zi = F (x(i)), where F is\nthe N(0, 1) cumulative distribution function. Then the statistic is\n\nA2(Z) = \u2212 1\nn\n\n(2i \u2212 1) [log(zi) + log(1 \u2212 zn+1\u2212i)] \u2212 n\n\n(1)\n\nnX\n\ni=1\n\nStephens [17] showed that for the case where \u00b5 and \u03c3 are estimated from the data (as in\nclustering), we must correct the statistic according to\n\nA2\u2217(Z) = A2(Z)(1 + 4/n \u2212 25/(n2))\n\n(2)\n\nGiven a subset of data X in d dimensions that belongs to center c, the hypothesis test\nproceeds as follows:\n\n1. Choose a signi\ufb01cance level \u03b1 for the test.\n\n\f2. Initialize two centers, called \u201cchildren\u201d of c. See the text for good ways to do this.\n3. Run k-means on these two centers in X. This can be run to completion, or to some\nearly stopping point if desired. Let c1, c2 be the child centers chosen by k-means.\n4. Let v = c1 \u2212 c2 be a d-dimensional vector that connects the two centers. This is\nthe direction that k-means believes to be important for clustering. Then project\ni = hxi, vi/||v||2. X0 is a 1-dimensional representation of the data\nX onto v: x0\nprojected onto v. Transform X0 so that it has mean 0 and variance 1.\n\n5. Let zi = F (x0\n\n(i)). If A2\u2217(Z) is in the range of non-critical values at con\ufb01dence\nlevel \u03b1, then accept H0, keep the original center, and discard {c1, c2}. Otherwise,\nreject H0 and keep {c1, c2} in place of the original center.\n\nA primary contribution of this work is simplifying the test for Gaussian \ufb01t by projecting\nthe data to one dimension where the test is simple to apply. The authors of [5] also use\nthis approach for online dimensionality reduction during clustering. The one-dimensional\nrepresentation of the data allows us to consider only the data along the direction that k-\nmeans has found to be important for separating the data. This is related to the problem\nof projection pursuit [7], where here k-means searches for a direction in which the data\nappears non-Gaussian.\n\nWe must choose the signi\ufb01cance level of the test, \u03b1, which is the desired probability of\nmaking a Type I error (i.e. incorrectly rejecting H0). It is appropriate to use a Bonferroni\nadjustment to reduce the chance of making Type I errors over multiple tests. For example, if\nwe want a 0.01 chance of making a Type I error in 100 tests, we should apply a Bonferroni\nadjustment to make each test use \u03b1 = 0.01/100 = 0.0001. To \ufb01nd k \ufb01nal centers the\nG-means algorithm makes k statistical tests, so the Bonferroni correction does not need to\nbe extreme. In our tests, we always use \u03b1 = 0.0001.\nWe consider two ways to initialize the two child centers. Both approaches initialize with\nc \u00b1 m, where c is a center and m is chosen. The \ufb01rst method chooses m as a random\nd-dimensional vector such that ||m|| is small compared to the distortion of the data. A\nsecond method \ufb01nds the main principal component s of the data (having eigenvalue \u03bb),\n\nand chooses m = sp2\u03bb/\u03c0. This deterministic method places the two centers in their\n\nexpected locations under H0. The principal component calculations require O(nd2 + d3)\ntime and O(d2) space, but since we only want the main principal component, we can use\nfast methods like the power method, which takes time that is at most linear in the ratio of\nthe two largest eigenvalues [4]. In this paper we use principal-component-based splitting.\n\n2.2 An example\n\nFigure 2 shows a run of the G-means algorithm on a synthetic dataset with two true clusters\nand 1000 points, using \u03b1 = 0.0001. The critical value for the Anderson-Darling test is\n1.8692 for this con\ufb01dence level. Starting with one center, after one iteration of G-means,\nwe have 2 centers and the A2\u2217 statistic is 38.103. This is much larger than the critical value,\nso we reject H0 and accept this split. On the next iteration, we split each new center and\nrepeat the statistical test. The A2\u2217 values for the two splits are 0.386 and 0.496, both of\nwhich are well below the critical value. Therefore we accept H0 for both tests, and discard\nthese splits. Thus G-means gives a \ufb01nal answer of k = 2.\n\n2.3 Statistical power\n\nFigure 3 shows the power of the Anderson-Darling test, as compared to the BIC. Lower is\nbetter for both plots. We run 1000 tests for each data point plotted for both plots. In the left\n\n\fFigure 2: An example of running G-means for three iterations on a 2-dimensional dataset\nwith two true clusters and 1000 points. Starting with one center (left plot), G-means splits\ninto two centers (middle). The test for normality is signi\ufb01cant, so G-means rejects H0 and\nkeeps the split. After splitting each center again (right), the test values are not signi\ufb01cant,\nso G-means accepts H0 for both tests and does not accept these splits. The middle plot is\nthe G-means answer. See the text for further details.\n\nFigure 3: A comparison of the power of the Anderson-Darling test versus the BIC. For\nthe AD test we \ufb01x the signi\ufb01cance level (\u03b1 = 0.0001), while the BIC\u2019s signi\ufb01cance level\ndepends on n. The left plot shows the probability of incorrectly splitting (Type I error) one\ntrue 2-d cluster that is 5% elliptical. The right plot shows the probability of incorrectly not\nsplitting two true clusters separated by 5\u03c3 (Type II error). Both plots are functions of n.\nBoth plots show that the BIC over\ufb01ts (splits clusters) when n is small.\n\nplot, for each test we generate n datapoints from a single true Gaussian distribution, and\nthen plot the frequency with which BIC and G-means will choose k = 2 rather than k = 1\n(i.e. commit a Type I error). BIC tends to over\ufb01t by choosing too many centers when the\ndata is not strictly spherical, while G-means does not. This is consistent with the tests of\nreal-world data in the next section. While G-means commits more Type II errors when n is\nsmall, this prevents it from over\ufb01tting the data.\n\nThe BIC can be considered a likelihood ratio test, but with a signi\ufb01cance level that cannot\nbe \ufb01xed. The signi\ufb01cance level instead varies depending on n and \u2206k (the change in the\nnumber of model parameters between two models). As n or \u2206k decrease, the signi\ufb01cance\nlevel increases (the BIC becomes weaker as a statistical test) [10]. Figure 3 shows this\neffect for varying n. In [11] the authors show that penalty-based methods require problem-\nspeci\ufb01c tuning and don\u2019t generalize as well as other methods, such as cross validation.\n\n3 Experiments\n\nTable 1 shows the results from running G-means and X-means on many large synthetic. On\nsynthetic datasets with spherically distributed clusters, G-means and X-means do equally\n\n024681012456789101112131402468101245678910111213140246810124567891011121314 0 0.2 0.4 0.6 0.8 1 0 30 60 90 120 150 180 210P(Type I error)number of datapointsG-meansX-means 0 0.2 0.4 0.6 0.8 1 0 30 60 90 120 150 180 210P(Type II error)number of datapointsG-meansX-means\fTable 1: Results for many synthetic datasets. We report distortion relative to the optimum\ndistortion for the correct clustering (closer to one is better), and time is reported relative to\nk-means run with the correct k. For BIC, larger values are better, but it is clear that \ufb01nding\nthe correct clustering does not always coincide with \ufb01nding a larger BIC. Items with a star\nare where X-means always chose the largest number of centers we allowed.\n\ndataset\nsynthetic\nk=5\nsynthetic\nk=20\nsynthetic\nk=80\nsynthetic\nk=5\nsynthetic\nk=20\nsynthetic\nk=80\nsynthetic\nk=5\nsynthetic\nk=20\nsynthetic\nk=80\n\nd\n2\n\n2\n\n2\n\n8\n\n8\n\n8\n\n32\n\n32\n\n32\n\nmethod\nG-means\nX-means\nG-means\nX-means\nG-means\nX-means\nG-means\nX-means\nG-means\nX-means\nG-means\nX-means\nG-means\nX-means\nG-means\nX-means\nG-means\nX-means\n\nk found\n9.1\u00b1 9.9\n18.1\u00b1 3.2\n20.1\u00b1 0.6\n70.5\u00b111.6\n80.0\u00b1 0.2\n171.7\u00b123.7\n5.0\u00b1 0.0\n*20.0\u00b1 0.0\n20.0\u00b1 0.1\n*80.0\u00b1 0.0\n80.2\u00b1 0.5\n229.2\u00b136.8\n5.0\u00b1 0.0\n*20.0\u00b1 0.0\n20.0\u00b1 0.0\n*80.0\u00b1 0.0\n80.0\u00b1 0.0\n171.5\u00b110.9\n\ndistortion(\u00d7 optimal)\n\n0.89\u00b1 0.23\n0.37\u00b1 0.12\n0.99\u00b1 0.01\n9.45\u00b128.02\n1.00\u00b1 0.01\n48.49\u00b170.04\n1.00\u00b1 0.00\n0.47\u00b1 0.03\n0.99\u00b1 0.00\n0.47\u00b1 0.01\n0.99\u00b1 0.00\n0.57\u00b1 0.06\n1.00\u00b1 0.00\n0.76\u00b1 0.00\n1.00\u00b1 0.00\n0.76\u00b1 0.01\n1.00\u00b1 0.00\n0.84\u00b1 0.01\n\nBIC(\u00d7104)\n-0.19\u00b12.70\n0.70\u00b10.93\n0.21\u00b10.18\n14.83\u00b13.50\n1.84\u00b10.12\n40.16\u00b16.59\n-0.74\u00b10.16\n-2.28\u00b10.20\n-0.18\u00b10.17\n14.36\u00b10.21\n1.45\u00b10.20\n52.28\u00b19.26\n-3.36\u00b10.21\n-27.92\u00b10.22\n-2.73\u00b10.22\n-11.13\u00b10.23\n-1.10\u00b10.16\n11.78\u00b12.74\n\ntime(\u00d7 k-means)\n\n13.2\n2.8\n2.1\n1.2\n2.2\n1.8\n4.6\n11.0\n2.6\n4.0\n2.9\n6.5\n4.4\n29.9\n2.3\n21.2\n2.8\n53.3\n\nFigure 4: 2-d synthetic dataset with 5 true clusters. On the left, G-means correctly chooses\n5 centers and deals well with non-spherical data. On the right, the BIC causes X-means to\nover\ufb01t the data, choosing 20 unevenly distributed clusters.\n\nwell at \ufb01nding the correct k and maximizing the BIC statistic, so we don\u2019t show these\nresults here. Most real-world data is not spherical, however.\nThe synthetic datasets used here each have 5000 datapoints in d = 2/8/32 dimensions.\nThe true ks are 5, 20, and 80. For each synthetic dataset type, we generate 30 datasets with\nthe true center means chosen uniformly randomly from the unit hypercube, and choosing \u03c3\nso that no two clusters are closer than 3\u03c3 apart. Each cluster is also given a transformation\nto make it non-spherical, by multiplying the data by a randomly chosen scaling and rotation\nmatrix. We run G-means starting with one center. We allow X-means to search between 2\nand 4k centers (where here k is the true number of clusters).\nThe G-means algorithm clearly does better at \ufb01nding the correct k on non-spherical data. Its\nresults are closer to the true distortions and the correct ks. The BIC statistic that X-means\nuses has been formulated to maximize the likelihood for spherically-distributed data. Thus\nit overestimates the number of true clusters in non-spherical data. This is especially evident\nwhen the number of points per cluster is small, as in datasets with 80 true clusters.\n\n0246810120123456702468101201234567\fFigure 5: NIST and Pendigits datasets: correspondence between each digit (row) and each\ncluster (column) found by G-means. G-means did not have the labels, yet it found mean-\ningful clusters corresponding with the labels.\n\nBecause of this overestimation, X-means often hits our limit of 4k centers. Figure 4 shows\nan example of over\ufb01tting on a dataset with 5 true clusters. X-means chooses k = 20 while\nG-means \ufb01nds all 5 true cluster centers. Also of note is that X-means does not distribute\ncenters evenly among clusters; some clusters receive one center, but others receive many.\n\nG-means runs faster than X-means for 8 and 32 dimensions, which we expect, since the\nkd-tree structures which make X-means fast in low dimensions take time exponential in\nd, making them slow for more than 8 to 12 dimensions. All our code is written in Matlab;\nX-means is written in C.\n\n3.1 Discovering true clusters in labeled data\n\nWe tested these algorithms on two real-world datasets for handwritten digit recognition:\nthe NIST dataset [12] and the Pendigits dataset [2]. The goal is to cluster the data without\nknowledge of the labels and measure how well the clustering captures the true labels. Both\ndatasets have 10 true classes (digits 0-9). NIST has 60000 training examples and 784\ndimensions (28\u00d728 pixels). We use 6000 randomly chosen examples and we reduce the\ndimension to 50 by random projection (following [3]). The Pendigits dataset has 7984\nexamples and 16 dimensions; we did not change the data in any way.\n\nj=1 p(i, j)2.Pkt\nPkc\n\nWe cluster each dataset with G-means and X-means, and measure performance by com-\nparing the cluster labels Lc with the true labels Lt. We de\ufb01ne the partition quality (PQ) as\ni=1 p(i)2 where kt is the true number of classes, and kc is\nthe number of clusters found by the algorithm. This metric is maximized when Lc induces\nthe same partition of the data as Lt; in other words, when all points in each cluster have the\nsame true label, and the estimated k is the true k. The p(i, j) term is the frequency-based\nprobability that a datapoint will be labeled i by Lt and j by Lc. This quality is normalized\nby the sum of true probabilities, squared. This statistic is related to the Rand statistic for\ncomparing partitions [8].\n\npq = Pkt\n\ni=1\n\nFor the NIST dataset, G-means \ufb01nds 31 clusters in 30 seconds with a PQ score of 0.177.\nX-means \ufb01nds 715 clusters in 4149 seconds, and 369 of these clusters contain only one\npoint, indicating an overestimation problem with the BIC. X-means receives a PQ score\nof 0.024. For the Pendigits dataset, G-means \ufb01nds 69 clusters in 30 seconds, with a PQ\nscore of 0.196; X-means \ufb01nds 235 clusters in 287 seconds, with a PQ score of 0.057.\nFigure 5 shows Hinton diagrams of the G-means clusterings of both datasets, showing that\nG-means succeeds at identifying the true clusters concisely, without aid of the labels. The\nconfusions between different digits in the NIST dataset (seen in the off-diagonal elements)\nare common for other researchers using more sophisticated techniques, see [3].\n\n510152025300123456789ClusterDigit1020304050600123456789ClusterDigit\f4 Discussion and conclusions\n\nWe have introduced the new G-means algorithm for learning k based on a statistical test\nfor determining whether datapoints are a random sample from a Gaussian distribution with\narbitrary dimension and covariance matrix. The splitting uses dimension reduction and a\npowerful test for Gaussian \ufb01tness. G-means uses this statistical test as a wrapper around\nk-means to discover the number of clusters automatically. The only parameter supplied\nto the algorithm is the signi\ufb01cance level of the statistical test, which can easily be set in\na standard way. The G-means algorithm takes linear time and space (plus the cost of the\nsplitting heuristic and test) in the number of datapoints and dimension, since k-means is\nitself linear in time and space. Empirically, the G-means algorithm works well at \ufb01nding\nthe correct number of clusters and the locations of genuine cluster centers, and we have\nshown it works well in moderately high dimensions.\n\nClustering in high dimensions has been an open problem for many years. Recent research\nhas shown that it may be preferable to use dimensionality reduction techniques before clus-\ntering, and then use a low-dimensional clustering algorithm such as k-means, rather than\nclustering in the high dimension directly.\nIn [3] the author shows that using a simple,\ninexpensive linear projection preserves many of the properties of data (such as cluster dis-\ntances), while making it easier to \ufb01nd the clusters. Thus there is a need for good-quality,\nfast clustering algorithms for low-dimensional data. Our work is a step in this direction.\n\nAdditionally, recent image segmentation algorithms such as normalized cut [16, 13] are\nbased on eigenvector computations on distance matrices. These \u201cspectral\u201d clustering al-\ngorithms still use k-means as a post-processing step to \ufb01nd the actual segmentation and\nthey require k to be speci\ufb01ed. Thus we expect G-means will be useful in combination with\nspectral clustering.\n\nReferences\n\n[1] Horst Bischof, Ale\u02c7s Leonardis, and Alexander Selb. MDL principle for robust vector quantisation. Pattern analysis and applications, 2:59\u201372,\n\n1999.\n\n[2] C.L. Blake and C.J. Merz. UCI repository of machine learning databases, 1998. http://www.ics.uci.edu/\u223cmlearn/MLRepository.html.\n[3] Sanjoy Dasgupta. Experiments with random projection.\n\nIn Uncertainty in Arti\ufb01cial Intelligence: Proceedings of the Sixteenth Conference\n\n(UAI-2000), pages 143\u2013151, San Francisco, CA, 2000. Morgan Kaufmann Publishers.\n\n[4] Gianna M. Del Corso. Estimating an eigenvector by the power method with a random start. SIAM Journal on Matrix Analysis and Applications,\n\n18(4):913\u2013937, 1997.\n\n[5] Chris Ding, Xiaofeng He, Hongyuan Zha, and Horst Simon. Adaptive dimension reduction for clustering high dimensional data. In Proceedings\n\nof the 2nd IEEE International Conference on Data Mining, 2002.\n\n[6] Fredrik Farnstrom, James Lewis, and Charles Elkan. Scalability for clustering algorithms revisited. SIGKDD Explorations, 2(1):51\u201357, 2000.\n\n[7] Peter J. Huber. Projection pursuit. Annals of Statistics, 13(2):435\u2013475, June 1985.\n\n[8] L. Hubert and P. Arabie. Comparing partitions. Journal of Classi\ufb01cation, 2:193\u2013218, 1985.\n\n[9] A. K. Jain, M. N. Murty, and P. J. Flynn. Data clustering: a review. ACM Computing Surveys, 31(3):264\u2013323, 1999.\n\n[10] Robert E. Kass and Larry Wasserman. A reference Bayesian test for nested hypotheses and its relationship to the Schwarz criterion. Journal of\n\nthe American Statistical Association, 90(431):928\u2013934, 1995.\n\n[11] Michael J. Kearns, Yishay Mansour, Andrew Y. Ng, and Dana Ron. An experimental and theoretical comparison of model selection methods. In\n\nComputational Learing Theory (COLT), pages 21\u201330, 1995.\n\n[12] Yann LeCun, L\u00b4eon Bottou, Yoshua Bengio, and Patrick Haffner. Gradient-based learning applied to document recognition. Proceedings of the\n\nIEEE, 86(11):2278\u20132324, 1998.\n\n[13] Andrew Ng, Michael Jordan, and Yair Weiss. On spectral clustering: Analysis and an algorithm. Neural Information Processing Systems, 14,\n\n2002.\n\n[14] Dan Pelleg and Andrew Moore. X-means: Extending K-means with ef\ufb01cient estimation of the number of clusters. In Proceedings of the 17th\n\nInternational Conf. on Machine Learning, pages 727\u2013734. Morgan Kaufmann, San Francisco, CA, 2000.\n\n[15] Peter Sand and Andrew Moore. Repairing faulty mixture models using density estimation. In Proceedings of the 18th International Conf. on\n\nMachine Learning. Morgan Kaufmann, San Francisco, CA, 2001.\n\n[16]\n\nJianbo Shi and Jitendra Malik. Normalized cuts and image segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence,\n22(8):888\u2013905, 2000.\n\n[17] M. A. Stephens. EDF statistics for goodness of \ufb01t and some comparisons. American Statistical Association, 69(347):730\u2013737, September 1974.\n\n\f", "award": [], "sourceid": 2526, "authors": [{"given_name": "Greg", "family_name": "Hamerly", "institution": null}, {"given_name": "Charles", "family_name": "Elkan", "institution": null}]}