{"title": "Cluster Stability for Finite Samples", "book": "Advances in Neural Information Processing Systems", "page_first": 1297, "page_last": 1304, "abstract": null, "full_text": "Cluster Stability for Finite Samples\n\nOhad Shamir\u2020 and Naftali Tishby\u2020\u2021\n\n\u2020 School of Computer Science and Engineering\n\u2021 Interdisciplinary Center for Neural Computation\n\nThe Hebrew University\nJerusalem 91904, Israel\n\n{ohadsh,tishby}@cs.huji.ac.il\n\nAbstract\n\nOver the past few years, the notion of stability in data clustering has received\ngrowing attention as a cluster validation criterion in a sample-based framework.\nHowever, recent work has shown that as the sample size increases, any clustering\nmodel will usually become asymptotically stable. This led to the conclusion that\nstability is lacking as a theoretical and practical tool. The discrepancy between\nthis conclusion and the success of stability in practice has remained an open ques-\ntion, which we attempt to address. Our theoretical approach is that stability, as\nused by cluster validation algorithms, is similar in certain respects to measures\nof generalization in a model-selection framework. In such cases, the model cho-\nsen governs the convergence rate of generalization bounds. By arguing that these\nrates are more important than the sample size, we are led to the prediction that\nstability-based cluster validation algorithms should not degrade with increasing\nsample size, despite the asymptotic universal stability. This prediction is substan-\ntiated by a theoretical analysis as well as some empirical results. We conclude that\nstability remains a meaningful cluster validation criterion over \ufb01nite samples.\n\n1 Introduction\n\nClustering is one of the most common tools of unsupervised data analysis. Despite its widespread\nuse and an immense amount of literature, distressingly little is known about its theoretical founda-\ntions [14]. In this paper, we focus on sample based clustering, where it is assumed that the data to\nbe clustered are actually a sample from some underlying distribution.\n\nA major problem in such a setting is assessing cluster validity. In other words, we might wish to\nknow whether the clustering we have found actually corresponds to a meaningful clustering of the\nunderlying distribution, and is not just an artifact of the sampling process. This problem relates to the\nissue of model selection, such as determining the number of clusters in the data or tuning parameters\nof the clustering algorithm. In the past few years, cluster stability has received growing attention\nas a criterion for addressing this problem.\nInformally, this criterion states that if the clustering\nalgorithm is repeatedly applied over independent samples, resulting in \u2019similar\u2019 clusterings, then\nthese clusterings are statistically signi\ufb01cant. Based on this idea, several cluster validity methods\nhave been proposed (see [9] and references therein), and were shown to be relatively successful for\nvarious data sets in practice.\n\nHowever, in recent work, it was proven that under mild conditions, stability is asymptotically fully\ndetermined by the behavior of the objective function which the clustering algorithm attempts to\noptimize. In particular, the existence of a unique optimal solution for some model choice implies\nstability as sample size increase to in\ufb01nity. This will happen regardless of the model \ufb01t to the data.\nFrom this, it was concluded that stability is not a well-suited tool for model selection in clustering.\nThis left open, however, the question of why stability is observed to be useful in practice.\n\n1\n\n\fIn this paper, we attempt to explain why stability measures should have much wider relevance than\nwhat might be concluded from these results. Our underlying approach is to view stability as a\nmeasure of generalization, in a learning-theoretic sense. When we have a \u2019good\u2019 model, which is\nstable over independent samples, then inferring its \ufb01t to the underlying distribution should be easy.\nIn other words, stability should \u2019work\u2019 because stable models generalize better, and models which\ngeneralize better should \ufb01t the underlying distribution better. We emphasize that this idea in itself is\nnot novel, appearing explicitly and under various guises in many aspects of machine learning. The\nnovelty in this paper lies mainly in the predictions that are drawn from it for clustering stability.\n\nThe viewpoint above places emphasis on the nature of stability for \ufb01nite samples. Since generaliza-\ntion is meaningless when the sample is in\ufb01nite, it should come as no surprise that stability displays\nsimilar behavior. On \ufb01nite samples, the generalization uncertainty is virtually always strictly posi-\ntive, with different model choices leading to different convergence rates towards zero for increasing\nsample size. Based on the link between stability and generalization, we predict that on realistic\ndata, all risk-minimizing models asymptotically become stable, but the rates of convergence to this\nultimate stability differ. In other words, an appropriate scaling of the stability measures will make\nthem independent of the actual sample size used. Using this intuition, we characterize and prove a\nmild set of conditions, applicable in principle to a wide class of clustering settings, which ensure\nthe relevance of cluster stability for arbitrarily large sample sizes. We then prove that the stability\nmeasure used in previous work to show negative asymptotic results on stability, actually allows us to\ndiscern the \u2019correct\u2019 model, regardless of how large is the sample, for a certain simple setting. Our\nresults are further validated by some experiments on synthetic and real world data.\n\n2 De\ufb01nitions and notation\nWe assume that the data sample to be clustered, S = {x1, .., xm}, is produced by sampling instances\ni.i.d from an underlying distribution D, supported on a subset X of Rn. A clustering CD for some\nD \u2286 X is a function from D \u00d7 D to {0, 1}, de\ufb01ning an equivalence relation on D with a \ufb01nite\nnumber of equivalence classes (namely, CD(xi, xj) = 1 if xi and xj belong to the same cluster,\nand 0 otherwise). For a clustering CX of the instance space, and a \ufb01nite sample S, let CX|S denote\nthe functional restriction of CX on S \u00d7 S.\nA clustering algorithm A is a function from any \ufb01nite sample S \u2286 X , to some clustering CX of\nthe instance space1. We assume the algorithm is driven by optimizing an objective function, and\nhas some user-de\ufb01ned parameters \u0398. In particular, Ak denotes the algorithm A with the number of\nclusters chosen to be k.\n\nFollowing [2], we de\ufb01ne the stability of a clustering algorithm A on \ufb01nite samples of size m as:\n\nstab(A,D, m) = ES1,S2dD(A(S1), A(S2)),\n\n(1)\nwhere S1 and S2 are samples of size m, drawn i.i.d from D, and dD is some \u2019dissimilarity\u2019 function\nbetween clusterings of X , to be speci\ufb01ed later.\nLet (cid:96) denote a loss function from any clustering CS of a \ufb01nite set S \u2286 X to [0, 1]. (cid:96) may or may not\ncorrespond to the objective function the clustering algorithm attempts to optimize, and may involve\na global quality measure rather than some average over individual instances. For a \ufb01xed sample size,\nwe say that (cid:96) obeys the bounded differences property (see [11]), if for any clustering CS it holds that\n|(cid:96)(CS) \u2212 (cid:96)(CS(cid:48))| \u2264 a, where a is a constant, and CS(cid:48) is obtained from CS by replacing at most one\ninstance of S by any other instance from X , and clustering it arbitrarily.\nA hypothesis class H is de\ufb01ned as some set of clusterings of X . The empirical risk of a clustering\nCX \u2208 H on a sample S of size m is (cid:96)(CX|S). The expected risk of CX , with respect to samples\nS of size m, will be de\ufb01ned as ES(cid:96)(CX|S). The problem of generalization is how to estimate the\nexpected risk, based on the empirical data.\n\n1Many clustering algorithms, such as spectral clustering, do not induce a natural clustering on X based on\na clustering of a sample. In that case, we view the algorithm as a two-stage process, in which the clustering\nof the sample is extended to X through some uniform extension operator (such as assigning instances to the\n\u2019nearest\u2019 cluster in some appropriate sense).\n\n2\n\n\f3 A Bayesian framework for relating stability and generalization\n\nThe relationship between generalization and various notions of stability is long known, but has been\ndealt with mostly in a supervised learning setting (see [3][5] [8] and references therein). In the\ncontext of unsupervised data clustering, several papers have explored the relevance of statistical\nstability and generalization, separately and together (such as [1][4][14][12]). However, there are\nnot many theoretical results quantitatively characterizing the relationship between the two in this\nsetting. The aim of this section is to informally motivate our approach, of viewing stability and\ngeneralization in clustering as closely related.\n\nRelating the two is very natural in a Bayesian setting, where clustering stability implies an \u2019unsur-\nprising\u2019 posterior given a prior, which is based on clustering another sample. Under this paradigm,\nwe might consider \u2019soft clustering\u2019 algorithms which return a distribution over a measurable hypoth-\nesis class H, rather than a speci\ufb01c clustering. This distribution typically re\ufb02ects the likelihood of a\nclustering hypothesis, given the data and prior assumptions. Extending our notation, we have that\nfor any sample S, A(S) is now a distribution over H. The empirical risk of such a distribution, with\nrespect to sample S(cid:48), is de\ufb01ned as (cid:96)(A(S)|S(cid:48)) = ECX \u223cA(S)(cid:96)(CX|S(cid:48)).\nIn this setting, consider for example the following simple procedure to derive a clustering hypothesis\ndistribution, as well as a generalization bound: Given a sample of size 2m drawn i.i.d from D, we\nrandomly split it into two samples S1,S2 each of size m, and use A to cluster each of them separately.\nThen we have the following:\nTheorem 1. For the procedure de\ufb01ned above, assume (cid:96) obeys the bounded differences property\nwith parameter 1/m. De\ufb01ne the clustering distance dD(P,Q) in Eq. (1), between two distributions\nP,Q over the hypothesis class H, as the Kullback-Leibler divergence DKL[Q||P]2. Then for a \ufb01xed\ncon\ufb01dence parameter \u03b4 \u2208 (0, 1), it holds with probability at least 1\u2212 \u03b4 over the draw of samples S1\nand S2 of size m, that\n\n(cid:114)\n\nES(cid:96)(A(S2)|S) \u2212 (cid:96)(A(S2)|S2) \u2264\n\ndD(A(S1), A(S2)) + ln(m/\u03b4) + 2\n\n2m \u2212 1\n\n.\n\nThe theorem is a straightforward variant of the PAC-Bayesian theorem [10]. Since the loss function\nis not necessarily an empirical average, we need to utilize McDiarmid\u2019s bound for random variables\nwith bounded differences, instead of Hoeffding\u2019s bound. Other than that, the proof is identical, and\nis therefore ommited.\n\nThis theorem implies that the more stable is the Bayesian algorithm, the tighter the expected gen-\neralization bounds we can achieve. In fact, the \u2019expected\u2019 magnitude of the high-probability bound\nwe will get (over drawing S1 and S2 and performing the procedure described above) is:\n\n(cid:114)\n(cid:114)\n\n(cid:114)\n\nES1,S2\n\ndD(A(S1), A(S2)) + ln(m/\u03b4) + 2\n\n2m \u2212 1\n\n\u2264\n\n=\n\nES1,S2 dD(A(S1), A(S2)) + ln(m/\u03b4) + 2\n\n2m \u2212 1\n\nstab(A,D, m) + ln(m/\u03b4) + 2\n\n2m \u2212 1\n\n.\n\nNote that the only model-dependent quantity in the expression above is stab(A,D, m). Therefore,\ncarrying out model selection by attempting to minimize these types of generalization bounds is\nclosely related to minimizing stab(A,D, m). In general, the generalization bound might converge\nto 0 as m \u2192 \u221e, but this is immaterial for the purpose of model selection. The important factor is\nthe relative values of the measure, over different choices of the algorithm parameters \u0398. In other\nwords, the important quantity is the relative convergence rates of this bound for different choices of\n\u0398, governed by stab(A,D, m).\nThis informal discussion only exempli\ufb01es the relationship between generalization and stability, since\nthe setting and the de\ufb01nition of dD here differs from the one we will focus on later in the paper.\nAlthough these ideas can be generalized, they go beyond the scope of this paper, and we leave it for\nfuture work.\n\n2Where we de\ufb01ne DKL[Q||P] =\n\nX Q(X) ln(Q(X)/P(X)), and DKL[q||p] for q, p \u2208 [0, 1] is de\ufb01ned\n\nas the divergence of Bernoulli distributions with parameters q and p.\n\nR\n\n3\n\n\f4 Effective model selection for arbitrarily large sample sizes\n\nFrom now on, following [2], we will de\ufb01ne the clustering distance function dD of Eq. (1) as:\n\ndD(A(S1), A(S2)) =\n\nx1,x2\u223cD (A(S1)(x1, x2) (cid:54)= A(S2)(x1, x2)) .\nPr\n\n(2)\n\nIn other words, the clustering distance is the probability that two independently drawn instances\nfrom D will be in the same cluster under one clustering, and in different clusters under another\nclustering.\n\nIn [2], it is essentially proven that if there exists a unique optimizer to the clustering algo-\nrithm\u2019s objective function, to which the algorithm converges for asymptotically large samples, then\nstab(A,D, m) converges to 0 as m \u2192 \u221e, regardless of the parameters of A. From this, it was\nconcluded that using stability as a tool for cluster validity is problematic, since for large enough\nsamples it would always be approximately zero, for any algorithm parameters chosen.\n\nHowever, using the intuition gleaned from the results of the previous section, the different conver-\ngence rates of the stability measure (for different algorithm parameters) should be more important\nthan their absolute values or the sample size. The key technical result needed to substantiate this\nintuition is the following theorem:\nTheorem 2. Let X, Y be two random variables bounded in [0, 1], and with strictly positive ex-\npected values. Assume E[X]/E[Y ] \u2265 1 + c for some positive constant c. Letting X1, . . . , Xm and\nY1, . . . , Ym be m identical independent copies of X and Y respectively, de\ufb01ne \u02c6X = 1\ni=1 Xi\nand \u02c6Y = 1\n\n(cid:80)m\n\n(cid:80)m\n\ni=1 Yi. Then it holds that:\n\nm\n\n(cid:195)\n\u22121\n8 mE[X]\n\n(cid:181)\n\nc\n\n1 + c\n\n(cid:33)\n\n(cid:182)4\n\n(cid:195)\n\n+ exp\n\n\u22121\n4 mE[X]\n\nc\n\n1 + c\n\n(cid:181)\n\nm\n\n(cid:33)\n\n(cid:182)2\n\n.\n\nPr( \u02c6X \u2264 \u02c6Y ) \u2264 exp\n\nThe importance of this theorem becomes apparent when \u02c6X, \u02c6Y are taken to be empirical estimators\nof stab(A,D, m) for two different algorithm parameter sets \u0398, \u0398(cid:48). For example, suppose that ac-\ncording to our stability measure (see Eq. (1)), a cluster model with k clusters is more stable than a\nmodel with k(cid:48) clusters, where k (cid:54)= k(cid:48), for sample size m (e.g., stab(Ak,D, m) < stab(Ak(cid:48),D, m)).\n\u221a\nThese stability measures might be arbitrarily close to zero. Assume that with high probability over\nthe choice of samples S1 and S2 of size m, we can show that dD(Ak(S1), Ak(S2)) \u2264 1/\n\u221a\nm, while\ndD(Ak(cid:48)(S1), Ak(cid:48)(S2)) \u2265 1.01/\nm. We cannot compute these exactly, since the de\ufb01nition of dD\ninvolves an expectation over the unknown distribution D (see Eq. (2)). However, we can estimate\nthem by drawing another sample S3 of m instance pairs, and computing a sample mean to estimate\nEq. (2). According to Thm. 2, since dD(Ak(S1), Ak(S2)) and dD(Ak(cid:48)(S1), Ak(cid:48)(S2)) have slightly\ndifferent convergence rates (c \u2265 0.01), which are slower than \u0398(1/m), then we can discern which\nnumber of clusters is more stable, with a high probability which actually improves as m increases.\n\nTherefore, we can use Thm. 2 as a guideline for when a stability estimator might be useful for arbi-\ntrarily large sample sizes. Namely, we need to show it is an expected value of some random variable,\nwith at least slightly different convergence rates for different model selections, and with at least\nsome of them dominating \u0398(1/m). We would expect these conditions to hold under quite general\nsettings, since most stability measures are based on empirically estimating the mean of some ran-\n\u221a\ndom variable. Moreover, a central-limit theorem argument leads us to expect an asymptotic form of\nm), with the exact constants dependent on the model. This convergence rate is slow enough\n\u2126(1/\nfor the theorem to apply. The dif\ufb01cult step, however, is showing that the differing convergence rates\ncan be detected empirically, without knowledge of D. In the example above, this reduces to show-\ning that with high probability over S1 and S2, dD(Ak(S1), Ak(S2)) and dD(Ak(cid:48)(S1), Ak(cid:48)(S2)) will\nindeed differ by some constant ratio independent of m.\n\nProof of Thm. 2. Using a relative entropy variant of Hoeffding\u2019s bound [7], we have that for any\n1 > b > 0 and 1/E[Y ] > a > 1, it holds that:\n\n(cid:179)\n(cid:179)\n\nPr\n\nPr\n\n\u02c6X \u2264 bE[X]\n\u02c6Y \u2265 aE[Y ]\n\n(cid:180)\n(cid:180)\n\n\u2264 exp (\u2212m DKL [bE [X] || E [X]]) ,\n\u2264 exp (\u2212m DKL [aE [Y ] || E [Y ]]) .\n\n4\n\n\fBy substituting the bound DKL[p||q] \u2265 (p \u2212 q)2/2 max{p, q} in the two inequalities, we get:\n\n(cid:179)\n(cid:179)\n\nPr\n\nPr\n\n(cid:180)\n(cid:180)\n\n\u02c6X \u2264 bE[X]\n\n\u02c6Y \u2265 aE[Y ]\n\n\u2264 exp\n\n\u2264 exp\n\n(cid:182)\n\n(cid:181)\n(cid:181)\n\u22121\n2 mE [X] (1 \u2212 b)2\n1\n\u22121\n2 mE [Y ]\na\n\n(cid:181)\n\na +\n\n\u2212 2\n\n(cid:182)(cid:182)\n\n(3)\n\n(4)\n\n,\n\nwhich hold whenever 1 > b > 0 and a > 1. Let b = 1 \u2212 (1 \u2212 E[Y ]/E[X])2 /2, and a =\nbE[X]/E[Y ]. It is easily veri\ufb01ed that b < 1 and a > 1. Substituting these values into the r.h.s of\nEq. (3), and to both sides of Eq. (4), and after some algebra, we get:\n\nPr( \u02c6X \u2264 bE[X]) \u2264 exp\n\nPr( \u02c6Y \u2265 bE[X]) \u2264 exp\n\n(cid:195)\n(cid:195)\n\u22121\n8 mE[X]\n\u22121\n4 mE[X]\n\n(cid:181)\n(cid:181)\n\nc\n\n1 + c\n\nc\n\n1 + c\n\n(cid:33)\n(cid:182)4\n(cid:33)\n(cid:182)2\n\n,\n\n.\n\nAs a result, by the union bound, we have that Pr( \u02c6X \u2264 \u02c6Y ) is at most the sum of the r.h.s of the last\ntwo inequalities, hence proving the theorem.\n\nAs a proof of concept, we show that for a certain setting, the stability measure used by [2], as de\ufb01ned\nabove, is meaningful for arbitrarily large sample sizes, even when this measure converges to zero for\nany choice of the required number of clusters. The result is a simple counter-example to the claim\nthat this phenomenon makes cluster stability a problematic tool.\nThe setting we analyze is a mixture distribution of three well-separated unequal Gaussians in R,\nwhere an empirical estimate of stability, using a centroid-based clustering algorithm, is utilized\nto discern whether the data contain 2, 3 or 4 clusters. We prove that with high probability, this\nempirical estimation process will discern k = 3 as much more stable than both k = 2 and k = 4\n(by an amount depending on the separation between the Gaussians). The result is robust enough to\nhold even if in addition one performs normalization procedures to account for the fact that higher\nnumber of clusters entail more degrees of freedom for the clustering algorithm (see [9]).\n\nWe emphasize that the simplicity of this setting is merely for the sake of analytical convenience.\nThe proof itself relies on a general and intuitive characteristic of what constitutes a \u2019wrong\u2019 model\n(namely, having cluster\n\nboundaries in areas of high density), rather than any speci\ufb01c feature of this setting. We are currently\nworking on generalizing this result, using a more involved analysis.\nIn this setting, by the results of [2], stab(Ak,D, m) will converge to 0 as m \u2192 \u221e for k = 2, 3, 4.\nThe next two lemmas, however, show that the stability measure for k = 3 (the \u2019correct\u2019 model order)\nis smaller than the other two, by a substantial ratio independent of m, and that this will be discerned,\nwith high probability, based on the empirical estimates of dD(Ak(S1), Ak(S2)). The proofs are\ntechnical, and appear in the supplementary material to this paper.\nLemma 1. For some \u00b5 > 0, let D be a Gaussian mixture distribution on R, with density function\n\np(x) =\n\n2\n\u221a\n2\u03c0\n\n3\n\nexp\n\n+\n\n1\n\u221a\n2\u03c0\n\n6\n\nexp\n\n\u2212 x2\n2\n\n+\n\n1\n\u221a\n2\u03c0\n\n6\n\nexp\n\nAssume \u00b5 (cid:192) 1, so that the Gaussians are well separated. Let Ak be a centroid-based cluster-\ning algorithm, which is given a sample and required number of clusters k, and returns a set of k\ncentroids, minimizing the k-means objective function (sum of squared Euclidean distances between\neach instance and its nearest centroid). Then the following holds, with o(1) signifying factors which\nconverge to 0 as m \u2192 \u221e:\n\n(cid:181)\n\u2212(x \u2212 \u00b5)2\n\n(cid:182)\n\n.\n\n2\n\n(cid:181)\n\u2212(x + \u00b5)2\n\n(cid:182)\n\n2\n\n(cid:181)\n\n(cid:182)\n\nstab(A2,D, m) \u2265 1 \u2212 o(1)\n\n\u221a\n\nexp\n\n\u2212 \u00b52\n32\n\n, stab(A4,D, m) \u2265 0.4 \u2212 o(1)\n\n\u221a\n\nm\n\n(cid:181)\n\nm\n\n7\nstab(A3,D, m) \u2264 1.1 + o(1)\n\n\u221a\n\nexp\n\n\u2212 \u00b52\n8\n\nm\n\n(cid:182)\n\n.\n\n(cid:181)\n\n(cid:182)\n\n5\n\n\f(cid:179)\n\n(cid:181)\n\nPr\n\n.\n\nm\n\n(cid:162)(cid:162)\n(cid:161)\u221a\n(cid:161)\u2212\u2126(\u00b52)\n(cid:162)\n\n+ exp\n\n(cid:182)\n\n\u2264 2\n\n1, S(cid:48)\n\n1), A2(S(cid:48)\n\nLemma 2. For the setting described in Lemma 1,\ndent sample pairs (S1, S2), (S(cid:48)\ndD(A2(S(cid:48)\nand dD(A3(S1), A3(S2)), is larger than 2 with probability of at least:\n\u2212 \u00b52\n32\n\n2 ) (each of size m from D),\n(cid:181)\n\n2)) and dD(A3(S1), A3(S2)), as well as the ratio between dD(A4(S(cid:48)(cid:48)\n\nit holds that over the draw of indepen-\nthe ratio between\n1 ), A4(S(cid:48)(cid:48)\n2 ))\n\n1 \u2212 (4 + o(1))\n\n(cid:182)(cid:182)\n\n2), (S(cid:48)(cid:48)\n\n1 , S(cid:48)(cid:48)\n\n\u2212 \u00b52\n16\n\n+ exp\n\n(cid:181)\n\n(cid:182)\n\n(cid:181)\n\nexp\n\n.\n\nIt should be noted that the asymptotic notation is merely to get rid of second-order terms, and is not\nan essential feature. Also, the constants are by no means the tightest possible. With these lemmas,\nwe can prove that a direct estimation of stab(A,D, m), based on a random sample, allows us to\ndiscern the more stable model with high probability, for arbitrarily large sample sizes.\nTheorem 3. For the setting described in Lemma 1, de\ufb01ne the following unbiased estimator \u02c6\u03b8k,4m\nof stab(Ak,D, m): Given a sample of size 4m, split it randomly into 3 disjoint subsets S1,S2,S3 of\nsize m,m and 2m respectively. Estimate dD(Ak(S1), Ak(S2)) by computing\nAk(S1)(xi, xm+i) (cid:54)= Ak(S2)(xi, xm+i)\n\n(cid:88)\n\n(cid:179)\n\n(cid:180)\n\n1\n\n,\n\n1\nm\n\nxi,xm+i\u2208S3\n\nwhere (x1, .., xm) is a random permutation of S3, and return this value as an estimate of\nstab(Ak,D, m). If three samples of size 4m each are drawn i.i.d from D, and are used to calculate\n\u02c6\u03b82,4m, \u02c6\u03b83,4m, \u02c6\u03b84,4m, then\n\n(cid:110)\n\n(cid:111)(cid:180)\n\n(cid:161)\u2212\u2126(\u00b52)\n(cid:162)\n\n(cid:161)\u2212\u2126\n\nPr\n\n\u02c6\u03b83,4m \u2265 min\n\n\u02c6\u03b82,4m, \u02c6\u03b84,4m\n\n\u2264 exp\n\nProof. Using Lemma 2, we have that:\n\nmin{dD(A2(S(cid:48)\n\n2)), dD(A4(S(cid:48)(cid:48)\n\n1), A2(S(cid:48)\ndD(A3(S1), A3(S2))\n\n1 ), A4(S(cid:48)(cid:48)\n\n2 ))}\n\n< exp\n\n.\n\n(5)\n\nDenoting the event above as B, and assuming it does not occur, we have that the estimators\n\u02c6\u03b82,4m, \u02c6\u03b83,4m, \u02c6\u03b84,4m are each an empirical average over an additional sample of size m, and the\nexpected value of \u02c6\u03b83,4m is at least twice smaller than the expected values of the other two. More-\n\u221a\nm). Invoking Thm. 2, we\nover, by Lemma 1, the expected value of dD(A3(S1), A3(S2)) is \u2126(1/\nhave that:\n\n(cid:110)\n\n(cid:179)\n\n(cid:111) (cid:175)(cid:175) B(cid:123)\n\n(cid:180)\n\n(cid:161)\u221a\n\n(cid:161)\u2212\u2126\n\nm\n\n(cid:162)(cid:162)\n\n\u2264 exp\n\nPr\n\n\u02c6\u03b83,4m \u2265 min\n\n\u02c6\u03b82,4m, \u02c6\u03b84,4m\n\n(6)\n\nCombining Eq. (5) and Eq. (6) yield the required result.\n\n5 Experiments\n\nIn order to further substantiate our analysis above, some experiments were run on synthetic and\nreal world data, with the goal of performing model selection over the number of clusters k. Our\n\ufb01rst experiment simulated the setting discussed in section 4 (see \ufb01gure 1). We tested 3 different\nGaussian mixture distributions (with \u00b5 = 5, 7, 8), and sample sizes m ranging from 25 to 222. For\neach distribution and sample size, we empirically estimated \u02c6\u03b82, \u02c6\u03b83 and \u02c6\u03b84 as described in section\n4, using the k-means algorithm, and repeated this procedure over 1000 trials. Our results show\n\u221a\nthat although these empirical estimators converge towards zero, their convergence rates differ, with\nm results in approximately\napproximately constant ratios between them. Scaling the graphs by\nconstant and differing stability measures for each \u00b5. Moreover, the failure rate does not increase with\nsample size, and decreases rapidly to negligible size as the Gaussians become more well separated\n- exactly in line with Thm. 3. Notice that although in the previous section we assumed a large\nseparation between the Gaussians for analytical convenience, good results are obtained even when\nthis separation is quite small.\n\nFor the other experiments, we used the stability-based cluster validation algorithm proposed in [9],\nwhich was found to compare favorably with similar algorithms, and has the desirable property of\n\n6\n\n\fFigure 1: Empirical validation of results in section 4. In each row, the leftmost sub-\ufb01gure is the\nactual distribution, the middle sub-\ufb01gure is a log-log plot of the estimators \u02c6\u03b82, \u02c6\u03b83, \u02c6\u03b84 (averaged over\n1000 trials), as a function of the sample size, and on the right is the failure rate as a function of the\nsample size (percentage of trials where \u02c6\u03b83 was not the smallest of the three).\n\nFigure 2: Performance of stability based algorithm in [9] on 3 data sets. In each row, the leftmost\nsub-\ufb01gure is a sample representing the distribution, the middle sub-\ufb01gure is a log-log plot of the\ncomputed stability indices (averaged over 100 trials), and on the right is the failure rate (in detecting\nthe most stable model over repeated trials). In the phoneme data set, the algorithm selects 3 clusters\nas the most stable models, since the vowels tend to group into a single cluster. The \u2019failures\u2019 are all\ndue to trials when k = 4 was deemed more stable.\n\n7\n\n\u221210\u2212505101000.10.20.3Distributionp(x)10110310510710\u2212810\u2212610\u2212410\u22122Valuesof\u02c6\u03b82,\u02c6\u03b83,\u02c6\u03b84 10110310510700.10.20.30.40.50.5Failure Rate\u221210\u22125051000.10.20.3p(x)10110310510710\u2212810\u2212610\u2212410\u22122 10110310510700.10.20.30.40.5\u221210\u22125 0 5 10 00.10.20.3p(x)10110310510710\u2212810\u2212610\u2212410\u22122m 10110310510700.10.20.30.40.5mk=2k=3k=4k=2k=3k=4k=2k=3k=4\u221210\u2212505\u221210\u2212505Random Sample10210310410510610\u2212410\u2212310\u2212210\u22121k=3k=4k=5k=6k=7Values of stability method index10210310410500.10.20.30.40.5Failure Rate\u2212202\u221220220004000800010\u2212310\u2212210\u22121k=2k=3k=4k=520004000800000.10.20.30.40.5\u22121000100200300\u221250050 shiydclaaao5001000500010\u2212310\u2212210\u22121k=3k=4k=5k=6m5001000500000.10.20.30.40.5m\fproducing a clear quantitative stability measure, bounded in [0, 1]. Lower values match models\nwith higher stability. The synthetic data sets selected (see \ufb01gure 2) were a mixture of 5 Gaussians,\nand segmented 2 rings. We also experimented on the Phoneme data set [6], which consists of\n4, 500 log-periodograms of 5 phonemes uttered by English speakers, to which we applied PCA\nprojection on 3 principal components as a pre-processing step. The advantage of this data set is its\nclear low-dimensional representation relative to its size, allowing us to get nearer to the asymptotic\nconvergence rates of the stability measures. All experiments used the k-means algorithm, except for\nthe ring data set which used the spectral clustering algorithm proposed in [13].\n\nComplementing our theoretical analysis, the experiments clearly demonstrate that regardless of the\nactual stability measures per \ufb01xed sample size, they seem to eventually follow roughly constant and\ndiffering convergence rates, with no substantial degradation in performance. In other words, when\nstability works well for small sample sizes, it should also work at least as well for larger sample\nsizes. The universal asymptotic convergence to zero does not seem to be a problem in that regard.\n\n6 Conclusions\n\nIn this paper, we propose a principled approach for analyzing the utility of stability for cluster\nvalidation in large \ufb01nite samples. This approach stems from viewing stability as a measure of\ngeneralization in a statistical setting. It leads us to predict that in contrast to what might be concluded\nfrom previous work, cluster stability does not necessarily degrade with increasing sample size. This\nprediction is substantiated both theoretically and empirically.\n\nThe results also provide some guidelines (via Thm. 2) for when a stability measure might be relevant\nfor arbitrarily large sample size, despite asymptotic universal stability. They also suggest that by\nappropriate scaling, stability measures would become insensitive to the actual sample size used.\nThese guidelines do not presume a speci\ufb01c clustering framework. However, we have proven their\nful\ufb01llment rigorously only for a certain stability measure and clustering setting. The proof can be\ngeneralized in principle, but only at the cost of a more involved analysis. We are currently working\non deriving more general theorems on when these guidelines apply.\nAcknowledgements: This work has been partially supported by the NATO SfP Programme and the\nPASCAL Network of excellence.\n\nReferences\n[1] Shai Ben-David. A framework for statistical clustering with a constant time approximation algorithms for\n\nk-median clustering. In Proceedings of COLT 2004, pages 415\u2013426.\n\n[2] Shai Ben-David, Ulrike von Luxburg, and D\u00b4avid P\u00b4al. A sober look at clustering stability. In Proceedings\n\n[3] Olivier Bousquet and Andr\u00b4e Elisseeff. Stability and generalization. Journal of Machine Learning Re-\n\nof COLT 2006, pages 5\u201319.\n\nsearch, 2:499\u2013526, 2002.\n\n[4] Joachim M. Buhmann and Marcus Held. Model selection in clustering by uniform convergence bounds.\n\nIn Advances in Neural Information Processing Systems 12, pages 216\u2013222, 1999.\n\n[5] Andrea Caponnetto and Alexander Rakhlin. Stability properties of empirical risk minimization over\n\ndonsker classes. Journal of Machine Learning Research, 6:2565\u20132583, 2006.\n\n[6] Trevor Hastie, Robert Tibshirani, Jerome Friedman. The Elements of Statistical Learning. Springer, 2001.\n[7] Wassily Hoeffding. Probability inequalities for sums of bounded random variables. Journal of the Amer-\n\nican Statistical Association, 58(301):13\u201330, March 1963.\n\n[8] Samuel Kutin and Partha Niyogi. Almost-everywhere algorithmic stability and generalization error. In\n\nProceeding of the 18th confrence on Uncertainty in Arti\ufb01cial Intelligence (UAI), pages 275\u2013282, 2002.\n\n[9] Tilman Lange, Volker Roth, Mikio L. Braun, and Joachim M. Buhmann. Stability-based validation of\n\nclustering solutions. Neural Computation, 16(6):1299\u20131323, June 2004.\n\n[10] D.A. McAllester. Pac-bayesian stochastic model selection. Machine Learning Journal, 51(1):5\u201321, 2003.\n[11] C. McDiarmid. On the method of bounded differences.\nIn Surveys in Combinatorics, volume 141 of\n\nLondon Mathematical Society Lecture Note Series, pages 148\u2013188. Cambridge University Press, 1989.\n\n[12] Alexander Rakhlin and Andrea Caponnetto. Stability of k-means clustering.\n\nIn Advances in Neural\n\nInformation Processing Systems 19. MIT Press, Cambridge, MA, 2007.\n\n[13] Jianbo Shi and Jitendra Malik. Normalized cuts and image segmentation. IEEE Transactions on Pattern\n\nAnalysis and Machine Intelligence, 22(8):888\u2013905, 2000.\n\n[14] Ulrike von Luxburg and Shai Ben-David. Towards a statistical theory of clustering. Technical report,\n\nPASCAL workshop on clustering, London, 2005.\n\n8\n\n\f", "award": [], "sourceid": 95, "authors": [{"given_name": "Ohad", "family_name": "Shamir", "institution": null}, {"given_name": "Naftali", "family_name": "Tishby", "institution": null}]}