{"title": "On U-processes and clustering performance", "book": "Advances in Neural Information Processing Systems", "page_first": 37, "page_last": 45, "abstract": "Many clustering techniques aim at optimizing empirical criteria that are of the form of a U-statistic of degree two. Given a measure of dissimilarity between pairs of observations, the goal is to minimize the within cluster point scatter over a class of partitions of the feature space. It is the purpose of this paper to define a general statistical framework, relying on the theory of U-processes, for studying the performance of such clustering methods. In this setup, under adequate assumptions on the complexity of the subsets forming the partition candidates, the excess of clustering risk is proved to be of the order O(1/\\sqrt{n}). Based on recent results related to the tail behavior of degenerate U-processes, it is also shown how to establish tighter rate bounds. Model selection issues, related to the number of clusters forming the data partition in particular, are also considered.", "full_text": "On U-processes and clustering performance\n\nSt\u00b4ephan Cl\u00b4emenc\u00b8on\u2217\n\nLTCI UMR Telecom ParisTech/CNRS No. 5141\nInstitut Telecom, Paris, 75634 Cedex 13, France\n\nstephan.clemencon@telecom-paristech.fr\n\nAbstract\n\nMany clustering techniques aim at optimizing empirical criteria that are of the\nform of a U-statistic of degree two. Given a measure of dissimilarity between\npairs of observations, the goal is to minimize the within cluster point scatter over\na class of partitions of the feature space. It is the purpose of this paper to de\ufb01ne\na general statistical framework, relying on the theory of U-processes, for study-\ning the performance of such clustering methods. In this setup, under adequate\n\u221a\nassumptions on the complexity of the subsets forming the partition candidates, the\nexcess of clustering risk is proved to be of the order OP(1/\nn). Based on recent\nresults related to the tail behavior of degenerate U-processes, it is also shown how\nto establish tighter rate bounds. Model selection issues, related to the number of\nclusters forming the data partition in particular, are also considered.\n\n1\n\nIntroduction\n\nIn cluster analysis, the objective is to segment a dataset into subgroups, such that data points in the\nsame subgroup are more similar to each other (in a sense that will be speci\ufb01ed) than to those in\nother subgroups. Given the wide range of applications of the clustering paradigm, numerous data\nsegmentation procedures have been introduced in the machine-learning literature (see Chapter 14 in\n[HTF09] and Chapter 8 in [CFZ09] for recent overviews of \u201doff-the-shelf\u201d clustering techniques).\nWhereas the design of clustering algorithms is still receiving much attention in machine-learning\n(see [WT10] and the references therein for instance), the statistical study of their performance,\nwith the notable exception of the celebrated K-means approach, see [Har78, Pol81, Pol82, BD04]\nand more recently [BDL08] in the functional data analysis setting, may appear to be not suf\ufb01ciently\nwell-documented in contrast, as pointed out in [vLBD05, BvL09]. Indeed, in the K-means situation,\nthe speci\ufb01c form of the criterion (and of its expectation, the clustering risk), as well as that of the\ncells de\ufb01ning the clusters and forming a partition of the feature space (Voronoi cells), permits to\nuse, in a straightforward manner, results of the theory of empirical processes in order to control\nthe performance of empirical clustering risk minimizers. Unfortunately, this center-based approach\ndoes not carry over into more general situations, where the dissimilarity measure is not a square\nhilbertian norm anymore, unless one loses the possibility to interpret the clustering criterion as a\nfunction of pairwise dissimilarities between the observations (cf K-medians).\nIt is the goal of this paper to establish a general statistical framework for investigating clustering\nperformance. The present analysis is based on the observation that many statistical criteria for\nmeasuring clustering accuracy are (symmetric) U-statistics (of degree two), functions of a matrix\nof dissimilarities between pairs of data points. Such statistics have recently received a good deal of\nattention in the machine-learning literature, insofar as empirical performance measures of predictive\nrules in problems such as statistical ranking (when viewed as pairwise classi\ufb01cation), see [CLV08],\nor learning on graphs ([BB06]), are precisely functionals of this type, generalizing sample mean\nstatistics. By means of uniform deviation results for U-processes, the Empirical Risk Minimization\n\n\u2217http://www.tsi.enst.fr/\u223cclemenco/.\n\n1\n\n\f\u221a\nparadigm (ERM) can be extended to situations where natural estimates of the risk are U-statistics.\nIn this way, we establish here a rate bound of order OP(1/\nn) for the excess of clustering risk\nof empirical minimizers under adequate complexity assumptions on the cells forming the partition\ncandidates (the bias term is neglected in the present analysis). A linearization technique, combined\nwith sharper tail results in the case of degenerate U-processes is also used in order to show that\ntighter rate bounds can be obtained. Finally, it is shown how to use the upper bounds established\nin this analysis in order to deal with the problem of automatic model selection, that of selecting the\nnumber of clusters in particular, through complexity penalization.\nThe paper is structured as follows.\nIn section 2, the notations are set out, a formal description\nof cluster analysis, from the \u201dpairwise dissimilarity\u201d perspective, is given and the main theoretical\nconcepts involved in the present analysis are brie\ufb02y recalled. In section 3, an upper bound for the\nperformance of empirical minimization of the clustering risk is established in the context of general\ndissimilarity measures. Section 4 shows how to re\ufb01ne the rate bound previously obtained by means\nof a recent inequality for degenerate U-processes, while section 5 deals with automatic selection of\nthe optimal number of clusters. Technical proofs are deferred to the Appendix section.\n\n2 Theoretical background\n\nIn this section, after a brief description of the probabilistic framework of the study, the general\nformulation of the clustering objective, based on the notion of dissimilarity between pairs of obser-\nvations, is recalled and the connection of the problem of investigating clustering performance with\nthe theory of U-statistics and U-processes is highlighted. Concepts pertaining to this theory and\ninvolved in the subsequent analysis are next recalled.\n\n2.1 Probabilistic setup and \ufb01rst notations\n\ndenoted by I{E}, the usual lp norm on Rd by ||x||p = ((cid:80)d\n\nHere and throughout, (X1, . . . , Xn) denotes a sample of i.i.d. random vectors, valued in a high-\ndimensional feature space X , typically a subset of the euclidian space Rd with d >> 1, with com-\nmon probability distribution \u00b5(dx). With no loss of generality, we assume that the feature space X\ncoincides with the support of the distribution \u00b5(dx). The indicator function of any event E will be\ni=1 |xi|p)1/p when 1 \u2264 p < \u221e and by\n||x||\u221e = max1\u2264i\u2264d |xi| in the case p = \u221e, with x = (x1, . . . , xd) \u2208 Rd. When well-de\ufb01ned, the\nexpectation and the variance of a r.v. Z are denoted by E[Z] and Var(Z) respectively. Finally, we\ndenote by x+ = max(0, x) the positive part of any real number x.\n\n2.2 Cluster analysis\n\nThe goal of clustering techniques is to partition the data (X1, . . . , Xn) into a given \ufb01nite number\nof groups, K << n say, so that the observations lying in a same group are more similar to each\nother than to those in other groups. When equipped with a (borelian) measure of dissimilarity\nD : X 2 \u2192 R\u2217\n+, the clustering task can be rigorously cast as the problem of minimizing the criterion\n\n(cid:99)Wn(P) =\n\n2\n\nn(n \u2212 1)\n\nK(cid:88)\n\n(cid:88)\n\nD(Xi, Xj) \u00b7 I{(Xi, Xj) \u2208 C2\nk},\n\n(1)\n\n1\u2264i 0 for all t > 0. This\nincludes the so-termed \u201dstandard K-means\u201d setup, where the dissimilarity measure coincides with\n\n2\n\n\fthe square euclidian norm (in this case, p = 2 and \u03a6(t) = t2 for t \u2265 0). Notice that the expectation\nof the r.v. (1) is equal to the following quantity:\n\nK(cid:88)\n\nk=1\n\nW (P) =\n\nE(cid:2)D(X, X(cid:48)) \u00b7 I{(X, X(cid:48)) \u2208 C2\nk}(cid:3) ,\n\nwhere (X, X(cid:48)) denotes a pair of independent r.v.\u2019s drawn from \u00b5(dx). It will be referred to as the\nclustering risk of the partition P, while its statistical counterpart (1) will be called the empirical\nclustering risk. Optimal partitions of the feature space X are de\ufb01ned as those that minimize W (P).\n(cid:80)\nRemark 1 (MAXIMIZATION FORMULATION) It is well-known that minimizing the empirical clus-\ntering risk (1) is equivalent to maximizing the between-cluster scatter point, which is given by\ni, j D(Xi, Xj)\u00b7I{(Xi, Xj) \u2208 Ck \u00d7Cl}, the sum of these two statistics being\nindependent from the partition P = {Ck : 1 \u2264 k \u2264 K} considered.\nSuppose we are given a (hopefully suf\ufb01ciently rich) class \u03a0 of partitions of the feature space X .\n\n1/(n(n\u2212 1))\u00b7(cid:80)\nHere we consider minimizers of the empirical risk(cid:99)Wn over \u03a0, i.e. partitions (cid:98)P\u2217\n\nn in \u03a0 such that\n\nk(cid:54)=l\n\n(cid:17)\n\n(cid:16)(cid:98)P\u2217\n\nn\n\n(cid:99)Wn\n\n(cid:99)Wn (P) .\n\n= minP\u2208\u03a0\n\nThe design of practical algorithms for computing (approximately) empirical clustering risk minimiz-\ners is beyond the scope of this paper (refer to [HTF09] for an overview of \u201doff-the-shelf\u201d clustering\nmethods). Here, focus is on the performance of such empirically de\ufb01ned rules.\n\n2.3 U-statistics and U-processes\n\nThe subsequent analysis crucially relies on the fact that the quantity (1) that one seeks to optimize\nis a U-statistic. For clarity\u2019s sake, we recall the de\ufb01nition of this class of statistics, generalizing\nsample means.\n\nDe\ufb01nition 1 (U -STATISTIC OF DEGREE TWO.) Let X1,\n. . . , Xn be independent copies of a\nrandom vector X drawn from a probability distribution \u00b5(dx) on the space X and K : X 2 \u2192 R be\na symmetric function such that K(X1, X2) is square integrable. By de\ufb01nition, the functional\n\n(2)\n\n(3)\n\nUn =\n\n2\n\nn(n \u2212 1)\n\n(cid:88)\n\n1\u2264i 0} for\n(x, x(cid:48)) \u2208 R2 in this case). Although the dependence structure induced by the summation over all\npairs of observations makes its study more dif\ufb01cult than that of basic sample means, this estimator\nhas nice properties. It is well-known folklore in mathematical statistics that it is the most ef\ufb01cient\nestimator among all unbiased estimators of the parameter \u03b8 (i.e.\nthat with minimum variance),\nsee [vdV98]. Precisely, when non degenerate, it is asymptotically normal with limiting variance\n4\u00b7Var(K(1)(X)) (refer to Chapter 5 in [Ser80] for an account of asymptotic analysis of U-statistics).\nAs shall be seen in section 4, the reduced variance property of U-statistics is crucial, when it comes\nto establish tight rate bounds.\nGoing back to the U-statistic of degree two (1) estimating (2), observe that its symmetric kernel is:\n\nK(cid:88)\nk}.\nD(x, x(cid:48)) \u00b7 I{(x, x(cid:48)) \u2208 C2\n\u2200(x, x(cid:48)) \u2208 X 2, KP(x, x(cid:48)) =\nAssuming that E[D2(X1, X2) \u00b7 I{(X1, X2) \u2208 C2\n. . . , K} and placing\nk}] < \u221e for all k \u2208 {1,\nourselves in the situation where K \u2265 1 is less than X \u2019s cardinality, the U-statistic (1) is always non\n\n(5)\n\nk=1\n\n3\n\n\f(cid:90)\n\ndegenerate, except in the (sole) case where X is made of K elements exactly and all P\u2019s cells are\nsingletons. Indeed, for all x \u2208 X , denoting by k(x) the index of {1, . . . , K} such that x \u2208 Ck(x),\nwe have:\n\nD(x, x(cid:48))\u00b5(dx(cid:48)).\n\nK(1)P (x) def= E[KP(x, X)] =\n\n\u221a\n\n(cid:82)\n\nx(cid:48)\u2208Ck(x)\n\nn{(cid:99)Wn(P) \u2212 W (P)} is equal to 4 \u00b7 Var(D(X,Ck(X)), where we set D(x, C) =\n\n(6)\nAs \u00b5\u2019s support coincides with X and the separation property is ful\ufb01lled by D, the quantity\nabove is zero iff Ck(x) = {x}.\nIn the non degenerate case, notice \ufb01nally that the asymptotic\nvariance of\nx(cid:48)\u2208X D(x, x(cid:48))\u00b5(dx(cid:48)) for all x \u2208 X and any measurable set C \u2282 X .\nBy de\ufb01nition, a U-process is a collection of U-statistics, one may refer to [dlPG99] for an account\nof the theory of U-processes. Echoing the role played by the theory of empirical processes in the\n(cid:111)\nstudy of the ERM principle in binary classi\ufb01cation, the control of the \ufb02uctuations of the U-process\n\n(cid:110)(cid:99)Wn(P) \u2212 W (P) : P \u2208 \u03a0\n\nindexed by a set \u03a0 of partition candidates will naturally lie at the heart of the present analysis. As\nshall be seen below, this can be achieved mainly by the means of the Hoeffding representations of\nU-statistics, see [Hoe48].\n\n3 A bound for the excess of clustering risk\n\nHere we establish an upper bound for the performance of an empirical minimizer of the clustering\nrisk over a class \u03a0K of partitions of X with K \u2265 1 cells, K being \ufb01xed here and supposed to be\nsmaller than X \u2019s cardinality. We denote by W \u2217\nK the clustering risk minimum over all partitions of\nX with K cells. The following global suprema of empirical Rademacher averages, characterizing\nthe complexity of the cells forming the partition candidates, shall be involved in the subsequent rate\nanalysis: \u2200n \u2265 2,\n\nAK,n =\n\nsup\n\nC\u2208P, P\u2208\u03a0K\n\n1\n\n(cid:98)n/2(cid:99)\n\n\u0001iD(Xi, Xi+(cid:98)n/2(cid:99)) \u00b7 I{(Xi, Xi+(cid:98)n/2(cid:99)) \u2208 C2}\n\n(7)\n\nwhere \u0001 = (\u0001i)i\u22651 is a Rademacher chaos, independent from the Xi\u2019s, see [Kol06].\nThe following theorem reveals that the clustering performance of the empirical minimizer (3) is of\nthe order OP(1/\nTheorem 1 Consider a class \u03a0K of partitions with K \u2265 1 cells and suppose that:\n\nn), when neglecting the bias term (depending on the richness of \u03a0K solely).\n\n\u221a\n\n\u2022 there exists B < \u221e such that for all P in \u03a0K, any C in P, sup(x,x(cid:48))\u2208C2 D(x, x(cid:48)) \u2264 B,\n\u2022 the expectation of the Rademacher average AK,n is of the order O(n\u22121/2).\n\nLet \u03b4 > 0. For any empirical clustering risk minimizer (cid:98)P\u2217\n\u2200n \u2265 2, W ((cid:98)P\u2217\n(cid:18)\nK \u2264 4KE[AK,n] + 2BK\n\nn) \u2212 W \u2217\n\n(cid:114)2 log(1/\u03b4)\n\n(cid:19)\nn, we have with probability at least 1 \u2212 \u03b4:\n\nW (P) \u2212 W \u2217\n\nK\n\n+\n\n(cid:18)\n(cid:19)\ninfP\u2208\u03a0K\n\nn\n\n\u2264 c(B, \u03b4) \u00b7 K\u221a\nn\n\ninfP\u2208\u03a0K\nfor some constant c(B, \u03b4) < \u221e, independent from n and K.\nThe key for proving (8) is to express the U-statistic Wn(P) in terms of sums of i.i.d. r.v.\u2019s, as that\ninvolved in the Rademacher average (7):\n\n(8)\n\n+\n\nK\n\n,\n\nW (P) \u2212 W \u2217\n\nWn(P) =\n\n1\nn!\n\nKP(Xi, Xi+(cid:98)n/2(cid:99)),\n\n(9)\n\n(cid:88)\n\n\u03c3\u2208Sn\n\n(cid:98)n/2(cid:99)(cid:88)\n\ni=1\n\n1\n\n(cid:98)n/2(cid:99)\n\n4\n\n(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:98)n/2(cid:99)(cid:88)\n\ni=1\n\n(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ,\n\n\fwhere the average is taken over Sn, the symmetric group of order n. The main point lies in the fact\nthat standard techniques in empirical process theory can be then used to control Wn(P) \u2212 W (P)\nuniformly over \u03a0K under adequate hypotheses, see the proof in the Appendix for technical details.\nWe underline that, naturally, the complexity assumption is also a crucial ingredient of the result\nstated above, and more generally to clustering consistency results, see Example 1 in [BvL09]. We\nalso point out that the ERM approach is by no means the sole method to obtain error bounds in the\nclustering context. Just like in binary classi\ufb01cation (see [KN02]), one may use a notion of stability\nof a clustering algorithm to establish such results, see [vL09, ST09] and the references therein. Refer\nto [vLBD06, vLBD08] for error bounds proved through the stability approach. Before showing how\nthe bound for the excess of risk stated above can be improved, a few remarks are in order.\n\nE[AK,n] \u2264 c(cid:112)V /n for some universal constant c < \u221e. This covers a wide variety of situations,\n\nRemark 2 (ON THE COMPLEXITY ASSUMPTION.) We point out that standard entropy metric argu-\nments can be used in order to bound the expected value of the Rademacher average An, see [BBL05]\nfor instance. In particular, if the set of functions F\u03a0K = {(x, x(cid:48)) \u2208 X 2 (cid:55)\u2192 D(x, x(cid:48)) \u00b7 I{(x, x(cid:48)) \u2208\nC2} : C \u2208 P, P \u2208 \u03a0K} is a VC major class with \ufb01nite VC dimension V (see [Dud99]), then\nincluding the case where D(x, x(cid:48)) = ||x \u2212 x(cid:48)||\u03b2\np and the class of sets {C : C \u2208 P, P \u2208 \u03a0K} is of\n\ufb01nite VC dimension.\n\nIn the standard K-means approach,\n\nRemark 3 (K-MEANS.)\nthe dissimilarity measure is\nD(x, x(cid:48)) = ||x \u2212 x(cid:48)||2\n2 and partition candidates are indexed by a collection c of distinct \u201dcenters\u201d\nc1, . . . , cK in X : Pc = {C1, . . . , CK} with Ck = {x \u2208 X : ||x\u2212 ck||2 = min1\u2264l\u2264K ||x\u2212 cl||2}\nfor 1 \u2264 k \u2264 K (with adequate distance-tie breaking). One may easily check that for this speci\ufb01c\ncollection of partitions \u03a0K and this choice for the dissimilarity measure, the class F\u03a0K is a VC\nmajor class with \ufb01nite VC dimension, see section 19.1 in [DGL96] for instance. Additionally, it\nshould be noticed than in most practical clustering procedures, center candidates are picked in a\ndata-driven fashion, being taken as the averages of the observations lying in each cluster/cell. In this\nrespect, the M-estimation problem formulated here can be considered to a certain extent as closer to\nwhat is actually achieved by K-means clustering techniques in practice, than the usual formulation\nof the K-means problem (as an optimization problem over c = (c1, . . . , cK) namely).\n\n(\u03c9i)1\u2264i\u2264d in a coordinatewise manner, leading to (cid:98)D(x, x(cid:48)) = (cid:80)d\nwhere(cid:98)\u03c32\nbetween the latter and the U-statistic (1) with D(x, x(cid:48)) = (cid:80)d\n\nRemark 4 (WEIGHTED CLUSTERING CRITERIA.) Notice that, in practice, the measure D involved\nin (1) may depend on the data. For scaling purpose, one could assign data-dependent weights \u03c9 =\ni for instance,\ni denotes the sample variance related to the i-th coordinate. Although the criterion re\ufb02ecting\nthe performance is not a U-statistic anymore, the theory we develop here can be straightforwardly\nused for investigating clustering accuracy in such a case. Indeed, it is easy to control the difference\ni \u2019s denoting\n\ni)2/\u03c32\nthe theoretical variances of \u00b5\u2019s marginals, under adequate moment assumptions.\n\ni)2/(cid:98)\u03c32\n\ni=1(xi \u2212 x(cid:48)\n\ni=1(xi \u2212 x(cid:48)\n\ni , the \u03c32\n\n4 Tighter bounds for empirical clustering risk minimizers\n\nWn(P) \u2212 W (P) = 2Ln(P) + Mn(P),\n\n(10)\n\ni=1\n\nWe now show that one may re\ufb01ne the rate bound established above, by considering another repre-\nsentation of the U-statistic (1), its Hoeffding\u2019s decomposition (see [Ser80]): for all partition P,\n\nHC(x, x(cid:48)) = D(x, x(cid:48)) \u00b7 I{(x, x(cid:48)) \u2208 C2} and H(1)C (x) = D(x,C) \u00b7 I{x \u2208 C} \u2212 D(C,C),\n\n(cid:80)C\u2208P H(1)C (Xi) being a simple average of i.i.d r.v.\u2019s with, for (x, x(cid:48)) \u2208 X 2,\n\nLn(P) = (1/n)(cid:80)n\nwhere D(C,C) =(cid:82)\ndegenerate U-statistic based on the Xi\u2019s with kernel given by:(cid:80)C\u2208P H(2)C (x, x(cid:48)), where\norder OP((cid:112)1/n), while the second term is of the order OP(1/n). Hence, provided this holds true\n\nfor all (x, x(cid:48)) \u2208 X 2. The leading term in (10) is the (centered) sample mean 2Ln(P), of the\n\nx\u2208C D(x,C)\u00b5(dx) and E[HC(x, X)] = D(x,C) \u00b7 I{x \u2208 C}, and Mn(P) being a\n\nH(2)C (x, x(cid:48)) = HC(x, x(cid:48)) \u2212 H(1)C (x) \u2212 H(1)C (x(cid:48)) \u2212 D(C,C),\n\n5\n\n\f(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) .\n(cid:88)\n\ni,j\n\n\u0001iD(Xi,C) \u00b7 I{Xi \u2208 C}\n\n(11)\n\nsup\n\nC\u2208P, P\u2208\u03a0K\n\n\u03b1:P\n\nsup\n\nj \u03b12\n\nj\n\n\u0001i\u03b1jH(2)C (Xi, Xj),\n\nuniformly over P, the main contribution to the rate bound should arise from the quantity\n\nn(cid:88)\n\ni=1\n\n|(1/n)\n\nH(1)C (Xi) \u2212 D(C,C)|,\n\nwhich thus leads to consider the following suprema of empirical Rademacher averages:\n\n|2Ln(P)| \u2264 2K\n\nsup\nP\u2208\u03a0K\n\nsup\n\nC\u2208P, P\u2208\u03a0K\n\ni=1\n\n1\nn\n\nsup\n\nsup\n\n(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)\n\nZ\u0001 =\n\n(cid:88)\n\nC\u2208P, P\u2208\u03a0K\n\nC\u2208P, P\u2208\u03a0K\n\nRK,n =\n\n\u0001i\u0001jH(2)C (Xi, Xj)\n\n(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) n(cid:88)\n(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) , U\u0001 =\n(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) .\nLet \u03b4 > 0. For any empirical clustering risk minimizer (cid:98)P\u2217\nW ((cid:98)P\u2217\n\n(cid:114)log(2/\u03b4)\n\nK \u2264 4KE[RK,n] + 2BK\n\n\u0001iH(2)C (Xi, Xj)\n\n(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:88)\n\nn) \u2212 W \u2217\n\nC\u2208P, P\u2208\u03a0K\n\nsup\n1\u2264j\u2264n\n\nM =\n\nsup\n\ni,j\n\ni\n\nThis supremum clearly has smaller mean and variance than (7). We also introduce the quantities:\n\nTheorem 2 Consider a class \u03a0K of partitions with K cells and suppose that:\n\n\u2022 there exists B < \u221e such that sup(x,x(cid:48))\u2208C2 D(x, x(cid:48)) \u2264 B for all P \u2208 \u03a0K, C \u2208 P.\n\nn, with probability at least 1 \u2212 \u03b4: \u2200n \u2265 2,\n\n+ K\u03ba(n, \u03b4) +\n\ninfP\u2208\u03a0K\n\nW (P) \u2212 W \u2217\n\nK\n\n, (12)\n\n(cid:19)\n\n(cid:18)\n\nwhere we set for some universal constant C < \u221e, independent from n, N and K:\n\n(cid:16)E[Z\u0001] +(cid:112)log(1/\u03b4)E[U\u0001] + (n + E[M])/ log(1/\u03b4)\n\n\u03ba(n, \u03b4) = C\n\n(cid:17)\n\nn\n\n/n2.\n\n(13)\n\nThe result above relies on the moment inequality for degenerate U-processes proved in [CLV08].\nRemark 5 (LOCALIZATION.) The same argument can be used to decompose \u039bn(P) \u2212 \u039b(P),\nK, and, by\nmeans of concentration inequalities, to obtain next a sharp upper bound that involves the modulus\nof continuity of the variance of the Rademacher average indexed by the convex hull of the set of\n\nwhere \u039bn(P) =(cid:99)Wn(P) \u2212 W \u2217\nfunctions {(cid:80)C\u2208P D(x,C) \u00b7 I{x \u2208 C} \u2212(cid:80)C\u2217\u2208P\u2217 D(x,C\u2217) \u00b7 {x \u2208 C\u2217} : P \u2208 \u03a0K}, following in\n\nK is an estimate of the excess of risk \u039b(P) = W (P) \u2212 W \u2217\n\nthe footsteps or recent advances in binary classi\ufb01cation, see [Kol06] and subsection 5.3 in [BBL05].\nOwing to space limitations, this will be dealt with in a forthcoming article.\n\n5 Model selection - choosing the number of clusters\n\nA crucial issue in data segmentation is to determine the number K of cells that exhibits the most the\nclustering phenomenon in the data. A variety of automatic procedures for choosing a good value for\nK have been proposed in the literature, based on data splitting, resampling or sampling techniques\n([PFvN89, TWH01, ST08]). Here we consider a complexity regularization method that avoids to\nhave recourse to such techniques and uses a data-dependent penalty term based on the analysis\ncarried out above.\n. . . of collections of partitions of the feature space X\nSuppose that we have a sequence \u03a01, \u03a02,\nsuch that, for all K \u2265 1, the elements of \u03a0K are made of K cells and ful\ufb01ll the assumptions of\nTheorem 1. In order to avoid over\ufb01tting, consider the (data-driven) complexity penalty given by\n\npen(n, K) = 3KE\u0001[AK,n] +\n\nand the minimizer (cid:98)PbK,n of the penalized empirical clustering risk, with\n\n27BK log K\n\nn\n\n(cid:110)(cid:99)Wn((cid:98)PK,n) + pen(n, K)\n(cid:111)\n\n(cid:98)K = arg min\n\nK\u22651\n\n+(cid:112)(2B log K)/n\nand(cid:99)Wn((cid:98)PK,n) = minP\u2208\u03a0K\n\n(cid:99)Wn(P).\n\n(14)\n\n6\n\n\fThe next result shows that the partition thus selected nearly achieves the performance that would be\nwith W \u2217 = infP W (P).\nTheorem 3 (AN ORACLE INEQUALITY) Suppose that, for all K \u2265 1, the assumptions of Theorem\n1 are ful\ufb01lled. Then, we have:\n\nobtained with the help of an oracle, revealing the value of the index K that minimizes E[(cid:98)PK,n]\u2212W \u2217,\nE(cid:104)\n\n(cid:105) \u2212 W \u2217 \u2264 min\n\n(cid:114) 2\n\n{W \u2217\n\n(cid:33)\n\n(cid:32)\n\nK \u2212 W \u2217 + pen(n, K)} + \u03c02\n6\n\nW ((cid:98)PbK,n)\n\n18B\nn\n\nK\u22651\n\n+\n\nn\n\n.\n\n(15)\n\n2B\n\nOf course, the penalty could be slightly re\ufb01ned using the results of Section 4. Due to space limita-\ntions, such an analysis is not carried out here and is left to the reader.\n\n6 Conclusion\n\nWhereas, until now, the theoretical analysis of clustering performance was mainly limited to the\nK-means situation (but not only, cf [BvL09] for instance), this paper establishes bounds for the\nsuccess of empirical clustering risk minimization in a general \u201dpairwise dissimilarity\u201d framework,\nrelying on the theory of U-processes. The excess of risk of empirical minimizers of the clustering\nrisk is proved to be of the order OP(n\u22121/2) under mild assumptions on the complexity of the cells\nforming the partition candidates. It is also shown how to re\ufb01ne slightly this upper bound through\na linearization technique and the use of recent inequalities for degenerate U-processes. Although\nthe improvement displayed here can appear as not very signi\ufb01cant at \ufb01rst glance, our approach\nsuggests that much sharper data-dependent bounds could be established this way. To the best of\nour knowledge, the present analysis is the \ufb01rst to state results of this nature. As regards complexity\nregularization, while focus is here on the choice of the number of clusters, the argument used in this\npaper also paves the way for investigating more general model selection issues, including choices\nrelated to the geometry/complexity of the cells of the partition considered.\n\nAppendix - Technical proofs\n\nProof of Theorem 1\n\nWe may classically write:\n\n(cid:99)W ((cid:98)Pn) \u2212 W \u2217\n\nK \u2264 2 sup\nP\u2208\u03a0K\n\n|(cid:99)Wn(P) \u2212 W (P)| +\n\n(cid:18)\n\n(cid:19)\n\nW (P) \u2212 W \u2217\n\nK\n\n(cid:18)\ninfP\u2208\u03a0K\n\n(cid:19)\n\n|Un(C) \u2212 u(C)| +\n\nW (P) \u2212 W \u2217\n\n\u2264 2K\n\nsup\n\nC\u2208P, P\u2208\u03a0K\n\n(16)\nwhere Un(C) denotes the U-statistic with kernel HC(x, x(cid:48)) = D(x, x(cid:48)) \u00b7 I{(x, x(cid:48)) \u2208 C2} based on\nthe sample X1, . . . , Xn and u(C) its expectation. Therefore, mimicking the argument of Corollary\n3 in [CLV08], based on the so-termed \ufb01rst Hoeffding\u2019s representation of U-statistics (see Lemma\nA.1 in [CLV08]), we may straightforwardly derive the lemma below.\nProposition 1 (UNIFORM DEVIATIONS) Suppose that Theorem 1\u2019s assumptions are ful\ufb01lled. Let\n\u03b4 > 0. With probability at least 1 \u2212 \u03b4, we have: \u2200n \u2265 2,\n\ninfP\u2208\u03a0K\n\nK\n\n,\n\nsup\n\nC\u2208P, P\u2208\u03a0K\n\n|Un(C) \u2212 u(C)| \u2264 2E[AK,n] + B\n\n.\n\n(17)\n\n(cid:114)2 log(1/\u03b4)\n\nn\n\nPROOF. The argument follows in the footsteps of Corollary 3\u2019s proof in [CLV08]. It is based on the\nso-termed \ufb01rst Hoeffding\u2019s representation of U-statistics (9), which provides an immediate control\nof the moment generating function of the supremum supC |Un(C) \u2212 u(C)| by that of the norm of an\nempirical process, namely supC |An(C) \u2212 u(C)|, where, for all C \u2208 P and P \u2208 \u03a0K:\nD(Xi, Xi+(cid:98)n/2(cid:99)) \u00b7 I{(Xi, Xi+(cid:98)n/2(cid:99)) \u2208 C2}.\n\n(cid:98)n/2(cid:99)(cid:88)\n\nAn(C) =\n\n1\n\n(cid:98)n/2(cid:99)\n\ni=1\n\n7\n\n\f(cid:19)(cid:21)\n\n(cid:19)(cid:21)\n\n(cid:18)\n\n(cid:20)\n(cid:19)(cid:21)\n\n(cid:20)\n\n(cid:18)\n\n(cid:20)\n\nE\n\n(cid:18)\n\nLemma 1 (see Lemma A.1 in [CLV08]) Let \u03a8 : R \u2192 R be convex and nondecreasing. We have:\n\nE\n\nexp\n\n\u03bb \u00b7 sup\nC\n\n|Un(C) \u2212 u(C)|\n\n\u2264 E\n\nexp\n\n\u03bb \u00b7 sup\nC\n\n|An(C) \u2212 u(C)|\n\n.\n\n(18)\n\nNow, using standard symmetrization and randomization tricks, one obtains that: \u2200\u03bb > 0,\n\nexp\n\n(19)\nthe value of AK,n cannot change by more than 2B/n when one of the\nObserving that\n(\u0001i, Xi, Xi+(cid:98)n/2(cid:99))(cid:48)s is changed, while the others are kept \ufb01xed, the standard bounded differences\ninequality argument applies and yields:\n\n|An(C) \u2212 u(C)|\n\n\u2264 E [exp (2\u03bb \u00b7 AK,n)] .\n\n\u03bb \u00b7 sup\nC\n\nE [exp (2\u03bb \u00b7 AK,n)] \u2264 exp\n\n(20)\nNext, Markov\u2019s inequality with \u03bb = (t \u2212 2E[AK,n])/B2 gives: P{supC |An(C) \u2212 u(C)| > t} \u2264\nexp(\u2212n(t \u2212 2E[AK,n])2/(2B2)). The desired result is then immediate. (cid:3)\nThe rate bound is \ufb01nally established by combining bounds (16) and (17).\n\n.\n\n(cid:18)\n2\u03bb \u00b7 E[AK,n] + \u03bb2B2\n2n\n\n(cid:19)\n\nProof of Theorem 2 (Sketch of)\n\nThe theorem can be proved by using the decomposition (10), applying the argument above in order\nto control supP |Ln(P)| and the lemma below to handle the degenerate part. The latter is based on a\nrecent moment inequality for degenerate U-processes, proved in [CLV08]. Due to space limitations,\ntechnical details are left to the reader.\n\nLemma 2 (see Theorem 11 in [CLV08]) Suppose that Theorem 2\u2019s assumptions are ful\ufb01lled. There\nexists a universal constant C < \u221e such that for all \u03b4 \u2208 (0, 1), we have with probability at least\n1 \u2212 \u03b4: \u2200n \u2265 2,\n\n|Mn(P)| \u2264 K\u03ba(n, \u03b4).\n\nsup\nP\u2208\u03a0K\n\nProof of Theorem 3\nThe proof mimics the argument of Theorem 8.1 in [BBL05]. We thus obtain that: \u2200K \u2265 1,\n\nE(cid:104)\n\nW ((cid:98)PbK,n)\n\n(cid:105) \u2212 W \u2217 \u2264 E(cid:104)\n\nW ((cid:98)PK,n)\n\n(cid:19)\n{W (P) \u2212(cid:99)Wn(P)} \u2212 pen(n, k)\n\n(cid:35)\n\n.\n\n+\n\nReproducing the argument of Theorem 1\u2019s proof, one may easily show that: \u2200k \u2265 1,\n\nE\n\n(cid:34)(cid:18)\n\n(cid:105) \u2212 W \u2217 + pen(K, n)\n+(cid:88)\n(cid:21)\n{W (P) \u2212(cid:99)Wn(P)}\n(cid:20)\n\nsup\nP\u2208\u03a0k\n\nk\u22651\n\n(cid:20)\n\nE\n\nsup\nP\u2208\u03a0k\n\n(cid:26)\nThus, for all k \u2265 1, the quantity P{supP\u2208\u03a0k\n\n{W (P) \u2212(cid:99)Wn(P)} \u2265 E\n\nP\n\nsup\nP\u2208\u03a0k\n\nsup\nP\u2208\u03a0k\n\n\u2264 2kE[Ak,n].\n\n{W (P) \u2212(cid:99)Wn(P)} \u2265 pen(n, k) + 2\u03b4} is bounded by\n(cid:21)\n{W (P) \u2212(cid:99)Wn(P)}\n(cid:27)\n(cid:26)\n\n+(cid:112)(2B log k)/n + \u03b4\n\n3kE\u0001[Ak,n] \u2264 2kE[Ak,n] \u2212 27Bk log k\n\n(cid:27)\n\n\u2212 \u03b4\n\n.\n\n+ P\n\nn\n\nBy virtue of the bounded differences inequality (jumps being bounded by 2B/n), the \ufb01rst term is\nbounded by exp(\u2212n\u03b42/(2B2))/k2, while the second term is bounded by, exp(\u2212n\u03b4/(9Bk))/k3 as\nshown by Lemma 8.2 in [BBL05] (see the third inequality therein). Integrating over \u03b4, one obtains:\n\n(cid:34)(cid:18)\n\nE\n\nsup\nP\u2208\u03a0k\n\n(cid:19)\n{W (P) \u2212(cid:99)Wn(P)} \u2212 pen(n, k)\n\n(cid:35)\n\n+\n\n\u2264 (2B(cid:112)2/n + 18B/n)/k2.\n\nSumming next the bounds thus obtained over k leads to the oracle inequality stated in the theorem.\n\n8\n\n\f[BD04]\n\n[BvL09]\n\nReferences\nG. Biau and L. Bleakley. Statistical Inference on Graphs. Statistics & Decisions, 24:209\u2013232, 2006.\n[BB06]\n[BBL05] S. Boucheron, O. Bousquet, and G. Lugosi. Theory of Classi\ufb01cation: A Survey of Some Recent\n\nAdvances. ESAIM: Probability and Statistics, 9:323\u2013375, 2005.\nS. Ben-David. A framework for statistical clustering with a constant time approximation algorithms\nfor k-median clustering. In Proceedings of COLT\u201904, Lecture Notes in Computer Science, Volume\n3120/2004, 415-426, 2004.\n\n[BDL08] G. Biau, L. Devroye, and G. Lugosi. On the Performance of Clustering in Hilbert Space. IEEE\n\nTrans. Inform. Theory, 54(2):781\u2013790, 2008.\nS. Bubeck and U. von Luxburg. Nearest neighbor clustering: A baseline method for consistent\nclustering with arbitrary objective functions. Journal of Machine Learning Research, 10:657\u2013698,\n2009.\n\n[CFZ09] B. Clarke, E. Fokou\u00b4e, and H.. Zhang. Principles and Theory for Data-Mining and Machine-\n\nLearning. Springer, 2009.\nS. Cl\u00b4emenc\u00b8on, G. Lugosi, and N. Vayatis. Ranking and empirical risk minimization of U-statistics.\nThe Annals of Statistics, 36(2):844\u2013874, 2008.\n\n[CLV08]\n\n[DGL96] L. Devroye, L. Gy\u00a8or\ufb01, and G. Lugosi. A Probabilistic Theory of Pattern Recognition. Springer,\n\n1996.\n\n[dlPG99] V. de la Pena and E. Gin\u00b4e. Decoupling: from Dependence to Independence. Springer, 1999.\n[Dud99]\n[Har78]\n\nR.M. Dudley. Uniform Central Limit Theorems. Cambridge University Press, 1999.\nJ.A. Hartigan. Asymptotic distributions for clustering criteria. The Annals of Statistics, 6:117\u2013131,\n1978.\n\n[Hoe48] W. Hoeffding. A class of statistics with asymptotically normal distribution. Ann. Math. Stat.,\n\n19:293\u2013325, 1948.\n\n[HTF09] T. Hastie, R. Tibshirani, and J. Friedman. The Elements of Statistical Learning (2nd ed.), pages\n\n520\u2013528. Springer, 2009.\nS. Kutin and P. Niyogi. Almost-everywhere algorithmic stability and generalization error. In Pro-\nceedings of the of the 18th Conference in Uncertainty in Arti\ufb01cial Intelligence, 2002.\nV. Koltchinskii. Local Rademacher complexities and oracle inequalities in risk minimization (with\ndiscussion). The Annals of Statistics, 34:2593\u20132706, 2006.\n\n[PFvN89] R. Peck, L. Fisher, and J. van Ness. Bootstrap con\ufb01dence intervals for the number of clusters in\n\ncluster analysis. J. Am. Stat. Assoc., 84:184\u2013191, 1989.\nD. Pollard. Strong consistency of k-means clustering. The Annals of Statistics, 9:135\u2013140, 1981.\nD. Pollard. A central limit theorem for k-means clustering. The Annals of Probability, 10:919\u2013926,\n1982.\nR.J. Ser\ufb02ing. Approximation theorems of mathematical statistics. Wiley, 1980.\nO. Shamir and N. Tishby. Model selection and stability in k-means clustering. In in Proceedings of\nthe 21rst Annual Conference on Learning Theory, 2008.\nO. Shamir and N. Tishby. On the reliability of clustering stability in the large sample regime. In\nAdvances in Neural Information Processing Systems 21, 2009.\n\n[KN02]\n\n[Kol06]\n\n[Pol81]\n[Pol82]\n\n[Ser80]\n[ST08]\n\n[ST09]\n\n[TWH01] R. Tibshirani, G. Walther, and T. Hastie. Estimating the number of clusters in a data set via the gap\n\nstatistic. J. Royal Stat. Soc., 63(2):411\u2013423, 2001.\n\n[vdV98] A. van der Vaart. Asymptotic Statistics. Cambridge University Press, 1998.\n[vL09]\n\nU. von Luxburg. Clustering stability: An overview. Foundations and Trends in Machine Learning,\n2(3):235\u2013274, 2009.\n\n[vLBD05] U. von Luxburg and S. Ben-David. Towards a statistical theory of clustering. In Pascal workshop\n\non Statistics and Optimization of Clustering, 2005.\n\n[vLBD06] U. von Luxburg and S. Ben-David. A sober look at clustering stability. In Proceedings of the 19th\n\nConference on Learning Theory, 2006.\n\n[vLBD08] U. von Luxburg and S. Ben-David. Relating clustering stability to properties of cluster boundaries.\n\nIn Proceedings of the 21th Conference on Learning Theory, 2008.\nD. M. Witten and R. Tibshirani. A framework for feature selection in clustering. J. Amer. Stat.\nAssoc., 105(490):713\u2013726, 2010.\n\n[WT10]\n\n9\n\n\f", "award": [], "sourceid": 42, "authors": [{"given_name": "St\u00e9phan", "family_name": "Cl\u00e9men\u00e7con", "institution": null}]}