{"title": "Stability-Based Model Selection", "book": "Advances in Neural Information Processing Systems", "page_first": 633, "page_last": 642, "abstract": null, "full_text": "Stability-Based Model Selection\n\nTilman Lange, Mikio L. Braun, Volker Roth, Joachim M. Buhmann\n\n(lange,braunm,roth,jb)@cs.uni-bonn.de\n\nInstitute of Computer Science, Dept. III,\n\nUniversity of Bonn\n\nR\u00a8omerstra\u00dfe 164, 53117 Bonn, Germany\n\nAbstract\n\nModel selection is linked to model assessment, which is the problem of\ncomparing different models, or model parameters, for a speci\ufb01c learning\ntask. For supervised learning, the standard practical technique is cross-\nvalidation, which is not applicable for semi-supervised and unsupervised\nsettings. In this paper, a new model assessment scheme is introduced\nwhich is based on a notion of stability. The stability measure yields an\nupper bound to cross-validation in the supervised case, but extends to\nsemi-supervised and unsupervised problems. In the experimental part,\nthe performance of the stability measure is studied for model order se-\nlection in comparison to standard techniques in this area.\n\n1 Introduction\n\nOne of the fundamental problems of learning theory is model assessment: Given a speci\ufb01c\ndata set, how can one practically measure the generalization performance of a model trained\nto the data. In supervised learning, the standard technique is cross-validation. It consists in\nusing only a subset of the data for training, and then testing on the remaining data in order to\nestimate the expected risk of the predictor. For semi-supervised and unsupervised learning,\nthere exist no standard techniques for estimating the generalization of an algorithm, since\nthere is no expected risk. Furthermore, in unsupervised learning, the problem of model\norder selection arises, i.e. estimating the \u201ccorrect\u201d number of clusters. This number is part\nof the input data for supervised and semi-supervised problems, but it is not available for\nunsupervised problems.\nWe present a common point of view, which provides a uni\ufb01ed framework for model as-\nsessment in these seemingly unrelated areas of machine learning. The main idea is that an\nalgorithm generalizes well, if the solution on one data set has small disagreement with the\nsolution on another data set. This idea is independent of the amount of label information\nwhich is supplied to the problem, and the challenge is to de\ufb01ne disagreement in a mean-\ningful way, without relying on additional assumptions, e.g. mixture densities. The main\nemphasis lies on developing model assessment procedures for semi-supervised and unsu-\npervised clustering, because a de\ufb01nitive answer to the question of model assessment has\nnot been given in these areas.\nIn section 3, we derive a stability measure for solutions to learning problems, which al-\nlows us to characterize the generalization in terms of the stability of solutions on different\nsets. For supervised learning, this stability measure is an upper bound to the 2-fold cross-\n\n\fvalidation error, and can thus be understood as a natural extension of cross-validation to\nsemi-supervised and unsupervised problems.\nFor experiments (section 4), we have chosen the model order selection problem in the\nunsupervised setting, which is one of the relevant areas of application as argued above. We\ncompare the stability measure to other techniques from the literature.\n\n2 Related Work\n\nFor supervised learning problems, several notions of stability have been introduced ([10],\n[3]). The focus of these works lies on deriving theoretical generalization bounds for su-\npervised learning. In contrast, this work aims at developing practical procedures for model\nassessment, which are also applicable in semi- and unsupervised settings. Furthermore, the\nde\ufb01nition of stability developed in this paper does not build upon the cited works.\nSeveral procedures have been proposed for inferring the number of clusters of which we\nname a few here. Tibshirani et al. [14] propose the Gap Statistic that is applicable to Euclid-\nian data only. Given a clustering solution, the total sum of within-cluster dissimilarities is\ncomputed. This quantity computed on the original data is compared with the average over\ndata which was uniformly sampled from a hyper-rectangle containing the original data.\nThe  which maximizes the gap between these two quantities is the estimated number of\nclusters. Recently, resampling-based approaches for model order selection have been pro-\nposed that perform model assessment in the spirit of cross validation. These approaches\nshare the idea of prediction strength or replicability as a common trait. The methods ex-\nploit the idea that a clustering solution can be used to construct a predictor, in order to\ncompute a solution for a second data set and to compare the computed and predicted class\nmemberships for the second data set. In an early study, Breckenridge [4] investigated the\nusefulness of this approach (called replication analysis there) for the purpose of cluster\nvalidation. Although his work does not lead to a directly applicable procedure, in particular\nnot for model order selection, his study suggests the usefulness of such an approach for\nthe purpose of validation. Our method can be considered as a re\ufb01nement of his approach.\nFridlyand and Dudoit [6] propose a model order selection procedure, called Clest, that also\nbuilds upon Breckenridge\u2019s work. Their method employs the replication analysis idea by\nrepeatedly splitting the available data into two parts. Free parameters of their method are\nthe predictor, the measure of agreement between a computed and a predicted solution and\na baseline distribution similar to the Gap Statistic. Because these three parameters largely\nin\ufb02uence the assessment, we consider their proposal more as a conceptual framework than\nas a concrete model order estimation procedure. In particular, the predictor can be chosen\nindependent of the clustering algorithm which can lead to unreliable results (see section\n3). For the experiments in section 4, we used a linear discriminant analysis classi\ufb01er, the\nFowlkes-Mellows index for solution comparison (c.f. [9, 6]) and the baseline distribution of\nthe Gap Statistic. Tibshirani et al. [13] formulated a similar method (Prediction Strength)\nfor inferring the number of clusters which is based on using nearest centroid predictors.\nRoughly, their measure of agreement quanti\ufb01es the similarity of two clusters in the com-\nputed and in the predicted solution. For inferring a number of clusters, the least similar\npair of clusters is taken into consideration. The estimated \u0001\nfor which the\nsimilarity is above some threshold value. Note that the similarity for \u0003\u0002\u0005\u0004\nis always above\nthis threshold.\n\nis the largest \n\n3 The Stability Measure\n\nWe begin by introducing a stability measure for supervised learning. Then, the stability\nmeasure is generalized to semi-supervised and unsupervised settings. Necessary modi\ufb01ca-\ntions for model order selection are discussed. Finally, a scheme for practical estimation of\nthe stability is proposed.\n\n\n\fGPO\n\n\u001bWV\n\n1PR\n\n+)R\n\nTKU\n\n\u001bWV\n\n+)R\n\nare the ob-\n\n\t6D\n\n\u0001\u001a\u000b\n\n+)R\n\n. Then, the test risk of\n\nStability and Supervised Learning The supervised learning problem is de\ufb01ned as fol-\n\n, using\nis the so-called loss function. For classi-\n\nelse.\nA measure of the stability of the labeling function learned is derived as follows. Note that\n\n\u0002\n\u0001\u0004\u000b\r\f\u000e\u0005\u0010\u000f\u0011\f\u0012\t\u0013\u0005\u0012\u0014\u0015\u0014\u0012\u0014\u0015\u0005\u0012\u0001\u0004\u000b\u0017\u0016\u0011\u0005\b\u000f\u0018\u0016\u0019\t be a sequence of random variables where\n\u0002\u0002\u0001\u0004\u0003\u0006\u0005\b\u0007\u0006\t\nlows. Let\n. The\u000b\u0017\u001b\u001f#%$\n\u0001\u001a\u000b\u0017\u001b\b\u0005\b\u000f\u0018\u001b\u001c\t are drawn i.i.d. from some probability distribution \u001d\u001f\u001e! \n\u0018* are the labels. The task is to \ufb01nd a labeling function +-,\n$/.\njects and \u000f\u0018\u001b&#('\n\u0004)\u0005\u0015\u0014\u0015\u0014\u0012\u0014\u0013\u0005\n\u0001\u001a7\u0011\t\u0013\u000598\u0019\t9:\n\u000232546\u0001\n\u0004)\u0005\u0015\u0014\u0012\u0014\u0015\u00140\u0005\n\u0018* which minimizes the expected risk, given by1\n\u001e! \nas input. Here 4\nonly a \ufb01nite sample of data \n\u0002A@\u0011BC8\u0006D\n\ufb01cation, we take the ; -\u0004 -loss de\ufb01ned by 4<\u0001\u00048\u0011\u000598>=?\t\n\u0002J8F= and ;\niff 8ID\n\u0002E8F=HG\nG\u0006M\nG , since\n\u0002L8F=\n= , it holds that @\n,8\u0019= and8F=\nfor three labels 8\n8Q=\u001fD\n8\u0006D\n8KD\n\u0002N8F=\n\u0002N8F=\n= . Now let and\n8\u0006D\n\u0002N8F=\n= be two data sets drawn independently\n= implies8\u0019=\u001fD\n\u0002L8 or8F=\u001fD\n\u0002L8F=\nfrom the same source, and denote the predictor +\ntrained on by+FR\n\u0001\u0004\u000bK=\n\t :\nR can be bounded by introducing+\nRFS\nGXM\n\u0001\u001a\u000b\n\u0001\u001a\u000b\n\t6D\n\u0002E\u000f\n+)R\n+)R\n1PR\n1PR\nWe call the second term the stability of the predictor+ and denote its expectation by Y\nGC^\n\u0001\u001a\u000b\nWe call the value ofY\nand large values of Y mean large instability. Taking expectations with respect to \n= on both sides yields Z\n+)R\n1PR\nrisk minimization over some hypothesis set b\n1PR\ndWeFf\bgih)jkZ\r\u0001\n\t9\t\n\u0002Ed?eFf\u0010gih)j\n\n\u0001\u001a\u000b\n+)R\n\t stability cost to stress the fact thatY\n\t . If +\n\t9\t\n1&R\n+)R\n\t9\t\n\n; means perfect stability\n\u0002cZX\u0001\u0004d?e\u0019f\bgih)j\n\t9\t\n\nand\nis obtained by empirical\n\nBy eq. (3), the stability de\ufb01ned in (2) yields an upper bound on the generalization error.\nIt can be shown that there exists a converse upper bound, if the minimum is unique and\n\n\u001bWV\n\t6`aZ\r\u0001\n\t9\t\n, then ZX\u0001\n\t , and one obtains\n\tl`nZ\r\u0001\nimpliesY\n\n\to.JdWeFf\u0010g\u0012h)j\n\nNote that the stability measures the disagreement between labels on training data and test\n. This asymmetry arises naturally and directly measures the gen-\n. Furthermore, the stability can be interpreted as the expected\n\nwell-separated, such thatZ\ndata, both assigned by +\neralization performance of +\nempirical risk of + with respects to the labels computed by itself (compare (1) and (2)).\nTherefore, stability measures the self-consistency of+\n\n. This interpretation is also valid in\nthe semi-supervised and unsupervised settings. Practical evaluation of the stability amounts\nto 2-fold cross-validation. No improvement can therefore be expected in this area. However,\nunlike cross-validation, stability can also be de\ufb01ned in settings where no label information\nis available. This property of the method will be discussed in the remainder of this section.\n\n; .\n\n+_R\n\tl`mdWeFf\ngih)j\n+_R\n\n(1)\n\n\t :\n\n(2)\n\n\u0002KZ\\[\n\n\t]D\n\n+)R\n\n\t0\u0014\n\n(3)\n\nSemi-supervised Learning Semi-supervised learning problems are de\ufb01ned as follows.\n\n\u001b of an object \u000b\n\n\u001b might not be known. This fact is encoded by setting \u000f\n\nis not a valid label. At least one labeled point must be given for every class.\nFurthermore, for the present discussion, we assume that we do not have a fully labeled data\nset for testing purposes.\nThere exist two alternatives in de\ufb01ning the solution to a semi-supervised learning problem.\n\nThe label \u000f\n; , since ;\nIn the \ufb01rst alternative, the solution is a labeling function + de\ufb01ned on the whole object\nspace$\n\nmeasures the con\ufb01dence for the (unknown) training error.\nThe second alternative is that the solution is not given by a labeling function on the whole\nobject space, but only by a labeling function on the training set\n. Labeling functions\n\nas in supervised learning. Then, the stability (eq. (2)) can be readily computed and\n\n\"\n'\n\u0001\n+\n\t\n,\n+\n\u001d\n\"\n\u0002\n\u0004\nB\n=\n@\nB\n@\nB\n=\n+\n\u001b\nS\n\u0001\n\t\n\u0002\n\u0004\n\f\n@\nB\n=\n\u001b\n=\n\u001b\nS\n\u0001\nS\n\t\nO\n\u0004\nT\n\u0016\nU\n\f\n@\nB\n=\n\u001b\n\u0002\nS\n=\n\u001b\n\t\nG\n\u0014\n\u0001\n+\nY\n\u0001\n+\n\t\n,\n\u0004\nT\n\u0016\nU\n\f\n@\nB\n=\n\u001b\n\u0002\nS\n=\n\u001b\n\t\n\u0014\n\u0001\n+\n\u0002\n\n1\n\u0001\n\u0001\nM\nY\n\u0001\n+\n\u0001\n\u0001\n+\nM\n\u0001\n+\n1\n\u0001\n+\nZ\n1\n\u0001\n+\nR\n1\n\u0001\n+\n\t\nM\nZ\n1\n\u0001\n+\nR\n1\nR\n\u0001\n+\nR\nM\nY\n\u0001\n+\n1\n\u0001\n1\n\u0001\n+\n\t\n.\n\u001b\n\u0002\n\n\f\u001bWV\n\n\u0001\u0004\u000b\n\n\u0001\u001c\u0003\n\nto predict labels on the new set \nB\u0004\u0001\u0006\u0005\n\nto stress the difference. The\nwhich are de\ufb01ned on the training set only will be denoted by \n, which is only de\ufb01ned on \u0003\n. As mentioned above,\nlabeling on  will be denoted by \u001fR\nthe stability compares labels given to the training data with predicted labels. In the current\nsetting, there are no predicted labels, because \nis de\ufb01ned on the training set only. One\npossibility to obtain predicted labels is to introduce a predictor \u0001\n, which is trained using\n\u0003\u0006\u0005\nstability for semi-supervised learning as\n\n= . Leaving \u0001 as a free parameter, we de\ufb01ne the\n\u0007\t\b\u000b\n\nRFS\n\n\t]D\n\n\u0005\u0003\f\u0015\u0001\u001a\u000b\n\n\t\u0015GC^F\u0014\n\nY\u0003\u0002semi\u0001\n\u0002KZ\nOf course, the choice of \u0001\nin\ufb02uences the value of the stability. We need a condition on the\nprediction step to select \u0001\n. First note that (4) is the expected empirical risk of \u0001 with respect\nto the data source \u0003K\u0005\n\u0001\u001c\u0003\n\t . Analogously to supervised learning, the minimal attainable\nlR\n\t measures the extent to which classes overlap, or how consistent\n\u0002semi\u0001\nstability \r\nthe labels are. Therefore, \u0001\n\t . Unfortunately, the\nshould be chosen to minimize Y\nconstruction of non-asymptotically Bayes optimal learning algorithms is extremely dif\ufb01cult\nand, therefore, we should not expect that there exists a universally applicable constructive\nprocedure for automatically building \u0001 given an \nIn practice, some \u0001 has to be chosen. This choice will yield larger stability costs, i.e. worse\nstability, and can therefore not fake stability. Furthermore, it is often possible to construct\ngood predictors in practice. Note that (4) measures the mismatch between the label genera-\ntor  and the predictor \u0001\n. Intuitively, \u0001 can lead to good stability only if the strategy of  and\n\u0001 are similar. For unsupervised learning, as discussed in the next paragraph, the choices for\nvarious standard techniques are natural. For example,  -means clustering suggests to use\nnearest centroid classi\ufb01cation. Minimum spanning tree type clustering algorithms suggest\nnearest neighbor classi\ufb01ers, and \ufb01nally, clustering algorithms which \ufb01t a parametric density\nmodel should use the class posteriors computed by the Bayes rule for prediction.\n\n\u0002semi\u0001\n\ndWe\n\n(4)\n\n.\n\n\u0002E\u000b\r\f\n\n\u0005\u0015\u0014\u0012\u0014\u0015\u00140\u0005\b\u000b\n\nlabeling a \ufb01nite data set\u0003\non\u0003\n\nUnsupervised Learning The unsupervised learning setting is given as the problem of\nis again a function only de\ufb01ned\n. From this de\ufb01nition, it becomes clear that we again need a predictor as in the second\n\n. The solution \n\nalternative of semi-supervised learning.\nFor unsupervised learning, another problem arises. Since no speci\ufb01c label values are pre-\nscribed for the classes, label indices might be permuted from one instance to another, even\nwhen the partitioning is identical. For example, keeping the same classes, exchanging the\nleads to a new partitioning, which is not structurally different. In other\nwords, label values are only known up to a permutation. In view of this non-uniqueness of\nthe representation of a partitioning, we de\ufb01ne the permutation relating indices on the \ufb01rst\nset to the second set by the one which maximizes the agreement between the classes. The\nstability then reads\n\nclass labels \u0004 and \u000e\n\n\u0001\u0015\u0001\u0006\u0005\n\n@\u0011B\u0014\u0013\n\nY un\u0001\n\nNote that the minimization has to take place inside the expectation, because the permutation\n\nd?e\nh\t\u0010\u0012\u0011\n= . In practice, it is not necessary to compute all \u0018\u0017 permutations,\n\ndepends on the data \u0003K\u0005\b\u0003\nbecause the problem is solvable by the Hungarian method in \u0019\nModel Order Selection The problem of model order selection consists in determining\nthe number of clusters \n\nto be estimated, and exists only in unsupervised learning.\n\n\u0006\u001a\u000e\t [11].\n\n\u0005\u0003\f\u0013\u0001\u0004\u000b\n\n\t\b\t]D\n\n\u0007\u0004\u0016\u0012\n\n^F\u0014\n\n\t\u0012G\n\n(5)\n\n\u0001\u001a\u000b\n\n\u001bWV\n\nThe range of the stability Y depends on \n\nfor different values of \nbounded from above by \u0004\n\n, therefore stability values cannot be compared\nis\n, since for a larger instability, there exists a relabeling\n\n. For unsupervised learning, the stability minimized over \u001b\u001d\u001c\n\n\u0004\u001f\u001e\n\n\nR\n\t\n\n\t\n,\n[\n\u0004\nT\n\u0016\nU\n\f\n@\n \n=\n\u001b\n\u0002\n\n=\n\u001b\n\u0002\nY\n\n\n\u0016\n\u0005\n\n\t\n,\n\u0002\nZ\n[\n\n\u000f\n\u0004\nT\n\u0016\nU\n\f\n \n=\n\u001b\n\u0002\n\n\u0005\nS\n=\n\u001b\n\u0001\n`\n\n\fwhich has smaller stability costs. This stability value is asymptotically achieved by the\nrandom predictor\nthe stability of the random predictor yields values independent of \nre-normalized stability as\n\n\u001c which assigns uniformly drawn labels to objects. Normalizing Y by\n\n. We thus de\ufb01ne the\n\n(6)\n\n\u001cun\u0001\n\n\t\u0013\u0014\n\nY un\u0001\n\nY un\u0001\nIn practice, a \ufb01nite data set\u000b\n\n\u0005\u0012\u0014\u0015\u0014\u0012\u0014\u0015\u00059\u000b\n\nResampling Estimate of the Stability\nis given,\nand the best model should be estimated. The stability is de\ufb01ned in terms of an expectation,\n\nis the\nsame as for the empirical risk, a fact which is not proved here. In order to estimate the\nstability, we propose the following resampling scheme: Iteratively split the data set into\ndisjoint halves, and compare the solutions on these sets as de\ufb01ned above for the respective\nis determined, train this model again\n\nwhich has to be estimated for practical applications. Estimation of Y over a hypothesis set\nis feasible ifb has \ufb01nite VC-dimension, since the VC-dimension for estimating Y\ncases. After the model having the smallest value of Y\n\non the whole data to obtain the result.\nNote that it is necessary to split into disjoint subsets, because common points potentially in-\ncrease the stability arti\ufb01cially. Furthermore, unlike in cross-validation, both sets must have\nthe same size, because both are used as inputs to training algorithms. For semi-supervised\nand unsupervised learning, the comparison might entail predicting labels on a new set, and\nfor the latter also minimizing over permutation of labels.\n\n4 Stability for Model Order Selection in Clustering: Experimental\n\nResults\n\nWe now provide experimental evidence for the usefulness of our approach to model order\nselection, which is one of the hardest model assessment problems. First, the algorithms are\ncompared for toy data, in order to study the performance of the stability measure under\nwell-controlled conditions. However, for real-world applications, it does not suf\ufb01ce to be\nbetter than competitors, but one has to provide solutions which are reasonable within the\nframework of the application. Therefore, in a second experiment, the stability measure is\ncompared to the other methods for the problem of clustering gene expression data.\nExperiments are conducted using a deterministic annealing variant of  -means [12] and\nPath-Based Clustering [5] optimized via an agglomerative heuristic. For all data sets, we\nrandom\naverage over \u000e\nsamples are drawn from the baseline. For Clest and Prediction Strength, the number of\nresamples is chosen the same as for our method. The threshold for Prediction Strength is\n\u0001 . As mentioned above, the nearest centroid classi\ufb01er is employed for the purpose\nset to;\nof prediction when using  -means, and a variant of the nearest neighbor classi\ufb01er is used\nfor Path-Based Clustering which can be regarded as a combination of Minimum Spanning\nTree clustering and Pairwise Clustering [5, 8].\nWe compare the proposed stability index of section 3 with the Gap Statistic, Clest and with\nTibshirani\u2019s Prediction Strength method using two toy data sets and a microarray data set\ntaken from [7]. Table 1 summarizes the estimated number of clusters \u0001\n\n; . For the Gap Statistic and Clest1 \u000e\n\n of each method.\n\nresamples for \n\n\u0005\u0015\u0014\u0015\u0014\u0012\u0014\u0015\u0005\n\nToy Data Sets The \ufb01rst data set consists of three fairly well separated point clouds, gen-\n\nerated from three Gaussian distributions (\u000e\u0003\u0002\npoints from the third were drawn). Note that for some \nin \ufb01gure 1(a),\nthe variance in the stability over different resamples is quite high. This effect is due to the\nmodel mismatch, since for \n, the clustering of the three classes depends highly on the\nsubset selected in the resampling. This means that besides the absolute value of the stability\n\npoints from the \ufb01rst and the second and\n\n, for example \n\n1See section 2 for a brief overview over these techniques.\n\n\nY\n\n\t\n\u0002\n\n\t\n\u001e\n\n\u001c\n\f\n\u0016\nb\n;\n\u0002\n\u000e\n\u0004\n;\n\u0014\n\u0002\n;\n\u0002\n\u0002\n\u0002\n\u0002\n\fData Set\n\n3 Gaussians\n3 Rings  -means\n3 Rings Path-Based\nGolub et al. data\n\nStability\nMethod\n\u0003\u0002\u0001\n\u0003\u0002\u0003\u0004\n\u0003\u0002\u0001\n\u0003\u0002\u0001\n\nGap\nStatistic\n\u0003\u0002\u0001\n\u0003\u0002\u0005\u0004\n\u0003\u0002\u0005\u0004\n\u0003\u0002\u0005\u0004\n\nClest\n\n\u0002\u0002\n\u0002\u0005\u0004\n\n\u0002\u0002\n\nPrediction\nStrength\n\u0003\u0002\u0002\n\u0003\u0002\n\u0003\u0002\n\u0003\u0002\n\n\u201ctrue\u201d\nnumber \n\u0003\u0002\u0003\n\u0003\u0002\u0003\n\u0003\u0002\u0003\n\n\u0002\u0006 or\n\n\u0003\u0002\n\nTable 1: The estimated model orders for the two toy and the microarray data set.\n\n\u0002\u0007\u0004 . Thus, the stability for this number of clusters \n\ncosts, additional information about the \ufb01t can be obtained from the distribution of the sta-\nbility costs over the resampled subsets. For this data set, all methods under comparison are\n\u0002\u0003 . Figures 1(d) and 1(a) show the clustered\nable to infer the \u201ctrue\u201d number of clusters \ndata set and the proposed stability index. For \n\u000e , the stability is relatively high, which\nis due to the hierarchical structure of the data set, which enables stable merging of the two\nsmaller sub-clusters.\nIn the ring data set (depicted in \ufb01gures 1(e) and 1(f)), one can naturally distinguish three\nring shaped clusters that violate the modeling assumptions of  -means since clusters are not\nspherically distributed. Here,  -means is able to identify the inner circle as a cluster with\nis highest (\ufb01gure 1(b)). All other\nmethods except Clest infer \u0001\nfor this data set with  -means. Applying the proposed\nstability estimator with Path-Based Clustering on the same data set yields highest stability\n\u0002\u0003 , the \u201ccorrect\u201d number of clusters (\ufb01gures 1(f) and 1(c)). Here, all other methods\nfor \nfail and estimate \u0001\n\u0004 . The Gap Statistic fails here because it directly incorporates the\nassumption of spherically distributed data. Similarly, the Prediction Strength measure and\nClest (in the form we use here) use classi\ufb01ers that only support linear decision boundaries\nwhich obviously cannot discriminate between the three ring-shaped clusters. In all these\ncases, the basic requirement for a validation scheme is violated, namely that it must not\nincorporate additional assumptions about the group structure in a data set that go beyond the\nones of the clustering principle employed. Apart from that, it is noteworthy that the stability\nwith  -means is signi\ufb01cantly worse than the one achieved with Path-Based Clustering,\nwhich indicates that the latter is the better choice for this data set.\n\nApplication to Microarray Data Recently, several authors have investigated the possi-\nbility of identifying novel tumor classes based solely on gene expression data [7, 2, 1].\nGolub et al. [7] studied in their analysis the problem of classifying and clustering acute\nleukemias. The important question of inferring an appropriate model order remains un-\naddressed in their article and prior knowledge is used instead. In practice however, such\nknowledge is often not available.\nAcute leukemias can be roughly divided into two groups, acute myeloid leukemia (AML)\nand acute lymphoblastic leukemia (ALL) where the latter can furthermore be subdivided\ninto B-cell ALL and T-cell ALL. Golub et al. used a data set of 72 leukemia samples (25\nAML, 47 ALL of which 38 are B-cell ALL samples) 2. For each sample, gene expression\nwas monitored using Affymetrix expression arrays.\nWe apply the preprocessing steps as in Golub et al. resulting in a data set consisting of 3571\ngenes and 72 samples. For the purpose of cluster analysis, the feature set was additionally\nreduced by only retaining the 100 genes with highest variance across samples. This step\nis adopted from [6]. The \ufb01nal data set consists of 100 genes and 72 samples. We have\nperformed cluster analysis using  -means and the nearest centroid rule. Figure 2 shows\n\n2Available at http://www-genome.wi.mit.edu/cancer/\n\n\u0001\n\u0001\n\u0001\n\n\u0001\n\u0001\n\u0001\n\u0001\n\n\u0001\n\u0004\n\u0001\n\u0001\n\u0001\n\n\u0002\n\u0004\n\u0001\n\u0004\n\u0001\n\u0001\n;\n\u0001\n\n\u0001\n\u0004\n\n\u000e\n\u0002\n\n\n\u0002\n\u0004\n\n\u0002\n\f0.8\n\n0.7\n\n0.6\n\n0.5\n\n0.4\n\nx\ne\nd\nn\n\ni\n\n0.3\n\n0.2\n\n0.1\n\n0\n\n8\n\n6\n\n4\n\n2\n\n0\n\n\u22122\n\n\u22124\n\n\u22126\n\u22124\n\n1\n\n0.9\n\n0.8\n\n0.7\n\n0.6\n\nx\ne\nd\nn\n\ni\n\n0.5\n\n0.4\n\n0.3\n\n0.2\n\n0.1\n\n0\n\n0.5\n\n0.4\n\n0.3\n\nx\ne\nd\nn\n\ni\n\n0.2\n\n0.1\n\n0\n\n2\n\n3\n\n4\n\n5\n\n6\n\n7\n\n8\n\n9\n\n10\n\n2\n\n3\n\n4\n\nnumber of clusters\n\n5\n\n6\n\n7\n\nnumber of clusters\n\n8\n\n9\n\n10\n\n2\n\n3\n\n4\n\n5\n\n6\n\n7\n\n8\n\n9\n\n10\n\nnumber of clusters\n\n(a) The stability index for\nGaussians data set\nthe\nwith\n-means.\n\n(b) The stability index for\nthe three-ring data set with\n\n-means Clustering.\n\n(c) The stability index for\nthe three-ring data set with\nPath-Based Clustering.\n\n5\n\n4\n\n3\n\n2\n\n1\n\n0\n\n\u22121\n\n\u22122\n\n\u22123\n\n\u22124\n\n\u22125\n\u22125\n\n\u22122\n\n0\n\n2\n\n4\n\n6\n\n8\n\n5\n\n4\n\n3\n\n2\n\n1\n\n0\n\n\u22121\n\n\u22122\n\n\u22123\n\n\u22124\n\n\u22125\n\u22125\n\n\u22124\n\n\u22123\n\n\u22122\n\n\u22121\n\n0\n\n1\n\n2\n\n3\n\n4\n\n5\n\n\u22124\n\n\u22123\n\n\u22122\n\n\u22121\n\n0\n\n1\n\n2\n\n3\n\n4\n\n5\n\n(d) Clustering solution on\nthe full data set for\n.\n\n\u0001\u0003\u0002\u0004\n\n(e) Clustering solution on\nthe full data set for\n.\n\n\u0001\u0003\u0002\u0006\u0005\n\n(f) Clustering solution on\nthe full data set for\n.\n\n\u0001\u0003\u0002\u0006\n\nFigure 1: Results of the stability index on the toy data (see section 4).\n\n\u0004C\u0014\n\nthe corresponding stability curve. For \n , we estimate the highest stability. We expect\nthat clustering with \u0003\u0002\u0003 separates AML, B-cell ALL and T-cell ALL samples from each\n\u0007\t\b of the samples (66 samples) are\nother. With respect to the known ground-truth labels, \u0001\ncorrectly classi\ufb01ed (the Hungarian method is used to map the clusters to the ground-truth).\nOf the competitors, only Clest is able to infer the \u201ccorrect\u201d number of cluster \n while\nthe Gap Statistic largely overestimates the number of clusters. The Prediction strength does\nnot provide any reasonable result as it estimates \u0001\n\u0004 . Note, that for \u0003\u0002\n\u000e similar stability\nis achieved. We cluster the data set again for \n\u000e and compare the result with the ALL \u2013\n\u0004\u000b\b of the samples (62 samples) are correctly identi\ufb01ed.\nAML labeling of the data. Here,\nWe conclude that our method is able to infer biologically relevant model orders. At the\nsame time, a \nis suggested that leads to high accuracy w.r.t. the ground-truth. Hence, our\nre-analysis demonstrates that we could have recovered a biologically meaningful grouping\nin a completely unsupervised manner.\n\n\u0007\u0019\u0014\n\n5 Conclusion\n\nThe problem of model assessment was addressed in this paper. The goal was to derive a\ncommon framework for practical assessment of learning models. Starting with de\ufb01ning a\nstability measure in the context of supervised learning, this measure was generalized to\nsemi-supervised and unsupervised learning. The experiments concentrated on model or-\n\n\n\u0001\n\u0001\n\u0002\n\u0002\n\n\u0002\n\u0002\n\n\f0.7\n\n0.6\n\n0.5\n\n0.4\n\nx\ne\nd\nn\n\ni\n\n0.3\n\n0.2\n\n0.1\n\n0\n\n2\n\n3\n\n4\n\n5\n\n6\n\n7\n\n8\n\n9\n\n10\n\nnumber of clusters\n\nFigure 2: Resampled stability for the leukemia dataset vs. number of classes (see sec. 4).\n\nder selection for unsupervised learning, because this is the area where the need for widely\napplicable model assessment strategies is highest. On toy data, the stability measure out-\nperforms other techniques, when their respective modeling assumptions are violated. On\nreal-world data, the stability measure compares favorably to the best of the competitors.\nAcknowledgments. This work has been supported by the German Research Foundation\n(DFG), grants #Buh 914/4, #Buh 914/5.\n\nReferences\n[1] A. A. Alizadeh et al. Distinct types of diffuse large b-cell lymphoma identi\ufb01ed by gene expres-\n\nsion pro\ufb01ling. Nature, 403:503 \u2013 511, 2000.\n\n[2] M. Bittner et al. Molecular classi\ufb01cation of cutaneous malignant melanoma by gene expression\n\npro\ufb01ling. Nature, 406(3):536 \u2013 540, 2000.\n\n[3] O. Bousquet and A. Elisseeff. Stability and generalization. Journal of Machine Learning\n\nResearch, 2:499\u2013526, 2002.\n\n[4] J. Breckenridge. Replicating cluster analysis: Method, consistency and validity. Multivariate\n\nBehavioural research, 1989.\n\n[5] B. Fischer, T. Z\u00a8oller, and J. M. Buhmann. Path based pairwise data clustering with application to\ntexture segmentation. In LNCS Energy Minimization Methods in Computer Vision and Pattern\nRecognition. Springer Verlag, 2001.\n\n[6] J. Fridlyand and S. Dudoit. Applications of resampling methods to estimate the number of\nclusters and to improve the accuracy of a clustering method. Technical Report 600, Statistics\nDepartment, UC Berkeley, September 2001.\n\n[7] T.R. Golub et al. Molecular classi\ufb01cation of cancer: Class discovery and class prediction by\n\ngene expression monitoring. Science, pages 531 \u2013 537, October 1999.\n\n[8] T. Hofmann and J. M. Buhmann. Pairwise data clustering by deterministic annealing. IEEE\n\nPAMI, 19(1), January 1997.\n\n[9] A. K. Jain and R. C. Dubes. Algorithms for Clustering Data. Prentice-Hall, Inc., 1988.\n[10] Michael J. Kearns and Dana Ron. Algorithmic stability and sanity-check bounds for leave-one-\n\nout cross-validation. In Computational Learing Theory, pages 152\u2013162, 1997.\n\n[11] H.W. Kuhn. The hungarian method for the assignment problem. Naval Res. Logist. Quart.,\n\n2:83\u201397, 1955.\n\n[12] K. Rose, E. Gurewitz, and G. C. Fox. A deterministic annealing approach to clustering. Pattern\n\nRecognition Letters, 11(9):589 \u2013 594, 1990.\n\n[13] R. Tibshirani, G. Walther, D. Botstein, and P. Brown. Cluster validation by prediction strength.\n\nTechnical report, Statistics Department, Stanford University, September 2001.\n\n[14] R. Tibshirani, G. Walther, and T. Hastie. Estimating the number of clusters via the gap statistic.\n\nTechnical report, Statistics Department, Stanford University, March 2000.\n\n\f", "award": [], "sourceid": 2139, "authors": [{"given_name": "Tilman", "family_name": "Lange", "institution": null}, {"given_name": "Mikio", "family_name": "Braun", "institution": null}, {"given_name": "Volker", "family_name": "Roth", "institution": null}, {"given_name": "Joachim", "family_name": "Buhmann", "institution": null}]}