{"title": "Feature Selection in Clustering Problems", "book": "Advances in Neural Information Processing Systems", "page_first": 473, "page_last": 480, "abstract": "", "full_text": "Feature Selection in Clustering Problems\n\nVolker Roth and Tilman Lange\n\nETH Zurich, Institut f. Computational Science\n\nHirschengraben 84, CH-8092 Zurich\n\nTel: +41 1 6323179\n\nfvroth, tilman.langeg@inf.ethz.ch\n\nAbstract\n\nA novel approach to combining clustering and feature selection is pre-\nsented.\nIt implements a wrapper strategy for feature selection, in the\nsense that the features are directly selected by optimizing the discrimina-\ntive power of the used partitioning algorithm. On the technical side, we\npresent an ef\ufb01cient optimization algorithm with guaranteed local con-\nvergence property. The only free parameter of this method is selected\nby a resampling-based stability analysis. Experiments with real-world\ndatasets demonstrate that our method is able to infer both meaningful\npartitions and meaningful subsets of features.\n\n1 Introduction\n\nThe task of selecting relevant features in classi\ufb01cation problems can be viewed as one of\nthe most fundamental problems in the \ufb01eld of machine learning. A major motivation for\nselecting a subset of features from which a learning rule is constructed is the interest in\nsparse and interpretable rules, emphasizing only a few relevant variables. In supervised\nlearning scenarios, feature selection has been studied widely in the literature. The methods\nused can be subdivided in \ufb01lter methods and wrapper methods. The main difference is\nthat a wrapper method makes use of the classi\ufb01er, while a \ufb01lter method does not. From a\nconceptual viewpoint, wrapper approaches are clearly advantageous, since the features are\nselected by optimizing the discriminative power of the \ufb01nally used classi\ufb01er.\n\nSelecting features in unsupervised learning scenarios is a much harder problem, due to the\nabsence of class labels that would guide the search for relevant information. Problems of\nthis kind have been rarely studied in the literature, for exceptions see e.g. [1, 9, 15]. The\ncommon strategy of most approaches is the use of an iterated stepwise procedure: in the\n\ufb01rst step a set of hypothetical partitions is extracted (the clustering step), and in the sec-\nond step features are scored for relevance (the relevance determination step). A possible\nshortcoming is the way of combining these two steps in an \u201cad hoc\u201d manner: usually the\nrelevance determination mechanism implements a \ufb01lter approach and does not take into\naccount the properties of the clustering method used. Usual scoring methods make an im-\nplicit independence assumption, while ignoring feature correlations. It is thus of particular\ninterest to combine wrapper selection strategies and clustering methods. The approach\npresented in this paper can be viewed as a method of this kind. It combines a Gaussian\nmixture model with a Bayesian feature selection principle. The usual combinatorial prob-\nlems involved with wrapper approaches are overcome by using a Bayesian marginalization\nmechanism. We present an ef\ufb01cient optimization algorithm for our model with guaranteed\nconvergence to a local optimum.\n\nThe only free model parameter is selected by a resampling-based stability analysis. The\nproblem of many ambiguous and equally high-scoring splitting hypotheses, which seems to\n\n\fbe a an inherent shortcoming of many other approaches, is successfully overcome. A com-\nparison with ground-truth labels in control experiments indicates that the selected models\ninduce sample clusters and feature subsets which both provide a clear interpretation.\n\nOur approach to combining clustering and feature selection is based on a Gaussian mix-\nture model, which is optimized by way of the classical expectation-maximization (EM)\nalgorithm. In order to incorporate the feature selection mechanism, the M-step is \ufb01rst re-\nformulated as a linear discriminant analysis (LDA) problem, which makes use of the \u201cfuzzy\nlabels\u201d estimated in the preceding E-step. We then use the well-known identity of LDA and\nlinear regression to restate the M-step in a form which easily allows us to regularize the es-\ntimation problem by specifying a prior distribution over the regression coef\ufb01cients. This\ndistribution has the functional form of an Automatic Relevance Determination (ARD) prior.\nFor each regression coef\ufb01cient, the ARD prior contains a free hyperparameter, which en-\ncodes the \u201crelevance\u201d of the corresponding variable in the linear regression. In a Bayesian\nmarginalization step, these hyperparameters are then integrated out. We \ufb01nally arrive at an\nM-step with integrated feature selection mechanism.\n\n2 Clustering and Bayesian relevance determination\nGaussian mixtures and LDA. The dataset is given as a collection of N samples xi 2 Rd.\nFor the purpose of \ufb01nding clusters, consider now a Gaussian mixture model with 2 mixture\ncomponents which share an identical covariance matrix (cid:6). Under this model, the log-\nlikelihood for the dataset reads\n\nlmix = PN\n\ni=1 log(cid:16)P2\n\n(cid:23)=1 (cid:25)(cid:23)(cid:30)(xi; (cid:22)(cid:23); (cid:6))(cid:17) ;\n\n(1)\n\nwhere the mixing proportions (cid:25)(cid:23) sum to one, and (cid:30) denotes a Gaussian density. The\nclassical EM-algorithm, [2], provides a convenient method for maximizing lmix:\n\nE-step: set p(cid:17)i = Prob(xi 2 class (cid:17)) =\n\nM-step: set (cid:22)(cid:23) = PN\nPN\n\ni=1 p(cid:23)ixi\ni=1 p(cid:23)i\n\n; (cid:6) =\n\nP2\n2X\n\n1\nN\n\n(cid:23)=1\n\n(cid:25)(cid:17)(cid:30)(xi; (cid:22)(cid:17); (cid:6))\n(cid:23)=1 (cid:25)(cid:23) (cid:30)(xi; (cid:22)(cid:23) ; (cid:6))\n\n:\n\nNX\n\ni=1\n\np(cid:23)i (xi (cid:0) (cid:22)(cid:23) ) (xi (cid:0) (cid:22)(cid:23) )>\n\n:\n\nThe likelihood equations in the M-step can be viewed as weighted mean and covariance\nmaximum likelihood estimates in a weighted and augmented problem: one replicates the\nN observations 2 times, with the (cid:23)-th such replication having observation weights p(cid:23)i. In\n[5] it is proven that the M-step can be carried out via a weighted and augmented linear dis-\ncriminant analysis (LDA). Following [6], any LDA problem can be restated as an optimal\nscoring problem. Let the class-memberships of the N data vectors be coded as a matrix Z,\nthe i; (cid:23)-th entry of which equals one if the i-th observation belongs to class (cid:23). The point\nof optimal scoring is to turn categorical variables into quantitative ones: the score vector\n(cid:18) assigns the real number (cid:18)(cid:23) to the entries in the (cid:23)-th column of Z. The simultaneous\nestimation of scores and regression coef\ufb01cients (cid:12) constitutes the optimal scoring problem:\nminimize\n\nN kZ(cid:18)k2\n\n2 = 1. The notion k (cid:1) k2\n\nM ((cid:18); (cid:12)) = kZ(cid:18) (cid:0) X(cid:12)k2\n\n(2)\nunder the constraint 1\n2 stands for the squared \u20182\u2013norm, and\nX denotes the (centered) data matrix of dimension N (cid:2) d. In [6] an algorithm for carrying\nout this optimization has been proposed, whose main ingredient is a linear regression of the\ndata matrix X against the scored indicator matrix Z(cid:18).\nReturning from a standard LDA-problem to the above weighted and augmented problem, it\nturns out that it is not necessary to explicitly replicate the observations: the optimal scoring\nversion of LDA allows an implicit solution of the augmented problem that still uses only N\nobservations. Instead of using a response indicator matrix Z, a blurred response matrix ~Z\nis employed, whose rows consist of the current class probabilities for each observation. At\neach M-step this ~Z enters in the linear regression, see [5]. After iterated application of the\nE- and M-step, an observation xi is \ufb01nally assigned to the class (cid:23) with highest probability\nof membership p(cid:23)i. Note that the EM-iterations converge to a local maximum.\n\n2\n\n\fLDA and Automatic Relevance Determination. We now focus on incorporating the au-\ntomatic feature selection mechanism into the EM-algorithm. According to [6], the 2-class\nLDA problem in the M-step can be solved by the following algorithm:\n\n1. Choose an initial N-vector of scores (cid:18)0 which satis\ufb01es N (cid:0)1(cid:18)T\n\n0 ~Z T ~Z(cid:18)0 = 1; and\n\nis orthogonal to a k-vector of ones, 1k. Set (cid:18)(cid:3) = ~Z(cid:18)0;\n\n2. Run a linear regression of X on (cid:18)(cid:3): c(cid:18)(cid:3) = X(X T X)(cid:0)1X T (cid:18)(cid:3)\n\nThe feature selection mechanism can now be incorporated in the M-step by imposing a\ncertain constraint on the linear regression. In [6, 4] it has been proposed to use a ridge-\ntype penalized regression. Taking a Bayesian perspective, such a ridge-type penalty can\nbe interpreted as introducing a spherical Gaussian prior over the coef\ufb01cients: p((cid:12)) =\nN (0; (cid:21)(cid:0)1I): The main idea of incorporating an automatic feature selection mechanism\nconsists of replacing the Gaussian prior with an automatic relevance determination (ARD)\nprior1 of the form\n\n(cid:17) X(cid:12):\n\ni\n\n) / exp[(cid:0)Pi #i(cid:12) 2\n\np((cid:12)j #) = Qi N (0; #(cid:0)1\n\n(3)\ni ]:\nIn this case, each coef\ufb01cient (cid:12)i has its own prior variance #(cid:0)1\n. Note that in the above\nARD framework only the functional form of the prior (3) is \ufb01xed, whereas the parameters\n#i, which encode the \u201crelevance\u201d of each variable, are estimated from the data. In [3] the\nfollowing Bayesian inference procedure for the prior parameters has been introduced: given\n2 expf(cid:0) (cid:13)#i\nexponential hyperpriors (the variances #(cid:0)1\n2 g,\none can analytically integrate out the hyperparameters from the prior distribution over the\ncoef\ufb01cients (cid:12)i:\n\ni must be nonnegative), p(#i) = (cid:13)\n\ni\n\np((cid:12)i) = R 1\n\n2 expf(cid:0)p(cid:13)j(cid:12)ijg:\n\n(5)\n\n2 + ~(cid:21)k(cid:12)k1;\n\n0 p((cid:12)ij#i)p(#i) d#i = (cid:13)\n(4)\nSwitching to the maximum a posteriori (MAP) solution in log-space, this marginalization\ndirectly leads us to the following penalized functional:\nM ((cid:18); (cid:12)) = k ~Z(cid:18) (cid:0) X(cid:12)k2\n\nwhere ~(cid:21) (cid:17) p(cid:13) has the role of a Lagrange parameter in the \u20181\u2013constrained problem: min-\nimize k ~Z(cid:18) (cid:0) X(cid:12)k2\n2 subject to k(cid:12)k1 < (cid:20). In the statistical literature, this model is known\nas the Least Absolute Shrinkage and Selection Operator (LASSO) model, [14].\nReturning to equation (3), we are now able to interpret the LASSO estimate as a Bayesian\nfeature selection principle: for the purpose of feature selection, we would like to estimate\nthe value of a binary selection variable S for each feature: Si equals one, if the i-th feature\nis considered relevant for the given task, and zero otherwise. Taking into account fea-\nture correlations, estimation of Si necessarily involves searching the space of all possible\nsubsets of features containing the i-th one. In the Bayesian ARD formalism, this combina-\ntorial explosion of the search space is overcome by relaxing the binary selection variable to\na positive real-valued variance of a Gaussian prior over each component of the coef\ufb01cient\nvector. Following the Bayesian inference principle, we introduce hyperpriors and integrate\nout these variances, and we \ufb01nally arrive at the \u20181\u2013constrained LASSO problem.\nOptimizing the \ufb01nal model. Since space here precludes a detailed discussion of \u20181\u2013\nconstrained regression problems, the reader is referred to [12], where a highly ef\ufb01cient\nalgorithm with guaranteed global convergence has been proposed. Given this global con-\nvergence in the M-step, for the EM-model we can guarantee convergence to a local max-\nimum of the constrained likelihood. Consider two cases: (i) the unconstrained solution\nis feasible. In this case our algorithm simply reduces to the standard EM procedure, for\nwhich is it known that in every iteration the likelihood monotonically increases; (ii) the\n\u20181\u2013constraint is active. Then, in every iteration the LASSO algorithm maximizes the like-\nlihood within the feasible region of (cid:12)-values de\ufb01ned by k(cid:12)k1 < (cid:20). The likelihood cannot\nbe decreased in further stages of the iteration, since any solution (cid:12) found in a preceding\niteration is also a valid solution for the actual problem (note that (cid:20) is \ufb01xed!). In this case,\nthe algorithm has converged to a local maximum of the likelihood within the constraint\nregion.\n\n1For an introduction to the ARD principle the reader is referred to [10].\n\n\f3 Model selection\n\nOur model has only one free parameter, namely the value of the \u20181\u2013constraint (cid:20). In the\nfollowing we describe a method for selecting (cid:20) by observing the stability of data partitions.\nFor each of the partitions which we have identi\ufb01ed as \u201cstable\u201d, we then examine the \ufb02uc-\ntuations involved in the feature selection process. It should be noticed that the concept of\nmeasuring the stability of solutions as a means of model selection has been successfully\napplied to several unsupervised learning problems, see e.g. [8, 11].\n\nWe will usually \ufb01nd many potential splits of a dataset, depending on how many features are\nselected: if we select only one feature, it is likely to \ufb01nd many competing hypotheses for\nsplits. The problem is that most of the feature vectors usually vote for a different partition.\nIf, on the other hand, we select too many features, we face the usual problems of \ufb01nding\nstructure in high-dimensional datasets: our functional which we want to optimize will have\nmany local minima, and with high probability, the EM-algorithm will \ufb01nd suboptimal so-\nlutions. Between these two extremes, we can hope to \ufb01nd relatively stable splits, which are\nrobust against noise and also against inherent instabilities of the optimization method.\n\nTo obtain a quantitative measure of stability, we propose the following procedure: run the\nclass discovery method once, corrupt the data vectors by a small amount of noise, repeat\nthe grouping procedure, and calculate the Hamming distance between the two partitions as\na measure of (in-)stability. For computing Hamming distances, the partitions are viewed\nas vectors containing the cluster labels. Simply taking the average stability over many\nsuch two-sample comparisons, however, would not allow an adequate handling of situa-\ntions where there are two equally likely stable solutions, of which the clustering algorithm\nrandomly selects one. In such situations, the averaged stability will be very low, despite the\nfact that there exist two stable splitting hypotheses. This problem can be overcome by look-\ning for compact clusters of highly similar partitions, leading to the following algorithm:\n\nAlgorithm for identifying stable partitions: for different values of the \u20181\u2013constraint (cid:20) do\n(i) compute m noisy replications of the data\n(ii) run the class discovery algorithm for each of these datasets\n(iii) compute the m (cid:2) m matrix of pairwise Hamming distances between all partitions\n(iv) cluster the partitions into compact groups and score the groups by their frequency\n(v) select dominant groups of partitions and choose representative partitions\n\nIn step (i) a \u201csuitable\u201d noise level must be chosen a priori. In our experiments we make\nuse of the fact that we have normalized the input data to have zero mean and unit variance.\nGiven this normalization, we then add Gaussian noise with 5% of the total variance in the\ndataset, i.e. (cid:27)2 = 0:05. In step (iii) we use Hamming distances as a dissimilarity measure\nbetween partitions. In order to make Hamming distances suitable for this purpose, we have\nto consider the inherent permutation symmetry of the clustering process: a cluster called\n\u201c1\u201d in the \ufb01rst partition can be called \u201c2\u201d in the second one. When computing the pair-\nwise Hamming distances, we thus have to minimize over the two possible permutations of\ncluster labels. Steps (iv) and (v) need some further explanation: the problem of identifying\ncompact groups in datasets which are represented by pairwise distances can by solved by\noptimizing the pairwise clustering cost function, [7]. We iteratively increase the number of\nclusters (which is a free parameter in the pairwise clustering functional) until the average\ndissimilarity in each group does not exceed a prede\ufb01ned threshold. Reasonable problem-\nspeci\ufb01c thresholds can be de\ufb01ned by considering the following null-model: given N sam-\nples, the average Hamming distance between two randomly drawn 2\u2013partitions P1 and P2\nis roughly dHamming(P1;P2) (cid:25) N=2. It may thus be reasonable to consider only clusters\nwhich are several times more homogeneous than the expected null-model homogeneity (in\nthe experiments we have set this threshold to 10 times the null-model homogeneity).\n\nFor the clusters which are considered homogeneous, we observe their populations, and\nout of all models investigated we choose the one leading to the partition cluster of largest\nsize. For this dominating cluster, we then select a prototypical partition. For selecting such\nprototypical partitions in pairwise clustering problems, we refer the reader to [13], where\nit is shown that the pairwise clustering problem can be equivalently restated as a k-means\nproblem in a suitably chosen embedding space. Each partition is represented as a vector\n\n\fin this space. This property allows us to select those partitions as representants, which\nare closest to the partition cluster centroids. The whole work-\ufb02ow of model selection is\nsummarized schematically in \ufb01gure 1.\n\n1\n\u2212\nn\n \ne\nl\np\nm\na\nS\n\nn\n \ne\nl\np\nm\na\nS\n\n1\n \ne\nl\np\nm\na\nS\n\n2\n \ne\nl\np\nm\na\nS\n\n3\n \ne\nl\np\nm\na\nS\n\nPartitions\n\nNoisy resample 1\n\nNoisy resample 99\n\nNoisy resample 100\n\n0.35\n\n0.3\n\n0.25\n\n0.2\n\n0.15\n\n0.1\n\n0.05\n\n0\n\nCluster population\n\nHamming distance\n\n2\n\n4\n\n6\n\n8\n\nCluster index\n\n10\n\ns\ne\nc\nn\na\nt\ns\ni\n\n \n\ni\n\nd\ng\nn\nm\nm\na\nH\n\n0\n\n8\n\n0\n\nx\ni\nr\nt\na\nm\ny\nt\ni\nr\na\nl\ni\n\n \n\nm\n\ni\ns\ns\ni\nD\n\nHistogramming cluster populations\n\nEmbedding & Clustering\n\nFigure 1: Model selection: schematic work-\ufb02ow for one \ufb01xed value of the \u20181\u2013constraint (cid:20).\n\n4 Experiments\nClustering USPS digits.\nIn a \ufb01rst experiment we test our method for the task of cluster-\ning digits from the USPS handwritten digits database. Sample images are shown in \ufb01gure\n2.\n\nFigure 2: Sample images of digits \u20196\u2019 and \u20197\u2019 from the USPS database.\n\nThe 16 (cid:2) 16 gray-value images of the digits are treated as 256-dimensional vectors. For\nthis experiment, we extracted a subset of 200 images, consisting of randomly selected digits\n\u20196\u2019 and \u20197\u2019. Based on this dataset, we \ufb01rst selected the most stable model according to the\nmodel selection procedure described in section 3. We observed the stability of the solutions\nfor different constraint values (cid:20) on the interval [0:7; 1:8] with a step-size of 0:1.\n\nk = 0.7 / #(features) = 2.3 \n\nk = 1.0 / #(features) = 18.7 \n\nk = 1.8 / #(features) = 34.6 \n\n0.8\n\n0.7\n\n0.6\n\n0.5\n\n0.4\n\n0.3\n\n0.2\n\n0.1\n\n0\n\nCluster population.\nAverage Hamming distance\nwithin clusters.\n\n5\n\n10\n\n15\n\n20\n\nCluster index\n\n0.8\n\n0.7\n\n0.6\n\n0.5\n\n0.4\n\n0.3\n\n0.2\n\n0.1\n\n0\n\n0.8\n\n0.7\n\n0.6\n\n0.5\n\n0.4\n\n0.3\n\n0.2\n\n0.1\n\n0\n\n0\n\n1\n\n2\n\n3\n\n4\n\nCluster index\n\n5\n\n10\n\n15\n\n20\nCluster index\n\n25\n\nFigure 3: Model selection: three different choices of the \u20181-constraint (cid:20). The histograms show\nthe relative population of partition clusters. The solid line indicates the average pairwise Hamming\ndistance between partitions (divided by 100).\n\nFigure 3 exemplarily shows the outcomes of the stability analysis: in the left panel, the\nsolution is so highly constrained that on average only 2.3 features (pixels) are selected.\nOne can see that the solutions are rather instable. Subsets of only two features seem to be\ntoo small for building a consistent splitting hypothesis. Even the most populated partition\ncluster (index 3) contains only 30% of all partitions. If, on the other hand, the constraint is\nrelaxed too far, we also arrive at the very instable situation, depicted in the right panel: for\n(cid:20) = 1:8, on average 34:6 pixels are selected. Optimizing the model in this 35-dimensional\nfeature space seems to be dif\ufb01cult, probably because the EM-algorithm is often trapped\nby suboptimal local optima. In-between these models, however, we \ufb01nd a highly stable\nsolution for (cid:20) = 1:0 in moderate dimensions (on average 18.7 features), see the middle\npanel. In this case, the dominating partition cluster (cluster no. 1 in the histogram) contains\nalmost 75% of all partitions.\n\n\fHaving selected the optimal model parameter (cid:20) = 1:0, in a next step we select the rep-\nresentative partition (= the one nearest to the centroid) of the dominating partition cluster\n(no. 1 in the middle panel of \ufb01gure 3). This partition splits the dataset into two clusters,\nwhich highly agree with the true labeling. In the upper part of \ufb01gure 4, both the inferred\nlabels and the true labels are depicted by horizontal bar diagrams. Only three samples out\nof 200 are mislabeled (the rightmost three samples). The lower panel of this \ufb01gure shows\nseveral rows, each of which represents one automatically selected feature. Each of the 200\ngrey-value coded pixel blocks in a row indicates the feature value for one sample. For a bet-\nter visualization, the features (rows) are permuted according to either high values (black)\nor low values (white) for one of the two clusters.\n\nFigure 4: Optimal model: representative partition. Upper horizontal bar: true labels of the 200\nsamples (black = \u20196\u2019, grey = \u20197\u2019). Lower bar: inferred labels. Lower panel: each row consists of\ngrey-value coded values of the selected features for all samples (1 pixel block = 1 sample).\n\nWe are not only interested in the stability of splittings of the dataset, but also in the stability\nof the feature selection process. In order to quantify this latter stability, we return to the\ndominating partition cluster no. 1 in the middle panel of \ufb01gure 3, and for each of the 73\npartitions in this cluster, we count how often a particular feature has been selected. The 22\nfeatures (pixels) which are selected in at least one halve of the partitions, are plotted in the\nsecond panel of \ufb01gure 5. The selection stability is grey-value coded (black = 100% stable).\nTo the left and to the right we have again plotted two typical sample images of both classes\nfrom the database. A comparison with the selected features leads us to the conclusion,\nthat we were not only able to \ufb01nd reasonable clusters, but we also have exactly selected\nthose discriminative features which we would have expected in this control experiment. In\nthe rightmost panel, we have also plotted one of the three mislabeled \u20197\u2019s which has been\nassigned to the \u20196\u2019 cluster.\n\nFigure 5: From left to right: First: a typical \u20196\u2019. Second: automatically extracted features. Third:\na typical \u20197\u2019. Fourth: one of the three mislabeled \u20197\u2019s.\n\nClustering faces.\nIn a second experiment we applied our method to the problem of\nclustering face images. From the Stirling Faces database (http://pics.psych.stir.ac.uk/cgi-\nbin/PICS/New/pics.cgi) we selected all 68 grey-valued front views of faces and all 105\npro\ufb01le views. The images are rather inhomogeneous, since they show different persons\nwith different facial expressions. Some sample images are depicted in \ufb01gure 6. For a com-\nplete overview over the whole image set, we refer the reader to our supplementary web\npage http://www.cs.uni-bonn.de/(cid:24)roth/FACES/split.html, where all images can be viewed in\nhigher quality.\n\nFigure 6: Example images form the Stirling Faces database.\n\nSince it appears to be infeasible to work directly on the set of pixels of the high-resolution\nimages, in a \ufb01rst step we extracted the 10 leading eigenfaces of the total dataset (eigenfaces\nare simply the eigenvectors vi of the images treated as pixel-wise vectorial objects). These\neigenfaces are depicted in \ufb01gure 7. We then applied our method to these image objects,\nwhich are represented as 10-dimensional vectors. Note that the original images Ij can be\n(partially) reconstructed from this truncated eigenvector expansion as I 0\ni Ij\n(assuming the image vectors Ij to be centered).\n\nj = P10\n\ni=1 viv>\n\n\f1.0\n\n0.46\n\n0.28\n\n0.17\n\n0.16\n\n0.15\n\n0.13\n\n0.1\n\n0.09\n\n0.08\n\nFigure 7: First 10 leading eigenfaces and their relative eigenvalues.\n\nWe again start our analysis with selecting an optimal model. Figure 8 depicts the out-\ncome of the model selection procedure. The left panel shows both the number of extracted\nfeatures and the relative population of the largest partition cluster for different values of (cid:20).\nThe most stable model is obtained for (cid:20) = 1:0. On average, 3:04 features (eigenfaces) have\nbeen selected. A detailed analysis of the selected features within the dominating partition\ncluster (no. 5 in the right panel) shows that the eigenfaces no. 2, 3 and 7 are all selected\nwith a stability of more than 98%. It is interesting to notice that the leading eigenface no. 1\nwith the distinctly largest eigenvalue has not been selected.\n\n5\n\n4\n\n3\n\n2\n\n1\n\n0\n\n#(features)\n\nRelative population of dominating cluster (x10)\n\n0.7\n\n0.8\n\n0.9\n\n1\n\n1.1\nconstraint value\n\n1.2\n\n1.3\n\n0.35\n\n0.3\n\n0.25\n\n0.2\n\n0.15\n\n0.1\n\n0.05\n\n0\n\nCluster population\n\nHamming distance\n\n2\n\n4\n\n6\n\n8\n\nCluster index\n\n10\n\nFigure 8: Model selection. Left: average number of selected features and relative population of the\ndominating partition cluster vs. (cid:20). Right: partition clusters for optimal model with (cid:20) = 1.\n\nIn every M-step of our algorithm, a linear discriminant analysis is performed, in which\na weight vector (cid:12) for all features is computed (due to the incorporated feature selection\nmechanism, most weights will be exactly zero). For a given partition of the objects, the\nlinear combination of the eigenface-features induced by this weight vector is known as\nthe Fisherface. Our method can, thus, be interpreted as a clustering method that \ufb01nds\na partition and simultaneously produces a \u201csparse\u201d Fisherface which consists of a linear\ncombination of the most discriminative eigenfaces. Figure 9 shows the derived Fisherface,\nreconstructed from the weight vector of the representative partition (no. 5 in the right panel\nof \ufb01gure 8). Note that there are only 3 nonzero weights (cid:12)2 = 0:8; (cid:12)3 = 0:05 and (cid:12)7 = 0:2.\n\n0.8\n\n*\n\n+ 0.05\n\n*\n\n+ 0.2\n\n*\n\n=\n\neigenface 2\n\neigenface 3\n\neigenface 7\n\nFisherface\n\nFigure 9: The inferred Fisherface as a linear combination of 3 eigenfaces.\n\nThe representative partition of the dominating cluster (no. 5 in the right panel of \ufb01gure 8)\nsplits the images in two groups, which again highly coincide with the original groups of\nfrontal and pro\ufb01le faces. Only 7 out of all 173 images are mislabeled w.r.t. this \u201cground-\ntruth\u201d labeling. The success of the clustering method can be understood by reconstructing\nthe original images from the inferred Fisherface (which is nothing but a weighted and\ntruncated eigenvector reconstruction of the original images). Figure 10 shows the same\nimages as in \ufb01gure 6, this time, however, reconstructed from the Fisherface. For better\nvisualization, all images are rescaled to the full range of 255 grey values. One can see the\nclear distinction between frontal and pro\ufb01le faces, which mainly results from different signs\nof the projections of the images on the Fisherface. Again, the whole set of reconstructed\nimages can be viewed on our supplementary material web page in higher quality.\n\nFigure 10: Images from \ufb01gure 6, reconstructed from the Fisherface.\n\nk\n\f5 Conclusions\nThe problem tackled in this paper consists of simultaneously clustering objects and auto-\nmatically extracting subsets of features which are most discriminative for this object par-\ntition. Some approaches have been proposed in the literature, most of which, however,\nbear several inherent shortcomings, such as an unclear probabilistic model, the simplifying\nassumption of features as being uncorrelated, or the absence of a plausible model selection\nstrategy. The latter issue is of particular importance, since many approaches seem to suffer\nfrom ambiguities caused by contradictory splitting hypotheses. In this work we have pre-\nsented a new approach which has the potential to overcome these shortcomings. It has a\nclear interpretation in terms of a constrained Gaussian mixture model, which combines a\nclustering method with a Bayesian inference mechanism for automatically selecting rele-\nvant features. We further present an optimization algorithm with guaranteed convergence\nto a local optimum. The model has only one free parameter, (cid:20), for which we propose a\nstability-based model selection procedure. Experiments demonstrate that this method is\nable to correctly infer partitions and meaningful feature sets.\n\nOur method currently only implements partitions of the object set into two clusters. For\n\ufb01nding multiple clusters, we propose to iteratively split the dataset. Such iterative splits\nhave been successfully applied to the problem of simultaneously clustering gene expres-\nsion datasets and selecting relevant genes. Details on these biological applications of our\nmethod will appear elsewhere.\nAcknowledgments. The authors would like to thank Joachim M. Buhmann for helpful\ndiscussions and suggestions.\n\nReferences\n[1] A. Ben-Dor, N. Friedman, and Z. Yakhini. Class discovery in gene expression data.\n\nProcs. RECOMB, pages 31\u201338, 2001.\n\nIn\n\n[2] A.P. Dempster, N.M. Laird, and D.B. Rubin. Maximum likelihood from incomplete data via\n\nthe EM algorithm. J. R. Stat. Soc. B, 39:1\u201338, 1977.\n\n[3] M. Figueiredo and A. K. Jain. Bayesian learning of sparse classi\ufb01ers. In CVPR2001, pages\n\n35\u201341, 2001.\n\n[4] T. Hastie, A. Buja, and R. Tibshirani. Penalized discriminant analysis. Ann. Stat., 23:73\u2013102,\n\n1995.\n\n[5] T. Hastie and R. Tibshirani. Discriminant analysis by gaussian mixtures. J. R. Stat. Soc. B,\n\n58:158\u2013176, 1996.\n\n[6] T. Hastie, R. Tibshirani, and A. Buja. Flexible discriminant analysis by optimal scoring. J. Am.\n\nStat. Assoc., 89:1255\u20131270, 1994.\n\n[7] T. Hofmann and J. Buhmann. Pairwise data clustering by deterministic annealing. IEEE Trans.\n\nPattern Anal. Mach. Intell., 19(1):1\u201314, 1997.\n\n[8] T. Lange, M. Braun, V. Roth, and J.M. Buhmann. Stability-based model selection. In Advances\n\nin Neural Information Processing Systems, volume 15, 2003. To appear.\n\n[9] M.H. Law, A.K. Jain, and M.A.T. Figueiredo. Feature selection in mixture-based clustering. In\n\nAdvances in Neural Information Processing Systems, volume 15, 2003. To appear.\n\n[10] D.J.C. MacKay. Bayesian non-linear modelling for the prediction competition. In ASHRAE\n\nTransactions Pt.2, volume 100, pages 1053\u20131062, Atlanta, Georgia, 1994.\n\n[11] F. Meinecke, A. Ziehe, M. Kawanabe, and K.-R. M\u00a8uller. Estimating the reliability of ICA\n\nprojections. In Advances in Neural Information Processing Systems, volume 14, 2002.\n\n[12] M. Osborne, B. Presnell, and B. Turlach. On the lasso and its dual. J. Comput. Graph. Stat.,\n\n9:319\u2013337, 2000.\n\n[13] V. Roth, J. Laub, J. M. Buhmann, and K.-R. M\u00a8uller. Going metric: Denoising pairwise data. In\n\nAdvances in Neural Information Processing Systems, volume 15, 2003. To appear.\n\n[14] R.J. Tibshirani. Regression shrinkage and selection via the lasso. J. R. Stat. Soc. B, 58(1):267\u2013\n\n288, 1996.\n\n[15] A. v.Heydebreck, W. Huber, A. Poustka, and M. Vingron. Identifying splits with clear separa-\n\ntion: a new class discovery method for gene expression data. Bioinformatics, 17, 2001.\n\n\f", "award": [], "sourceid": 2486, "authors": [{"given_name": "Volker", "family_name": "Roth", "institution": null}, {"given_name": "Tilman", "family_name": "Lange", "institution": null}]}