{"title": "Hierarchical Clustering of a Mixture Model", "book": "Advances in Neural Information Processing Systems", "page_first": 505, "page_last": 512, "abstract": null, "full_text": " Hierarchical Clustering of a Mixture Model\n\n\n\n Jacob Goldberger Sam Roweis\n Department of Computer Science, University of Toronto\n {jacob,roweis}@cs.toronto.edu\n\n\n\n Abstract\n\n In this paper we propose an efficient algorithm for reducing a large\n mixture of Gaussians into a smaller mixture while still preserv-\n ing the component structure of the original model; this is achieved\n by clustering (grouping) the components. The method minimizes\n a new, easily computed distance measure between two Gaussian\n mixtures that can be motivated from a suitable stochastic model\n and the iterations of the algorithm use only the model parameters,\n avoiding the need for explicit resampling of datapoints. We demon-\n strate the method by performing hierarchical clustering of scenery\n images and handwritten digits.\n\n\n1 Introduction\n\nThe Gaussian mixture model (MoG) is a flexible and powerful parametric frame-\nwork for unsupervised data grouping. Mixture models, however, are often involved\nin other learning processes whose goals extend beyond simple density estimation to\nhierarchical clustering, grouping of discrete categories or model simplification. In\nmany such situations we need to group the Gaussians components and re-represent\neach group by a new single Gaussian density. This grouping results in a compact\nrepresentation of the original mixture of many Gaussians that respects the original\ncomponent structure in the sense that no original component is split in the reduced\nrepresentation. We can view the problem of Gaussian component clustering as gen-\neral data-point clustering with side information that points belonging to the same\noriginal Gaussian component should end up in the same final cluster. Several algo-\nrithms that perform clustering of data points given such constraints were recently\nproposed [11, 5, 12]. In this study we extend these approaches to model-based\nrather than datapoint based settings. Of course, one could always generate data by\nsampling from the model, enforcing the constraint that any two samples generated\nby the same mixture component must end up in the same final cluster. We show\nthat if we already have a parametric representation of the constraint via the MoG\ndensity, there is no need for an explicit sampling phase to generate representative\ndatapoints and their associated constraints.\n\nIn other situations we want to collapse a MoG into a mixture of fewer components\nin order to reduce computation complexity. One example is statistical inference\nin switching dynamic linear models, where performing exact inference with a MoG\nprior causes the number of Gaussian components representing the current belief\nto grow exponentially in time. One common solution to this problem is grouping\n\n\f\nthe Gaussians according to their common history in recent timesteps and collapsing\nGaussians grouped together into a single Gaussian [1]. Such a reduction, however, is\nnot based on the parameters of the Gaussians. Other instances in which collapsing\nMoGs is relevant are variants of particle filtering [10], non-parametric belief propa-\ngation [7] and fault detection in dynamical systems [3]. A straight-forward solution\nfor these situations is first to produce samples from the original MoG and then to\napply the EM algorithm to learn a reduced model; however this is computationally\ninefficient and does not preserve the component structure of the original mixture.\n\n\n2 The Clustering Algorithm\n\nWe assume that we are given a mixture density f composed of k d-dimensional\nGaussian components:\n\n k k\n\n f (y) = iN (y; i, i) = ifi(y) (1)\n i=1 i=1\n\nWe want to cluster the components of f into a reduced mixture of m < k compo-\nnents. If we denote the set of all (d-dimensional) Gaussian mixture models with at\nmost m components by MoG(m), one way to formalize the goal of clustering is to\nsay that we wish to find the element g of MoG(m) \"closest\" to f under some dis-\ntance measure. A common proximity criterion is the cross-entropy from f to g, i.e.\n^\ng = arg ming KL(f ||g) = arg maxg f log g, where KL() is the Kullback-Leibler\ndivergence and the minimization is performed over all g in MoG(m). This criterion\nleads to an intractable optimization problem; there is not even a closed-form expres-\nsion for the KL-divergence between two MoGs let alone an analytic minimizer of\nits second argument. Furthermore, minimizing a KL-based criterion does not pre-\nserving the original component structure of f . Instead, we introduce the following\nnew distance measure between f = k \n i=1 ifi and g = m\n j=1 j gj :\n\n k m\n d(f, g) = i min KL(fi||gj) (2)\n j=1\n i=1\n\nwhich can be intuitively thought of as the cost of coding data generated by f under\nthe model g, if all points generated by component i of f must be coded under a single\ncomponent of g. Unlike the KL-divergence between two MoGs, this distance can\nbe analytically computed. In particular, each term is a KL-divergence between two\nGaussian distributions N (1, 1) and N (2, 2) which is given by:\n\n 1 |\n (log 2| + T r(-1\n 2 | 2 1) + (1 - 2)T -1\n 2 (1 - 2) - d).\n 1|\n\n\nUnder this distance, the optimal reduced MoG representation ^\n g is the solution to\nthe minimization of (2) over MoG(m): ^\n g = arg ming d(f, g). Although the min-\nimization ranges over all the MoG(m), we prove that the optimal density ^\n g is a\nMoG obtained from grouping the components of f into clusters and collapsing all\nGaussians within a cluster into a single Gaussian. There is no closed-form solution\nfor the minimization; rather, we propose an iterative algorithm to obtain a locally\noptimal solution. Denote the set of all the mk mappings from {1, ..., k} to {1, ..., m}\nby S. For each S and g M oG(m) define:\n\n k\n\n d(f, g, ) = iKL(fi||g(i)). (3)\n i=1\n\n\f\nFor a given g M oG(m), we associate a matching function g S:\n\n m\n g(i) = arg min KL(fi||gj) i = 1, ..., k (4)\n j=1\n\nIt can be easily verified that:\n\n d(f, g) = d(f, g, g) = min d(f, g, ) (5)\n S\n\ni.e. g is the optimal mapping between the components of f and g. Using (5) to\ndefine our main optimization we obtain the optimal reduced model as a solution of\nthe following double minimization problem:\n\n ^\n g = arg min min d(f, g, ) (6)\n g S\n\nFor m > 1 the double minimization (6) can not be solved analytically. Instead,\nwe can use alternating minimization to obtain a local minimum. Given a matching\nfunction S, we define g M oG(m) as follows. For each j such that -1(j) is\nnon empty, define the following MoG density:\n\n ifi\n f = i-1(j) (7)\n j \n i-1(j) i\n\nThe mean and variance of the set f , denoted by and , are:\n j j j\n\n 1 1\n = = )( )T\n j ii, i i + (i - i - \n j j j\n j j\n i-1(j) i-1(j)\n\nwhere j = = N ( , ) be the Gaussian distribution obtained\n i-1(j) i. Let g\n j j j\nby collapsing the set f into a single Gaussian. It satisfies:\n j\n\n g = ) = arg min ||\n j N (j, j KL(f \n j g) = arg min d(f \n j , g)\n g g\n\nsuch that the minimization is performed over all the d-dimensional Gaussian den-\nsities. Denote the collapsed version of f according to by g, i.e.:\n\n m\n\n g = jg (8)\n j\n j=1\n\n\nLemma 1: Given a MoG f and a matching function S, g is the unique\nminimum point of d(f, g, ). More precisely, d(f, g, ) d(f, g, ) for all g \nM oG(m), and if d(f, g, ) = d(f, g, ) then g = g\n j j for all j = 1, .., m such that\ng and g\n j j are the Gaussian components of g and g respectively.\n\nProof: Denote c = k \n i=1 i fi log fi (a constant independent of g).\n\n k m\n\n c - d(f, g, ) = i fi log(g(i)) = i fi log(gj)\n\n i=1 j=1 i-1(j)\n\n m m\n\n = j f \n j log(gj ) = j g\n j log(gj )\n\n j=1 j=1\n\nThe Jensen inequality yields:\n\n m m k\n\n j g\n j log(g\n j ) = j f \n j log(g\n j ) = i fi log(g\n (i)) = c - d(f, g , )\n\n j=1 j=1 i=1\n\n\f\nThe equality f \n j log(gj ) = g\n j log(gj ) is due to the fact that log(gj ) is a quadratic\nexpression and the first two moments of f \n j and its collapsed version g\n j are equal. Jensen's\ninequality is saturated if and only if for all j = 1, .., m (such that -1(j) is not empty) the\nGaussian densities gj and g\n j are equal. 2\n\nUsing Lemma 1 we obtain a closed form description of a single iteration of the\nalternating minimization algorithm, which can be viewed as a type of K-means\noperating at the meta-level of model parameters:\n\n g = arg min d(f, g, ) (REGROUP)\n \n\n g = arg min d(f, g, ) (REFIT)\n g\n\n\nAbove, g(i) = arg minj KL(fi||gj) and g is computed using (8). The iterative\nalgorithm monotonically decreases the distance measure d(f, g). Hence, since S\nis finite, the algorithm converges to a local minimum point after finite number of\niterations. The next theorem ensures that once the iterative algorithm converges\nwe obtain a clustering of the MoG components.\n\nDefinition 1: A MoG g M oG(m) is an m-mixture collapsed version of f if there\nexists a matching function S such that g is obtained by collapsing f according\nto , .i.e. g = g.\n\nTheorem 1: If applying a single iteration (expressions (regroup) and (refit)) to\na function g M oG(m) does not decrease the distance function (2), then necessarily\ng is a collapsed version of f .\n\nProof: Let g M oG(m) and let be a matching function such that d(f, g) = d(f, g, ).\nLet g be a collapsed version of f according to . The MoG g is obtained as a result of\napplying a single iteration to g. Let g be composed of the following Gaussians {g1, ..., gm}\nand similarly let g = {g\n 1 , ..., g\n m}. According to Lemma 1, d(f, g) = d(f, g, ) \nd(f, g, ) d(f, g). Assume that a single iteration does not decrease the distance,\ni.e. d(f, g) = d(f, g). Hence d(f, g, ) = d(f, g, ). According to Lemma 1, this implies\nthat gj = g\n j for all j = 1, ..., m. Therefore g is a collapsed version of f . 2\n\nTheorem 1 implies that each local minimum of the propose iterative algorithm is a\ncollapsed version of f .\n\nGiven the optimal matching function , the last step of the algorithm is to set\nthe weights of the reduced representation. = \n j {i|(i)=j} i. These weights are\nautomatically obtained via the collapsing process.\n\n\n3 Experimental Results\n\nIn this section we evaluate the performance of our semi-supervised clustering al-\ngorithm and compare it to the standard \"flat\" clustering approach that does not\nrespect the original component structure. We have applied both methods to clus-\ntering handwritten digits and natural scene images. In each case, a set of objects\nis organized in predefined categories. For each category c we learn from a labeled\ntraining set a Gaussian distribution f (x|c). A prior distribution over the categories\np(c) can be also extracted from the labeled training set. The goal is to cluster the\nobjects into a small number of clusters (fewer than the number of class labels). The\nstandard (flat) approach is to apply an unsupervised clustering to entire collection\nof original objects, ignoring their class labels. Alternatively we can utilize the given\ncategorization as side-information in order to obtain an improved reduced clustering\nwhich also respects the original labels, thus inducing a hierarchical structure.\n\n\f\n Figure 1: (top) Means of 10 models of\n f digit classes. (bottom) Means of two\n fx clusters after our algorithm has grouped\nClass A Class B 0,2,3,5,6,8 and 1,4,7,9.\n\n\n\n\n\n method cls 0 1 2 3 4 5 6 7 8 9\n this Class A 100 4 99 99 3 99 99 0 94 1\n paper Class B 0 96 1 1 98 2 1 100 6 99\n unsupervised Class 1 93 16 93 87 22 66 96 16 23 25\n EM Class 2 7 85 7 14 78 34 4 84 77 76\n\nTable 1: Clustering results showing the purity of a 2-cluster reduced model learned\nfrom a training set of handwritten digits in 10 original classes. For each true label,\nthe percentage of cases (from an unseen test set) falling into each of the two re-\nduced classes is shown. The top two lines show the purity of assignments provided\nby our clustering algorithm; the bottom two lines show assignments from a flat\nunsupervised fitting of a two component mixture.\n\n\n\nOur first experiment used a database of handwritten digits. Each example is repre-\nsented by a 8 8 grayscale pixel image; 700 cases are used to learn a 64-dimensional\nfull covariance Gaussian distribution for each class. In the next step we want to\ndivide the digits into two natural clusters, while taking into account their original\n10-way structure. We applied our semi-supervised algorithm to reduce the mix-\nture of 10 Gaussians into a mixture of two Gaussians. The minimal distance (2)\nis obtained when the ten digits are divided into the two groups {0, 2, 3, 5, 6, 8} and\n{1, 4, 7, 9}. The means of the two resulting clusters are shown in Figure 1.\n\nTo evaluate the purity of this clustering, the reduced MoG was used to label a test\nset consists of 4000 previously unseen examples. The binary labels on the test set are\nobtained by comparing the likelihood of the two components in the reduced mixture.\nTable 1 (top) presents, for each digit, the percentage of images that were affiliated\nwith each of the two clusters. Alternatively we can apply a standard EM algorithm\nto learn by maximum likelihood a flat mixture of 2 Gaussians directly from the 7000\ntraining examples, without utilizing their class labels. Table 1 (bottom) shows the\nresults of such an unsupervised clustering, evaluated on the same test set. Although\nthe likelihood of the unsupervised mixture model was significantly better than the\nsemi-supervised model, both on train and test data-sets it is obvious that the purity\nof the clusters it learns is much worse since it is not preserving the hierarchical\nclass structure. Comparing the top and bottom of Table 1, we can see that using\nthe side information we obtain a clustering of the digit data-base which is much\nmore correlated with categorization of the set into ten digits than the unsupervised\nprocedure.\n\nIn a second experiment, we evaluate the performance of our proposed algorithm on\nimage category models. The database used consists of 1460 images selectively hand-\npicked from the COREL database to create 16 categories. The images within each\ncategory have similar color spatial layout, and are labeled with a high-level semantic\n\n\f\n clustering results\n 2\n\n semi-supervised\n 1.8 unsupervised\n\n 1.6\n\n\n\n\n 1.4\n\n\n\n\n 1.2\n A C\n 1\n\n\n\n\n 0.8\n mutual information\n 0.6\n\n\n\n\n 0.41.5 2 2.5 3 3.5 4 4.5 5 5.5 6 6.5\n # of clusters B D\n\nFigure 2: Hierarchical clustering of natural image categories. (left) Mutual infor-\nmation between reduced cluster index and original class. (right) Sample images\nfrom the sets A,B,C,D learned by hierarchical clustering.\n\n\ndescription (e.g. fields, sunset). For each pixel we extract a five-dimensional feature\nvector (3 color features and x,y position). From all the pixels that are belonging\nto the same category we learn a single Gaussian. We have clustered the image\ncategories into k = 2, ..., 6 sets using our algorithm and compared the results to\nunsupervised clustering obtained from an EM procedure that learned a mixture of\nk Gaussians. In order to evaluate the quality of the clustering in terms of correlation\nwith the category information we computed the mutual information (MI) between\nthe clustering result (into k clusters) and the category affiliation of the images in a\ntest set. A high value of mutual information indicates a strong resemblance between\nthe content of the learned clusters and the hand-picked image categories. It can be\nverified from the results summarized in Figure 2 that, as we can expect, the MI in\nthe case of semi-supervised clustering is consistently larger than the MI in the case\nof completely unsupervised clustering. A semi-supervised clustering of the image\ndatabase yields clusters that are based on both low-level features and a high level\navailable categorization. Sampled images from clustering into 4 sets presented in\nFigure 2.\n\n\n4 A Stochastic Model for the Proposed Distance\n\nIn this section we describe a stochastic process that induces a likelihood function\nwhich coincides with the distance measure d(f, g) presented in section 2. Suppose\nwe are given two MoGs:\n\n k k m m\n\nf (y) = ifi(y) = iN (y; i, i) , g(y) = jgj(y) = jN (y; )\n j , j\n i=1 i=1 j=1 j=1\n\nConsider an iid sample set of size n, drawn from f (y). The samples can be arranged\nin k blocks according to the Gaussian component that was selected to produce the\nsample. Assume that ni samples were drawn from the i-th component fi and denote\nthese samples by yi = {yi1, ..., yin }. Next, we compute the likelihood of the sample\n i\nset according to the model g; but under the constraint that samples within the\nsame block must be assigned to the same mixture component of g. In other words,\ninstead of having a hidden variable for each sample point we shall have one for each\nsample block. The likelihood of the sample set yn according to the MoG g under\nthis constraint is:\n\n k m ni\n Ln(g) = g(y1, ..., yk) = j N (yit; )\n j , j\n i=1 j=1 t=1\n\n\f\nThe main result is that as the number of points sampled grows large, the expected\nnegative log likelihood becomes equal to the distance d(f, g) under the measure\nproposed above:\n\nTheorem 2: For each g M oG(m)\n 1\n lim log Ln(g) = c - d(f, g) (9)\n n n\nsuch that c = i fi log fi does not depend on g.\n\nSurprisingly, as noted earlier the mixture weights j do not appear in the asymptotic\nlikelihood function of the generative model presented in this section.\n\nProof: To prove the theorem we shall use the following lemma:\n\nLemma 2: Let {xjn} j = 1, .., m be a set of m sequences of real positive num-\nbers such that xjn xj and let {j} be a set of positive numbers. Then\n1 log \nn j j (xjn)n maxj log xj [This can be shown as follows: Let a = arg maxj xj.\nThen for n sufficiently large, a(xan)n \n j j (xjn)n ma(xan)n. Hence log xa \nlim 1\n n log \n n j j (xjn)n log xa.]\n\nThe points {yi1, ..., yin } are independently sampled from the Gaussian distribution f\n i i.\nTherefore, the law of large numbers implies: 1 log ni N (y\n n it; j , j ) fi log gj.\n i t=1\n 1\nHence, substituting: x n\n jn = ( ni N (y i exp( f\n i t=1 it; j , j )) i log gj ) = xj in Lemma 2,\nwe obtain: 1 log m ni N (y\n n j it; j , j ) maxm\n j=1 fi log gj In a similar manner,\n i j=1 t=1\nthe law of large numbers, applied to the discrete distribution (1, ..., k), yields ni \n n i.\nHence 1 log L log g(y ni 1 log m ni N (y\n n n(g) = 1\n n 1, ..., yk ) = k\n i=1 n n j it; j , j ) \n i j=1 t=1\n k \n i=1 i maxm\n j=1 fi log gj = c - k\n i=1 i minm\n j=1 K L(fi||gj ) = c - d(f, g) 2\n\n\n5 Relations to Previous Approaches and Conclusions\n\nOther authors have recently investigated the learning of Gaussian mixture models\nusing various pieces of side information or constraints. Shental et al. [5] utilized the\ngenerative model described in the previous section and the EM algorithm derived\nfrom it, to learn a MoG from data set endowed with equivalence constraints that\nenforce equivalent points to be assigned to the same cluster. Vasconcelos and Lipp-\nman [9] proposed a similar EM based clustering algorithm for constructing mixture\nhierarchies using a finite set of virtual samples.\n\nGiven the generative model presented above, we can apply the EM algorithm to\nlearn the (locally) maximum likelihood parameters of the reduced MoG model g(y).\nThis EM-based approach, however, is not precisely suitable for our component\nclustering problem. The EM update rule for the weights of the reduced mixture\ndensity is based only on the number of the original components that are clustered\ninto a single component without taking into account the relative weights [9].\n\nThe problem discussed in this study is also related to the Information-Bottleneck\n(IB) principle [8]. In the case of mixture of histograms f = k \n i=1 ifi , the IB\nprinciple yields the following iterative algorithm for finding a clustering of a mixture\nof histograms g = m \n j=1 j gj (y):\n\n wijifi\n w j e-KL(fi||gj )\n ij = , j = wiji , gj = i (10)\n w\n l le-KL(fi||gl) ij i\n i i\n\nAssuming that the number of the (virtual) samples tends to , we can derive,\nin a manner similar to the Gaussian case, a grouping algorithm for a mixture of\n\n\f\nhistograms. Slonim and Weiss [6] showed that the clustering algorithm in this case\ncan be either motivated from the EM algorithm applied to a suitable generative\nmodel [4] or from the (hard decision version) of the IB principle [8]. However,\nwhen we want to represent the clustering result as a mixture density there is a\ndifference in the resulting mixture coefficient between the EM and the IB based\nalgorithms. Unlike the IB updating equation (10) of the coefficients wij , the EM\nupdate equation is based only on the number of components that are collapsed into\na single Gaussian. In the case of mixture of Gaussians, applying the IB principle\nresults only in a partitioning of the original components but does not deliver a\nreduced representation in the form of a smaller mixture [2]. If we modify gj in\nequation (10) by collapsing the mixture gj into a single Gaussian we obtain a soft\nversion of our algorithm. Setting the Lagrange multiplier to we recover exactly\nthe algorithm described in Section 2.\n\nTo conclude, we have presented an efficient Gaussian component clustering algo-\nrithm that can be used for object category clustering and for MoG collapsing. We\nhave shown that our method optimizes the distance measure between two MoG that\nwe proposed. In this study we have assumed that the desired number of clusters is\ngiven as part of the problem setup, but if this is not the case, standard methods for\nmodel selection can be applied.\n\n\nReferences\n\n [1] Y. Bar-Shalom and X. Li. Estimation and tracking: principles, techniques and soft-\n ware. Artech House, 1993.\n\n [2] S. Gordon, H. Greenspan, and J. Goldberger. Applying the information bottleneck\n principle to unsupervised clustering of discrete and continuous image representations.\n In ICCV, 2003.\n\n [3] U. Lerner, R. Parr, D. Koller, and G. Biswas. Bayesian fault detection and diagnosis\n in dynamic systems. In AAAI/IAAI, pp. 531537, 2000.\n\n [4] J. Puzicha, T. Hofmann, and J. Buhmann. Histogram clustering for unsupervised\n segmentation and image retrieval. Pattern Recognition Letters, 20(9):899909, 1999.\n\n [5] N. Shental, A. Bar-Hillel, T. Hertz, and D. Weinshall. Computing gaussian mix-\n ture models with em using equivalence constraints. In Proc. of Neural Information\n Processing Systems, 2003.\n\n [6] N. Slonim and Y. Weiss. Maximum likelihood and the information bottleneck. In\n Proc. of Neural Information Processing Systems, 2003.\n\n [7] E. Sudderth, A. Ihler, W. Freeman, and A. Wilsky. Non-parametric belief propaga-\n tion. In CVPR, 2003.\n\n [8] N. Tishby, F. Pereira, and W. Bialek. The information bottleneck method. In Proc.\n of the 37-th Annual Allerton Conference on Communication, Control and Computing,\n pages 368377, 1999.\n\n [9] N. Vasconcelos and A. Lippman. Learning mixture hierarchies. In Proc. of Neural\n Information Processing Systems, 1998.\n\n[10] J. Vermaak, A. A. Doucet, and P. Perez. Maintaining multi-modality through mixture\n tracking. In Int. Conf. on Computer Vision, 2003.\n\n[11] K. Wagstaff, C. Cardie, S. Rogers, and S. Schroell. Constraind k-means clustering\n with background knowledge. In Proc. Int. Conf. on Machine Learning, 2001.\n\n[12] E.P. Xing, A. Y. Ng, M.I. Jordan, and S. Russell. Distance learning metric. In Proc.\n of Neural Information Processing Systems, 2003.\n\n\f\n", "award": [], "sourceid": 2585, "authors": [{"given_name": "Jacob", "family_name": "Goldberger", "institution": null}, {"given_name": "Sam", "family_name": "Roweis", "institution": null}]}