{"title": "SMEM Algorithm for Mixture Models", "book": "Advances in Neural Information Processing Systems", "page_first": 599, "page_last": 605, "abstract": null, "full_text": "SMEM Algorithm for Mixture Models \n\nN aonori U eda Ryohei Nakano \n{ueda, nakano }@cslab.kecl.ntt.co.jp \n\nNTT Communication Science Laboratories \n\nHikaridai, Seika-cho, Soraku-gun, Kyoto 619-0237 Japan \n\nZoubin Ghahramani Geoffrey E. Hinton \n\nzoubin@gatsby.uc1.ac.uk g.hinton@ucl.ac.uk \n\nGatsby Computational Neuroscience Unit, University College London \n\n17 Queen Square, London WC1N 3AR, UK \n\nAbstract \n\nWe present a split and merge EM (SMEM) algorithm to overcome the local \nmaximum problem in parameter estimation of finite mixture models. In the \ncase of mixture models, non-global maxima often involve having too many \ncomponents of a mixture model in one part of the space and too few in an(cid:173)\nother, widely separated part of the space. To escape from such configurations \nwe repeatedly perform simultaneous split and merge operations using a new \ncriterion for efficiently selecting the split and merge candidates. We apply \nthe proposed algorithm to the training of Gaussian mixtures and mixtures of \nfactor analyzers using synthetic and real data and show the effectiveness of \nusing the split and merge operations to improve the likelihood of both the \ntraining data and of held-out test data. \n\n1 \n\nINTRODUCTION \n\nMixture density models, in particular normal mixtures, have been extensively used \nin the field of statistical pattern recognition [1]. Recently, more sophisticated mix(cid:173)\nture density models such as mixtures of latent variable models (e.g., probabilistic \nPCA or factor analysis) have been proposed to approximate the underlying data \nmanifold [2]-[4]. The parameter of these mixture models can be estimated using the \nEM algorithm [5] based on the maximum likelihood framework [3] [4]. A common \nand serious problem associated with these EM algorithm is the local maxima prob(cid:173)\nlem. Although this problem has been pointed out by many researchers, the best \nway to solve it in practice is still an open question. \n\nTwo of the authors have proposed the deterministic annealing EM (DAEM) algo(cid:173)\nrithm [6], where a modified posterior probability parameterized by temperature is \nderived to avoid local maxima. However, in the case of mixture density models, \nlocal maxima arise when there are too many components of a mixture models in \none part of the space and too few in another. It is not possible to move a compo(cid:173)\nnent from the overpopulated region to the underpopulated region without passing \n\n\f600 \n\nN Ueda, R. Nakano, Z. Ghahramani and G. E. Hinton \n\nthrough positions that give lower likelihood. We therefore introduce a discrete move \nthat simultaneously merges two components in an overpopulated region and splits \na component in an underpopulated region. \n\nThe idea of split and merge operations has been successfully applied to clustering \nor vector quantization (e.g., [7]). To our knowledge, this is the first time that \nsimultaneous split and merge operations have been applied to improve mixture \ndensity estimation. New criteria presented in this paper can efficiently select the \nsplit and merge candidates. Although the proposed method, unlike the DAEM \nalgorithm, is limited to mixture models, we have experimentally comfirmed that our \nsplit and merge EM algorithm obtains better solutions than the DAEM algorithm. \n\n2 Split and Merge EM (SMEM) Algorithm \n\nThe probability density function (pdf) of a mixture of M density models is given by \n\np(x; 8) = L amP(xlwm; Om), where am 2: 0 and \n\nM \n\nm=l \n\n(1) \n\nThe p(xlwm; Om) is a d-dimensional density model corresponding to the component \nW m . The EM algorithm, as is well known, iteratively estimates the parameters 8 = \n{(am, Om), m = 1, ... , M} using two steps. The E-step computes the expectation \nof the complete data log-likelihood. \n\nQ(818(t\u00bb) = L L P(wmlx; 8(t\u00bb) logamP(xlwm; Om), \n\nX m \n\nwhere P(wmlx; 8(t\u00bb) is the posterior probability which can be computed by \n\nP( I' 8(t\u00bb) = \n\nWm X, \n\na~p(xlwm; o~;h \n\nM \n\n(t) \nLm'=l am,p(xlwm,;Om') \n\n(t) \n\n. \n\n(2) \n\n(3) \n\nNext, the M-step maximizes this Q function with respect to 8 to estimate the new \nparameter values 8(t+1). \n\nLooking at (2) carefully, one can see that the Q function can be represented in \nthe form of a direct sum; i.e., Q(818(t\u00bb) = L~=l qm(818(t\u00bb), where qm(818(t\u00bb) = \nLXEx P(wmlx ; 8(t\u00bb) logamP(xlwm; Om) and depends only on am and Om . Let 8* \ndenote the parameter values estimated by the usual EM algorithm. Then after the \nEM algorithm has converged, the Q function can be rewritten as \n\nQ* = q7 + q; + qk + L \nm,m,/i,j,k \n\nq:n. \n\n(4) \n\nWe then try to increase the first three terms of the right-hand side of (4) by merging \ntwo components Wi and Wj to produce a component Wi', and splitting the component \nWk into two components Wj' and Wk\" To reestimate the parameters of these new \ncomponents, we have to initialize the parameters corresponding to them using 8*. \n\nThe initial parameter values for the merged component Wi' can be set as a linear \ncombination of the original ones before merge: \n\nand \n\nOi' = t wX \n\nO*~ P(w \u00b7 lx\u00b78*)+O*~ P(w \u00b7l x \u00b7 8*) \n\nJ wX J , \nLx P(wil x ; 8*) + Lx P(wjlx; 8*) \n\nt , \n\n(5) \n\n\fSMEM Algorithm for Mixture Models \n\n601 \n\nOn the other hand, as for two components Wj' and Wk', we set \n\n(6) \nwhere t is some small random perturbation vector or matrix (i.e., Iltll\u00abIIOk 11)1. \nThe parameter reestimation for m = i', j' and k' can be done by using EM steps, \nbut note that the posterior probability (3) should be replaced with (7) so that this \nreestimation does not affect the other components. \n\na~p(xlwm; o~;h \nL \n\n(t) (I .O(t)) x L P(wm'lx; 8*), m = i',j', k'. \n\nam,p x Wm , m' \n\nm'=i,j,k \n\nm'=i',j',k' \n\n(7) \nClearly Lm'=i',j',k' P(wm, Ix; 8(t)) = Lm=i,j,k P{wmlx; 8*) always holds during \nthe reestimation process. For convenience, we call this EM procedure the partial \nEM procedure. After this partial EM procedure, the usual EM steps, called the full \nEM procedure, are performed as a post processing. After these procedures, if Q \nis improved, then we accept the new estimate and repeat the above after setting \nthe new paramters to 8*. Otherwise reject and go back to 8* and try another \ncandidate. We summarize these procedures as follows: \n\n[SMEM Algorithm] \n1. Perform the usual EM updates. Let 8* and Q* denote the estimated parameters \n\nand corresponding Q function value, respectively. \n\n2. Sort the split and merge candidates by computing split and merge criteria (de(cid:173)\nscribed in the next section) based on 8*. Let {i, j, k}c denote the cth candidate. \n3. For c = 1, ... , Cmax , perform the following: After initial parameter settings \nbased on 8 *, perform the partial EM procedure for {i, j, k } c and then perform \nthe full EM procedure. Let 8** be the obtained parameters and Q** be the \ncorresponding Q function value. If Q** > Q*, then set Q* f - Q**, 8* f - 8** \nand go to Step 2. \n\n4. Halt with 8* as the final parameters. \n\nNote that when a certain split and merge candidate which improves the Q function \nvalue is found at Step 3, the other successive candidates are ignored. There is \ntherefore no guarantee that the split and the merge candidates that are chosen will \ngive the largest possible improvement in Q. This is not a major problem, however, \nbecause the split and merge operations are performed repeatedly. Strictly speaking, \nCmax = M(M -1)(M - 2)/2, but experimentally we have confirmed that Cmax '\" 5 \nmay be enough. \n\n3 Split and Merge Criteria \nEach of the split and merge candidates can be evaluated by its Q function value \nafter Step 3 of the SMEM algorithm mentioned in Sec.2. However, since there \nare so many candidates, some reasonable criteria for ordering the split and merge \ncandidates should be utilized to accelerate the SMEM algorithm. \n\nIn general, when there are many data points each of which has almost equal posterior \nprobabilities for any two components, it can be thought that these two components \n\n1 In the case of mixture Gaussians, covariance matrices Ei , and E k , should be positive \ndefinite. In this case, we can initialize them as Ei , = E k , = det(Ek)l/d Id indtead of (6). \n\n\f602 \n\nN. Ueda. R. Nakano. Z. Ghahramani and G. E. Hinton \n\nmight be merged. To numerically evaluate this, we define the following merge \ncriterion: \n\nJmerge(i,j; 8*) = Pi (8*fpj (8*), \n\n(8) \n\nwhere Pi(8*) = (P(wilxl; 8*), ... , P(wilxN; 8*))T E nN is the N-dimensional \nvector consisting of posterior probabilities for the component Wi. Clearly, two com(cid:173)\nponents Wi and Wj with larger Jmerge(i,j; 8*) should be merged. \n\nAs a split criterion (Jsplit), we define the local Kullback divergence as: \n\n(k 8 *) J ( 8*) I \n\nPk x; -\n\n= \n\nJ \nsplit \n\n; -\n\nog (I e) x, \n\nPk(X; 8*) d \nP x Wk; k \n\n(9) \n\nwhich is the distance between two distributions: the local data density Pk(X) around \nthe component Wk and the density of the component Wk specified by the current \nparameter estimate ILk and ~k' The local data density is defined as: \n\n(10) \n\nThis is a modified empirical distribution weighted by the posterior probability so \nthat the data around the component Wk are focused. Note that when the weights \nare equal, i.e., P(wklx; 8*) = 11M, (10) is the usual empirical distribution, i.e., \nPk(X; 8*) = (liN) E~=l 6(x - xn). Since it can be thought that the component \nwith the largest Jsplit (k; 8*) has the worst estimate of the local density, we should \ntry to split it. Using Jmerge and Jsp/it, we sort the split and merge candidates as \nfollows. First, merge candidates are sorted based on Jmerge. Then, for each sorted \nmerge ' candidate {i,j}e, split candidates excluding {i,j}e are sorted as {k}e. By \ncombining these results and renumbering them, we obtain {i, j , k }e. \n\n4 Experiments \n\n4.1 Gaussian mixtures \n\nFirst, we show the results of two-dimensional synthetic data in Fig. 1 to visually \ndemonstrate the usefulness of the split and merge operations. Initial mean vectors \nand covariance matrices were set to near mean of all data and unit matrix, respec(cid:173)\ntively. The usual EM algorithm converged to the local maximum solution shown in \nFig. l(b), whereas the SMEM algorithm converged to the superior solution shown \nin Fig. led) very close to the true one. The split of the 1st Gaussian shown in \nFig. l(c) seems to be redundant, but as shown in Fig. led) they are successfully \nmerged and the original two Gaussians were improved. This indicates that the split \nand merge operations not only appropriately assign the number of Gaussians in a \nlocal data space, but can also improve the Gaussian parameters themselves. \n\nNext, we tested the proposed algorithm using 20-dimensional real data (facial im(cid:173)\nages) where the local maxima make the optimization difficult. The data size was 103 \nfor training and 103 for test. We ran three algorithms (EM, DAEM, and SMEM) \nfor ten different initializations using the K-means algorithm. We set M = 5 and \nused a diagonal covariance for each Gaussian. As shown in Table 1, the worst so(cid:173)\nlution found by the SMEM algorithm was better than the best solutions found by \nthe other algorithms on both training and test data. \n\n\fSMEM Algorithmfor Mixture Models \n\n603 \n\n:'(:0: . \n. ~.~ . .. \n; ': . \n. ' \" \n\\.';,,: \n\n. co,\" 2 \n\n.... : ... : ...... . \n. ,,' :. ' \n,' .. \n\n\u2022 i,:\"' ..... \n\n.: \n\n\" \n\n-\n\n';q\\' \n... ~.t./ \n\n. :~~1 \n\n\u00b7{J\u00b7\u00b7 .... O\u00b7:6 \n\nI . ' \n\n\" '..!'\" \n\u2022 \n\u2022 :: . : :, \u2022\u2022 t \n\n~:. \" \n, ' \u2022\u2022 \n\n(a) True Gaussians and \n\ngenerated data \n\n(b) Result by EM (t=72) \n\n(c) Example of split \n\nand merge (t=141) \n\n(d) Final result by SMEM \n\n(t=212) \n\nFigure 1: Gaussian mixture estimation results. \n\nTable 1: Log-likelihood I data point \n\ninitiall value \n\nEM \n\nDAEM \n\nSMEM \n\nTraining \ndata \n\nTest \ndata \n\nmean \nstd \nmax \nmin \n\nmean \nstd \nmax \nmin \n\n-159.1 \n1.n \n-157.3 \n-163.2 \n\n-168.2 \n2.80 \n-165.5 \n-174.2 \n\n-148.2 \n0.24 \n-147.7 \n-148.6 \n\n-159.8 \n1.00 \n-158.0 \n-160.8 \n\n-147.9 \n0.04 \n-147.8 \n-147.9 \n\n-159.7 \n0.37 \n-159.6 \n-159.8 \n\n-145.1 \n0.08 \n-145.0 \n-145.2 \n\n-155.9 \n0.09 \n-155.9 \n-156.0 \n\nTable 2: No. of iterations \nSMEM \n155 \n44 \n219 \n109 \n\nDAEM \n147 \n39 \n189 \n103 \n\nEM \n47 \n16 \n65 \n37 \n\nmean \nsId \nmax \nmin \n\n-145 \n\n'~-150 \n~ '0 \n8-'55 \n~ g.-,,, \n\n\u00a3 \n\n...J \n\n-1&5 \n\n, \n\n, \n\n, \n\n, \n\nI \n\nEM ste~ with split and'merge \n: \n\nI \n\n:\n\n: \n\nI \n\n, \n\n~ ~ .. \n\nI-------'--~ ,-. ----~--~~ ____ \n\n.. ~ m ~ ~ ~ ~ = \n\nNo. of iterations \n\nFigure 2: Trajectories of loglikelihood. Upper (lower) \ncorresponds to training (test) data. \n\nFigure 2 shows log-likelihood value trajectories accepted at Step 3 of the SMEM \nalgorithm during the estimation process 2. Comparing the convergence points at \nStep 3 marked by the '0' symbol in Fig. 2, one can see that the successive split \nand merge operations improved the log-likelihood for both the training and test \ndata, as we expected. Table 2 compares the number of iterations executed by the \nthree algorithms. Note that in the SMEM algorithm, the EM-steps corresponding \nto rejected split and merge operations are not counted. The average rank of the \naccepted split and merge candidates was 1.8 (STD=O.9) , which indicates that the \nproposed split and merge criteria work very well. Therefore, the SMEM algorithm \nwas about 155 x 1.8/47 c::: 6 times slower than the original EM algorithm. \n\n4.2 Mixtures of factor analyzers \n\nA mixture of factor analyzers (MFA) can be thought of as a reduced dimension \nmixture of Gaussians [4]. That is, it can extract locally linear low-dimensional \nmanifold underlying given high-dimensional data. A single FA model assumes that \nan observed D-dimensional variable x are generated as a linear transformation of \nsome lower K-dimensionallatent variable z rv N(O, I) plus additive Gaussian noise \nv rv N(O, w). w is diagonal. That is, the generative model can be written as \n\n2Dotted lines in Fig. 2 denote the starting points of Step 2. Note that it is due to the \n\ninitialization at Step 3 that the log-likelihood decreases just after the split and merge. \n\n\f604 \n\nN. Ueda, R. Nakano, Z. Ghahrarnani and G. E. Hinton \n\n\u00b7\u00b7~~~~--~~\u00b7-X~l\"\u00b7 \n\n\u00b7~\u00b7~~~~--~'\"O':',,-x-:.,. \n\n\u00b7~~~~--~~\u00b7-x~; \n\n(a) Initial values \n\n(b) Result by EM \n\n(c) Result by SMEM \n\nRgure 3: Extraction of 1 D manifold by using a mixture of factor analyzers. \n\nx = Az + v + J-L. Here J-L is a mean vector. Then from simple calculation, we \ncan see that x t'V N(J-L, AAT + '11). Therefore, in the case of a M mixture of FAs, \nx t'V L~=l omN(J-Lm, AmA~ + 'lim). See [4] for the details. Then, in this case, \nthe Q function is also decomposable into M components and therefore the SMEM \nalgorithm is straightforwardly applicable to the parameter estimation of the MFA \nmodels. \n\nFigure 3 shows the results of extracting a one-dimensional manifold from three(cid:173)\ndimensional data (nOisy shrinking spiral) using the EM and the SMEM algorithms3. \nAlthough the EM algorithm converged to a poor local maxima, the SMEM algo(cid:173)\nrithm successfully extracted data manifold. Table 3 compares average log-likelihood \nper data point over ten different initializations. The log-likelihood values were dras(cid:173)\ntically improved on both training and test data by the SMEM algorithm. \n\nThe MFA model is applicable to pattern recognition tasks [2][3] since once an MFA \nmodel is fitted to each class, we can compute the posterior probabilities for each data \npoint. We tried a digit recognition task (10 digits (classes))4 using the MFA model. \nThe computed log-likelihood averaged over ten classes and recognition accuracy for \ntest data are given in Table 4. Clearly, the SMEM algorithm consistently improved \nthe EM algorithm on both log-likelihood and recognition accuracy. Note that the \nrecognition accuracy by the 3-nearest neighbor (3NN) classifier was 88.3%. It is \ninteresting that the MFA approach by both the EM and SMEM algorithms could \noutperform the nearest neighbor approach when K = 3 and M = 5. This suggests \nthat the intrinsic dimensionality of the data would be three or so. \n\n' \n\n3In this case, each factor loading matrix Am becomes a three dimensional column \nvector corresponding to each thick line in Fig. 3. More correctly, the center position and \nthe direction of each thick line are f..Lm and Am, respectively. And the length of each thick \nline is 2 IIAmll. \n\n4The data were created using the degenerate Glucksman's feature (16 dimensional data) \n\nby NTT labs.[8]. The data size was 200/class for training and 200/class for test. \n\n\fSMEM Algorithm/or Mixture Models \n\n605 \n\nTable 3: Log-likelihood I data point \n\nEM \n\nSMEM \n\nTraining -7.68 (0.151) \n-7.75 (0.171) \n\nTest \n\n-7.26 (0.017) \n-7.33 (0.032) \n\nK=3 \n\nO:STD \n\nK=8 \n\nTable 4: Digit recognition results \n\nLog-likelihood / data point Recognition rate (\"!o) \n\nM=5 \nM=10 \n\nM=5 \nM=10 \n\nEM \n-3.18 \n-3.09 \n-3.14 \n-3.04 \n\nSMEM \n-3.15 \n-3.05 \n-3.11 \n-3.01 \n\nEM \n89.0 \n87.5 \n85.3 \n82.5 \n\nSMEM \n91.3 \n88.7 \n87.3 \n85.1 \n\n5 Conclusion \n\nWe have shown how simultaneous split and merge operations can be used to move \ncomponents of a mixture model from regions of the space in which there are too \nmany components to regions in which there are too few. Such moves cannot be \naccomplished by methods that continuously move components through intermediate \nlocations because the likelihood is lower at these locations. A simultaneous split and \nmerge can be viewed as a way of tunneling through low-likelihood barriers, thereby \neliminating many non-global optima. In this respect, it has some similarities with \nsimulated annealing but the moves that are considered are long-range and are very \nspecific to the particular problems that arise when fitting a mixture model. Note \nthat the SMEM algorithm is applicable to a wide variety of mixture models, as long \nas the decomposition (4) holds. To make the split and merge method efficient we \nhave introduced criteria for deciding which splits and merges to consider and have \nshown that these criteria work well for low-dimensional synthetic datasets and for \nhigher-dimensional real datasets. Our SMEM algorithm conSistently outperforms \nstandard EM and therefore it would be very useful in practice. \n\nReferences \n\n[1] MacLachlan, G. and Basford K., \"Mixture models: Inference and application \n\nto clustering,\" Marcel Dekker, 1988. \n\n[2] Hinton G. E., Dayan P., and Revow M., \"Modeling the minifolds of images of \n\nhandwritten digits,\" IEEE Trans. PAMI, vol.8, no.1, pp. 65-74, 1997. \n\n[3] Tipping M. E. and Bishop C. M., \"Mixtures of probabilistic principal compo(cid:173)\n\nnent analysers,\" Tech. Rep. NCRG-97-3, Aston Univ. Birmingham, UK, 1997. \n\n[4] Ghahramani Z. and Hinton G. E., \"The EM algorithm for mixtures of factor \n\nanalyzers,\" Tech. Report CRG-TR-96-1, Univ. of Toronto, 1997. \n\n[5] Dempster A. P., Laird N. M. and Rubin D. B., \"Maximum likelihood from \nincomplete data via the EM algorithm,\" Journal of Royal Statistical Society \nB, vol. 39, pp. 1-38, 1977. \n\n[6] Ueda N. and Nakano R., \"Deterministic annealing EM algorithm,\" Neural Net(cid:173)\n\nworks, voLl1, no.2, pp.271-282, 1998. \n\n[7] Ueda N. and Nakano R., \"A new competitive learning approach based on an \n\nequidistortion principle for designing optimal vector quantizers,\" Neural Net(cid:173)\nworks, vol.7, no.8, pp.1211-1227, 1994. \n\n[8] Ishii K., \"Design of a recognition dictionary using artificially distorted charac(cid:173)\n\nters,\" Systems and computers in Japan, vol.21, no.9, pp. 669-677, 1989. \n\n\f", "award": [], "sourceid": 1521, "authors": [{"given_name": "Naonori", "family_name": "Ueda", "institution": null}, {"given_name": "Ryohei", "family_name": "Nakano", "institution": null}, {"given_name": "Zoubin", "family_name": "Ghahramani", "institution": null}, {"given_name": "Geoffrey", "family_name": "Hinton", "institution": null}]}