{"title": "Batch and On-Line Parameter Estimation of Gaussian Mixtures Based on the Joint Entropy", "book": "Advances in Neural Information Processing Systems", "page_first": 578, "page_last": 584, "abstract": null, "full_text": "Batch and On-line Parameter Estimation of \nGaussian Mixtures Based on the Joint Entropy \n\nYoram Singer \n\nAT&T Labs \n\nsinger@research.att.com \n\nManfred K. Warmuth \n\nUniversity of California, Santa Cruz \n\nmanfred@cse.ucsc.edu \n\nAbstract \n\nWe describe a new iterative method for parameter estimation of Gaus(cid:173)\nsian mixtures. The new method is based on a framework developed by \nKivinen and Warmuth for supervised on-line learning. In contrast to gra(cid:173)\ndient descent and EM, which estimate the mixture's covariance matrices, \nthe proposed method estimates the inverses of the covariance matrices. \nFurthennore, the new parameter estimation procedure can be applied in \nboth on-line and batch settings. We show experimentally that it is typi(cid:173)\ncally faster than EM, and usually requires about half as many iterations \nas EM. \n\n1 Introduction \n\nMixture models, in particular mixtures of Gaussians, have been a popular tool for density \nestimation, clustering, and un-supervised learning with a wide range of applications (see \nfor instance [5, 2] and the references therein). Mixture models are one of the most useful \ntools for handling incomplete data, in particular hidden variables. For Gaussian mixtures \nthe hidden variables indicate for each data point the index of the Gaussian that generated it. \nThus, the model is specified by ajoint density between the observed and hidden variables. \nThe common technique used for estimating the parameters of a stochastic source with hid(cid:173)\nden variables is the EM algorithm. In this paper we describe a new technique for estimating \nthe parameters of Gaussian mixtures. The new parameter estimation method is based on a \nframework developed by Kivinen and Warmuth [8] for supervised on-line learning. This \nframework was successfully used in a large number of supervised and un-supervised prob(cid:173)\nlems (see for instance [7, 6, 9, 1]). \n\nOur goal is to find a local minimum of a loss function which, in our case, is the negative \nlog likelihood induced by a mixture of Gaussians. However, rather than minimizing the \n\n\fParameter Estimation of Gaussian Mixtures \n\n579 \n\nloss directly we add a tenn measuring the distance of the new parameters to the old ones. \nThis distance is useful for iterative parameter estimation procedures. Its purpose is to keep \nthe new parameters close to the old ones. The method for deriving iterative parameter \nestimation can be used in batch settings as well as on-line settings where the parameters \nare updated after each observation. The distance used for deriving the parameter estimation \nmethod in this paper is the relative entropy between the old and new joint density of the \nobserved and hidden variables. For brevity we tenn the new iterative parameter estimation \nmethod the joint-entropy (JE) update. \n\nThe JE update shares a common characteristic with the Expectation Maximization [4, 10] \nalgorithm as it first calculates the same expectations. However, it replaces the maximization \nstep with a different update of the parameters. For instance, it updates the inverse of the \ncovariance matrix of each Gaussian in the mixture, rather than the covariance matrices \nthemselves. We found in our experiments that the JE update often requires half as many \niterations as EM. It is also straightforward to modify the proposed parameter estimation \nmethod for on-line setting where the parameters are updated after each new observation. \nAs we demonstrate in our experiments with digit recognition, the on-line version of the \nJE update is especially useful in situations where the observations are generated by a non(cid:173)\nstationary stochastic source. \n\n2 Notation and preliminaries \n\nLet S be a sequence of training examples (Xl, X2, ..\u2022 , XN) where each Xi is a d(cid:173)\ndimensional vector in ~d. To model the distribution of the examples we use m d(cid:173)\ndimensional Gaussians. The parameters of the i-th Gaussian are denoted by 8 i and they \ninclude the mean-vector and the covariance matrix \n\nThe density function of the ith Gaussian, denoted P(xI8d, is \n\nWe denote the entire set of parameters of a Gaussian mixture by 8 = {8i }:: 1 = \n{Wi, Pi' C i }::l where w = (WI, ... , wm ) is a non-negative vector of mixture coefficients \nsuch that 2:::1 Wi = 1. We denote by P(xI8) = 2:;:1 w;P(xI8d the likelih~od of an \nobservation x according to a Gaussian mixture with parameters_ 8. Let 8 i and 8 i be two \nGaussian distributions. For brevity, we d~note by E; (Z) and Ej (Z) the expectation of a \nrandom variable Z with respect to 8i and 8 j \u2022 Let f be a parametric function whose param(cid:173)\neters constitute a matrix A = (a;j). We denote by {) f / {)A the matrix of partial derivatives \nof f with respect to the elements in A. That is, the ij element of {) f / {)A is {) f / {)aij. \nSimilarly, let B = (bij(x)) a matrix whose elements are functions of a scalar x. Then, we \ndenote by dB / dx the matrix of derivatives of the elements in B with respect to x, namely, \nthe ij element of dB / dx is dbij (x) / dx. \n\n3 The framework for deriving updates \n\nKivinen and Warmuth [8] introduced a general framework for deriving on-line parameter \nupdates. In this section we describe how to apply their framework for the problem of \n\n\f580 \n\nY. Singer and M. K. Warmuth \n\nparameter estimation of Gaussian mixtures in a batch setting. We later discuss how a \nsimple modification gives the on-line updates. \n\nGiven a set of data points S in ~d and a number m, the goal is to find a set of m \nGaussians that minimize the loss on the data, denoted as loss(SI8). For density esti(cid:173)\nmation the natural loss function is the negative log-likelihood of the data loss(SI8) = \n-(I/ISI) In P(SI8) ~f -(I/ISI) L:xES In P(xI8). The best parameters which minimize \nthe above loss cannot be found analytically. The common approach is to use iterative meth(cid:173)\nods such as EM [4, 10] to find a local minimizer of the loss. \n\nIn an iterative parameter estimation framework we are given the old set of parameters 8 t \nand we need to find a set of new parameters 8 t+1 that induce smaller loss. The framework \nintroduced by Kivinen and Warmuth [8] deviates from the common approaches as it also \nrequires to the new parameter vector to stay \"close\" to the old set of parameters which \nincorporates all that was learned in the previous iterations. The distance of the new param(cid:173)\neter setting 8 t+1 from the old setting 8 t is measured by a non-negative distance function \nLl(8t+1 , 8 t ). We now search for a new set of parameters 8 t+1 that minimizes the distance \nsummed with the loss multiplied by 17. Here 17 is a non-negative number measuring the rel(cid:173)\native importance of the distance versus the loss. This parameter 17 will become the learning \nrate of the update. More formally, the update is found by setting 8 t+1 = arg mineUt(8) \nwhereUt (8) = Ll(8,8t ) + 17loss(SI8) + A(L:::1 Wi -1). (We use a Lagrange multi(cid:173)\nplier A to enforce the constraint that the mixture coefficients sum to one.) By choosing the \napropriate distance function and 17 = 1 one can show that EM becomes the above update. \nFor most distance functions and learning rates the minimizer of the function U t (8) can(cid:173)\nnot be found analytically as both the distance function and the log-likelihood are usually \nnon-linear in 8. Instead, we expand the log-likelihood using a first order Taylor expan(cid:173)\nsion around the old parameter setting. This approximation degrades the further the new \nparameter values are from the old ones, which further motivates the use of the distance \nfunction Ll(8, 8 t ) (see also the discussion in [7]). We now seek a new set of parameters \n8 t +1 = argmineVt(8) where \n\nVt(8) = ~(8, 0 t) + '7 (loss(510 t) + (8 - 0 t) . V' e l0ss(510t)) + A(L w. - 1) . \n\nm \n\n(1) \n\n.=1 \n\nHere V' eloss(SI8t) denotes the gradient of the loss at 8 t . We use the above method \nEq. (1) to derive the updates of this paper. For density estimation, it is natural to use the \nrelative entropy between the new and old density as a distance. In this paper we use the \njoint density between the observed (data points) and hidden variables (the indices of the \nGaussians). This motivates the name joint-entropy update. \n\n4 Entropy based distance functions \n\nWe first consider the relative entropy between the new and old parameter parameters of a \nsingle Gaussian. Using the notation introduced in Sec. 2, the relative entropy between two \nGaussian distributions denoted by 8i , 8i is \n\n~(8., 8i) = JXE~d P(xI0i) In P(xI8.) dx \n\ndef \n\n[ \n\n-\n\nP(xI0.) \n\nI} le.1 \nz n -z- - z' X -I-'i \nle.1 \n\nIE-(( \n\n-)Te-- I ( \n\n\u2022 \n\nX -I-'i + zEi \n\n-)) 1-(( \n\nX -\n\nTe- I ( \n\n\u2022 x-I-'. \n\n)) \n\nI-'i) \n\n\fParameter Estimation of Gaussian Mixtures \n\n581 \n\nUsing standard (though tedious) algebra we can rewrite the expectations as follows: \n\nU \n\nA(8- 8) \n\n- i, - i = 2\" n -;;:;- -\n1] ICil \nICil \n\n1 (C-1C-) \n\ni + 2\" J.li -\n\n1(-\n\ni \n\nd \n- + 2\"tr \n2 \n\n)T -1(-\n\nJ.l Ci \n\nJ.li -\n\n) \n\n. \n\nJ.li \n\n(2) \n\nThe relative entropy between the new and the old mixture models is the following \n\n-\n\n~(0,0) = ix P(xI0) In P(xI0)dx = ix 7:: w.P(xI0.)ln ~:1 w.P(xI0./ x . (3) \n\nL~1 w.P(xI8.) \n\nP(xI8) \n\nf ~ -\n\n-\n\ndef f \n\n-\n\nIdeally, we would like to use the above distance function in V t to give us an update of \n\ne in terms of 8. However, there isn't a closed form expression for Eq. (3). Although the \nrelative entropy between two Gaussians is a convex function in their parameters, the relative \nentropy between two Gaussian mixtures is non-convex. Thus, the loss function V t (e) may \nhave multiple minima, making the problem of finding arg mine V t (e) difficult. \n\nIn order to sidestep this problem we use the log-sum inequality [3] to obtain an upper bound \nfor the distance function ~(e, 8). We denote this upper bound as Li(e, 8). \n\n;;;, L:m \n\nW, In - + \n\n-\n\nw, \n\n- j \n\nw, \n\n- p(xle,) L:m \n\ndx = \n\nI \n\np(xle,) In \n\nP(x e,l \n\n_=1 \n\nx \n\n;;;, L:m \n\n-\n\nW, In - + \n\nw, \n\n1=1 \n\nWI~(e\" e,l . \n-\n\n-\n\n(4) \n\n,=1 \n\nL:m \n= \n\n_=1 \n\nWe call the new distance function Li(e, 8) the joint-entropy distance. Note that in this \ndistance the parameters of Wi and Wi are \"coupled\" in the sense that it is a convex combi(cid:173)\nnation of the distances 6.(8 i , 8d. In particular, Li(8, 8) as a function of the parameters \nWi, Pi' Ci does not remain constant any more when the parameters of the individual Gaus(cid:173)\nsians are permuted. Furthermore, Li (e, 8) is also is sufficiently convex so that finding the \nminimizer of V t is possible (see below). \n\n5 The updates \n\nWe are now ready to derive the new parameter estimation scheme. This is done by setting \nthe partial derivatives of V t , with respect to e, to O. That is, our problem consists of solving \nthe following equations \n\na~(e, e) \n\n_ \naw, \n\n1) a In p(5Ie) \n-\n151 \n\naw, \n\n+>- = 0, \n\na~(e, e) \n\n_ \naJ.L, \n\n1) a In P(5Ie) \n-\n151 \n\naJ.L, \n\n= 0, \n\na~(e, e) \n\nac, \n\n1) aln p(5Ie) \n-\n151 \n\nac, \n\n= o. \n\nWe now use the fact that Ci and thus C;l is symmetric. The derivatives of Li(e, 8), as \ndefined by Eq. (4) and Eq. (2), with respect to Wi, Pi and C\\, are \n\nIn - + 1 + - n -\n\n1 I \n2 \n\nW. \nw. \n\nICd \nICd \n\n-\n\n- + -tr C Ci + - ll . -\nd \n2 \n\n-1 -) 1 (-\n\n) \n2\"\",,,,,,,,,,,\"\", \n\n)TC-1 (-\n\n1 ( \n2 \n\nII - \" (5) \n\nII \n\nt \n\n. \n\naE(0,0) \n\naE(0,0) \n\nalii \nac. \n\n__ \n\n1 - (C--1 C-1) \n. \n2 Wi -\n\ni + . \n\n(6) \n\n(7) \n\n\f582 \n\nY Singer and M. K. Warmuth \n\nTo simplify the notation throughout the rest of the paper we define the following variables \n\n) \nf3.(x) = P(x\\0) and (X i x) = P(x\\0) = P t x, 0i) = wif3i(X . \n\n(d ef wi P (xI0i) \n\nd e f P(xI0i) \n\n('1 \n\nThe partial derivatives of the log-likelihood are computed similarly: \n\noln P(SI0) \n\nOWi \n\noln P(SI0) \n\n01-1. \n\noIn P(SI0) \n\nOC. \n\n= ~ P(xI0i) = ~ a ( ) \nL.; P' x \nX\u00a3 s \n\nL.; P(xI0) \nX\u00a3S \n~ w.P(xI0.) \nL.; P(xI0) C i X-I-I.) = L.;(X.(x)C. X-I-Ii) \nx\u00a3s \n\n~ -1 ( \n\n~ s \n\n-1 ( \n\n= _l ~ wiP(xI0.) (C:- 1 _ C:- 1 ( \n\u2022 x \n\n2 L.; P(xI0) \n-t L(X,(x)(Ci 1 - C i 1 (x-l-li)(X-I-I.fc ;-t). \n\n.)( \n1-1. x \n\nx\u00a3s \n\n\u2022 \n\n_ \n\n1-1. \n\n_ \n\n)TC:-1) \n\n\u2022 \n\n(8) \n\n(9) \n\n(10) \n\nx\u00a3s \n\nWe now need to decide on an order for updating the parameter classes Wi, Pi ' and C i . We \nuse the same order that EM uses, namely, Wi, then Pi' and finally, C i . (After doing one \npass over all three groups we start again using the same order.) Using this order results in \na simplified set of equations as several terms in Eq. (5) cancel out. Denote the size of the \nsample by N = lSI. We now need to sum the derivatives from Eq. (5) and Eq. (8) while \nusing the fact that the Lagrange multiplier). simply assures that the new weight Wi sum to \none. By setting the result to zero, we get that \n\nw. t- E:l W J exp (-N Ex\u00a3 s f3i(X\u00bb) \n\nSimilarly, we sum Eq. (6) and Eq. (9), set the result to zero, and get that \n\nI-li t-I-I. + ~ Lf3i(X) (x -I-Ii)' \n\nx\u00a3s \n\n(11) \n\n(12) \n\nFinally, we do the same for C i . We sum Eq. (7) and Eq. (10) using the newly obtained Pi' \n\nCit t- Ci 1 + ~ Lf3.(x) (Cit - C;-l(X -I-I.)(x -l-lifCi1) . \n\n(13) \n\nx\u00a3s \n\nWe call the new iterative parameter estimation procedure the joint-entropy (JE) update. \nTo summarize, the JE update is composed of the following alternating steps: We first cal(cid:173)\nculate for each observation x the value !3i(X) = P(xI8;}j P(xI8) and then update the \nparameters as given by Eq. (11), Eq. (12), and Eq. (13). The JE update and EM differ in \nseveral aspects. First, EM uses a simple update for the mixture we!ghts w . Second, EM \nuses the expectations (with respect to the current parameters) of the sufficient statistics [4] \nfor Pi and C; to find new sets of mean vectors and covariance matrices. The JE uses a \n(slightly different) weighted average of the observation and, in addition, it adds the old \nparameters. The learning rate TJ determines the proportion to be used in summing the old \nparameters and the newly estimated parameters. Last, EM estimates the covariance matri(cid:173)\nces Ci whereas the new update estimates the inverses, C;l, of these matrices. Thus, it is \npotentially be more stable numerically in cases where the covariance matrices have small \ncondition number. \n\nTo obtain an on-line procedure we need to update the parameters after each new observation \nat a time. That is, rather than summing over all xES, for a new observation Xt, we update \n\n\fParameter Estimation of Gaussian Mixtures \n\n583 \n\nI \nI \nJE ot8_1 .9 J \n~I \n\nI \nI \n/018=1.5 \n/ \n\n\" \n.. / \n.: ! \n\not8:1 .1 .. / / ota=l .OS \n\n~ / /. \n\n_~\" EM \n\nr'\" \n\n! -0170 \n\nEM \n\n, , , \n\nr \n\n, \n\nEU \n\n-3.0 \n\n-3.1 \n\n~ -32 \n'i \n~-3.3 \nS \n\n-3.4 \n\n--::::_________ \n\n__-0':;:;--< \n\n~ /( ,/ \n/' ' \nl-/ ./ \nrr\" \n\no \n\n50 \n\n100 \n\n150 \n\n200 \n\nNumber 01 iterations \n\n- 0 171 ~----7---!----:-,--7---:----!---' \n\n._...-\n\n-<), \n\nlo, \n\n.9'\" \n-0' \n\n250 \n\n300 \n\n10 \n\n15 \n\n~ \n\n..... \n\n................. .......... \n\nE'\" \n\n~ \n\n~ \n\n~ \n\n~ \n\n............... \n\nBJ \n\n._-\n\n~ \n\n~ \n\nFigure 1: Left: comparison of the convergence rate of EM and the JE update with different \nlearning rates. Right: example of a case where EM initially increases the likelihood faster \nthan the JE update. \n\nthe parameters and get a new set of parameters 8 t+1 using the current parameters 8 t \u2022 The \nnew parameters are then used for inducing the likelihood of the next observation Xt+ 1. The \non-line parameter estimation procedure is composed of the following steps: \n\n(3 ( \ni Xt = P(Xj e) . \n\n1 S \n. et: \n2. Parameter updates: \n\nP Xj e, \n\n) \n\n(a) Wj f- Wj exp (-1]t(3j (xt)) / I:j=1 Wj exp ( -1]t(3j (xt)) \n(b) J,lj f- J,lj + 1]t (3j (xt) (Xt -\n(c) Ci 1 f- Ci 1 + 1]t (3j(xt) (Cil - Ci 1(Xt - J,lj)(Xt - J,lj)TCi1). \n\nJ,lj) \n\nTo guarantee convergence of the on-line update one should use a diminishing learning rate, \nthat is 1]t -t 0 as t -t 00 (for further motivation see [lID. \n\n6 Experiments \n\nWe conducted numerous experiments with the new update. Due to the lack of space we de(cid:173)\nscribe here only two. In the first experiment we compared the JE update and EM in batch \nsettings. We generated data from Gaussian mixture distributions with varying number of \ncomponents (m = 2 to 100) and dimensions (d = 2 to 20). Due to the lack of space \nwe describe here results obtained from only one setting. In this setting the examples were \ngenerated by a mixture of 5 components with w = (0.4 , 0.3,0.2,0.05,0.05). The mean \nvectors were the 5 standard unit vectors in the Euclidean space 1R5 and we set all of covari(cid:173)\nances matrices to the identity matrix. We generated 1000 examples. We then run EM and \nthe JE update with different learning rates (1] = 1.9,1.5,1.1,1.05). To make sure that all \nthe runs will end in the same local maximum we fist performed three EM iterations. The \nresults are shown on the left hand side of Figure 1. In this setting, the JE update with high \nlearning rates achieves much faster convergence than EM. We would like to note that this \nbehavior is by no means esoteric - most of our experiments data yielded similar results. \n\nWe found a different behavior in low dimensional settings. On the right hand side of Fig(cid:173)\nure 1 we show convergence rate results for a mixture containing two components each of \nwhich is a single dimension Gaussians. The mean of the two components were located \n\n\f584 \n\nY. Singer and M. K. Warmuth \n\nat 1 and -1 with the same variance of 2. Thus, there is a significant \"overlap\" between \nthe two Gaussian constituting the mixture. The mixture weight vector was (0 .5,0 .5). We \ngenerated 50 examples according to this distribution and initialized the parameters as fol(cid:173)\nlows: 1-'1 = 0.01,1-'2 = -0.01, 0\"1 = 0\"2 = 2, WI = W2 = 0.5 We see that initially \nEM increases the likelihood much faster than the JE update. Eventually, the JE update \nconvergences faster than EM when using a small learning rate (in the example appearing in \nFigure 1 we set 'rJ = 1.05). However, in this setting, the JE update diverges when learning \nrates larger than 'rJ = 1.1 are used. This behavior underscores the advantages of both meth(cid:173)\nods. EM uses a fixed learning rate and is guaranteed to converge to a local maximum of the \nlikelihood, under conditions that typically hold for mixture of Gaussians [4, 12]. the JE up(cid:173)\ndate, on the other hand, encompasses a learning rate and in many settings it converges much \nfaster than EM. However, the superior performance in high dimensional cases demands its \nprice in low dimensional \"dense\" cases. Namely, a very conservative learning rate, which \nis hard to tune, need to be used. In these cases, EM is a better alternative, offering almost \nthe same convergence rate without the need to tune any parameters. \n\nAcknowledgments Thanks to Duncan Herring for careful proof reading and providing \nus with interesting data sets. \n\nReferences \n\n[1] E. Bauer, D. Koller, and Y. Singer. Update rules for parameter estimation in Bayesian \nnetworks. In Proc. of the 13th Annual Con! on Uncertainty in AI, pages 3-13, 1997. \n\n[2] C.M. Bishop. Neural Networks and Pattern Recognition. Oxford Univ. Press, 1995. \n\n[3] Thomas M. Cover and Joy A Thomas. Elements of Information Theory. Wiley, 1991. \n\n[4] AP. Dempster, N.M. Laird, and D.B. Rubin. Maximum-likelihood from incomplete \ndata via the EM algorithm. Journal of the Royal Statistical Society, B39:1-38, 1977. \n\n[5] R.O. Duda and P.E. Hart. Pattern Classification and Scene Analysis. Wiley, 1973. \n\n[6] D. P. Helmbold, J. Kivinen, and M.K. Warmuth. Worst-case loss bounds for sig(cid:173)\n\nmoided neurons. In Advances in Neural Information Processing Systems 7, pages \n309-315, 1995. \n\n[7] D.P. Helmbold, R.E. Schapire, Y.Singer, and M.K. Warmuth. A comparison of new \nand old algorithms for a mixture estimation problem. Machine Learning, Vol. 7, 1997. \n\n[8] J. Kivinen and M.K. Warmuth. Additive versus exponentiated gradient updates for \n\nlinear prediction. Information and Computation, 132(1): 1-64, January 1997. \n\n[9] J. Kivinen and M.K. Warmuth. Relative loss bounds for multidimensional regression \n\nproblems. In Advances in Neural Information Processing Systems 10, 1997. \n\n[10] R.A Redner and H.E Walker. Mixture densities, maximum likelihood and the EM \n\nalgorithm. SIAM Review, 26(2), 1984. \n\n[11] D.M. Titterington, A.EM. Smith, and U.E. Makov. Statistical Analysis of Finite Mix(cid:173)\n\nture Distributions. Wiley, 1985. \n\n[12] C.E Wu. On the convergence properties of the EM algorithm. Annals of Stat., 11 :95-\n\n103, 1983. \n\n\f", "award": [], "sourceid": 1525, "authors": [{"given_name": "Yoram", "family_name": "Singer", "institution": null}, {"given_name": "Manfred K.", "family_name": "Warmuth", "institution": null}]}