{"title": "Maximum Conditional Likelihood via Bound Maximization and the CEM Algorithm", "book": "Advances in Neural Information Processing Systems", "page_first": 494, "page_last": 500, "abstract": null, "full_text": "Maximum Conditional Likelihood via \nBound Maximization and the CEM \n\nAlgorithm \n\nTony J ebara and Alex Pentland \n\nVision and Modeling, MIT Media Laboratory, Cambridge MA \n\nhttp://www.rnedia.rnit.edu/ ~ jebara \n\n{ jebara,sandy }~rnedia.rnit.edu \n\nAbstract \n\nWe present the CEM (Conditional Expectation Maximi::ation) al(cid:173)\ngorithm as an extension of the EM (Expectation M aximi::ation) \nalgorithm to conditional density estimation under missing data. A \nbounding and maximization process is given to specifically optimize \nconditional likelihood instead of the usual joint likelihood. We ap(cid:173)\nply the method to conditioned mixture models and use bounding \ntechniques to derive the model's update rules . Monotonic conver(cid:173)\ngence, computational efficiency and regression results superior to \nEM are demonstrated. \n\n1 \n\nIntroduction \n\nConditional densities have played an important role in statistics and their merits \nover joint density models have been debated. Advantages in feature selection , ro(cid:173)\nbustness and limited resource allocation have been studied. Ultimately, tasks such \nas regression and classification reduce to the evaluation of a conditional density. \n\nHowever, popularity of maximumjoint likelihood and EM techniques remains strong \nin part due to their elegance and convergence properties . Thus, many conditional \nproblems are solved by first estimating joint models then conditioning them . This \nresults in concise solutions such as the N adarya- Watson estimator [2], Xu's mixture \nof experts [7], and Amari's em-neural networks [1]. However, direct conditional \ndensity approaches [2, 4] can offer solutions with higher conditional likelihood on \ntest data than their joint counter-parts. \n\n\fMaximum Conditional Likelihood via Bound Maximization and CEM \n\n495 \n\n01 \n(a) La = -4.2 L; = -2.4 \n\n'\" \n\n1~~1 \n\n15 \n\n20 \n\n!i. \n\n(b) Lb = -5.2 Lb = -1.8 \n\nFigure 1: Average Joint (x, y) vs. Conditional (ylx) Likelihood Visualization \n\nPop at [6] describes a simple visualization example where 4 clusters must be fit with \n2 Gaussian models as in Figure 1. Here, the model in (a) has a superior joint likeli(cid:173)\nhood (La> Lb) and hence a better p(x, y) solution. However, when the models are \nconditioned to estimate p(ylx), model (b) is superior (Lb > L~). Model (a) yields \na poor unimodal conditional density in y and (b) yields a bi-modal conditional \ndensity. It is therefore of interest to directly optimize conditional models using con(cid:173)\nditionallikelihood. We introduce the CEM (Conditional Expectation Maximization) \nalgorithm for this purpose and apply it to the case of Gaussian mixture models. \n\n2 EM and Conditional Likelihood \n\nFor joint densities, the tried and true EM algorithm [3] maximizes joint likelihood \nover data. However, EM is not as useful when applied to conditional density estima(cid:173)\ntion and maximum conditional likelihood problems. Here, one typically resorts to \nother local optimization techniques such as gradient descent or second order Hessian \nmethods [2]. We therefore introduce CEM, a variant of EM , which targets condi(cid:173)\ntional likelihood while maintaining desirable convergence properties. The CEM \nalgorithm operates by directly bounding and decoupling conditional likelihood and \nsimplifies M-step calculations. \n\nIn EM, a complex density optimization is broken down into a two-step iteration \nusing the notion of missing data. The unknown data components are estimated via \nthe E-step and a simplified maximization over complete data is done in the M-step. \nIn more practical terms, EM is a bound maximization : the E-step finds a lower \nbound for the likelihood and the M-step maximizes the bound. \n\nP(Xi, Yi18) = L p(m, Xi, Yi1 8 ) \n\nM \n\nm=l \n\n(1) \n\nConsider a complex joint density p(Xi , Yi 18) which is best described by a discrete \n(or continuous) summation of simpler models (Equation 1) . Summation is over the \n'missing components' m . \n\nl:!.l \n\n> \n\nL:~llog(P(Xi' Yi 18t \u00bb) -Iog(p(xi ' YiI 8t - 1 )) \n\",\\,N \nim og p(m,X\"Y.let I) were \nL...i=l L...m=l \n\np(m ,X. ,Y.le t) \n\n\",\\,M h \n\nh \n\nI \n\nh \nim = \",\\,M \n\np(m,x. ,y.le t- I) \n\nL...,,=I p(n ,x\"y .Iet- I) \n\n(2) \n\nBy appealing to Jensen's inequality, EM obtains a lower bound for the incremental \nlog-likelihood over a data set (Equation 2) . Jensen's inequality bounds the log(cid:173)\narithm of the sum and the result is that the logarithm is applied to each simple \n\n\f496 \n\nT. Jebara and A. Pentland \n\nmodel p(m, Xi , yd8) individually. It then becomes straightforward t.o compute the \nderivatives with respect to e and set to zero for maximization (M-step) . \n\nI 8)\n\nI 8) L...m=IP(m,xi,y;j8) \np( Y i Xi, - = L..: p( m , Y i Xi , - = =c\"\"-7M+------'-----'-\nLm=IP(m,XiI8 ) \n\nAJ \n\" \nm=l \n\n,\\\"M \n\n(3) \n\nHowever, the elegance of EM is compromised when we consider a conditioned density \nas in Equation :3. The corresponding incremental conditional log-likelihood, L:l.lc, is \nshown in Equation 4. \n\nL~llog(p(Yilxi' 8 t )) -log(p(ydxi, 8 t - 1 ) \nLM \n,\\\"N \nL.... \n!=1 \n\nm_I P(m ,X. ,Y.10) -I \nH \n\nL;\"=I p(m ,X. ,Y.10 t -\n\nLM \n\nI \nog \n\nog \n\nM \n\nt \n\nl ) \n\nn_IP(n ,X, 0 \n\nI t) \n\nLn=1 p(n ,X.10 t - l ) \n\n(4) \n\nThe above is a difference between a ratio of joints and a ratio of marginals. If \nJensen's inequality is applied to the second term in Equation 4 it yields an upper \nbound since the term is subtracted (this would compromise convergence). Thus, \nonly the first ratio can be lower bounded with Jensen (Equation 5). \n\nL:l.jC>~~h' I \n\n- L..: L..: \ni=1 m=1 \n\n2m og ( \n\np(m,xi,YiI 8t ) -I \np m, Xi, Yi -\n\n18t - 1) \n\nog \n\nL~~lp(n,xiI8t) \nLn=1 p(n, XiI8 t - 1) \n\nM \n\n(5) \n\nNote the lingering logarithm of a sum which prevents a simple M-Step. At this point, \none would resort to a Generalized EM (GEM) approach which requires gradient or \nsecond-order ascent techniques for the M-step. For example, Jordan et al. overcome \nthe difficult M-step caused by EM with an Iteratively Re-Weighted Least Squares \nalgorithm in the mixtures of experts architecture [4]. \n\n3 Conditional Expectation Maximization \n\nThe EM algorithm can be extended by substituting Jensen's inequality for a dif(cid:173)\nferent bound. Consider the upper variational bound of a logarithm x-I 2: log(x) \n(which becomes a lower bound on the negative log). The proposed logarithm's \nbound satisfies a number of desiderata: (1) it makes contact at the current op(cid:173)\nerating point1, (2) it is tangential to the logarithm, (3) it is a tight bound, (4) \nit is simple and (5) it is the variational dual of the logarithm. Substituting this \nlinear bound into the incremental conditional log-likelihood maintains a true lower \nbounding function Q (Equation 6). \n\nThe Mixture of Experts formalism [4J offers a graceful representation of a conditional \ndensity using experts (conditional sub-models) and gates (marginal sub-models). \nThe Q function adopts this form in Equation 7. \n\n1 The current operating point is 1 since the e t model in the ratio is held fixed at the \n\nprevious iteration's value e t - 1 . \n\n\fMaximum Conditional Likelihood via Bound Maximization and CEM \n\n497 \n\nL~l L~=1 {him(logp(Yilm,Xi,e t ) +logp(m,xile t ) - Zim) -\n\nriP(m , xile) + ir} \n\nwhere Zim = log(p(m,xi,Yile t -- 1 )) \n\nand ri = (L~=1p(n'Xilet-1) )-1 \n\n(7) \n\nComputing this Q function forms the CE-step in the Conditional Expectation Max(cid:173)\nimization algorithm and it results in a simplified M-step. Note the absence of the \nlogarithm of a sum and the decoupled models. The form here allows a more straight(cid:173)\nforward computation of derivatives with respect to e t and a more tractable M-Step. \nFor continuous missing data, a similar derivation holds. \n\nAt this point , without loss of generality, we specifically attend to the case of a condi(cid:173)\ntioned Gaussian mixture model and derive the corresponding M-Step calculations. \nThis serves as an implementation example for comparison purposes. \n\n4 CEM and Bound Maxinlization for Gaussian Mixtures \n\nIn deriving an efficient M-step for the mixture of Gaussians, we call upon more \nbounding techniques that follow the CE-step and provide a monotonically conver(cid:173)\ngent learning algori thm . The form ofthe condi tional model we will train is obtained \nby conditioning a joint mixture of Gaussians. We write the conditional density \nin a experts-gates form as in Equation 8. We use unnormalized Gaussian gates \nN(x; p,~) = exp( - ~(x - p)T~-1 (x - p\u00bb since conditional models do not require \ntrue marginal densities over x (i .e. that necessarily integrate to 1). Also, note that \nthe parameters of the gates (0:' , px , :Exx ) are independent of the parameters of the \nexperts (vm,rm,om). \n\nBoth gates and experts are optimized independently and have no variables in com(cid:173)\nmon. An update is performed over the experts and then over the gates. If each \nof those causes an increase, we converge to a local maximum of conditional log(cid:173)\nlikelihood (as in Expectation Conditional Maximization [5]). \n\np(Ylx,8) \n\n(8) \n\nTo update the experts, we hold the gates fixed and merely take derivatives of the Q \nfunction with respect to the expert parameters (m = {v m , rm, am} ) and set them \nto O. Each expert is effectively decoupled from other terms (gates, other experts, \netc.). The solution reduces to maximizing the log of a single conditioned Gaussian \nand is analytically straightforward. \n\n8Q(e t ,e(t-l\u00bb) \n\n8<1>'\" \n\n(9) \n\nSimilarly, to update the gate mixing proportions, derivatives of the Q function are \ntaken with respect to O:'m and set to O. By holding the other parameters fixed , the \nupdate equation for the mixing proportions is numerically evaluated (Equation 10). \n\nN \n\nN \n\nO:'m \n\n:= LriN(xi;P~,:E~x) le(l-I) {Lhim}-l \n\n(10) \n\ni=l \n\ni=l \n\n\f498 \n\nc, \n,1 \\ \n\nO~ \n\n01 \n~ \n\n01 \n\nI \nr \nI \n\n\\ \n\ni \n\\ \n\nat ,f \n\n00 \n\n\\ \n',_ \n\n, \n//---~ \n\n',,---, \n\n~, ,>/~ ..... : \n\n-\n\n' ., \n\ni \n\n\\, \n\n':li:~ \n\n'--... \n\nc ' \n01' \n\n_1 \n\n-2 \n\n_I \n\n..!..: \n\nI \n\n00 \n\n' - ,- - 10 \n\nIi \n\n- - - - - - - -\n\nT. Jebara and A. Pentland \n\nj---\n\n''''\\ ~ \nI ~ jr-' --...-.;==i-J \nI \nI\u00b7Ji \n\n( a) f Function \n\n(b) Bound on f..L \n\n( c) g Function \n\n\" \n\n(d) Bound on I: xx \n\nFigure 2: Bound Width Computation and Example Bounds \n\n4.1 Bounding Gate Means \nTaking derivatives of Q and setting to a is not as straightforward for the case of \nthe gate means (even though they are decoupled). What is desired is a simple \nupdate rule (i.e. computing an empirical mean). Therefore, we further bound the \nQ function for the M-step. The Q function is actually a summation of sub-elements \nQim and we bound it instead by a summation of quadratic functions on the means \n(Equation 11). \n\nQ(et , e(t-1)) = L L Q(et , e(t-1))im > L L kim - Wimllf..L~ - ciml1 2 (11) \n\nN M \n\nN M \n\ni=l m=l \n\ni=l m=l \n\nEach quadratic bound has a location parameter cim (a centroid), a scale parameter \nWim (narrowness), and a peak value at kim. The sum of quadratic bounds makes \ncontact with the Q function at the old values of the model e t - 1 where the gate \nmean was originally f..L':* and the covariance is I:':x*' To facilitate the derivation, \none may assume that the previous mean was zero and the covariance was identity \nif the data is appropriately whitened with respect to a given gate. \n\nThe parameters of each quadratic bound are solved by ensuring that it contacts the \ncorresponding Qim function at et - 1 and they have equal derivatives at contact (i .e. \ntangential contact) . Sol ving these constraints yields quadratic parameters for each \ngate m and data point i in Equation 12 (kim is omitted for brevity) . \n\n> \n\n(12) \n\nThe tightest quadratic bound occurs when Wim is minimal (without violating the \ninequality). The expression for Wim reduces to finding the minimal value, wim, as in \nEquation 13 (here p2 = xT xd. The f function is computed numerically only once \nand stored as a lookup table (see Figure 2(a)). We thus immediately compute the \noptimal wim and the rest of the quadratic bound's parameters obtaining bounds as \nin Figure 2(b) where a Qim is lower bounded. \n\n* _ \nWim -\n\n. \n\nrlCl'm C \n\nmax \n\n1 2 \n\n1 2 e- 2 C eCP - cp - 1 \n2 \n\n{ - -p \ne \n\n2 \nc \n\n' \nh\u00b7 \n} + 1m _\n- - -\n2 \n\n. \n\nr l Cl'm e \n\n, \nh\u00b7 \n- -p f( ) + 1m \np - -\n2 \n\n1 2 \n2 \n\n(13) \n\nThe gate means f..L~ are solved by maximizing the sum of the M x N parabolas which \nbound Q. The update is f..L': = (2: wimCim) (2: wim)-l. This mean is subsequently \nunwhitened to undo earlier data transformations. \n\n\fMaximum Conditional Likelihood via Bound Maximization and CEM \n\n499 \n\n:l \n\n-10'-, \n\n' . \n..... : \nJ ... . \n\n! ' \n\n\" \n~ \n\n1 \n\n\u2022. i \n\n[,~, ] \n'~---- - -:----\" \n\n(a) Data \n\n(b) CEM p(ylx) \n\n(c) CEM IC \n\n(d) EM fit \n\n(e) EM p(ylx) \n\n(f) EM I C \n\nFigure 3: Conditional Density Estimation for CEM and EM \n\n4.2 Bounding Gate Covariances \n\nHaving derived the update equation for gate means, we now turn our attention \nto the gate covariances. We bound the Q function with logarithms of Gaussians. \nMaximizing this bound (a sum of log-Gaussians) reduces to the maximum-likelihood \nestimation of a covariance matrix . The bound for a Qim sub-component is shown \nin Equation 14. Once again, we assume the data has been appropriately whitened \nwith respect to the gate's previous parameters (the gate's previous mean is 0 and \nprevious covariance is identity). Equation 15 solves for the log-Gaussian parameters \n(again p2 = XTXi). \n\nQ( Dt,D(t-1));m > I \n\nQQ . _ \n\n(N) \n\nog \n\n= \n\nk \nim - WimCimL..xx Cim - W i m og L..xx \n\nI \n\nT \",m -1 \n\nI\",m I (14) \n\n> \n\n(15) \n\nThe computation for the minimal Wim simplifies to wim = riQ:mg(p) . The 9 function \nis derived and plotted in Figure 2(c). An example of a log-Gaussian bound is \nshown in Figure 2( d) a sub-component of the Q function. Each sub-component \ncorresponds to a single data point as we vary one gate 's covariance. All M x N \nlog-Gaussian bounds are computed (one for each data point and gate combination) \nand are summed to bound the Q function in its entirety. \n\nTo obtain a final answer for the update of the gate covariances E~ we simply \nmaximize the sum of log Gaussians (parametrized by wim, kim, Cim). The update is \nE~x = (2: WimCimCim T) (2: wim)-l. This covariance is subsequently unwhitened , \ninverting the whitening transform applied to the data. \n\n5 Results \n\nThe CEM algorithm updates the conditioned mixture of Gaussians by computing \nhim and rim in the CE steps and interlaces these with updates on the experts, \nmixing proportions, gate means and gate covariances. For the mixture of Gaussians , \neach CEM update has a computation time that is comparable with that of an EM \nupdate (even for high dimensions). However, conditional likelihood (not joint) is \nmonotonically increased . \n\nConsider the 4-cluster (x , y) data in Figure 3(a). The data is modeled with a con(cid:173)\nditional density p(ylx) using only 2 Gaussian models . Estimating the density with \nCEM yields the p(ylx) shown in Figure 3(b). CEM exhibits monotonic conditional \nlikelihood growth (Figure 3(c)) and obtains a more conditionally likely model. In \n\n\f500 \n\nAlgorithm \nAbalone \n\nT. Jebara and A. Pentland \n\nTable 1: Test Results. Class label regression accuracy data. \ncorrelation, a hidden units, CCN5=5 hidden LD=linear discriminant). \n\n(CNNO=cascade(cid:173)\n\nthe EM case, a joint p(x, y) clusters the data as in Figure 3(d) . Conditioning it \nyields the p(ylx) in Figure 3(e) . Figure 3(f) depicts EM's non-monotonic evolution \nof conditional log-likelihood. EM produces a superior joint likelihood but an infe(cid:173)\nrior conditional likelihood. Note how the CEM algorithm utilized limited resources \nto capture the multimodal nature of the distribution in y and ignored spurious bi(cid:173)\nmodal clustering in the x feature space. These properties are critical for a good \nconditional density p(ylx). \n\nFor comparison , standard databases were used from DCI 2. Mixture models were \ntrained with EM and CEM , maximizingjoint and conditional likelihood respectively. \nRegression results are shown in Table 1. CEM exhibited , monotonic conditional log(cid:173)\nlikelihood growth and out-performed other methods including EM with the same \n2-Gaussian model (EM2 and CEM2). \n\n6 Discussion \n\nWe have demonstrated a variant of EM called CEM which optimizes conditional \nlikelihood efficiently and monotonically. The application of CEM and bound maxi(cid:173)\nmization to a mixture of Gaussians exhibited promising results and better regression \nthan EM . In other work , a MAP framework with various priors and a deterministic \nannealing approach have been formulated. Applications of the CEM algorithm to \nnon-linear regressor experts and hidden Markov models are currently being investi(cid:173)\ngated . Nevertheless, many applications CEM remain to be explored and hopefully \nothers will be motivated to extend the initial results. \n\nAcknowledgements \n\nMany thanks to Michael Jordan and Kris Popat for insightful discussions. \n\nReferences \n\n[1] S. Amari. Information geometry of em and em algorithms for neural networks. Neural \n\nNetworks, 8(9), 1995 . \n\n[23] C. Bishop. Neural Networks Jor Pattern Recognition. Oxford Press, 1996. \n[ ] A. Dempster, N. Laird, and D. Rubin. Maximum likelihood from incomplete data via \n\nthe em algorithm. Journal oj the Royal Statistical Society, B39, 1977. \n\n[4] M. Jordan and R. Jacobs. Hierarchical mixtures of experts and the em algorithm . \n\nNeural Computation, 6:181- 214, 1994. \n\n[5] X. Meng and D. Rubin. Maximum likelihood estimation via the ecm algorithm : A \n\ngeneral framework. Biometrika , 80(2), 1993. \n\n[6] A. Popat. Conjoint probabilistic subband modeling (phd. thesis). Technical Report \n\n461, M.LT. Media Laboratory, 1997. \n\n[7] 1. Xu, M. Jordan , and G. Hinton . An alternative model for mixtures of experts . In \n\nNeural InJormation Processing Systems 7, 1995. \n\n2http://www.ics.uci.edu/'''-'mlearn/MLRepository.html \n\n\f", "award": [], "sourceid": 1537, "authors": [{"given_name": "Tony", "family_name": "Jebara", "institution": null}, {"given_name": "Alex", "family_name": "Pentland", "institution": null}]}