{"title": "An Alternative Model for Mixtures of Experts", "book": "Advances in Neural Information Processing Systems", "page_first": 633, "page_last": 640, "abstract": null, "full_text": "An Alternative Model for Mixtures of \n\nExperts \n\nDept. of Computer Science, The Chinese University of Hong Kong \n\nShatin, Hong Kong, Emaillxu@cs.cuhk.hk \n\nLei Xu \n\nMichael I. Jordan \n\nDept. of Brain and Cognitive Sciences \n\nMIT \n\nCambridge, MA 02139 \n\nToronto, M5S lA4, Canada \n\nGeoffrey E. Hinton \n\nDept. of Computer Science \n\nUniversity of Toronto \n\nAbstract \n\nWe propose an alternative model for mixtures of experts which uses \na different parametric form for the gating network. The modified \nmodel is trained by the EM algorithm. In comparison with earlier \nmodels-trained by either EM or gradient ascent-there is no need \nto select a learning stepsize. We report simulation experiments \nwhich show that the new architecture yields faster convergence. \nWe also apply the new model to two problem domains: piecewise \nnonlinear function approximation and the combination of multiple \npreviously trained classifiers. \n\n1 \n\nINTRODUCTION \n\nFor the mixtures of experts architecture (Jacobs, Jordan, Nowlan & Hinton, 1991), \nthe EM algorithm decouples the learning process in a manner that fits well with the \nmodular structure and yields a considerably improved rate of convergence (Jordan & \nJacobs, 1994). The favorable properties of EM have also been shown by theoretical \nanalyses (Jordan & Xu, in press; Xu & Jordan, 1994). \nIt is difficult to apply EM to some parts of the mixtures of experts architecture \nbecause of the nonlinearity of softmax gating network. This makes the maximiza-\n\n\f634 \n\nLei Xu, Michael!. Jordan, Geoffrey E. Hinton \n\ntion with respect to the parameters in gating network nonlinear and analytically \nunsolvable even for the simplest generalized linear case. Jordan and Jacobs (1994) \nsuggested a double-loop approach in which an inner loop of iteratively-reweighted \nleast squares (IRLS) is used to perform the nonlinear optimization. However, this \nrequires extra computation and the stepsize must be chosen carefully to guarantee \nthe convergence of the inner loop. \nWe propose an alternative model for mixtures of experts which uses a different para(cid:173)\nmetric form for the gating network. This form is chosen so that the maximization \nwith respect to the parameters of the gating network can be handled analytically. \nThus, a single-loop EM can be used, and no learning stepsize is required to guar(cid:173)\nantee convergence. We report simulation experiments which show that the new \narchitecture yields faster convergence. We also apply the model to two problem do(cid:173)\nmains. One is a piecewise nonlinear function approximation problem with smooth \nblending of pieces specified by polynomial, trigonometric, or other prespecified ba(cid:173)\nsis functions. The other is to combine classifiers developed previously-a general \nproblem with a variety of applications (Xu, et al., 1991, 1992). Xu and Jordan \n(1993) proposed to solve the problem by using the mixtures of experts architecture \nand suggested an algorithm for bypassing the difficulty caused by the softmax gat(cid:173)\ning networks. Here, we show that the algorithm of Xu and Jordan (1993) can be \nregarded as a special case of the single-loop EM given in this paper and that the \nsingle-loop EM also provides a further improvement. \n\n2 MIXTURES OF EXPERTS AND EM LEARNING \n\nThe mixtures of experts model is based on the following conditional mixture: \n\nK \n\nI: 9j(X, lI)P(ylx, OJ), \n\nP(ylx,6) \n\nP(ylx, OJ) \n\nwhere x ERn, and 6 consists of 1I,{Oj}f, and OJ consists of {wj}f,{rj}f. The \nvector Ij (x, Wj) ~s the output ofthe j-th expert net. The scalar 9j (x, 1I), j = 1, ... , K \nis given by the softmax function: \n\n9j(X,1I) = e.B;(x,v)/I:e.B;(X,v). \n\n(2) \n\nIn this equation, Pj (x, 1I), j = 1, ... ,K are the outputs of the gating network. \nThe parameter 6 is estimated by Maximum Likelihood (ML), where the log likeli(cid:173)\nhood is given by L = Lt In P(y(t) Ix(t), 6). The ML estimate can be found iteratively \nusing the EM algorithm as follows. Given the current estimate 6(1~), each iteration \nconsists of two steps. \n(1) E-step. For each pair {x(t),y(t)}, compute h~k)(y(t)lx(t\u00bb) = PUlx(t),y(t\u00bb), and \nthen form a set of objective functions: \n\nQj(Oj) \n\nI:h;k)(y(t)lx(t\u00bb) InP(y(t)lx(t), OJ), j = 1,,, ',K; \n\n\fAn Alternative Model for Mixtures of Experts \n\nQg(V) = L L h)k) (y(t) Ix(t\u00bb) lngt) (x(t) ,v(k\u00bb). \n\nt \n\nj \n\n(2). M-step. Find a new estimate e(k+l) = {{ot+l)}f=I,V(k+l)} with: \n\nOJk+l) = argmax Qj(Oj), j = 1, ... , K; v(k+l) = arg max Qg(v). \n\n~ \n\nv \n\n635 \n\n(3) \n\n(4) \n\nIn certain cases, for example when I; (x, Wj) is linear in the parameters Wj, \nmaXe j Qj(Oj) can be solved by solving 8Qj/80j = O. When l;(x,Wj) is nonlinear \nwith respect to Wj, however, the maximization cannot be performed analytically. \nMoreover, due to the nonlinearity of softmax, maXv Qg(v) cannot be solved analyti(cid:173)\ncally in any case. There are two possibilities for attacking these nonlinear optimiza(cid:173)\ntion problems. One is to use a conventional iterative optimization technique (e.g., \ngradient ascent) to perform one or more inner-loop iterations. The other is to simply \nfind a new estimate such that Qj(ot+l\u00bb) ~ Qj(Ojk\u00bb), Qg(v(k+l\u00bb) ~ Qg(v(k\u00bb). Usu(cid:173)\nally, the algorithms that perform a full maximization during the M step are referred \nas \"EM\" algorithms, and algorithms that simply increase the Q function during the \nM step as \"GEM\" algorithms. In this paper we will further distinguish between \nEM algorithms requiring and not requiring an iterative inner loop by designating \nthem as double-loop EM and single-loop EM respectively. \nJordan and Jacobs (1994) considered the case of linear {3j(x, v) = vJ[x,l] with \nv = [VI,\u00b7\u00b7\u00b7, VK] and semi-linear I; (wnx, 1]) with nonlinear 1;(.). They proposed a \ndouble-loop EM algorithm by using the IRLS method to implement the inner-loop \niteration. For more general nonlinear {3j(x, v) and I;(x, OJ), Jordan and Xu (in \npress) showed that an extended IRLS can be used for this inner loop. It can be \nshown that IRLS and the extension are equivalent to solving eq. (3) by the so-called \nFisher Scoring method. \n\n3 A NEW GATING NET AND A SINGLE-LOOP EM \n\nTo sidestep the need for a nonlinear optimization routine in the inner loop of the \nEM algorithm, we propose the following modified gating network: \n\ngj(x, v) = CkjP(xlvj)/ L:i CkiP(xIVi) , L:j Ckj = 1, Ckj ~ 0, \n\nP(xIVj) = aj(vj)-lbj(x) exp{cj(Vj)Ttj(x)} \n\n(5) \nwhere v = {Ckj,Vj,j = 1,\u00b7\u00b7\u00b7,K}, tj(x) is a vector of sufficient statistics, and the \nP(xlvj)'s are density functions from the exponential family. The most common \nexample is the Gaussian: \n\n(6) \n\nIn eq. (5), gj(x, v) is actually the posterior probability PUlx) that x is assigned to \nthe partition corresponding to the j-th expert net, obtained from Bayes' rule: \n\ngj(x, v) = PUlx) = CkjP(xIVj)/ P(x, v), P(x, v) = L CkiP(xlvi). \n\n(7) \n\n\f636 \n\nLei Xu, Michael I. Jordan, Geoffrey E. Hinton \n\nInserting this 9j(X, v) into the model eq. (1), we get \n\nP(ylx, 8) = L- ~( ) P(ylx, OJ). \n\n\" a\u00b7P(xlv \u00b7) \n\n. \n3 \n\nX,V \n\n(8) \n\nIf we do ML estimation directly on this P(ylx,8) and derive an EM algorithm, \nwe again find that the maximization maXv Q9(v) cannot be solved analytically. To \navoid this difficulty, we rewrite eq. (8) as: \n\nP(y, x) = P(ylx, 8)P(x, v) = L ajP(xlvj)P(ylx, OJ). \n\n(9) \n\nj \n\nThis suggests an asymmetrical representation for the joint density. We accordingly \nperform ML estimation based on L' = 2:t In P(y(t), x(t\u00bb) to determine the param(cid:173)\neters a j , Vj, OJ of the gating net and the expert nets. This can be done by the \nfollowing EM algorithm: \n(1) E-step. Compute \n\nh(k)(y(t) Ix(t\u00bb) _ \nj \n\na\\k) P( x(t) Iv~k \u00bb)P(y(t) Ix(t) Oi,j(X) are prespecified basis functions, maX8 j Qf!(Oj),j = 1\"\", K in eq. (3) \nis still a weighted least squares problem that can be soived analytically. One useful \nspecial case is when 4>i,j(X) are canonical polynomial terms X~l .. 'X~d, rj ~ 0. In \nthis case, the mixture of experts model implements piecewise polynomial approxi(cid:173)\nmations. Another case is that 4>i,j(X) is TIi sini (jll'xt) cosi(jll'xt}, ri ~ 0, in which \ncase the mixture of experts implements piecewise trigonometric approximations. \n\n5 COMBINING MULTIPLE CLASSIFIERS \n\nGiven pattern classes Ci, i = 1, ... , M, we consider classifiers ej that for each input \nx produce an output Pj(ylx): \n\nPj(ylx) = [Pj(ll x ), ... ,pj(Mlx)), pj(ilx) ~ 0, LPj(ilx) = 1. \n\n(16) \n\nThe problem of Combining Multiple Classifiers (CMC) is to combine these Pj(ylx)'s \nto give a combined estimate of P(ylx) . Xu and Jordan (1993) proposed to solve \nCMC problems by regarding the problem as a special example of the mixture den(cid:173)\nsity problem eq. (1) with the Pj(ylx)'s known and only the gating net 9j(X, v) to \nbe learned. In Xu and Jordan (1993), one problem encountered was also the non(cid:173)\nlinearity of softmax gating networks, and an algorithm was proposed to avoid the \ndifficulty. \nActually, the single-loop EM given by eq. (10) and eq. (13) can be directly used \nto solve the CMC problem. In particular, when P(xlvj) is Gaussian, eq. \n(13) \nbecomes eq. (14). Assuming that al = . . . = aK in eq. (7), eq. (10) becomes \n\n\f638 \n\nLei Xu, Michaell. Jordan, Geoffrey E. Hinton \n\nh)k) (y(t) Ix(t)) = P(x(t)lvt))P(y(t)lx(t))/ L:i P(x(t)lvi(k))p(y(t)lx(t)). If we divide \nboth the numerator and denominator by L:i P(x(t) Ivi(k)), we get ht)(y(t) Ix(t)) = \ngj(x, v)P(y(t)lx(t))/ L:i gj(x, v)P(y(t)lx(t)) . Comparing this equation with eq. (7a) \nin Xu and Jordan (1993), we can see that the two equations are actually the same. \nDespite the different notation, C\u00a5j(x) and Pj(y1.t) Ix(t)) in Xu and Jordan (1993) are \nthe same as gj(x, v) and P(y(t)lx(t)) in Section 3. So the algorithm of Xu and \nJordan (1993) is a special case of the single-loop EM given in Section 3. \n\n6 \n\nSIMULATION RESULTS \n\nWe compare the performance of the EM algorithm presented earlier with the model \nof mixtures of experts presented by Jordan and Jacobs (1994). As shown in Fig. \nl(a), we consider a mixture of experts model with K = 2. For the expert nets, \neach P(ylx, OJ) is Gaussian given by eq. (1) with linear !;(x,Wj) = wJ[x, 1] . For \nthe new gating net, each P(x, Vj) in eq. (5) is Gaussian given by eq. (6) . For the \nold gating net eq. (2), f31(X,V) = 0 and f32(X,v) = vT[x, 1]. The learning speeds of \nthe two are significantly different. The new algorithm takes k=15 iterations for the \nlog-likelihood to converge to the value of -1271.8. These iterations require about \n1,351,383 MATLAB flops. For the old algorithm, we use the IRLS algorithm \ngiven in Jordan and Jacobs (1994) for the inner loop iteration. In experiments, \nwe found that it usually took a large number of iterations for the inner loop to \nconverge. To save computations, we limit the maximum number of iterations by \nTmaz = 10. We found that this saved computation without obviously influencing \nthe overall performance. From Fig. 1 (b), we see that the outer loop converges in \nabout 16 iterations. Each inner loop takes 290498 flops and the entire process \nrequires 5,312,695 flops. So, we see that the new algorithm yields a speedup of \nabout 4,648,608/1,441,475 = 3.9. Moreover, no external adjustment is needed to \nensure the convergence of the new algorithm. But for the old one the direct use \nof IRLS can make the inner loop diverge and we need to appropriately rescale the \nupdating stepsize of IRLS. \nFigs. 2(a) and (b) show the results of a simulation of a piecewise polynomial \napproximation problem utilizing the approach described in Section 4. We consider a \nmixture of experts model with K = 2. For expert nets, each P(ylx, OJ) is Gaussian \ngiven by eq. (1) with !;(x,Wj) = W3,jX3 +W2,jX2+Wl,jX+WD,j. In the new gating \nnet eq. (5), each P(x, Vj) is again Gaussian given by eq. (6) . We see that the higher \norder nonlinear regression has been fit quite well. \nFor multiple classifier combination, the problem and data are the same as in Xu and \nJordan (1993). Table 1 shows the classification results. Com-old and Com-new de(cid:173)\nnote the method given in in Xu and Jordan (1993) and in Section 5 respectively. We \nsee that both improve the classification rate of each individual network considerably \nand that Com - new improves on Com - old. \n\nClassifer el Classifer el \n\nTraining set \nTesting set \n\n89.9% \n89.2% \n\n93.3% \n92.7% \n\nCom- old \n\n98.6% \n98.0% \n\nCom- new \n\n99.4% \n99.0% \n\nTable 1 A comparison of the correct classification rates \n\n\fAn Alternative Model for Mixtures of Experts \n\n639 \n\nU \n\n2 U \n\n3 U \n\n4 \n\no.s \n\n(a) \n\n(b) \n\nFigure 1: (a) 1000 samples from y = a1X + a2 + c, a1 = 0.8, a2 = 0.4, x E [-1,1.5] \nwith prior a1 = 0.25 and y = ai x + a~ + c, ai = 0.8, a2 =' 2.4, x E [1,4] with prior \na2 = 0.75, where x is uniform random variable and z is from Gaussian N(O, 0.3). \nThe two lines through the clouds are the estimated models of two expert nets. The \nfits obtained by the two learning algorithms are almost the same. (b) The evolution \nof the log-likelihood. The solid line is for the modified learning algorithm. The \ndotted line is for the original learning algorithm (the outer loop iteration). \n\n7 REMARKS \n\nRecently, Ghahramani and Jordan (1994) proposed solving function approximation \nproblems by using a mixture of Gaussians to estimate the joint density of the input \nand output (see also Specht, 1991; Tresp, et al., 1994). In the special case of linear \nI;(x, Wj) = wnx,I] and Gaussian P(xlvj) with equal priors, the method given \nin Section 3 provides the same result as Ghahramani and Jordan (1994) although \nthe parameterizations of the two methods are different. However, the method of \nthis paper also applies to nonlinear l;(x,Wj) = Wn