{"title": "Bayesian Methods for Mixtures of Experts", "book": "Advances in Neural Information Processing Systems", "page_first": 351, "page_last": 357, "abstract": null, "full_text": "Bayesian Methods for Mixtures of Experts \n\nSteve Waterhouse \n\nCambridge University \n\nEngineering Department \n\nCambridge CB2 1PZ \n\nDavid MacKay \n\nCavendish Laboratory \n\nMadingley Rd. \n\nCambridge CB3 OHE \n\nEngland \n\nEngland \n\nTel: [+44] 1223 332754 \nsrw1001@eng.cam.ac.uk \n\nTel: [+44] 1223 337238 \nmackay@mrao.cam.ac.uk \n\nTony Robinson \n\nCambridge University \n\nEngineering Department \n\nCambridge CB2 1PZ \n\nEngland. \n\nTel: [+44] 1223 332815 \n\najr@eng.cam.ac.uk \n\nABSTRACT \n\nWe present a Bayesian framework for inferring the parameters of \na mixture of experts model based on ensemble learning by varia(cid:173)\ntional free energy minimisation. The Bayesian approach avoids the \nover-fitting and noise level under-estimation problems of traditional \nmaximum likelihood inference. We demonstrate these methods on \nartificial problems and sunspot time series prediction. \n\nINTRODUCTION \n\nThe task of estimating the parameters of adaptive models such as artificial neural \nnetworks using Maximum Likelihood (ML) is well documented ego Geman, Bienen(cid:173)\nstock & Doursat (1992). ML estimates typically lead to models with high variance, \na process known as \"over-fitting\". ML also yields over-confident predictions; in \nregression problems for example, ML underestimates the noise level. This problem \nis particularly dominant in models where the ratio of the number of data points in \nthe training set to the number of parameters in the model is low. In this paper we \nconsider inference of the parameters of the hierarchical mixture of experts (HME) \narchitecture (Jordan & Jacobs 1994). This model consists of a series of \"experts,\" \neach modelling different processes assumed to be underlying causes of the data. \nSince each expert may focus on a different subset of the data which may be ar(cid:173)\nbitrarily small, the possibility of over-fitting of each process is increased. We use \nBayesian methods (MacKay 1992a) to avoid over-fitting by specifying prior belief \nin various aspects of the model and marginalising over parameter uncertainty. \n\nThe use of regularisation or \"weight decay\" corresponds to the prior assumption \nthat the model should have smooth outputs. This is equivalent to a prior p(ela) on \nthe parameters e of the model , where a are the hyperparameters of the prior . Given \na set of priors we may specify a posterior distribution of the parameters given data \nD, \n\np(eID, a, 11) ex: p(Dle, 1l)p(e\\a, 11), \n\n(1) \n\nwhere the variable 11 encompasses the assumptions of model architecture, type of \nregularisation used and assumed noise model. Maximising the posterior gives us \nthe most probable parameters eMP. We may then set the hyperparameters either \nby cross-validation, or by finding the maximum of the posterior distribution of the \n\n\f352 \n\nS. WATERHOUSE, D. MACKAY, T. ROBINSON \n\nhyperparameters P(aID), also known as the \"evidence\" (Gull 1989). In this paper \nwe describe a method, motivated by the Expectation Maximisation (EM) algorithm \nof Dempster, Laird & Rubin (1977) and the principle of ensemble learning by vari(cid:173)\national free energy minimisation (Hinton & van Camp 1993, Neal & Hinton 1993) \nwhich achieves simultaneous optimisation of the parameters and hyper parameters \nof the HME. We then demonstrate this algorithm on two simulated examples and a \ntime series prediction task. In each task the use of the Bayesian methods prevents \nover-fitting of the data and gives better prediction performance. Before we describe \nthis algorithm, we will specify the model and its associated priors. \n\nMIXTURES OF EXPERTS \n\nThe mixture of experts architecture (Jordan & Jacobs 1994) consists of a set of \n\"experts\" which perform local function approximation. The expert outputs are \ncombined by a \"gate\" to form the overall output. In the hierarchical case, the \nexperts are themselves mixtures of further experts, thus extending the network in \na tree structured fashion. The model is a generative one in which we assume that \ndata are generated in the domain by a series of J independent processes which \nare selected in a stochastic manner. We specify a set of indicator variables Z = \n{zt) : j = 1 ... J, n = 1 ... N}, where zt) is 1 if the output y(n) was generated \nby expert j and zero otherwise. Consider the case of regression over a data set \nD = {x(n) E 9tk, y(n) E 9tP, n = 1 ... N} with p = 1. We specify that the conditional \nprobability of the scalar output y(n) given the input vector x(n) at exemplar (n) is \n\np(y(n)lx(n), 8) = L P(zt)lx(n), ~j)p(y(n)lx(n), Wj, {3j), \n\nJ \n\n(2) \n\nj=1 \n\nwhere {~j E 9tk } is the set of gate parameters, and {(Wj E 9tk), {3J the set of expert \nparameters. In this case, p(y(n)lx(n), Wj,{3j) is a Gaussian: \n\nwhere 1/ {3j is the variance of expert j, I and Jt) = !}(x(n), Wj) is the output of expert \nj, giving a probabilistic mixture model. In this paper we restrict the expert output \nto be a linear function of the input, !}(x(n>, Wj) = w\"f x(n). We model the action of \nselecting process j with the gate, the outputs of which are given by the softmax \nfunction of the inner products of the input vector2 and the gate parameter vectors. \nThe conditional probability of selecting expert j given input x(n) is thus: \n\n(3) \n\n(4) \n\nA straightforward extension of this model also gives us the conditional probability \nh;n) of expert j having been selected given input x(n) and output y(n) , \n\nhj\") = P( zj') = 11/'), z ('), 9) = gj\") ~j\") / t. g~') ~j(') . \n\n(5) \n\n1 Although {Jj is a parameter of expert j, in common with MacKay (1992a) we consider \n\nit as a hyperparameter on the Gaussian noise prior. \n\n2In all notation, we assume that the input vector is augmented by a constant term, \n\nwhich avoids the need to specify a \"bias\" term in the parameter vectors. \n\n\fBayesian Methods for Mixtures of Experts \n\nPRIORS \nWe assume a separable prior on the parameters e of the model: \n\nJ \n\np(ela) = IT P(~jltl)P(wjlaj) \n\ni=l \n\n353 \n\n(6) \n\nwhere {aj} and {tl} are the hyperparameters for the parameter vectors of the experts \nand the gate respectively. We assume Gaussian priors on the parameters of the \nexperts {Wj} and the gate {~j}' for example: \n\n(7) \n\nFor simplicity of notation, we shall refer to the set of all smoothness hyperparame(cid:173)\nters as a = {tl, aj} and the set of all noise level hyperparameters as /3 = {/3j}. \nFinally, we assume Gamma priors on the hyperparameters {tl, aj, /3j} of the priors, \nfor example: \n\nP(log /3j IPlh up) = r(pp) \n\n1 \n\n(/3,) p~ \n\nu~ \n\nexp( - /3j / up), \n\n(8) \n\nwhere up, Pp are the hyper-hyperparameters which specify the range in which we \nexpect the noise levels /3j to lie. \n\nINFERRING PARAMETERS USING ENSEMBLE LEARNING \n\nThe EM algorithm was used by Jordan & Jacobs (1994) to train the HME in a \nmaximum likelihood framework. In the EM algorithm we specify a complete data set \n{D,Zl which includes the observed ~ataD and the set ~fi~dic~tor variables Z. Given \ne(m- 15, the E step of the EM algOrIthm computes a dIstrIbutIOn P(ZID, e(m-l\u00bb over \nZ. The M step then maximises the expected value of the complete data likelihood \nP(D, ZI e) over this distribution. In the case of the HME, the indicator variables \nZ = {{zt)}f=tl~=l specify which expert was responsible for generating the data at \neach time. \n\nWe now outline an algorithm for the simultaneous optimisation of the parameters \ne and hyperparameters a and /3, using the framework of ensemble learning by \nvariational free energy minimisation (Hinton & van Camp 1993). Rather than \noptimising a point estimate of e, a and /3, we optimise a distribution over these \nparameters. This builds on Neal & Hinton's (1993) description ofthe EM algorithm \nin terms of variational free energy minimisation. \nWe first specify an approximating ensemble Q(w,~, a, /3, Z) which we optimise so that \nit approximates the posterior distribution P(w,~, a, /3, ZID, J{) well. The objective \nfunction chosen to measure the quality of the approximation is the variational free \nenergy, \n\nF(Q) = J dw d~ da d/3 dZ Q(w,~, a,/3,Z) log P( Q(;, ~,;,/3'iJ{)' \n\nw, ,a, ,Z,D \n\n(9) \n\nwhere the joint probability of parameters {w,~}, hyperparameters, {a,/3}, missing \ndata Z and observed data D is given by, \n\n\f354 \n\nS. WATERHOUSE, D. MACKAY, T. ROBINSON \n\nP(w,~,a,{3,Z,DIJI) = \nP(,u) n P(~jl,u)P(aj)P(wjlaj)P({3jlpj, Vj) n (P(Z}\") = ll:r:(n), ~j)p(y(n)I:r:(n), Wj, (3j\u00bb) J (10) \n\n~ \n\nJ \n\nN \n\n_I \n\nFI \n\nThe free energy can be viewed as the sum of the negative log evidence -log P(DIJI) \nand the Kullback-Leibler divergence between Q and P(w,~, a,{3,ZID,JI). F is \nbounded below by -log P(DIJ{), with equality when Q = P(w,~, a, (3, ZID, JI). \nWe constrain the approximating ensemble Q \nto be separable in the form \nQ(w,~,a,{3,Z) = Q(w)Q(~)Q(a)Q({3)Q(Z). We find the optimal separable distribu(cid:173)\ntion Q by considering separately the optimisation of F over each separate ensemble \ncomponent QO with all other components fixed. \n\nOptimising Qw(w) and Q~(~). \nAs a functional of Qw(w), F is \n\nF = J dwQw(w) [L iwJWj+ ti/n)~(y(n)_yj\"\u00bb2+10gQw(W)l +const (ll) \n\n] \n\nn=1 \n\nwhere for any variable a, a denotes I da Q(a) a . Noting that the w dependent terms \nare the log of a posterior distribution and that a divergence I Q log(QIP) is minimised \nby setting Q = P, we can write down the distribution Qw(w) that minimises this \nexpression. For given data and Qa, Q/J' Qz, Q~, the optimising distribution Q~Pt(w) is \n\nThis is a set of J Gaussian distributions with means {Wj}, which can be found \nexactly by quadratic optimisation. We denote the variance covariance matrices of \nQ~ft(Wj) by {Lwj } . The analogous expression for the gates Q?\\~) is obtained in a \nsimilar fashion and is given by \n\nWe approximate each Q~P\\~j) by a Gaussian distribution fitted at its maximum \n~j = ~j with variance covariance matrix l:~j. \n\nOptimising Qz(Z) \nBy a similar procedure, the optimal distribution Q~P\\z) is given by \n\nwhere \n\n(14) \n\n(15) \n\n\fBayesian Methods for Mixtures of Experts \n\n355 \n\nand ~j is the value of l;j computed above. The standard E-step gives us a distribution \nof Z given a fixed value of parameters and the data, as shown in equation (5). In \nthis case, by finding the optimal Qz(Z) we obtain the alternative expression of (15), \nwith dependencies on the uncertainty of the experts' predictions. Ideally (if we \ndid not made the assumption of a separable distribution Q) Qz might be expected \nto contain an additional effect of the uncertainty in the gate parameters. We can \nintroduce this by the method of MacKay (1992b) for marginalising classifiers, in the \ncase of binary gates. \n\nOptimising Qa(a) and QfJ(f3) \nFinally, for the hyperparameter distributions, the optimal values of ensemble func(cid:173)\ntions give values for aj and {3j as \n\n1 _ wJ Wj + 21 utlj + Trace1:Wj \n\nlij -\n\nk+ 2paj \n\nAn analogous procedure is used to set the hyperparameters {Jl} of the gate. \n\nMAKING PREDICTIONS \nIn order to make predictions using the model, we must maryinaiise over the param(cid:173)\neters and hyper parameters to get the predictive distribution. We use the optimal \ndistributions QoptO to approximate the posterior distribution. \nFor the experts, the marginalised outputs are given by ~(N+I) = h(x(N+l), w7P ), with \nvariance a 2 = x(N+I)T l:wx(N+I) + a 2 where a 2 = 1 / it. We may also marginalise \nover the gate parameters (MacKay 1992b) to give marginalised outputs for the gates. \nThe predictive distribution is then a mixture of Gaussians, with mean and variance \ngiven by its first and second moments, \n\nyl aj \u2022 .Bj \n\nJ ' \n\nPJ \n\nJ \n\nJ \n\nJ \n\ny(N+1) = Lg/N+l)y/N+I); \n\ni=1 \n\na2 = \"\"' g .(N+I)(a2 \nyla . .B \n\nI \n\n+ (\",(N+I\u00bb2) _ (,,(N+I\u00bb2 \n\u00b7 \n\nV\n\nylai . .B, VI \n\nJ \nL..J \ni=1 \n\n(17) \n\nSIMULATIONS \nArtificial Data \nIn order to test the performance of the Bayesian method, we constructed two arti(cid:173)\nficial data sets. Both data sets consist of a known function corrupted by additive \nzero mean Gaussian noise. The first data set, shown in Figure (1a) consists of \n100 points from a piecewise linear function in which the leftmost portion is cor(cid:173)\nrupted with noise of variance 3 times greater than the rightmost portion. The \nsecond data set, shown in Figure (1b) consists of 100 points from the function \nget) = 4. 26(e- t - 4e- 2t + 3e- 3t), corrupted by Gaussian noise of constant variance \n0.44. We trained a number of models on these data sets, and they provide a typical \nset of results for the maximum likelihood and Bayesian methods , together with the \nerror bars on the Bayesian solutions. The model architecture used was a 6 deep \nbinary hierarchy of linear experts. In both cases, the ML solutions tend to overfit \nthe noise in the data set. The Bayesian solutions, on the other hand, are both \nsmooth functions which are better approximations to the underlying functions. \n\nTime Series Prediction \nThe Bayesian method was also evaluated on a time series prediction problem. This \nconsists of yearly readings of sunspot activity from 1700 to 1979, and was first \n\n\f356 \n\n-- -\n\nS. WATERHOUSE. D. MACKAY. T. ROBINSON \n\nI\n\n\u2022 \n\n.,' ' . \n\n--I \n\n\"' ' . \n-~. - ~ \n- ~ ~i -\\~~ ~ ' \" \n\"1, \n; .. \n..... , , \n! ~ ,l \n\\, ' .. / . \n~,-\n...\u2022 \n\n. \" \n. \n. ,1 ~ \n. \ni \"~ ii \" \n~i. 'I \n\n\"~.. \n\nj~.. \n\nj i \n\n. \u2022 \u2022 \u2022 \n\ni / \u2022\u2022 \n\n..... \n... \n\n\" \n\n: ... .. . .. . ..... . ... \n\n-Original function \n\u2022 Original + Noise \n. -ML solution \n- - Bayesian solution \n... Error bars \n\n(a) \n\n(b) \n\nFigure 1: The effect of regularisation on fitting known functions corrupted with noise. \n\nconsidered in the connectionist community by Weigend, Huberman & Rumelhart \n(1990), who used an MLP with 8 hidden tanh units, to predict the coming year's \nactivity based on the activities of the previous 12 years. This data set was chosen \nsince it consists of a relatively small number of examples and thus the probability \nof over-fitting sizeable models is large. In previous work, we considered the use of a \nmixture of 7 experts on this problem. Due to the problems of over-fitting inherent \nin ML however, we were constrained to using cross validation to stop the training \nearly. This also constrained the selection of the model order, since the branches of \ndeep networks tend to become \"pinched off\" during ML training , resulting in local \nminima during training. The Bayesian method avoids this over-fitting of the gates \nand allows us to use very large models. \n\nTable 1: Single step prediction on the Sunspots data set using a lag vector of 12 years. \nNMSE is the mean squared prediction error normalised by the variance of the entire \nrecord from 1700 to 1979. The models used were; WHR: Weigend et al's MLP result; \n1HME_7_CV: mixture of 7 experts trained via maximum likelihood and using a 10 % \ncross validation scheme; 8HME2-ML & 8HME2J3ayes: 8 deep binary HME,trained via \nmaximum likelihood (ML) and Bayesian method (Bayes). \n\nMODEL \n\nTrain NMSE \n\n1700-1920 \n\nTest NMSE \n\n1921-1955 1956-1979 \n\nWHR \n\n1HME7_CV \n8HME2_ML \n8HME2..Bayes \n\n0.082 \n0.061 \n0.052 \n0.079 \n\n0.086 \n0.089 \n0.162 \n0.089 \n\n0.35 \n0.27 \n0.41 \n0.26 \n\nTable 1 shows the results obtained using a variety of methods on the sunspots \ntask. The Bayesian method performs significantly better on the test sets than the \nmaximum likelihood method (8HME2_ML), and is competitive with the MLP of \nWeigend et al (WHR). It should be noted that even though the number of param(cid:173)\neters in the 8 deep binary HME (4992) used is much larger than the number of \ntraining examples (209), the Bayesian method still avoids over-fitting of the data. \nThis allows us to specify large models and avoids the need for prior architecture \nselection, although in some cases such selection may be advantageous, for example \nif the number of processes inherent in the data is known a-priori. \n\n\fBayesian Methods for Mixtures of Experts \n\n357 \n\nIn our experience with linear experts, the smoothness prior on the output function \nof the expert does not have an important effect; the prior on the gates and the \nBayesian inference of the noise level are the important factors. We expect that the \nsmoothness prior would become more important if the experts used more complex \nbasis functions . \n\nDISCUSSION \n\nThe EM algorithm is a special case of the ensemble learning algorithm presented \nhere: the EM algorithm is obtained if we constrain Qe(8) and Qf3(f3) to be delta \nfunctions and fix a = O. The Bayesian ensemble works better because it includes \nregularization and because the uncertainty of the parameters is taken into account \nwhen predictions are made. It could be of interest in future work to investigate how \nother models trained by EM could benefit from the ensemble learning approach \nsuch as hidden Markov models. \n\nThe Bayesian method of avoiding over-fitting has been shown to lend itself naturally \nto the mixture of experts architecture. The Bayesian approach can be implemented \npractically with only a small computational overhead and gives significantly better \nperformance than the ML model. \n\nReferences \n\nDempster, A. P., Laird, N. M. & Rubin, D. B. (1977), 'Maximum likelihood from \nincomplete data via the EM algorithm', Journal of the Royal Statistical Society, \nSeries B 39, 1- 38. \n\nGeman, S., Bienenstock, E. & Doursat, R. (1992), 'Neural networks and the bias \n\n/ variance dilemma', Neural Computation 5, 1-58. \n\nGull, S. F. (1989), Developments in maximum entropy data analysis, in J . Skilling, \ned., 'Maximum Entropy and Bayesian Methods, Cambridge 1988', Kluwer, \nDordrecht, pp. 53-71. \n\nHinton, G. E. & van Camp, D. (1993), Keeping neural networks simple by min(cid:173)\n\nimizing the description length of the weights, To appear in: Proceedings of \nCOLT-93. \n\nJordan, M. I. & Jacobs, R. A. (1994), 'Hierarchical Mixtures of Experts and the \n\nEM algorithm', Neural Computation 6, 181- 214. \n\nMacKay, D. J . C. (1992a), 'Bayesian interpolation', Neural Computation 4(3),415-\n\n447. \n\nMacKay, D. J . C. (1992b), 'The evidence framework applied to classification net(cid:173)\n\nworks' , Neural Computation 4(5), 698- 714. \n\nNeal, R. M. & Hinton, G. E. (1993), 'A new view of the EM algorithm that jus(cid:173)\n\ntifies incremental and other variants'. Submitted to Biometrika. Available at \nURL:ftp:/ /ftp.cs.toronto.edu/pub/radford/www. \n\nWeigend, A. S., Huberman, B. A. & Rumelhart, D. E. (1990), 'Predicting the future: \na connectionist approach ', International Journal of Neural Systems 1, 193- 209. \n\n\f", "award": [], "sourceid": 1167, "authors": [{"given_name": "Steve", "family_name": "Waterhouse", "institution": null}, {"given_name": "David", "family_name": "MacKay", "institution": null}, {"given_name": "Anthony", "family_name": "Robinson", "institution": null}]}