{"title": "Improved Gaussian Mixture Density Estimates Using Bayesian Penalty Terms and Network Averaging", "book": "Advances in Neural Information Processing Systems", "page_first": 542, "page_last": 548, "abstract": null, "full_text": "Improved Gaussian Mixture Density \n\nEstimates Using Bayesian Penalty Terms \n\nand Network Averaging \n\nDirk Ormoneit \n\nInstitut fur Informatik (H2) \n\nTechnische Universitat Munchen \n\n80290 Munchen, Germany \n\normoneit@inJormatik.tu-muenchen.de \n\nAbstract \n\nVolker Tresp \n\nSiemens AG \n\nCentral Research \n\n81730 Munchen, Germany \nVolker. Tresp@zJe.siemens.de \n\nWe compare two regularization methods which can be used to im(cid:173)\nprove the generalization capabilities of Gaussian mixture density \nestimates. The first method uses a Bayesian prior on the parame(cid:173)\nter space. We derive EM (Expectation Maximization) update rules \nwhich maximize the a posterior parameter probability. In the sec(cid:173)\nond approach we apply ensemble averaging to density estimation. \nThis includes Breiman's \"bagging\" , which recently has been found \nto produce impressive results for classification networks. \n\n1 \n\nIntroduction \n\nGaussian mixture models have recently attracted wide attention in the neural net(cid:173)\nwork community. Important examples of their application include the training of \nradial basis function classifiers, learning from patterns with missing features, and \nactive learning. The appeal of Gaussian mixtures is based to a high degree on the \napplicability of the EM (Expectation Maximization) learning algorithm, which may \nbe implemented as a fast neural network learning rule ([Now91], [Orm93]). Severe \nproblems arise, however, due to singularities and local maxima in the log-likelihood \nfunction. Particularly in high-dimensional spaces these problems frequently cause \nthe computed density estimates to possess only relatively limited generalization ca(cid:173)\npabilities in terms of predicting the densities of new data points. As shown in this \npaper, considerably better generalization can be achieved using regularization. \n\n\fImproved Gaussian Mixture Density Estimates Using Bayesian Penalty Terms \n\n543 \n\nWe will compare two regularization methods. The first one uses a Bayesian prior \non the parameters. By using conjugate priors we can derive EM learning rules \nfor finding the MAP (maximum a posteriori probability) parameter estimate. The \nsecond approach consists of averaging the outputs of ensembles of Gaussian mixture \ndensity estimators trained on identical or resampled data sets. The latter is a form \nof \"bagging\" which was introduced by Breiman ([Bre94]) and which has recently \nbeen found to produce impressive results for classification networks. By using the \nregularized density estimators in a Bayes classifier ([THA93], [HT94], [KL95]) , we \ndemonstrate that both methods lead to density estimates which are superior to the \nunregularized Gaussian mixture estimate. \n\n2 Gaussian Mixtures and the EM Algorithm \n\nConsider the lroblem of estimating the probability density of a continuous random \nvector x E 'R based on a set x* = {x k 11 S k S m} of iid. realizations of x. As a den(cid:173)\nsity model we choose the class of Gaussian mixtures p(xle) = L:7=1 Kip(xli, pi, Ei ), \nwhere the restrictions Ki ~ 0 and L:7=1 Kj = 1 apply. e denotes the parameter \nvector (Ki' Iti, Ei)i=1. The p(xli, Pi, Ei ) are multivariate normal densities: \n\np( xli , Pi , Ei) = (271\")- 41Ei 1- 1 / 2 exp [-1/2(x - Pi)tEi 1 (x - Iti)] . \n\nThe Gaussian mixture model is well suited to approximate a wide class of continuous \nprobability densities. Based on the model and given the data x*, we may formulate \nthe log-likelihood as \n\nlee) = log [rrm p(xkle)] = \",m log \"'~ Kip(xkli, Pi, Ei) . \n\n.L...\".k=1 \n\n.L...\".J=l \n\nk=l \n\nMaximum likelihood parameter estimates e may efficiently be computed with the \nEM (Expectation Maximization) algorithm ([DLR77]) . It consists of the iterative \napplication of the following two steps: \n\n1. In the E-step, based on the current parameter estimates, the posterior \nprobability that unit i is responsible for the generation of pattern xk is \nestimated as \n\n(1) \n\n2. In the M-step, we obtain new parameter estimates (denoted by the prime): \n\n(3) \n\n(4) \n\nK \u00b7 = -\n, \nm \n\nJ \n\n1 L m \nk=1 \n\nk \nh\u00b7 \nJ \n\n(2) \n\n, wk-l i X \n\n~m hk k \nPi = ~m hi \n\nwl=l \n\ni \n\n~.' _ L:~1 hf(xk - pD(xk - pDt \n\nL.J J \n\n-\n\nm \n\nI \nL:l=l hi \n\nNote that K~ is a scalar, whereas p~ denotes a d-dimensional vector and E/ \nis a d x d matrix. \n\nIt is well known that training neural networks as predictors using the maximum \nlikelihood parameter estimate leads to overfitting. The problem of overfitting is \neven more severe in density estimation due to singularities in the log-likelihood \nfunction. Obviously, the model likelihood becomes infinite in a trivial way if we \nconcentrate all the probability mass on one or several samples of the training set. \n\n\f544 \n\nD. ORMONEIT, V. TRESP \n\nIn a Gaussian mixture this is just the case if the center of a unit coincides with \none of the data points and E approaches the zero matrix. Figure 1 compares the \ntrue and the estimated probability density in a toy problem. As may be seen, \nthe contraction of the Gaussians results in (possibly infinitely) high peaks in the \nGaussian mixture density estimate. A simple way to achieve numerical stability \nis to artificially enforce a lower bound on the diagonal elements of E. This is a \nvery rude way of regularization, however, and usually results in low generalization \ncapabilities. The problem becomes even more severe in high-dimensional spaces. \nTo yield reasonable approximations, we will apply two methods of regularization, \nwhich will be discussed in the following two sections. \n\nFigure 1: True density (left) and unregularized density estimation (right). \n\n3 Bayesian Regularization \n\nIn this section we propose a Bayesian prior distribution on the Gaussian mixture \nparameters, which leads to a numerically stable version of the EM algorithm. We \nfirst select a family of prior distributions on the parameters which is conjugate*. \nSelecting a conjugate prior has a number of advantages. In particular, we obtain \nanalytic solutions for the posterior density and the predictive density. In our case, \nthe posterior density is a complex mixture of densitiest . It is possible, however, to \nderive EM-update rules to obtain the MAP parameter estimates. \n\nA conjugate prior of a single multivariate normal density is a product of a normal \ndensity N(JLilft,1]-lEi ) and a Wishart density Wi(E;lla,,8) ([Bun94]). A proper \nconjugate prior for the the mixture weightings '\" = (\"'1, ... , \"'n) is a Dirichlet density \nD(\"'hV. Consequently, the prior of the overall Gaussian mixture is the product \nD(\",lr) il7=1 N(JLilil, 71-1Ei)Wi(E;1Ia , ,8). Our goal is to find the MAP parameter \nestimate, that is parameters which assume the maximum of the log-posterior \n\nIp(S) \n\n2:=~=1 log 2:=;=1 \"'iP(X k Ii, JLi, Ei ) + log D(\"'lr) \n+ 2:=;=1 [logN(JLilft, 71-1Ei) + log Wi(E;lla, ,8)]. \n\nAs in the unregularized case, we may use the EM-algorithm to find a local maximum \n\n\u2022 A family F of probability distributions on 0 is said to be conjugate if, for every 1r E F, \n\nthe posterior 1r(0Ix) also belongs to F ([Rob94]). \n\ntThe posterior distribution can be written as a sum of nm simple terms. \ntThose densities are defined as follows (b and c are normalizing constants): \n\nD(1I:17) \n\nN(Il.lp,1,-IE.) \n\nW i(Ei l la,,8) \n\n= \n\n.=1 \n\nbIIn 11:7,-1, with 11:, ~ 0 and \",n 11:. = 1 \n(21r)-i 11,-IE;I-l/2 exp [-~(Il' - Mt Ei 1 (1l' - M] \ncIEillo-Cd+l)/2 exp [-tr(,8Ei 1 )] \u2022 \n\n~.=l \n\n\fImproved Gaussian Mixture Density Estimates Using Bayesian Penalty Terms \n\n545 \n\nof Ip(8). The E-step is identical to (1). The M-step becomes \n\n\"m hk + \n1 \nL..\"k-l \nm + L..\"i=l ri - n \n\ni \n\"n \n\nri -\n\n(5) \n\n,L..\"k=l i x \nJ1.i = \"m hi \n\n\"m hk k + A \n'1J1. \ni + 11 \n\nL..,,1=1 \n\n, \n\"'i = \n\nI \n\n(6) \n\n(7) \n\nE~ = 2:;-1 hf(xk - J1.D(xk - J1.Dt + 11(J1.i - jJ.)(J1.i - jJ.)t + 2f3 \n\n2:~1 h~ + 20: - d \n\nAs typical for conjugate priors, prior knowledge corresponds to a set of artificial \ntraining data which is also reflected in the EM-update equations. In our experi(cid:173)\nments, we focus on a prior on the variances which is implemented by f3 =F 0, where \no denotes the d x d zero matrix. All other parameters we set to \"neutral\" values: \n\nri=l'v'i : l::;i::;n, 0:= (d+I)/2, 11=0, \n\nf3=iJl d \n\nld is the d x d unity matrix. The choice of 0: introdu~es a bias which favors large \nvariances\u00a7. The effect of various values of the scalar f3 on the density estimate is \nillustrated in figure 2. Note that if iJ is chosen too small, overfitting still occurs. If \nit is chosen to large, on the other hand, the model is too constraint to recognize the \nunderlying structure. \n\nFigure 2: Regularized density estimates (left: iJ = 0.05, right: 'iJ = 0.1). \n\nTypically, the optimal value for iJ is not known a priori. The simplest procedure \nconsists of using that iJ which leads to the best performance on a validation set, \nanalogous to the determination of the optimal weight decay parameter in neural \nnetwork training. Alternatively, iJ might be determined according to appropriate \nBayesian methods ([Mac9I]). Either way, only few additional computations are \nrequired for this method if compared with standard EM. \n\n4 Averaging Gaussian Mixtures \n\nIn this section we discuss the averaging of several Gaussian mixtures to yield im(cid:173)\nproved probability density estimation. The averaging over neural network ensembles \nhas been applied previously to regression and classification tasks ([PC93]) . \n\nThere are several different variants on the simple averaging idea. First, one may \ntrain all networks on the complete set of training data. The only source of dis(cid:173)\nagreement between the individual predictions consists in different local solutions \nfound by the likelihood maximization procedure due to different starting points. \nDisagreement is essential to yield an improvement by averaging, however, so that \nthis proceeding only seems advantageous in cases where the relation between train(cid:173)\ning data and weights is extremely non-deterministic in the sense that in training, \n\u00a7If A is distributed according to Wi(AIO', (3), then E[A- 1 ] = (0' - (d + 1)/2)-1 {3. In our \n\ncase A is B;-I, so that E[Bi] -+ 00 \u2022 {3 for 0' -+ (d + 1)/2. \n\n\f546 \n\nD. ORMONEIT, V. TRESP \n\ndifferent solutions are found from different random starting points. A straightfor(cid:173)\nward way to increase the disagreement is to train each network on a resampled \nversion of the original data set. If we resample the data without replacement, the \nsize of each training set is reduced, in our experiments to 70% of the original. The \naveraging of neural network predictions based on resampling with replacement has \nrecently been proposed under the notation \"bagging\" by Breiman ([Bre94]), who \nhas achieved dramatic.ally improved results in several classification tasks. He also \nnotes, however, that an actual improvement of the prediction can only result if the \nestimation procedure is relatively unstable. As discussed, this is particularly the \ncase for Gaussian mixture training. We therefore expect bagging to be well suited \nfor our task. \n\n5 Experiments and Results \n\nTo assess the practical advantage resulting from regularization, we used the density \nestimates to construct classifiers and compared the resulting prediction accuracies \nusing a toy problem and a real-world problem. The reason is that the generaliza(cid:173)\ntion error of density estimates in terms of the likelihood based on the test data \nis rather unintuitive whereas performance on a classification problem provides a \ngood impression of the degree of improvement. Assume we have a set of N labeled \ndata z* = {(xk, lk)lk = 1, ... , N}, where lk E Y = {I, ... , C} denotes the class label \nof each input xk . A classifier of new inputs x is yielded by choosing the class I \nwith the maximum posterior class-probability p(llx). The posterior probabilities \nmay be derived from the class-conditional data likelihood p(xll) via Bayes theorem: \np(llx) = p(xll)p(l)/p(x) ex p(xll)p(l) . The resulting partitions ofthe input space are \noptimal for the true p(llx). A viable way to approximate the posterior p(llx) is to \nestimate p(xll) and p(l) from the sample data. \n\n5.1 Toy Problem \n\nIn the toy classification problem the task is to discriminate the two classes of circu(cid:173)\nlatory arranged data shown in figure 3. We generated 200 data points for each class \nand subdivided them into two sets of 100 data points. The first was used for train(cid:173)\ning, the second to test the generalization performance. As a network architecture \nwe chose a Gaussian mixture with 20 units. Table 1 summarizes the results, begin(cid:173)\nning with the unregularized Gaussian mixture which is followed by the averaging \nand the Bayesian penalty approaches. The three rows for averaging correspond to \nthe results yielded without applying resampling (local max.), with resampling with-\n\nFigure 3: Toy Classification Task. \n\n\fImproved Gaussian Mixture Density Estimates Using Bayesian Penalty Terms \n\n547 \n\nout replacement (70% subsets), and with resampling with replacement (bagging). \nThe performances on training and test set are measured in terms of the model log(cid:173)\nlikelihood. Larger values indicate a better performance. We report separate results \nfor dass A and B, since the densities of both were estimated separately. The final \ncolumn shows the prediction accuracy in terms of the percentage of correctly clas(cid:173)\nsified data in the test set. We report the average results from 20 experiments. The \nnumbers in brackets denote the standard deviations u of the results. Multiplying u \nwith T19;95%/v'20 = 0.4680 yields 95% confidence intervals. The best result in each \ncategory is underlined. \n\nAlgorithm \n\nunreg. \nAveraging: \nlocal max. \n70% subset \nbagging \nPenalty: \n[3 = 0.01 \n[3 = 0.02 \n[3 = 0.05 \n[3 = 0.1 \n\nTraining \n\nA \n\nLog-Likelihood \n\nB \n\nA \n\nTest \n\nAccuracy \n\nB \n\nI \n\n-120.8 (13.3) \n\n-120.4 (10.8) \n\n-224.9 (32.6) \n\n-241.9 (34.1) \n\n80.6'70 (2.8) \n\n-115.6 (6.0) \n-106.8 (5.8) \n-83.8 (4.9) \n\n-112.6 (6.6) \n-105.1 (6.7) \n-83.1 (7.1) \n\n-200.9 (13.9) \n-188.8 (9.5) \n-194.2 (7.3) \n\n-209.1 (16.3) \n-196.4 (11.3) \n-200.1 (11.3) \n\n81.8% (3.1) \n83.2% (2.9) \n82.6% (3.4) \n\n-149.3 (18.5) \n-156.0 (16.5) \n-173.9 (24.3) \n-183.0 (21.9) \n\n-146.5 (5.9) \n-153.0 (4.8) \n-167.0 (15.8) \n-181.9 (21.1) \n\n-186.2 (13.9) \n-177.1 (11.8) \n-182.0 (20.1) \n-184.6 (21.0) \n\n-182.9 (11.6) \n-174.9 (7.0) \n-173.9 (14.3) \n-182.5 (21.1) \n\n83.1% (2.9) \n84.4% (6.3) \n81.5% (5.9) \n78.5% (5.1) \n\nTable 1: Performances in the toy classification problem . \n\nAs expected, all regularization methods outperform the maximum likelihood ap(cid:173)\nproach in terms of correct classification. The performance of the Bayesian regu(cid:173)\nlarization is hereby very sensitive to the appropriate choice of the regularization \nparameter (3. Optimality of (3 with respect to the density prediction and oytimality \nwith respect to prediction accuracy on the test set roughly coincide (for (3 = 0.02). \nA veraging is inferior to the Bayesian approach if an optimal {3 is chosen. \n\n5.2 BUPA Liver Disorder Classification \n\nAs a second task we applied our methods to a real-world decision problem from \nthe medical environment. The problem is to detect liver disorders which might \narise from excessive alcohol consumption. Available information consists of five \nblood tests as well as a measure of the patients' daily alcohol consumption. We \nsubdivided the 345 available samples into a training set of 200 and a test set of 145 \nsamples. Due to the relatively few data we did not try to determine the optimal \nregularization parameter using a validation process and will report results on the \ntest set for different parameter values. \n\nAlgorithm \nunregularized \nBayesian penalty ({3 = 0.05) \nBayesian penalty \u00ab(3 = 0.10) \n(3 = 0.20 \nBayesian penal ty \naveraging local maxima \naveraging (70 % subset) \naveraging (bagging) \n\nAccuracy \n\n64.8 % \n65.5 % \n66.9 % \n61.4 % \n65 .5 0 \n72.4 % \n71.0 % \n\nTable 2: Performances in the liver disorder classification problem. \n\n\f548 \n\nD. ORMONEIT. V. TRESP \n\nThe results of our experiments are shown in table 2. Again, both regularization \nmethods led to an improvement in prediction accuracy. In contrast to the toy prob(cid:173)\nlem, the averaged predictor was superior to the Bayesian approach here. Note that \nthe resampling led to an improvement of more than five percent points compared \nto unresampled averaging. \n\n6 Conclusion \n\nWe proposed a Bayesian and an averaging approach to regularize Gaussian mixture \ndensity estimates. In comparison with the maximum likelihood solution both ap(cid:173)\nproaches led to considerably improved results as demonstrated using a toy problem \nand a real-world classification task. Interestingly, none of the methods outperformed \nthe other in both tasks. This might be explained with the fact that Gaussian mix(cid:173)\nture density estimates are particularly unstable in high-dimensional spaces with \nrelatively few data. The benefit of averaging might thus be greater in this case. \nA veraging proved to be particularly effective if applied in connection with resam(cid:173)\npIing of the training data, which agrees with results in regression and classification \ntasks. If compared to Bayesian regularization, averaging is computationally expen(cid:173)\nsive. On the other hand, Baysian approaches typically require the determination of \nhyper parameters (in our case 13), which is not the case for averaging approaches. \n\nReferences \n\n[Bre94] L. Breiman. Bagging predictors. Technical report , UC Berkeley, 1994. \n[Bun94] W . Buntine. Operations for learning with graphical models. Journal of Artificial \n\nIntelligence Research, 2:159-225, 1994. \n\n[DLR77] A. P. Dempster, N. M. Laird, and D. B. Rubin. Maximum likelihood from \n\nincomplete data via the EM algorithm. J. Royal Statistical Society B, 1977. \n\n[HT94] T. Hastie and R. Tibshirani. Discriminant analysis by gaussian mixtures. Tech(cid:173)\n\nnical report, AT&T Bell Labs and University of Toronto, 1994. \n\n[KL95] N. Kambhatla and T. K. Leen. Classifying with gaussian mixtures and clusters. \nIn Advances in Neural Information Processing Systems 7. Morgan Kaufman, \n1995. \n\n[Mac91] D. MacKay. Bayesian Modelling and Neural Networks. PhD thesis, California \n\nInstitute of Technology, Pasadena, 1991. \n\n[Now91] S. J. Nowlan. Soft Competitive Adaption: Neural Network Learning Algorithms \nbased on Fitting Statistical Mixtures. PhD thesis, School of Computer Science, \nCarnegie Mellon University, Pittsburgh, 1991. \n\n[Orm93] D. Ormoneit. Estimation of probability densities using neural networks. Master's \n\nthesis, Technische Universitiit Munchen, 1993. \n\n[PC93] M. P. Perrone and L. N. Cooper. When networks disagree: Ensemble methods for \nhybrid Neural networks. In Neural Networks for Speech and Image Processing. \nChapman Hall, 1993. \n\n[Rob94] C. P. Robert. The Bayesian Choice. Springer-Verlag, 1994. \n[THA93] V. Tresp, J. Hollatz, and S. Ahmad. Network structuring and training using \nrule-based knowledge. In Advances in Neural Information Processing Systems 5. \nMorgan Kaufman, 1993. \n\n\f", "award": [], "sourceid": 1036, "authors": [{"given_name": "Dirk", "family_name": "Ormoneit", "institution": null}, {"given_name": "Volker", "family_name": "Tresp", "institution": null}]}