{"title": "Improved Gaussian Mixture Density Estimates Using Bayesian Penalty Terms and Network Averaging", "book": "Advances in Neural Information Processing Systems", "page_first": 542, "page_last": 548, "abstract": null, "full_text": "Improved  Gaussian Mixture Density \n\nEstimates  Using Bayesian Penalty Terms \n\nand Network Averaging \n\nDirk Ormoneit \n\nInstitut fur  Informatik (H2) \n\nTechnische  Universitat Munchen \n\n80290  Munchen,  Germany \n\normoneit@inJormatik.tu-muenchen.de \n\nAbstract \n\nVolker Tresp \n\nSiemens AG \n\nCentral Research \n\n81730 Munchen,  Germany \nVolker. Tresp@zJe.siemens.de \n\nWe  compare two  regularization methods which  can  be  used  to  im(cid:173)\nprove  the  generalization  capabilities  of Gaussian  mixture  density \nestimates.  The first  method uses  a  Bayesian  prior on the parame(cid:173)\nter space.  We derive EM  (Expectation Maximization) update rules \nwhich  maximize the a  posterior parameter probability.  In the sec(cid:173)\nond  approach  we  apply ensemble  averaging  to  density  estimation. \nThis includes  Breiman's  \"bagging\" , which  recently  has been found \nto  produce impressive results  for  classification  networks. \n\n1 \n\nIntroduction \n\nGaussian mixture models have recently  attracted wide attention  in  the  neural  net(cid:173)\nwork  community.  Important examples of their  application  include  the  training  of \nradial  basis function  classifiers,  learning from  patterns  with  missing features,  and \nactive  learning.  The appeal of Gaussian  mixtures is  based  to  a  high  degree  on  the \napplicability of the EM  (Expectation Maximization) learning algorithm, which  may \nbe implemented as  a fast  neural  network  learning  rule  ([Now91],  [Orm93]).  Severe \nproblems arise,  however,  due to singularities and local maxima in the log-likelihood \nfunction.  Particularly  in  high-dimensional spaces  these  problems frequently  cause \nthe computed density estimates to possess only relatively limited generalization ca(cid:173)\npabilities in  terms of predicting the densities of new  data points.  As  shown  in  this \npaper,  considerably  better generalization can  be  achieved  using regularization. \n\n\fImproved Gaussian Mixture Density Estimates  Using  Bayesian Penalty Terms \n\n543 \n\nWe  will  compare two  regularization  methods.  The first  one  uses  a  Bayesian  prior \non  the  parameters.  By  using  conjugate  priors  we  can  derive  EM  learning  rules \nfor  finding  the MAP  (maximum a  posteriori probability) parameter estimate.  The \nsecond approach consists of averaging the outputs of ensembles of Gaussian mixture \ndensity estimators trained on identical or resampled data sets.  The latter is  a form \nof \"bagging\"  which  was  introduced  by  Breiman  ([Bre94])  and  which  has  recently \nbeen  found  to  produce  impressive results  for  classification  networks.  By  using  the \nregularized  density  estimators in  a  Bayes  classifier  ([THA93],  [HT94],  [KL95]) ,  we \ndemonstrate that both  methods lead  to density estimates which are superior to the \nunregularized  Gaussian mixture estimate. \n\n2  Gaussian  Mixtures and  the EM  Algorithm \n\nConsider the lroblem of estimating the probability density of a  continuous random \nvector x  E 'R  based on a set x*  = {x k 11  S k  S m}  of iid.  realizations of x.  As a den(cid:173)\nsity model we  choose  the class of Gaussian mixtures p(xle) =  L:7=1  Kip(xli, pi, Ei ), \nwhere  the  restrictions  Ki  ~ 0  and  L:7=1  Kj  =  1 apply.  e denotes  the parameter \nvector  (Ki' Iti, Ei)i=1.  The p(xli, Pi, Ei )  are  multivariate normal densities: \n\np( xli , Pi , Ei) =  (271\")- 41Ei 1- 1 / 2 exp [-1/2(x - Pi)tEi 1 (x - Iti)]  . \n\nThe Gaussian mixture model is well suited to approximate a wide class of continuous \nprobability densities.  Based on  the model and given  the data x*,  we  may formulate \nthe log-likelihood as \n\nlee) = log [rrm  p(xkle)]  = \",m  log \"'~  Kip(xkli, Pi, Ei) . \n\n.L...\".k=1 \n\n.L...\".J=l \n\nk=l \n\nMaximum likelihood parameter estimates e may  efficiently  be  computed  with  the \nEM  (Expectation  Maximization)  algorithm ([DLR77]) .  It consists  of the iterative \napplication of the following  two steps: \n\n1.  In  the  E-step,  based  on  the  current  parameter  estimates,  the  posterior \nprobability  that  unit  i  is  responsible  for  the  generation  of  pattern  xk  is \nestimated as \n\n(1) \n\n2.  In  the M-step,  we  obtain new  parameter estimates (denoted  by  the prime): \n\n(3) \n\n(4) \n\nK \u00b7  = -\n, \nm \n\nJ \n\n1 L m \nk=1 \n\nk \nh\u00b7 \nJ \n\n(2) \n\n,  wk-l  i  X \n\n~m  hk  k \nPi =  ~m hi \n\nwl=l \n\ni \n\n~.' _  L:~1 hf(xk - pD(xk - pDt \n\nL.J J \n\n-\n\nm \n\nI \nL:l=l hi \n\nNote that K~ is a scalar, whereas  p~ denotes a d-dimensional vector and E/ \nis  a  d x  d matrix. \n\nIt  is  well  known  that  training  neural  networks  as  predictors  using  the  maximum \nlikelihood  parameter  estimate  leads  to  overfitting.  The  problem  of overfitting  is \neven  more  severe  in  density  estimation  due  to  singularities  in  the  log-likelihood \nfunction.  Obviously,  the  model  likelihood  becomes  infinite  in  a  trivial  way  if we \nconcentrate  all  the probability  mass on  one or  several  samples of the  training set. \n\n\f544 \n\nD. ORMONEIT, V. TRESP \n\nIn  a  Gaussian  mixture  this  is  just  the  case  if the  center  of a  unit  coincides  with \none  of the  data points  and  E  approaches  the  zero  matrix.  Figure  1 compares  the \ntrue  and  the  estimated  probability  density  in  a  toy  problem.  As  may  be  seen, \nthe  contraction  of the  Gaussians  results  in  (possibly  infinitely)  high  peaks  in  the \nGaussian  mixture  density  estimate.  A  simple  way  to  achieve  numerical  stability \nis  to  artificially  enforce  a  lower  bound  on  the  diagonal  elements  of  E.  This  is  a \nvery  rude  way  of regularization,  however,  and  usually  results  in low  generalization \ncapabilities.  The  problem  becomes  even  more severe  in  high-dimensional spaces. \nTo  yield  reasonable  approximations,  we  will  apply  two  methods  of regularization, \nwhich  will  be discussed  in  the following  two sections. \n\nFigure  1:  True  density  (left)  and  unregularized  density  estimation  (right). \n\n3  Bayesian  Regularization \n\nIn  this  section  we  propose  a  Bayesian  prior  distribution  on  the Gaussian  mixture \nparameters,  which  leads  to  a  numerically stable version  of the  EM  algorithm.  We \nfirst  select  a  family  of prior  distributions  on  the  parameters  which  is  conjugate*. \nSelecting  a  conjugate prior  has  a  number  of advantages.  In  particular,  we  obtain \nanalytic solutions for  the posterior density  and the predictive density.  In  our case, \nthe posterior density  is  a  complex mixture of densitiest .  It is  possible,  however,  to \nderive  EM-update rules  to obtain  the MAP  parameter estimates. \n\nA  conjugate prior of a single multivariate normal density is  a  product of a  normal \ndensity  N(JLilft,1]-lEi )  and  a  Wishart  density  Wi(E;lla,,8)  ([Bun94]).  A  proper \nconjugate prior for  the the mixture weightings '\" = (\"'1, ... , \"'n)  is a  Dirichlet density \nD(\"'hV.  Consequently,  the  prior  of the  overall  Gaussian  mixture  is  the  product \nD(\",lr) il7=1  N(JLilil, 71-1Ei)Wi(E;1Ia , ,8).  Our goal is  to find  the MAP parameter \nestimate,  that is  parameters which  assume the maximum of the log-posterior \n\nIp(S) \n\n2:=~=1 log 2:=;=1 \"'iP(X k Ii, JLi, Ei )  + log D(\"'lr) \n+ 2:=;=1 [logN(JLilft, 71-1Ei) + log Wi(E;lla, ,8)]. \n\nAs in the unregularized case,  we  may use the EM-algorithm to find a local maximum \n\n\u2022 A family  F  of probability  distributions on 0  is said to be conjugate if,  for  every 1r  E F, \n\nthe posterior  1r(0Ix)  also  belongs  to F  ([Rob94]). \n\ntThe  posterior  distribution  can be  written as  a  sum of nm  simple  terms. \ntThose densities  are defined  as follows  (b  and  c are normalizing  constants): \n\nD(1I:17) \n\nN(Il.lp,1,-IE.) \n\nW i(Ei l la,,8) \n\n= \n\n.=1 \n\nbIIn  11:7,-1,  with  11:,  ~ 0  and  \",n  11:.  = 1 \n(21r)-i 11,-IE;I-l/2 exp [-~(Il' - Mt Ei 1 (1l'  - M] \ncIEillo-Cd+l)/2 exp  [-tr(,8Ei 1 )]  \u2022 \n\n~.=l \n\n\fImproved Gaussian Mixture Density  Estimates  Using  Bayesian Penalty Terms \n\n545 \n\nof Ip(8).  The E-step  is  identical  to  (1).  The  M-step  becomes \n\n\"m  hk + \n1 \nL..\"k-l \nm + L..\"i=l  ri - n \n\ni \n\"n \n\nri -\n\n(5) \n\n,L..\"k=l  i x \nJ1.i  =  \"m  hi \n\n\"m  hk  k +  A \n'1J1. \ni + 11 \n\nL..,,1=1 \n\n, \n\"'i = \n\nI \n\n(6) \n\n(7) \n\nE~ =  2:;-1 hf(xk - J1.D(xk  - J1.Dt  + 11(J1.i  - jJ.)(J1.i  - jJ.)t  + 2f3 \n\n2:~1 h~ + 20:  - d \n\nAs  typical  for  conjugate  priors,  prior  knowledge  corresponds  to  a  set  of artificial \ntraining  data  which  is  also  reflected  in  the  EM-update  equations.  In  our  experi(cid:173)\nments,  we focus  on  a  prior on  the variances  which is  implemented by  f3  =F  0,  where \no denotes  the d x  d zero  matrix.  All  other parameters we  set  to  \"neutral\"  values: \n\nri=l'v'i : l::;i::;n,  0:= (d+I)/2,  11=0, \n\nf3=iJl d \n\nld  is  the d x  d unity  matrix.  The choice  of 0:  introdu~es a  bias  which  favors  large \nvariances\u00a7.  The effect  of various  values  of the  scalar  f3  on  the  density  estimate is \nillustrated in  figure  2.  Note  that if iJ  is  chosen  too small, overfitting still occurs.  If \nit is  chosen  to large, on the other hand, the model is too constraint to recognize the \nunderlying structure. \n\nFigure 2:  Regularized  density  estimates  (left:  iJ  =  0.05,  right: 'iJ  =  0.1). \n\nTypically,  the  optimal  value for  iJ  is  not  known  a  priori.  The simplest  procedure \nconsists  of using  that iJ  which  leads  to  the  best  performance  on  a  validation  set, \nanalogous  to  the  determination  of the  optimal  weight  decay  parameter  in  neural \nnetwork  training.  Alternatively,  iJ  might  be  determined  according  to  appropriate \nBayesian  methods  ([Mac9I]).  Either  way,  only  few  additional  computations  are \nrequired for  this method if compared with standard EM. \n\n4  Averaging  Gaussian  Mixtures \n\nIn  this section  we  discuss  the  averaging  of several  Gaussian  mixtures  to  yield  im(cid:173)\nproved probability density estimation.  The averaging over neural network ensembles \nhas  been  applied previously  to  regression  and classification tasks  ([PC93]) . \n\nThere  are several  different  variants  on  the  simple  averaging  idea.  First,  one  may \ntrain  all  networks  on  the  complete  set  of training  data.  The  only  source  of dis(cid:173)\nagreement  between  the  individual  predictions  consists  in  different  local  solutions \nfound  by  the  likelihood  maximization procedure  due  to  different  starting  points. \nDisagreement  is  essential  to  yield  an  improvement by  averaging,  however,  so  that \nthis proceeding only seems advantageous in  cases  where  the relation between  train(cid:173)\ning data  and  weights  is extremely  non-deterministic  in  the sense  that  in  training, \n\u00a7If A is  distributed  according  to Wi(AIO', (3),  then  E[A- 1 ]  =  (0' - (d + 1)/2)-1 {3.  In our \n\ncase A is  B;-I,  so  that E[Bi] -+ 00  \u2022 {3  for  0'  -+ (d + 1)/2. \n\n\f546 \n\nD. ORMONEIT, V. TRESP \n\ndifferent  solutions  are  found  from  different  random starting points.  A  straightfor(cid:173)\nward  way  to  increase  the  disagreement  is  to  train  each  network  on  a  resampled \nversion  of the  original data set.  If we  resample  the  data  without  replacement,  the \nsize  of each  training set  is  reduced,  in our experiments to 70%  of the original.  The \naveraging  of neural  network  predictions  based on  resampling  with  replacement has \nrecently  been  proposed  under  the  notation  \"bagging\"  by  Breiman  ([Bre94]),  who \nhas  achieved  dramatic.ally  improved  results  in  several  classification  tasks.  He  also \nnotes,  however,  that an  actual improvement of the prediction can  only  result  if the \nestimation  procedure  is  relatively  unstable.  As  discussed,  this  is  particularly  the \ncase  for  Gaussian mixture training.  We  therefore  expect  bagging  to be  well  suited \nfor  our task. \n\n5  Experiments and Results \n\nTo assess  the practical advantage resulting from regularization, we  used  the density \nestimates  to  construct  classifiers  and  compared  the  resulting  prediction  accuracies \nusing  a  toy  problem  and a  real-world  problem.  The reason  is  that  the  generaliza(cid:173)\ntion  error  of density  estimates  in  terms  of the  likelihood  based  on  the  test  data \nis  rather  unintuitive  whereas  performance  on  a  classification  problem  provides  a \ngood  impression of the degree  of improvement.  Assume we  have a  set of N labeled \ndata z*  = {(xk, lk)lk = 1, ... , N}, where  lk  E Y = {I, ... , C} denotes  the class  label \nof each  input  xk .  A  classifier  of new  inputs  x  is  yielded  by  choosing  the  class  I \nwith  the  maximum posterior  class-probability  p(llx).  The  posterior  probabilities \nmay be derived from the class-conditional data likelihood p(xll) via Bayes theorem: \np(llx) = p(xll)p(l)/p(x) ex  p(xll)p(l) . The resulting partitions ofthe input space are \noptimal for  the  true p(llx).  A  viable  way  to  approximate the posterior p(llx)  is  to \nestimate p(xll)  and p(l) from  the sample data. \n\n5.1  Toy  Problem \n\nIn  the toy classification problem the task is  to discriminate the  two classes of circu(cid:173)\nlatory arranged data shown in figure 3.  We generated 200 data points for each class \nand subdivided them into two sets of 100 data points.  The first  was  used for  train(cid:173)\ning,  the  second  to  test  the  generalization  performance.  As  a  network  architecture \nwe  chose  a Gaussian mixture with 20  units.  Table 1 summarizes the results,  begin(cid:173)\nning  with  the  unregularized  Gaussian  mixture  which  is followed  by  the  averaging \nand  the  Bayesian penalty  approaches.  The three  rows  for  averaging  correspond  to \nthe results yielded without applying resampling (local max.), with resampling with-\n\nFigure 3:  Toy  Classification  Task. \n\n\fImproved Gaussian Mixture Density Estimates  Using  Bayesian Penalty Terms \n\n547 \n\nout  replacement  (70%  subsets),  and  with  resampling  with  replacement  (bagging). \nThe performances on training and test set  are measured in  terms of the  model log(cid:173)\nlikelihood.  Larger values indicate a  better performance.  We report separate results \nfor  dass  A  and  B,  since  the densities of both were  estimated separately.  The final \ncolumn shows  the  prediction  accuracy  in  terms of the percentage  of correctly  clas(cid:173)\nsified  data in  the test set.  We  report  the  average  results from 20  experiments.  The \nnumbers in  brackets denote the standard deviations u  of the results.  Multiplying u \nwith  T19;95%/v'20 = 0.4680 yields 95%  confidence  intervals.  The best  result  in each \ncategory is  underlined. \n\nAlgorithm \n\nunreg. \nAveraging: \nlocal  max. \n70%  subset \nbagging \nPenalty: \n[3  = 0.01 \n[3  = 0.02 \n[3  = 0.05 \n[3  =  0.1 \n\nTraining \n\nA \n\nLog-Likelihood \n\nB \n\nA \n\nTest \n\nAccuracy \n\nB \n\nI \n\n-120.8  (13.3) \n\n-120.4  (10.8) \n\n-224.9  (32.6) \n\n-241.9  (34.1) \n\n80.6'70  (2.8) \n\n-115.6  (6.0) \n-106.8  (5.8) \n-83.8  (4.9) \n\n-112.6  (6.6) \n-105.1  (6.7) \n-83.1  (7.1) \n\n-200.9  (13.9) \n-188.8  (9.5) \n-194.2  (7.3) \n\n-209.1  (16.3) \n-196.4  (11.3) \n-200.1 (11.3) \n\n81.8%  (3.1) \n83.2%  (2.9) \n82.6%  (3.4) \n\n-149.3  (18.5) \n-156.0  (16.5) \n-173.9  (24.3) \n-183.0  (21.9) \n\n-146.5  (5.9) \n-153.0  (4.8) \n-167.0  (15.8) \n-181.9  (21.1) \n\n-186.2  (13.9) \n-177.1  (11.8) \n-182.0  (20.1) \n-184.6  (21.0) \n\n-182.9  (11.6) \n-174.9  (7.0) \n-173.9  (14.3) \n-182.5  (21.1) \n\n83.1%  (2.9) \n84.4%  (6.3) \n81.5%  (5.9) \n78.5%  (5.1) \n\nTable  1:  Performances in  the toy  classification problem  . \n\nAs  expected,  all  regularization  methods  outperform  the  maximum likelihood  ap(cid:173)\nproach  in  terms  of correct  classification.  The  performance  of the  Bayesian  regu(cid:173)\nlarization  is  hereby  very  sensitive  to  the  appropriate  choice  of the  regularization \nparameter (3.  Optimality of (3  with respect  to the density  prediction and oytimality \nwith  respect  to  prediction  accuracy  on  the test set  roughly  coincide  (for  (3  =  0.02). \nA veraging is  inferior to  the  Bayesian  approach  if an optimal {3  is  chosen. \n\n5.2  BUPA Liver Disorder Classification \n\nAs  a  second  task  we  applied  our  methods  to  a  real-world  decision  problem  from \nthe  medical  environment.  The  problem  is  to  detect  liver  disorders  which  might \narise  from  excessive  alcohol  consumption.  Available  information  consists  of five \nblood  tests  as  well  as  a  measure  of the  patients'  daily  alcohol  consumption.  We \nsubdivided the 345  available samples into a  training set of 200 and a  test set of 145 \nsamples.  Due  to  the  relatively  few  data  we  did  not  try  to  determine  the  optimal \nregularization  parameter using  a  validation process  and  will  report  results  on  the \ntest  set for  different  parameter values. \n\nAlgorithm \nunregularized \nBayesian penalty ({3  =  0.05) \nBayesian penalty \u00ab(3  =  0.10) \n(3  =  0.20 \nBayesian  penal ty \naveraging  local maxima \naveraging  (70 % subset) \naveraging (bagging) \n\nAccuracy \n\n64.8  % \n65.5 % \n66.9 % \n61.4 % \n65 .5  0 \n72.4 % \n71.0 % \n\nTable 2:  Performances in  the liver  disorder  classification problem. \n\n\f548 \n\nD. ORMONEIT. V. TRESP \n\nThe  results  of our  experiments  are shown  in  table  2.  Again,  both  regularization \nmethods led  to  an improvement in prediction accuracy.  In contrast to  the toy prob(cid:173)\nlem, the averaged predictor was superior to the  Bayesian approach here.  Note that \nthe resampling  led  to  an  improvement of more than five  percent  points  compared \nto  unresampled  averaging. \n\n6  Conclusion \n\nWe proposed  a Bayesian and an averaging approach to regularize Gaussian mixture \ndensity  estimates.  In  comparison  with  the  maximum likelihood solution  both  ap(cid:173)\nproaches  led  to  considerably improved results as demonstrated using  a  toy problem \nand a real-world classification task.  Interestingly, none of the methods outperformed \nthe other in  both tasks.  This might be explained  with the fact  that Gaussian mix(cid:173)\nture  density  estimates  are  particularly  unstable  in  high-dimensional  spaces  with \nrelatively  few  data.  The  benefit  of averaging  might  thus  be  greater  in  this  case. \nA veraging  proved  to  be  particularly effective  if applied  in  connection  with  resam(cid:173)\npIing of the training data,  which  agrees  with results in  regression  and classification \ntasks.  If compared to Bayesian regularization, averaging is  computationally expen(cid:173)\nsive.  On the other hand,  Baysian approaches typically require  the determination of \nhyper parameters (in our case 13),  which  is not the  case for  averaging approaches. \n\nReferences \n\n[Bre94]  L.  Breiman.  Bagging  predictors.  Technical  report ,  UC  Berkeley,  1994. \n[Bun94]  W . Buntine.  Operations for learning  with graphical models.  Journal of Artificial \n\nIntelligence  Research, 2:159-225,  1994. \n\n[DLR77]  A.  P.  Dempster,  N.  M.  Laird,  and  D.  B.  Rubin.  Maximum  likelihood  from \n\nincomplete  data via the EM  algorithm.  J.  Royal Statistical Society B,  1977. \n\n[HT94]  T. Hastie and R.  Tibshirani.  Discriminant  analysis  by  gaussian  mixtures.  Tech(cid:173)\n\nnical  report,  AT&T Bell  Labs  and  University  of Toronto,  1994. \n\n[KL95]  N.  Kambhatla and T. K.  Leen.  Classifying  with gaussian  mixtures  and clusters. \nIn  Advances  in  Neural  Information  Processing  Systems  7.  Morgan  Kaufman, \n1995. \n\n[Mac91]  D.  MacKay.  Bayesian  Modelling  and  Neural  Networks.  PhD  thesis,  California \n\nInstitute of Technology,  Pasadena,  1991. \n\n[Now91]  S.  J.  Nowlan.  Soft  Competitive Adaption:  Neural  Network  Learning Algorithms \nbased  on  Fitting Statistical Mixtures.  PhD  thesis,  School  of Computer  Science, \nCarnegie  Mellon  University,  Pittsburgh,  1991. \n\n[Orm93]  D. Ormoneit.  Estimation of probability densities using neural networks.  Master's \n\nthesis,  Technische  Universitiit  Munchen,  1993. \n\n[PC93]  M.  P.  Perrone and L.  N.  Cooper.  When networks disagree:  Ensemble methods for \nhybrid  Neural  networks.  In  Neural  Networks for  Speech  and Image  Processing. \nChapman  Hall,  1993. \n\n[Rob94]  C.  P.  Robert.  The  Bayesian Choice.  Springer-Verlag,  1994. \n[THA93]  V.  Tresp,  J.  Hollatz,  and  S.  Ahmad.  Network  structuring  and  training  using \nrule-based  knowledge.  In  Advances in Neural Information Processing Systems 5. \nMorgan  Kaufman,  1993. \n\n\f", "award": [], "sourceid": 1036, "authors": [{"given_name": "Dirk", "family_name": "Ormoneit", "institution": null}, {"given_name": "Volker", "family_name": "Tresp", "institution": null}]}