{"title": "Mixture Density Estimation", "book": "Advances in Neural Information Processing Systems", "page_first": 279, "page_last": 285, "abstract": null, "full_text": "Mixture Density Estimation \n\nJonathan Q.  Li \n\nDepartment of Statistics \n\nYale  University \nP.O.  Box 208290 \n\nNew  Haven,  CT 06520 \nQiang.Li@aya.yale. edu \n\nAndrew R. Barron \nDepartment of Statistics \n\nYale  University \nP.O.  Box 208290 \n\nNew  Haven,  CT 06520 \nAndrew. Barron@yale. edu \n\nAbstract \n\nGaussian mixtures (or so-called radial basis function networks)  for \ndensity estimation provide a natural counterpart to sigmoidal neu(cid:173)\nral networks for function fitting and approximation.  In both cases, \nit is  possible to give  simple expressions for  the iterative improve(cid:173)\nment of performance as components of the network are introduced \none at a time.  In particular, for mixture density estimation we show \nthat a k-component mixture estimated by maximum likelihood  (or \nby an iterative likelihood improvement that we introduce) achieves \nlog-likelihood  within order  1/k of the log-likelihood  achievable by \nany convex combination.  Consequences for  approximation and es(cid:173)\ntimation  using  Kullback-Leibler  risk  are  also  given.  A  Minimum \nDescription Length principle selects the optimal number of compo(cid:173)\nnents k that minimizes the risk bound. \n\n1 \n\nIntroduction \n\nIn density  estimation,  Gaussian mixtures  provide flexible-basis  representations for \ndensities  that  can be  used  to  model  heterogeneous  data in  high  dimensions.  We \nintroduce an index of regularity  C f  of density functions  f  with respect  to mixtures \nof densities from  a given family.  Mixture models  with  k  components are shown  to \nachieve  Kullback-Leibler approximation error bounded  by  c}/k for  every  k.  Thus \nin  a  manner  analogous  to  the  treatment  of  sinusoidal  and  sigmoidal  networks  in \nBarron [1],[2],  we  find  classes of density functions  f  such that reasonable  size  net(cid:173)\nworks  (not exponentially large as function of the input dimension)  achieve suitable \napproximation and estimation error. \nConsider a parametric family G =  {<pe(x) , x  E  X  C  Rd'  : fJ  E e c Rd} of probability \ndensity functions  parameterized by fJ  E e.  Then consider the class C = CONV(G) \nof density functions  for  which there is a mixture representation of the form \n\nfp(x) = Ie <pe(x)P(dfJ) \n\n(1) \n\nwhere <pe(x)  are density functions from  G  and P  is  a  probability measure on e. \nThe  main  theme  of the paper is  to  give  approximation  and  estimation  bounds  of \narbitrary densities by finite  mixture densities.  We  focus  our attention on densities \n\n\f280 \n\nJ.  Q.  Li and A. R.  Barron \n\ninside  C first  and  give  an  approximation error bound  by  finite  mixtures for  arbi(cid:173)\ntrary f  E  C.  The approximation error is  measured by  Kullback-Leibler divergence \nbetween two densities,  defined  as \n\nDUllg) = J f(x) log[f(x)jg(x)]dx. \n\nIn density estimation,  D  is  more natural to use  than the L2  distance often seen in \nthe function  fitting  literature.  Indeed,  D  is  invariant  under  scale transformations \n(and  other 1-1  transformation of the  variables)  and it has  an intrinsic  connection \nwith  Maximum Likelihood, one of the most useful  methods in the mixture density \nestimation.  The following  result quantifies the approximation error. \n\nTHEOREM 1  Let  G  =  {4>8(X)  :  0  E  8}  and  C=  CONV(G).  Let  f(x) \nJ 4>8 (x)P(dO)  E C.  There  exists  fk'  a  k-component  mixture of 4>8,  such  that \n\n(2) \n\n(3) \n\n(4) \n\n(5) \n\nIn the bound,  we  have \n\nCf  = \nand 'Y  =  4[log(3Ve) + a],  where \n\n2  J J 4>~(x)P(dO) \nJ 4>8 (x)P(dO) dx, \n\na =  sup  log 4>81 (x) . \n4>82 (x) \n\n81,82,X \n\nHere,  a  characterizes an upper  bound  of the  log  ratio of the  densities  in  G,  when \nthe parameters are restricted to 8  and the variable to  X . \nNote that the rate of convergence,  Ijk, is  not related to the dimensions of 8  or X. \nThe behavior of the constants, though,  depends on the choices of G and the target \nf\u00b7 \nFor example we  may take G  to be the Gaussian location family,  which  we  restrict \nto a set  X  which is  a cube of side-length A.  Likewise we  restrict the parameters to \nbe in the same cube.  Then, \n\na<-2\u00b7 \n\ndA2 \n(7 \n\n-\n\n(6) \n\nIn this case, a is linear in dimension. \nThe value of c}  depends on the target density f.  Suppose f  is a finite mixture with \nM  components, then \n\nC}  ~ M, \n\nwith  equality  if  and  only  if  those  M  components  are  disjoint.  Indeed, \nf(x)  =  E!l Pi 4>8; (x),  then Pi 4>8; (x)j E!l Pi 4>8; (x)  ~ 1 and hence \n\nc}  = J L..-i=l CI;;4>8;.(X)) 4>8; (x) dx  ~ J I)I)4>8; (x)dx = M. \n\n\"\",M \n\nEi=l Pt4>8; (x) \n\nM \n\ni=l \n\n(7) \nsuppose \n\n(8) \n\nGenovese  and  Wasserman  [3]  deal  with  a  similar  setting.  A  Kullback-Leibler  ap(cid:173)\nproximation  bound  of  order  IjVk for  one-dimensional  mixtures  of  Gaussians  is \ngiven by them. \nIn  the  more  general  case  that  f  is  not  necessarily  in  C,  we  have  a  competitive \noptimality result.  Our density  approximation is  nearly  at least as good  as  any  gp \nin C. \n\n\fMixture Density Estimation \n\nTHEOREM 2  For  every gp(x)  =  f \u00a2o(x)P(d8), \nc2 \n\nDUIlIk) ~ DUllgp) +  ~p 'Y. \n\nHere, \n\n2  J f \u00a2~(x)P(d8) \n\nC/,P  = \n\n(J \u00a2o(x)P(d8))2 f(x)dx. \n\n281 \n\n(9) \n\n(10) \n\nIn particular, we  can take infimum  over all gp E C,  and still obtain a  bound. \nLet  DUIIC)  =  infgEc DUlIg).  A  theory  of  information  projection  shows  that  if \nthere exists  a  sequence  of fk  such  that  DUllfk)  -t DUIIC),  then  fk  converges  to \na  function  1*,  which  achieves  DUIIC).  Note that 1*  is  not  necessarily an element \nin  C.  This is  developed  in  Li[4]  building  on  the  work  of Bell  and  Cover[5].  As  a \nconsequence of Theorem 2 we  have \n\n(11) \n\nwhere  c},*  is  the smallest  limit  of cJ,p  for  sequences of P  achieving DUlIgp)  that \napproaches the infimum DUIIC). \nWe  prove  Theorem  1  by  induction in  the following  section.  An  appealing feature \nof  such  an  approach  is  that  it  provides  an  iterative  estimation  procedure  which \nallows  us  to  estimate  one  component  at  a  time.  This  greedy  procedure  is  shown \nto perform  almost  as  well  as  the full-mixture  procedures,  while  the  computational \ntask  of estimating  one  component  is  considerably  easier  than estimating  the  full \nmixtures. \n\nSection 2 gives the iterative construction of a suitable approximation, while Section \n3 shows how such mixtures may be estimated from  data.  Risk bounds are stated in \nSection 4. \n\n2  An iterative construction of the approximation \n\nWe  provide  an  iterative  construction  of  Ik's  in  the  following  fashion.  Suppose \nduring  our  discussion  of  approximation  that  f  is  given.  We  seek  a  k-component \nmixture  fk  close  to  f.  Initialize  h  by  choosing  a  single  component  from  G  to \nminimize  DUllh)  =  DUII\u00a2o).  Now  suppose  we  have  fk-l(X).  Then  let  fk(X)  = \n(1  - a)fk-l(X) + a\u00a2o(x)  where  a  and  8  are  chosen  to  minimize  DUIIIk).  More \ngenerally  let  Ik  be  any  sequence  of  k-component  mixtures,  for  k  = 1,2, ...  such \nthat DUIIIk)  ~ mina,o DUII(l - a)fk-l + a\u00a2o).  We  prove that such  sequences  Ik \nachieve the error bounds in Theorem 1 and Theorem 2. \n\nThose  familiar  with  the  iterative Hilbert  space  approximation  results  of Jones[6], \nBarron[l]'  and  Lee,  Bartlett  and  Williamson[7],  will  see  that  we  follow  a  similar \nstrategy.  The  use  of  L2  distance  measures  for  density  approximation  involves  L2 \nnorms of component  densities  that are exponentially large with  dimension.  Naive \nTaylor expansion of the  Kullback-Leibler divergence  leads to an L2  norm  approxi(cid:173)\nmation  (weighted  by  the reciprocal of the density)  for  which  the  difficulty  remains \n(Zeevi &  Meir[8],  Li[9]).  The challenge for  us was to adapt iterative approximation \nto  the  use  of Kullback-Leibler  divergence  in  a  manner  that  permits  the  constant \na  in  the bound  to involve the  logarithm  of the density  ratio  (rather than the ratio \nitself)  to allow  more manageable constants. \n\n\f282 \n\nJ.  Q. Li and A. R.  Barron \n\nThe proof establishes the inductive relationship \n\n(12) \nwhere  B  is  bounded  and  Dk  = DUllfk).  By  choosing  0.1  =  1,0.2  = 1/2  and \nthereafter ak  =  2/k, it's easy  to see  by  induction that Dk  ::;  4B/k. \n\nDk  ::;  (1  - a)Dk- 1 + 0.2 B, \n\n(12),  we  establish  a  quadratic  upper  bound \n\nTo  get \n-log \u00ab1-0:)\"',_1+0:\u00a2e).  Three  key  analytic  inequalities  regarding  to  the  logarithm \nwill  be handy for  us, \n\nfor \n\n-log tr \n\nfor  r  ~ ro  > 0, \n\nand \n\n-log(r) ::;  -(r - 1)  +  [-log(ro) +  ro  - l](r _  1)2 \n\n(ro  - 1)2 \n\n2[ -log(r) +  r  - 1] \n\nr-l \n\nI \n\n::;  og r, \n\n-log(r)+r-l<I/2 \n\n(r _  1)2 \n\n-\n\nI  -() \n\n+  og \n\nr \n\n(13) \n\n(14) \n\n(15) \n\nwhere log- (-)  is the negative part of the logarithm.  The proof of of inequality  (13) \nis done by  verifying that  -lo(~(.:it!-1  is  monotone decreasing in r.  Inequalities (14) \nand  (15)  are  shown  by  separately  considering the  cases  that  r  <  1  and  r  > 1  (as \nwell as the limit as r  -+  1).  To get the inequalities one multiplies through by  (r -1) \nor (r  - 1)2,  respectively, and then takes derivatives to obtain suitable monotonicity \nin  r  as one moves  away from  r  =  1. \n\nNow  apply  the  inequality  (13)  with  r  =  (1-0:)\"'_1 +o:\u00a2e  and  ro  =  (1-0:)'k-1,  where \n9  is  an  arbitrary density  in C with  9  =  J \u00a29P(d9).  Note  that r  ~ ro  in  this  case \nbecause  o:te  ~ O.  Plug  in  r  =  ro  +  a~ at  the  right  side  of  (13)  and  expand  the \nsquare.  Then we  get \n\n9 \n\n9 \n\n-log(r)  < \n\n-(ro +  a: _  1)  + [-IO~~o~ ~fo -1][(ro - 1) +  (ag\u00a2W \n\n- -\n\na\u00a2 \n9 \n\nI  () \n\n- og  ro  + a  -\ng2 \n\n2\u00a22[-log(ro) +ro -1]  2  \u00a2[-log(ro) +ro -1] \n. \n\n(ro  - 1) 2 \n\n+  0.-\n9 \n\nro  - 1 \n\nNow  apply  (14)  and  (15)  respectively.  We  get \n\n-log(r)  ::;  -log(ro) -\n\na\u00a2 \n-\n\n9 \n\n2\u00a22 \n\n9 \n\n_ \n\n\u00a2 \n\n9 \n\n+  a  2(1/2 +  log  (ro))  +  a-Iog(ro). \n\n(16) \n\nNote that in our application, ro  is a ratio of densities in C.  Thus we obtain an upper \nbound for  log-(ro)  involving a.  Indeed we  find  that  (1/2 +  log-(ro))  ::;  \"1/4  where \n\"I  is  as defined  in  the theorem. \nIn  the case that f  is in C,  we take 9  =  f.  Then taking the expectation with respect \nto f  of both sides of (16),  we  acquire a  quadratic upper bound for  Dk, noting that \nr  =  tr.  Also  note that  D k  is  a  function  of 9.  The greedy  algorithm  chooses  9 to \nminimize  Dk(9).  Therefore \n\nDk  ::;  mjnDk(9)  ::;  /  Dk(9)P(d9). \n\nPlugging the upper bound  (16)  for  Dk(9)  into  (17),  we  have \n\nDk::;  (  ([-log(ro)- a\u00a2 +a2\u00a2:(\"f/4)+a~log(ro)]J(x)dxP(d9). \n\n9 \n\n9 \n\n9 \n\n19 Ix \n\n(17) \n\n(18) \n\n\fMixture Density Estimation \n\n283 \n\nwhere  TO  =  (1  - a)fk-1 (x)jg(x)  and  P  is  chosen  to satisfy  Ie \u00a2>e(x)P(dO)  =  g(x). \nThus \nDk  ~ (1- a)Dk- 1 + a \n\nf(x)dx{rj4) + a log(l- a) - a -log(l- a). \n(19) \nIt can be shown that alog(l- a)  - a  -log(l - a)  ~ O.  Thus we  have the desired \ninductive relationship, \n\n2! \u00a2>~(x)P(dO) \n\n(g(x))2 \n\n'Yc 2 \n\nTherefore,  Dk  ~ f .  \nIn the case that  f  does  not  have a  mixture  representation of the form  I  \u00a2>eP(dO), \ni.e.  f  is outside the convex hull C,  we take Dk  to be I  f(x) log j:f:? dx for any given \n\ngp(x)  =  I  \u00a2>e(x)P(dO).  The above analysis then yields Dk  =  DUllfk) -DUllgp) ::; \n'Yc2 f \n\nas  desired.  That completes the proof of Theorems 1 and 2. \n\n(20) \n\n3  A  greedy estimation procedure \n\nThe  connection  between  the  K-L  divergence  and  the  MLE  helps  to  motivate  the \nfollowing  estimation  procedure for  /k  if we  have  data Xl, ... , Xn  sampled  from  f. \nThe  iterative  construction  of  fk  can  be  turned  into  a  sequential  maximum  likeli(cid:173)\nhood estimation by changing min DUllfk) to max 2:~1 log fk (Xi)  at each step.  A \nsurprising  result  is  that the  resulting  estimator A has  a  log  likelihood  almost  at \nleast as  high as  log likelihood achieved  by  any  density  gp  in C with a  difference  of \norder 1jk.  We  formally  state it as \n\nn \n\n~ \n\n~ ~logfk(Xi) ~ ~ ~IOg9p(Xi) - ' k  \n1 '\" \n1=1 \n\n1 '\" \n1=1 \n\ncF  P \n\nn \n\n2 \n\n(21) \n\nfor  all  gp  E  C.  Here  Fn  is  the  empirical  distribution,  for  which  c2 \n(ljn) 2:~=1 c~;,p where \n\nFn,P \n\nI  \u00a2>Hx)P(dO) \n\n2 \nCx,P  =  (f \u00a2>e(x)P(dO))2 . \n\n(22) \n\nThe proof of this result  (21)  follows  as in the  proof in  the last section,  except  that \nnow  we  take Dk  =  EFn loggp(X)j fk(X)  to be the expectation with  respect to Fn \ninstead of with respect to the density f. \nLet's  look  at  the  computation  at  each  step  to  see  the  benefits  this  new  greedy \nprocedure can bring for  us.  We  have  ik(X)  =  (1- a)ik-1(X) + a\u00a2>e(x)  with 0 and \na  chosen to maximize \n\nn L log[(l - a)f~-l (Xi) + a\u00a2>e(Xi)] \n\ni=l \n\n(23) \n\nwhich is a simple two component mixture problem, with one of the two components, \nf~-l(X), fixed.  To achieve the bound in (21), a  can either be chosen by this iterative \nmaximum  likelihood  or  it  can  be  held  fixed  at  each  step  to  equal  ak  (which  as \nbefore is  ak  =  2jk for  k > 2).  Thus one may replace the MLE-computation of a  k(cid:173)\ncomponent mixture  by  successive  MLE-computations of two-component mixtures. \nThe resulting estimate is guaranteed to have almost at least as high a likelihood as \nis  achieved by any  mixture density. \n\n\f284 \n\nJ.  Q.  Li and A. R.  Barron \n\nA  disadvantage  of the greedy  procedure is  that  it  may  take  a  number  of steps  to \nadequately downweight poor initial choices.  Thus it is  advisable at each step to re(cid:173)\ntune the weights of convex combinations of previous components (and even perhaps \nto  adjust  the  locations  of these  components) ,  in  which  case,  the  result  from  the \nprevious  iterations  (with  k - 1 components)  provide  natural  initialization  for  the \nsearch at step k.  The good news is  that as long as for  each k,  given ik-l, the A is \nchosen  among  k  component  mixtures to achieve  likelihood  at least as  large as  the \nchoice achieving maxol:~=llog[(l - ak)fk-l (Xi) + ak<Po(Xi )),  that is,  we  require \nthat \n\nn \n\nL log f~(Xd ~ mt\" L log[(l- ak)f~-l (Xi) + ak<Po(Xi )), \n\nn \n\n(24) \n\ni=l \n\ni=l \nthen the conclusion  (21)  will  follow. \nIn particular, our likelihood  results  and  risk  bound results  apply  both to the case \nthat A is taken to be global maximizer of the likelihood over k-component mixtures \nas  well  as  to the case that ik  is  the result of the greedy procedure. \n\n4  Risk bounds for  the MLE and the iterative MLE \n\nThe  metric  entropy  of  the  family  G  is  controlled  to  obtain  the  risk  bound  and \nto determine the precisions with which the coordinates of the parameter space are \nallowed to be represented.  Specifically, the following Lipschitz condition is assumed: \nfor  (}  E e c  Rd  and x  E  X  C  Rd, \n\nsup I log <PO (x)  -log <Po' (x)1  ~ B L IOj  - 0jl \n\nd \n\nxEX \n\nj=l \n\n(25) \n\nwhere OJ  is the j-th coordinate of the parameter vector.  Note that such a condition \nis  satisfied  by  a  Gaussian  family  with  x  restricted  to  a  cube  with  sidelength  A \nand  has a  location parameter 0 that is  also  prescribed to be in the same cube.  In \nparticular, if we  let the variance be 0'2,  we  may set B  =  2AI 0'2. \nNow  we  can state the bound on the K-L  risk of A. \nTHEOREM 3  Assume the  condition  {25}.  Also  assume e to  be  a  cube  with  side(cid:173)\nlength  A.  Let  ik(X)  be  either  the  maximizer  of the  likelihood  over  k -component \nmixtures  or  more  generally  any  sequence  of density  estimates  f~  satisfying  {24}. \nWe  have \n\nE(DUllfk)) - DUIIC)  ~ 'Y  k  + 'Y-:;;: log(nABe). \n\n2 cf .. \n\n2kd \n\nA \n\n2 \n\n(26) \n\nFrom  the bound on  risk, a  best choice of k  would  be of order roughly Vn leading \nto  a  bound  on  ED(fllf~) - DUIIC)  of order  1/Vn to  within  logarithmic factors. \nHowever  the  best  such  bound  occurs  with  k  =  'Ycf, .. VnI.j2dlog(nABe)  which  is \nnot available when the value of cf, ..  is unknown.  More importantly, k should not be \nchosen merely to optimize an upper bound on risk,  but rather to balance whatever \napproximation and estimation sources of error actually occur.  Toward this end we \noptimize a penalized likelihood criterion related to the minimum description length \nprinciple, following Barron and Cover  [10] . \n\nLet l(k) be a function of k that satisfies l:~l e-l(k)  ::;  1, such as l(k) =  2Iog(k+ 1). \n\n\fMixture Densiry Estimation \n\nA penalized MLE (or  MDL)  procedure picks k  by  minimizing \n\n! t log  A  1  + 2kdlog(nABe)  + 21(k)jn. \n\nn  i=l \n\nh(Xi ) \n\nn \n\nThen we  have \n\nE(DUllh)) - DUlle)  ~ m1nb  k  + r-;-log(nABe) + 21(k)jn}. \n\n2 Cf  * \n\n2kd \n\nA \n\n2 \n\n285 \n\n(27) \n\n(28) \n\nIt  builds  on  general  results  for \n\nA  proof  of these  risk  bounds  is  given  in  Li[4]. \nmaximum likelihood and penalized maximum likelihood procedures. \nRecently, Dasgupta [11]  has established a randomized algorithm for estimating mix(cid:173)\ntures of Gaussians,  in the case that data are drawn  from  a  finite  mixture of suffi(cid:173)\nciently separated Gaussian components with common covariance, that runs in time \nlinear in  the  dimension  and quadratic  in  the sample size.  However,  present  forms \nof his algorithm require impractically large sample sizes to get reasonably accurate \nestimates of the density.  It is not yet known how his techniques will work for  more \ngeneral  mixtures.  Here  we  see  that  iterative  likelihood  maximization  provides  a \nbetter relationship between accuracy, sample size and number of components. \n\nReferences \n\n[1]  Barron, Andrew (1993)  Universal Approximation Bounds for Superpositions of a \nSigmoidal Function.  IEEE  Transactions  on  Information  Theory 39, No.3:  930-945 \n[2]  Barron,  Andrew  (1994)  Approximation  and  Estimation  Bounds  for  Artificial \nNeural  Networks.  Machine  Learning 14:  115-133. \n[3]  Genovese,  Chris  and  Wasserman,  Larry  (1998)  Rates  of  Convergence  for  the \nGaussian Mixture Seive.  Manuscript. \n[4]  Li,  Jonathan Q.  (1999)  Estimation of Mixture Models.  Ph.D Dissertation.  The \nDepartment of Statistics.  Yale  University. \n[5]  Bell, Robert and Cover, Thomas (1988) Game-theoretic optimal portfolios.  Man(cid:173)\nagement Science 34:  724-733. \n\n[6]  Jones,  Lee  (1992)  A  simple  lemma on  greedy  approximation  in  Hilbert  space \nand convergence rates for projection pursuit regression and neural network training. \nAnnals of Statistics 20:  608-613. \n\n[7]  Lee,  W.S.,  Bartlett, P.L.  and Williamson R.C.  (1996)  Efficient  Agnostic Learn(cid:173)\ning of Neural  Networks with  Bounded Fan-in.  IEEE  Transactions  on  Information \nTheory  42, No.6:  2118-2132. \n[8]  Zeevi,  Assaf and Meir Ronny (1997)  Density Estimation Through Convex Com(cid:173)\nbinations  of Densities:  Approximation  and  Estimation  Bounds.  Neural  Networks \n10, No.1:  99-109. \n[9]  Li,  Jonathan Q.  (1997)  Iterative Estimation of Mixture Models.  Ph.D. Prospec(cid:173)\ntus.  The Department of Statistics.  Yale  University. \n[10]  Barron,  Andrew  and  Cover,  Thomas  (1991)  Minimum  Complexity  Density \nEstimation.  IEEE  Transactions  on Information  Theory  37:  1034-1054. \n[11]  Dasgupta,  Sanjoy  (1999)  Learning  Mixtures  of Gaussians.  Pmc.  IEEE  Conf. \non  Foundations  of Computer Science,  634-644. \n\n\f", "award": [], "sourceid": 1673, "authors": [{"given_name": "Jonathan", "family_name": "Li", "institution": null}, {"given_name": "Andrew", "family_name": "Barron", "institution": null}]}