{"title": "Mixture Density Estimation", "book": "Advances in Neural Information Processing Systems", "page_first": 279, "page_last": 285, "abstract": null, "full_text": "Mixture Density Estimation \n\nJonathan Q. Li \n\nDepartment of Statistics \n\nYale University \nP.O. Box 208290 \n\nNew Haven, CT 06520 \nQiang.Li@aya.yale. edu \n\nAndrew R. Barron \nDepartment of Statistics \n\nYale University \nP.O. Box 208290 \n\nNew Haven, CT 06520 \nAndrew. Barron@yale. edu \n\nAbstract \n\nGaussian mixtures (or so-called radial basis function networks) for \ndensity estimation provide a natural counterpart to sigmoidal neu(cid:173)\nral networks for function fitting and approximation. In both cases, \nit is possible to give simple expressions for the iterative improve(cid:173)\nment of performance as components of the network are introduced \none at a time. In particular, for mixture density estimation we show \nthat a k-component mixture estimated by maximum likelihood (or \nby an iterative likelihood improvement that we introduce) achieves \nlog-likelihood within order 1/k of the log-likelihood achievable by \nany convex combination. Consequences for approximation and es(cid:173)\ntimation using Kullback-Leibler risk are also given. A Minimum \nDescription Length principle selects the optimal number of compo(cid:173)\nnents k that minimizes the risk bound. \n\n1 \n\nIntroduction \n\nIn density estimation, Gaussian mixtures provide flexible-basis representations for \ndensities that can be used to model heterogeneous data in high dimensions. We \nintroduce an index of regularity C f of density functions f with respect to mixtures \nof densities from a given family. Mixture models with k components are shown to \nachieve Kullback-Leibler approximation error bounded by c}/k for every k. Thus \nin a manner analogous to the treatment of sinusoidal and sigmoidal networks in \nBarron [1],[2], we find classes of density functions f such that reasonable size net(cid:173)\nworks (not exponentially large as function of the input dimension) achieve suitable \napproximation and estimation error. \nConsider a parametric family G = {8(X) : 0 E 8} and C= CONV(G). Let f(x) \nJ 4>8 (x)P(dO) E C. There exists fk' a k-component mixture of 4>8, such that \n\n(2) \n\n(3) \n\n(4) \n\n(5) \n\nIn the bound, we have \n\nCf = \nand 'Y = 4[log(3Ve) + a], where \n\n2 J J 4>~(x)P(dO) \nJ 4>8 (x)P(dO) dx, \n\na = sup log 4>81 (x) . \n4>82 (x) \n\n81,82,X \n\nHere, a characterizes an upper bound of the log ratio of the densities in G, when \nthe parameters are restricted to 8 and the variable to X . \nNote that the rate of convergence, Ijk, is not related to the dimensions of 8 or X. \nThe behavior of the constants, though, depends on the choices of G and the target \nf\u00b7 \nFor example we may take G to be the Gaussian location family, which we restrict \nto a set X which is a cube of side-length A. Likewise we restrict the parameters to \nbe in the same cube. Then, \n\na<-2\u00b7 \n\ndA2 \n(7 \n\n-\n\n(6) \n\nIn this case, a is linear in dimension. \nThe value of c} depends on the target density f. Suppose f is a finite mixture with \nM components, then \n\nC} ~ M, \n\nwith equality if and only if those M components are disjoint. Indeed, \nf(x) = E!l Pi 4>8; (x), then Pi 4>8; (x)j E!l Pi 4>8; (x) ~ 1 and hence \n\nc} = J L..-i=l CI;;4>8;.(X)) 4>8; (x) dx ~ J I)I)4>8; (x)dx = M. \n\n\"\",M \n\nEi=l Pt4>8; (x) \n\nM \n\ni=l \n\n(7) \nsuppose \n\n(8) \n\nGenovese and Wasserman [3] deal with a similar setting. A Kullback-Leibler ap(cid:173)\nproximation bound of order IjVk for one-dimensional mixtures of Gaussians is \ngiven by them. \nIn the more general case that f is not necessarily in C, we have a competitive \noptimality result. Our density approximation is nearly at least as good as any gp \nin C. \n\n\fMixture Density Estimation \n\nTHEOREM 2 For every gp(x) = f \u00a2o(x)P(d8), \nc2 \n\nDUIlIk) ~ DUllgp) + ~p 'Y. \n\nHere, \n\n2 J f \u00a2~(x)P(d8) \n\nC/,P = \n\n(J \u00a2o(x)P(d8))2 f(x)dx. \n\n281 \n\n(9) \n\n(10) \n\nIn particular, we can take infimum over all gp E C, and still obtain a bound. \nLet DUIIC) = infgEc DUlIg). A theory of information projection shows that if \nthere exists a sequence of fk such that DUllfk) -t DUIIC), then fk converges to \na function 1*, which achieves DUIIC). Note that 1* is not necessarily an element \nin C. This is developed in Li[4] building on the work of Bell and Cover[5]. As a \nconsequence of Theorem 2 we have \n\n(11) \n\nwhere c},* is the smallest limit of cJ,p for sequences of P achieving DUlIgp) that \napproaches the infimum DUIIC). \nWe prove Theorem 1 by induction in the following section. An appealing feature \nof such an approach is that it provides an iterative estimation procedure which \nallows us to estimate one component at a time. This greedy procedure is shown \nto perform almost as well as the full-mixture procedures, while the computational \ntask of estimating one component is considerably easier than estimating the full \nmixtures. \n\nSection 2 gives the iterative construction of a suitable approximation, while Section \n3 shows how such mixtures may be estimated from data. Risk bounds are stated in \nSection 4. \n\n2 An iterative construction of the approximation \n\nWe provide an iterative construction of Ik's in the following fashion. Suppose \nduring our discussion of approximation that f is given. We seek a k-component \nmixture fk close to f. Initialize h by choosing a single component from G to \nminimize DUllh) = DUII\u00a2o). Now suppose we have fk-l(X). Then let fk(X) = \n(1 - a)fk-l(X) + a\u00a2o(x) where a and 8 are chosen to minimize DUIIIk). More \ngenerally let Ik be any sequence of k-component mixtures, for k = 1,2, ... such \nthat DUIIIk) ~ mina,o DUII(l - a)fk-l + a\u00a2o). We prove that such sequences Ik \nachieve the error bounds in Theorem 1 and Theorem 2. \n\nThose familiar with the iterative Hilbert space approximation results of Jones[6], \nBarron[l]' and Lee, Bartlett and Williamson[7], will see that we follow a similar \nstrategy. The use of L2 distance measures for density approximation involves L2 \nnorms of component densities that are exponentially large with dimension. Naive \nTaylor expansion of the Kullback-Leibler divergence leads to an L2 norm approxi(cid:173)\nmation (weighted by the reciprocal of the density) for which the difficulty remains \n(Zeevi & Meir[8], Li[9]). The challenge for us was to adapt iterative approximation \nto the use of Kullback-Leibler divergence in a manner that permits the constant \na in the bound to involve the logarithm of the density ratio (rather than the ratio \nitself) to allow more manageable constants. \n\n\f282 \n\nJ. Q. Li and A. R. Barron \n\nThe proof establishes the inductive relationship \n\n(12) \nwhere B is bounded and Dk = DUllfk). By choosing 0.1 = 1,0.2 = 1/2 and \nthereafter ak = 2/k, it's easy to see by induction that Dk ::; 4B/k. \n\nDk ::; (1 - a)Dk- 1 + 0.2 B, \n\n(12), we establish a quadratic upper bound \n\nTo get \n-log \u00ab1-0:)\"',_1+0:\u00a2e). Three key analytic inequalities regarding to the logarithm \nwill be handy for us, \n\nfor \n\n-log tr \n\nfor r ~ ro > 0, \n\nand \n\n-log(r) ::; -(r - 1) + [-log(ro) + ro - l](r _ 1)2 \n\n(ro - 1)2 \n\n2[ -log(r) + r - 1] \n\nr-l \n\nI \n\n::; og r, \n\n-log(r)+r-l* 1 (as \nwell as the limit as r -+ 1). To get the inequalities one multiplies through by (r -1) \nor (r - 1)2, respectively, and then takes derivatives to obtain suitable monotonicity \nin r as one moves away from r = 1. \n\nNow apply the inequality (13) with r = (1-0:)\"'_1 +o:\u00a2e and ro = (1-0:)'k-1, where \n9 is an arbitrary density in C with 9 = J \u00a29P(d9). Note that r ~ ro in this case \nbecause o:te ~ O. Plug in r = ro + a~ at the right side of (13) and expand the \nsquare. Then we get \n\n9 \n\n9 \n\n-log(r) < \n\n-(ro + a: _ 1) + [-IO~~o~ ~fo -1][(ro - 1) + (ag\u00a2W \n\n- -\n\na\u00a2 \n9 \n\nI () \n\n- og ro + a -\ng2 \n\n2\u00a22[-log(ro) +ro -1] 2 \u00a2[-log(ro) +ro -1] \n. \n\n(ro - 1) 2 \n\n+ 0.-\n9 \n\nro - 1 \n\nNow apply (14) and (15) respectively. We get \n\n-log(r) ::; -log(ro) -\n\na\u00a2 \n-\n\n9 \n\n2\u00a22 \n\n9 \n\n_ \n\n\u00a2 \n\n9 \n\n+ a 2(1/2 + log (ro)) + a-Iog(ro). \n\n(16) \n\nNote that in our application, ro is a ratio of densities in C. Thus we obtain an upper \nbound for log-(ro) involving a. Indeed we find that (1/2 + log-(ro)) ::; \"1/4 where \n\"I is as defined in the theorem. \nIn the case that f is in C, we take 9 = f. Then taking the expectation with respect \nto f of both sides of (16), we acquire a quadratic upper bound for Dk, noting that \nr = tr. Also note that D k is a function of 9. The greedy algorithm chooses 9 to \nminimize Dk(9). Therefore \n\nDk ::; mjnDk(9) ::; / Dk(9)P(d9). \n\nPlugging the upper bound (16) for Dk(9) into (17), we have \n\nDk::; ( ([-log(ro)- a\u00a2 +a2\u00a2:(\"f/4)+a~log(ro)]J(x)dxP(d9). \n\n9 \n\n9 \n\n9 \n\n19 Ix \n\n(17) \n\n(18) \n\n\fMixture Density Estimation \n\n283 \n\nwhere TO = (1 - a)fk-1 (x)jg(x) and P is chosen to satisfy Ie \u00a2>e(x)P(dO) = g(x). \nThus \nDk ~ (1- a)Dk- 1 + a \n\nf(x)dx{rj4) + a log(l- a) - a -log(l- a). \n(19) \nIt can be shown that alog(l- a) - a -log(l - a) ~ O. Thus we have the desired \ninductive relationship, \n\n2! \u00a2>~(x)P(dO) \n\n(g(x))2 \n\n'Yc 2 \n\nTherefore, Dk ~ f . \nIn the case that f does not have a mixture representation of the form I \u00a2>eP(dO), \ni.e. f is outside the convex hull C, we take Dk to be I f(x) log j:f:? dx for any given \n\ngp(x) = I \u00a2>e(x)P(dO). The above analysis then yields Dk = DUllfk) -DUllgp) ::; \n'Yc2 f \n\nas desired. That completes the proof of Theorems 1 and 2. \n\n(20) \n\n3 A greedy estimation procedure \n\nThe connection between the K-L divergence and the MLE helps to motivate the \nfollowing estimation procedure for /k if we have data Xl, ... , Xn sampled from f. \nThe iterative construction of fk can be turned into a sequential maximum likeli(cid:173)\nhood estimation by changing min DUllfk) to max 2:~1 log fk (Xi) at each step. A \nsurprising result is that the resulting estimator A has a log likelihood almost at \nleast as high as log likelihood achieved by any density gp in C with a difference of \norder 1jk. We formally state it as \n\nn \n\n~ \n\n~ ~logfk(Xi) ~ ~ ~IOg9p(Xi) - ' k \n1 '\" \n1=1 \n\n1 '\" \n1=1 \n\ncF P \n\nn \n\n2 \n\n(21) \n\nfor all gp E C. Here Fn is the empirical distribution, for which c2 \n(ljn) 2:~=1 c~;,p where \n\nFn,P \n\nI \u00a2>Hx)P(dO) \n\n2 \nCx,P = (f \u00a2>e(x)P(dO))2 . \n\n(22) \n\nThe proof of this result (21) follows as in the proof in the last section, except that \nnow we take Dk = EFn loggp(X)j fk(X) to be the expectation with respect to Fn \ninstead of with respect to the density f. \nLet's look at the computation at each step to see the benefits this new greedy \nprocedure can bring for us. We have ik(X) = (1- a)ik-1(X) + a\u00a2>e(x) with 0 and \na chosen to maximize \n\nn L log[(l - a)f~-l (Xi) + a\u00a2>e(Xi)] \n\ni=l \n\n(23) \n\nwhich is a simple two component mixture problem, with one of the two components, \nf~-l(X), fixed. To achieve the bound in (21), a can either be chosen by this iterative \nmaximum likelihood or it can be held fixed at each step to equal ak (which as \nbefore is ak = 2jk for k > 2). Thus one may replace the MLE-computation of a k(cid:173)\ncomponent mixture by successive MLE-computations of two-component mixtures. \nThe resulting estimate is guaranteed to have almost at least as high a likelihood as \nis achieved by any mixture density. \n\n\f284 \n\nJ. Q. Li and A. R. Barron \n\nA disadvantage of the greedy procedure is that it may take a number of steps to \nadequately downweight poor initial choices. Thus it is advisable at each step to re(cid:173)\ntune the weights of convex combinations of previous components (and even perhaps \nto adjust the locations of these components) , in which case, the result from the \nprevious iterations (with k - 1 components) provide natural initialization for the \nsearch at step k. The good news is that as long as for each k, given ik-l, the A is \nchosen among k component mixtures to achieve likelihood at least as large as the \nchoice achieving maxol:~=llog[(l - ak)fk-l (Xi) + ak*