{"title": "Variational Minimax Estimation of Discrete Distributions under KL Loss", "book": "Advances in Neural Information Processing Systems", "page_first": 1033, "page_last": 1040, "abstract": null, "full_text": " Variational minimax estimation of discrete\n distributions under KL loss\n\n\n\n Liam Paninski\n Gatsby Computational Neuroscience Unit\n University College London\n liam@gatsby.ucl.ac.uk\n http://www.gatsby.ucl.ac.uk/liam\n\n\n\n Abstract\n\n We develop a family of upper and lower bounds on the worst-case ex-\n pected KL loss for estimating a discrete distribution on a finite number m\n of points, given N i.i.d. samples. Our upper bounds are approximation-\n theoretic, similar to recent bounds for estimating discrete entropy; the\n lower bounds are Bayesian, based on averages of the KL loss under\n Dirichlet distributions. The upper bounds are convex in their parameters\n and thus can be minimized by descent methods to provide estimators with\n low worst-case error; the lower bounds are indexed by a one-dimensional\n parameter and are thus easily maximized. Asymptotic analysis of the\n bounds demonstrates the uniform KL-consistency of a wide class of es-\n timators as c = N/m (no matter how slowly), and shows that\n no estimator is consistent for c bounded (in contrast to entropy estima-\n tion). Moreover, the bounds are asymptotically tight as c 0 or ,\n and are shown numerically to be tight within a factor of two for all c.\n Finally, in the sparse-data limit c 0, we find that the Dirichlet-Bayes\n (add-constant) estimator with parameter scaling like -c log(c) optimizes\n both the upper and lower bounds, suggesting an optimal choice of the\n \"add-constant\" parameter in this regime.\n\n\nIntroduction\n\nThe estimation of discrete distributions given finite data -- \"histogram smoothing\" -- is a\ncanonical problem in statistics and is of fundamental importance in applications to language\nmodeling, informatics, and safari organization (13). In particular, estimation of discrete\ndistributions under Kullback-Leibler (KL) loss is of basic interest in the coding commu-\nnity, in the context of two-step universal codes (4, 5). The problem has received signicant\nattention from a variety of statistical viewpoints (see, e.g., (6) and references therein); in\nthis work, we will focus on the \"minimax\" approach, that is, on developing estimators\nwhich work well even in the worst case, with the performance of an estimator measured by\nthe average KL loss. The recent work of (7) and (8) has answered many of the important\nasymptotic questions in the heavily-sampled limit, where the number of data samples, N ,\nis much larger than the number of support points, m, of the unknown distribution; in par-\nticular, the optimal (minimax) error rate has been identified in closed form in the case that\nm is fixed and N , and a simple estimator that asymptotically achieves this optimum\n\n\f\nhas been described. Our goal here is to analyze further the opposite case, when N/m is\nbounded or even small (the sparse data case). It will turn out that the estimators which are\nasymptotically optimal as N/m are far from optimal in this sparse data case, which\nmay be considered more important for applications to modeling of large dictionaries.\n\nMuch of our approach is influenced by the similarities to the entropy estimation problem\n(911), where the sparse data regime is also important for applications and of independent\nmathematical interest: how do we decide how much probability to assign to bins for which\nno samples, or very few samples, are observed? We will emphasize the similarities (and\nimportant differences) between these two problems throughout.\n\n\nUpper bounds\n\nThe basic idea is to find a simple upper bound on the worst-case expected loss, and then to\nminimize this upper bound over some tractable class of possible estimators; the resulting\noptimized estimator will then be guaranteed to possess good worst-case properties. Clearly\nwe want this upper bound to be as tight as possible, and the space of allowed estimators\nto be as large as possible, while still allowing easy minimization. The approach taken here\nis to develop bounds which are convex in the estimator, and to allow the estimators to\nrange over a large convex space; this implies that the minimization problem is tractable by\ndescent methods, since no non-global local minima exist.\n\nWe begin by defining the class of estimators we will be minimizing over: ^\n p of the form\n\n g(n\n ^\n p i)\n i = m ,\n g(n\n i=1 i)\n\nwith ni defined as the number of samples observed in bin i and the constants gj g(j)\ntaking values in the (N + 1)-dimensional convex space gj 0; note that normalization\nof the estimated distribution is automatically enforced. The \"add-constant\" estimators,\ngj = j+ , > 0, are an important special case (7).\n N +m\n\nAfter some rearrangement, the expected KL loss for these estimators satisfies\n\n m p\nE i\n p (L(p, ^\n p)) = Ep pi log ^p\n i=1 i\n\n N m\n = (- log gj)piBN,j(pi) g(nk)\n i -H(pi)+j=0 +Ep logk=1\n (- log gj)piBN,j(pi) g(nk)\n i -H(pi)+ j +Ep -1+ k\n = f (pi);\n i\n\nwe have abbreviated p the true underlying distribution, the entropy function\n\n H(t) = -t log t,\n\nthe binomial functions\n N\n BN,j(t) = tj(1 - t)N-j,\n j\nand\n f (t) = -H(t) - t + (gj - t log gj)BN,j(t).\n j\n\n\f\nEquality holds iff g(n\n k k) is constant almost surely (as is the case, e.g., for any add-\nconstant estimator).\n\nWe have two distinct simple bounds on the above: first, the obvious\n m\n f (pi) m max f(t),\n 0t1\n i=1\nwhich generalizes the bound considered in (7) (where a similar bound was derived asymp-\ntotically as N for m fixed, and applied only to the add-constant estimators), or\n f (t)\n f (pi) m max f(t) + max ,\n 0t1/m 1/mt1 t\n i\nwhich follows easily from p\n i i = 1; see (11) for a proof. The above maxima are always\nachieved, by the compactness of the intervals and the continuity of the binomial and entropy\nfunctions. Again, the key point is that these bounds are uniform over all possible underlying\np (that is, they bound the worst-case error).\n\nWhy two bounds? The first is nearly tight for N >> m (it is actually asymptotically\npossible to replace m with m - 1 in this limit, due to the fact that pi must sum to one;\nsee (7, 8)), but grows linearly with m and thus cannot be tight for m comparable to or\nlarger than N . In particular, the optimizer doesn't depend on m, only N (and hence the\nbound can't help but behave linearly in m). The second bound is much more useful (and,\nas we show below, tight) in the data-sparse regime N << m.\n\nThe resulting minimization problems have a polynomial approximation flavor: we are try-\ning to find an optimal set of weights gj such that the sum in the definition of f (t) (a\npolynomial in t) will be as close to H(t) + t as possible. In this sense our approach is\nnearly identical to that recently followed for bounding the bias in the entropy estimation\ncase (11, 12). There are three key differences, however: the term penalizing the variance\nin the entropy case is missing here, the approximation only has to be good from above, not\nfrom below as well (both making the problem easier), and the approximation is nonlinear,\ninstead of linear, in gj (making the problem harder). Indeed, we will see below that the en-\ntropy estimation problem is qualitatively easier than the estimation of the full distribution,\ndespite the entropic form of the KL loss.\n\n\nSmooth minimization algorithm\n\nIn the next subsections, we develop methods for minimizing these bounds as a function of\ngj (that is, for choosing estimators with good worst-case properties). The first key point is\nthat the bounds involve maxima over a collection of convex functions in gj, and hence the\nbounds are convex in gj; since the coefficients gj take values in a convex set, no non-global\nlocal minima exist, and the global mimimum can be found by simple descent procedures.\n\nOne complicating factor is that the bounds are nondifferentiable in gj: while methods\nfor direct minimization of this type of L error exist (13), they require that we track the\nlocation in t of the maximal error; since this argmax can jump discontinuously as a function\nof gj, this interior maximization loop can be time-consuming. A more efficient solution\nis given by approximating this nondifferentiable objective function by smooth functions\nwhich retain the convexity of the original objective. We employ a Laplace approximation\n(albeit in a different direction than usual): use the fact that\n 1\n max h(t) = lim log eqh(t)\n tA q q tA\nfor continuous h(t) and compact A; thus, letting h(t) = f (t), we can minimize\n 1\n Uq({gj}) eqf(t)dt,\n 0\n\n\f\nor\n 1/m 1\n Vq({gj}) log eqmf(t)dt + log eq f(t)\n t dt ,\n 0 1/m\n\nfor q increasing; these new objective functions are smooth, with easily-computable gradi-\nents, and are still convex, since f (t) is convex in gj, convex functions are preserved under\nconvex, increasing maps (i.e., the exponential), and sums of convex functions are convex.\n(In fact, since Uq is strictly convex in g for any q, the minima are unique, which to our\nknowledge is not necessarily the case for the original minimax problem.) It is easy to show\nthat any limit point of the sequence of minimizers of the above problems will minimize\nthe original problem; applying conjugate gradient descent for each q, with the previous\nminimizer as the seed for the minimization in the next largest q, worked well in practice.\n\n\nInitialization; connection to Laplace estimator\n\nIt is now useful to look for suitable starting points for the minimization. For example, for\nthe first bound, approximate the maximum by an integral, that is, find gj to minimize\n\n 1\n m dt (gj - t log gj)BN,j(t)\n 0 -H(t)-t+ j .\n(Note that this can be thought of as the limit of the above Uq minimization problem as q \n0, as can be seen by expanding the exponential.) The gj that minimizes this approximation\nto the upper bound is trivially derived as\n 1 tBN,j(t)dt (j + 2, N - j + 1) j + 1\n g 0\n j = = = ,\n 1 B (j + 1, N - j + 1) N + 2\n 0 N,j (t)dt\nwith (a, b) = 1 ta-1(1 - t)b-1dt defined as usual. The resulting estimator ^p agrees\n 0\nexactly with \"Laplace's estimator,\" the add- estimator with = 1. Note, though, that to\nderive this gj, we completely ignore the first two terms (-H(t) - t) in the upper bound,\nand the resulting estimator can therefore be expected to be suboptimal (in particular, the\ngj will be chosen too large, since -H(t) - t is strictly decreasing for t < 1). Indeed,\nwe find that add- estimators with < 1 provide a much better starting point for the\noptimization, as expected given (7,8). (Of course, for N/m large enough an asymptotically\noptimal estimator is given by the perturbed add-constant estimator of (8), and none of this\nnumerical optimization is necessary.) In the limit as c = N/m 0, we will see below that\na better initialization point is the add- estimator with parameter H(c) = -c log c.\n\n\nFixed-point algorithm\n\nOn examining the gradient of the above problems with respect to gj, a fixed-point algorithm\nmay be derived. We have, for example, that\n U 1 t\n = dt 1 - eqf(t)B\n g N,j (t);\n j 0 gj\nthus, analogously to the q 0 case above, a simple update is given by\n 1 teqf0(t)BN,j(t)dt\n g1 0\n j = ,\n 1 eqf0(t)B\n 0 N,j (t)dt\nwhich effectively corresponds to taking the mean of the binomial function BN,j, weighted\nby the \"importance\" term eqf(t), which in turn is controlled by the proximity of t to the\nmaximum of f 0(t) for q large. While this is an attractive strategy, conjugate gradient\ndescent proved to be a more stable algorithm in our hands.\n\n\f\nLower bounds\n\nOnce we have found an estimator with good worst-case error, we want to compare its\nperformance to some well-defined optimum. To do this, we obtain lower bounds on the\nworst-case performance of any estimator (not just the class of ^\n p we minimized over in the\nlast section). Once again, we will derive a family of bounds indexed by some parameter ,\nand then optimize over .\n\nOur lower bounds are based on the well-known fact that, for any proper prior distribution,\nthe average (Bayesian) loss is less than or equal to the maximum (worst-case) loss. The\nmost convenient class of priors to use here are the Dirichlet priors. Thus we will compute\nthe average KL error under any Dirichlet distribution (interesting in its own right), then\nmaximize over the possible Dirichlet priors (that is, find the \"least favorable\" Dirichlet\nprior) to obtain the tightest lower bound on the worst-case error; importantly, the resulting\nbounds will be nonasymptotic (that is, valid for all N and m). This approach therefore\ngeneralizes the asymptotic lower bound used in (7), who examined the KL loss under the\nspecial case of the uniform Dirichlet prior. See also (4) for direct application of this idea\nto bound the average code length, and (14), who derived a lower bound on the average KL\nloss, again in the uniform Dirichlet case.\n\nWe compute the Bayes error as follows. First, it is well-known (e.g., (9, 14)) that the\nKL-Bayes estimate of p given count data n (under any prior, not just the Dirichlet) is the\nposterior mean (interestingly, the KL loss shares this property with the squared error); for\nthe Dirichlet prior with parameter , this conditional mean has the particularly simple form\n + n\n EDir(|n)p = ,\n \n i i + ni\nwith Dir(|n) denoting the Dir() density conditioned on data n. Second, it is straight-\nforward to show (14) that the conditional average KL error, given this estimate, has an\nappealing form: the entropy at the conditional mean minus the conditional mean entropy\n(one can easily check the strict positivity of this average error via the concavity of the vector\nentropy function H(p) = - p\n i i log pi). Thus we can write the average loss as\n\n + n i + ni\n EDir()H( ) E )\n P Dir\n -EDir(|n)H (p)=X ()H( N+ -EDir(+n)H(pi),\n i i+ni i\n i Pi\nwhere the inner averages over p are under the Dirichlet distribution and the outer averages\nover n and ni are under the corresponding Dirichlet-multinomial or Dirichlet-binomial\nmixtures (i.e., multinomials whose parameter p is itself Dirichlet distributed); we have\nused linearity of the expectation, n\n i i = N , and Dir(|n) = Dir( + n). Evaluating\nthe right-hand side of the above, in turn, requires the formula\n\n \n -E i\n Dir()H (pi) = ( \n i + 1) - (1 + i) ,\n i i i\n\nwith (t) = d log (t); recall that (t + 1) = (t) + 1 . All of the above may thus be\n dt t\neasily computed numerically for any N, m, and ; to simplify, however, we will restrict \nto be constant, = (, , . . . , ). This symmetrizes the above formulae; we can replace\nthe outer sum with multiplication by m, and substitute \n i i = m. Finally, abbreviating\nK = N + m, we have that the worst-case error is bounded below by:\n\n m N j + 1 1\n p + (j + ) + - (K) - , (1)\n K ,m,N (j)(j + ) - log K j + K\n j=0\n\nwith p,m,N (j) the beta-binomial distribution\n N (m)(j + )(K - (j + ))\n p,m,N (j) = .\n j (K)()(m - )\n\n\f\nThis lower bound is valid for all N, m, and , and can be optimized numerically in the\n(scalar) parameter in a straightforward manner.\n\n\nAsymptotic analysis\n\nIn this section, we aim to understand some of the implications of the rather complicated\nexpressions above, by analyzing them in some simplifying limits. Due to space constraints,\nwe can only sketch the proof of each of the following statements.\n\nProposition 1. Any add- estimator, > 0, is uniformly KL-consistent if N/m .\n\nThis is a simple generalization of a result of (7), who proved consistency for the special\ncase of m fixed and N ; the main point here is that N/m is allowed to tend to infinity\narbitarily slowly. The result follows on utilizing our first upper bound (the main difference\nbetween our analysis and that of (7) is that our bound holds for all m, N , whereas (7)\nfocuses on the asymptotic case) and noting that max0t1 f (t) = O(1/N ) for f (t) defined\nby any add-constant estimator; hence our upper bound is uniformly O(m/N ). To obtain\nthe O(1/N ) bound, we plug in the add-constant gj = (j + )/N :\n\n j + \n f (t) = /N + t \n logt- (log )B\n N N,j (t)\n j .\nFor t fixed, an application of the delta method implies that the sum looks like log(t + ) -\n N\n1-t ; an expansion of the logarithm, in turn, implies that the right-hand side converges to\n2N t\n 1 (1 - t), for any fixed > 0. On a 1/N scale, on the other hand, we have\n2N\n\n t t\n N f ( ) = + t log(j + )B )\n N logt- N,j ( N\n j ,\nwhich can be uniformly bounded above. In fact, as demonstrated by (7), the binomial sum\non the right-hand side converges to the corresponding Poisson sum; interestingly, a similar\nPoisson sum plays a key role in the analysis of the entropy estimation case in (12).\n\nA converse follows easily from the lower bounds developed above:\n\nProposition 2. No estimator is uniformly KL-consistent if lim sup N/m < .\n\nOf course, it is intuitively clear that we need many more than m samples to estimate a\ndistribution on m bins; our contribution here is a quantitative asymptotic lower bound on\nthe error in the data-sparse regime. (A simpler but slightly weaker asymptotic bound may\nbe developed from the lower bound given in (14).) Once again, we contrast with the entropy\nestimation case, where consistent estimators do exist in this regime (12).\n\nWe let N, m , N/m c, 0 < c < . The beta-binomial distribution has mean N/m\nand converges to a non-degenerate limit, which we'll denote p,c, in this regime. Using\nFatou's lemma and (t) = log(t) - 1 + O t-2 , t , we obtain the asymptotic\n 2t\nlower bound\n 1 1\n p > 0.\n c + ,c(j)( + j) - log( + j) + ( + j) + + j\n j=0\n\n\nAlso interestingly, it is easy to see that our lower bound behaves as m-1 (1 + o(1)) as\n 2N\nN/m for any fixed positive (since in this case k p\n j=0 ,m,N (j) 0 for any fixed\nfinite k). Thus, comparing to the upper bound on the minimax error in (8), we have the\nsomewhat surprising fact that:\n\n\f\n 0.1\n 0.01 optimal \n approx opt\n 0.001\n\n\n\n 5 lower bound\n 4 j=0 approx\n 3 (m-1)/2N approx\n 2\n lower bound 1\n\n\n\n 3 least-favorable Bayes\n Braess-Sauer\n optimized\n 2\n\n\n (upper) / (lower) 1 -4 -3 -2 -1 0 1\n 10 10 10 10 10 10\n N/m\n\n\nFigure 1: Illustration of bounds and asymptotic results. N = 100, m varying. a.\nNumerically- and theoretically-obtained optimal (least-favorable) , as a function of c =\nN/m; note close agreement. b. Numerical lower bounds and theoretical approximations;\nnote the log-linear growth as c 0. The j = 0 approximation is obtained by retaining\nonly the j = 0 term of the sum in the lower bound (1); this approximation turns out to\nbe sufficiently accurate in the c 0 limit, while the (m - 1)/2N approximation is tight\nas c . c. Ratio comparison of upper to lower bounds. Dashed curve is the ratio\nobtained by plugging the asymptotically optimal estimator due to Braess-Sauer (8) into\nour upper bound; solid-dotted curve numerically least-favorable Dirichlet estimator; black\nsolid curve optimized estimator. Note that curves for optimized and Braess-Sauer esti-\nmators are in constant proportion, since bounds are independent of m for c large enough.\nMost importantly, note that optimized bounds are everywhere tight within a factor of 2, and\nasymptotically tight as c or c 0.\n\n\n\nProposition 3. Any fixed- Dirichlet prior is asymptotically least-favorable as N .\n m\n\nThis generalizes Theorem 2 of (7) (and in fact, an alternate proof can be constructed on\nclose examination of Krichevskiy's proof of that result).\n\nFinally, we examine the optimizers of the bounds in the data-sparse limit, c = N/m 0.\n\nProposition 4. The least-favorable Dirichlet parameter is given by H(c) as c 0; the\ncorresponding Bayes estimator also asymptotically minimizes the upper bound (and hence\nthe bounds are asymptotically tight in this limit). The maximal and average errors grow as\n-log(c)(1 + o(1)), c 0.\n\nThis is our most important asymptotic result. It suggests a simple and interesting rule of\nthumb for estimating distributions in this data-sparse limit: use the add- estimator with\n = H(c). When the data are very sparse (c sufficiently small) this estimator is optimal;\nsee Fig. 1 for an illustration. The proof, which is longer than those of the above results but\nstill fairly straightforward, has been omitted due to space constraints.\n\n\f\nDiscussion\n\nWe have omitted a detailed discussion of the form of the estimators which numerically\nminimize the upper bounds developed here; these estimators were empirically found to\nbe perturbed add-constant estimators, with gj growing linearly for large j but perturbed\ndownward in the approximate range j < 10. Interestingly, in the heavily-sampled limit\nN >> m, the minimizing estimator provided by (8) again turns out to be a perturbed\nadd-constant estimator. Further details will be provided elsewhere.\n\nWe note an interesting connection to the results of (9), who find that 1/m scaling of the\nadd-constant parameter is empirically optimal for for an entropy estimation application\nwith large m. This 1/m scaling bears some resemblance to the optimal H(c) scaling that\nwe find here, at least on a logarithmic scale (Fig. 1a); however, it is easy to see that the extra\n- log(c) term included here is useful. As argued in (3), it is a good idea, in the data-sparse\nlimit N << m, to assign substantial probability mass to bins which have not seen any data\nsamples. Since the total probability assigned to these bins by any add- estimator scales in\nthis limit as P (unseen) = m/(N + m), it is clear that the choice 1/m decays too\nquickly.\n\nFinally, we note an important direction for future research: the upper bounds developed\nhere turn out to be least tight in the range N m, when the optimum in the bound occurs\nnear t = 1/m; in this case, our bounds can be loose by roughly a factor of two (exactly\nthe degree of looseness we found in Fig. 1c). Thus it would be quite worthwhile to explore\nupper bounds which are tight in this N m range.\n\nAcknowledgements: We thank Z. Ghahramani and D. Mackay for helpful conversations;\nLP is supported by an International Research Fellowship from the Royal Society.\n\n\nReferences\n\n 1. D. Mackay, L. Peto, Natural Language Engineering 1, 289 (1995).\n\n 2. N. Friedman, Y. Singer, NIPS (1998).\n\n 3. A. Orlitsky, N. Santhanam, J. Zhang, Science 302, 427 (2003).\n\n 4. T. Cover, IEEE Transactions on Information Theory 18, 216 (1972).\n\n 5. R. Krichevsky, V. Trofimov, IEEE Transactions on Information Theory 27, 199 (1981).\n\n 6. D. Braess, H. Dette, Sankhya 66, 707 (2004).\n\n 7. R. Krichevsky, IEEE Transactions on Information Theory 44, 296 (1998).\n\n 8. D. Braess, T. Sauer, Journal of Approximation Theory 128, 187 (2004).\n\n 9. T. Schurmann, P. Grassberger, Chaos 6, 414 (1996).\n\n10. I. Nemenman, F. Shafee, W. Bialek, NIPS 14 (2002).\n\n11. L. Paninski, Neural Computation 15, 1191 (2003).\n\n12. L. Paninski, IEEE Transactions on Information Theory 50, 2200 (2004).\n\n13. G. Watson, Approximation theory and numerical methods (Wiley, Boston, 1980).\n\n14. D. Braess, J. Forster, T. Sauer, H. Simon, Algorithmic Learning Theory 13, 380 (2002).\n\n\f\n", "award": [], "sourceid": 2586, "authors": [{"given_name": "Liam", "family_name": "Paninski", "institution": null}]}