Part of Advances in Neural Information Processing Systems 17 (NIPS 2004)
We develop a family of upper and lower bounds on the worst-case ex- pected KL loss for estimating a discrete distribution on a finite number m of points, given N i.i.d. samples. Our upper bounds are approximation- theoretic, similar to recent bounds for estimating discrete entropy; the lower bounds are Bayesian, based on averages of the KL loss under Dirichlet distributions. The upper bounds are convex in their parameters and thus can be minimized by descent methods to provide estimators with low worst-case error; the lower bounds are indexed by a one-dimensional parameter and are thus easily maximized. Asymptotic analysis of the bounds demonstrates the uniform KL-consistency of a wide class of es- timators as c = N/m (no matter how slowly), and shows that no estimator is consistent for c bounded (in contrast to entropy estima- tion). Moreover, the bounds are asymptotically tight as c 0 or , and are shown numerically to be tight within a factor of two for all c. Finally, in the sparse-data limit c 0, we find that the Dirichlet-Bayes (add-constant) estimator with parameter scaling like -c log(c) optimizes both the upper and lower bounds, suggesting an optimal choice of the "add-constant" parameter in this regime.