{"title": "Competitive Distribution Estimation: Why is Good-Turing Good", "book": "Advances in Neural Information Processing Systems", "page_first": 2143, "page_last": 2151, "abstract": "Estimating distributions over large alphabets is a fundamental machine-learning tenet. Yet no method is known to estimate all distributions well. For example, add-constant estimators are nearly min-max optimal but often perform poorly in practice, and practical estimators such as absolute discounting, Jelinek-Mercer, and Good-Turing are not known to be near optimal for essentially any distribution.We describe the first universally near-optimal probability estimators. For every discrete distribution, they are provably nearly the best in the following two competitive ways. First they estimate every distribution nearly as well as the best estimator designed with prior knowledge of the distribution up to a permutation. Second, they estimate every distribution nearly as well as the best estimator designed with prior knowledge of the exact distribution, but as all natural estimators, restricted to assign the same probability to all symbols appearing the same number of times.Specifically, for distributions over $k$ symbols and $n$ samples, we show that for both comparisons, a simple variant of Good-Turing estimator is always within KL divergence of $(3+o(1))/n^{1/3}$ from the best estimator, and that a more involved estimator is within $\\tilde{\\mathcal{O}}(\\min(k/n,1/\\sqrt n))$. Conversely, we show that any estimator must have a KL divergence $\\ge\\tilde\\Omega(\\min(k/n,1/ n^{2/3}))$ over the best estimator for the first comparison, and $\\ge\\tilde\\Omega(\\min(k/n,1/\\sqrt{n}))$ for the second.", "full_text": "Competitive Distribution Estimation:\n\nWhy is Good-Turing Good\n\nAlon Orlitsky\nUC San Diego\n\nalon@ucsd.edu\n\nAbstract\n\nAnanda Theertha Suresh\n\nUC San Diego\n\nasuresh@ucsd.edu\n\nEstimating distributions over large alphabets is a fundamental machine-learning\ntenet. Yet no method is known to estimate all distributions well. For example,\nadd-constant estimators are nearly min-max optimal but often perform poorly in\npractice, and practical estimators such as absolute discounting, Jelinek-Mercer,\nand Good-Turing are not known to be near optimal for essentially any distribution.\nWe describe the \ufb01rst universally near-optimal probability estimators. For every\ndiscrete distribution, they are provably nearly the best in the following two com-\npetitive ways. First they estimate every distribution nearly as well as the best\nestimator designed with prior knowledge of the distribution up to a permutation.\nSecond, they estimate every distribution nearly as well as the best estimator de-\nsigned with prior knowledge of the exact distribution, but as all natural estimators,\nrestricted to assign the same probability to all symbols appearing the same number\nof times.\nSpeci\ufb01cally, for distributions over k symbols and n samples, we show that for\nboth comparisons, a simple variant of Good-Turing estimator is always within KL\ndivergence of (3 + on(1))/n1/3 from the best estimator, and that a more involved\nestimator is within \u02dcOn(min(k/n, 1/\nn)). Conversely, we show that any esti-\nmator must have a KL divergence at least \u02dc\u2126n(min(k/n, 1/n2/3)) over the best\n\u221a\nestimator for the \ufb01rst comparison, and at least \u02dc\u2126n(min(k/n, 1/\nn)) for the sec-\nond.\n\n\u221a\n\n1\n\nIntroduction\n\n1.1 Background\n\nMany learning applications, ranging from language-processing staples such as speech recognition\nand machine translation to biological studies in virology and bioinformatics, call for estimating large\ndiscrete distributions from their samples. Probability estimation over large alphabets has therefore\nlong been the subject of extensive research, both by practitioners deriving practical estimators [1, 2],\nand by theorists searching for optimal estimators [3].\nYet even after all this work, provably-optimal estimators remain elusive. The add-constant esti-\nmators frequently analyzed by theoreticians are nearly min-max optimal, yet perform poorly for\nmany practical distributions, while common practical estimators, such as absolute discounting [4],\nJelinek-Mercer [5], and Good-Turing [6], are not well understood and lack provable performance\nguarantees.\nTo understand the terminology and approach a solution we need a few de\ufb01nitions. The performance\nof an estimator q for an underlying distribution p is typically evaluated in terms of the Kullback-\n\n1\n\n\fLeibler (KL) divergence [7],\n\nD(p||q)\n\ndef\n=\n\n(cid:88)\n\nx\n\npx log\n\npx\nqx\n\n,\n\nre\ufb02ecting the expected increase in the ambiguity about the outcome of p when it is approximated by\nq. KL divergence is also the increase in the number of bits over the entropy that q uses to compress\nthe output of p, and is also the log-loss of estimating p by q. It is therefore of interest to construct\nestimators that approximate a large class of distributions to within small KL divergence. We now\ndescribe one of the problem\u2019s simplest formulations.\n\n1.2 Min-max loss\nA distribution estimator over a support set X associates with any observed sample sequence x\u2217 \u2208\nX \u2217 a distribution q(x\u2217) over X . Given n samples X n def\n= X1, X2, . . . , Xn, generated independently\naccording to a distribution p over X , the expected KL loss of q is\n[D(p||q(X n))].\n\nrn(q, p) = E\n\nX n\u223cpn\n\nLet P be a known collection of distributions over a discrete set X . The worst-case loss of an\nestimator q over all distributions in P is\n\nrn(q,P)\n\ndef\n= max\n\np\u2208P rn(q, p),\n\nand the lowest worst-case loss for P, achieved by the best estimator, is the min-max loss\n\nrn(P)\n\ndef\n= min\n\nq\n\nrn(q,P) = min\n\nq\n\nmax\np\u2208P rn(q, p).\n\n(1)\n\n(2)\n\nMin-max performance can be viewed as regret relative to an oracle that knows the underlying dis-\ntribution. Hence from here on we refer to it as regret.\nThe most natural and important collection of distributions, and the one we study here, is the set\nof all discrete distributions over an alphabet of some size k, which without loss of generality we\nassume to be [k] = {1, 2, . . . k}. Hence the set of all distributions is the simplex in k dimensions,\n\u2206k\nrelated quantities, for example see [9]. We outline some of the results derived.\n\n= {(p1, . . . , pk) : pi \u2265 0 and (cid:80) pi = 1}. Following [8], researchers have studied rn(\u2206k) and\n\ndef\n\n1.3 Add-constant estimators\n\nThe add-\u03b2 estimator assigns to a symbol that appeared t times a probability proportional to t+\u03b2. For\nexample, if three coin tosses yield one heads and two tails, the add-1/2 estimator assigns probability\n1.5/(1.5 + 2.5) = 3/8 to heads, and 2.5/(1.5 + 2.5) = 5/8 to tails. [10] showed that as for every\nk, as n \u2192 \u221e, an estimator related to add-3/4 is near optimal and achieves\n\nrn(\u2206k) =\n\nk \u2212 1\n2n\n\n\u00b7 (1 + o(1)).\n\n(3)\n\nThe more challenging, and practical, regime is where the sample size n is not overwhelmingly larger\nthan the alphabet size k. For example in English text processing, we need to estimate the distribution\nof words following a context. But the number of times a context appears in a corpus may not be\nmuch larger than the vocabulary size. Several results are known for other regimes as well. When the\nsample size n is linear in the alphabet size k, rn(\u2206k) can be shown to be a constant, and [3] showed\nthat as k/n \u2192 \u221e, add-constant estimators achieve the optimal\n\u00b7 (1 + o(1)),\n\nrn(\u2206k) = log\n\n(4)\n\nk\nn\n\nWhile add-constant estimators are nearly min-max optimal, the distributions attaining the min-max\nregret are near uniform. In practice, large-alphabet distributions are rarely uniform, and instead, tend\nto follow a power-law. For these distributions, add-constant estimators under-perform the estimators\ndescribed in the next subsection.\n\n2\n\n\f1.4 Practical estimators\n\nFor real applications, practitioners tend to use more sophisticated estimators, with better empirical\nperformance. These include the Jelinek-Mercer estimator that cross-validates the sample to \ufb01nd the\nbest \ufb01t for the observed data. Or the absolute-discounting estimators that rather than add a positive\nconstant to each count, do the opposite, and subtract a positive constant.\nPerhaps the most popular and enduring have been the Good-Turing estimator [6] and some of its\ndef\nvariations. Let nx\n= \u03d5t(xn)\nbe the number of symbols appearing t times in xn. The basic Good-Turing estimator posits that if\nnx = t,\n\ndef\n= nx(xn) be the number of times a symbol x appears in xn and let \u03d5t\n\nqx(xn) =\n\n\u03d5t+1\n\u03d5t\n\n\u00b7 t + 1\nn\n\n,\n\nsurprisingly relating the probability of an element not just to the number of times it was observed,\nbut also to the number other elements appearing as many, and one more, times. It is easy to see\nthat this basic version of the estimator may not work well, as for example it assigns any element\nappearing \u2265 n/2 times 0 probability. Hence in practice the estimator is modi\ufb01ed, for example,\nusing empirical frequency to elements appearing many times.\nThe Good-Turing Estimator was published in 1953, and quickly adapted for language-modeling\nuse, but for half a century no proofs of its performance were known. Following [11], several papers,\ne.g., [12, 13], showed that Good-Turing variants estimate the combined probability of symbols\nappearing any given number of times with accuracy that does not depend on the alphabet size, and\n[14] showed that a different variation of Good-Turing similarly estimates the probabilities of each\npreviously-observed symbol, and all unseen symbols combined.\nHowever, these results do not explain why Good-Turing estimators work well for the actual proba-\nbility estimation problem, that of estimating the probability of each element, not of the combination\nof elements appearing a certain number of times. To de\ufb01ne and derive uniformly-optimal estimators,\nwe take a different, competitive, approach.\n\n2 Competitive optimality\n\n2.1 Overview\n\nTo evaluate an estimator, we compare its performance to the best possible performance of two es-\ntimators designed with some prior knowledge of the underlying distribution. The \ufb01rst estimator is\ndesigned with knowledge of the underlying distribution up to a permutation of the probabilities,\nnamely knowledge of the probability multiset, e.g., {.5, .3, .2}, but not of the association between\nprobabilities and symbols. The second estimator is designed with exact knowledge of the distribu-\ntion, but like all natural estimators, forced to assign the same probabilities to symbols appearing the\nsame number of times. For example, upon observing the sample a, b, c, a, b, d, e, the estimator must\nassign the same probability to a and b, and the same probability to c, d, and e.\nThese estimators cannot be implemented in practice as in reality we do not have prior knowledge\nof the estimated distribution. But the prior information is chosen to allow us to determine the best\nperformance of any estimator designed with that information, which in turn is better than the perfor-\nmance of any data-driven estimator designed without prior information. We then show that certain\nvariations of the Good-Turing estimators, designed without any prior knowledge, approach the per-\nformance of both prior-knowledge estimators for every underlying distribution.\n\n2.2 Competing with near full information\n\nWe \ufb01rst de\ufb01ne the performance of an oracle-aided estimator, designed with some knowledge of the\nunderlying distribution. Suppose that the estimator is designed with the aid of an oracle that knows\nthe value of f (p) for some given function f over the class \u2206k of distributions.\nThe function f partitions \u2206k into subsets, each corresponding to one possible value of f. We denote\nthe subsets by P , and the partition by P, and as before, denote the individual distributions by p.\nThen the oracle knows the unique partition part P such that p \u2208 P \u2208 P. For example, if f (p) is\n\n3\n\n\fthe multiset of p, then each subset P corresponds to set of distributions with the same probability\nmultiset, and the oracle knows the multiset of probabilities.\nFor every partition part P \u2208 P, an estimator q incurs the worst-case regret in (1),\n\nThe oracle, knowing the unique partition part P , incurs the least worst-case regret (2),\n\nrn(q, P ) = max\np\u2208P\n\nrn(q, p).\n\nrn(P ) = min\n\nq\n\nrn(q, P ).\n\nThe competitive regret of q over the oracle, for all distributions in P is\n\nrn(q, P ) \u2212 rn(P ),\n\nthe competitive regret over all partition parts and all distributions in each is\n\nP\nr\nn(q, \u2206k)\n\ndef\n= max\n\nP\u2208P (rn(q, P ) \u2212 rn(P )) ,\n\nand the best possible competitive regret is\n\nP\nr\nn(\u2206k)\n\nConsolidating the intermediate de\ufb01nitions,\n\nP\nr\nn(q, \u2206k).\n\ndef\n= min\n\nq\n\n(cid:18)\n\n(cid:19)\n\nP\nr\nn(\u2206k) = min\n\nq\n\nmax\nP\u2208P\n\nmax\np\u2208P\n\nrn(q, p) \u2212 rn(P )\n\n.\n\nNamely, an oracle-aided estimator who knows the partition part incurs a worst-case regret rn(P )\nP\nover each part P , and the competitive regret r\nn(\u2206k) of data-driven estimators is the least overall\nincrease in the part-wise regret due to not knowing P . In Appendix A.1, we give few examples of\nsuch partitions.\nA partition P(cid:48) re\ufb01nes a partition P if every part in P is partitioned by some parts in P(cid:48). For example\n{{a, b},{c},{d, e}} re\ufb01nes {{a, b, c},{d, e}}. In Appendix A.2, we show that if P(cid:48) re\ufb01nes P then\nfor every q\n\nn (q, \u2206k) \u2265 r\nP(cid:48)\nr\n\nP\nn(q, \u2206k).\n\n(5)\n\nConsidering the collection \u2206k of all distributions over [k], it follows that as we start with single-part\npartition {\u2206k} and keep re\ufb01ning it till the oracle knows p, the competitive regret of estimators will\nincrease from 0 to rn(q, \u2206k). A natural question is therefore how much information can the oracle\nhave and still keep the competitive regret low? We show that the oracle can know the distribution\nexactly up to permutation, and still the regret will be very small.\nTwo distributions p and p(cid:48) permutation equivalent if for some permutation \u03c3 of [k],\n\np(cid:48)\n\u03c3(i) = pi,\n\nP\u03c3\nn (q, \u2206k), thus the same estimator uniformly bounds r\n\nfor all 1 \u2264 i \u2264 k. For example, (0.5, 0.3, 0.2) and (0.3, 0.5, 0.2) are permutation equivalent.\nPermutation equivalence is clearly an equivalence relation, and hence partitions the collection of\ndistributions over [k] into equivalence classes. Let P\u03c3 be the corresponding partition. We construct\nP\nestimators q that uniformly bound r\nn(q, \u2206k)\nfor any coarser partition of \u2206k, such as partitions into classes of distributions with the same support\nsize, or entropy. Note that the partition P\u03c3 corresponds to knowing the underlying distribution up\nP\u03c3\nto permutation, hence r\nn (\u2206k) is the additional KL loss compared to an estimator designed with\nknowledge of the underlying distribution up to permutation.\nThis notion of competitiveness has appeared in several contexts. In data compression it is called\ntwice-redundancy [15, 16, 17, 18], while in statistics it is often called adaptive or local min-\nmax [19, 20, 21, 22, 23], and recently in property testing it is referred as competitive [24, 25, 26]\n\u221a\nor instance-by-instance [27]. Subsequent to this work, [28] studied competitive estimation in (cid:96)1\ndistance, however their regret is poly(1/ log n), compared to our \u02dcO(1/\n\nn).\n\n4\n\n\f2.3 Competing with natural estimators\n\nOur second comparison is with an estimator designed with exact knowledge of p, but forced to be\nnatural, namely, to assign the same probability to all symbols appearing the same number of times\nin the sample. For example, for the observed sample a, b, c, a, b, d, e, the same probability must be\nassigned to a and b, and the same probability to c, d, and e. Since data-driven estimators derive all\ntheir knowledge of the distribution from the data, we expect them to be natural.\nWe compare the regret of data-driven estimators to that of natural oracle-aided estimators. Let Qnat\nbe the set of all natural estimators. For a distribution p, the lowest regret of a natural estimator,\ndesigned with prior knowledge of p is\n\nrnat\nn (p)\n\ndef\n= min\nq\u2208Qnat\n\nrn(q, p),\n\nand the regret of an estimator q relative to the least-regret natural-estimator is\n\nn (q, p) = rn(q, p) \u2212 rnat\nrnat\nThus the regret of an estimator q over all distributions in \u2206k is\nrnat\nn (q, p),\n\nn (p).\n\nrnat\nn (q, \u2206k) = max\np\u2208\u2206k\n\nand the best possible competitive regret is rnat\nIn the next section we state the results, showing in particular that rnat\nn (\u2206k) is uniformly bounded. In\nSection 5, we outline the proofs, and in Section 4 we describe experiments comparing the perfor-\nmance of competitive estimators to that of min-max motivated estimators.\n\nn (\u2206k) = minq rnat\n\nn (q, \u2206k).\n\n3 Results\n\nGood-Turing estimators are often used in conjunction with empirical frequency, where Good-Turing\nestimates low probabilities and empirical frequency estimates large probabilities. We \ufb01rst show that\neven this simple Good-Turing version, de\ufb01ned in Appendix C and denoted q(cid:48), is uniformly optimal\nfor all distributions. For simplicity we prove the result when the number of samples is n(cid:48) \u223c poi(n),\npoi(n)(q(cid:48), \u2206k) be the regrets in this\na Poisson random variable with mean n. Let r\nsampling process. A similar result holds with exactly n samples, but the proof is more involved as\nthe multiplicities are dependent.\nTheorem 1 (Appendix C). For any k and n,\npoi(n)(q(cid:48), \u2206k) \u2264 rnat\nP\u03c3\n\npoi(n)(q(cid:48), \u2206k) \u2264 3 + on(1)\n\nP\u03c3\npoi(n)(q(cid:48), \u2206k) and rnat\n\nr\n\n.\n\nn1/3\n\nFurthermore, a lower bound in [13] shows that this bound is optimal up to logarithmic factors.\nA more complex variant of Good-Turing, denoted q(cid:48)(cid:48), was proposed in [13]. We show that its regret\ndiminishes uniformly in both the partial-information and natural-estimator formulations.\nTheorem 2 (Section 5). For any k and n,\nn (q(cid:48)(cid:48), \u2206k) \u2264 rnat\nP\u03c3\nr\n\nn (q(cid:48)(cid:48), \u2206k) \u2264 \u02dcOn\n\n(cid:19)(cid:19)\n\n(cid:18)\n\nmin\n\n,\n\n.\n\nk\nn\n\nn\n\nWhere \u02dcOn, and below also \u02dc\u2126n, hide multiplicative logarithmic factors in n. Lemma 6 in Section 5\nand a lower bound in [13] can be combined to prove a matching lower bound on the competitive\nregret of any estimator for the second formulation,\n\n(cid:18) 1\u221a\n(cid:19)(cid:19)\n\n.\n\n(cid:18)\n\n(cid:18) 1\u221a\n\nk\nn\n\n,\n\nn\n\nn (\u2206k) \u2265 \u02dc\u2126n\nrnat\n\nmin\n\nHence q(cid:48)(cid:48) has near-optimal competitive regret relative to natural estimators.\nFano\u2019s inequality usually yields lower bounds on KL loss, not regret. By carefully constructing\ndistribution classes, we lower bound the competitive regret relative to the oracle-aided estimators.\nTheorem 3 (Appendix D). For any k and n,\nn (\u2206k) \u2265 \u02dc\u2126n\nP\u03c3\nr\n\n(cid:18) 1\n\n(cid:19)(cid:19)\n\n(cid:18)\n\nmin\n\n,\n\n.\n\nk\nn\n\nn2/3\n\n5\n\n\f3.1\n\nIllustration and implications\n\nFigure 1 demonstrates some of the results. The horizontal axis re\ufb02ects the set \u2206k of distributions\nillustrated on one dimension. The vertical axis indicates the KL loss, or absolute regret, for clarity,\nshown for k (cid:29) n. The blue line is the previously-known min-max upper bound on the regret,\nwhich by (4) is very high for this regime, log(k/n). The red line is the regret of the estimator\ndesigned with prior knowledge of the probability multiset. Observe that while for some probability\nmultisets the regret approaches the log(k/n) min-max upper bound, for other probability multisets\nit is much lower, and for some, such as uniform over 1 or over k symbols, where the probability\nmultiset determines the distribution it is even 0. For many practically relevant distributions, such\nas power-law distributions and sparse distributions, the regret is small compared to log(k/n). The\ngreen line is an upper bound on the absolute regret of the data-driven estimator q(cid:48)(cid:48). By Theorem 2,\n\u221a\nit is always at most 1/\nn larger than the red line. It follows that for many distributions, possibly for\ndistributions with more structure, such as those occurring in nature, the regret of q(cid:48)(cid:48) is signi\ufb01cantly\nsmaller than the pessimistic min-max bound implies.\n\nKL loss\n\nrn(\u2206k) = log k\nn\n\n\u2264 \u02dcO(cid:16)\n\n(cid:16) 1\u221a\n\n(cid:17)(cid:17)\n\nmin(\n\nn , k\n\nn\n\nDistributions\n\nUniform distribution\n\nFigure 1: Qualitative behavior of the KL loss as a function of distributions in different formulations\n\n\u221a\ndiminish to zero at least as fast as 1/n1/3, and 1/\nalphabet size k is.\n\nWe observe a few consequences of these results.\n\u2022 Theorems 1 and 2 establish two uniformly-optimal estimators q(cid:48) and q(cid:48)(cid:48). Their relative regrets\nn respectively, independent of how large the\n\u2022 Although the results are for relative regret, as shown in Figure 1, they lead to estimator with\n\u2022 The same regret upper bounds hold for all coarser partitions of \u2206k i.e., where instead of knowing\n\nsmaller absolute regret, namely, the expected KL divergence.\n\nthe multiset, the oracle knows some property of multiset such as entropy.\n\n4 Experiments\n\nRecall that for a sequence xn, nx denotes the number of times a symbol x appears and \u03d5t denotes\nthe number of symbols appearing t times. For small values of n and k, the estimator proposed\nin [13] simpli\ufb01es to a combination of Good-Turing and empirical estimators. By [13, Lemmas 10\nand 11], for symbols appearing t times, if \u03d5t+1 \u2265 \u02dc\u2126(t), then the Good-Turing estimate is close\nto the underlying total probability mass, otherwise the empirical estimate is closer. Hence, for a\nsymbol appearing t times, if \u03d5t+1 \u2265 t we use the Good-Turing estimator, otherwise we use the\nempirical estimator. If nx = t,\n\n(cid:40) t\n\nqx =\n\nN\n\u03d5t+1+1\n\n\u03d5t\n\n\u00b7 t+1\n\nN\n\nif t > \u03d5t+1,\nelse,\n\nwhere N is a normalization factor. Note that we have replaced \u03d5t+1 in the Good-Turing estimator\nby \u03d5t+1 + 1 to ensure that every symbol is assigned a non-zero probability.\n\n6\n\n\f(a) Uniform\n\n(b) Step\n\n(c) Zipf with parameter 1\n\n(d) Zipf with parameter 1.5\n\n(e) Uniform prior (Dirichlet 1)\n\n(f) Dirichlet 1/2 prior\n\nFigure 2: Simulation results for support 10000, number of samples ranging from 1000 to 50000,\naveraged over 200 trials.\n\nWe compare the performance of this estimator to four estimators: three popular add-\u03b2 estimators\nand the optimal natural estimator. An add-beta estimator \u02c6S has the form\n\nq \u02c6S\nx =\n\nnx + \u03b2 \u02c6S\nnx\nN ( \u02c6S)\n\n,\n\n1 = 1, \u03b2BS\n\nwhere N ( \u02c6S) is a normalization factor to ensure that the probabilities add up to 1. The Laplace\nt = 1\u2200 t, minimizes the expected loss when the underlying distribution is generated\nestimator, \u03b2L\nt = 1/2\u2200 t, is asymptotically\nby a uniform prior over \u2206k. The Krichevsky-Tro\ufb01mov estimator, \u03b2KT\nmin-max optimal for the cumulative regret, and minimizes the expected loss when the underlying\ndistribution is generated according to a Dirichlet-1/2 prior. The Braess-Sauer estimator, \u03b2BS\n0 =\nt = 3/4 \u2200 t > 1, is asymptotically min-max optimal for rn(\u2206k). Finally,\n1/2, \u03b2BS\nas shown in Lemma 10, the optimal estimator qx = Snx\nachieves the lowest loss of any natural\n\u03d5nx\nestimator designed with knowledge of the underlying distribution.\nWe compare the performance of the proposed estimator to that of the four estimators above. We\nconsider six distributions: uniform distribution, step distribution with half the symbols having prob-\nability 1/2k and the other half have probability 3/2k, Zipf distribution with parameter 1 (pi \u221d i\u22121),\nZipf distribution with parameter 1.5 (pi \u221d i\u22121.5), a distribution generated by the uniform prior\non \u2206k, and a distribution generated from Dirichlet-1/2 prior. All distributions have support size\nk = 10000. n ranges from 1000 to 50000 and the results are averaged over 200 trials.\nFigure 2 shows the results. Observe that the proposed estimator performs similarly to the best\nnatural estimator for all six distributions. It also signi\ufb01cantly outperforms the other estimators for\nZipf, uniform, and step distributions.\nThe performance of other estimators depends on the underlying distribution. For example, since\nLaplace is the optimal estimator when the underlying distribution is generated from the uniform\nprior, it performs well in Figure 2(e), however performs poorly on other distributions.\nFurthermore, even though for distributions generated by Dirichlet priors, all the estimators have\nsimilar looking regrets (Figures 2(e), 2(f)), the proposed estimator performs better than estimators\nwhich are not designed speci\ufb01cally for that prior.\n\n7\n\nNumber of samples#1040.511.522.533.544.55Expected KL divergence00.050.10.150.20.250.30.350.40.450.5Best-naturalLaplaceBraess-SauerKrichevsky-TrofimovGood-Turing + empiricalNumber of samples#1040.511.522.533.544.55Expected KL divergence00.050.10.150.20.250.30.350.40.450.5Best-naturalLaplaceBraess-SauerKrichevsky-TrofimovGood-Turing + empiricalNumber of samples#1040.511.522.533.544.55Expected KL divergence00.050.10.150.20.250.30.350.40.450.5Best-naturalLaplaceBraess-SauerKrichevsky-TrofimovGood-Turing + empiricalNumber of samples#1040.511.522.533.544.55Expected KL divergence00.050.10.150.20.250.30.350.40.450.5Best-naturalLaplaceBraess-SauerKrichevsky-TrofimovGood-Turing + empiricalNumber of samples#1040.511.522.533.544.55Expected KL divergence00.050.10.150.20.250.30.350.40.450.5Best-naturalLaplaceBraess-SauerKrichevsky-TrofimovGood-Turing + empiricalNumber of samples#1040.511.522.533.544.55Expected KL divergence00.050.10.150.20.250.30.350.40.450.5Best-naturalLaplaceBraess-SauerKrichevsky-TrofimovGood-Turing + empirical\f5 Proof sketch of Theorem 2\n\nThe proof consists of two parts. We \ufb01rst show that for every estimator q, r\nand then upper bound rnat\nLemma 4 (Appendix B.1). For every estimator q,\n\nn (q, \u2206k) \u2264 rnat\nP\u03c3\nn (q, \u2206k) using results on combined probability mass.\n\nn (q, \u2206k)\n\nn (q, \u2206k) \u2264 rnat\nP\u03c3\nr\n\nn (q, \u2206k).\n\nThe proof of the above lemma relies on showing that the optimal estimator for every class in P \u2208 P\u03c3\nis natural.\n\n5.1 Relation between rnat\n\nn (q, \u2206k) and combined probability estimation\n\nWe now relate the regret in estimating distribution to that of estimating the combined or total prob-\nability mass, de\ufb01ned as follows. Recall that \u03d5t denotes the number of symbols appearing t times.\ndef\nFor a sequence xn, let St\n= St(xn) denote the total probability of symbols appearing t times. For\nnotational convenience, we use St to denote both St(xn) and St(X n) and the usage becomes clear\nin the context. Similar to KL divergence between distributions, we de\ufb01ne KL divergence between S\nand their estimates \u02c6S as\n\nn(cid:88)\n\nt=0\n\nD(S|| \u02c6S) =\n\nSt log\n\nSt\n\u02c6St\n\n.\n\nSince the natural estimator assigns same probability to symbols that appear the same number of\ntimes, estimating probabilities is same as estimating the total probability of symbols appearing a\ngiven number of times. We formalize it in the next lemma.\n\nLemma 5 (Appendix B.2). For a natural estimator q let \u02c6St(xn) =(cid:80)\nLemma 6. For a natural estimator q let \u02c6St(xn) =(cid:80)\n\nn (q, p) = E[D(S|| \u02c6S)].\nrnat\n\nIn Lemma 11(Appendix B.3), we show that there is a natural estimator that achieves rnat\nmaximum over all distributions p and minimum over all estimators q results in\n\nx:nx=t qx(xn), then\n\nn (\u2206k). Taking\n\nrnat\nn (q, \u2206k) = max\np\u2208\u2206k\n\nx:nx=t qx(xn), then\nE[D(S|| \u02c6S)].\n\nFurthermore,\n\nrnat\nn (\u2206k) = min\n\u02c6S\n\nmax\np\u2208\u2206k\n\nE[D(S|| \u02c6S)].\n\nThus \ufb01nding the best competitive natural estimator is same as \ufb01nding the best estimator for the\ncombined probability mass S. [13] proposed an algorithm for estimating S such that for all k and\nfor all p \u2208 \u2206k, with probability \u2265 1 \u2212 1/n ,\n\nD(S|| \u02c6S) = \u02dcOn\n\n.\n\nn\n\n(cid:18) 1\u221a\n(cid:19)\n(cid:18) 1\u221a\n(cid:18) 1\u221a\n\nmin\n\n(cid:18)\n\n(cid:19)\n\n(cid:19)(cid:19)\n\n.\n\n,\n\nk\nn\n\nn\n\nThe result is stated in Theorem 2 of [13]. One can convert this result to a result on expectation easily\nusing the property that their estimator is bounded below by 1/2n and show that\n\nA slight modi\ufb01cation of their proofs for Lemma 17 and Theorem 2 in their paper using(cid:80)n\n(cid:80)n\nt=1 \u03d5t \u2264 k shows that their estimator \u02c6S for the combined probability mass S satis\ufb01es\n\nE[D(S|| \u02c6S)] = \u02dcOn\n\nmax\np\u2208\u2206k\n\nn\n\n.\n\n\u221a\n\n\u03d5t \u2264\n\nt=1\n\nE[D(S|| \u02c6S)] = \u02dcOn\n\nmax\np\u2208\u2206k\n\nThe above equation together with Lemmas 4 and 6 results in Theorem 2.\n\n6 Acknowledgements\n\nWe thank Jayadev Acharya, Moein Falahatgar, Paul Ginsparg, Ashkan Jafarpour, Mesrob Ohannes-\nsian, Venkatadheeraj Pichapati, Yihong Wu, and the anonymous reviewers for helpful comments.\n\n8\n\n\fReferences\n[1] William A. Gale and Geoffrey Sampson. Good-turing frequency estimation without tears. Journal of\n\nQuantitative Linguistics, 2(3):217\u2013237, 1995.\n\n[2] S. F. Chen and J. Goodman. An empirical study of smoothing techniques for language modeling. In ACL,\n\n1996.\n\n[3] Liam Paninski. Variational minimax estimation of discrete distributions under KL loss. In NIPS, 2004.\n[4] Hermann Ney, Ute Essen, and Reinhard Kneser. On structuring probabilistic dependences in stochastic\n\nlanguage modelling. Computer Speech & Language, 8(1):1\u201338, 1994.\n\n[5] Fredrick Jelinek and Robert L. Mercer. Probability distribution estimation from sparse data. IBM Tech.\n\nDisclosure Bull., 1984.\n\n[6] Irving J. Good. The population frequencies of species and the estimation of population parameters.\n\nBiometrika, 40(3-4):237\u2013264, 1953.\n\n[7] Thomas M. Cover and Joy A. Thomas. Elements of information theory (2. ed.). Wiley, 2006.\n[8] R. Krichevsky. Universal Compression and Retrieval. Dordrecht,The Netherlands: Kluwer, 1994.\n[9] Sudeep Kamath, Alon Orlitsky, Dheeraj Pichapati, and Ananda Theertha Suresh. On learning distributions\n\nfrom their samples. In COLT, 2015.\n\n[10] Dietrich Braess and Thomas Sauer. Bernstein polynomials and learning theory. Journal of Approximation\n\nTheory, 128(2):187\u2013206, 2004.\n\n[11] David A. McAllester and Robert E. Schapire. On the convergence rate of Good-Turing estimators. In\n\nCOLT, 2000.\n\n[12] Evgeny Drukh and Yishay Mansour. Concentration bounds for unigrams language model.\n\n2004.\n\nIn COLT,\n\n[13] Jayadev Acharya, Ashkan Jafarpour, Alon Orlitsky, and Ananda Theertha Suresh. Optimal probability\n\nestimation with applications to prediction and classi\ufb01cation. In COLT, 2013.\n\n[14] Alon Orlitsky, Narayana P. Santhanam, and Junan Zhang. Always Good Turing: Asymptotically optimal\n\nprobability estimation. In FOCS, 2003.\n\n[15] Boris Yakovlevich Ryabko. Twice-universal coding. Problemy Peredachi Informatsii, 1984.\n[16] Boris Yakovlevich Ryabko. Fast adaptive coding algorithm. Problemy Peredachi Informatsii, 26(4):24\u2013\n\n37, 1990.\n\n[17] Dominique Bontemps, St\u00b4ephane Boucheron, and Elisabeth Gassiat. About adaptive coding on countable\n\nalphabets. IEEE Transactions on Information Theory, 60(2):808\u2013821, 2014.\n\n[18] St\u00b4ephane Boucheron, Elisabeth Gassiat, and Mesrob I. Ohannessian. About adaptive coding on countable\n\nalphabets: Max-stable envelope classes. CoRR, abs/1402.6305, 2014.\n\n[19] David L Donoho and Jain M Johnstone.\n\n81(3):425\u2013455, 1994.\n\nIdeal spatial adaptation by wavelet shrinkage. Biometrika,\n\n[20] Felix Abramovich, Yoav Benjamini, David L Donoho, and Iain M Johnstone. Adapting to unknown\n\nsparsity by controlling the false discovery rate. The Annals of Statistics, 2006.\n\n[21] Peter J Bickel, Chris A Klaassen, YA\u2019Acov Ritov, and Jon A Wellner. Ef\ufb01cient and adaptive estimation\n\nfor semiparametric models. Johns Hopkins University Press Baltimore, 1993.\n\n[22] Andrew Barron, Lucien Birg\u00b4e, and Pascal Massart. Risk bounds for model selection via penalization.\n\nProbability theory and related \ufb01elds, 113(3):301\u2013413, 1999.\n\n[23] Alexandre B. Tsybakov. Introduction to Nonparametric Estimation. Springer, 2004.\n[24] Jayadev Acharya, Hirakendu Das, Ashkan Jafarpour, Alon Orlitsky, and Shengjun Pan. Competitive\n\ncloseness testing. COLT, 2011.\n\n[25] Jayadev Acharya, Hirakendu Das, Ashkan Jafarpour, Alon Orlitsky, Shengjun Pan, and Ananda Theertha\n\nSuresh. Competitive classi\ufb01cation and closeness testing. In COLT, 2012.\n\n[26] Jayadev Acharya, Ashkan Jafarpour, Alon Orlitsky, and Ananda Theertha Suresh. A competitive test for\n\nuniformity of monotone distributions. In AISTATS, 2013.\n\n[27] Gregory Valiant and Paul Valiant. An automatic inequality prover and instance optimal identity testing.\n\nIn FOCS, 2014.\n\n[28] Gregory Valiant and Paul Valiant. Instance optimal learning. CoRR, abs/1504.05321, 2015.\n[29] Michael Mitzenmacher and Eli Upfal. Probability and computing: Randomized algorithms and proba-\n\nbilistic analysis. Cambridge University Press, 2005.\n\n9\n\n\f", "award": [], "sourceid": 1277, "authors": [{"given_name": "Alon", "family_name": "Orlitsky", "institution": "University of California, San Diego"}, {"given_name": "Ananda Theertha", "family_name": "Suresh", "institution": "UCSD"}]}