{"title": "Estimating the Unseen: Improved Estimators for Entropy and other Properties", "book": "Advances in Neural Information Processing Systems", "page_first": 2157, "page_last": 2165, "abstract": "Recently, [Valiant and Valiant] showed that a class of distributional properties, which includes such practically relevant properties as entropy, the number of distinct elements, and distance metrics between pairs of distributions, can be estimated given a SUBLINEAR sized sample. Specifically, given a sample consisting of independent draws from any distribution over at most n distinct elements, these properties can be estimated accurately using a sample of size O(n / log n). We propose a novel modification of this approach and show: 1) theoretically, our estimator is optimal (to constant factors, over worst-case instances), and 2) in practice, it performs exceptionally well for a variety of estimation tasks, on a variety of natural distributions, for a wide range of parameters. Perhaps unsurprisingly, the key step in this approach is to first use the sample to characterize the unseen\" portion of the distribution. This goes beyond such tools as the Good-Turing frequency estimation scheme, which estimates the total probability mass of the unobserved portion of the distribution: we seek to estimate the \"shape\"of the unobserved portion of the distribution. This approach is robust, general, and theoretically principled; we expect that it may be fruitfully used as a component within larger machine learning and data analysis systems. \"", "full_text": "Estimating the Unseen:\n\nImproved Estimators for Entropy and other\n\nProperties\n\nGregory Valiant \u2217\nStanford University\nStanford, CA 94305\n\nvaliant@stanford.edu\n\nPaul Valiant \u2020\nBrown University\n\nProvidence, RI 02912\n\npvaliant@gmail.com\n\nAbstract\n\nRecently, Valiant and Valiant [1, 2] showed that a class of distributional proper-\nties, which includes such practically relevant properties as entropy, the number\nof distinct elements, and distance metrics between pairs of distributions, can be\nestimated given a sublinear sized sample. Speci\ufb01cally, given a sample consisting\nof independent draws from any distribution over at most n distinct elements, these\nproperties can be estimated accurately using a sample of size O(n/ log n). We\npropose a novel modi\ufb01cation of this approach and show: 1) theoretically, this esti-\nmator is optimal (to constant factors, over worst-case instances), and 2) in practice,\nit performs exceptionally well for a variety of estimation tasks, on a variety of nat-\nural distributions, for a wide range of parameters. Perhaps unsurprisingly, the key\nstep in our approach is to \ufb01rst use the sample to characterize the \u201cunseen\u201d portion\nof the distribution. This goes beyond such tools as the Good-Turing frequency\nestimation scheme, which estimates the total probability mass of the unobserved\nportion of the distribution: we seek to estimate the shape of the unobserved portion\nof the distribution. This approach is robust, general, and theoretically principled;\nwe expect that it may be fruitfully used as a component within larger machine\nlearning and data analysis systems.\n\nIntroduction\n\n1\nWhat can one infer about an unknown distribution based on a random sample? If the distribution\nin question is relatively \u201csimple\u201d in comparison to the sample size\u2014for example if our sample\nconsists of 1000 independent draws from a distribution supported on 100 domain elements\u2014then\nthe empirical distribution given by the sample will likely be an accurate representation of the true\ndistribution. If, on the other hand, we are given a relatively small sample in relation to the size\nand complexity of the distribution\u2014for example a sample of size 100 drawn from a distribution\nsupported on 1000 domain elements\u2014then the empirical distribution may be a poor approximation\nof the true distribution. In this case, can one still extract accurate estimates of various properties of\nthe true distribution?\nMany real\u2013world machine learning and data analysis tasks face this challenge; indeed there are\nmany large datasets where the data only represent a tiny fraction of an underlying distribution we\nhope to understand. This challenge of inferring properties of a distribution given a \u201ctoo small\u201d\nsample is encountered in a variety of settings, including text data (typically, no matter how large the\ncorpus, around 30% of the observed vocabulary only occurs once), customer data (many customers\nor website users are only seen a small number of times), the analysis of neural spike trains [15],\n\n\u2217http://theory.stanford.edu/~valiant/ A portion of this work was done while at Microsoft Research.\n\u2020http://cs.brown.edu/people/pvaliant/\n\n1\n\n\fand the study of genetic mutations across a population1. Additionally, many database management\ntasks employ sampling techniques to optimize query execution; improved estimators would allow\nfor either smaller sample sizes or increased accuracy, leading to improved ef\ufb01ciency of the database\nsystem (see, e.g. [6, 7]).\nWe introduce a general and robust approach for using a sample to characterize the \u201cunseen\u201d portion\nof the distribution. Without any a priori assumptions about the distribution, one cannot know what\nthe unseen domain elements are. Nevertheless, one can still hope to estimate the \u201cshape\u201d or his-\ntogram of the unseen portion of the distribution\u2014essentially, we estimate how many unseen domain\nelements occur in various probability ranges. Given such a reconstruction, one can then use it to\nestimate any property of the distribution which only depends on the shape/histogram; such prop-\nerties are termed symmetric and include entropy and support size. In light of the long history of\nwork on estimating entropy by the neuroscience, statistics, computer science, and information the-\nory communities, it is compelling that our approach (which is agnostic to the property in question)\noutperforms these entropy-speci\ufb01c estimators.\nAdditionally, we extend this intuition to develop estimators for properties of pairs of distributions,\nthe most important of which are the distance metrics. We demonstrate that our approach can ac-\ncurately estimate the total variational distance (also known as statistical distance or \uffff1 distance)\nbetween distributions using small samples. To illustrate the challenge of estimating variational dis-\ntance (between distributions over discrete domains) given small samples, consider drawing two sam-\nples, each consisting of 1000 draws from a uniform distribution over 10,000 distinct elements. Each\nsample can contain at most 10% of the domain elements, and their intersection will likely contain\nonly 1% of the domain elements; yet from this, one would like to conclude that these two samples\nmust have been drawn from nearly identical distributions.\n1.1 Previous work: estimating distributions, and estimating properties\nThere is a long line of work on inferring information about the unseen portion of a distribution,\nbeginning with independent contributions from both R.A. Fisher and Alan Turing during the 1940\u2019s.\nFisher was presented with data on butter\ufb02ies collected over a 2 year expedition in Malaysia, and\nsought to estimate the number of new species that would be discovered if a second 2 year expedition\nwere conducted [8]. (His answer was \u201c\u2248 75.\u201d) At nearly the same time, as part of the British WWII\neffort to understand the statistics of the German enigma ciphers, Turing and I.J. Good were working\non the related problem of estimating the total probability mass accounted for by the unseen portion of\na distribution [9]. This resulted in the Good-Turing frequency estimation scheme, which continues\nto be employed, analyzed, and extended by our community (see, e.g. [10, 11]).\nMore recently, in similar spirit to this work, Orlitsky et al. posed the following natural question:\ngiven a sample, what distribution maximizes the likelihood of seeing the observed species frequen-\ncies, that is, the number of species observed once, twice, etc.? [12, 13] (What Orlitsky et al. term\nthe pattern of a sample, we call the \ufb01ngerprint, as in De\ufb01nition 1.) Orlitsky et al. show that such\nlikelihood maximizing distributions can be found in some speci\ufb01c settings, though the problem of\n\ufb01nding or approximating such distributions for typical patterns/\ufb01ngerprints may be dif\ufb01cult. Re-\ncently, Acharya et al. showed that this maximum likelihood approach can be used to yield a near-\noptimal algorithm for deciding whether two samples originated from identical distributions, versus\ndistributions that have large distance [14].\nIn contrast to this approach of trying to estimate the \u201cshape/histogram\u201d of a distribution, there has\nbeen nearly a century of work proposing and analyzing estimators for particular properties of distri-\nbutions. In Section 3 we describe several standard, and some recent estimators for entropy, though\nwe refer the reader to [15] for a thorough treatment. There is also a large literature on estimating\nsupport size (also known as the \u201cspecies problem\u201d, and the related \u201cdistinct elements\u201d problem), and\nwe refer the reader to [16] and to [17] for several hundred references.\nOver the past 15 years, the theoretical computer science community has spent signi\ufb01cant effort\ndeveloping estimators and establishing worst-case information theoretic lower bounds on the sample\nsize required for various distribution estimation tasks, including entropy and support size (e.g. [18,\n19, 20, 21]).\n\n1Three recent studies (appearing in Science last year) found that very rare genetic mutations are especially\nabundant in humans, and observed that better statistical tools are needed to characterize this \u201crare events\u201d\nregime, so as to resolve fundamental problems about our evolutionary process and selective pressures [3, 4, 5].\n\n2\n\n\fThe algorithm we present here is based on the intuition of the estimator described in our theoretical\nwork [1]. That estimator is not practically viable, and additionally, requires as input an accurate\nupper bound on the support size of the distribution in question. Both the algorithm proposed in this\ncurrent work and that of [1] employ linear programming, though these programs differ signi\ufb01cantly\n(to the extent that the linear program of [1] does not even have an objective function and simply\nde\ufb01nes a feasible region). Our proof of the theoretical guarantees in this work leverages some of\nthe machinery of [1] (in particular, the \u201cChebyshev bump construction\u201d) and achieves the same\ntheoretical worst-case optimality guarantees. See Appendix A for further theoretical and practical\ncomparisons with the estimator of [1].\n\n1.2 De\ufb01nitions and examples\nWe begin by de\ufb01ning the \ufb01ngerprint of a sample, which essentially removes all the label-information\nfrom the sample. For the remainder of this paper, we will work with the \ufb01ngerprint of a sample,\nrather than the with the sample itself.\nDe\ufb01nition 1. Given a samples X = (x1, . . . , xk), the associated \ufb01ngerprint, F = (F1,F2, . . .),\nis the \u201chistogram of the histogram\u201d of the sample. Formally, F is the vector whose ith component,\nFi, is the number of elements in the domain that occur exactly i times in sample X.\nFor estimating entropy, or any other property whose value is invariant to relabeling the distribution\nsupport, the \ufb01ngerprint of a sample contains all the relevant information (see [21], for a formal proof\nof this fact). We note that in some of the literature, the \ufb01ngerprint is alternately termed the pattern,\nhistogram, histogram of the histogram or collision statistics of the sample.\nIn analogy with the \ufb01ngerprint of a sample, we de\ufb01ne the histogram of a distribution, a representation\nin which the labels of the domain have been removed.\nDe\ufb01nition 2. The histogram of a distribution D is a mapping hD : (0, 1] \u2192 N\u222a{ 0}, where hD(x)\nis equal to the number of domain elements that each occur in distribution D with probability x.\nFormally, hD(x) = |{\u03b1 : D(\u03b1) = x}|, where D(\u03b1) is the probability mass that distribution D\nassigns to domain element \u03b1. We will also allow for \u201cgeneralized histograms\u201d in which hD does\nnot necessarily take integral values.\nSince h(x) denotes the number of elements that have probability x, we have\uffffx:h(x)\uffff=0 x\u00b7h(x) = 1,\n\nas the total probability mass of a distribution is 1. Any symmetric property is a function of only the\nhistogram of the distribution:\n\n\u2022 The Shannon entropy H(D) of a distribution D is de\ufb01ned to be\n\nH(D) := \u2212 \uffff\u03b1\u2208sup(D)\n\nD(\u03b1) log2 D(\u03b1) = \u2212 \uffffx:hD(x)\uffff=0\n|sup(D)| := |{\u03b1 : D(\u03b1) > 0}| = \uffffx:hD(x)\uffff=0\n\nhD(x)x log2 x.\n\nhD(x).\n\n\u2022 The support size is the number of domain elements that occur with positive probability:\n\nWe provide an example to illustrate the above de\ufb01nitions:\nExample 3. Consider a sequence of animals, obtained as a sample from the distribution of animals\non a certain island, X = (mouse, mouse, bird, cat, mouse, bird, bird, mouse, dog, mouse). We\nhave F = (2, 0, 1, 0, 1), indicating that two species occurred exactly once (cat and dog), one species\noccurred exactly three times (bird), and one species occurred exactly \ufb01ve times (mouse).\nConsider the following distribution of animals:\nP r(mouse) = 1/2, P r(bird) = 1/4, P r(cat) = P r(dog) = P r(bear) = P r(wolf ) = 1/16.\nThe associated histogram of this distribution is h : (0, 1] \u2192 Z de\ufb01ned by h(1/16) = 4, h(1/4) = 1,\nh(1/2) = 1, and for all x \uffff\u2208 {1/16, 1/4, 1/2}, h(x) = 0.\nAs we will see in Example 5 below, the \ufb01ngerprint of a sample is intimately related to the Binomial\ndistribution; the theoretical analysis will be greatly simpli\ufb01ed by reasoning about the related Poisson\ndistribution, which we now de\ufb01ne:\nDe\ufb01nition 4. We denote the Poisson distribution of expectation \u03bb as P oi(\u03bb), and write poi(\u03bb, j) :=\ne\u2212\u03bb\u03bbj\n\n, to denote the probability that a random variable with distribution P oi(\u03bb) takes value j.\n\nj!\n\n3\n\n\fExample 5. Let D be the uniform distribution with support size 1000. Then hD(1/1000) = 1000,\nand for all x \uffff= 1/1000, hD(x) = 0. Let X be a sample consisting of 500 independent draws\nfrom D. Each element of the domain, in expectation, will occur 1/2 times in X, and thus the\nnumber of occurrences of each domain element in the sample X will be roughly distributed as\nP oi(1/2). (The exact distribution will be Binomial(500, 1/1000), though the Poisson distribu-\ntion is an accurate approximation.) By linearity of expectation, the expected \ufb01ngerprint satis\ufb01es\nE[Fi] \u2248 1000\u00b7 poi(1/2, i). Thus we expect to see roughly 303 elements once, 76 elements twice, 13\nelements three times, etc., and in expectation 607 domain elements will not be seen at all.\n\n2 Estimating the unseen\nGiven the \ufb01ngerprint F of a sample of size k, drawn from a distribution with histogram h, our high-\nlevel approach is to \ufb01nd a histogram h\uffff that has the property that if one were to take k independent\ndraws from a distribution with histogram h\uffff, the \ufb01ngerprint of the resulting sample would be similar\nto the observed \ufb01ngerprint F. The hope is then that h and h\uffff will be similar, and, in particular, have\nsimilar entropies, support sizes, etc.\nAs an illustration of this approach, suppose we are given a sample of size k = 500, with \ufb01ngerprint\nF = (301, 78, 13, 1, 0, 0, . . .); recalling Example 5, we recognize that F is very similar to the\nexpected \ufb01ngerprint that we would obtain if the sample had been drawn from the uniform distribution\nover support 1000. Although the sample only contains 391 unique domain elements, we might be\njusti\ufb01ed in concluding that the entropy of the true distribution from which the sample was drawn is\nclose to H(U nif (1000)) = log2(1000).\nIn general, how does one obtain a \u201cplausible\u201d histogram from a \ufb01ngerprint in a principled fashion?\nWe must start by understanding how to obtain a plausible \ufb01ngerprint from a histogram.\nGiven a distribution D, and some domain element \u03b1 occurring with probability x = D(\u03b1), the prob-\nability that it will be drawn exactly i times in k independent draws from D is P r[Binomial(k, x) =\ni] \u2248 poi(kx, i). By linearity of expectation, the expected ith \ufb01ngerprint entry will roughly satisfy\n(1)\n\nh(x)poi(kx, i).\n\nE[Fi] \u2248 \uffffx:hD(x)\uffff=0\n\nThis mapping between histograms and expected \ufb01ngerprints is linear in the histogram, with coef\ufb01-\ncients given by the Poisson probabilities. Additionally, it is not hard to show that V ar[Fi] \u2264 E[Fi],\nand thus the \ufb01ngerprint is tightly concentrated about its expected value. This motivates a \u201c\ufb01rst mo-\nment\u201d approach. We will, roughly, invert the linear map from histograms to expected \ufb01ngerprint\nentries, to yield a map from observed \ufb01ngerprints, to plausible histograms h\uffff.\nThere is one additional component of our approach. For many \ufb01ngerprints, there will be a large space\nof equally plausible histograms. To illustrate, suppose we obtain \ufb01ngerprint F = (10, 0, 0, 0, . . .),\nand consider the two histograms given by the uniform distributions with respective support sizes\n10,000, and 100,000. Given either distribution, the probability of obtaining the observed \ufb01ngerprint\nfrom a set of 10 samples is > .99, yet these distributions are quite different and have very different\nentropy values and support sizes. They are both very plausible\u2013which distribution should we return?\nTo resolve this issue in a principled fashion, we strengthen our initial goal of \u201creturning a histogram\nthat could have plausibly generated the observed \ufb01ngerprint\u201d: we instead return the simplest his-\ntogram that could have plausibly generated the observed \ufb01ngerprint. Recall the example above,\nwhere we observed only 10 distinct elements, but to explain the data we could either infer an ad-\nditional 9,900 unseen elements, or an additional 99,000. In this sense, inferring \u201conly\u201d 9,900 addi-\ntional unseen elements is the simplest explanation that \ufb01ts the data, in the spirit of Occam\u2019s razor.2\n2.1 The algorithm\nWe pose this problem of \ufb01nding the simplest plausible histogram as a pair of linear programs. The\n\ufb01rst linear program will return a histogram h\uffff that minimizes the distance between its expected \ufb01n-\ngerprint and the observed \ufb01ngerprint, where we penalize the discrepancy between Fi and E[F h\uffff\n] in\nproportion to the inverse of the standard deviation of Fi, which we estimate as 1/\u221a1 + Fi, since\n\ni\n\n2The practical performance seems virtually unchanged if one returns the \u201cplausible\u201d histogram of minimal\n\nentropy, instead of minimal support size (see Appendix B).\n\n4\n\n\fPoisson distributions have variance equal to their expectation. The constraint that h\uffff corresponds to\na histogram simply means that the total probability mass is 1, and all probability values are nonneg-\native. The second linear program will then \ufb01nd the histogram h\uffff\uffff of minimal support size, subject to\nthe constraint that the distance between its expected \ufb01ngerprint, and the observed \ufb01ngerprint, is not\nmuch worse than that of the histogram found by the \ufb01rst linear program.\nTo make the linear programs \ufb01nite, we consider a \ufb01ne mesh of values x1, . . . , x\uffff \u2208 (0, 1] that be-\ntween them discretely approximate the potential support of the histogram. The variables of the linear\nprogram, h\uffff1, . . . , h\uffff\uffff will correspond to the histogram values at these mesh points, with variable h\uffffi\nrepresenting the number of domain elements that occur with probability xi, namely h\uffff(xi).\nA minor complicating issue is that this approach is designed for the challenging \u201crare events\u201d regime,\nwhere there are many domain elements each seen only a handful of times. By contrast if there is\na domain element that occurs very frequently, say with probability 1/2, then the number of times\nit occurs will be concentrated about its expectation of k/2 (and the trivial empirical estimate will\nbe accurate), though \ufb01ngerprint Fk/2 will not be concentrated about its expectation, as it will take\nan integer value of either 0, 1 or 2. Hence we will split the \ufb01ngerprint into the \u201ceasy\u201d and \u201chard\u201d\nportions, and use the empirical estimator for the easy portion, and our linear programming approach\nfor the hard portion. The full algorithm is below (see our websites or Appendix D for Matlab code).\n\nAlgorithm 1. ESTIMATE UNSEEN\nInput: Fingerprint F = F1,F2, . . . ,Fm, derived from a sample of size k,\nvector x = x1, . . . , x\uffff with 0 < xi \u2264 1, and error parameter \u03b1> 0.\nOutput: List of pairs (y1, h\uffffy1 ), (y2, h\uffffy2 ), . . . , with yi \u2208 (0, 1], and h\uffffyi \u2265 0.\n\u2022 Initialize the output list of pairs to be empty, and initialize a vector F\uffff to be equal to F.\n\u2022 For i = 1 to k,\n\n\u2013 If\uffffj\u2208{i\u2212\uffff\u221ai\uffff,...,i+\uffff\u221ai\uffff} Fj \u2264 2\u221ai\n\nSet F\uffffi = 0, and append the pair (i/k,Fi) to the output list.\n\n[i.e. if the \ufb01ngerprint is \u201csparse\u201d at index i]\n\n\u2022 Let vopt be the objective function value returned by running Linear Program 1 on input F\uffff, x.\n\u2022 Let h be the histogram returned by running Linear Program 2 on input F\uffff, x, vopt,\u03b1 .\n\u2022 For all i s.t. hi > 0, append the pair (xi, hi) to the output list.\n\nvector x = x1, . . . , x\uffff consisting of a \ufb01ne mesh of points in the interval (0, 1].\n\nLinear Program 1. FIND PLAUSIBLE HISTOGRAM\nInput: Fingerprint F = F1,F2, . . . ,Fm, derived from a sample of size k,\nOutput: vector h\uffff = h\uffff1, . . . , h\uffff\uffff, and objective value vopt \u2208 R.\nLet h\uffff1, . . . , h\uffff\uffff and vopt be, respectively, the solution assignment, and corresponding objective function\nvalue of the solution of the following linear program, with variables h\uffff1, . . . , h\uffff\uffff:\n\nMinimize:\n\nm\uffffi=1\nSubject to: \uffff\uffff\n\n1\n\n\uffff\uffffj=1\n\n\u221a1 + Fi\uffff\uffff\uffff\uffff\uffffFi \u2212\nh\uffffj \u00b7 poi(kxj, i)\uffff\uffff\uffff\uffff\uffff\nj=1 xjh\uffffj =\uffffi Fi/k, and \u2200j, h\uffffj \u2265 0.\n\nLinear Program 2. FIND SIMPLEST PLAUSIBLE HISTOGRAM\nInput: Fingerprint F = F1,F2, . . . ,Fm, derived from a sample of size k,\n\nvector x = x1, . . . , x\uffff consisting of a \ufb01ne mesh of points in the interval (0, 1],\noptimal objective function value vopt from Linear Program 1, and error parameter \u03b1> 0.\n\nOutput: vector h\uffff = h\uffff1, . . . , h\uffff\uffff.\nLet h\uffff1, . . . , h\uffff\uffff be the solution assignment of the following linear program, with variables h\uffff1, . . . , h\uffff\uffff:\n\nMinimize: \uffff\uffff\n\nj=1 h\uffffj\n\n1\u221a1+Fi\uffff\uffff\uffffFi \u2212\uffff\uffff\nSubject to: \uffffm\n\uffff\uffff\nj=1 xjh\uffffj =\uffffi Fi/k, and \u2200j, h\uffffj \u2265 0.\n\nj=1 h\uffffj \u00b7 poi(kxj, i)\uffff\uffff\uffff \u2264 vopt+\u03b1,\n\ni=1\n\nTheorem 1. There exists a constant C0 > 0 and assignment of parameter \u03b1 := \u03b1(k) of Algorithm 1\nsuch that for any c > 0, for suf\ufb01ciently large n, given a sample of size k = c n\nlog n consisting of\nindependent draws from a distribution D over a domain of size at most n, with probability at least\n1 \u2212 e\u2212n\u2126(1) over the randomness in the selection of the sample, Algorithm 13, when run with a\nsuf\ufb01ciently \ufb01ne mesh x1, . . . , x\uffff, returns a histogram h\uffff such that |H(D) \u2212 H(h\uffff)|\u2264 C0\u221ac .\n\n3For simplicity, we prove this statement for Algorithm 1 with the second bullet step of the algorithm modi-\n\ufb01ed as follows: there is an explicit cutoff N such that the linear programming approach is applied to \ufb01ngerprint\nentries Fi for i \u2264 N, and the empirical estimate is applied to \ufb01ngerprints Fi for i > N.\n\n5\n\n\fThe above theorem characterizes the worst-case performance guarantees of the above algorithm in\nterms of entropy estimation. The proof of Theorem 1 is rather technical and we provide the complete\nproof together with a high-level overview of the key components, in Appendix C. In fact, we prove\na stronger theorem\u2014guaranteeing that the histogram returned by Algorithm 1 is close (in a speci\ufb01c\nmetric) to the histogram of the true distribution; this stronger theorem then implies that Algorithm 1\ncan accurately estimate any statistical property that is suf\ufb01ciently Lipschitz continuous with respect\nto the speci\ufb01c metric on histograms.\nThe information theoretic lower bounds of [1] show that there is some constant C1 such that for\nsuf\ufb01ciently large k, no algorithm can estimate the entropy of (worst-case) distributions of support\nsize n to within \u00b10.1 with any probability of success greater 0.6 when given a sample of size at most\nlog n . Together with Theorem 1, this establishes the worst-case optimality of Algorithm 1\nk = C1\n(to constant factors).\n\nn\n\n3 Empirical results\n\ni\n\n2k\n\ni\n\nk|.\n\nk| log2\n\nIn this section we demonstrate that Algorithm 1 performs well, in practice. We begin by brie\ufb02y\ndiscussing the \ufb01ve entropy estimators to which we compare our estimator in Figure 1. The \ufb01rst\nthree are standard, and are, perhaps, the most commonly used estimators [15]. We then describe two\nrecently proposed estimators that have been shown to perform well [22].\nThe \u201cnaive\u201d estimator: the entropy of the empirical distribution, namely, given a \ufb01ngerprint F\nderived from a set of k samples, H naive(F) := \u2212\uffffi Fi\nThe Miller-Madow corrected estimator [23]: the naive estimator H naive corrected to try to ac-\ncount for the second derivative of the logarithm function, namely H MM (F) := H naive(F) +\n(\uffffi Fi)\u22121\n, though we note that the numerator of the correction term is sometimes replaced by vari-\nous related quantities, see [24].\nk \uffffk\nThe jackknifed naive estimator [25, 26]: H JK(F) := k\u00b7 H naive(F)\u2212 k\u22121\nwhere F\u2212j is the \ufb01ngerprint given by removing the contribution of the jth sample.\nThe coverage adjusted estimator (CAE) [27]: Chao and Shen proposed the CAE, which is specif-\nically designed to apply to settings in which there is a signi\ufb01cant component of the distribution that\nis unseen, and was shown to perform well in practice in [22].4 Given a \ufb01ngerprint F derived from\na set of k samples, let Ps := 1 \u2212F 1/k be the Good\u2013Turing estimate of the probability mass of\nthe \u201cseen\u201d portion of the distribution [9]. The CAE adjusts the empirical probabilities according to\nPs, then applies the Horvitz\u2013Thompson estimator for population totals [28] to take into account the\nprobability that the elements were seen. This yields:\n\nj=1 H naive(F\u2212j),\n\nH CAE(F) := \u2212\uffffi\n\nFi\n\n(i/k)Ps log2 ((i/k)Ps)\n1 \u2212 (1 \u2212 (i/k)Ps)k\n\n.\n\nThe Best Upper Bound estimator [15]: The \ufb01nal estimator to which we compare ours is the Best\nUpper Bound (BUB) estimator of Paninski. This estimator is obtained by searching for a minimax\nlinear estimator, with respect to a certain error metric. The linear estimators of [2] can be viewed\nas a variant of this estimator with provable performance bounds.5 The BUB estimator requires, as\ninput, an upper bound on the support size of the distribution from which the samples are drawn;\nif the bound provided is inaccurate, the performance degrades considerably, as was also remarked\nin [22]. In our experiments, we used Paninski\u2019s implementation of the BUB estimator (publicly\navailable on his website), with default parameters. For the distributions with \ufb01nite support, we gave\nthe true support size as input, and thus we are arguably comparing our estimator to the best\u2013case\nperformance of the BUB estimator.\nSee Figure 1 for the comparison of Algorithm 1 with these estimators.\n\n4One curious weakness of the CAE, is that its performance is exceptionally poor on some simple large\ninstances. Given a sample of size k from a uniform distribution over k elements, it is not hard to show that\nthe bias of the CAE is \u2126(log k). This error is not even bounded! For comparison, even the naive estimator has\nerror bounded by a constant in the limit as k \u2192 \u221e in this setting. This bias of the CAE is easily observed in\nour experiments as the \u201chump\u201d in the top row of Figure 1.\n5We also implemented the linear estimators of [2], though found that the BUB estimator performed better.\n\n6\n\n\fNaive\nMiller\u2212Madow\nJackknifed\nCAE\nBUB\nUnseen\n\nUnif[n], n=10,000\n\n3\n4\n10\n10\nSample Size\n\n5\n10\n\nMixUnif[n], n=10,000\n\n3\n4\n10\n10\nSample Size\n\n5\n10\n\nZipf[n], n=10,000\n\nUnif[n], n=100,000\n\n5\n4\n10\n10\nSample Size\n\n6\n10\n\nMixUnif[n], n=100,000\n\n5\n4\n10\n10\nSample Size\n\n6\n10\n\nZipf[n], n=100,000\n\n1\n\nE\nS\nM\nR\n\n0.5\n\n0\n\n1\n\nE\nS\nM\nR\n\n0.5\n\n0\n\n1.5\n\nE\nS\nM\nR\n\n1\n\n0.5\n\n3\n4\n10\n10\nSample Size\n\n5\n10\n\nZipf2[n], n=10,000\n\n0\n\n1.5\n\n5\n4\n10\n10\nSample Size\n\n6\n10\n\nZipf2[n], n=100,000\n\nE\nS\nM\nR\n\n1\n\n0.5\n\n3\n4\n10\n10\nSample Size\n\n5\n10\n\nGeom[n], n=10,000\n\n0\n\n1.5\n\n5\n4\n10\n10\nSample Size\n\n6\n10\n\nGeom[n], n=100,000\n\nE\nS\nM\nR\n\n1\n\n0.5\n\n4\n3\n10\n10\nSample Size\n\n0\n\n5\n10\n\n5\n4\n10\n10\nSample Size\n\n6\n10\n\nUnif[n], n=1,000\n\n2\n10\n\n3\n10\nSample Size\n\nMixUnif[n], n=1,000\n\n2\n10\n\n3\n10\nSample Size\n\nZipf[n], n=1,000\n\n2\n10\n\n3\n10\nSample Size\n\nZipf2[n], n=1,000\n\n2\n10\n\n3\n10\nSample Size\n\nGeom[n], n=1,000\n\n2\n10\n\n3\n10\nSample Size\n\n1\n\nE\nS\nM\nR\n\n0.5\n\n0\n\n1\n\nE\nS\nM\nR\n\n0.5\n\n0\n\nE\nS\nM\nR\n\n1.5\n\n1\n\n0.5\n\n0\n\n1.5\n\nE\nS\nM\nR\n\n1\n\n0.5\n\n0\n\n1.5\n\nE\nS\nM\nR\n\n1\n\n0.5\n\n0\n\nMixGeomZipf[n], n=1,000\n\nMixGeomZipf[n], n=10,000\n\n1.5\n\nMixGeomZipf[n], n=100,000\n \n\n1.5\n\nE\nS\nM\nR\n\n1\n\n0.5\n\n0\n\nE\nS\nM\nR\n\n1\n\n0.5\n\n0\n\n \n\n5\n10\n\n5\n4\n10\n10\nSample Size\n\n6\n10\n\n1\n\nE\nS\nM\nR\n\n0.5\n\n0\n\n1\n\nE\nS\nM\nR\n\n0.5\n\n0\n\nE\nS\nM\nR\n\n1.5\n\n1\n\n0.5\n\n0\n\nE\nS\nM\nR\n\n1.5\n\n1\n\n0.5\n\n0\n\n1.5\n\nE\nS\nM\nR\n\n1\n\n0.5\n\n0\n\nE\nS\nM\nR\n\n1.5\n\n1\n\n0.5\n\n0\n\n2\n10\n\n3\n10\nSample Size\n\n3\n4\n10\n10\nSample Size\n\n5 ] and U nif [ 4n\n\n5 and probability pi = 5\n\nFigure 1: Plots depicting the square root of the mean squared error (RMSE) of each entropy estimator over\n500 trials, plotted as a function of the sample size; note the logarithmic scaling of the x-axis. The samples are\ndrawn from six classes of distributions: the uniform distribution, U nif [n] that assigns probability pi = 1/n\nfor i = 1, 2, . . . , n; an even mixture of U nif [ n\n2n for\n5 + 1, . . . , n; the Zipf distribution Zipf [n] that assigns\ni = 1, . . . , n\n1/i\uffffn\nprobability pi =\nj=1 1/j for i = 1, 2, . . . , n and is commonly used to model naturally occurring \u201cpower law\u201d\ndistributions, particularly in natural language processing; a modi\ufb01ed Zipf distribution with power\u2013law exponent\n1/i0.6\n0.6, Zipf 2[n], that assigns probability pi =\nj=1 1/j0.6 for i = 1, 2, . . . , n; the geometric distribution\n\uffffn\nGeom[n], which has in\ufb01nite support and assigns probability pi = (1/n)(1 \u2212 1/n)i, for i = 1, 2 . . .; and\nlastly an even mixture of Geom[n/2] and Zipf [n/2]. For each distribution, we considered three settings of\nthe parameter n: n = 1, 000 (left column), n = 10, 000 (center column), and n = 100, 000 (right column). In\neach plot, the sample size ranges over the interval [n0.6, n1.25].\n\n5 ], which assigns probability pi = 5\n\n8n for i = n\n\nAll experiments were run in Matlab. The error parameter \u03b1 in Algorithm 1 was set to be 0.5 for all\ntrials, and the vector x = x1, x2, . . . used as the support of the returned histogram was chosen to be a coarse\ngeometric mesh, with x1 = 1/k2, and xi = 1.1xi\u22121. The experimental results are essentially unchanged\nif the parameter \u03b1 varied within the range [0.25, 1], or if x1 is decreased, or if the mesh is made more \ufb01ne\n(see Appendix B). Appendix D contains our Matlab implementation of Algorithm 1 (also available from our\nwebsites).\n\nThe unseen estimator performs far better than the three standard estimators, dominates the CAE estimator\nfor larger sample sizes and on samples from the Zipf distributions, and also dominates the BUB estimator, even\nfor the uniform and Zipf distributions for which the BUB estimator received the true support sizes as input.\n\n7\n\n\fEstimating Distance (d=0)\n\nEstimating Distance (d=0.5)\n\nEstimating Distance (d=1)\n\n \n\nNaive\n\nUnseen\n\n5\n\n10\n\ni\n\n \n\ne\nc\nn\na\nt\ns\nD\n1\nL\nd\ne\nt\na\nm\n\n \n\ni\nt\ns\nE\n\n1\n\n0.8\n\n0.6\n\n0.4\n\n0.2\n\n0\n\n \n\n \n\nNaive\n\nUnseen\n\n5\n\n10\n\ni\n\n \n\ne\nc\nn\na\nt\ns\nD\n1\nL\nd\ne\nt\na\nm\n\n \n\ni\nt\ns\nE\n\n1\n\n0.8\n\n0.6\n\n0.4\n\n0.2\n\n0\n\n \n\n \n\nNaive\nUnseen\n\n \n\ni\n\ne\nc\nn\na\nt\ns\nD\n1\nL\nd\ne\nt\na\nm\n\n \n\ni\nt\ns\nE\n\n5\n\n10\n\n1\n\n0.8\n\n0.6\n\n0.4\n\n0.2\n\n0\n\n \n\n3\n\n10\n10\n Sample Size \n\n4\n\n3\n\n10\n10\n Sample Size \n\n4\n\n3\n\n10\n10\n Sample Size \n\n4\n\nFigure 2: Plots depicting the estimated the total variation distance (\uffff1 distance) between two uniform distri-\nbutions on n = 10, 000 points, in three cases: the two distributions are identical (left plot, d = 0), the supports\noverlap on half their domain elements (center plot, d = 0.5), and the distributions have disjoint supports (right\nplot, d = 1). The estimate of the distance is plotted along with error bars at plus and minus one standard\ndeviation; our results are compared with those for the naive estimator (the distance between the empirical dis-\ntributions). The unseen estimator can be seen to reliably distinguish between the d = 0, d = 1\n2 , and d = 1\ncases even for samples as small as several hundred.\n\n3.1 Estimating \uffff1 distance and number of words in Hamlet\n\nThe other two properties that we consider do not have such widely-accepted estimators as entropy,\nand thus our evaluation of the unseen estimator will be more qualitative. We include these two exam-\nples here because they are of a substantially different \ufb02avor from entropy estimation, and highlight\nthe \ufb02exibility of our approach.\nFigure 2 shows the results of estimating the total variation distance (\uffff1 distance). Because total\nvariation distance is a property of two distributions instead of one, \ufb01ngerprints and histograms are\ntwo-dimensional objects in this setting (see Section 4.6 of [29]), and Algorithm 1 and the linear pro-\ngrams are extended accordingly, replacing single indices by pairs of indices, and Poisson coef\ufb01cients\nby corresponding products of Poisson coef\ufb01cients.\nFinally, in contrast to the synthetic tests above, we also evaluated our estimator on a real-data prob-\nlem which may be seen as emblematic of the challenges in a wide gamut of natural language pro-\ncessing problems: given a (contiguous) fragment of Shakespeare\u2019s Hamlet, estimate the number\nof distinct words in the whole play. We use this example to showcase the \ufb02exibility of our linear\nprogramming approach\u2014our estimator can be customized to particular domains in powerful and\nprincipled ways by adding or modifying the constraints of the linear program. To estimate the his-\ntogram of word frequencies in Hamlet, we note that the play is of length \u2248 25, 000, and thus the\n25,000. Thus in contrast to our previous\nminimum probability with which any word can occur is\napproach of using Linear Program 2 to bound the support of the returned histogram, we instead\n25,000,\nsimply modify the input vector x of Linear Program 1 to contain only probability values \u2265 1\nand forgo running Linear Program 2. The results are plotted in Figure 3. The estimates converge\ntowards the true value of 4268 distinct words extremely rapidly, and are slightly negatively biased,\nperhaps re\ufb02ecting the fact that words appearing close together are correlated.\nIn contrast to Hamlet\u2019s charge that \u201cthere are more things in heaven and earth...than are dreamt of\nin your philosophy,\u201d we can say that there are almost exactly as many things in Hamlet as can be\ndreamt of from 10% of Hamlet.\n\n1\n\n8000\n\n6000\n\n4000\n\n2000\n\ne\nt\na\nm\n\ni\nt\ns\nE\n\n0\n\n \n0\n\nEstimating # Distinct Words in Hamlet\n\n \n\nNaive\n\nCAE\n\nUnseen\n\n0.5\n\n1\n\n1.5\n\n2\n\nLength of Passage\n\n2.5\n\n4\nx 10\n\nFigure 3: Estimates of the total number of distinct word forms in Shakespeare\u2019s Hamlet (excluding stage\ndirections and proper nouns) as a functions of the length of the passage from which the estimate is inferred.\nThe true value, 4268, is shown as the horizontal line.\n\n8\n\n\fReferences\n[1] G. Valiant and P. Valiant. Estimating the unseen: an n/ log(n)\u2013sample estimator for entropy and support\n\nsize, shown optimal via new CLTs. In Symposium on Theory of Computing (STOC), 2011.\n\n[2] G. Valiant and P. Valiant. The power of linear estimators. In IEEE Symposium on Foundations of Computer\n\nScience (FOCS), 2011.\n\n[3] M. R. Nelson et al. An abundance of rare functional variants in 202 drug target genes sequenced in 14,002\n\npeople. Science, 337(6090):100\u2013104, 2012.\n\n[4] J. A. Tennessen et al. Evolution and functional impact of rare coding variation from deep sequencing of\n\nhuman exomes. Science, 337(6090):64\u201369, 2012.\n\n[5] A. Keinan and A. G. Clark. Recent explosive human population growth has resulted in an excess of rare\n\ngenetic variants. Science, 336(6082):740\u2013743, 2012.\n\n[6] F. Olken and D. Rotem. Random sampling from database \ufb01les: a survey. In Proceedings of the Fifth\n\nInternational Workshop on Statistical and Scienti\ufb01c Data Management, 1990.\n\n[7] P. J. Haas, J. F. Naughton, S. Seshadri, and A. N. Swami. Selectivity and cost estimation for joins based\n\non random sampling. Journal of Computer and System Sciences, 52(3):550\u2013569, 1996.\n\n[8] R.A. Fisher, A. Corbet, and C.B. Williams. The relation between the number of species and the number\nof individuals in a random sample of an animal population. Journal of the British Ecological Society,\n12(1):42\u201358, 1943.\n\n[9] I. J. Good. The population frequencies of species and the estimation of population parameters. Biometrika,\n\n40(16):237\u2013264, 1953.\n\n[10] D. A. McAllester and R.E. Schapire. On the convergence rate of Good-Turing estimators. In Conference\n\non Learning Theory (COLT), 2000.\n\n[11] A. Orlitsky, N.P. Santhanam, and J. Zhang. Always Good Turing: Asymptotically optimal probability\n\nestimation. Science, 302(5644):427\u2013431, October 2003.\n\n[12] A. Orlitsky, N. Santhanam, K.Viswanathan, and J. Zhang. On modeling pro\ufb01les instead of values. Un-\n\ncertainity in Arti\ufb01cial Intelligence, 2004.\n\n[13] J. Acharya, A. Orlitsky, and S. Pan. The maximum likelihood probability of unique-singleton, ternary,\n\nand length-7 patterns. In IEEE Symp. on Information Theory, 2009.\n\n[14] J. Acharya, H. Das, A. Orlitsky, and S. Pan. Competitive closeness testing. In COLT, 2011.\n[15] L. Paninski. Estimation of entropy and mutual information. Neural Comp., 15(6):1191\u20131253, 2003.\n[16] J. Bunge and M. Fitzpatrick. Estimating the number of species: A review. Journal of the American\n\nStatistical Association, 88(421):364\u2013373, 1993.\n\n[17] J. Bunge.\n\nBibliography of references on the problem of estimating support size, available at\n\nhttp://www.stat.cornell.edu/\u02dcbunge/bibliography.html.\n\n[18] Z. Bar-Yossef, R. Kumar, and D. Sivakumar. Sampling algorithms: lower bounds and applications. In\n\nSTOC, 2001.\n\n[19] T. Batu Testing Properties of Distributions Ph.D. thesis, Cornell, 2001.\n[20] M. Charikar, S. Chaudhuri, R. Motwani, and V.R. Narasayya. Towards estimation error guarantees for\n\ndistinct values. In SODA, 2000.\n\n[21] T. Batu, L. Fortnow, R. Rubinfeld, W.D. Smith, and P. White. Testing that distributions are close. In IEEE\n\nSymposium on Foundations of Computer Science (FOCS), 2000.\n\n[22] V.Q. Vu, B. Yu, and R.E. Kass. Coverage-adjusted entropy estimation.\n\n26(21):4039\u20134060, 2007.\n\nStatistics in Medicine,\n\n[23] G. Miller. Note on the bias of information estimates.\n\nQuastler (Glencoe, IL: Free Press):pp 95\u2013100, 1955.\n\nInformation Theory in Psychology II-B, ed H\n\n[24] S. Panzeri and A Treves. Analytical estimates of limited sampling biases in different information mea-\n\nsures. Network: Computation in Neural Systems, 7:87\u2013107, 1996.\n\n[25] S. Zahl. Jackkni\ufb01ng an index of diversity. Ecology, 58:907\u2013913, 1977.\n[26] B. Efron and C. Stein. The jacknife estimate of variance. Annals of Statistics, 9:586\u2013596, 1981.\n[27] A. Chao and T.J. Shen. Nonparametric estimation of shannons index of diversity when there are unseen\n\nspecies in sample. Environmental and Ecological Statistics, 10:429\u2013443, 2003.\n\n[28] D.G. Horvitz and D.J. Thompson. A generalization of sampling without replacement from a \ufb01nite uni-\n\nverse. Journal of the American Statistical Association, 47(260):663\u2013685, 1952.\n\n[29] P. Valiant. Testing Symmetric Properties of Distributions. SIAM J. Comput., 40(6):1927\u20131968,2011.\n\n9\n\n\f", "award": [], "sourceid": 1055, "authors": [{"given_name": "Paul", "family_name": "Valiant", "institution": "Brown University"}, {"given_name": "Gregory", "family_name": "Valiant", "institution": "Stanford University"}]}