{"title": "The Broad Optimality of Profile Maximum Likelihood", "book": "Advances in Neural Information Processing Systems", "page_first": 10991, "page_last": 11003, "abstract": "We study three fundamental statistical-learning problems: distribution estimation, property estimation, and property testing. We establish the profile maximum likelihood (PML) estimator as the first unified sample-optimal approach to a wide range of learning tasks. In particular, for every alphabet size $k$ and desired accuracy $\\varepsilon$: \\textbf{Distribution estimation} Under $\\ell_1$ distance, PML yields optimal $\\Theta(k/(\\varepsilon^2\\log k))$ sample complexity for sorted-distribution estimation, and a PML-based estimator empirically outperforms the Good-Turing estimator on the actual distribution; \\textbf{Additive property estimation} For a broad class of additive properties, the PML plug-in estimator uses just four times the sample size required by the best estimator to achieve roughly twice its error, with exponentially higher confidence; \\textbf{$\\alpha$-R\\'enyi entropy estimation} For an integer $\\alpha>1$, the PML plug-in estimator has optimal $k^{1-1/\\alpha}$ sample complexity; for non-integer $\\alpha>3/4$, the PML plug-in estimator has sample complexity lower than the state of the art; \\textbf{Identity testing} In testing whether an unknown distribution is equal to or at least $\\varepsilon$ far from a given distribution in $\\ell_1$ distance, a PML-based tester achieves the optimal sample complexity up to logarithmic factors of $k$. With minor modifications, most of these results also hold for a near-linear-time computable variant of PML.", "full_text": "The Broad Optimality of Pro\ufb01le Maximum Likelihood\n\nYi Hao\n\nAlon Orlitsky\n\nDept. of Electrical and Computer Engineering\n\nDept. of Electrical and Computer Engineering\n\nUniversity of California, San Diego\n\nyih179@ucsd.edu\n\nUniversity of California, San Diego\n\nalon@ucsd.edu\n\nAbstract\n\nWe study three fundamental statistical learning problems: distribution estimation,\nproperty estimation, and property testing. We establish the pro\ufb01le maximum likeli-\nhood (PML) estimator as the \ufb01rst uni\ufb01ed sample-optimal approach to a wide range\nof learning tasks. In particular, for every alphabet size k and desired accuracy \u03b5:\nDistribution estimation Under (cid:96)1 distance, PML yields optimal \u0398(k/(\u03b52 log k))\nsample complexity for sorted distribution estimation, and a PML-based estimator\nempirically outperforms the Good-Turing estimator on the actual distribution;\nAdditive property estimation For a broad class of additive properties, the PML\nplug-in estimator uses just four times the sample size required by the best estimator\nto achieve roughly twice its error, with exponentially higher con\ufb01dence;\n\u03b1-R\u00e9nyi entropy estimation For an integer \u03b1 > 1, the PML plug-in estimator\nhas optimal k1\u22121/\u03b1 sample complexity; for non-integer \u03b1 > 3/4, the PML plug-in\nestimator has sample complexity lower than the state of the art;\nIdentity testing In testing whether an unknown distribution is equal to or at least \u03b5\nfar from a given distribution in (cid:96)1 distance, a PML-based tester achieves the optimal\nsample complexity up to logarithmic factors of k.\nWith minor modi\ufb01cations, most of these results also hold for a near-linear-time\ncomputable variant of PML.\n\nIntroduction\n\n1\nA distribution p over a discrete alphabet X of size k corresponds to an element of the simplex\n\n(cid:40)\np \u2208 Rk\u22650 :\n\n(cid:88)\n\nx\u2208X\n\n(cid:41)\n\n\u2206X :=\n\np(x) = 1\n\n.\n\nproperty is additive, i.e., additively separable, if it can be written as f (p) :=(cid:80)\n\nA distribution property is a mapping f : \u2206X \u2192 R associating a real value with each distribution. A\ndistribution property f is symmetric if it is invariant under domain-symbol permutations. A symmetric\nx f (p(x)), where for\n\nsimplicity we use f to denote both the property and the corresponding real function.\nMany important symmetric properties are additive. For example,\n\n\u2022 Support size S(p) :=(cid:80)\n\u2022 Support coverage Cm(p) :=(cid:80)\n\u2022 Shannon entropy H(p) := \u2212(cid:80)\n\nx\n\nlary size [29, 53, 67], population estimation [34, 52], and database studies [37].\n\n1p(x)>0, a fundamental quantity arising in the study of vocabu-\nx(1 \u2212 (1 \u2212 p(x))m), where m is a given parameter, the\nexpected number of distinct elements observed in a sample of size m, arising in biologi-\ncal [17, 49] and ecological [17\u201319, 23] research;\n\nx p(x) log p(x), the primary measure of information [24,\n66] with numerous applications to machine learning [14, 22, 63] and neuroscience [30, 51];\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\f\u2022 Distance to uniformity D(p) := (cid:107)p \u2212 pu(cid:107)1, where pu is the uniform distribution over \u2206X ,\n\na property being central to the \ufb01eld of distribution property testing [10, 12, 15, 65].\n\nBesides being additive and symmetric, these four properties have yet another attribute in common.\nUnder the appropriate interpretation, they are also all 1-Lipschitz. Speci\ufb01cally, for two distributions\np, q \u2208 \u2206X , let \u0393p,q be the collection of distributions over X \u00d7 X with marginals p and q on the \ufb01rst\nand second factors respectively. The relative earth-mover distance [70], between p and q is\n\nR(p, q) := inf\n\n\u03b3\u2208\u0393p,q\n\nE\n\n(X,Y )\u223c\u03b3\n\np(X)\nq(Y )\n\n(cid:12)(cid:12)(cid:12)(cid:12)log\n\n(cid:12)(cid:12)(cid:12)(cid:12) .\n\nOne can verify [70, 71] that H, D, and \u02dcCm := Cm/m are all 1-Lipschitz on the metric space\n(\u2206X , R), and \u02dcS := S/k is 1-Lipschitz over (\u2206\u22651/k, R), the set of distributions in \u2206X whose\nnonzero probabilities are at least 1/k. We will study all such Lipschitz properties in later sections.\nAn important symmetric non-additive property is R\u00e9nyi entropy, a well-known measure of randomness\nwith numerous applications to unsupervised learning [44, 77] and image registration [50, 54]. For\na distribution p \u2208 \u2206X and a non-negative real parameter \u03b1 (cid:54)= 1, the \u03b1-R\u00e9nyi entropy [64] of p is\nx ). In particular, denoted by H1(p) := lim\u03b1\u21921 H\u03b1(p), the 1-R\u00e9nyi\nentropy is exactly Shannon entropy [64].\n\nH\u03b1(p) := (1 \u2212 \u03b1)\u22121 log ((cid:80)\n\nx p\u03b1\n\n1.1 Problems of interest\n\nIn this work, we consider three fundamental statistical learning problems concerning the estimation\nand testing of distributions and their properties.\n\n(Sorted) distribution estimation\nA natural learning problem is to estimate an unknown distribution p \u2208 \u2206X from an i.i.d. sample\nX n \u223c p. For any two distributions p, q \u2208 \u2206X , let (cid:96)(p, q) be the loss when we approximate p by q.\nA distribution estimator \u02c6p : X \u2217 \u2192 \u2206X associates every sequence xn \u2208 X \u2217 with a distribution \u02c6p(xn).\nWe measure the performance of an estimator by its sample complexity\n\nn(\u02c6p, \u03b5, \u03b4) := min{n : \u2200p \u2208 \u2206X , Pr\nX n\u223cp\n\n((cid:96)(p, \u02c6p(X n)) \u2265 \u03b5) \u2264 \u03b4},\n\nthe smallest sample size that \u02c6p requires to estimate all distributions in \u2206X to a desired accuracy\n\u03b5 > 0, with error probability \u03b4 \u2208 (0, 1). The sample complexity of distribution estimation over \u2206X is\n\nn(\u03b5, \u03b4) := min{n(\u02c6p, \u03b5, \u03b4) : \u02c6p : X \u2217 \u2192 \u2206X},\n\nthe lowest sample complexity of any estimator. For simplicity, we will omit \u03b4 when \u03b4 = 1/3.\nFor a distribution p \u2208 \u2206X , we denote by {p} the multiset of its probabilities. The sorted (cid:96)1 distance\nbetween two distributions p, q \u2208 \u2206X is\n\n(cid:96)<\n1(p, q) :=\n\nmin\n\np(cid:48)\u2208\u2206X :{p(cid:48)}={p}\n\n(cid:107)p(cid:48) \u2212 q(cid:107)1 ,\n\nthe smallest (cid:96)1 distance between q and any sorted version of p. As illustrated in Section 7.1 of the\nsupplementary material, this is essentially the 1-Wasserstein distance between uniform measures on\nthe probability multisets {p} and {q}. We will consider both the sorted and unsorted (cid:96)1 distances.\n\nProperty estimation\nOften we would like to estimate a given property f of an unknown distribution p \u2208 \u2206X based on\na sample X n \u223c p. A property estimator is a mapping \u02c6f : X \u2217 \u2192 R. Analogously, the sample\ncomplexity of \u02c6f in estimating f over a set P \u2282 \u2206X is\nnf ( \u02c6f ,P, \u03b5, \u03b4) := min{n : \u2200p \u2208 P, Pr\nX n\u223cp\n\nthe smallest sample size that \u02c6f requires to estimate f with accuracy \u03b5 and con\ufb01dence 1 \u2212 \u03b4, for all\ndistributions in P. The sample complexity of estimating f over P is\n\n(| \u02c6f (X n) \u2212 f (p)| \u2265 \u03b5) \u2264 \u03b4},\n\nnf (P, \u03b5, \u03b4) := min{nf ( \u02c6f ,P, \u03b5, \u03b4) : \u02c6f : X \u2217 \u2192 R},\n\nthe lowest sample complexity of any estimator. For simplicity, we will omit P when P = \u2206X , and\nomit \u03b4 when \u03b4 = 1/3. The standard \u201cmedian trick\" shows that log(1/\u03b4)\u00b7 nf (P, \u03b5) \u2265 \u2126(nf (P, \u03b5, \u03b4)).\nBy convention, we say an estimator \u02c6f is sample-optimal if nf ( \u02c6f ,P, \u03b5) = \u0398(nf (P, \u03b5)).\n\n2\n\n\fProperty testing: Identity testing\n\nA closely related problem is distribution property testing, of which identity testing is the most\nfundamental and well-studied [15, 32]. Given an error parameter \u03b5, a distribution q, and a sample\nX n from an unknown distribution p, identity testing aims to distinguish between the null hypothesis\n\nH0 : p = q\n\nand the alternative hypothesis\n\nH1 : (cid:107)p \u2212 q(cid:107)1 \u2265 \u03b5.\n\nA property tester is a mapping \u02c6t : X \u2217 \u2192 {0, 1}, indicating whether H0 or H1 is accepted. Analogous\nto the two formulations above, the sample complexity of \u02c6t is\n\n(cid:0)\u02c6t(X n) (cid:54)= i(cid:1) \u2264 \u03b4},\n\nnq(\u02c6t, \u03b5, \u03b4) := min{n : \u2200i \u2208 {0, 1} and \u2200p \u2208 Hi, Pr\nX n\u223cp\nand the sample complexity of identity testing with respect to q is\n\nnq(\u03b5, \u03b4) := min{n(\u02c6t, \u03b5, \u03b4) : \u02c6t : X \u2217 \u2192 {0, 1}}.\n\nAgain, when \u03b4 = 1/3, we will omit \u03b4. For q = pu, the problem is also known as uniformity testing.\n\n1.2 Pro\ufb01le maximum likelihood\nThe multiplicity of a symbol x \u2208 X in a sequence xn := x1, . . . , xn \u2208 X \u2217 is \u00b5x(xn) := |{j : xj =\nx, 1 \u2264 j \u2264 n}|, the number of times x appears in xn. These multiplicities induce an empirical\ndistribution p\u00b5(xn) that associates a probability \u00b5x(xn)/n with each symbol x \u2208 X .\nThe prevalence of an integer i \u2265 0 in xn is the number \u03d5i(xn) of symbols appearing i times in xn.\nFor known X , the value of \u03d50 can be deduced from the remaining multiplicities, hence we de\ufb01ne\nthe pro\ufb01le of xn to be \u03d5(xn) = (\u03d51(xn), . . . , \u03d5n(xn)), the vector of all positive prevalences. For\nexample, \u03d5(alfalfa) = (0, 2, 1, 0, 0, 0, 0). Note that the pro\ufb01le of xn also corresponds to the multiset\nof multiplicities of distinct symbols in xn.\nFor a distribution p \u2208 \u2206X , let\n\nbe the probability of observing a sequence xn under i.i.d. sampling from p, and let\n\np(xn) := Pr\nX n\u223cp\n\n(X n = xn)\n\n(cid:88)\n\np(\u03d5) :=\n\np(yn)\n\nbe the probability of observing a pro\ufb01le \u03d5. While the sequence maximum likelihood estimator maps\na sequence to its empirical distribution, which maximizes the sequence probability p(xn), the pro\ufb01le\nmaximum likelihood (PML) estimator [58] over a set P \u2286 \u2206X maps each pro\ufb01le \u03d5 to a distribution\n\nyn:\u03d5(yn)=\u03d5\n\np\u03d5 := arg max\n\np\n\np(\u03d5)\n\n\u03d5 such that p\u03b2\n\nthat maximizes the pro\ufb01le probability. Relaxing the optimization objective, for any \u03b2 \u2208 (0, 1), a \u03b2-\n\u03d5(\u03d5) \u2265 \u03b2 \u00b7 p\u03d5(\u03d5).\napproximate PML estimator [4] maps each pro\ufb01le \u03d5 to a distribution p\u03b2\nOriginating from the principle of maximum likelihood, PML was proved [2, 4, 6, 7, 25, 58] to possess\na number of useful attributes, such as existence over \ufb01nite discrete domains, majorization by empirical\ndistributions, consistency for distribution estimation under both sorted and unsorted (cid:96)1 distances, and\ncompetitiveness to other pro\ufb01le-based estimators.\nLet \u03b5 be an error parameter and f be one of the four properties in the introduction. Set n := nf (\u03b5).\nRecently, Acharya et al. [4] showed that for some absolute constant c(cid:48) > 0, if c < c(cid:48) and \u03b5 \u2265 n\u2212c, then\na plug-in estimator for f, using an exp(\u2212n1\u2212\u0398(c))-approximate PML, is sample-optimal. Motivated\nby this result, Charikar et al. [20] constructed an explicit exp(\u2212O(n2/3 log3 n))-approximate PML\n(APML) whose computation time is near-linear in n. Combined, these results provide a uni\ufb01ed,\nsample-optimal, and near-linear-time computable plug-in estimator for the four properties.\n\n3\n\n\f2 New results and implications\n\n2.1 New results\n\nAdditive property estimation\n\nRecall that for any property f, the expression nf (\u03b5) denotes the smallest sample size required by any\nestimator to achieve accuracy \u03b5 with con\ufb01dence 2/3, for all distributions in \u2206X . Let f be an additive\nsymmetric property that is 1-Lipschitz on (\u2206X , R). Let \u03b5 > 0 and n \u2265 nf (\u03b5) be error and sampling\nparameters. For an absolute constant c \u2208 (10\u22122, 10\u22121), if \u03b5 \u2265 n\u2212c,\n\u221a\nTheorem 1. The PML plug-in estimator, when given a sample of size 4n from any distribution\np \u2208 \u2206X , will estimate f (p) up to an error of (2 + o(1))\u03b5, with probability at least 1 \u2212 exp (\u22124\nn).\nFor a different c > 0, Theorem 1 also holds for APML, which is near-linear-time computable [20].\n\nR\u00e9nyi entropy estimation\nFor X of \ufb01nite size k and any p \u2208 \u2206X , it is well-known that H\u03b1(p) \u2208 [0, log k]. The following\ntheorems characterize the performance of the PML plug-in estimator in estimating R\u00e9nyi entropy.\nFor any distribution p \u2208 \u2206X , error parameter \u03b5 \u2208 (0, 1), absolute constant \u03bb \u2208 (0, 0.1), and sampling\nparameter n, draw a sample X n \u223c p and denote its pro\ufb01le by \u03d5. Then for suf\ufb01ciently large k,\nTheorem 2. For \u03b1 \u2208 (3/4, 1), if n = \u2126\u03b1(k1/\u03b1/(\u03b51/\u03b1 log k)),\n\nTheorem 3. For non-integer \u03b1 > 1, if n = \u2126\u03b1(k/(\u03b51/\u03b1 log k)),\n\nPr (|H\u03b1(p\u03d5) \u2212 H\u03b1(p)| \u2265 \u03b5) \u2264 exp(\u2212\u221a\nPr (|H\u03b1(p\u03d5) \u2212 H\u03b1(p)| \u2265 \u03b5) \u2264 exp(\u2212n1\u2212\u03bb).\n\nn).\n\nTheorem 4. For integer \u03b1 > 1, if n = \u2126\u03b1(k1\u22121/\u03b1(\u03b52 log(1/\u03b5))\u2212(1+\u03b1)) and H\u03b1(p) \u2264 (log n)/4,\n\nPr(|H\u03b1(p\u03d5) \u2212 H\u03b1(p)| \u2265 \u03b5) \u2264 1/3.\n\nReplacing 3/4 by 5/6, Theorem 2 also holds for APML with a better probability bound exp(\u2212n2/3).\nIn addition, Theorem 3 holds for APML without any modi\ufb01cations.\n\nSorted distribution estimation\nLet c be the absolute constant de\ufb01ned just prior to Theorem 1. For any distribution p \u2208 \u2206X , error\nparameter \u03b5 \u2208 (0, 1), and sampling parameter n, draw a sample X n \u223c p and denote its pro\ufb01le by \u03d5.\n\nTheorem 5. If n = \u2126(n(\u03b5)) = \u2126(cid:0)k/(\u03b52 log k)(cid:1) and \u03b5 \u2265 n\u2212c,\n\n\u221a\n1(p\u03d5, p) \u2265 \u03b5) \u2264 exp(\u2212\u2126(\n\nPr((cid:96)<\n\nn)).\n\nFor a different c > 0, Theorem 5 also holds for APML with a better probability bound exp(\u2212n2/3).\n\nIdentity testing\n\nThe recent works of Diakonikolas and Kane [26] and Goldreich [31] provided a procedure reducing\nidentity testing to uniformity testing, while modifying the desired accuracy and alphabet size by only\nabsolute constant factors. Hence below we consider uniformity testing.\nThe uniformity tester TPML shown in Figure 1 is purely based on PML and satis\ufb01es\n\u221a\nTheorem 6. If \u03b5 = \u02dc\u2126(k\u22121/4) and n = \u02dc\u2126(\n\u221a\nk/\u03b52), then the tester TPML(X n) will be correct with\nprobability at least 1 \u2212 k\u22122. The tester also distinguishes between p = pu and (cid:107)p \u2212 pu(cid:107)2 \u2265 \u03b5/\nk.\nThe \u02dc\u2126(\u00b7) notation only hides logarithmic factors of k. The tester TPML is near-optimal as for uniform\ndistribution pu, the results in [28] yield an \u2126(\nFor space considerations, we postpone proofs and additional results to the supplementary material.\nThe rest of the paper is organized as follows. Section 2.2 presents several immediate implications of\nthe above theorems. Section 3 and Section 4 illustrate PML\u2019s theoretical and practical advantages by\ncomparing it to existing methods for a variety of learning tasks. Section 5 concludes the paper and\noutlines multiple promising future directions.\n\nk log k/\u03b52) lower bound on npu (\u03b5, k\u22122).\n\n\u221a\n\n4\n\n\fInput: parameters k, \u03b5, and a sample X n \u223c p with pro\ufb01le \u03d5.\nif maxx\u00b5x(X n) \u2265 3 max{1, n/k} log k then return 1;\nelif (cid:107)p\u03d5 \u2212 pu(cid:107)2 \u2265 3\u03b5/(4\nelse return 0.\n\nk) then return 1;\n\n\u221a\n\nFigure 1: Uniformity tester TPML\n\n2.2\n\nImplications\n\nSeveral immediate implications are in order.\nWe say that a plug-in estimator is universally sample-optimal for estimating symmetric properties\nif there exist absolute positive constants c1, c2 and c3, such that for any 1-Lipschitz property on\n(\u2206X , R), with probability \u2265 9/10, the plug-in estimator uses just c1 times the sample size n required\nby the minimax estimator to achieve c2 times its error, whenever this error is at least n\u2212c3.\nNote that the \u201c1-Lipschitz property\u201d class can be replaced by other general property classes, but not\nby those containing only a few speci\ufb01c properties, since \u201cuniversal\u201d means \u201capplicable to all cases\u201d.\nTheorem 1 makes PML the \ufb01rst plug-in estimator that is universally sample-optimal for a broad class\nof distribution properties. In particular, Theorem 1 also covers the four properties considered in [4].\nTo see this, as mentioned in the introduction, \u02dcCm, H, and D are 1-Lipschitz on (\u2206X , R); as for \u02dcS,\nthe following result [4] relates it to \u02dcCm for distributions in \u2206\u22651/k, and proves PML\u2019s optimality.\nLemma 1. For any \u03b5 > 0, m = k log(1/\u03b5), and p \u2208 \u2206\u22651/k,\n\n| \u02dcS(p) \u2212 \u02dcCm(p) log (1/\u03b5)| \u2264 \u03b5.\n\nfs(x) := min{x,|x \u2212 1/s|}. Then to within a factor of two, fs(p) :=(cid:80)\n\nThe theorem also applies to many other properties. As an example [70], given an integer s > 0, let\nx fs(px) approximates the\n(cid:96)1 distance between any distribution p and the closest uniform distribution in \u2206X of support size s.\nIn Section 3.2 we compare Theorem 1 with existing results and present more of its implications.\nTheorem 2 and 3 imply that for all non-integer \u03b1 > 3/4 (resp. \u03b1 > 5/6), the PML (resp. APML)\nplug-in estimator achieves a sample complexity better than the best currently known [5]. This makes\nboth the PML and APML plug-in estimators the state-of-the-art algorithms for estimating non-integer\norder R\u00e9nyi entropy. See Section 3.3 for an introduction of known results, and see Section 3.4 for a\ndetailed comparison between existing methods and ours.\nTheorem 4 shows that for all integer \u03b1 > 1, the sample complexity of the PML plug-in estimator\nhas optimal k1\u22121/\u03b1 dependence [5, 55] on the alphabet size.\nTheorem 5 makes APML the \ufb01rst distribution estimator under sorted (cid:96)1 distance that is both near-\nlinear-time computable and sample-optimal for a range of desired accuracy \u03b5 beyond inverse poly-\n\u221a\nlogarithmic of n. In comparison, existing algorithms [2, 38, 72] either run in polynomial time in the\nsample sizes, or are only known to achieve optimal sample complexity for \u03b5 = \u2126(1/\nlog n), which\nis essentially different from the applicable range of \u03b5 \u2265 n\u2212\u0398(1) in Theorem 5. We provide a more\ndetailed comparison in Section 3.6.\nTheorem 6 provides the \ufb01rst PML-based uniformity tester with near-optimal sample complexity.\nAs stated, the tester also distinguishes between p = pu and (cid:107)p \u2212 pu(cid:107)2 \u2265 \u03b5/\nk. This is a stronger\nguarantee since by the Cauchy-Schwarz inequality, (cid:107)p \u2212 pu(cid:107)1 \u2265 \u03b5 implies (cid:107)p \u2212 pu(cid:107)2 \u2265 \u03b5/\nNote that several other uniformity testers in the literature (see Section 3.7) also provide the same (cid:96)2\ntesting guarantee, since all of them are essentially counting sample collisions, i.e., the number of\nlocation pairs such that the sample points at those locations are equal.\n\n\u221a\n\n\u221a\n\nk.\n\n5\n\n\f3 Related work and comparisons\n\n3.1 Additive property estimation\n\nThe study of additive property estimation dates back at least half a century [16, 34, 35] and has steadily\ngrown over the years. For any additive symmetric property f and sequence xn, the simplest and most\nwidely-used approach uses the empirical (plug-in) estimator \u02c6f E(xn) := f (p\u00b5(xn)) that evaluates f\nat the empirical distribution. While the empirical estimator performs well in the large-sample regime,\nmodern data science applications often concern high-dimensional data, for which more involved meth-\nods have yielded property estimators that are more sample-ef\ufb01cient. For example, for relatively large k\nand for f being \u02dcS, \u02dcCm, H, or D, recent research [45, 59, 69, 70, 75, 76] showed that the empirical esti-\nmator is optimal up to logarithmic factors, namely nf (P, \u03b5) = \u0398\u03b5(nf ( \u02c6f E,P, \u03b5)/log nf ( \u02c6f E,P, \u03b5)),\nwhere P is \u2206\u22651/k for \u02dcS, and is \u2206X for the other properties.\nBelow we classify the methods for deriving the corresponding sample-optimal estimators into two\ncategories: plug-in and approximation, and provide a high-level description. For simplicity of\nillustration, we assume that \u03b5 \u2208 (0, 1].\nThe plug-in approach essentially estimates the unknown distribution multiset, which suf\ufb01ces for\ncomputing any symmetric properties. Besides the empirical and PML estimators, Efron and Thisted\n[29] proposed a linear-programming approach that \ufb01nds a multiset estimate consistent with the\nsample\u2019s pro\ufb01le. This approach was then adapted and analyzed by Valiant and Valiant [69, 72],\nyielding plug-in estimators that achieve near-optimal sample complexities for H and \u02dcS, and optimal\nsample complexity for D, when \u03b5 is relatively large.\nThe approximation approach modi\ufb01es non-smooth segments of the probability function to correct\nthe bias of empirical estimators. A popular modi\ufb01cation is to replace those non-smooth segments by\ntheir low-degree polynomial approximations and then estimate the modi\ufb01ed function. For several\nx, where \u03b1 is a given parameter,\nthis approach yields property-dependent estimators [45, 59, 75, 76] that are sample-optimal for all \u03b5.\nMore recently, Acharya et al. [4] proved the aforementioned results on PML estimator and made it the\n\ufb01rst uni\ufb01ed, sample-optimal plug-in estimator for \u02dcS, \u02dcCm, H and D and relatively large \u03b5. Following\nthese advances, Han et al. [38] re\ufb01ned the linear-programming approach and designed a plug-in\nestimator that implicitly performs polynomial approximation and is sample-optimal for H, \u02dcS, and\nP\u03b1 with \u03b1 < 1, when \u03b5 is relatively large.\n\nproperties including the above four and power sum P\u03b1(p) :=(cid:80)\n\nx p\u03b1\n\n3.2 Comparison I: Theorem 1 and related property-estimation work\n\nIn terms of the estimator\u2019s theoretical guarantee, Theorem 1 is essentially the same as Valiant and\nValiant [70]. However, for each property, k, and n, [70] solves a different linear program and\nconstructs a new estimator, which takes polynomial time. On the other hand, both the PML estimator\nand its near-linear-time computable variant, once computed, can be used to accurately estimate\nexponentially many properties that are 1-Lipschitz on (\u2206X , R). A similar comparison holds between\nthe PML method and the approximation approach, while the latter is provably sample-optimal for\nonly a few properties. In addition, Theorem 1 shows that the PML estimator often achieves the\noptimal sample complexity up to a small constant factor, which is a desired estimator attribute shared\nby some, but not all approximation-based estimators [45, 59, 75, 76].\nIn term of the method and proof technique, Theorem 1 is closest to Acharya et al. [4]. On the other\nhand, [4] establishes the optimality of PML for only four properties, while our result covers a much\n\u221a\nbroader property class. In addition, both the above mentioned \u201csmall constant factor\u201d attribute, and\nthe con\ufb01dence boost from 2/3 to 1 \u2212 exp(\u22124\nn) are unique contributions of this work. The PML\nplug-in approach is also close in \ufb02avor to the plug-in estimators in Valiant and Valiant [69, 72]\nand their re\ufb01nement in Han et al. [38]. On the other hand, as pointed out previously, these plug-in\nestimators are provably sample-optimal for only a few properties. More speci\ufb01cally, for estimating H,\n\u02dcS, and \u02dcCm, the plug-in estimators in [69, 72] achieve sub-optimal sample complexities with regard to\n\u221a\nthe desired accuracy \u03b5; and the estimation guarantee in [38] is provided in terms of the approximation\nerrors of \u02dcO(\n\nn) polynomials that are not directly related to the optimal sample complexities.\n\n6\n\n\f3.3 R\u00e9nyi entropy estimation\n\nMotivated by the wide applications of R\u00e9nyi entropy, heuristic estimators were proposed and studied\nin the physics literature following [36], and asymptotically consistent estimators were presented and\nanalyzed in the statistical learning literature [46, 78]. For the special case of 1-R\u00e9nyi (or Shannon)\nentropy, the works of [69, 70] determined the sample complexity to be nf (\u03b5) = \u0398(k/(\u03b5 log k)).\nFor general \u03b1-R\u00e9nyi entropy, the best-known results in Acharya et al. [5] state that for integer and non-\ninteger \u03b1 values, the corresponding sample complexities nf (\u03b5, \u03b4) are O\u03b1(k1\u22121/\u03b1 log(1/\u03b4)/\u03b52) and\nO\u03b1(kmin{1/\u03b1,1} log(1/\u03b4)/(\u03b51/\u03b1 log k)), respectively. The upper bounds for integer \u03b1 are achieved\nby an estimator that corrects the bias of the empirical plug-in estimator. To achieve the upper\nbounds for non-integer \u03b1 values, one needs to compute some best polynomial approximation of z\u03b1,\nwhose degree and domain both depend on n, and construct a more involved estimator using the\napproximation approach [45, 75] mentioned in Section 3.1.\n\n3.4 Comparison II: Theorem 2 to 4 and related R\u00e9nyi-entropy-estimation work\n\nOur result shows that a single PML estimate suf\ufb01ces to estimate the R\u00e9nyi entropy of different\norders \u03b1. Such adaptiveness to the order parameter is a signi\ufb01cant advantage of PML over existing\nmethods. For example, by Theorem 3 and the union bound, one can use a single APML or PML\nto accurately approximate exponentially many non-integer order R\u00e9nyi entropy values, yet still\nmaintains an overall con\ufb01dence of 1 \u2212 exp(\u2212k0.9). By comparison, the estimation heuristic in [5]\nrequires different polynomial-based estimators for different \u03b1 values. In particular, to construct each\n\u221a\nestimator, one needs to compute some best polynomial approximation of z\u03b1, which is not known to\nadmit a closed-form formula for \u03b1 (cid:54)\u2208 Z. Furthermore, even for a single \u03b1 and with a sample size\nk\ntimes larger, such estimator is not known to achieve the same level of con\ufb01dence as PML or APML.\nAs for the theoretical guarantees, the sample-complexity upper bounds in both Theorem 2 and 3\nare better than those mentioned in the previous section. More speci\ufb01cally, for any \u03b1 \u2208 (3/4, 1) and\n\u03b4 \u2265 exp(\u2212k\u22120.5), Theorem 2 shows that nf (\u03b5, \u03b4) = O\u03b1(k1/\u03b1/(\u03b51/\u03b1 log k)). Analogously, for any\nnon-integer \u03b1 > 1 and \u03b4 \u2265 exp(\u2212k\u22120.9), Theorem 3 shows that nf (\u03b5, \u03b4) = O\u03b1(k/(\u03b51/\u03b1 log k)).\nBoth bounds are better than the best currently known by a log(1/\u03b4) factor.\n\n3.5\n\n(Sorted) distribution estimation\n\nEstimating large-alphabet distributions from their samples is a fundamental statistical learning tenet.\nOver the past few decades, distribution estimation has found numerous applications, ranging from\nnatural language modeling [21] to biological research [8], and has been studied extensively. Under\nthe classical (cid:96)1 and KL losses, existing research [13, 47] showed that the corresponding sample\ncomplexities n(\u03b5) are \u0398(k/\u03b52) and \u0398(k/\u03b5), respectively. Several recent works have investigated\nthe analogous formulation under sorted (cid:96)1 distance, and revealed a lower sample complexity of\nn(\u03b5) = \u0398(k/(\u03b52 log k)). Speci\ufb01cally, under certain conditions, Valiant and Valiant [72], Han et al.\n[38] derived sample-optimal estimators using linear programming, and Acharya et al. [2], Das. [25]\nshowed that PML achieves a sub-optimal O(k/(\u03b52.1 log k)) sample complexity for relatively large \u03b5.\n\n3.6 Comparison III: Theorem 5 and related distribution-estimation work\n\nWe compare our results with existing ones from three different perspectives.\nApplicable parameter ranges: As shown by [38], for \u03b5 (cid:28) n\u22121/3, the simple empirical estimator\nis already sample-optimal. Hence we consider the parammeter range \u03b5 = \u2126(n\u22121/3). For the\n\u221a\nresults in [2, 25] and [72] to hold, we would need \u03b5 to be at least \u2126(1/\nlog n). On the other hand,\nTheorem 5 shows that PML and APML are sample-optimal for \u03b5 larger than n\u2212\u0398(1). Here, the gap is\nexponentially large. The result in [38] applies to the whole range \u03b5 = \u2126(n\u22121/3), which is larger than\nthe applicable range of our results.\nTime complexity: Both the APML and the estimator in [72] are near-linear-time computable in the\nsample sizes, while the estimator in [38] would require polynomial time to be computed.\nStatistical con\ufb01dence: The PML and APML achieve the desired accuracy with an error probability\nat most exp(\u2212\u2126(\nn)). On the contrary, the estimator in [38] is known to achieve an error probability\n\n\u221a\n\n7\n\n\fthat decreases only as O(n\u22123). The gap is again exponentially large. The estimator in [72] admits a\nbetter error probability bound of exp(\u2212n0.02), which is still far from ours.\n\n3.7\n\nIdentity testing\n\nInitiated by the work of [33], identity testing is arguably one of the most important and widely-studied\nproblem in distribution property testing. Over the past two decades, a sequence of works [3, 11, 26\u2013\n28, 33, 61, 68] have addressed the sample complexity of this problem and proposed testers with a\nvariety of guarantees. In particular, applying a coincidence-based tester, Paninski [61] determined\nthe sample complexity of uniformity testing up to constant factors; utilizing a variant of the Pearson\nchi-squared statistic, Valiant and Valiant [68] resolved the general identity testing problem. For an\noverview of related results, we refer interested readers to [15] and [32]. The contribution of this work\nis mainly showing that PML, is a uni\ufb01ed sample-optimal approach for several related problems, and\nas shown in Theorem 6, also provides a near-optimal tester for this important testing problem.\n\n4 Experiments and distribution estimation under (cid:96)1 distance\n\nA number of different approaches have been taken to computing the PML and its approximations.\nAmong the existing works, Acharya et al. [1] considered exact algebraic computation, Orlitsky et al.\n[57, 58] designed an EM algorithm with MCMC acceleration, Vontobel [73, 74] proposed a Bethe\napproximation heuristic, Anevski et al. [7] introduced a sieved PML estimator and a stochastic approx-\nimation of the associated EM algorithm, and Pavlichin et al. [62] derived a dynamic programming\napproach. Notably and recently, for a sample size n, Charikar et al. [20] constructed an explicit\nexp(\u2212O(n2/3 log3 n))-approximate PML whose computation time is near-linear in n.\nIn Section 4 of the supplementary material, we introduce a variant of the MCMC-EM algorithm in [60]\nand demonstrate the exceptional ef\ufb01cacy of PML on a variety of learning tasks through experiments.\nIn particular, we derive a new distribution estimator for (unsorted) (cid:96)1 distance by combining the\nproposed PML computation algorithm with the denoising procedure in [71] and a novel missing mass\nestimator. As shown below, the proposed distribution estimator has the state-of-the-art performance.\n\nFigure 2: Distribution estimation under (cid:96)1 distance\n\nIn Figures 2, samples are generated according to six distributions of the same support size k = 5,000.\nDetails about these distributions can be found in Section 4.2 of the supplementary material. The\nsample size n (horizontal axis) ranges from 10,000 to 100,000, and the vertical axis re\ufb02ects the\n(unsorted) (cid:96)1 distance between the true distribution and the estimates, averaged over 30 independent\ntrials. We compare our estimator with three different ones: the improved Good-Turing estimator\nin [56, 41], which is provably instance-by-instance near-optimal [56], the empirical estimator, serving\nas a baseline, and the empirical estimator with a larger n log n sample size. Note that log n is\n\n8\n\n\froughly 11. As shown in [56], the improved Good-Turing estimator substantially outperformed\nother estimators such as the Laplace (add-1) estimator, the Braess-Sauer estimator [13], and the\nKrichevsky-Tro\ufb01mov estimator [48]. Hence we do not include those estimators here. The following\nplots showed that our proposed estimator further outperformed the improved Good-Turing estimator\nin all the experiments.\n\n5 Conclusion and future directions\n\nWe studied three fundamental problems in statistical learning: distribution estimation, property\nestimation, and property testing. We established the pro\ufb01le maximum likelihood (PML) as the \ufb01rst\nuniversally sample-optimal approach for several important learning tasks: distribution estimation\nunder the sorted (cid:96)1 distance, additive property estimation, R\u00e9nyi entropy estimation, and identity\ntesting. Several future directions are promising. We believe that neither the factor of 4 in the sample\nsize in Theorem 1, nor the lower bounds on \u03b5 in Theorem 1, 5, and 6 are necessary. In other words,\nthe PML approach is universally sample-optimal for these tasks in all ranges of parameters. It is\nalso of interest to extend the PML\u2019s optimality to estimating symmetric properties not covered by\nTheorem 1 to 4, such as generalized distance to uniformity [9, 39], the (cid:96)1 distance between the\nunknown distribution and the closest uniform distribution over an arbitrary subset of X .\nAnother important direction is competitive (or instance-optimal) property estimation. It should\nbe noted that all the referenced works including this paper are of the worst-case nature, namely,\ndesigning estimators with near-optimal worst-case performances. On the contrary, practical and\nnatural distributions often possess simple structures, and are rarely the worst possible. To address this\ndiscrepancy, the recent work [40, 43] took a competitive approach and constructed estimators whose\nperformances are adaptive to the simplicity of the underlying distributions. Speci\ufb01cally, for any\nproperty in a broad class and every distribution in \u2206X , the expected error of the proposed estimator\nwith a sample of size n/ log n is at most that of the empirical estimator with a sample of size n, pluses\n(cid:80)\na distribution-free vanishing function of n. These results not only cover \u02dcS, \u02dcCm, H, and D, for which\nthe log n-factor is optimal up to constants, but also apply to any non-symmetric additive property\nx fx(px) where fx is 1-Lipschitz for all x \u2208 X , such as the (cid:96)1-distance to a given distribution.\nIt would be of interest to study the optimality of the PML approach under this formulation as well.\nReaders interested in estimating non-symmetric properties may also \ufb01nd the paper [42] helpful.\n\nAcknowledgments\n\nWe are grateful to the National Science Foundation (NSF) for supporting this work through grants\nCIF-1564355 and CIF-1619448.\n\nReferences\n[1] J. Acharya, H. Das, H. Mohimani, A. Orlitsky, and S. Pan. Exact calculation of pattern\nprobabilities. In Proceedings of 2010 IEEE International Symposium on Information Theory\n(ISIT), pages 1498\u20131502. IEEE, 2010.\n\n[2] J. Acharya, H. Das, A. Jafarpour, A. Orlitsky, and S. Pan. Estimating multiple concurrent\nprocesses. In Proceedings 2012 IEEE International Symposium on Information Theory, pages\n1628\u20131632. IEEE, 2012.\n\n[3] J. Acharya, C. Daskalakis, and G. Kamath. Optimal testing for properties of distributions. In\nAdvances in Neural Information Processing Systems, pages 3591\u20133599. Society for Industrial\nand Applied Mathematics, 2015.\n\n[4] J. Acharya, H. Das, A. Orlitsky, and A. T. Suresh. A uni\ufb01ed maximum likelihood approach\nfor estimating symmetric properties of discrete distributions. In International Conference on\nMachine Learning, pages 11\u201321, 2017.\n\n[5] J. Acharya, A. Orlitsky, A. T. Suresh, and H. Tyagi. Estimating R\u00e9nyi entropy of discrete\n\ndistributions. IEEE Transactions on Information Theory, 63(1):38\u201356, 2017.\n\n9\n\n\f[6] J. Acharya, Y. Bao, Y. Kang, and Z. Sun. Improved bounds for minimax risk of estimating\nmissing mass. In 2018 IEEE International Symposium on Information Theory (ISIT), pages\n326\u2013330. IEEE, 2018.\n\n[7] D. Anevski, R. D. Gill, and S. Zohren. Estimating a probability mass function with unknown\n\nlabels. The Annals of Statistics, 45(6):2708\u20132735, 2017.\n\n[8] R. Armaanzas, I. Inza, R. Santana, Y. Saeys, J. L. Flores, J. A. Lozano, . . . , and P. Larra\u00f1aga. A\nreview of estimation of distribution algorithms in bioinformatics. BioData Mining, 1(6), 2008.\n\n[9] T. Batu and C. L. Canonne. Generalized uniformity testing.\n\nIn 2017 IEEE 58th Annual\n\nSymposium on Foundations of Computer Science, pages 880\u2013889. IEEE, 2017.\n\n[10] T. Batu, L. Fortnow, R. Rubinfeld, W. D. Smith, and P. White. Testing that distributions are\nclose. In Proceedings 41st Annual Symposium on Foundations of Computer Science, pages\n259\u2013269. IEEE, 2000.\n\n[11] T. Batu, E. Fischer, Kumar Fortnow, L., R. R., Rubinfeld, and P. White. Testing random\nvariables for independence and identity. In Proceedings 2001 IEEE International Conference\non Cluster Computing, pages 442\u2013451. IEEE, 2001.\n\n[12] T. Batu, E. Fischer, L. Fortnow, R. Kumar, R. Rubinfeld, and P. White. Testing random variables\nfor independence and identity. In Proceedings of 42nd IEEE Symposium on Foundations of\nComputer Science, pages 442\u2013451. IEEE, 2001.\n\n[13] D. Braess and T. Sauer. Bernstein polynomials and learning theory. Journal of Approximation\n\nTheory, 128(2):187\u2013206, 2004.\n\n[14] G. Bresler. Ef\ufb01ciently learning ising models on arbitrary graphs. In Proceedings of the 47th\n\nAnnual ACM Symposium on Theory of Computing, pages 771\u2013782. ACM, 2015.\n\n[15] C. L. Canonne. A survey on distribution testing. Your Data is Big. But is it Blue., 2017.\n\n[16] A. Carlton. On the bias of information estimates. Psychological Bulletin, 71(2):108, 1969.\n\n[17] A. Chao. Nonparametric estimation of the number of classes in a population. Scandinavian\n\nJournal of Statistics, pages 265\u2013270, 1984.\n\n[18] A. Chao and C. H. Chiu. Species richness: estimation and comparison. Wiley StatsRef: Statistics\n\nReference Online, pages 1\u201326, 2014.\n\n[19] A. Chao and S. M. Lee. Estimating the number of classes via sample coverage. Journal of the\n\nAmerican Statistical Association, 87(417):210\u2013217, 1992.\n\n[20] M. S. Charikar, K. Shiragur, and A. Sidford. Ef\ufb01cient pro\ufb01le maximum likelihood for universal\nsymmetric property estimation. In Proceedings of the 51st Annual ACM SIGACT Symposium\non Theory of Computing, pages 780\u2013791. ACM, 2019.\n\n[21] S. F. Chen and J. Goodman. An empirical study of smoothing techniques for language modeling.\n\nComputer Speech & Language, 13(4):359\u2013394, 1999.\n\n[22] C. Chow and C. Liu. Approximating discrete probability distributions with dependence trees.\n\nIEEE Transactions on Information Theory, 14(3):462\u2013467, 1968.\n\n[23] R. K. Colwell, A. Chao, N. J. Gotelli, S. Y. Lin, C. X. Mao, R. L. Chazdon, and J. T. Longino.\nModels and estimators linking individual-based and sample-based rarefaction, extrapolation\nand comparison of assemblages. Journal of Plant Ecology, 5(1):3\u201321, 2012.\n\n[24] T. M. Cover and J. A. Thomas. Elements of information theory. John Wiley & Sons, 2012.\n\n[25] H. Das. Competitive tests and estimators for properties of distributions. (Doctoral dissertation,\n\nUC San Diego), 2012.\n\n[26] I. Diakonikolas and D. M. Kane. A new approach for testing properties of discrete distributions.\nIn 2016 IEEE 57th Annual Symposium on Foundations of Computer Science (FOCS), pages\n685\u2013694. IEEE, 2016.\n\n10\n\n\f[27] I. Diakonikolas, D. M. Kane, and V. Nikishkin. Testing identity of structured distributions.\nIn Proceedings of the Twenty-Sixth Annual ACM-SIAM Symposium on Discrete Algorithms\n(SODA), pages 1841\u20131854. Society for Industrial and Applied Mathematics, 2015.\n\n[28] I. Diakonikolas, T. Gouleakis, J. Peebles, and E. Price. Sample-optimal identity testing with\nhigh probability. In 45th International Colloquium on Automata, Languages, and Programming\n(ICALP 2018). Schloss Dagstuhl-Leibniz-Zentrum fuer Informatik, 2018.\n\n[29] B. Efron and R. Thisted. Estimating the number of unseen species: How many words did\n\nShakespeare know? Biometrika, 63(3):435\u2013447, 1976.\n\n[30] W. Gerstner and W. M. Kistler. Spiking neuron models: Single neurons, populations, plasticity.\n\nCambridge University Press, 2002.\n\n[31] O. Goldreich. The uniform distribution is complete with respect to testing identity to a \ufb01xed\ndistribution. In Electronic Colloquium on Computational Complexity (ECCC), volume 23,\npage 1. IEEE, 2016.\n\n[32] O. Goldreich. Introduction to property testing (chapter 11). Cambridge University Press, 2017.\n\n[33] O. Goldreich and D. Ron. On testing expansion in bounded-degree graphs. In Technical Report\n\nTR00-020, Electronic Colloquium on Computational Complexity (ECCC), 2000.\n\n[34] I. J. Good. The population frequencies of species and the estimation of population parameters.\n\nBiometrika, 40(3-4):237\u2013264, 1953.\n\n[35] I. J. Good and G. H. Toulmin. The number of new species, and the increase in population\n\ncoverage, when a sample is increased. Biometrika, 43(1-2):45\u201363, 1956.\n\n[36] P. Grassberger. Finite sample corrections to entropy and dimension estimates. Physics Letters\n\nA, 128(6-7):369\u2013373, 1988.\n\n[37] P. J. Haas, J. F. Naughton, S. Seshadri, and L. Stokes. Sampling-based estimation of the number\n\nof distinct values of an attribute. VLDB, 95:311\u2013322, 1995.\n\n[38] Y. Han, J. Jiao, and T. Weissman. Local moment matching: A uni\ufb01ed methodology for\nsymmetric functional estimation and distribution estimation under Wasserstein distance. In\nConference on Learning Theory, pages 3189\u20133221, 2018.\n\n[39] Y. Hao and A. Orlitsky. Adaptive estimation of generalized distance to uniformity. In 2018\n\nIEEE International Symposium on Information Theory, pages 1076\u20131080. IEEE, 2018.\n\n[40] Y. Hao and A. Orlitsky. Data ampli\ufb01cation: Instance-optimal property estimation. In arXiv\n\npreprint arXiv:1903.01432., 2019.\n\n[41] Y. Hao and A. Orlitsky. Doubly-competitive distribution estimation. In International Conference\n\non Machine Learning, pages 2614\u20132623, 2019.\n\n[42] Y. Hao and A. Orlitsky. Uni\ufb01ed sample-optimal property estimation in near-linear time. In\n\nAdvances in Neural Information Processing Systems, 2019.\n\n[43] Y. Hao, A. Orlitsky, A. T. Suresh, and Y. Wu. Data ampli\ufb01cation: A uni\ufb01ed and competitive\napproach to property estimation. In Advances in Neural Information Processing Systems, pages\n8848\u20138857, 2018.\n\n[44] R. Jenssen, K. E. Hild, D. Erdogmus, J. C. Principe, and T. Eltoft. Clustering using R\u00e9nyi\u2019s\nentropy. In Proceedings of the International Joint Conference on Neural Networks, volume 1,\npages 523\u2013528. IEEE, 2003.\n\n[45] J. Jiao, K. Venkat, Y. Han, and T. Weissman. Minimax estimation of functionals of discrete\n\ndistributions. IEEE Transactions on Information Theory, 61(5):2835\u20132885, 2015.\n\n[46] D. K\u00e4llberg, N. Leonenko, and O. Seleznjev. Statistical inference for R\u00e9nyi entropy functionals.\nIn Conceptual Modelling and Its Theoretical Foundations. Springer, Berlin, Heidelberg., pages\n36\u201351, 2012.\n\n11\n\n\f[47] S. Kamath, A. Orlitsky, D. Pichapati, and A. T. Suresh. On learning distributions from their\n\nsamples. In Conference on Learning Theory, pages 1066\u20131100, 2015.\n\n[48] R. Krichevsky and V. Tro\ufb01mov. The performance of universal encoding. IEEE Transactions on\n\nInformation Theory, 27(2):199\u2013207, 1981.\n\n[49] I. Kroes, P. W. Lepp, and D. A. Relman. Bacterial diversity within the human subgingival\n\ncrevice. Proceedings of the National Academy of Sciences, 96(25):14547\u201314552, 1999.\n\n[50] B. Ma, A. Hero, J. Gorman, and O. Michel. Image registration with minimum spanning tree\nIn Proceedings 2000 International Conference on Image Processing (Cat. No.\n\nalgorithm.\n00CH37101), volume 1, pages 481\u2013484. IEEE, 2000.\n\n[51] Z. F. Mainen and T. J. Sejnowski. Reliability of spike timing in neocortical neurons. Science,\n\n268(5216):1503\u20131506, 1995.\n\n[52] C. X. Mao and B. G. Lindsay. Estimating the number of classes. The Annals of Statistics, pages\n\n917\u2013930, 2007.\n\n[53] D. R. McNeil. Estimating an author\u2019s vocabulary. Journal of the American Statistical Associa-\n\ntion, 68(341):92\u201396, 1973.\n\n[54] H. Neemuchwala, A. Hero, S. Zabuawala, and P. Carson. Image registration methods in high-\ndimensional space. International Journal of Imaging Systems and Technology, 16(5):130\u2013145,\n2006.\n\n[55] M. Obremski and M. Skorski. R\u00e9nyi entropy estimation revisited. In Approximation, Random-\nization, and Combinatorial Optimization. Algorithms and Techniques (APPROX/RANDOM\n2017). Schloss Dagstuhl-Leibniz-Zentrum fuer Informatik, 2017.\n\n[56] A. Orlitsky and A. T. Suresh. Competitive distribution estimation: Why is Good-Turing good.\n\nIn Advances in Neural Information Processing Systems, pages 2143\u20132151, 2015.\n\n[57] A. Orlitsky, S. Sajama, N. P. Santhanam, K. Viswanathan, and J. Zhang. Algorithms for modeling\ndistributions over large alphabets. In Proceedings of 2004 IEEE International Symposium on\nInformation Theory (ISIT), pages 304\u2013304. IEEE, 2004.\n\n[58] A. Orlitsky, N. P. Santhanam, K. Viswanathan, and J. Zhang. On modeling pro\ufb01les instead of\nvalues. In Proceedings of the 20th Conference on Uncertainty in Arti\ufb01cial Intelligence, pages\n426\u2013435. AUAI Press, 2004.\n\n[59] A. Orlitsky, A. T. Suresh, and Y. Wu. Optimal prediction of the number of unseen species.\n\nProceedings of the National Academy of Sciences, 113(47):13283\u201313288, 2016.\n\n[60] S. Pan. On the theory and application of pattern maximum likelihood. (Doctoral dissertation,\n\nUC San Diego), 2012.\n\n[61] L. Paninski. A coincidence-based test for uniformity given very sparsely sampled discrete data.\n\nIEEE Transactions on Information Theory, 54(10):4750\u20134755, 2008.\n\n[62] D. S. Pavlichin, J. Jiao, and T. Weissman. Approximate pro\ufb01le maximum likelihood. arXiv\n\npreprint, arXiv:1712.07177, 2017.\n\n[63] C. J. Quinn, N. Kiyavash, and T. P. Coleman. Ef\ufb01cient methods to compute optimal tree\napproximations of directed information graphs. IEEE Transactions on Signal Processing, 61\n(12):3173\u20133182, 2013.\n\n[64] A. R\u00e9nyi. On measures of entropy and information. In Proceedings of the Fourth Berkeley\nSymposium on Mathematical Statistics and Probability, Volume 1: Contributions to the Theory\nof Statistics. The Regents of the University of California, 1961.\n\n[65] D. Ron. Algorithmic and analysis techniques in property testing. Number 5 in 2. Foundations\n\nand Trends in Theoretical Computer Science, 2010.\n\n12\n\n\f[66] C. E. Shannon. A mathematical theory of communication. Bell System Technical Journal, 27\n\n(3):379\u2013423, 1948.\n\n[67] R. Thisted and B. Efron. Did Shakespeare write a newly-discovered poem? Biometrika, 74(3):\n\n445\u2013455, 1987.\n\n[68] G. Valiant and P. . Valiant. An automatic inequality prover and instance optimal identity testing.\n\nSIAM Journal on Computing, 46(1):429\u2013455, 2017.\n\n[69] G. Valiant and P. Valiant. Estimating the unseen: An n/log(n)-sample estimator for entropy and\nsupport size, shown optimal via new CLTs. In Proceedings of the 43rd Annual ACM Symposium\non Theory of Computing, pages 685\u2013694. ACM, 2011.\n\n[70] G. Valiant and P. Valiant. The power of linear estimators. In IEEE 52nd Annual Symposium on\n\nFoundations of Computer Science, pages 403\u2013412. IEEE, 2011.\n\n[71] G. Valiant and P. Valiant. Instance optimal learning of discrete distributions. In Proceedings of\n\nthe 48th Annual ACM Symposium on Theory of Computing, pages 142\u2013155. ACM, 2016.\n\n[72] G. Valiant and P. Valiant. Estimating the unseen: Improved estimators for entropy and other\n\nproperties. Journal of the ACM (JACM), 64(6):37, 2017.\n\n[73] P. O. Vontobel. The Bethe approximation of the pattern maximum likelihood distribution. In\nProceedings of 2012 IEEE International Symposium on Information Theory (ISIT). IEEE, 2012.\n\n[74] P. O. Vontobel. The Bethe and Sinkhorn approximations of the pattern maximum likelihood\nestimate and their connections to the Valiant-Valiant estimate. In 2014 Information Theory and\nApplications Workshop (ITA), pages 1\u201310. IEEE, 2014.\n\n[75] Y. Wu and P. Yang. Minimax rates of entropy estimation on large alphabets via best polynomial\n\napproximation. IEEE Transactions on Information Theory, 62(6):3702\u20133720, 2016.\n\n[76] Y. Wu and P. Yang. Chebyshev polynomials, moment matching, and optimal estimation of the\n\nunseen. The Annals of Statistics, 47(2):857\u2013883, 2019.\n\n[77] D. Xu. Energy, entropy and information potential for neural computation. Doctoral dissertation,\n\nUniversity of Florida, 1999.\n\n[78] D. Xu and D. Erdogmuns. R\u00e9nyi\u2019s entropy, divergence and their nonparametric estimators. In\n\nInformation Theoretic Learning. Springer, New York, NY., pages 47\u2013102, 2010.\n\n13\n\n\f", "award": [], "sourceid": 5897, "authors": [{"given_name": "Yi", "family_name": "Hao", "institution": "University of California, San Diego"}, {"given_name": "Alon", "family_name": "Orlitsky", "institution": "University of California, San Diego"}]}