{"title": "Unified Sample-Optimal Property Estimation in Near-Linear Time", "book": "Advances in Neural Information Processing Systems", "page_first": 11106, "page_last": 11116, "abstract": "We consider the fundamental learning problem of estimating properties of distributions over large domains. Using a novel piecewise-polynomial approximation technique, we derive the first unified methodology for constructing sample- and time-efficient estimators for all sufficiently smooth, symmetric and non-symmetric, additive properties. This technique yields near-linear-time computable estimators whose approximation values are asymptotically optimal and highly-concentrated, resulting in the first: 1) estimators achieving the $\\mathcal{O}(k/(\\varepsilon^2\\log k))$ min-max $\\varepsilon$-error sample complexity for all $k$-symbol Lipschitz properties; 2) unified near-optimal differentially private estimators for a variety of properties; 3) unified estimator achieving optimal bias and near-optimal variance for five important properties; 4) near-optimal sample-complexity estimators for several important symmetric properties over both domain sizes and confidence levels.", "full_text": "Uni\ufb01ed Sample-Optimal Property Estimation\n\nin Near-Linear Time\n\nDept. of Electrical and Computer Engineering\n\nDept. of Electrical and Computer Engineering\n\nYi Hao\n\nAlon Orlitsky\n\nUniversity of California, San Diego\n\nyih179@ucsd.edu\n\nUniversity of California, San Diego\n\nalon@ucsd.edu\n\nAbstract\n\nWe consider the fundamental learning problem of estimating properties of distri-\nbutions over large domains. Using a novel piecewise-polynomial approximation\ntechnique, we derive the \ufb01rst uni\ufb01ed methodology for constructing sample- and\ntime-ef\ufb01cient estimators for all suf\ufb01ciently smooth, symmetric and non-symmetric,\nadditive properties. This technique yields near-linear-time computable estimators\nwhose approximation values are asymptotically optimal and highly-concentrated,\nresulting in the \ufb01rst: 1) estimators achieving the O(k/(\u03b52 log k)) min-max \u03b5-error\nsample complexity for all k-symbol Lipschitz properties; 2) uni\ufb01ed near-optimal\ndifferentially private estimators for a variety of properties; 3) uni\ufb01ed estimator\nachieving optimal bias and near-optimal variance for \ufb01ve important properties;\n4) near-optimal sample-complexity estimators for several important symmetric\nproperties over both domain sizes and con\ufb01dence levels.\n\nIntroduction\n\n1\nLet \u2206k be the collection of distributions over the alphabet [k] := {1, 2, . . . , k}, and [k]\u2217 be the\nset of \ufb01nite sequences over [k]. In many learning applications, we are given i.i.d. samples X n :=\nX1, X2, . . . , Xn from an unknown distribution (cid:126)p := (p1, p2, . . . , pk) \u2208 \u2206k, and using these samples\nwe would like to infer certain distribution properties.\nA distribution property is a mapping f : \u2206k \u2192 R. Often, these properties are symmetric and additive,\ni\u2208[k] f (pi). For example, Shannon entropy, support size, and three more properties\nin Table 1. Many other important properties are additive but not necessarily symmetric, namely,\ni\u2208[k] fi(pi). For example, Kullback-Leibler (KL) divergence or (cid:96)1 distance to a \ufb01xed\ni\u2208[k] xi \u00b7 |pi \u2212 qi|. A property\n\nnamely, f ((cid:126)p) =(cid:80)\nf ((cid:126)p) = (cid:80)\ndistribution q, and distances weighted by the observations xi, e.g.,(cid:80)\n\nestimator is a mapping \u02c6f : [k]\u2217 \u2192 R, where \u02c6f (X n) approximates f ((cid:126)p).\nProperty estimation has attracted signi\ufb01cant attention due to its many applications in various disci-\nplines: Shannon entropy is the principal information measure in numerous machine-learning [6] and\nneurosicence [13] algorithms; support size is essential in population [4] and vocabulary size [29] esti-\nmation; support coverage arises in ecological [7], genomic [23], and database [14] studies; (cid:96)1 distance\nis useful in hypothesis testing [24] and classi\ufb01cation [10]; KL divergence re\ufb02ects the performance of\ninvestment [8], compression [9], and on-line learning [20].\nFor data containing sensitive information, we may need to design special property estimators that\npreserve individual privacy. Perhaps the most notable notion of privacy is differential privacy (DP).\nIn the context of property estimation [11], we say that an estimator \u02c6f is \u03b1-differentially private if\nfor any X n and Y n that differ by at most one symbol, Pr( \u02c6f (X n) \u2208 S)/ Pr( \u02c6f (Y n) \u2208 S) \u2264 e\u03b1 for\nany measurable set S \u2282 R. We consider designing property estimators that achieve small estimation\nerror \u03b5, with probability at least 2/3, while maintaining \u03b1-privacy.\n\nPreprint. To appear at NeurIPS 2019.\n\n\fThe next section formalizes our discussion and presents some of the major results in the area.\n\n1.1 Problem formulation and prior results\n\nProperty estimation Let f be a property over \u2206k. The (\u03b5, \u03b4)-sample complexity of an estimator \u02c6f\nfor f is the smallest number of samples required to estimate f ((cid:126)p) with accuracy \u03b5 and con\ufb01dence\n1 \u2212 \u03b4, for all distributions in \u2206k,\n\nCf ( \u02c6f , \u03b5, \u03b4) := min{n : Pr\nX n\u223cp\n\n(| \u02c6f (X n) \u2212 f ((cid:126)p)| > \u03b5) \u2264 \u03b4, \u2200(cid:126)p \u2208 \u2206k}.\n\nThe (\u03b5, \u03b4)-sample complexity of estimating f is the lowest (\u03b5, \u03b4)-sample complexity of any estimator,\n\nCf (\u03b5, \u03b4) := min \u02c6f Cf ( \u02c6f , \u03b5, \u03b4).\n\nIgnoring constant factors and assuming k is large, Table 1 summarizes some of the previous results [2,\n22, 26, 31, 33\u201335] for \u03b4 = 1/3. Following the formulation in [2, 31, 34], for support size, we\nk ,\u2200i \u2208 [k].\nnormalize it by k and replace \u2206k by the collection of distributions (cid:126)p \u2208 \u2206k satisfying pi \u2265 1\nFor support coverage [2, 26], the expected number of distinct symbols in m samples, we normalize it\nby the given parameter m and assume that m is suf\ufb01ciently large.\n\nTable 1: Cf (\u03b5, 1/3) for some properties\n\nProperty\nShannon entropy\nPower sum of order a\nDistance to uniformity\nNormalized support size\nNormalized support coverage\n\nfi(pi)\npi log 1\npi\npa\ni , a < 1\n\n(cid:12)(cid:12)pi \u2212 1\n\n(cid:12)(cid:12)\n\nk\n\n1pi>0\n1\u2212(1\u2212pi)m\n\nk\n\nm\n\nCf (\u03b5, 1/3)\n\nlog k\n\nk\n\n1\n\u03b5\nk1/a\n\n\u03b51/a log k\n\nk\n\nlog k\n\n1\n\u03b52\n\nk\n\nlog k log2 1\nlog m log 1\n\nm\n\n\u03b5\n\n\u03b5\n\nMin-max MSE A closely related characterization of an estimator\u2019s performance is the min-max\nmean squared error (MSE). For any unknown distribution (cid:126)p \u2208 \u2206k, the MSE of a property estimator\n\u02c6f in estimating f ((cid:126)p), using n samples from (cid:126)p, is\nRn( \u02c6f , f, (cid:126)p) := E\nX n\u223c(cid:126)p\n\n( \u02c6f (X n) \u2212 f ((cid:126)p))2.\n\nSince (cid:126)p is unknown, we consider the minimal possible worst-case MSE, or the min-max MSE, for any\nproperty estimator in estimating property f,\n\nRn(f, \u2206k) := min\n\u02c6f\n\nmax\n(cid:126)p\u2208\u2206k\n\nRn( \u02c6f , f, (cid:126)p).\n\nThe property estimator \u02c6f m achieving the min-max MSE is the min-max estimator [21, 22, 34, 35].\nLetting (cid:126)pmax := arg max(cid:126)p\u2208\u2206k Rn( \u02c6f m, f, (cid:126)p) be the worst-case distribution for \u02c6f m, we can express the\nmin-max MSE as the sum of two quantities: the min-max squared bias,\n\nBias2\n\nn( \u02c6f m) := (EX n\u223c(cid:126)pmax[ \u02c6f m(X n)] \u2212 f ((cid:126)pmax))2,\n\nand the min-max variance,\n\nVarn( \u02c6f m) := VarX n\u223c(cid:126)pmax( \u02c6f m(X n)).\n\nPrivate property estimation Analogous to the non-private setting above, for an estimator \u02c6f of\nf, let its (\u03b5, \u03b4, \u03b1)-private sample complexity C( \u02c6f , \u03b5, \u03b4, \u03b1) be the smallest number of samples that \u02c6f\nrequires to estimate f ((cid:126)p) with accuracy \u03b5 and con\ufb01dence 1 \u2212 \u03b4, while maintaining \u03b1-privacy for all\ndistributions (cid:126)p \u2208 \u2206k. The (\u03b5, \u03b4, \u03b1)-private sample complexity of estimating f is then\n\nCf (\u03b5, \u03b4, \u03b1) := min \u02c6f Cf ( \u02c6f , \u03b5, \u03b4, \u03b1).\n\nFor Shannon entropy, normalized support size, and normalized support coverage, the work of [3]\nderived tight lower and upper bounds on Cf (\u03b5, 1/3, \u03b1).\n\n2\n\n\f1.2 Existing methods\n\nThere are mainly two types of methods introduced to estimate distribution properties: plug-in, and\napproximation-empirical, which we brie\ufb02y discuss below.\n\nPlug-in Major existing plug-in estimators work for only symmetric properties, and in general do\nnot achieve the min-max MSEs\u2019 nor the optimal (\u03b5, \u03b4)-sample complexities. More speci\ufb01cally, the\nlinear-programming based methods initiated by [12], and analyzed and extended in [31\u201333] achieve\nthe optimal sample complexities only for distance to uniformity and entropy, for relatively large \u03b5.\nThe method basically learns the moments of the underlying distribution from its samples, and \ufb01nds a\ndistribution whose (low-order) moments are consistent with these estimates. A locally re\ufb01ned version\nof the linear-programming estimator [15] achieves optimal sample complexities for entropy, power\nsum, and normalized support size, but requires polynomial time to be computed. This version yields\na bias guarantee similar to ours over symmetric properties, yet its variance guarantee is often worse.\nRecently, the work of [2] showed that the pro\ufb01le maximum likelihood (PML) estimator [25], an\nestimator that \ufb01nds a distribution maximizing the probability of observing the multiset of empirical\nfrequencies, is sample-optimal for estimating entropy, distance to uniformity, and normalized support\nsize and coverage. After the initial submission of the current work, paper [18] showed that the PML\napproach and its near-linear-time computable variant [5] are sample-optimal for any property that\nis symmetric, additive, and appropriately Lipschitz, including the four properties just mentioned.\nThis establishes the PML estimator as the \ufb01rst universally sample-optimal plug-in approach for\nestimating symmetric properties. In comparison, the current work provides a uni\ufb01ed property-\ndependent approach that is sample-optimal for several symmetric and non-symmetric properties.\n\nApproximation-empirical The approximation-empirical method [21, 22, 28, 34, 35] identi\ufb01es a\nnon-smooth part of the underlying function fi, replaces it by a polynomial \u02dcfi, and estimates the\nvalue of pi by its empirical frequency \u02c6pi. Depending on whether \u02c6pi belongs to the non-smooth\npart or not, the method estimates fi(pi) by either the unbiased estimator of \u02dcfi(pi), or the empirical\nplug-in estimator fi(\u02c6pi). However, due to its strong dependency on both the function\u2019s structure\nand the empirical estimator\u2019s performance, the method requires signi\ufb01cantly different case-by-case\nmodi\ufb01cation and analysis, and may not work optimally for general additive properties.\nSpeci\ufb01cally, 1) The ef\ufb01cacy of this method relies on the accuracy of the empirical plug-in estimator\nover the smooth segments, which needs to be veri\ufb01ed individually for each property; 2) Different\nfunctions often have non-smooth segments of different number, locations, and sizes; 3) Combining\nthe non-smooth and smooth segments estimators requires additional care: sometimes needs the\nknowledge of k, sometimes even needs a third estimator to ensure smooth transitioning.\nIn addition, the method has also not been shown to achieve optimal results for general Lipschitz\nproperties, or many of the other properties covered by the new method in this paper.\n\n2 New methodology\n\nThe preceding discussion showed that no existing generic method ef\ufb01ciently estimates general\nadditive properties. Motivated by recent advances in the \ufb01eld [2, 15, 16, 19], we derive the \ufb01rst\ngeneric method to construct sample-ef\ufb01cient estimators for all suf\ufb01ciently smooth additive properties.\nWe start by approximating functions of an unknown Bernoulli success probability from its i.i.d.\nsamples. For a wide class of real functions, we propose a piecewise-polynomial approximation\ntechnique, and show that it yields small-bias estimators that are exponentially concentrated around\ntheir expected estimates. This provides a different view of property estimation that allows us to\nsimplify the proofs and broaden the range of the results. For details please see Section 4.\n\nHigh-level idea The idea behind this methodology is natural. By the Chernoff bound for binomial\nrandom variables, the empirical count of a symbol in a given sample sequence will not differ from\nits mean value by too much. Hence, based on the empirical frequency, we can roughly infer which\n\u201ctiny piece\u201d of [0, 1] the corresponding probability lies in. However, due to randomness, a symbol\u2019s\nempirical frequency may often differ from the true probability value by a small quantity, and plugging\nit into the function will cause certain amount of bias.\n\n3\n\n\fTo correct this bias, we \ufb01rst replace the function by its low-degree polynomial approximation over\nthat \u201ctiny piece\u201d, and then compute an unbiased estimator of this polynomial. In other words, we use\nthis polynomial as a proxy for the estimation task. We want the degree of the polynomial to be small\nsince this will generally reduce the unbiased estimator\u2019s variance; we focus on approximating only a\ntiny piece of the function because this will reduce the polynomial\u2019s approximation error (bias). Given\ni\u2208[k] fi(pi), we apply this technique to each real function fi and use\nthe corresponding sum to estimate f ((cid:126)p). Henceforth we use \u02c6f\u2217 to denote this explicit estimator.\n\nany additive property f ((cid:126)p) =(cid:80)\n\nImplications and new results\n\n3\nBecause of its conceptual simplicity, the methodology described in the last section has strong\nimplications for estimating all suf\ufb01ciently smooth additive properties, which we present as theorems.\nTheorem 5 in Section 5 is the root of all the following results, while it is more abstract and illustrating\nit requires much more effort. Hence for clarity, we begin by presenting several more concrete results.\nCorrect asymptotic For most of the properties considered in the paper, even the naive empirical-\nfrequency estimator is sample-optimal in the large-sample regime (termed \"simple regime\" in [34])\nwhere the number of samples n far exceeds the alphabet size k. The interesting regime, addressed\nin numerous recent publications [15, 16, 19, 18, 21, 31, 33, 35], is where n and k are comparable,\ne.g., differing by at most a logarithmic factor. In this range, n is suf\ufb01ciently small that sophisticated\ntechniques can help, yet not too small that nothing can be estimated. Since n and k are given, one\ncan decide whether the naive estimator suf\ufb01ces, or sophisticated estimators are needed. For most of\nthe results presented here, the technical signi\ufb01cance stems in their nontriviality in this large-alphabet\nregime. For this reason, we will also assume that log k (cid:16) log n throughout the paper.\n\nImplication 1: Lipschitz property estimation\n\nAn additive property f ((cid:126)p) =(cid:80)\n\ni fi(pi) is L-Lipschitz if all functions fi have Lipschitz constants\nuniformly bounded by L. Many important properties are Lipschitz, but except for a few isolated\nexamples, it was not known until very recently [16, 19] that general Lipschitz properties can be\nestimated with sub-linearly many samples. In particular, the result in [16] implies a sample-complexity\nupper bound of O(L3k/(\u03b53 log k)). We improve this bound to Cf (\u03b5, 1/3) (cid:46) L2k/(\u03b52 log k):\nTheorem 1. If f is an L-Lipschitz property, then for any p \u2208 \u2206k and X n \u223c p,\n\n(cid:12)(cid:12)(cid:12)E(cid:104) \u02c6f\u2217(X n)\n\n(cid:105) \u2212 f ((cid:126)p)\n\n(cid:12)(cid:12)(cid:12) (cid:46) (cid:88)\n\ni\u2208[k]\n\nL\n\nVar( \u02c6f\u2217(X n)) \u2264 O\n\n(cid:115)\n\n\u2264 L\n\nk\n\nn log n\n\n,\n\n(cid:114) pi\n(cid:18) L2\n\nn log n\n\nn0.99\n\n(cid:19)\n\n.\n\nand\n\nThis theorem is optimal as even for relatively simple Lipschitz properties, e.g., distance to uniformity\n(see Table 1 and [22]), the bias bound is optimal up to constant factors, and the variance bound is\nnear-optimal and can not be smaller than \u2126(L2/n).\n\nImplication 2: High-con\ufb01dence property estimation\n\nSurprisingly, the (\u03b5, \u03b4)-sample complexity has not been fully characterized even for some important\nproperties. A commonly-used approach to constructing an estimator with (\u03b5, \u03b4)-guarantee is to\nchoose an (\u03b5, 1/3)-estimator, and boost the learning con\ufb01dence by taking the median of its O(log 1\n\u03b4 )\nindependent estimates. This well-known median trick yields the following upper bound\n\n\u00b7 Cf (\u03b5, 1/3).\n\nFor example, for Shannon entropy,\n\nCf (\u03b5, \u03b4) (cid:46) log\n\nCf (\u03b5, \u03b4) (cid:46) log\n\n\u00b7\nBy contrast, we show that our estimator satis\ufb01es\nCf ( \u02c6f\u2217, \u03b5, \u03b4) (cid:46) k\n\n1\n\u03b4\n\n+ log\n\n\u03b5 log k\n\n(cid:18)\n\n+\n\nlog\n\n\u03b5 log k\n\n1\n\u03b4\n\n\u00b7 1\n\u03b52\n\n1\n\u03b4\n\n\u00b7 log2 k\n(cid:19)1.01\n\u03b52\n\n.\n\n.\n\n1\n\u03b4\n\nk\n\n4\n\n\fTo see optimality, Theorem 2 below shows that this upper bound is nearly tight.\nIn the high-probability regime, namely when \u03b4 is small, the new upper bound obtained using our\nmethod could be signi\ufb01cantly smaller than the one obtained from the median trick. Theorem 2 shows\nthat this phenomenon also holds for other properties like normalized support size and power sum.\nTheorem 2. Ignoring constant factors, Table 2 summarizes relatively tight lower and upper bounds on\nCf (\u03b5, \u03b4, k) for different properties. In addition, all the upper bounds can be achieved by estimator \u02c6f\u2217.\n\nTable 2: Bounds on Cf (\u03b5, \u03b4, k) for several properties\n\nProperty\n\nShannon entropy\n\nPower sum of order a\nNormalized support size\n\nLower bound\n\nk\n\n\u03b5 log k + log 1\n+ log 1\n\n1\na\n\n\u03b4 \u00b7 log2 k\n\u03b4 \u00b7 k2\u22122a\n\n\u03b52\n\n\u03b52\n\nk\n1\na log k\nk\n\n\u03b5\n\nlog k log2 1\n\n\u03b5\n\nUpper bound\n\n\u03b5 log k +(cid:0)log 1\n(cid:104)(cid:0)log 1\n\n\u03b4 \u00b7 1\n\n(cid:1)1+\u03b2\n2a\u22121(cid:105)1+\u03b2\n(cid:1) 1\n\u03b4 \u00b7 1\n\n\u03b52\n\n\u03b52\n\n1\na\n\nk\n\n\u03b5\n\nk\n1\na log k\nk\n\n+\nlog k log2 1\n\n\u03b5\n\nRemarks on Table 2: Parameter \u03b2 can be any \ufb01xed absolute constant in (0, 1). The lower and upper\nbounds for power sum hold for a \u2208 (1/2, 1). For normalized support size, we require \u03b4 > exp(\u2212k1/3)\nand \u03b5 \u2208 (k\u22120.33, k\u22120.01). Note that the restriction on \u03b5 for support-size estimation is imposed only\nto yield a simple sample-complexity expression. This is not required by our algorithm, which is also\nsample optimal for \u03b5 \u2265 k\u22120.01. It is possible that other algorithms can achieve similar upper bounds,\nwhile our main point is to demonstrate that our single, uni\ufb01ed method has many desired attributes.\n\nImplication 3: Optimal bias and near-optimal variance\n\nThe min-max MSEs of several important properties have been determined up to constant factors, yet\nthere is no explicit and executable scheme for designing estimators achieving them. We show that \u02c6f\u2217\nachieves optimal squared bias and near-optimal variance in estimating a variety of properties.\nTheorem 3. Up to constant factors, the estimator \u02c6f\u2217 achieves the optimal (min-max) squared bias\nand near-optimal variance for estimating Shannon entropy, normalized support size, distance to\nuniformity, and power sum, as well as (cid:96)1 distance to a \ufb01xed distribution.\n\nRemarks on Theorem 3: For power sum, we consider the case where the order is less than 1. For\nnormalized support size, we again make the assumption that the minimum nonzero probability of the\nunderlying distribution is at least 1/k. As noted previously, we consider the parameter regime where\nn and k are comparable and k is large. In particular, besides the general assumption log k (cid:16) log n, we\nassume that n (cid:38) k1/\u03b1/ log k for power sum; n (cid:38) k/ log k for entropy; and k log k (cid:38) n (cid:38) k/ log k\nfor normalized support size. The proof of the theorem naturally follows from Theorem 5.\n\nImplication 4: Private property estimation\n\nPrivacy is of increasing concern in modern data science. We show that our estimates are exponentially\nconcentrated around the underlying value. Using this attribute we derive a near-optimal differentially-\nprivate estimator \u02c6f\u2217\nDP for several important properties by adding Laplacian noises to \u02c6f.\nAs an example, for Shannon entropy and properly chosen algorithm hyper-parameters,\n\nCf ( \u02c6f\u2217\n\nDP, \u03b5, 1/3, \u03b1) (cid:46) k\n\n\u03b5 log k\n\n+\n\n1\n\u03b52.01 +\n\n1\n\n(\u03b1\u03b5)1.01 .\n\nThis essentially recovers the sample-complexity upper bound in [3], which is nearly tight [3] for all\nparameters. Hence for large domains, one can achieve strong differential privacy guarantees with\nonly a marginal increase in the sample size compared to the k/(\u03b5 log k) required for non-private\nestimation. An analogous argument shows that \u02c6f\u2217\nDP is also near-optimal for the private estimation of\nsupport size and many others. Section 2.3 of the supplementary material provides more detail as well\nas uni\ufb01ed bounds on the differentially-private sample complexities of general additive properties.\n\n5\n\n\fOutline The rest of the paper is organized as follows. In Section 4 we construct an estimator that\napproximates the function value of an unknown Bernoulli success probability, and characterize its\nperformance by Theorem 4. In Section 5 we apply this function estimator to estimating properties\nof distributions and provide analogous guarantees. Section 6 concludes the paper and also presents\npossible future directions. We postpone all the proof details to the supplementary material.\n\n4 Estimating functions of Bernoulli probabilities\n\n4.1 Problem formulation\n\nWe begin with a simple problem that involves just a single unknown parameter.\nLet g be a continuous real function over the unit interval whose absolute value is uniformly bounded\nby an absolute constant. Given i.i.d. samples X n := X1, . . . , Xn from a Bernoulli distribution\nwith unknown success probability p, we would like to estimate the function value g(p). A function\nestimator is a mapping \u02c6g : {0, 1}\u2217 \u2192 R. We characterize the performance of the estimator \u02c6g(X n) in\nestimating g(p) by its absolute bias\n\nand deviation probability\n\nBias(\u02c6g) := |E[\u02c6g(X n)] \u2212 g(p)|,\n\nP (\u03b5) := Pr(|\u02c6g(X n) \u2212 E[\u02c6g(X n)]| > \u03b5),\n\nwhich implies the variance, and provides additional information useful for property estimation.\nOur objective is to \ufb01nd an estimator that has near-optimal small bias and Gaussian-type deviation\nprobability exp(\u2212n\u0398(1)\u03b52) for all possible values of p \u2208 [0, 1].\nAs could be expected, our results are closely related to the smoothness of the function g.\n\n4.2 Smoothness of real functions\nEffective derivative Given an interval I and step size h \u2208 (0,|I|), where |I| denotes the interval\u2019s\nlength. The effective derivative of g over I is the Lipschitz-type ratio\n|g (y) \u2212 g (x)|\n\n|y \u2212 x|\n\n.\n\nLg(h, I) :=\n\nsup\n\nx,y\u2208I,|y\u2212x|\u2265h\n\nThis simple smoothness measure does not fully capture the smoothness of g. For example, g could\nbe a zigzag function that has a high effective derivative locally, but over-all \ufb02uctuates in only a very\nsmall range, and hence is close to a smooth function in maximum deviation. We therefore de\ufb01ne a\nsecond smoothness measure as the maximum deviation between g and a \ufb01xed-degree polynomial.\nBesides being smooth and having derivatives of all orders, by the Weierstrass approximation theorem,\npolynomials can also uniformly approximate any continuous g.\n\nMin-max deviation Let Pd be the collection of polynomials of degree at most d. The min-max\ndeviation in approximating g over an interval I by a polynomial in Pd is\n\nDg(d, I) := min\nq\u2208Pd\n\nmax\nx\u2208I\n\n|g(x) \u2212 q(x)|.\n\nThe minimizing polynomial is the degree-d min-max polynomial approximation of g over I.\nFor simplicity we abbreviate Lg(h) := Lg(h, [0, 1]) and Dg(d) := Dg(d, [0, 1]).\n\n4.3 Estimator construction\n\nFor simplicity, assume that the sampling parameter is an even number 2n. Given i.i.d. samples\nX 2n \u223c Bern(p), we let Ni denote the number of times symbol i \u2208 {0, 1} appears in X 2n.\nWe \ufb01rst describe a simpli\ufb01ed version of our estimator and provide a non-rigorous analysis relating\nits performance to the smoothness quantities just de\ufb01ned. The actual more involved estimator and a\nrigorous performance analysis are presented in Section 1 of the supplementary material.\n\n6\n\n\f\u221a\nHigh-level description On a high level, the empirical estimator estimates g(p) by g(N1/(2n)),\nand often incurs a large bias. To address this, we \ufb01rst partition the unit interval into roughly\nn\nsub-intervals. Then, we split X 2n into two halves of equal length n and use the empirical probability\nof symbol 1 in the \ufb01rst half to identify a sub-interval I and its two neighbors in the partition so that p\nis contained in one of them, with high con\ufb01dence. Finally, we replace g by a low-degree min-max\npolynomial \u02dcg over I and its four neighbors and estimate g(p) from the second half of the sample\nsequence by applying a near-unbiased estimator of \u02dcg(p).\n\nStep 1: Partitioning the unit interval\n\nLet \u03b1[a, b] denote the interval [\u03b1a, \u03b1b]. For an absolute positive constant c, de\ufb01ne cn := c log n\nsequence of increasing-length intervals\n\nn and a\n\nIj := cn\n\u221a\ncn intervals partition the unit interval [0, 1]. For any x \u2265 0, we\nObserve that the \ufb01rst Mn := 1/\nlet jx denote the index j such that x \u2208 Ij. This unit-interval partition is directly motivated by the\nChernoff bounds. A very similar construction appears in [1], and the exact one appears in [15, 17].\n\n(cid:2)(j \u2212 1)2, j2(cid:3) ,\n\nj \u2265 1.\n\nStep 2: Splitting the sample sequence and locating the probability value\n\nSplit the sample sequence X 2n into two equal halves, and let \u02c6p1 and \u02c6p2 denote the empirical\nprobabilities of symbol 1 in the \ufb01rst and second half, respectively. By the Chernoff bound of binomial\nrandom variables, for suf\ufb01ciently large c, the intervals I1, . . . , IMn form essentially the \ufb01nest partition\nj(cid:48)=j\u22122 Ij(cid:48), then for all underlying p (cid:54)\u2208 I\u2217\nof [0, 1] such that if we let I\u2217\nj ,\n\nj(cid:48)=j\u22121 Ij(cid:48) and I\u2217\u2217\n\nj := \u222aj+1\n\n:= \u222aj+2\nPr(\u02c6p1 \u2208 Ij) \u2264 n\u22123,\n\nj\n\nand for all underlying p and all j,\nIt follows that if \u02c6p1 \u2208 Ij, then with high con\ufb01dence we can assume that p \u2208 I\u2217\nj .\n\nPr(\u02c6p1 \u2208 Ij and \u02c6p2 (cid:54)\u2208 I\u2217\u2217\n\nj ) \u2264 n\u22123.\n\nStep 3: Min-max polynomial estimation\n\nLet \u03bb be a universal constant in (0, 1/4) that balances the bias and variance of our estimator. Given\nthe sampling parameter n, de\ufb01ne\n\ndn := max(cid:8)d \u2208 N : d \u00b7 24.5d+2 \u2264 n\u03bb(cid:9) .\n\nj that approximates g over the entire unit\n\nestimator for pt is(cid:81)t\u22121\n\nFor each j, let the min-max polynomial of g be the degree-dn polynomial \u02dcgj minimizing the largest\nabsolute deviation with g over I\u2217\u2217\nj .\nFor each interval Ij we create a piecewise polynomial \u02dcg\u2217\nj , and of \u02dcgj(cid:48) over Ij(cid:48) for j(cid:48) (cid:54)\u2208 [j \u2212 2, j + 2].\ninterval. It consists of \u02dcgj over I\u2217\u2217\nFinally, to estimate g(p), let j be the index such that \u02c6p1 \u2208 Ij, and approximate \u02dcg\u2217\nj (p) by plugging in\nunbiased estimators of pt constructed from \u02c6p2 for all powers t \u2264 dn. Note that a standard unbiased\ni=0[(\u02c6p2 \u2212 i/n)/(1\u2212 i/n)], and the rest follows from the linearity of expectation.\nComputational complexity A well-known approximation theory result states that the degree-d\ntruncated Chebyshev series (or polynomial) of a function g, often closely approximate the degree-d\nmin-max polynomial of g. The Remez algorithm [27, 30] is a popular method for \ufb01nding such\nChebyshev-type approximations of degree d, and is often very ef\ufb01cient in practice. Under certain\nconditions on the function to approximate, running the algorithm for log t iterations will lead to an\nerror of O(exp(\u2212\u0398(t))). Indeed, many state-of-the-art property estimators, e.g., [16, 21, 22, 34, 35],\nuse the Remez algorithm to approximate the min-max polynomials, and have implementations that\nare near-linear-time computable.\n4.4 Final estimator and its characterization\nThe estimator Consolidating above results, we estimate g(p) by the estimator\n\n(cid:88)\n\n\u02c6g(\u02c6p1, \u02c6p2) :=\n\nj \u02c6gj(\u02c6p2) \u00b7 1 \u02c6p1\u2208Ij .\n\nThe exact form and construction of this estimator appear in Section 1.2 of the supplementary material.\n\n7\n\n\fCharacterization The theorem below characterizes the bias, variance, and mean-deviation proba-\nbility of the estimator. We sketch its proof here and leave the details to the supplementary material.\nAccording to the reasoning in the last section, for all p \u2208 I\u2217\nj , the absolute bias of the resulting\nestimator \u02c6gj(\u02c6p2) in estimating g(p) is essentially upper bounded by Dg(dn, I\u2217\u2217\nj ). Normalizing it\nby the input\u2019s precision 1/n, we de\ufb01ne the (normalized) local min-max deviation and the global\nmin-max deviation over I\u2217\u2217\n\nj , respectively, as\n\ng(2n, x) := n \u00b7 max\nD\u2217\n\nDg(dn, I\u2217\u2217\nj(cid:48) ).\n\nj(cid:48)\u2208[jx\u22121,jx+1]\nD\u2217\ng(2n, x(cid:48)).\n\nD\u2217\ng(2n) := max\nx(cid:48)\u2208[0,1]\n\nHence the bias of \u02c6g(\u02c6p1, \u02c6p2) in estimating g(p) is essentially upper bounded by D\u2217\nD\u2217\ng(2n)/n. A similar argument yields the following variance bound on \u02c6g(\u02c6p1, \u02c6p2), where D\u2217\nis replaced by the local effective derivative,\nL\u2217\ng(2n, p) :=\n\nLg(1/n, I\u2217\u2217\nj(cid:48) ).\n\nmax\n\ng(2n, p)/n \u2264\ng(2n, p)\n\nj(cid:48)\u2208[jp\u22121,jp+1]\n\nAnalogously, de\ufb01ne L\u2217\ndeviation probability of this estimator is characterized by\n\ng(2n) := maxx\u2208[0,1] L\u2217\nS\u2217\ng (2n) := L\u2217\n\ng(2n, x) as the global effective derivative. The mean-\ng(2n) + D\u2217\n\ng(2n).\n\nSpeci\ufb01cally, changing one sample in X 2n changes the value of \u02c6g(\u02c6p1, \u02c6p2) by at most \u0398(S\u2217\nTherefore, by McDiarmid\u2019s inequality, for any error parameter \u03b5,\n\u2212\u0398\n\nPr(|\u02c6g(\u02c6p1, \u02c6p2) \u2212 E[\u02c6g(\u02c6p1, \u02c6p2)]| > \u03b5) (cid:46) exp\n\n(cid:18) \u03b52n1\u22122\u03bb\n\n(cid:19)(cid:19)\n\n(cid:18)\n\n.\n\ng (n)n\u03bb\u22121).\n\nTheorem 4. For any bounded function g over [0, 1], X n \u223c Bern(p), and error parameter \u03b5 > 0,\n\n|E[\u02c6g(\u02c6p1, \u02c6p2)] \u2212 g(p)| (cid:46) p\n\nVar(\u02c6g(\u02c6p1, \u02c6p2)) (cid:46) p\n\nn5 +\n\nPr(|\u02c6g(\u02c6p1, \u02c6p2) \u2212 E[\u02c6g(\u02c6p1, \u02c6p2)]| > \u03b5) (cid:46) exp\n\nS\u2217\ng (2n)2\n\n,\n\nD\u2217\n\nn\n\nn3 +\n\ng(n, p)\n\ng(n, p)(cid:1)2 \u00b7 p\n(cid:0)L\u2217\n(cid:18)\n\nn1\u22124\u03bb\n\u2212\u0398\n\n,\n\n(cid:18) \u03b52n1\u22122\u03bb\n\nS\u2217\ng (n)2\n\n(cid:19)(cid:19)\n\n.\n\nNext we use this theorem to derive tight bounds for estimating general additive properties.\n\n5 A uni\ufb01ed piecewise-polynomial approach to property estimation\nLet f be an arbitrary additive property over \u2206k such that |fi(x)| is uniformly bounded by some\nabsolute constant for all i \u2208 [k], and L\u2217\u00b7 (\u00b7), D\u2217\u00b7 (\u00b7), and S\u2217\u00b7 (\u00b7) be the smoothness quantities de\ufb01ned\nin Section 4.3 and 4.4. Let X n be an i.i.d. sample sequence from an unknown distribution (cid:126)p \u2208 \u2206k.\nSplitting X n into two sub-sample sequences of equal length, we denote by \u02c6pi,1 and \u02c6pi,2 the empirical\nprobability of symbol i \u2208 [k] in the \ufb01rst and second sub-sample sequences, respectively.\ni\u2208[k] fi(pi) by the estimator \u02c6f\u2217(X n) :=(cid:80)\n(cid:80)\nApplying the technique presented in Section 4, we can estimate the additive property f ((cid:126)p) =\n\u02c6fi(\u02c6pi,1, \u02c6pi,2). Theorem 4 can then be used to show\ni\u2208[k]\nthat \u02c6f\u2217 performs well for all suf\ufb01ciently-smooth additive properties:\n(cid:105) \u2212 f ((cid:126)p)\nTheorem 5. For any \u03b5 > 0, (cid:126)p \u2208 \u2206k, and X n \u223c (cid:126)p,\n\n(cid:12)(cid:12)(cid:12)E(cid:104) \u02c6f\u2217(X n)\n\n(cid:12)(cid:12)(cid:12) (cid:46) 1\n\nD\u2217\n\nfi\n\nand\n\nand\n\nVar( \u02c6f\u2217(X n)) (cid:46) 1\n\n(cid:16)(cid:12)(cid:12)(cid:12) \u02c6f\u2217(X n) \u2212 E(cid:104) \u02c6f\u2217(X n)\n\nn5 +\n\n(cid:105)(cid:12)(cid:12)(cid:12) > \u03b5\n\nand\n\nPr\n\ni\u2208[k]\n\n(n, pi),\n\n(cid:88)\n(n, pi)(cid:1)2 \u00b7 pi,\n(cid:0)L\u2217\n(cid:32)\n\nfi\n\n\u03b52n1\u22122\u03bb\nmaxi\u2208[k](S\u2217\n\nfi\n\n\u2212\u0398\n\n1\nn\n\n1\n\nn3 +\n\n(cid:88)\n(cid:32)\n(cid:17) (cid:46) exp\n\nn1\u22124\u03bb\n\ni\u2208[k]\n\n(cid:33)(cid:33)\n\n.\n\n(n))2\n\n8\n\n\ffi\n\nfi\n\nfi\n\nfi\n\nfi\n\nfi\n\nfi\n\n(\u00b7) and L\u2217\n\n(n, pi), the local effective deviation L\u2217\n\nDiscussions While the signi\ufb01cance of the theorem may not be immediately apparent, note that the\nthree equations characterize the estimator\u2019s bias, variance, and higher-order moments in terms of\nthe local min-max deviation D\u2217\n(n, pi), and the sum of the\nmaximum possible values of the two, S\u2217\n(n), respectively. The smoother function fi is, the smaller\n(\u00b7) will be. In particular, for simple smooth functions, the values of D\u2217, L\u2217, and S\u2217\nD\u2217\ncan be easily shown to be small, implying that the f\u2217 is nearly optimal under all three criteria.\nFor example, considering Shannon entropy where fi(pi) = \u2212pi log pi for all i, we can show that\n(n, pi) are at most O(1/ log n) and O(log n), respectively. Hence, the bias and\nD\u2217\nvariance bounds in Theorem 5 become k/(n log n) and (log n)/n1\u22124\u03bb, and the tail bound simpli\ufb01es\nto exp(\u2212\u0398(\u03b52n1\u22122\u03bb/ log2 n)), where \u03bb is an arbitrary absolute constant in (0, 1/4), e.g., \u03bb = 0.01.\nAccording to Theorem 2 and results in [21, 35], all these bounds are optimal.\n\n(n, pi) and L\u2217\n\nComputational complexity We brie\ufb02y illustrate how our estimator can be computed ef\ufb01ciently\n\nintervals we constructed, we will \ufb01nd the min-max polynomial of the underlying function over that\nparticular interval, and for many properties, an approximation suf\ufb01ces and the computation takes\nonly poly(log n) time utilizing the Remez algorithm as previously noted.\n\nin near-linear time in the sample size n. As stated in Section 4.3, over each of the O((cid:112)n/ log n)-\nThough our construction uses O((cid:112)n/ log n) such polynomials, for each symbol i appearing in the\nwhich is near-linear in n. In fact, the computation of all the O(k(cid:112)n/ log n) possible polynomials\n\nsample sequence X n, we need to compute just one such polynomial to estimate fi(pi). The number of\nsymbols appearing in X n is trivially at most n, hence the total time complexity is O(n \u00b7 poly(log n)),\n\ncan be even performed off-line (without samples), and will not affect our estimator\u2019s time complexity.\n\n6 Conclusion and future directions\nWe introduced a piecewise min-max polynomial methodology for approximating additive distribution\nproperties. This method yields the \ufb01rst generic approach to constructing sample- and time-ef\ufb01cient\nestimators for all suf\ufb01ciently smooth properties. This approach provides the \ufb01rst: 1) sublinear-\nsample estimators for general Lipschitz properties; 2) general near-optimal private estimators; 3)\nuni\ufb01ed min-max-MSE-achieving estimators for six important properties; 4) near-optimal high-\ncon\ufb01dence estimators. Unlike previous works, our method covers both symmetric and non-symmetric,\ndifferentiable and non-differentiable, properties, under both private and non-private settings.\nTwo natural extensions are of interest: 1) generalizing the results to properties involving multiple\nunknown distributions such as distributional divergences; 2) extending the techniques to derive a\nsimilarly uni\ufb01ed approach for the closely related \ufb01eld of distribution property testing.\nBesides the results we established for piecewise polynomial estimators under the min-max estimation\nframework, the works of [16, 19] recently proposed and studied a different formulation of competitive\nproperty estimation that aims to emulate the instance-by-instance performance of the widely used\nempirical plug-in estimator, using a logarithmic smaller sample size. It is also quite meaningful to\ninvestigate the performance of our technique through this new formulation.\n\nAcknowledgments\n\nWe are grateful to the National Science Foundation (NSF) for supporting this work through grants\nCIF-1564355 and CIF-1619448.\n\nReferences\n[1] J. Acharya, A. Jafarpour, A. Orlitsky, and A. T. Suresh. Optimal probability estimation with\napplications to prediction and classi\ufb01cation. In Conference on Learning Theory, pages 764\u2013796,\n2013.\n\n[2] J. Acharya, H. Das, A. Orlitsky, and A. T. Suresh. A uni\ufb01ed maximum likelihood approach for\nestimating symmetric properties of discrete distributions. International Conference on Machine\nLearning, pages 11\u201321, 2017.\n\n9\n\n\f[3] J. Acharya, G. Kamath, Z. Sun, and H. Zhang. Inspectre: Privately estimating the unseen. arXiv\n\npreprint arXiv:1803.00008, 2018.\n\n[4] A. Chao. Nonparametric estimation of the number of classes in a population. Scandinavian\n\nJournal of statistics, pages 265\u2013270, 1984.\n\n[5] M. S. Charikar, K. Shiragur, and A. Sidford. Ef\ufb01cient pro\ufb01le maximum likelihood for universal\nsymmetric property estimation. In Proceedings of the 51st Annual ACM SIGACT Symposium\non Theory of Computing, pages 780\u2013791. ACM, 2019.\n\n[6] C. Chow and C. Liu. Approximating discrete probability distributions with dependence trees.\n\nIEEE Transactions on Information Theory, 14(3):462\u2013467, 1968.\n\n[7] R. K. Colwell, A. Chao, N. J. Gotelli, S. Y. Lin, C. X. Mao, R. L. Chazdon, and J. T. Longino.\nModels and estimators linking individual-based and sample-based rarefaction, extrapolation\nand comparison of assemblages. Journal of Plant Ecology, 5(1):3\u201321, 2012.\n\n[8] T. M. Cover. Universal portfolios. The Kelly Capital Growth Investment Criterion: Theory and\n\nPractice, pages 181\u2013209, 2011.\n\n[9] T. M. Cover and J. A. Thomas. Elements of information theory. John Wiley & Sons, 2012.\n\n[10] L. Devroye, L. Gy\u00f6r\ufb01, and G. Lugosi. A probabilistic theory of pattern recognition, volume 31.\n\nSpringer Science & Business Media, 2013.\n\n[11] C. Dwork, F. McSherry, K. Nissim, and A. Smith. Calibrating noise to sensitivity in private data\nanalysis. In Theory of cryptography conference, pages 265\u2013284. Springer, Berlin, Heidelberg,\n2006.\n\n[12] B. Efron and R. Thisted. Estimating the number of unseen species: How many words did\n\nshakespeare know? Biometrika, 63.3:435\u2013447, 1976.\n\n[13] W. Gerstner and W. M. Kistler. Spiking neuron models: Single neurons, populations, plasticity.\n\nCambridge university press, 2002.\n\n[14] P. J. Haas, J. F. Naughton, S. Seshadri, and L. Stokes. Sampling-based estimation of the number\n\nof distinct values of an attribute. VLDB, 95:311\u2013322, 1995.\n\n[15] Y. Han, J. Jiao, and T. Weissman. Local moment matching: A uni\ufb01ed methodology for\nsymmetric functional estimation and distribution estimation under wasserstein distance. arXiv\npreprint arXiv:1802.08405, 2018.\n\n[16] Y. Hao and A. Orlitsky. Data ampli\ufb01cation: Instance-optimal property estimation. In arXiv\n\npreprint arXiv:1903.01432., 2019.\n\n[17] Y. Hao and A. Orlitsky. Doubly-competitive distribution estimation. In International Conference\n\non Machine Learning, pages 2614\u20132623, 2019.\n\n[18] Y. Hao and A. Orlitsky. The broad optimality of pro\ufb01le maximum likelihood. In arXiv preprint\n\narXiv:1906.03794., 2019.\n\n[19] Y. Hao, A. Orlitsky, A. T. Suresh, and Y. Wu. Data ampli\ufb01cation: A uni\ufb01ed and competitive\napproach to property estimation. In Advances in Neural Information Processing Systems, pages\n8848\u20138857, 2018.\n\n[20] M. Hoffman, F. R. Bach, and D. M. Blei. Online learning for latent dirichlet allocation. In\n\nAdvances in Neural Information Processing Systems, pages 856\u2013864, 2010.\n\n[21] J. Jiao, K. Venkat, Y. Han, and T. Weissman. Minimax estimation of functionals of discrete\n\ndistributions. IEEE Transactions on Information Theory, 61.5:2835\u20132885, 2015.\n\n[22] J. Jiao, Y. Han, and T. Weissman. Minimax estimation of the (cid:96)1 distance. IEEE Transactions on\n\nInformation Theory, 2018.\n\n10\n\n\f[23] I. Kroes, P. W. Lepp, and D. A. Relman. Bacterial diversity within the human subgingival\n\ncrevice. Proceedings of the National Academy of Sciences, 96(25):14547\u201314552, 1999.\n\n[24] E. L. Lehmann and J. P. Romano. Testing statistical hypotheses. Springer Science & Business\n\nMedia, 2006.\n\n[25] A. Orlitsky, N.P. Santhanam, K. Viswanathan, and J. Zhang. On modeling pro\ufb01les instead of\n\nvalues. In UAI \u201904, pages 426\u2013435, 2004.\n\n[26] A. Orlitsky, A. T. Suresh, and Y. Wu. Optimal prediction of the number of unseen species.\n\nProceedings of the National Academy of Sciences, 113.47:13283\u201313288, 2016.\n\n[27] R. Pach\u00f3n and L. N. Trefethen. Barycentric-remez algorithms for best polynomial approximation\n\nin the chebfun system. BIT Numerical Mathematics, 49(4):721, 2009.\n\n[28] L. Paninski. Estimation of entropy and mutual information. Neural Computation, 15:1191\u20131254,\n\n2003.\n\n[29] R. Thisted and B. Efron. Did shakespeare write a newly-discovered poem? bio, 74:445\u2013455,\n\n1987.\n\n[30] L. N. Trefethen. Approximation theory and approximation practice, volume 128. SIAM, 2013.\n\n[31] G. Valiant and P. Valiant. Estimating the unseen: An n/ log(n)-sample estimator for entropy,\nsupport size, and other distribution properties, with a proof of optimality via two new central\nlimit theorems. In STOC \u201911: Proceedings of the 42nd annual ACM symposium on Theory of\ncomputing, 2011.\n\n[32] G. Valiant and P. Valiant. The power of linear estimators. In 2011 IEEE 52nd Annual Symposium\n\non Foundations of Computer Science (FOCS), pages 403\u2013412, 2011.\n\n[33] G. Valiant and P. Valiant. Estimating the unseen: improved estimators for entropy and other\n\nproperties. In Advances in Neural Information Processing Systems, pages 2157\u20132165, 2013.\n\n[34] Y. Wu and P. Yang. Chebyshev polynomials, moment matching, and optimal estimation of the\n\nunseen. arXiv preprint arXiv:1504.01227, 2015.\n\n[35] Y. Wu and P. Yang. Minimax rates of entropy estimation on large alphabets via best polynomial\n\napproximation. IEEE Transactions on Information Theory, 62.6:3702\u20133720, 2016.\n\n11\n\n\f", "award": [], "sourceid": 5951, "authors": [{"given_name": "Yi", "family_name": "Hao", "institution": "University of California, San Diego"}, {"given_name": "Alon", "family_name": "Orlitsky", "institution": "University of California, San Diego"}]}