{"title": "The Nearest Neighbor Information Estimator is Adaptively Near Minimax Rate-Optimal", "book": "Advances in Neural Information Processing Systems", "page_first": 3156, "page_last": 3167, "abstract": "We analyze the Kozachenko\u2013Leonenko (KL) fixed k-nearest neighbor estimator for the differential entropy. We obtain the first uniform upper bound on its performance for any fixed k over H\\\"{o}lder balls on a torus without assuming any conditions on how close the density could be from zero. Accompanying a recent minimax lower bound over the H\\\"{o}lder ball, we show that the KL estimator for any fixed k is achieving the minimax rates up to logarithmic factors without cognizance of the smoothness parameter s of the H\\\"{o}lder ball for $s \\in (0,2]$ and arbitrary dimension d, rendering it the first estimator that provably satisfies this property.", "full_text": "The Nearest Neighbor Information Estimator is\n\nAdaptively Near Minimax Rate-Optimal\n\nJiantao Jiao\n\nDepartment of Electrical Engineering and Computer Sciences\n\nUniversity of California, Berkeley\n\njiantao@berkeley.edu\n\nWeihao Gao\n\nDepartment of ECE\n\nCoordinated Science Laboratory\n\nUniversity of Illinois at Urbana-Champaign\n\nwgao9@illinois.edu\n\nYanjun Han\n\nDepartment of Electrical Engineering\n\nStanford University\nyjhan@stanford.edu\n\nAbstract\n\nWe analyze the Kozachenko\u2013Leonenko (KL) \ufb01xed k-nearest neighbor estimator\nfor the differential entropy. We obtain the \ufb01rst uniform upper bound on its perfor-\nmance for any \ufb01xed k over H\u00a8older balls on a torus without assuming any condi-\ntions on how close the density could be from zero. Accompanying a recent mini-\nmax lower bound over the H\u00a8older ball, we show that the KL estimator for any \ufb01xed\nk is achieving the minimax rates up to logarithmic factors without cognizance of\nthe smoothness parameter s of the H\u00a8older ball for s \u2208 (0, 2] and arbitrary dimen-\nsion d, rendering it the \ufb01rst estimator that provably satis\ufb01es this property.\n\n1\n\nIntroduction\n\nInformation theoretic measures such as entropy, Kullback-Leibler divergence and mutual informa-\ntion quantify the amount of information among random variables. They have many applications in\nmodern machine learning tasks, such as classi\ufb01cation [48], clustering [46, 58, 10, 41] and feature\nselection [1, 17]. Information theoretic measures and their variants can also be applied in several\ndata science domains such as causal inference [18], sociology [49] and computational biology [36].\nEstimating information theoretic measures from data is a crucial sub-routine in the aforementioned\napplications and has attracted much interest in statistics community. In this paper, we study the prob-\nlem of estimating Shannon differential entropy, which is the basis of estimating other information\ntheoretic measures for continuous random variables.\nSuppose we observe n independent identically distributed random vectors X = {X1, . . . , Xn}\ndrawn from density function f where Xi \u2208 Rd. We consider the problem of estimating the dif-\nferential entropy\n\nh(f ) = \u2212(cid:82) f (x) ln f (x)dx ,\n\n(1)\n\n32nd Conference on Neural Information Processing Systems (NeurIPS 2018), Montr\u00b4eal, Canada.\n\n\ffrom the empirical observations X. The fundamental limit of estimating the differential entropy is\ngiven by the minimax risk\n\n(cid:16)E(\u02c6h(X) \u2212 h(f ))2(cid:17)1/2\n\ninf\n\u02c6h\n\nsup\nf\u2208F\n\n,\n\n(2)\n\nwhere the in\ufb01mum is taken over all estimators \u02c6h that is a function of the empirical data X. Here F\ndenotes a (nonparametric) class of density functions.\nThe problem of differential entropy estimation has been investigated extensively in the literature.\nAs discussed in [2], there exist two main approaches, where one is based on kernel density esti-\nmators [30], and the other is based on the nearest neighbor methods [56, 53, 52, 11, 3], which is\npioneered by the work of [33].\nThe problem of differential entropy estimation lies in the general problem of estimating nonpara-\nmetric functionals. Unlike the parametric counterparts, the problem of estimating nonparametric\nfunctionals is challenging even for smooth functionals. Initial efforts have focused on inference of\nlinear, quadratic, and cubic functionals in Gaussian white noise and density models and have laid the\nfoundation for the ensuing research. We do not attempt to survey the extensive literature in this area,\nbut instead refer to the interested reader to, e.g., [24, 5, 12, 16, 6, 32, 37, 47, 8, 9, 54] and the refer-\nences therein. For non-smooth functionals such as entropy, there is some recent progress [38, 26, 27]\non designing theoretically minimax optimal estimators, while these estimators typically require the\nknowledge of the smoothness parameters, and the practical performances of these estimators are not\nyet known.\nThe k-nearest neighbor differential entropy estimator, or Kozachenko-Leonenko (KL) estimator is\ncomputed in the following way. Let Ri,k be the distance between Xi and its k-nearest neighbor\namong {X1, . . . , Xi\u22121, Xi+1, . . . , Xn}. Precisely, Ri,k equals the k-th smallest number in the list\n{(cid:107)Xi \u2212 Xj(cid:107) : j (cid:54)= i, j \u2208 [n]}, here [n] = {1, 2, . . . , n}. Let B(x, \u03c1) denote the closed (cid:96)2 ball\ncentered at x of radius \u03c1 and \u03bb be the Lebesgue measure on Rd. The KL differential entropy\nestimator is de\ufb01ned as\n\ni=1 ln(cid:0) n\n(cid:80)n\nwhere \u03c8(x) is the digamma function with \u03c8(1) = \u2212\u03b3, \u03b3 = \u2212(cid:82) \u221e\n\n\u02c6hn,k(X) = ln k \u2212 \u03c8(k) + 1\n\nk \u03bb(B(Xi, Ri,k))(cid:1) ,\n\nEuler\u2013Mascheroni constant.\nThere exists an intuitive explanation behind the construction of the KL differential entropy estimator.\nWriting informally, we have\n\n(3)\n0 e\u2212t ln tdt = 0.5772156 . . . is the\nn(cid:88)\n\n\u2212 ln \u02c6f (Xi),\n\n(4)\n\nn\n\nh(f ) = Ef [\u2212 ln f (X)] \u2248 1\nn\n\n\u2212 ln f (Xi) \u2248 1\nn\n\ni=1\n\nn(cid:88)\n\ni=1\n\nwhere the \ufb01rst approximation is based on the law of large numbers, and in the second approxima-\ntion we have replaced f by a nearest neighbor density estimator \u02c6f. The nearest neighbor density\nestimator \u02c6f (Xi) follows from the \u201cintuition\u201d 1that\n\n.\n\n\u02c6f (Xi)\u03bb(B(Xi, Ri,k)) \u2248 k\nn\n\n(5)\nHere the \ufb01nal additive bias correction term ln k \u2212 \u03c8(k) follows from a detailed analysis of the bias\nof the KL estimator, which will become apparent later.\nWe focus on the regime where k is a \ufb01xed: in other words, it does not grow as the number of samples\nn increases. The \ufb01xed k version of the KL estimator is widely applied in practice and enjoys smaller\ncomputational complexity, see [52].\nThere exists extensive literature on the analysis of the KL differential entropy estimator, which we\nrefer to [4] for a recent survey. One of the major dif\ufb01culties in analyzing the KL estimator is that\nthe nearest neighbor density estimator exhibits a huge bias when the density is small. Indeed, it was\nshown in [42] that the bias of the nearest neighbor density estimator in fact does not vanish even\nB(Xi,Ri,k) f (u)du \u223c Beta(k, n \u2212 k) [4, Chap. 1.2]. A Beta(k, n \u2212 k) distributed\nrandom variable has mean k\nn .\n\n1Precisely, we have(cid:82)\n\n2\n\n\fwhen n \u2192 \u221e and deteriorates as f (x) gets close to zero. In the literature, a large collection of work\nassume that the density is uniformly bounded away from zero [23, 29, 57, 30, 53], while others put\nvarious assumptions quantifying on average how close the density is to zero [25, 40, 56, 14, 20, 52,\n11]. In this paper, we focus on removing assumptions on how close the density is to zero.\n\n1.1 Main Contribution\nLet Hs\nd(L; [0, 1]d) be the H\u00a8older ball in the unit cube (torus) (formally de\ufb01ned later in De\ufb01nition 2\nin Appendix A) and s \u2208 (0, 2] is the H\u00a8older smoothness parameter. Then, the worst case risk of\nthe \ufb01xed k-nearest neighbor differential entropy estimator over Hs\nd(L; [0, 1]d) is controlled by the\nfollowing theorem.\nTheorem 1 Let X = {X1, . . . , Xn} be i.i.d. samples from density function f. Then, for 0 < s \u2264 2,\nthe \ufb01xed k-nearest neighbor KL differential entropy estimator \u02c6hn,k in (3) satis\ufb01es\n\n(cid:32)\n\nf\u2208Hs\n\nsup\nd(L;[0,1]d)\n\nEf\n\n(cid:16)\u02c6hn,k(X) \u2212 h(f )\n\n(cid:17)2(cid:33) 1\n\n(cid:16)\n\n2 \u2264 C\n\nn\u2212 s\n\ns+d ln(n + 1) + n\u2212 1\n\n2\n\n.\n\n(6)\n\nwhere C is a constant depends only on s, L, k and d.\n\n(cid:17)\n\n(cid:17)\n\nThe KL estimator is in fact nearly minimax up to logarithmic factors, as shown in the following\nresult from [26].\nTheorem 2 [26] Let X = {X1, . . . , Xn} be i.i.d. samples from density function f. Then, there\nexists a constant L0 depending on s, d only such that for all L \u2265 L0, s > 0,\n\n(cid:32)\n\ninf\n\u02c6h\n\nf\u2208Hs\n\nsup\nd(L;[0,1]d)\n\n(cid:16)\u02c6h(X) \u2212 h(f )\n\nEf\n\n(cid:17)2(cid:33) 1\n\n(cid:16)\n\n2 \u2265 c\n\nn\u2212 s\n\ns+d (ln(n + 1))\u2212 s+2d\n\ns+d + n\u2212 1\n\n2\n\n.\n\n(7)\n\nwhere c is a constant depends only on s, L and d.\nRemark 1 We emphasize that one cannot remove the condition L \u2265 L0 in Theorem 2. Indeed, if the\nH\u00a8older ball has a too small width, then the density itself is bounded away from zero, which makes\nthe differential entropy a smooth functional, with minimax rates n\u2212 4s\nTheorem 1 and 2 imply that for any \ufb01xed k, the KL estimator achieves the minimax rates up to\nlogarithmic factors without knowing s for all s \u2208 (0, 2], which implies that it is near minimax\nrate-optimal (within logarithmic factors) when the dimension d \u2264 2. We cannot expect the vanilla\nversion of the KL estimator to adapt to higher order of smoothness since the nearest neighbor density\nestimator can be viewed as a variable width kernel density estimator with the box kernel, and it is\nwell known in the literature (see, e.g., [55, Chapter 1]) that any positive kernel cannot exploit the\nsmoothness s > 2. We refer to [26] for a more detailed discussion on this dif\ufb01culty and potential\nsolutions. The Jackknife idea, such as the one presented in [11, 3] might be useful for adapting to\ns > 2.\nThe signi\ufb01cance of our work is multi-folded:\n\n4s+d + n\u22121/2 [51, 50, 43].\n\n\u2022 We obtain the \ufb01rst uniform upper bound on the performance of the \ufb01xed k-nearest neigh-\nbor KL differential entropy estimator over H\u00a8older balls without assuming how close the\ndensity could be from zero. We emphasize that assuming conditions of this type, such as\nthe density is bounded away from zero, could make the problem signi\ufb01cantly easier. For\nexample, if the density f is assumed to satisfy f (x) \u2265 c for some constant c > 0, then the\ndifferential entropy becomes a smooth functional and consequently, the general technique\nfor estimating smooth nonparametric functionals [51, 50, 43] can be directly applied here\nto achieve the minimax rates n\u2212 4s\n4s+d + n\u22121/2. The main technical tools that enabled us\nto remove the conditions on how close the density could be from zero are the Besicovitch\ncovering lemma (Lemma. 4) and the generalized Hardy\u2013Littlewood maximal inequality.\n\u2022 We show that, for any \ufb01xed k, the k-nearest neighbor KL entropy estimator nearly achieves\nthe minimax rates without knowing the smoothness parameter s. In the functional estima-\ntion literature, designing estimators that can be theoretically proved to adapt to unknown\n\n3\n\n\flevels of smoothness is usually achieved using the Lepski method [39, 22, 45, 44, 27],\nwhich is not known to be performing well in general in practice. On the other hand, a sim-\nple plug-in approach can achieves the rate of n\u2212s/(s+d), but only when s is known [26].\nThe KL estimator is well known to exhibit excellent empirical performance, but existing\ntheory has not yet demonstrated its near-\u201coptimality\u201d when the smoothness parameter s is\nnot known. Recent works [3, 52, 11] analyzed the performance of the KL estimator under\nvarious assumptions on how close the density could be to zero, with no matching lower\nbound up to logarithmic factors in general. Our work makes a step towards closing this gap\nand provides a theoretical explanation for the wide usage of the KL estimator in practice.\n\nThe rest of the paper is organized as follows. Section 2 is dedicated to the proof of Theorem 1. We\ndiscuss some future directions in Section 3.\n\n1.2 Notations\nFor positive sequences a\u03b3, b\u03b3, we use the notation a\u03b3 (cid:46)\u03b1 b\u03b3 to denote that there exists a universal\n\u2264 C, and a\u03b3 (cid:38)\u03b1 b\u03b3 is equivalent to b\u03b3 (cid:46)\u03b1 a\u03b3.\nconstant C that only depends on \u03b1 such that sup\u03b3\nNotation a\u03b3 (cid:16)\u03b1 b\u03b3 is equivalent to a\u03b3 (cid:46)\u03b1 b\u03b3 and b\u03b3 (cid:46)\u03b1 a\u03b3. We write a\u03b3 (cid:46) b\u03b3 if the constant is\nuniversal and does not depend on any parameters. Notation a\u03b3 (cid:29) b\u03b3 means that lim inf \u03b3\n= \u221e,\nand a\u03b3 (cid:28) b\u03b3 is equivalent to b\u03b3 (cid:29) a\u03b3. We write a \u2227 b = min{a, b} and a \u2228 b = max{a, b}.\n\na\u03b3\nb\u03b3\n\na\u03b3\nb\u03b3\n\n2 Proof of Theorem 1\n\nIn this section, we will prove that\n\n(cid:18)\n\nE(cid:16)\u02c6hn,k(X) \u2212 h(f )\n\n(cid:17)2(cid:19) 1\n\n2 (cid:46)s,L,d,k n\u2212 s\n\ns+d ln(n + 1) + n\u2212 1\n2 ,\n\n(8)\n\nfor any f \u2208 Hs\nd(L; [0, 1]d) and s \u2208 (0, 2]. The proof consists two parts: (i) the upper bound\nof the bias in the form of Os,L,d,k(n\u2212s/(s+d) ln(n + 1)); (ii) the upper bound of the variance is\nOs,L,d,k(n\u22121). Below we show the bias proof and relegate the variance proof to Appendix B.\nFirst, we introduce the following notation\n\nft(x) =\n\n\u00b5(B(x, t))\n\u03bb(B(x, t))\n\n=\n\n1\n\nVdtd\n\nu:|u\u2212x|\u2264t\n\nf (u)du .\n\n(9)\n\n(cid:90)\n\nHere \u00b5 is the probability measure speci\ufb01ed by density function f on the torus, \u03bb is the Lebesgue\nmeasure on Rd, and Vd = \u03c0d/2/\u0393(1+d/2) is the Lebesgue measure of the unit ball in d-dimensional\nEuclidean space. Hence ft(x) is the average density of a neighborhood near x. We \ufb01rst state two\nmain lemmas about ft(x) which will be used later in the proof.\nLemma 1 If f \u2208 Hs\n\nd(L; [0, 1]d) for some 0 < s \u2264 2, then for any x \u2208 [0, 1]d and t > 0, we have\n\n| ft(x) \u2212 f (x)| \u2264 dLts\ns + d\n\n,\n\n(10)\n\nLemma 2 If f \u2208 Hs\nx and any t > 0, we have\n\nd(L; [0, 1]d) for some 0 < s \u2264 2 and f (x) \u2265 0 for all x \u2208 [0, 1]d, then for any\n\n(cid:110)\n\nft(x),(cid:0) ft(x)Vdtd(cid:1)s/(s+d)(cid:111)\n\n,\n\n(11)\n\nf (x) (cid:46)s,L,d max\n\nFurthermore, f (x) (cid:46)s,L,d 1.\n\nWe relegate the proof of Lemma 1 and Lemma 2 to Appendix C. Now we investigate the bias\nof \u02c6hn,k(X). The following argument reduces the bias analysis of \u02c6hn,k(X) to a function analytic\nproblem. For notation simplicity, we introduce a new random variable X \u223c f independent of\n\n4\n\n\f(16)\n\n(17)\n\n(cid:105) (cid:46)s,L,d,k n\u2212 s\n\ns+d and\n\ns+d ln(n + 1), which completes the proof.\n\n{X1, . . . , Xn} and study \u02c6hn+1,k({X1, . . . , Xn, X}). For every x \u2208 Rd, denote Rk(x) by the k-\nnearest neighbor distance from x to {X1, X2, . . . , Xn} under distance d(x, y) = minm\u2208Zd (cid:107)m +\nx \u2212 y(cid:107), i.e., the k-nearest neighbor distance on the torus. Then,\n\nE[\u02c6hn+1,k({X1, . . . , Xn, X})] \u2212 h(f )\n\n(cid:18) f (X)\u03bb(B(X, Rk(X)))\n\n= \u2212\u03c8(k) + E [ ln ( (n + 1)\u03bb(B(X, Rk(X))) )] + E [ln f (X)]\n= E\n\n(12)\n(13)\n+ E [ ln ((n + 1)\u00b5(B(X, Rk(X))) ) ] \u2212 \u03c8(k) (14)\n\n(cid:19)(cid:21)\n\nln\n\n= E\n\nln\n\nf (X)\n\nfRk(X)(X)\n\n+ ( E [ ln ((n + 1)\u00b5(B(X, Rk(X))) ) ] \u2212 \u03c8(k) ) .\n\n(15)\n\nWe \ufb01rst show that the second term E [ln ((n + 1)\u00b5(B(X, Rk(X))))] \u2212 \u03c8(k) can be universally\ncontrolled regardless of the smoothness of f. Indeed, the random variable \u00b5(B(X, Rk(X))) \u223c\nBeta(k, n + 1 \u2212 k) [4, Chap. 1.2] and it was shown in [4, Theorem 7.2] that there exists a universal\n\n\u00b5(B(X, Rk(X)))\n\n(cid:21)\n\n(cid:20)\n(cid:20)\n\n(cid:12)(cid:12)(cid:12) \u2264 C\n\nn\n\n.\n\ns+d ln(n + 1).\n\nln\n\nfRk (X)(X)\n\nf (X)\n\n(cid:20)\n\nHence, it suf\ufb01ces to show that for 0 < s \u2264 2,\n\nconstant C > 0 such that(cid:12)(cid:12)(cid:12) E [ln ((n + 1)\u00b5(B(X, Rk(X))))] \u2212 \u03c8(k)\n(cid:21)(cid:12)(cid:12)(cid:12)(cid:12) (cid:46)s,L,d,k n\u2212 s\n(cid:12)(cid:12)(cid:12)(cid:12)E\nWe split our analysis into two parts. Section 2.1 shows that E(cid:104)\n(cid:105) (cid:46)s,L,d,k n\u2212 s\nSection 2.2 shows that E(cid:104)\n2.1 Upper bound on E(cid:104)\n(cid:105)\n\nfRk(X)(X)\n\nfRk (X)(X)\n\nfRk (X)(X)\n\nf (X)\n\nf (X)\n\nln\n\nln\n\nln\n\nBy the fact that ln y \u2264 y \u2212 1 for any y > 0, we have\n\n(cid:20)\n\nE\n\nln\n\nfRk(X)(X)\n\nf (X)\n\nf (X)\n\n(cid:21)\n\n\u2264 E\n\n(cid:90)\n\n(cid:20) fRk(X)(X) \u2212 f (X)\n\n(cid:21)\n\nf (X)\n\n(cid:0)E[fRk(x)(x)] \u2212 f (x)(cid:1) dx.\n\n(18)\n\n(cid:26)\n\n(cid:27)\n\n(19)\nHere the expectation is taken with respect to the randomness in Rk(x) = min1\u2264i\u2264n,m\u2208Zd (cid:107)m +\nXi \u2212 x(cid:107), x \u2208 Rd. De\ufb01ne function g(x; f, n) as\n\n[0,1]d\u2229{x:f (x)(cid:54)=0}\n\n=\n\ng(x; f, n) = sup\n\nu \u2265 0 : Vdudfu(x) \u2264 1\nn\n\n,\n\n(20)\n\ng(x; f, n) intuitively means the distance R such that the probability mass \u00b5(B(x, R)) within R is\n1/n. Then for any x \u2208 [0, 1]d, we can split E[fRk(x)(x)] \u2212 f (x) into three terms as\n\nE[fRk(x)(x)] \u2212 f (x) = E[(fRk(x)(x) \u2212 f (x))1(Rk(x) \u2264 n\u22121/(s+d))]\n\n(21)\n+ E[(fRk(x)(x) \u2212 f (x))1(n\u22121/(s+d) < Rk(x) \u2264 g(x; f, n))] (22)\n+ E[(fRk(x)(x) \u2212 f (x))1(Rk(x) > g(x; f, n) \u2228 n\u22121/(s+d))]\n(23)\n(24)\n= C1 + C2 + C3.\nNow we handle three terms separately. Our goal is to show that for every x \u2208 [0, 1], Ci (cid:46)s,L,d\nn\u2212s/(s+d) for i \u2208 {1, 2, 3}. Then, taking the integral with respect to x leads to the desired bound.\n\n1. Term C1: whenever Rk(x) \u2264 n\u22121/(s+d), by Lemma 1, we have\n\nwhich implies that\n\n|fRk(x)(x) \u2212 f (x)| \u2264 dLRk(x)s\ns + d\n\nC1 \u2264 E(cid:104)(cid:12)(cid:12)fRk(x)(x) \u2212 f (x)(cid:12)(cid:12) 1(Rk(x) \u2264 n\u22121/(s+d))\n\n(cid:105) (cid:46)s,L,d n\u2212s/(s+d).\n\n(cid:46)s,L,d n\u2212s/(s+d),\n\n(25)\n\n(26)\n\n5\n\n\f2. Term C2: whenever Rk(x) satis\ufb01es that n\u22121/(s+d) < Rk(x) \u2264 g(x; f, n), by de\ufb01nition of\n\ng(x; f, n), we have VdRk(x)dfRk(x)(x) \u2264 1\n\nn, which implies that\n\nIt follows from Lemma 2 that in this case\n\nfRk(x)(x) \u2264\n\n1\n\nnVdRk(x)d \u2264\n\n1\n\nnVdn\u2212d/(s+d)\n\n(cid:46)s,L,d n\u2212s/(s+d).\n\nfRk(x)(x) \u2228(cid:0) fRk(x)(x)VdRk(x)d(cid:1)s/(s+d)\n\nn\u2212s/(s+d) \u2228 n\u2212s/(s+d) = n\u2212s/(s+d).\n\nf (x) (cid:46)s,L,d\n\n\u2264\n\nHence,\nf (x) + fRk(x)(x) (cid:46)s,L,d\n\n2fRk(x)(x) + (VdRk(x)dfRk(x)(x))s/(s+d)\n\n(37)\n(cid:46)s,L,d VdRk(x)dfRk(x)(x)nd/(s+d) + (VdRk(x)dfRk(x)(x))s/(s+d)\n(38)\n(39)\nwhere in the last step we have used the fact that VdRk(x)dfRk(x)(x) > n\u22121 since Rk(x) >\ng(x; f, n). Finally, we have\n\n(cid:46)s,L,d VdRk(x)dfRk(x)(x)nd/(s+d),\n\nC3 (cid:46)s,L,d nd/(s+d)E[(VdRk(x)dfRk(x)(x))1(Rk(x) > g(x; f, n))]\n\nnd/(s+d)E(cid:2)(VdRk(x)dfRk(x)(x))1(cid:0)VdRk(x)dfRk(x)(x) > 1/n(cid:1)(cid:3) .(41)\n\n(40)\n\n=\n\nNote that VdRk(x)dfRk(x)(x) \u223c Beta(k, n + 1 \u2212 k), and if Y \u223c Beta(k, n + 1 \u2212 k), we\nhave\n\n(cid:18) k\n\nk(n + 1 \u2212 k)\n(n + 1)2(n + 2)\nNotice that E[Y 1 (Y > 1/n)] \u2264 nE[Y 2]. Hence, we have\n\nE[Y 2] =\n\nn + 1\n\n+\n\n(cid:19)2\nnd/(s+d) n E(cid:2)(VdRk(x)dfRk(x)(x))2(cid:3)\n\n1\nn2 .\n\n(cid:46)k\n\nC3 (cid:46)s,L,d\n(cid:46)s,L,d,k\n\n= n\u2212s/(s+d).\n\nnd/(s+d)n\n\nn2\n\n6\n\n(cid:17)(cid:105)\n(cid:17)(cid:105)\n\nHence, we have\n\nC2\n\n=\n\u2264\n\nE(cid:104)\nE(cid:104)\n\n(cid:16)\n(cid:16)\n\n(fRk(x)(x) \u2212 f (x))1\n\n(fRk(x)(x) + f (x))1\n\nn\u22121/(s+d) < Rk(x) \u2264 g(x; f, n)\nn\u22121/(s+d) < Rk(x) \u2264 g(x; f, n)\n\n(cid:46)s,L,d n\u2212s/(s+d).\n\n3. Term C3: we have\n\nC3 \u2264 E(cid:104)\n\n(fRk(x)(x) + f (x))1\n\n(cid:16)\n\nRk(x) > g(x; f, n) \u2228 n\u22121/(s+d)(cid:17)(cid:105)\n\n.\n\nFor any x such that Rk(x) > n\u22121/(s+d), we have\n\nfRk(x)(x) (cid:46)s,L,d VdRk(x)dfRk(x)(x)nd/(s+d),\n\nand by Lemma 2,\n\nf (x) (cid:46)s,L,d\n\n\u2264\n\nfRk(x)(x) \u2228 (VdRk(x)dfRk(x)(x))s/(s+d)\nfRk(x)(x) + (VdRk(x)dfRk(x)(x))s/(s+d).\n\n(27)\n\n(28)\n(29)\n\n(30)\n\n(31)\n\n(32)\n\n(33)\n\n(34)\n\n(35)\n(36)\n\n(42)\n\n(43)\n\n(44)\n\n\f(cid:35)\n\n(cid:21)\n\n2.2 Upper bound on E(cid:104)\n(cid:21)\n\n(cid:20)\n\nf (X)\n\nE\n\nln\n\nfRk(X)(X)\n\nBy splitting the term into two parts, we have\n\n(cid:105)\n\nf (X)\n\nfRk (X)(X)\n\nln\n\n(cid:34)(cid:90)\n(cid:20)(cid:90)\n(cid:20)(cid:90)\n\nA\n\nA\n\n= E\n\n= E\n\n+ E\n\n[0,1]d\u2229{x:f (x)(cid:54)=0}\nf (x)\n\nf (x) ln\n\nfRk(x)(x)\n\nf (x)\n\nfRk(x)(x)\n\nf (x) ln\n\nf (x) ln\n\ndx\n\nf (x)\n\nfRk(x)(x)\n\n(cid:21)\n(cid:21)\n1(fRk(x)(x) > n\u2212s/(s+d))dx\n1(fRk(x)(x) \u2264 n\u2212s/(s+d))dx\n\n(45)\n\n(46)\n\n(47)\n\n(cid:20)(cid:90)\n(cid:20)(cid:90)\n(cid:20)(cid:90)\n\n(cid:18) f (x) \u2212 fRk(x)(x)\n\n(cid:19)\n\n(48)\nhere we denote A = [0, 1]d \u2229 {x : f (x) (cid:54)= 0} for simplicity of notation. For the term C4, we have\n\n= C4 + C5.\n\nA\n\nA\n\nA\n\n+ E\n\n= E\n\nfRk(x)(x)\n\nC4 \u2264 E\n\n1(fRk(x)(x) > n\u2212s/(s+d))dx\n\n(cid:21)\nf (x)\nfRk(x)(x)\n(f (x) \u2212 fRk(x)(x))2\n(cid:21)\n1(fRk(x)(x) > n\u2212s/(s+d))dx\n(cid:0)f (x) \u2212 fRk(x)(x)(cid:1) 1(fRk(x)(x) > n\u2212s/(s+d))dx\n(cid:20)(cid:90)\nIn the proof of upper bound of E(cid:104)\nn\u2212s/(s+d) for any x \u2208 A. Similarly as in the proof of upper bound of E(cid:104)\nE(cid:2)(fRk(x)(x) \u2212 f (x))2(cid:3) (cid:46)s,L,d,k n\u22122s/(s+d) for every x \u2208 A. Therefore, we have\n\n(cid:0)f (x) \u2212 fRk(x)(x)(cid:1)2\n(cid:105)\n, we have shown that E[fRk(x)(x) \u2212 f (x)] (cid:46)s,L,d,k\n, we have\n\n(cid:21)\n(cid:0)f (x) \u2212 fRk(x)(x)(cid:1) dx\n(cid:105)\n\n\u2264 ns/(s+d)E\n\n(49)\n\n(50)\n\n(cid:20)(cid:90)\n\nfRk (X)(X)\n\nfRk (X)(X)\n\n.\n\n(52)\n\n+ E\n\n(51)\n\n(cid:21)\n\nf (X)\n\nf (X)\n\ndx\n\nln\n\nln\n\nA\n\nA\n\nC4 (cid:46)s,L,d,k ns/(s+d)n\u22122s/(s+d) + n\u2212s/(s+d) (cid:46)s,L,d,k n\u2212s/(s+d).\n\n(53)\n\nNow we consider C5. We conjecture that C5 (cid:46)s,L,d,k n\u2212s/(s+d) in this case, but we were not able\nto prove it. Below we prove that C5 (cid:46)s,L,d,k n\u2212s/(s+d) ln(n + 1). De\ufb01ne the function\n\nM (x) = sup\nt>0\n\n1\n\nft(x)\n\n.\n\n(54)\n\nSince fRk(x)(x) \u2264 n\u2212s/(s+d), we have M (x) = supt>0(1/ft(x)) \u2265 1/fRk(x)(x) \u2265 ns/(s+d).\nDenote ln+(y) = max{ln(y), 0} for any y > 0, therefore, we have that\n\n(cid:21)\n\n(cid:21)\n1(fRk(x)(x) \u2264 n\u2212s/(s+d))dx\n\n1(M (x) \u2265 ns/(s+d))dx\n\n(cid:19)(cid:21)\n\nC5 \u2264 E\n\nf (x) ln+\n\n\u2264 E\n\nf (x) ln+\n\n(cid:20)\n\n(cid:20)(cid:90)\n(cid:20)(cid:90)\n\nA\n\nA\n\n(cid:90)\n(cid:90)\n\n(cid:19)\n(cid:19)\n\n(cid:18) f (x)\n(cid:18) f (x)\n(cid:18)\n\nfRk(x)(x)\n\nfRk(x)(x)\n\nA\n\n= C51 + C52,\n\n7\n\n\u2264\n\n+\n\nA\n\nf (x)E\n\nln+\n\nf (x)E(cid:2)ln+(cid:0)(n + 1)VdRk(x)df (x)(cid:1)(cid:3) 1(M (x) \u2265 ns/(s+d))dx\n\n(n + 1)VdRk(x)dfRk(x)(x)\n\n1(M (x) \u2265 ns/(s+d))dx\n\n1\n\n(55)\n\n(56)\n\n(57)\n\n(58)\n\n(59)\n\n\fwhere the last inequality uses the fact ln+(xy) \u2264 ln+ x + ln+ y for all x, y > 0. As for C51, since\nVdRk(x)dfRk(x)(x) \u223c Beta(k, n + 1 \u2212 k), and for Y \u223c Beta(k, n + 1 \u2212 k), we have\n\n(cid:20)\n\n(cid:18)\n\n(cid:19)(cid:21)\n\n0\n\n(cid:90) 1\n(cid:20)\n(cid:20)\n(cid:20)\n\n(cid:18)\n\n1\n\n(n + 1)x\n\n(cid:19)\n(cid:19)(cid:21)\n(cid:19)(cid:21)\n(cid:19)(cid:21)\n\n1\n\nn+1\n\n=\n\nln\n\nln\n\nE\n\nln+\n\n(n + 1)Y\n\n= E\n\n(cid:18)\n(cid:18)\n(cid:18)\nwhere in the last inequality we used the fact that E(cid:104)\n(cid:90)\n\n\u2264 E\n\u2264 ln(n + 1)\n\nfor any k \u2265 1. Hence,\n\n\u2264 E\n\nln\n\nln\n\n1\n\n1\n\n1\n\nC51 (cid:46)s,L,d\n\nln(n + 1)\n\n(n + 1)Y\n\n(n + 1)Y\n\n(n + 1)Y\n\n(cid:16)\n\npY (x)dx\n\n(cid:90) 1\n\n1\n\nn+1\n\n(cid:90) 1\n\n1\n\nn+1\n\n+\n\nln ((n + 1)x) pY (x)dx\n\n+ ln(n + 1)\n\npY (x)dx\n\n+ ln(n + 1)\n\n(cid:17)(cid:105)\n\n(64)\n= \u03c8(n+1)\u2212\u03c8(k)\u2212ln(n+1) \u2264 0\n\nln\n\n1\n\n(n+1)Y\n\nf (x)1(M (x) \u2265 ns/(s+d))dx.\n\nA\n\n(60)\n\n(61)\n\n(62)\n\n(63)\n\n(65)\n\n(66)\n\n(67)\n\n(68)\n\n(69)\n\n(70)\n\n(71)\n\nNow we introduce the following lemma, which is proved in Appendix C.\nLemma 3 Let \u00b51, \u00b52 be two Borel measures that are \ufb01nite on the bounded Borel sets of Rd. Then,\nfor all t > 0 and any Borel set A \u2282 Rd,\n\n(cid:18)(cid:26)\n\n\u00b51\n\nx \u2208 A :\n\nsup\n0<\u03c1\u2264D\n\n(cid:18) \u00b52(B(x, \u03c1))\n\n(cid:19)\n\n\u00b51(B(x, \u03c1))\n\n(cid:27)(cid:19)\n\n> t\n\n\u2264 Cd\nt\n\n\u00b52(AD).\n\nHere Cd > 0 is a constant that depends only on the dimension d and\n\nAD = {x : \u2203y \u2208 A,|y \u2212 x| \u2264 D}.\n\nApplying the second part of Lemma 3 with \u00b52 being the Lebesgue measure and \u00b51 being the measure\nspeci\ufb01ed by f (x) on the torus, we can view the function M (x) as\n\nM (x) = sup\n\n0<\u03c1\u22641/2\n\n\u00b52(B(x, \u03c1))\n\u00b51(B(x, \u03c1))\n\n.\n\n(cid:90)\n\nTaking A = [0, 1]d \u2229 {x : f (x) (cid:54)= 0}, t = ns/(s+d), then \u00b52(A 1\n\n) \u2264 2d, so we know that\n\n2\n\nC51 (cid:46)s,L,d\n\n=\n\u2264\n\n(cid:16)\n\nf (x)1(M (x) \u2265 ns/(s+d))dx\n\nx \u2208 [0, 1]d, f (x) (cid:54)= 0, M (x) \u2265 ns/(s+d)(cid:17)\n\nln(n + 1) \u00b7\nln(n + 1) \u00b7 \u00b51\nln(n + 1) \u00b7 Cdn\u2212s/(s+d)\u00b52(A 1\n\n) (cid:46)s,L,d n\u2212s/(s+d) ln(n + 1).\n\nA\n\n2\n\nNow we deal with C52. Recall that in Lemma 2, we know that f (x) (cid:46)s,L,d 1 for any x, and\nRk(x) \u2264 1, so ln+((n + 1)VdRk(x)df (x)) (cid:46)s,L,d ln(n + 1). Therefore,\n\nC52 (cid:46)s,L,d\n\nln(n + 1) \u00b7\n\nf (x)1(M (x) \u2265 ns/(s+d))dx\n\n(72)\n\n(73)\nTherefore, we have proved that C5 \u2264 C51 + C52 (cid:46)s,L,d n\u2212s/(s+d) ln(n + 1), which completes the\n\n(cid:46)s,L,d n\u2212s/(s+d) ln(n + 1).\n\nproof of the upper bound on E(cid:104)\n\n(cid:105)\n\n.\n\nln\n\nf (X)\n\nfRk (X)(X)\n\n(cid:90)\n\nA\n\n8\n\n\f3 Future directions\n\nIt is an tempting question to ask whether one can close the logarithmic gap between Theorem 1 and 2.\nWe believe that neither the upper bound nor the lower bound are tight. In fact, we conjecture that the\nupper bound in Theorem 1 could be improved to n\u2212 s\ns+d +n\u22121/2 due to a more careful analysis of the\nbias, since Hardy\u2013Littlewood maximal inequalities apply to arbitrary measurable functions but we\nhave assumed regularity properties of the underlying density. We conjecture that the minimax lower\nbound could be improved to (n ln n)\u2212 s\ns+d +n\u22121/2, since a kernel density estimator based differential\nentropy estimator was constructed in [26] which achieves upper bound (n ln n)\u2212 s\ns+d + n\u22121/2 over\nHs\nd(L; [0, 1]d) with the knowledge of s.\nIt would be interesting to extend our analysis to that of the k-nearest neighbor based Kullback\u2013\nLeibler divergence estimator [59]. The discrete case has been studied recently [28, 7].\nIt is also interesting to analyze k-nearest neighbor based mutual information estimators, such as the\nKSG estimator [34], and show that they are \u201cnear\u201d-optimal and adaptive to both the smoothness\nand the dimension of the distributions. There exists some analysis of the KSG estimator [21] but we\nsuspect the upper bound is not tight. Moreover, a slightly revised version of KSG estimator is proved\nto be consistent even if the underlying distribution is not purely continuous nor purely discrete [19],\nbut the optimality properties are not yet well understood.\n\n9\n\n\fReferences\n[1] R. Battiti. Using mutual information for selecting features in supervised neural net learning.\n\nNeural Networks, IEEE Transactions on, 5(4):537\u2013550, 1994.\n\n[2] Jan Beirlant, Edward J Dudewicz, L\u00b4aszl\u00b4o Gy\u00a8or\ufb01, and Edward C Van der Meulen. Nonpara-\nmetric entropy estimation: An overview. International Journal of Mathematical and Statistical\nSciences, 6(1):17\u201339, 1997.\n\n[3] Thomas B Berrett, Richard J Samworth, and Ming Yuan. Ef\ufb01cient multivariate entropy esti-\n\nmation via k-nearest neighbour distances. arXiv preprint arXiv:1606.00304, 2016.\n\n[4] G\u00b4erard Biau and Luc Devroye. Lectures on the nearest neighbor method. Springer, 2015.\n\n[5] Peter J Bickel and Yaacov Ritov. Estimating integrated squared density derivatives: sharp best\norder of convergence estimates. Sankhy\u00afa: The Indian Journal of Statistics, Series A, pages\n381\u2013393, 1988.\n\n[6] Lucien Birg\u00b4e and Pascal Massart. Estimation of integral functionals of a density. The Annals\n\nof Statistics, pages 11\u201329, 1995.\n\n[7] Yuheng Bu, Shaofeng Zou, Yingbin Liang, and Venugopal V Veeravalli. Estimation of KL\ndivergence between large-alphabet distributions. In 2016 IEEE International Symposium on\nInformation Theory (ISIT), pages 1118\u20131122. IEEE, 2016.\n\n[8] T Tony Cai and Mark G Low. A note on nonparametric estimation of linear functionals. Annals\n\nof statistics, pages 1140\u20131153, 2003.\n\n[9] T Tony Cai and Mark G Low. Nonquadratic estimators of a quadratic functional. The Annals\n\nof Statistics, pages 2930\u20132956, 2005.\n\n[10] C. Chan, A. Al-Bashabsheh, J. B. Ebrahimi, T. Kaced, and T. Liu. Multivariate mutual in-\nformation inspired by secret-key agreement. Proceedings of the IEEE, 103(10):1883\u20131913,\n2015.\n\n[11] Sylvain Delattre and Nicolas Fournier. On the kozachenko\u2013leonenko entropy estimator. Jour-\n\nnal of Statistical Planning and Inference, 185:69\u201393, 2017.\n\n[12] David L Donoho and Michael Nussbaum. Minimax quadratic estimation of a quadratic func-\n\ntional. Journal of Complexity, 6(3):290\u2013323, 1990.\n\n[13] Bradley Efron and Charles Stein. The jackknife estimate of variance. The Annals of Statistics,\n\npages 586\u2013596, 1981.\n\n[14] Fidah El Haje Hussein and Yu Golubev. On entropy estimation by m-spacing method. Journal\n\nof Mathematical Sciences, 163(3):290\u2013309, 2009.\n\n[15] Lawrence Craig Evans and Ronald F Gariepy. Measure theory and \ufb01ne properties of functions.\n\nCRC press, 2015.\n\n[16] Jianqing Fan. On the estimation of quadratic functionals. The Annals of Statistics, pages\n\n1273\u20131294, 1991.\n\n[17] F. Fleuret. Fast binary feature selection with conditional mutual information. The Journal of\n\nMachine Learning Research, 5:1531\u20131555, 2004.\n\n[18] Weihao Gao, Sreeram Kannan, Sewoong Oh, and Pramod Viswanath. Conditional dependence\nvia shannon capacity: Axioms, estimators and applications. In International Conference on\nMachine Learning, pages 2780\u20132789, 2016.\n\n[19] Weihao Gao, Sreeram Kannan, Sewoong Oh, and Pramod Viswanath. Estimating mutual in-\nIn Advances in Neural Information Processing\n\nformation for discrete-continuous mixtures.\nSystems, pages 5988\u20135999, 2017.\n\n10\n\n\f[20] Weihao Gao, Sewoong Oh, and Pramod Viswanath. Breaking the bandwidth barrier: Geo-\nmetrical adaptive entropy estimation. In Advances in Neural Information Processing Systems,\npages 2460\u20132468, 2016.\n\n[21] Weihao Gao, Sewoong Oh, and Pramod Viswanath. Demystifying \ufb01xed k-nearest neighbor\ninformation estimators. In Information Theory (ISIT), 2017 IEEE International Symposium\non, pages 1267\u20131271. IEEE, 2017.\n\n[22] Evarist Gin\u00b4e and Richard Nickl. A simple adaptive estimator of the integrated square of a\n\ndensity. Bernoulli, pages 47\u201361, 2008.\n\n[23] Peter Hall. Limit theorems for sums of general functions of m-spacings.\n\nIn Mathematical\nProceedings of the Cambridge Philosophical Society, volume 96, pages 517\u2013532. Cambridge\nUniversity Press, 1984.\n\n[24] Peter Hall and James Stephen Marron. Estimation of integrated squared density derivatives.\n\nStatistics & Probability Letters, 6(2):109\u2013115, 1987.\n\n[25] Peter Hall and Sally C Morton. On the estimation of entropy. Annals of the Institute of\n\nStatistical Mathematics, 45(1):69\u201388, 1993.\n\n[26] Yanjun Han, Jiantao Jiao, , Tsachy Weissman, and Yihong Wu. Optimal rates of entropy\n\nestimation over lipschitz balls. arXiv preprint arXiv:1711.02141, 2017.\n\n[27] Yanjun Han, Jiantao Jiao, Rajarshi Mukherjee, and Tsachy Weissman. On estimation of lr-\n\nnorms in gaussian white noise models. arXiv preprint arXiv:1710.03863, 2017.\n\n[28] Yanjun Han, Jiantao Jiao, and Tsachy Weissman. Minimax rate-optimal estimation of diver-\n\ngences between discrete distributions. arXiv preprint arXiv:1605.09124, 2016.\n\n[29] Harry Joe. Estimation of entropy and other functionals of a multivariate density. Annals of the\n\nInstitute of Statistical Mathematics, 41(4):683\u2013697, 1989.\n\n[30] Kirthevasan Kandasamy, Akshay Krishnamurthy, Barnabas Poczos, Larry Wasserman, et al.\nNonparametric von Mises estimators for entropies, divergences and mutual informations. In\nAdvances in Neural Information Processing Systems, pages 397\u2013405, 2015.\n\n[31] Rhoana J Karunamuni and Tom Alberts. On boundary correction in kernel density estimation.\n\nStatistical Methodology, 2(3):191\u2013212, 2005.\n\n[32] G\u00b4erard Kerkyacharian and Dominique Picard. Estimating nonquadratic functionals of a density\n\nusing haar wavelets. The Annals of Statistics, 24(2):485\u2013507, 1996.\n\n[33] LF Kozachenko and Nikolai N Leonenko. Sample estimate of the entropy of a random vector.\n\nProblemy Peredachi Informatsii, 23(2):9\u201316, 1987.\n\n[34] Alexander Kraskov, Harald St\u00a8ogbauer, and Peter Grassberger. Estimating mutual information.\n\nPhysical Review E, 69(6):066138, 2004.\n\n[35] Akshay Krishnamurthy, Kirthevasan Kandasamy, Barnabas Poczos, and Larry Wasserman.\nNonparametric estimation of R\u00b4enyi divergence and friends. In International Conference on\nMachine Learning, pages 919\u2013927, 2014.\n\n[36] Smita Krishnaswamy, Matthew H Spitzer, Michael Mingueneau, Sean C Bendall, Oren Litvin,\nErica Stone, Dana Pe\u2019er, and Garry P Nolan. Conditional density-based analysis of t cell\nsignaling in single-cell data. Science, 346(6213):1250689, 2014.\n\n[37] B\u00b4eatrice Laurent. Ef\ufb01cient estimation of integral functionals of a density. The Annals of\n\nStatistics, 24(2):659\u2013681, 1996.\n\n[38] Oleg Lepski, Arkady Nemirovski, and Vladimir Spokoiny. On estimation of the Lr norm of a\n\nregression function. Probability theory and related \ufb01elds, 113(2):221\u2013253, 1999.\n\n[39] Oleg V Lepski. On problems of adaptive estimation in white gaussian noise. Topics in non-\n\nparametric estimation, 12:87\u2013106, 1992.\n\n11\n\n\f[40] Boris Ya Levit. Asymptotically ef\ufb01cient estimation of nonlinear functionals. Problemy\n\nPeredachi Informatsii, 14(3):65\u201372, 1978.\n\n[41] Pan Li and Olgica Milenkovic. Inhomogoenous hypergraph clustering with applications. arXiv\n\npreprint arXiv:1709.01249, 2017.\n\n[42] YP Mack and Murray Rosenblatt. Multivariate k-nearest neighbor density estimates. Journal\n\nof Multivariate Analysis, 9(1):1\u201315, 1979.\n\n[43] Rajarshi Mukherjee, Whitney K Newey, and James M Robins. Semiparametric ef\ufb01cient em-\n\npirical higher order in\ufb02uence function estimators. arXiv preprint arXiv:1705.07577, 2017.\n\n[44] Rajarshi Mukherjee, Eric Tchetgen Tchetgen, and James Robins. On adaptive estimation of\n\nnonparametric functionals. arXiv preprint arXiv:1608.01364, 2016.\n\n[45] Rajarshi Mukherjee, Eric Tchetgen Tchetgen, and James Robins. Lepski\u2019s method and adaptive\nestimation of nonlinear integral functionals of density. arXiv preprint arXiv:1508.00249, 2015.\n\n[46] A. C. M\u00a8uller, S. Nowozin, and C. H. Lampert. Information theoretic clustering using minimum\n\nspanning trees. Springer, 2012.\n\n[47] Arkadi Nemirovski. Topics in non-parametric. Ecole dEt\u00b4e de Probabilit\u00b4es de Saint-Flour,\n\n28:85, 2000.\n\n[48] H. Peng, F. Long, and C. Ding. Feature selection based on mutual information criteria of max-\ndependency, max-relevance, and min-redundancy. Pattern Analysis and Machine Intelligence,\nIEEE Transactions on, 27(8):1226\u20131238, 2005.\n\n[49] David N Reshef, Yakir A Reshef, Hilary K Finucane, Sharon R Grossman, Gilean McVean,\nPeter J Turnbaugh, Eric S Lander, Michael Mitzenmacher, and Pardis C Sabeti. Detecting\nnovel associations in large data sets. science, 334(6062):1518\u20131524, 2011.\n\n[50] James Robins, Lingling Li, Rajarshi Mukherjee, Eric Tchetgen Tchetgen, and Aad van der\nVaart. Higher order estimating equations for high-dimensional models. The Annals of Statistics\n(To Appear), 2016.\n\n[51] James Robins, Lingling Li, Eric Tchetgen, and Aad van der Vaart. Higher order in\ufb02uence func-\ntions and minimax estimation of nonlinear functionals. In Probability and Statistics: Essays\nin Honor of David A. Freedman, pages 335\u2013421. Institute of Mathematical Statistics, 2008.\n\n[52] Shashank Singh and Barnab\u00b4as P\u00b4oczos. Finite-sample analysis of \ufb01xed-k nearest neighbor\ndensity functional estimators. In Advances in Neural Information Processing Systems, pages\n1217\u20131225, 2016.\n\n[53] Kumar Sricharan, Raviv Raich, and Alfred O Hero. Estimation of nonlinear functionals of\ndensities with con\ufb01dence. IEEE Transactions on Information Theory, 58(7):4135\u20134159, 2012.\n\n[54] Eric Tchetgen, Lingling Li, James Robins, and Aad van der Vaart. Minimax estimation of the\n\nintegral of a power of a density. Statistics & Probability Letters, 78(18):3307\u20133311, 2008.\n\n[55] A. Tsybakov. Introduction to Nonparametric Estimation. Springer-Verlag, 2008.\n\n[56] Alexandre B Tsybakov and EC Van der Meulen. Root-n consistent estimators of entropy for\n\ndensities with unbounded support. Scandinavian Journal of Statistics, pages 75\u201383, 1996.\n\n[57] Bert Van Es. Estimating functionals related to a density by a class of statistics based on spac-\n\nings. Scandinavian Journal of Statistics, pages 61\u201372, 1992.\n\n[58] G. Ver Steeg and A. Galstyan. Maximally informative hierarchical representations of high-\n\ndimensional data. stat, 1050:27, 2014.\n\n[59] Qing Wang, Sanjeev R Kulkarni, and Sergio Verd\u00b4u. Divergence estimation for multidimen-\nsional densities via k-nearest-neighbor distances. Information Theory, IEEE Transactions on,\n55(5):2392\u20132405, 2009.\n\n12\n\n\f", "award": [], "sourceid": 1614, "authors": [{"given_name": "Jiantao", "family_name": "Jiao", "institution": "University of California, Berkeley"}, {"given_name": "Weihao", "family_name": "Gao", "institution": "UIUC"}, {"given_name": "Yanjun", "family_name": "Han", "institution": "Stanford University"}]}