{"title": "Breaking the Bandwidth Barrier: Geometrical Adaptive Entropy Estimation", "book": "Advances in Neural Information Processing Systems", "page_first": 2460, "page_last": 2468, "abstract": "Estimators of information theoretic measures such as entropy and mutual information from samples are a basic workhorse for many downstream applications in modern data science. State of the art approaches have been either geometric (nearest neighbor (NN) based) or kernel based (with bandwidth chosen to be data independent and vanishing sub linearly in the sample size). In this paper we combine both these approaches to design new estimators of entropy and mutual information that strongly outperform all state of the art methods. Our estimator uses bandwidth choice of fixed $k$-NN distances; such a choice is both data dependent and linearly vanishing in the sample size and necessitates a bias cancellation term that  is  universal and independent of the underlying distribution. As a byproduct, we obtain a unified way of obtaining both kernel and NN estimators.  The corresponding theoretical contribution relating the geometry of NN distances to asymptotic order statistics  is of independent mathematical interest.", "full_text": "Breaking the Bandwidth Barrier:\n\nGeometrical Adaptive Entropy Estimation\n\nWeihao Gao\u2217, Sewoong Oh\u2020,\n\nand Pramod Viswanath\u2217\n\n{wgao9,swoh,pramodv}@illinois.edu\n\nUniversity of Illinois at Urbana-Champaign\n\nUrbana, IL 61801\n\nAbstract\n\nEstimators of information theoretic measures such as entropy and mutual infor-\nmation are a basic workhorse for many downstream applications in modern data\nscience. State of the art approaches have been either geometric (nearest neighbor\n(NN) based) or kernel based (with a globally chosen bandwidth). In this paper, we\ncombine both these approaches to design new estimators of entropy and mutual\ninformation that outperform state of the art methods. Our estimator uses local\nbandwidth choices of k-NN distances with a \ufb01nite k, independent of the sample\nsize. Such a local and data dependent choice improves performance in practice, but\nthe bandwidth is vanishing at a fast rate, leading to a non-vanishing bias. We show\nthat the asymptotic bias of the proposed estimator is universal; it is independent of\nthe underlying distribution. Hence, it can be precomputed and subtracted from the\nestimate. As a byproduct, we obtain a uni\ufb01ed way of obtaining both kernel and\nNN estimators. The corresponding theoretical contribution relating the asymptotic\ngeometry of nearest neighbors to order statistics is of independent mathematical\ninterest.\n\n1\n\nIntroduction\n\nUnsupervised representation learning is one of the major themes of modern data science; a common\ntheme among the various approaches is to extract maximally \u201cinformative\" features via information-\ntheoretic metrics (entropy, mutual information and their variations) \u2013 the primary reason for the\npopularity of information theoretic measures is that they are invariant to one-to-one transformations\nand that they obey natural axioms such as data processing. Such an approach is evident in many\napplications, as varied as computational biology [11], sociology [20] and information retrieval [17],\nwith the citations representing a mere smattering of recent works. Within mainstream machine\nlearning, a systematic effort at unsupervised clustering and hierarchical information extraction is\nconducted in recent works of [25, 23]. The basic workhorse in all these methods is the computation\nof mutual information (pairwise and multivariate) from i.i.d. samples. Indeed, sample-ef\ufb01cient\nestimation of mutual information emerges as the central scienti\ufb01c question of interest in a variety\nof applications, and is also of fundamental interest to statistics, machine learning and information\ntheory communities.\nWhile these estimation questions have been studied in the past three decades (and summarized in [28]),\nthe renewed importance of estimating information theoretic measures in a sample-ef\ufb01cient manner\nis persuasively argued in a recent work [2], where the authors note that existing estimators perform\npoorly in several key scenarios of central interest (especially when the high dimensional random\nvariables are strongly related to each other). The most common estimators (featured in scienti\ufb01c\n\n\u2217Coordinated Science Lab and Department of Electrical and Computer Engineering\n\u2020Coordinated Science Lab and Department of Industrial and Enterprise Systems Engineering\n\n30th Conference on Neural Information Processing Systems (NIPS 2016), Barcelona, Spain.\n\n\fsoftware packages) are nonparametric and involve k nearest neighbor (NN) distances between the\nsamples. The widely used estimator of mutual information is the one by Kraskov and St\u00f6gbauer and\nGrassberger [10] and christened the KSG estimator (nomenclature based on the authors, cf. [2]) \u2013\nwhile this estimator works well in practice (and performs much better than other approaches such as\nthose based on kernel density estimation procedures), it still suffers in high dimensions. The basic\nissue is that the KSG estimator (and the underlying differential entropy estimator based on nearest\nneighbor distances by Kozachenko and Leonenko (KL) [9]) does not take advantage of the fact that\nthe samples could lie in a smaller dimensional subspace (more generally, manifold) despite the high\ndimensionality of the data itself. Such lower dimensional structures effectively act as boundaries,\ncausing the estimator to suffer from what is known as boundary biases.\nAmeliorating this de\ufb01ciency is the central theme of recent works [3, 2, 16], each of which aims\nto improve upon the classical KL (differential) entropy estimator of [9]. A local SVD is used\nto heuristically improve the density estimate at each sample point in [2], while a local Gaussian\ndensity (with empirical mean and covariance weighted by NN distances) is heuristically used for\nthe same purpose in [16]. Both these approaches, while inspired and intuitive, come with no\ntheoretical guarantees (even consistency) and from a practical perspective involve delicate choice\nof key hyper parameters. An effort towards a systematic study is initiated in [3] which connects the\naforementioned heuristic efforts of [2, 16] to the local log-likelihood density estimation methods\n[6, 15] from theoretical statistics.\nThe local density estimation method is a strong generalization of the traditional kernel density\nestimation methods, but requires a delicate normalization which necessitates the solution of certain\nintegral equations (cf. Equation (9) of [15]). Indeed, such an elaborate numerical effort is one of the\nkey impediments for the entropy estimator of [3] to be practically valuable. A second key impediment\nis that theoretical guarantees (such as consistency) can only be provided when the bandwidth is chosen\nglobally (leading to poor sample complexity in practice) and consistency requires the bandwidth h\nto be chosen such that nhd \u2192 \u221e and h \u2192 0, where n is the sample size and d is the dimension\nof the random variable of interest. More generally, it appears that a systematic application of local\nlog-likelihood methods to estimate functionals of the unknown density from i.i.d. samples is missing\nin the theoretical statistics literature (despite local log-likelihood methods for regression and density\nestimation being standard textbook fare [29, 14]). We resolve each of these de\ufb01ciencies in this paper\nby undertaking a comprehensive study of estimating the (differential) entropy and mutual information\nfrom i.i.d. samples using sample dependent bandwidth choices (typically \ufb01xed k-NN distances). This\neffort allows us to connect disparate threads of ideas from seemingly different arenas: NN methods,\nlocal log-likelihood methods, asymptotic order statistics and sample-dependent heuristic, but inspired,\nmethods for mutual information estimation suggested in the work of [10].\nMain Results: We make the following contributions.\n\n1. Density estimation: Parameterizing the log density by a polynomial of degree p, we derive\nsimple closed form expressions for the local log-likelihood maximization problem for the\ncases of p \u2264 2 for arbitrary dimensions, with Gaussian kernel choices. This derivation, posed\nas an exercise in [14, Exercise 5.2], signi\ufb01cantly improves the computational ef\ufb01ciency\nupon similar endeavors in the recent efforts of [3, 16, 26].\n\n2. Entropy estimation: Using resubstitution of the local density estimate, we derive a simple\nclosed form estimator of the entropy using a sample dependent bandwidth choice (of k-NN\ndistance, where k is a \ufb01xed small integer independent of the sample size): this estimator\noutperforms state of the art entropy estimators in a variety of settings. Since the bandwidth\nis data dependent and vanishes too fast (because k is \ufb01xed), the estimator has a bias, which\nwe derive a closed form expression for and show that it is independent of the underlying\ndistribution and hence can be easily corrected: this is our main theoretical contribution, and\ninvolves new theorems on asymptotic statistics of nearest neighbors generalizing classical\nwork in probability theory [19], which might be of independent mathematical interest.\n\n3. Generalized view: We show that seemingly very different approaches to entropy estimation\n\u2013 recent works of [2, 3, 16] and the classical work of \ufb01xed k-NN estimator of Kozachenko and\nLeonenko [9] \u2013 can all be cast in the local log-likelihood framework as speci\ufb01c kernel and\nsample dependent bandwidth choices. This allows for a uni\ufb01ed view, which we theoretically\njustify by showing that resubstitution entropy estimation for any kernel choice using \ufb01xed\nk-NN distances as bandwidth involves a bias term that is independent of the underlying\n\n2\n\n\fdistribution (but depends on the speci\ufb01c choice of kernel and parametric density family).\nThus our work is a strict mathematical generalization of the classical work of [9].\n\n4. Mutual Information estimation: The inspired work of [10] constructs a mutual information\nestimator that subtly altered (in a sample dependent way) the three KL entropy estimation\nterms, leading to superior empirical performance. We show that the underlying idea behind\nthis change can be incorporated in our framework as well, leading to a novel mutual\ninformation estimator that combines the two ideas and outperforms state of the art estimators\nin a variety of settings.\n\nIn the rest of this paper we describe these main results, the sections organized in roughly the same\norder as the enumerated list.\n\n2 Local likelihood density estimation (LLDE)\nGiven n i.i.d. samples X1, . . . , Xn, estimating the unknown density fX (\u00b7) in Rd is a very basic\nstatistical task. Local likelihood density estimators [15, 6] constitute state of the art and are speci\ufb01ed\nby a weight function K : Rd \u2192 R (also called a kernel), a degree p \u2208 Z+ of the polynomial\napproximation, and the bandwidth h \u2208 R, and maximizes the local log-likelihood:\n\nn(cid:88)\n\nj=1\n\nK\n\n(cid:18) Xj \u2212 x\n\n(cid:19)\n\nh\n\n(cid:90)\n\n(cid:18) u \u2212 x\n\n(cid:19)\n\nh\n\nLx(f ) =\n\nlog f (Xj) \u2212 n\n\nK\n\nf (u) du ,\n\n(1)\n\nwhere maximization is over an exponential polynomial family, locally approximating f (u) near x:\nloge fa,x(u) = a0 + (cid:104)a1, u \u2212 x(cid:105) + (cid:104)u \u2212 x, a2(u \u2212 x)(cid:105) + \u00b7\u00b7\u00b7 + ap[u \u2212 x, u \u2212 x, . . . , u \u2212 x] ,\n(2)\nparameterized by a = (a0, . . . , ap) \u2208 R1\u00d7d\u00d7d2\u00d7\u00b7\u00b7\u00b7\u00d7dp, where (cid:104)\u00b7,\u00b7(cid:105) denotes the inner-product and\nap[u, . . . , u] the p-th order tensor projection. The local likelihood density estimate (LLDE) is de\ufb01ned\n\nas (cid:98)fn(x) = f(cid:98)a(x),x(x) = e(cid:98)a0(x), where(cid:98)a(x) \u2208 arg maxa Lx(fa,x). The maximizer is represented\n\nby a series of nonlinear equations, and does not have a closed form in general. We present below a\nfew choices of the degrees and the weight functions that admit closed form solutions. Concretely, for\np = 0, it is known that LDDE reduces to the standard Kernel Density Estimator (KDE) [15]:\n\n(cid:18) x \u2212 Xi\n\n(cid:19)(cid:46)(cid:90)\n\nh\n\n(cid:18) u \u2212 x\n\n(cid:19)\n\nK\n\nh\n\ndu .\n\n(3)\n\n(cid:98)fn(x) =\n\n(cid:98)fn(x) =\n\nn(cid:88)\n\ni=1\n\n1\nn\n\nK\n\n(cid:80)n\n\nIf we choose the step function K(u) = I((cid:107)u(cid:107) \u2264 1) with a local and data-dependent choice of the\nbandwidth h = \u03c1k,x where \u03c1k,x is the k-NN distance from x, then the above estimator recovers the\npopular k-NN density estimate as a special case, namely, for Cd = \u03c0d/2/\u0393(d/2 + 1),\n\n1\nn\n\nI((cid:107)Xi \u2212 x(cid:107) \u2264 \u03c1k,x)\n\ni=1\n\nVol{u \u2208 Rd : (cid:107)u \u2212 x(cid:107) \u2264 \u03c1k,x} =\n\nk\n\nn Cd \u03c1d\n\nk,x\n\n.\n\n(4)\n\nFor higher degree local likelihood, we provide simple closed form solutions and provide a proof\nin Section D. Somewhat surprisingly, this result has eluded prior works [16, 26] and [3] which\nspeci\ufb01cally attempted the evaluation for p = 2. Part of the subtlety in the result is to critically\nuse the fact that the parametric family (eg., the polynomial family in (2)) need not be normalized\nthemselves; the local log-likelihood maximization ensures that the resulting density estimate is\ncorrectly normalized so that it integrates to 1.\nProposition 2.1. [14, Exercise 5.2] For a degree p \u2208 {1, 2}, the maximizer of local likelihood (1)\nadmits a closed form solution, when using the Gaussian kernel K(u) = e\u2212 (cid:107)u(cid:107)2\n. In case of p = 1,\n\n2\n\n(cid:98)fn(x) =\n\n(cid:107)S1(cid:107)2\nwhere S0 \u2208 R and S1 \u2208 Rd are de\ufb01ned for given x \u2208 Rd and h \u2208 R as\n\nn(2\u03c0)d/2hd\n\n\u2212 1\n2\n\n1\nS2\n0\n\nexp\n\nS0\n\nS0 \u2261 n(cid:88)\n\n\u2212 (cid:107)Xj\u2212x(cid:107)2\n\n2h2\n\ne\n\n,\n\nS1 \u2261 n(cid:88)\n\n(Xj \u2212 x) e\n\n\u2212 (cid:107)Xj\u2212x(cid:107)2\n\n2h2\n\n.\n\n1\nh\n\n(cid:26)\n\n(cid:27)\n\n,\n\n(5)\n\n(6)\n\nj=1\n\nj=1\n\n3\n\n\fIn case of p = 2, for S0 and S1 de\ufb01ned as above,\n\n(cid:98)fn(x) =\n\n(cid:110) \u2212 1\n\n2\n\n(cid:111)\n\n,\n\n1 \u03a3\u22121S1\nST\nwhere |\u03a3| is the determinant and S2 \u2208 Rd\u00d7d and \u03a3 \u2208 Rd\u00d7d are de\ufb01ned as\n\nn(2\u03c0)d/2hd|\u03a3|1/2\n\n1\nS2\n0\n\nexp\n\nS0\n\n1\n\nh2 (Xj \u2212 x)(Xj \u2212 x)T e\n\n\u2212 (cid:107)Xj\u2212x(cid:107)2\n\n2h2\n\n,\n\n\u03a3 \u2261 S0S2 \u2212 S1ST\n\n1\n\nS2\n0\n\nS2 \u2261 n(cid:88)\n\nj=1\n\n(7)\n\n(8)\n\n,\n\nwhere it follows from Cauchy-Schwarz that \u03a3 is positive semide\ufb01nite.\n\nOne of the major drawbacks of the KDE and k-NN methods is the increased bias near the boundaries.\nLLDE provides a principled approach to automatically correct for the boundary bias, which takes\neffect only for p \u2265 2 [6, 21]. This explains the performance improvement for p = 2 in the \ufb01gure\nbelow (left panel), and the gap increases with the correlation as boundary effect becomes more\nprominent. We use the proposed estimators with p \u2208 {0, 1, 2} to estimate the mutual information\nbetween two jointly Gaussian random variables with correlation r, from n = 500 samples, using\nresubstitution methods explained in the next sections. Each point is averaged over 100 instances.\nIn the right panel, we generate i.i.d. samples from a 2-dimensional Gaussian with correlation 0.9, and\n\nfound local approximation (cid:98)f (u \u2212 x\u2217) around x\u2217 denoted by the blue \u2217 in the center. Standard k-NN\n\napproach \ufb01ts a uniform distribution over a circle enclosing k = 20 nearest neighbors (red circle). The\ngreen lines are the contours of the degree-2 polynomial approximation with bandwidth h = \u03c120,x.\nThe \ufb01gure illustrates that k-NN method suffers from boundary effect, where it underestimates the\nprobability by over estimating the volume in (4). However, degree-2 LDDE is able to correctly\ncapture the local structure of the pdf, correcting for boundary biases.\nDespite the advantages of the LLDE, it requires the bandwidth to be data independent and vanishingly\nsmall (sublinearly in sample size) for consistency almost everywhere \u2013 both of these are impediments\nto practical use since there is no obvious systematic way of choosing these hyperparameters. On the\nother hand, if we restrict our focus to functionals of the density, then both these issues are resolved:\nthis is the focus of the next section where we show that the bandwidth can be chosen to be based on\n\ufb01xed k-NN distances and the resulting universal bias easily corrected.\n\n]\n2\n)\n(cid:98)I\n\u2212\nI\n(\n[\nE\n\nX1\n\n(1 \u2212 r) where r is correlation\n\nX2\n\nFigure 1: The boundary bias becomes less signi\ufb01cant and the gap closes as correlation decreases for\nestimating the mutual information (left). Local approximation around the blue \u2217 in the center. The\ndegree-2 local likelihood approximation (contours in green) automatically captures the local structure\nwhereas the standard k-NN approach (uniform distribution in red circle) fails (left).\n\n3 k-LNN Entropy Estimator\n\nWe consider resubstitution entropy estimators of the form (cid:98)H(x) = \u2212(1/n)(cid:80)n\n\ni=1 log(cid:98)fn(Xi) and\n\npropose to use the local likelihood density estimator in (7) and a choice of bandwidth that is local\n\n4\n\n 0.001 0.01 0.1 1 100.0000010.00010.0011p=0p=1p=2\f(cid:41)\n\n(cid:98)H (n)\nkLNN(X) = \u2212 1\nn\n\n(cid:40)\n\nn(cid:88)\n\ni=1\n\n(varying for each point x) and adaptive (based on the data). Concretely, we choose, for each sample\npoint Xi, the bandwidth hXi to be the the distance to its k-th nearest neighbor \u03c1k,i. Precisely, we\npropose the following k-Local Nearest Neighbor (k-LNN) entropy estimator of degree-2:\n\nlog\n\nS0,i\nk,i|\u03a3i|1/2\nn(2\u03c0)d/2\u03c1d\n\n\u2212 1\n2\n\n1\nS2\n0,i\n\n1,i\u03a3\u22121\nST\n\ni S1,i\n\n\u2212 Bk,d ,\n\n(9)\n\nwhere subtracting Bk,d de\ufb01ned in Theorem 1 removes the asymptotic bias, and k \u2208 Z+ is the only\nhyper parameter determining the bandwidth. In practice k is a small integer \ufb01xed to be in the range\n4 \u223c 8. We only use the (cid:100)log n(cid:101) nearest subset of samples Ti = {j \u2208 [n] : j (cid:54)= i and (cid:107)Xi \u2212 Xj(cid:107) \u2264\n\u03c1(cid:100)log n(cid:101),i} in computing the quantities below:\n\nS0,i \u2261 (cid:88)\n\n\u2212 (cid:107)Xj\u2212Xi(cid:107)2\n\n2\u03c12\n\nk,i\n\ne\n\n,\n\nS1,i \u2261 (cid:88)\n\nj\u2208Ti,m\n(Xj \u2212 Xi)(Xj \u2212 Xi)T e\n\nj\u2208Ti,m\n\u2212 (cid:107)Xj\u2212Xi(cid:107)2\n\n2\u03c12\n\nk,i\n\nS2,i \u2261 (cid:88)\n\nj\u2208Ti,m\n\n1\n\u03c12\nk,i\n\n(Xj \u2212 Xi)e\n\n1\n\u03c1k,i\n\n\u2212 (cid:107)Xj\u2212Xi(cid:107)2\n\n2\u03c12\n\nk,i\n\n, \u03a3i \u2261 S0,iS2,i \u2212 S1,iST\n\n1,i\n\nS2\n0,i\n\n,\n\n.\n\n(10)\n\nE[(cid:98)H (n)\n\nThe truncation is important for computational ef\ufb01ciency, but the analysis works as long as m =\nO(n1/(2d)\u2212\u03b5) for any positive \u03b5 that can be arbitrarily small. For a larger m, for example of \u2126(n),\nthose neighbors that are further away have a different asymptotic behavior. We show in Theorem 1\nthat the asymptotic bias is independent of the underlying distribution and hence can be precomputed\nand removed, under mild conditions on a twice continuously differentiable pdf f (x) (cf. Lemma 3.1\nbelow).\nTheorem 1. For k \u2265 3 and X1, X2, . . . , Xn \u2208 Rd are i.i.d. samples from a twice continuously\ndifferentiable pdf f (x), then\n\n(11)\nwhere Bk,d in (9) is a constant that only depends on k and d. Further, if E[(log f (X))2] < \u221e then\n\nthe variance of the proposed estimator is bounded by Var[(cid:98)H (n)\n\nkLNN(X)] = O((log n)2/n).\n\nkLNN(X)] = H(X) ,\n\nlim\nn\u2192\u221e\n\nThis proves the L1 and L2 consistency of the k-LNN estimator; we relegate the proof to Section F\nfor ease of reading the main part of the paper. The proof assumes Ansatz 1 (also stated in Section F),\nwhich states that a certain exchange of limit holds. As noted in [18], such an assumption is common\nin the literature on consistency of k-NN estimators, where it has been implicitly assumed in existing\nanalyses of entropy estimators including [9, 5, 12, 27], without explicitly stating that such assumptions\nare being made. Our choice of a local adaptive bandwidth hXi = \u03c1k,i is crucial in ensuring that\nthe asymptotic bias Bk,d does not depend on the underlying distribution f (x). This relies on a\nfundamental connection to the theory of asymptotic order statistics made precise in Lemma 3.1,\nwhich also gives the explicit formula for the bias below.\nThe main idea is that the empirical quantities used in the estimate (10) converge in large n limit to\nsimilar quantities de\ufb01ned over order statistics. We make this intuition precise in the next section.\nWe de\ufb01ne order statistics over i.i.d. standard exponential random variables E1, E2, . . . , Em and i.i.d.\nrandom variables \u03be1, \u03be2, . . . , \u03bem drawn uniformly (the Haar measure) over the unit sphere in Rd, for\na variable m \u2208 Z+. We de\ufb01ne for \u03b1 \u2208 {0, 1, 2},\n\n((cid:80)j\n((cid:80)k\n\n\u03be(\u03b1)\nj\n\n(cid:96)=1 E(cid:96))\u03b1\n(cid:96)=1 E(cid:96) )\u03b1\n\nexp\n\n(cid:40)\n\u2212 ((cid:80)j\n2((cid:80)k\n\n(cid:96)=1 E(cid:96) )2\n(cid:96)=1 E(cid:96) )2\n\n(cid:41)\n\n,\n\n(12)\n\n\u03b1 \u2261 m(cid:88)\n(cid:101)\u03a3 = (1/ \u02dcS0)2( \u02dcS0 \u02dcS2 \u2212 \u02dcS1 \u02dcST\n\nj = 1, \u03be(1)\n\nwhere \u03be(0)\n\n\u02dcS(m)\n\nj=1\n\nTheorem 1) and are directly related to the bias terms in the resubstitution estimator of entropy:\n\nj = \u03bej \u2208 Rd, and \u03be(2)\nk(cid:88)\n\nand\n1 ). We show that the limiting \u02dcS\u03b1\u2019s are well-de\ufb01ned (in the proof of\n\nj = \u03bej\u03beT\n\n\u03b1\n\nj \u2208 Rd\u00d7d, and let \u02dcS\u03b1 = limm\u2192\u221e \u02dcS(m)\n1(cid:101)\u03a3\u22121 \u02dcS1) ] .\n\nlog(cid:12)(cid:12)(cid:101)\u03a3(cid:12)(cid:12) + (\n\n\u02dcST\n\n(13)\n\nE(cid:96)) +\n\nlog 2\u03c0 \u2212 log Cd \u2212 log \u02dcS0 +\n\nBk,d = E[ log(\n\n1\n2\n\n1\n2 \u02dcS2\n0\n\nd\n2\n\n(cid:96)=1\n\n5\n\n\fIn practice, we propose using a \ufb01xed small k such as \ufb01ve. For k \u2264 3 the estimator has a very large\nvariance, and numerical evaluation of the corresponding bias also converges slowly. For some typical\nchoices of k, we provide approximate evaluations below, where 0.0183(\u00b16) indicates empirical mean\n\u00b5 = 183 \u00d7 10\u22124 with con\ufb01dence interval 6 \u00d7 10\u22124. In these numerical evaluations, we truncated\nthe summation at m = 50, 000. Although we prove that Bk,d converges in m, in practice, one can\nchoose m based on the number of samples and Bk,d can be evaluated for that m.\nTheoretical contribution: Our key technical innovation is a fundamental connection between nearest\nneighbor statistics and asymptotic order statistics, stated below as Lemma 3.1: we show that the\n(normalized) distances \u03c1(cid:96),i\u2019s jointly converge to the standardized uniform order statistics and the\ndirections (Xj(cid:96) \u2212 Xi)/(cid:107)Xj(cid:96) \u2212 Xi(cid:107)\u2019s converge to independent uniform distribution (Haar measure)\nover the unit sphere.\n\nk\n\n4\n\n5\n\n6\n\nd\n\n1\n2\n\n-0.0183(\u00b16)\n-0.1023(\u00b15)\n\n-0.0171(\u00b13)\n-0.0401(\u00b13)\nTable 1: Numerical evaluation of Bk,d, via sampling 1, 000, 000 instances for each pair (k, d).\n\n-0.0200(\u00b14)\n-0.0528(\u00b13)\n\n-0.0181(\u00b14)\n-0.0448(\u00b13)\n\n-0.0220(\u00b14)\n-0.0628(\u00b14)\n\n-0.0233(\u00b16)\n-0.0765(\u00b14)\n\n7\n\n8\n\n9\n\nConditioned on Xi = x, the proposed estimator uses nearest neighbor statistics on Z(cid:96),i \u2261 Xj(cid:96) \u2212 x\nwhere Xj(cid:96) is the (cid:96)-th nearest neighbor from x such that Z(cid:96),i = ((Xj(cid:96) \u2212 Xi)/(cid:107)Xj(cid:96) \u2212 Xi(cid:107))\u03c1(cid:96),i.\nNaturally, all the techniques we develop in this paper generalize to any estimators that depend on the\nnearest neighbor statistics {Z(cid:96),i}i,(cid:96)\u2208[n] \u2013 and the value of such a general result is demonstrated later\n(in Section 4) when we evaluate the bias in similarly inspired entropy estimators [2, 3, 16, 9].\nLemma 3.1. Let E1, E2, . . . , Em be i.i.d. standard exponential random variables and \u03be1, \u03be2, . . . , \u03bem\nbe i.i.d. random variables drawn uniformly over the unit (d \u2212 1)-dimensional sphere in d dimensions,\nindependent of the Ei\u2019s. Suppose f is twice continuously differentiable and x \u2208 Rd satis\ufb01es that\nthere exists \u03b5 > 0 such that f (a) > 0, (cid:107)\u2207f (a)(cid:107) = O(1) and (cid:107)Hf (a)(cid:107) = O(1) for any (cid:107)a \u2212 x(cid:107) < \u03b5.\nThen for any m = O(log n), we have the following convergence conditioned on Xi = x:\n\n1\n\n(cid:96)=1\n\n(14)\n\n, . . . , \u03bem(\n\nn\u2192\u221e dTV((cdnf (x))1/d( Z1,i, . . . , Zm,i ) , ( \u03be1E1/d\nlim\n\nE(cid:96))1/d )) = 0 .\nwhere dTV(\u00b7,\u00b7) is the total variation and cd is the volume of unit Euclidean ball in Rd.\nEmpirical contribution: Numerical experiments suggest that the proposed estimator outperforms\nstate-of-the-art entropy estimators, and the gap increases with correlation. The idea of using k-NN\ndistance as bandwidth for entropy estimation was originally proposed by Kozachenko and Leonenko\nin [9], and is a special case of the k-LNN method we propose with degree 0 and a step kernel. We\nrefer to Section 4 for a formal comparison. Another popular resubstitution entropy estimator is to\nuse KDE in (3) [7], which is a special case of the k-LNN method with degree 0, and the Gaussian\nkernel is used in simulations. As comparison, we also study a new estimator [8] based on von Mises\nexpansion (as opposed to simple re-substitution) which has an improved convergence rate in the large\nsample regime. We relegate simulation results to Section. B in the appendix.\n\nm(cid:88)\n\n4 Universality of the k-LNN approach\n\nIn this section, we show that Theorem 1 holds universally for a general family of entropy estimators,\nspeci\ufb01ed by the choice of k \u2208 Z+, degree p \u2208 Z+, and a kernel K : Rd \u2192 R, thus allowing a uni\ufb01ed\nview of several seemingly disparate entropy estimators [9, 2, 3, 16]. The template of the entropy\nestimator is the following: given n i.i.d. samples, we \ufb01rst compute the local density estimate by\nmaximizing the local likelihood (1) with bandwidth \u03c1k,i, and then resubstitute it to estimate entropy:\n\nk,p,K(X) = \u2212(1/n)(cid:80)n\n(cid:98)H (n)\nthe solution to the maximization(cid:98)a(x) = arg maxa Lx(fa,x) exists for all x \u2208 {X1, . . . , Xn}, then\n\nTheorem 2. For the family of estimators described above, under the hypotheses of Theorem 1, if\nfor any choice of k \u2265 p + 1, p \u2208 Z+, and K : Rd \u2192 R, the asymptotic bias is independent of the\nunderlying distribution:\n\ni=1 log(cid:98)fn(Xi).\n\nE[(cid:98)H (n)\nk,p,K(X)] = H(X) + (cid:101)Bk,p,K,d ,\n\n(15)\n\nlim\nn\u2192\u221e\n\n6\n\n\fWe provide a proof in Section G. Although in general there is no simple analytical characterization of\n\nfor some constant (cid:101)Bk,d,p,K that only depends on k, p, K and d.\nthe asymptotic bias (cid:101)Bk,p,K,d it can be readily numerically computed: since (cid:101)Bk,p,K,d is independent\n(cid:98)a(x) = arg maxa Lx(fa,x) admits a closed form solution, as is the case with proposed k-LNN, then\n(cid:101)Bk,p,K,d can be characterized explicitly in terms of uniform order statistics.\n\nof the underlying distribution, one can run the estimator over i.i.d. samples from any distribution and\nnumerically approximate the bias for any choice of the parameters. However, when the maximization\n\nThis family of estimators is general: for instance, the popular KL estimator is a special case with\np = 0 and a step kernel K(u) = I((cid:107)u(cid:107) \u2264 1). [9] showed (in a remarkable result at the time)\nthat the asymptotic bias is independent of the dimension d and can be computed exactly to be\nlog n\u2212 \u03c8(n) + \u03c8(k)\u2212 log k and \u03c8(k) is the digamma function de\ufb01ned as \u03c8(x) = \u0393\u22121(x)d\u0393(x)/dx.\nThe dimension independent nature of this asymptotic bias term (of O(n\u22121/2) for d = 1 in [24,\nTheorem 1] and O(n\u22121/d) for general d in [4]) is special to the choice of p = 0 and the step kernel;\nwe explain this in detail in Section G, later in the paper. Analogously, the estimator in [2] can be\nviewed as a special case with p = 0 and an ellipsoidal step kernel.\n\n5 k-LNN Mutual information estimator\n\nGiven an entropy estimator (cid:98)HKL, mutual information can be estimated: (cid:98)I3KL = (cid:98)HKL(X) +\n(cid:98)HKL(Y ) \u2212 (cid:98)HKL(X, Y ). In [10], Kraskov and St\u00f6gbauer and Grassberger introduced(cid:98)IKSG(X; Y )\nby coupling the choices of the bandwidths. The joint entropy is estimated in the usual way, but for the\nmarginal entropy, instead of using kNN distances from {Xj}, the bandwidth hXi = \u03c1k,i(X, Y ) is\n(cid:98)I3LNN(X; Y ) = (cid:98)HkLNN(X) + (cid:98)HkLNN(Y ) \u2212 (cid:98)HkLNN(X, Y ). Inspired by [10], we introduce the\nchosen, which is the k nearest neighbor distance from (Xi, Yi) for the joint data {(Xj, Yj)}. Consider\nfollowing novel mutual information estimator we denote by(cid:98)ILNN\u2212KSG(X; Y ). where for the joint\nuse the bandwidth hXi = \u03c1k,i(X, Y ) coupled to the joint estimator. Empirically, we observe(cid:98)IKSG\noutperforms(cid:98)I3KL everywhere, validating the use of correlated bandwidths. However, the performance\nof(cid:98)ILNN\u2212KSG is similar to(cid:98)I3LNN\u2013sometimes better and sometimes worse.\n\n(X, Y ) we use the LNN entropy estimator we proposed in (9), and for the marginal entropy we\n\nEmpirical Contribution: Numerical experiments show that for most regimes of correlation, both\n3LNN and LNN-KSG outperforms other state-of-the-art estimators, and the gap increases with\ncorrelation r. In the large sample limit, all estimators \ufb01nd the correct mutual information, but both\nLNN and LNN-KSG are signi\ufb01cantly more robust compared to other approaches. Mutual information\nestimators have been recently proposed in [2, 3, 16] based on local likelihood maximization. However,\nthey involve heuristic choices of hyper-parameters or solving elaborate optimization and numerical\nintegrations, which are far from being easy to implement. Simulation results can be found in\nSection. C in the appendix.\n\n6 Breaking the bandwidth barrier\n\nWhile k-NN distance based bandwidth are routine in practical usage [21], the main \ufb01nding of this work\nis that they also turn out to be the \u201ccorrect\" mathematical choice for the purpose of asymptotically\n\nunbiased estimation of an integral functional such as the entropy: \u2212(cid:82) f (x) log f (x); we brie\ufb02y\nthumb, h = 1.06(cid:98)\u03c3n\u22121/5 is suggested when d = 1 where(cid:98)\u03c3 is the sample standard deviation [29,\nknown that resubstitution estimators of the form \u2212(1/n)(cid:80)n\ni=1 log(cid:98)f (Xi) achieve variances scaling\n\ndiscuss the rami\ufb01cations below. Traditionally, when the goal is to estimate f (x), it is well known\nthat the bandwidth should satisfy h \u2192 0 and nhd \u2192 \u221e, for KDEs to be consistent. As a rule of\n\nChapter 6.3]. On the other hand, when estimating entropy, as well as other integral functionals, it is\n\nas O(1/n) independent of the bandwidth [13]. This allows for a bandwidth as small as O(n\u22121/d).\nThe bottleneck in choosing such a small bandwidth is the bias, scaling as O(h2 +(nhd)\u22121 +En) [13],\nwhere the lower order dependence on n, dubbed En, is generally not known. The barrier in choosing a\nglobal bandwidth of h = O(n\u22121/d) is the strictly positive bias whose value depends on the unknown\ndistribution and cannot be subtracted off. However, perhaps surprisingly, the proposed local and\n\n7\n\n\fadaptive choice of the k-NN distance admits an asymptotic bias that is independent of the unknown\nunderlying distribution. Manually subtracting off the non-vanishing bias gives an asymptotically\nunbiased estimator, with a potentially faster convergence as numerically compared below. Figure 2\nillustrates how k-NN based bandwidth signi\ufb01cantly improves upon, say a rule-of-thumb choice of\nO(n\u22121/(d+4)) explained above and another choice of O(n\u22121/(d+2)). In the left \ufb01gure, we use the\nsetting from Figure 3 (right) but with correlation r = 0.999. On the right, we generate X \u223c N (0, 1)\nand U from uniform [0, 0.01] and let Y = X + U and estimate I(X; Y ). Following recent advances\nin [12, 22], the proposed local estimator has a potential to be extended to, for example, Renyi entropy,\nbut with a multiplicative bias as opposed to additive.\n\n]\n2\n)\nI\n\u2212\n(cid:98)I\n(\n[\nE\n\n]\n2\n)\nI\n\u2212\n(cid:98)I\n(\n[\nE\n\nnumber of samples n\n\nnumber of samples n\n\nFigure 2: Local and adaptive bandwidth signi\ufb01cantly improves over rule-of-thumb \ufb01xed bandwidth.\n\nAcknowledgement\n\nThis work is supported by NSF SaTC award CNS-1527754, NSF CISE award CCF-1553452, NSF\nCISE award CCF-1617745. We thank the anonymous reviewers for their constructive feedback.\n\nReferences\n[1] G. Biau and L. Devroye. Lectures on the Nearest Neighbor Method. Springer, 2016.\n\n[2] S. Gao, G. Ver Steeg, and A. Galstyan. Ef\ufb01cient estimation of mutual information for strongly dependent\n\nvariables. International Conference on Arti\ufb01cial Intelligence and Statistics (AISTATS), 2015.\n\n[3] S. Gao, G. Ver Steeg, and A. Galstyan. Estimating mutual information by local gaussian approximation.\n\n31st Conference on Uncertainty in Arti\ufb01cial Intelligence (UAI), 2015.\n\n[4] W. Gao, S. Oh, and P. Viswanath. Demystifying \ufb01xed k-nearest neighbor information estimators. arXiv\n\npreprint arXiv:1604.03006, 2016.\n\n[5] M. N. Goria, N. N. Leonenko, V. V. Mergel, and P. L. Novi Inverardi. A new class of random vector entropy\nestimators and its applications in testing statistical hypotheses. Nonparametric Statistics, 17(3):277\u2013297,\n2005.\n\n[6] N. Hjort and M. Jones. Locally parametric nonparametric density estimation. The Annals of Statistics,\n\npages 1619\u20131647, 1996.\n\n[7] H. Joe. Estimation of entropy and other functionals of a multivariate density. Annals of the Institute of\n\nStatistical Mathematics, 41(4):683\u2013697, 1989.\n\n[8] K. Kandasamy, A. Krishnamurthy, B. Poczos, and L. Wasserman. Nonparametric von mises estimators for\n\nentropies, divergences and mutual informations. In NIPS, pages 397\u2013405, 2015.\n\n[9] L. F. Kozachenko and N. N. Leonenko. Sample estimate of the entropy of a random vector. Problemy\n\nPeredachi Informatsii, 23(2):9\u201316, 1987.\n\n[10] A. Kraskov, H. St\u00f6gbauer, and P. Grassberger. Estimating mutual information. Physical review E,\n\n69(6):066138, 2004.\n\n8\n\n 0.0001 0.01 1 100 10000 1x106 1x108 1x1010 1x1012100200400800kNN bandwidthFixed bandwidth N-1/(d+4)Fixed bandwidth N-1/(d+2) 0.0001 0.01 1 100 10000 1x106 1x108 1x1010 1x1012100200400800kNN bandwidthFixed bandwidth N-1/(d+4)Fixed bandwidth N-1/(d+2)\f[11] S. Krishnaswamy, M. Spitzer, M. Mingueneau, S. Bendall, O. Litvin, E. Stone, D. Peer, and G. Nolan.\n\nConditional density-based analysis of t cell signaling in single-cell data. Science, 346:1250689, 2014.\n\n[12] N. Leonenko, L. Pronzato, and V. Savani. A class of r\u00e9nyi information estimators for multidimensional\n\ndensities. The Annals of Statistics, 36(5):2153\u20132182, 2008.\n\n[13] H. Liu, L. Wasserman, and J. D. Lafferty. Exponential concentration for mutual information estimation\n\nwith application to forests. In NIPS, pages 2537\u20132545, 2012.\n\n[14] C. Loader. Local regression and likelihood. Springer Science & Business Media, 2006.\n\n[15] C. R. Loader. Local likelihood density estimation. The Annals of Statistics, 24(4):1602\u20131618, 1996.\n\n[16] D. Lombardi and S. Pant. Nonparametric k-nearest-neighbor entropy estimator. Physical Review E,\n\n93(1):013310, 2016.\n\n[17] C. D. Manning, P. Raghavan, and H. Sch\u00fctze. Introduction to information retrieval, volume 1. Cambridge\n\nuniversity press Cambridge, 2008.\n\n[18] D. P\u00e1l, B. P\u00f3czos, and C. Szepesv\u00e1ri. Estimation of r\u00e9nyi entropy and mutual information based on\nIn Advances in Neural Information Processing Systems, pages\n\ngeneralized nearest-neighbor graphs.\n1849\u20131857, 2010.\n\n[19] R-D Reiss. Approximate distributions of order statistics: with applications to nonparametric statistics.\n\nSpringer Science & Business Media, 2012.\n\n[20] D. Reshef, Y. Reshef, H. Finucane, S. Grossman, G. McVean, P. Turnbaugh, E. Lander, M. Mitzenmacher,\n\nand P. Sabeti. Detecting novel associations in large data sets. science, 334(6062):1518\u20131524, 2011.\n\n[21] S. J. Sheather. Density estimation. Statistical Science, 19(4):588\u2013597, 2004.\n\n[22] S. Singh and B. P\u00f3czos. Analysis of k-nearest neighbor distances with application to entropy estimation.\n\narXiv preprint arXiv:1603.08578, 2016.\n\n[23] G. Ver Steeg and A. Galstyan. The information sieve. to appear in ICML, arXiv:1507.02284, 2016.\n\n[24] A. B. Tsybakov and E. C. Van der Meulen. Root-n consistent estimators of entropy for densities with\n\nunbounded support. Scandinavian Journal of Statistics, pages 75\u201383, 1996.\n\n[25] G. Ver Steeg and A. Galstyan. Discovering structure in high-dimensional data through correlation\n\nexplanation. In Advances in Neural Information Processing Systems, pages 577\u2013585, 2014.\n\n[26] P. Vincent and Y. Bengio. Locally weighted full covariance gaussian density estimation. Technical report,\n\nTechnical report 1240, 2003.\n\n[27] Q. Wang, S. R. Kulkarni, and S. Verd\u00fa. Divergence estimation for multidimensional densities via-nearest-\n\nneighbor distances. Information Theory, IEEE Transactions on, 55(5):2392\u20132405, 2009.\n\n[28] Q. Wang, S. R. Kulkarni, and S. Verd\u00fa. Universal estimation of information measures for analog sources.\n\nFoundations and Trends in Communications and Information Theory, 5(3):265\u2013353, 2009.\n\n[29] L. Wasserman. All of nonparametric statistics. Springer Science & Business Media, 2006.\n\n9\n\n\f", "award": [], "sourceid": 1292, "authors": [{"given_name": "Weihao", "family_name": "Gao", "institution": "UIUC"}, {"given_name": "Sewoong", "family_name": "Oh", "institution": "UIUC"}, {"given_name": "Pramod", "family_name": "Viswanath", "institution": "UIUC"}]}