{"title": "Ensemble weighted kernel estimators for multivariate entropy estimation", "book": "Advances in Neural Information Processing Systems", "page_first": 566, "page_last": 574, "abstract": "The problem of estimation of entropy functionals of probability densities has received much attention in the information theory, machine learning and statistics communities. Kernel density plug-in estimators are simple, easy to implement and widely used for estimation of entropy. However, kernel plug-in estimators suffer from the curse of dimensionality, wherein the MSE rate of convergence is glacially slow - of order $O(T^{-{\\gamma}/{d}})$, where $T$ is the number of samples, and $\\gamma>0$ is a rate parameter. In this paper, it is shown that for sufficiently smooth densities, an ensemble of kernel plug-in estimators can be combined via a weighted convex combination, such that the resulting weighted estimator has a superior parametric MSE rate of convergence of order $O(T^{-1})$. Furthermore, it is shown that these optimal weights can be determined by solving a convex optimization problem which does not require training data or knowledge of the underlying density, and therefore can be performed offline. This novel result is remarkable in that, while each of the individual kernel plug-in estimators belonging to the ensemble suffer from the curse of dimensionality, by appropriate ensemble averaging we can achieve parametric convergence rates.", "full_text": "Ensemble weighted kernel estimators\n\nfor multivariate entropy estimation\n\nKumar Sricharan, Alfred O. Hero III\n\nDepartment of EECS\nUniversity of Michigan\nAnn Arbor, MI 48104\n\n{kksreddy,hero}@umich.edu\n\nAbstract\n\nThe problem of estimation of entropy functionals of probability densities\nhas received much attention in the information theory, machine learning\nand statistics communities. Kernel density plug-in estimators are simple,\neasy to implement and widely used for estimation of entropy. However, for\nlarge feature dimension d, kernel plug-in estimators su\ufb00er from the curse\nof dimensionality: the MSE rate of convergence is glacially slow - of order\nO(T \u2212\u03b3/d), where T is the number of samples, and \u03b3 > 0 is a rate para-\nmeter. In this paper, it is shown that for su\ufb03ciently smooth densities, an\nensemble of kernel plug-in estimators can be combined via a weighted con-\nvex combination, such that the resulting weighted estimator has a superior\nparametric MSE rate of convergence of order O(T \u22121). Furthermore, it is\nshown that these optimal weights can be determined by solving a convex\noptimization problem which does not require training data or knowledge of\nthe underlying density, and therefore can be performed o\ufb04ine. This novel\nresult is remarkable in that, while each of the individual kernel plug-in es-\ntimators belonging to the ensemble su\ufb00er from the curse of dimensionality,\nby appropriate ensemble averaging we can achieve parametric convergence\nrates.\n\n1\n\nIntroduction\n\nNon-linear entropy functionals of a multivariate density f of the form R g(f (x), x)f (x)dx\n\narise in applications including machine learning, signal processing, mathematical statistics,\nand statistical communication theory. Important examples of such functionals include Shan-\nnon and R\u00b4enyi entropy. Entropy based applications include image registration and texture\nclassi\ufb01cation, ICA, anomaly detection, data and image compression, testing of statistical\nmodels and parameter estimation. For details and other applications, see, for example, Beir-\nlant et al. [2] and Leonenko et al. [18]. In these applications, the functional of interest must\nbe estimated empirically from sample realizations of the underlying densities. Several estim-\nators of entropy measures have been proposed for general multivariate densities f . These\ninclude consistent estimators based on histograms [10, 2], kernel density plug-in estimators,\nentropic graphs [5, 20], gap estimators [24] and nearest neighbor distances [8, 18, 19].\n\nKernel density plug-in estimators [1, 6, 11, 15, 12] are simple, easy to implement, computa-\ntionally fast and therefore widely used for estimation of entropy [2, 23, 14, 4, 13]. However,\nthese estimators su\ufb00er from mean squared error (MSE) rates which typically grow with\nfeature dimension d as O(T \u2212\u03b3/d), where T is the number of samples and \u03b3 is a positive rate\nparameter.\n\n1\n\n\fIn this paper, we propose a novel weighted ensemble kernel density plug-in estimator\nof entropy \u02c6Gw, that achieves parametric MSE rates of O(T \u22121) when the feature dens-\nity is smooth. The estimator is constructed as a weighted convex combination \u02c6Gw =\nPl\u2208\u00afl w(l) \u02c6Gk(l) of\nindividual kernel density plug-in estimators \u02c6Gk(l) wrt the weights\n{w(l); l \u2208 \u00afl}. Here, \u00afl is a vector of indices {l1, .., lL} and k(l) = lpT /2 is proportional\nto the the volume of the kernel bins used in evaluating \u02c6Gk(l). The individual kernel estim-\nators \u02c6Gk(l) are similar to the data-split kernel estimator of Gy\u00a8or\ufb01 and van der Muelen [11],\nand have slow MSE rates of convergence of order O(T \u22121/1+d). Please refer to Section 2 for\nthe exact de\ufb01nition of \u02c6Gk(l).\nThe principal result presented in this paper is as follows.\nIt is shown that the weights\n{w(l); l \u2208 \u00afl} can be chosen so as to signi\ufb01cantly improve the rate of MSE convergence\nof the weighted estimator \u02c6Gw. In fact our ensemble averaging method can improve MSE\nconvergence of \u02c6Gw to the parametric rate O(T \u22121). These optimal weights can be determined\nby solving a convex optimization problem. Furthermore, this optimization problem does not\ninvolve any density-dependent parameters and can therefore be performed o\ufb04ine.\n\n1.1 Related work\n\nEnsemble based methods have been previously proposed in the context of classi\ufb01cation. For\nexample, in both boosting [21] and multiple kernel learning [16] algorithms, lower complexity\nweak learners are combined to produce classi\ufb01ers with higher accuracy. Our work di\ufb00ers\nfrom these methods in several ways. First and foremost, our proposed method performs\nestimation rather than classi\ufb01cation. An important consequence of this is that the weights\nwe use are data independent , while the weights in boosting and multiple kernel learning\nmust be estimated from training data since they depend on the unknown distribution.\n\nBirge and Massart [3] show that for density f in a Holder smoothness class with s de-\nrivatives, the minimax MSE rate for estimation of a smooth functional is T \u22122\u03b3, where\n\u03b3 = min{1/2, 4s/(4s + d)}. This means that for s > 4/d, parametric rates are achievable.\nThe kernel estimators proposed in this paper require higher order smoothness conditions\non the density, i. e. the density must be s > d times di\ufb00erentiable. While there exist other\nestimators [17, 7] that achieve parametric MSE rates of O(1/T ) when s > 4/d, these es-\ntimators are more di\ufb03cult to implement than kernel density estimators, which are a staple\nof many toolboxes in machine learning, pattern recognition, and statistics. The proposed\nensemble weighted estimator is a simple weighted combination of o\ufb00-the-shelf kernel density\nestimators.\n\n1.2 Organization\n\nThe reminder of the paper is organized as follows. We formally describe the kernel plug-in\nentropy estimators for entropy estimation in Section 2 and discuss the MSE convergence\nproperties of these estimators. In particular, we establish that these estimators have MSE\nrate which decays as O(T \u22121/1+d). Next, we propose the weighted ensemble of kernel en-\ntropy estimators in Section 3. Subsequently, we provide an MSE-optimal set of weights as\nthe solution to a convex optimization(3.4) and show that the resultant optimally weighted\nestimator has a MSE of O(T \u22121). We present simulation results in Section 4 that illustrate\nthe superior performance of this ensemble entropy estimator in the context of (i) estimation\nof the Panter-Dite distortion-rate factor [9] and (ii) testing the probability distribution of a\nrandom sample. We conclude the paper in Section 5.\n\nNotation\n\nWe will use bold face type to indicate random variables and random vectors and regular\ntype face for constants. We denote the expectation operator by the symbol E, the variance\noperator as V[X] = E[(X \u2212 E[X])2], and the bias of an estimator by B.\n\n2\n\n\f2 Entropy estimation\n\nThis paper focuses on the estimation of general non-linear functionals G(f ) of d-dimensional\nmultivariate densities f with known support S = [a, b]d, where G(f ) has the form\n\nG(f ) = Z g(f (x), x)f (x)d\u00b5(x),\n\n(2.1)\n\nfor some smooth function g(f, x). Let B denote the boundary of S. Here, \u00b5 denotes the\nLebesgue measure and E denotes statistical expectation with respect to the density f . As-\nsume that T = N + M i.i.d realizations of feature vectors {X1, . . . , XN , XN +1, . . . , XN +M}\nare available from the density f . In the sequel f will be called the feature density.\n\n2.1 Plug-in estimators of entropy\n\nA truncated kernel density estimator with uniform kernel is de\ufb01ned below. Our proposed\nweighted ensemble method applies to other types of kernels as well but we specialize to\nuniform kernels as it makes the derivations clearer. For integer 1 \u2264 k \u2264 M , de\ufb01ne the\ndistance dk to be: dk = (k/M )1/d. De\ufb01ne the truncated kernel bin region for each X \u2208 S\nto be Sk(X) = {Y \u2208 S : ||X \u2212 Y ||1 \u2264 dk/2}, and the volume of the truncated kernel bins\nto be Vk(X) = RSk(X) dz. Note that when the smallest distance from X to S is greater\nthan dk, Vk(X) = dd\nk = k/M . Let lk(X) denotes the number of points falling in Sk(X):\nlk(X) = PM\n\ni=1 1{Xi\u2208Sk(X)}. The truncated kernel density estimator is de\ufb01ned as\n\n\u02c6fk(X) =\n\n.\n\n(2.2)\n\nlk(X)\n\nM Vk(X)\n\nThe plug-in estimator of the density functional is constructed using a data splitting ap-\nproach as follows. The data is randomly subdivided into two parts {X1, . . . , XN} and\n{XN +1, . . . , XN +M} of N and M points respectively.\nIn the \ufb01rst stage, we estimate\nthe kernel density estimate \u02c6fk at the N points {X1, . . . , XN} using the M realizations\n{XN +1, . . . , XN +M}. Subsequently, we use the N samples {X1, . . . , XN} to approximate\nthe functional G(f ) and obtain the plug-in estimator:\n\n\u02c6Gk =\n\n1\nN\n\nN\n\nXi=1\n\ng(\u02c6f k(Xi), Xi).\n\n(2.3)\n\nAlso de\ufb01ne a standard kernel density estimator with uniform kernel \u02dcfk(X), which is identical\nto \u02c6fk(X) except that the volume Vk(X) is always set to be Vk(X) = k/M . De\ufb01ne\n\n\u02dcGk =\n\n1\nN\n\nN\n\nXi=1\n\ng(\u02dcf k(Xi), Xi).\n\n(2.4)\n\nThe estimator \u02dcGk is identical to the estimator of Gy\u00a8or\ufb01 and van der Muelen [11]. Observe\nthat the implementation of \u02dcGk, unlike \u02c6Gk, does not require knowledge about the support\nof the density.\n\n2.1.1 Assumptions\n\nWe make a number of technical assumptions that will allow us to obtain tight MSE con-\nvergence rates for the kernel density estimators de\ufb01ned in above. These assumptions are\ncomparable to other rigorous treatments of entropy estimation. Please refer to Section\nII, [2] for details. (A.0) : Assume that the kernel bandwidth satis\ufb01es k = k0M \u03b2 for any\nrate constant 0 < \u03b2 < 1, and assume that M , N and T are linearly related through the\nproportionality constant \u03b1f rac with: 0 < \u03b1f rac < 1, M = \u03b1f racT and N = (1 \u2212 \u03b1f rac)T .\n(A.1) : Let the feature density f be uniformly bounded away from 0 and upper bounded\non the set S, i.e., there exist constants \u01eb0, \u01eb\u221e such that 0 < \u01eb0 \u2264 f (x) \u2264 \u01eb\u221e < \u221e \u2200x \u2208 S.\n(A.2): Assume that the density f has continuous partial derivatives of order d in the in-\nterior of the set S, and that these derivatives are upper bounded. (A.3): Assume that the\n\n3\n\n\ffunction g(f, x) has max{\u03bb, d} partial derivatives w.r.t. the argument f , where \u03bb satis\ufb01es\nthe conditions \u03bb\u03b2 > 1. Denote the n-th partial derivative of g(f, x) wrt x by g(n)(f, x).\nAlso, let g\u2032(f, x) := g(1)(f, x) and g\u2032\u2032(f, x) := g(2)(f, x). (A.4): Assume that the absolute\nvalue of the functional g(f, x) and its partial derivatives are strictly bounded away from\n\u221e in the range \u01eb0 < x < \u01eb\u221e for all y. (A.5): Let \u01eb \u2208 (0, 1) and \u03b4 \u2208 (2/3, 1). Let C(M )\nbe a positive function satisfying the condition C(M ) = O(exp(\u2212M \u03b2(1\u2212\u03b4))). For some \ufb01xed\n0 < \u01eb < 1, de\ufb01ne pl = (1 \u2212 \u01eb)\u01eb0 and pu = (1 + \u01eb)\u01eb\u221e. Assume that the following four condi-\ntions are satis\ufb01ed by h(f, x) = g(f, x), g(3)(f, x) and g(\u03bb)(f, x) : (i) supx |h(0, x)| = G1 < \u221e,\n(ii) supf \u2208(pl,pu),x |h(f, x)| = G2/4 < \u221e, (iii) supf \u2208(1/k,pu),x |h(f, x)|C(M ) = G3 < \u221e, and\n(iv)E[supf \u2208(pl,2dM/k),x |h(f, x)|]C(M ) = G4 < \u221e.\n2.1.2 Analysis of MSE\n\nUnder these assumptions, we have shown the following (please see [22] for the proof) :\nTheorem 1. The bias of the plug-in estimators \u02c6Gk, \u02dcGk is given by\nM(cid:19)\nk\n\nM(cid:19)i/d\n\n+ o(cid:18) 1\n\nc2\nk\n\n+\n\n+\n\nB( \u02c6Gk) = Xi\u2208I\nB( \u02dcGk) = c1(cid:18) k\n\nc1,i(cid:18) k\nM(cid:19)1/d\n\nk\n\n+\n\nc2\nk\n\n+ o(cid:18) 1\n\nk\n\n+\n\nk\n\nM(cid:19) .\n\nTheorem 2. The variance of the plug-in estimators \u02c6Gk, \u02dcGk is given by\n\nV( \u02c6Gk) = c4(cid:18) 1\nV( \u02dcGk) = c4(cid:18) 1\n\nN(cid:19) + c5(cid:18) 1\nN(cid:19) + c5(cid:18) 1\n\nM(cid:19) + o(cid:18) 1\nM(cid:19) + o(cid:18) 1\n\nM\n\nM\n\n+\n\n+\n\n1\n\nN(cid:19)\nN(cid:19) .\n\n1\n\nIn the above expressions, c1,i, c1, c2, c4 and c5 are constants that depend only on g, f and\ntheir partial derivatives, and I = {1, . . . , d}. In particular, the constants c1,i, c1, c2, c4 and\nc5 are independent of k, N and M .\n\n2.1.3 Optimal MSE rate\n\nFrom Theorem 1, k \u2192 \u221e and k/M \u2192 0 for the estimators \u02c6Gk and \u02dcGk to be unbiased.\nLikewise from Theorem 2 N \u2192 \u221e and M \u2192 \u221e for the variance of the estimator to\nconverge to 0. We can optimize the choice of bandwidth k, and the data splitting proportions\nN/(N + M ), M/(N + M ) for minimum M.S.E.\n\nMinimizing the MSE over k is equivalent to minimizing the bias over k. The optimal choice\nof k is given by kopt = O(M 1/(1+d)), and the bias evaluated at kopt is O(M \u22121/(1+d)). Also\nobserve that the MSE of \u02c6Gk and \u02dcGk is dominated by the squared bias (O(M \u22122/(1+d))) as\ncontrasted to the variance (O(1/N + 1/M )). This implies that the asymptotic MSE rate of\nconvergence is invariant to selected proportionality constant \u03b1f rac. The optimal MSE for\nthe estimators \u02c6Gk and \u02dcGk is therefore achieved for the choice of k = O(M 1/(1+d)), and is\ngiven by O(T \u22122/(1+d)). In particular, observe that both \u02c6Gk and \u02dcGk have identical optimal\nrates of MSE convergence. Our goal is to reduce the estimator MSE to O(T \u22121). We do so\nby applying the method of weighted ensembles described next in section 3.\n\n3 Ensemble estimators\n\nFor a positive integer L > d, choose \u00afl = {l1, . . . , lL} to be a vector of distinct positive real\nnumbers. De\ufb01ne the mapping k(l) = l\u221aM and let \u00afk = {k(l); l \u2208 \u00afl}. Observe that any k \u2208 \u00afk\ncorresponds to the rate constant \u03b2 = 0.5, and that N = \u0398(T ) and M = \u0398(T ). De\ufb01ne the\nweighted ensemble estimator\n\n\u02c6Gw = Xl\u2208\u00afl\n\nw(l) \u02c6Gk(l).\n\n4\n\n(3.1)\n\n\fTheorem 3. There exists a weight vector w\u2217 such that\n\nE[( \u02c6Gw\u2217 \u2212 G(f ))2] = O(1/T ).\n\nThis weight vector can be found by solving a convex optimization. Furthermore, this op-\ntimal weight vector does not depend on the unknown feature density f or the samples\n{X1, .., XN +M}, and hence can be solved o\ufb00-line.\nProof. For each i \u2208 I, de\ufb01ne \u03b3w(i) = Pl\u2208\u00afl w(l)li/d. The bias of the ensemble estimator\n\nfollows from Theorem 1 and is given by\n\nB[ \u02c6Gw] = Xi\u2208I\n\nc1,i\u03b3w(i)M \u2212i/2d + O(cid:18) 1\n\n\u221aT (cid:19) .\n\n(3.2)\n\nDenote the covariance matrix of { \u02c6Gk(l); l \u2208 \u00afl} by \u03a3L. Let \u00af\u03a3L = \u03a3LT . Observe that by\n(2.5) and the Cauchy-Schwarz inequality, the entries of \u00af\u03a3L are O(1). The variance of the\nweighted estimator \u02c6Gw can then be bound as follows:\n\nV[ \u02c6Gw] = V\uf8ee\n\uf8f0Xl\u2208\u00afl\n\nwl \u02c6Gk(l)\uf8f9\n\n\uf8fb = w\u2032\u03a3Lw =\n\nw\u2032 \u00af\u03a3Lw\n\nT\n\n\u2264\n\n\u03bbmax( \u00af\u03a3L)||w||2\n\n2\n\nT\n\n.\n\n(3.3)\n\nWe seek a weight vector w that (i) ensures that the bias of the weighted estimator is\nO(T \u22121/2) and (ii) has low \u21132 norm ||w||2 in order to limit the contribution of the variance\nof the weighted estimator. To this end, let w\u2217 be the solution to the convex optimization\nproblem\n\nw\n\nminimize\n\n||w||2\nsubject to Xl\u2208\u00afl\n\nw(l) = 1,\n\n|\u03b3w(i)| = 0, i \u2208 I.\n\n(3.4)\n\nThis problem is equivalent to minimizing ||w||2 subject to A0w = b, where A0 and b are\nde\ufb01ned below. Let fIN : I \u2192 {1, .., I} be a bijective mapping. Let a0 be the vector of\nones:\nL ]. De\ufb01ne\nA0 = [a\u2032\nI ] and b = [1; 0; 0; ..; 0](I+1)\u00d71. Observe that the entries\nof A0 and b are O(1), and therefore the entries of the solution w\u2217 are O(1). Consequently,\n\n[1, 1..., 1]1\u00d7L; and let afIN (i), for i \u2208 I be given by afIN (i) = [li/d\n0, a\u2032\n\nby (3.2), the bias B[ \u02c6Gw\u2217] = O(1/\u221aT ). Furthermore, the optimal minimum \u03b7(d) := ||w\u2217||2\n0). By (6.4), the estimator variance V[ \u02c6Gw\u2217] is of\nis given by \u03b7(d) = pdet(A1A\u2032\n\n1)/det(A0A\u2032\norder O(\u03b7(d)/T ). This concludes the proof.\n\nI ]\u2032, A1 = [a\u2032\n\n1 , .., li/d\n\n1, ..., a\u2032\n\n1, ..., a\u2032\n\nWhile we have illustrated the weighted ensemble method only in the context of kernel\nestimators, this method can be applied to any general ensemble of estimators that satisfy\nbias and variance conditions C .1 and C .2 in [22].\n\n4 Experiments\n\nWe illustrate the superior performance of the proposed weighted ensemble estimator for two\napplications: (i) estimation of the Panter-Dite rate distortion factor, and (ii) estimation of\nentropy to test for randomness of a random sample.\n\n4.1 Panter-Dite factor estimation\n\nFor a d-dimensional source with underlying density f , the Panter-Dite distortion-rate\ndistortion-rate function [9] for a q-dimensional vector quantizer with n levels of quantiz-\n\nation is given by \u03b4(n) = n\u22122/qR f q/(q+2)(x)dx. The Panter-Dite factor corresponds to the\n\n5\n\n\f100\n\n \n\nr\no\nr\nr\ne\nd\ne\nr\na\nu\nq\ns\n \nn\na\ne\nM\n\n10\u22121\n\n10\u22122\n\n10\u22123\n\n10\u22124\n\n \n\n103\n\n \n\n104\n\nStandard kernel plug\u2212in estimator [12]\nTruncated kernel plug\u2212in estimator (2.3)\nHistogram plug\u2212in estimator [11]\nk\u2212nearest neighbor estimator [19]\nEntropic graph estimator [6,21]\nWeighted kernel estimator (3.1)\n\nSample size T\n\n(a) Variation of MSE of Panter-Dite factor estimates as a function of\nsample size T . From the \ufb01gure, we see that the proposed weighted es-\ntimator has the fastest MSE rate of convergence wrt sample size T .\n\n100\n\nr\no\nr\nr\ne\n\n \n\nd\ne\nr\na\nu\nq\ns\n \n\nn\na\ne\nM\n\n10\u22121\n\n10\u22122\n\n10\u22123\n\n10\u22124\n1\n\n \n\n2\n\n3\n\n4\n\n \n\n10\n\nStandard kernel plug\u2212in estimator [12]\nTruncated kernel plug\u2212in estimator (2.3)\nHistogram plug\u2212in estimator [11]\nk\u2212nearest neighbor estimator [19]\nEntropic graph estimator [6,21]\nWeighted kernel estimator (3.1)\n6\n5\n9\ndimension d\n\n7\n\n8\n\n(b) Variation of MSE of Panter-Dite factor estimates as a function of di-\nmension d. From the \ufb01gure, we see that the MSE of the proposed weighted\nestimator has the slowest rate of growth with increasing dimension d.\n\nFigure 1: Variation of MSE of Panter-Dite factor estimates using standard kernel plug-in es-\ntimator [12], truncated kernel plug-in estimator (2.3), histogram plug-in estimator[11], k-NN\nestimator [19], entropic graph estimator [6,21] and the weighted ensemble estimator (3.1).\n\nfunctional G(f ) with g(f, x) = n\u22122/qf \u22122/(q+2)I(f > 0) + I(f = 0), where I(.) is the indic-\nator function. The Panter-Dite factor is directly related to the R\u00b4enyi \u03b1-entropy, for which\nseveral other estimators have been proposed.\n\nIn our simulations we compare six di\ufb00erent choices of functional estimators - the three\nestimators previously introduced: (i) the standard kernel plug-in estimator \u02c6Gk, (ii) the\nboundary truncated plug-in estimator \u02c6Gk and (iii) the weighted estimator \u02c6Gw with optimal\nweight w = w\u2217 given by (3.4), and in addition the following popular entropy estimators: (iv)\nhistogram plug-in estimator [10], (v) k-nearest neighbor (k-NN) entropy estimator [18] and\n(vi) entropic k-NN graph estimator [5, 20]. For both \u02dcGk and \u02c6Gk, we select the bandwidth\nparameter k as a function of M according to the optimal proportionality k = M 1/(1+d) and\nN = M = T /2. To illustrate the weighted estimator of the Panter-Dite factor we assume\nthat f is the d = 6 dimensional mixture density f (a, b, p, d) = pf\u03b2(a, b, d) + (1 \u2212 p)fu(d);\nwhere f\u03b2(a, b, d) is a d-dimensional Beta density with parameters a = 6, b = 6, fu(d) is a\nd-dimensional uniform density and the mixing ratio p is 0.8.\n\n4.1.1 Variation of MSE with sample size T\n\nThe MSE results of these di\ufb00erent estimators are shown in Fig. 1(a) as a function of sample\nsize T . It is clear from the \ufb01gure that the proposed ensemble estimator \u02c6Gw has signi\ufb01cantly\n\n6\n\n\f1\n\n0.5\n\n0\n\nHypothesis index\nTrue entropy\nStandard kernel plug\u2212in estimate\nTruncated kernel plug\u2212in estimate\nWeighted plug\u2212in estimate\n\n\u22120.5\n\nStandard kernel plug\u2212in estimate\n\n \n\n\u22121\n\n\u22121.5\n\n\u22122\n\n \n\n\u22122.5\n0\n\nTruncated kernel plug\u2212in estimate\n\n Weighted plug\u2212in estimate\n\n100\n\n200\n\n300\n\n400\n\n500\n\n600\n\n700\n\n800\n\n900\n\n1000\n\n100\n\n50\n\n0\n\u22121\n\n100\n\n50\n\n0\n\u22121.4\n\n100\n\n50\n\n0\n\u22122.3\n\n\u22120.9\n\n\u22121.35\n\n\u22121.3\n\n\u22120.8\n\n\u22120.7\n\n\u22120.6\n\nStandard kernel plug\u2212in estimate\n\n\u22120.5\n\n\u22121.25\nTruncated kernel plug\u2212in estimate\n\n\u22121.15\n\n\u22121.2\n\n\u22121.1\n\n\u22120.4\n\n\u22120.3\n\n\u22121.05\n\n\u22121\n\n\u22120.95\n\n\u22122.2\n\n\u22122.1\n\n\u22122\n\n\u22121.9\n\n\u22121.8\n\n\u22121.7\nWeighted estimate\n\n\u22121.6\n\n\u22121.5\n\n\u22121.4\n\n\u22121.3\n\n(a) Entropy estimates for random samples cor-\nresponding to hypothesis H0 and H1.\n\n(b) Histogram envelopes of entropy estimates\nfor random samples corresponding to hypo-\nthesis H0 (blue) and H1 (red).\n\nFigure 2: Entropy estimates using standard kernel plug-in estimator, truncated kernel plug-\nin estimator and the weighted estimator, for random samples corresponding to hypothesis\nH0 and H1. The weighted estimator provided better discrimination ability by suppressing\nthe bias, at the cost of some additional variance.\n\nfaster rate of convergence while the MSE of the rest of the estimators, including the truncated\nkernel plug-in estimator, have similar, slow rates of convergence. It is therefore clear that the\nproposed optimal ensemble averaging signi\ufb01cantly accelerates the MSE convergence rate.\n\n4.1.2 Variation of MSE with dimension d\n\nThe MSE results of these di\ufb00erent estimators are shown in Fig. 1(b) as a function of di-\nmension d, for \ufb01xed sample size T = 3000. For the standard kernel plug-in estimator and\ntruncated kernel plug-in estimator, the MSE varies exponentially with d as expected. The\nMSE of the histogram and k-NN estimators increase at a similar rate, indicating that these\nestimators su\ufb00er from the curse of dimensionality as well. The MSE of the weighted estim-\nator on the other hand increases at a slower rate, which is in agreement with our theory that\nthe MSE is O(\u03b7(d)/T ) and observing that \u03b7(d) is an increasing function of d. Also observe\nthat the MSE of the weighted estimator is signi\ufb01cantly smaller than the MSE of the other\nestimators for all dimensions d > 3.\n\n4.2 Distribution testing\n\nIn this section, Shannon di\ufb00erential entropy is estimated using the function g(f, x) =\n\u2212 log(f )I(f > 0) + I(f = 0) and used as a test statistic to test for the underlying probab-\nility distribution of a random sample. In particular, we draw 500 instances each of random\nsamples of size 103 from the probability distribution f (a, b, p, d), described in Sec. 4. 1, with\n\ufb01xed d = 6, p = 0.75 for two sets of values of a, b under the null and alternate hypothesis,\nH0 : a = a0, b = b0 versus H1 : a = a1, b = b1.\n\nFirst, we \ufb01x a0 = b0 = 6 and a1 = b1 = 5. We note that the underlying density under the\nnull hypothesis f (6, 6, 0.75, 6) has greater curvature relative to f (5, 5, 0.75, 6) and therefore\nhas smaller entropy (randomness). The true entropy, and entropy estimates using \u02dcGk, \u02c6Gk\nand \u02c6Gw for the cases corresponding to each of the 500 instances of hypothesis H0 and H1\nare shown in Fig. 2(a). From this \ufb01gure, it is apparent that the weighted estimator provides\nbetter discrimination ability by suppressing the bias, at the cost of some additional variance.\n\nTo demonstrate that the weighted estimator provides better discrimination, we plot the\nhistogram envelope of the entropy estimates using standard kernel plug-in estimator, trun-\ncated kernel plug-in estimator and the weighted estimator for the cases corresponding to\nthe hypothesis H0 (color coded blue) and H1 (color coded red) in Fig. 2(b). Furthermore,\nwe quantitatively measure the discriminative ability of the di\ufb00erent estimators using the\n1, where \u00b50 and \u03c30 (respectively \u00b51 and \u03c31) are\n\n0 + \u03c32\n\nde\ufb02ection statistic ds = |\u00b51 \u2212 \u00b50|/p\u03c32\n\n7\n\n\fe\n\nt\n\na\nr\n \n\ne\nv\ni\nt\n\n \n\na\ng\ne\nN\ne\ns\na\nF\n\nl\n\n1\n\n0.95\n\n0.9\n\n0.85\n\n0.8\n\n0.75\n\n0.7\n\n0.65\n\n0.6\n\n0.55\n\n0.5\n\n \n\n \n\n1\n\ne\nv\nr\nu\nc\n \nC\nO\nR\n\n \nr\ne\nd\nn\nu\na\ne\nr\nA\n\n \n\n0.9\n\n0.8\n\n0.7\n\n0.6\n\nStandard kernel plug\u2212in estimator\nTruncated kernel plug\u2212in estimator\nWeighted estimator\n\n0.05\n\n0.1\n\n0.15\n\n0.2\n0.3\nFalse Positive rate\n\n0.25\n\n0.35\n\n0.4\n\n0.45\n\n0.5\n\n0.5\n\n \n\n0.2\n\n0.4\n\n \n\nNeyman\u2212Pearson test\nStandard kernel plug\u2212in estimate\nTruncated kernel plug\u2212in estimate\nWeighted estimate\n0.6\n\n0.8\n\n1\n\n\u03b4\n\n(a) ROC curves corresponding to entropy estim-\nates obtained using standard and truncated ker-\nnel plug-in estimator and the weighted estimator.\nThe corresponding AUC are given by 0.9271,\n0.9459 and 0.9619.\n\n(b) Variation of AUC curves vs \u03b4(= a0 \u2212 a1, b0 \u2212\nb1) corresponding to Neyman-Pearson omni-\nscient test, entropy estimates using the standard\nand truncated kernel plug-in estimator and the\nweighted estimator.\n\nFigure 3: Comparison of performance in terms of ROC for the distribution testing problem.\nThe weighted estimator uniformly outperforms the individual plug-in estimators.\n\nthe sample mean and standard deviation of the entropy estimates. The de\ufb02ection statistic\nwas found to be 1.49, 1.60 and 1.89 for the standard kernel plug-in estimator, truncated\nkernel plug-in estimator and the weighted estimator respectively. The receiver operating\ncurves (ROC) for this test using these three di\ufb00erent estimators is shown in Fig. 3(a). The\ncorresponding area under the ROC curves (AUC) are given by 0.9271, 0.9459 and 0.9619.\nIn our \ufb01nal experiment, we \ufb01x a0 = b0 = 10 and set a1 = b1 = 10 \u2212 \u03b4, draw 500 instances\neach of random samples of size 5 \u00d7 103 under the null and alternate hypothesis, and plot\nthe AUC as \u03b4 varies from 0 to 1 in Fig. 3(b). For comparison, we also plot the AUC for the\nNeyman-Pearson likelihood ratio test. The Neyman-Pearson likelihood ratio test, unlike\nthe Shannon entropy based tests, is an omniscient test that assumes knowledge of both\nthe underlying beta-uniform mixture parametric model of the density and the parameter\nvalues a0, b0 and a1, b1 under the null and alternate hypothesis respectively. Figure 4 shows\nthat the weighted estimator uniformly and signi\ufb01cantly outperforms the individual plug-in\nestimators and is closest to the performance of the omniscient Neyman-Pearson likelihood\ntest. The relatively superior performance of the Neyman-Pearson likelihood test is due to\nthe fact that the weighted estimator is a nonparametric estimator that has marginally higher\nvariance (proportional to ||w\u2217||2\n2) compared to the underlying parametric model for which\nthe Neyman-Pearson test statistic provides the most powerful test.\n\n5 Conclusions\n\nA novel method of weighted ensemble estimation was proposed in this paper. This method\ncombines slowly converging individual estimators to produce a new estimator with faster\nMSE rate of convergence.\nIn this paper, we applied weighted ensembles to improve the\nMSE of a set of uniform kernel density estimators with di\ufb00erent kernel width parameters.\nWe showed by theory and in simulation that that the improved ensemble estimator achieves\nparametric MSE convergence rate O(T \u22121). The optimal weights are determined by solving\na convex optimization problem which does not require training data and can be performed\no\ufb04ine. The superior performance of the weighted ensemble entropy estimator was veri\ufb01ed\nin the context of two important problems: (i) estimation of the Panter-Dite factor and (ii)\nnon-parametric hypothesis testing.\n\nAcknowledgments\n\nThis work was partially supported by ARO grant W911NF-12-1-0443.\n\n8\n\n\fReferences\n\n[1] I. Ahmad and Pi-Erh Lin. A nonparametric estimation of the entropy for absolutely continuous\n\ndistributions (corresp.). Information Theory, IEEE Trans. on, 22(3):372 \u2013 375, May 1976.\n\n[2] J. Beirlant, EJ Dudewicz, L. Gy\u00a8or\ufb01, and EC Van der Meulen. Nonparametric entropy estim-\n\nation: An overview. Intl. Journal of Mathematical and Statistical Sciences, 6:17\u201340, 1997.\n\n[3] L. Birge and P. Massart. Estimation of integral functions of a density. The Annals of Statistics,\n\n23(1):11\u201329, 1995.\n\n[4] D. Chauveau and P. Vandekerkhove. Selection of a MCMC simulation strategy via an entropy\n\nconvergence criterion. ArXiv Mathematics e-prints, May 2006.\n\n[5] J.A. Costa and A.O. Hero. Geodesic entropic graphs for dimension and entropy estimation in\n\nmanifold learning. Signal Processing, IEEE Transactions on, 52(8):2210\u20132221, 2004.\n\n[6] P. B. Eggermont and V. N. LaRiccia. Best asymptotic normality of the kernel density entropy\nestimator for smooth densities. Information Theory, IEEE Trans. on, 45(4):1321 \u20131326, May\n1999.\n\n[7] E. Gin\u00b4e and D.M. Mason. Uniform in bandwidth estimation of integral functionals of the\n\ndensity function. Scandinavian Journal of Statistics, 35:739761, 2008.\n\n[8] M. Goria, N. Leonenko, V. Mergel, and P. L. Novi Inverardi. A new class of random vec-\ntor entropy estimators and its applications in testing statistical hypotheses. Nonparametric\nStatistics, 2004.\n\n[9] R. Gupta. Quantization Strategies for Low-Power Communications. PhD thesis, University of\n\nMichigan, Ann Arbor, 2001.\n\n[10] L. Gy\u00a8or\ufb01 and E. C. van der Meulen. Density-free convergence properties of various estimators\n\nof entropy. Comput. Statist. Data Anal., pages 425\u2013436, 1987.\n\n[11] L. Gy\u00a8or\ufb01 and E. C. van der Meulen. An entropy estimate based on a kernel density estimation.\n\nLimit Theorems in Probability and Statistics, pages 229\u2013240, 1989.\n\n[12] P. Hall and S. C. Morton. On the estimation of the entropy. Ann. Inst. Statist. Meth., 45:69\u201388,\n\n1993.\n\n[13] K. Hlav\u00b4a\u02c7ckov\u00b4a-Schindler, M. Palu\u02c7s, M. Vejmelka, and J. Bhattacharya. Causality detection\nbased on information-theoretic approaches in time series analysis. Physics Reports, 441(1):1\u2013\n46, 2007.\n\n[14] A.T. Ihler, J.W. Fisher III, and A.S. Willsky. Nonparametric estimators for online signature\nauthentication. In Acoustics, Speech, and Signal Processing, 2001. Proceedings.(ICASSP\u201901).\n2001 IEEE International Conference on, volume 6, pages 3473\u20133476. IEEE, 2001.\n\n[15] H. Joe. Estimation of entropy and other functionals of a multivariate density. Annals of the\n\nInstitute of Statistical Mathematics, 41(4):683\u2013697, 1989.\n\n[16] G. Lanckriet, N. Cristianini, P. Bartlett, and L. El Ghaoui. Learning the kernel matrix with\n\nsemi-de\ufb01nite programming. Journal of Machine Learning Research, 5:2004, 2002.\n\n[17] B. Laurent. E\ufb03cient estimation of integral functionals of a density. The Annals of Statistics,\n\n24(2):659\u2013681, 1996.\n\n[18] N. Leonenko, L. Prozanto, and V. Savani. A class of R\u00b4enyi information estimators for multi-\n\ndimensional densities. Annals of Statistics, 36:2153\u20132182, 2008.\n\n[19] E. Liiti\u00a8ainen, A. Lendasse, and F. Corona. On the statistical estimation of r\u00b4enyi entropies.\nIn Proceedings of IEEE/MLSP 2009 International Workshop on Machine Learning for Signal\nProcessing, Grenoble (France), September 2-4 2009.\n\n[20] D. Pal, B. Poczos, and C. Szepesvari. Estimation of R\u00b4enyi entropy and mutual information\nbased on generalized nearest-neighbor graphs. In Proc. Advances in Neural Information Pro-\ncessing Systems (NIPS). MIT Press, 2010.\n\n[21] Robert E. Schapire. The strength of weak learnability. Machine Learning, 5(2):197\u2013227\u2013227,\n\nJune 1990.\n\n[22] K. Sricharan and A. O. Hero, III. Ensemble estimators for multivariate entropy estimation.\n\nArXiv e-prints, March 2012.\n\n[23] C. Studholme, C. Drapaca, B. Iordanova, and V. Cardenas. Deformation-based mapping of\nvolume change from serial brain mri in the presence of local tissue contrast change. Medical\nImaging, IEEE Transactions on, 25(5):626\u2013639, 2006.\n\n[24] B. van Es. Estimating functionals related to a density by class of statistics based on spacing.\n\nScandinavian Journal of Statistics, 1992.\n\n9\n\n\f", "award": [], "sourceid": 280, "authors": [{"given_name": "Kumar", "family_name": "Sricharan", "institution": null}, {"given_name": "Alfred", "family_name": "Hero", "institution": null}]}