{"title": "Multivariate f-divergence Estimation With Confidence", "book": "Advances in Neural Information Processing Systems", "page_first": 2420, "page_last": 2428, "abstract": "The problem of f-divergence estimation is important in the fields of machine learning, information theory, and statistics. While several divergence estimators exist, relatively few have known convergence properties. In particular, even for those estimators whose MSE convergence rates are known, the asymptotic distributions are unknown. We establish the asymptotic normality of a recently proposed ensemble estimator of f-divergence between two distributions from a finite number of samples. This estimator has MSE convergence rate of O(1/T), is simple to implement, and performs well in high dimensions. This theory enables us to perform divergence-based inference tasks such as testing equality of pairs of distributions based on empirical samples. We experimentally validate our theoretical results and, as an illustration, use them to empirically bound the best achievable classification error.", "full_text": "Multivariate f-Divergence Estimation With\n\nCon\ufb01dence\n\nKevin R. Moon\n\nDepartment of EECS\nUniversity of Michigan\n\nAnn Arbor, MI\n\nkrmoon@umich.edu\n\nAlfred O. Hero III\nDepartment of EECS\nUniversity of Michigan\n\nAnn Arbor, MI\n\nhero@eecs.umich.edu\n\nAbstract\n\nThe problem of f-divergence estimation is important in the \ufb01elds of machine\nlearning, information theory, and statistics. While several nonparametric diver-\ngence estimators exist, relatively few have known convergence properties. In par-\nticular, even for those estimators whose MSE convergence rates are known, the\nasymptotic distributions are unknown. We establish the asymptotic normality of a\nrecently proposed ensemble estimator of f-divergence between two distributions\nfrom a \ufb01nite number of samples. This estimator has MSE convergence rate of\n\n(cid:1), is simple to implement, and performs well in high dimensions. This the-\n\nory enables us to perform divergence-based inference tasks such as testing equality\nof pairs of distributions based on empirical samples. We experimentally validate\nour theoretical results and, as an illustration, use them to empirically bound the\nbest achievable classi\ufb01cation error.\n\nO(cid:0) 1\n\nT\n\n1\n\nIntroduction\n\nThis paper establishes the asymptotic normality of a nonparametric estimator of the f-divergence\nbetween two distributions from a \ufb01nite number of samples. For many nonparametric divergence\nestimators the large sample consistency has already been established and the mean squared error\n(MSE) convergence rates are known for some. However, there are few results on the asymptotic\ndistribution of non-parametric divergence estimators. Here we show that the asymptotic distribution\nis Gaussian for the class of ensemble f-divergence estimators [1], extending theory for entropy\nestimation [2, 3] to divergence estimation. f-divergence is a measure of the difference between\ndistributions and is important to the \ufb01elds of machine learning, information theory, and statistics [4].\nThe f-divergence generalizes several measures including the Kullback-Leibler (KL) [5] and R\u00e9nyi-\n\u03b1 [6] divergences. Divergence estimation is useful for empirically estimating the decay rates of\nerror probabilities of hypothesis testing [7], extending machine learning algorithms to distributional\nfeatures [8, 9], and other applications such as text/multimedia clustering [10]. Additionally, a special\ncase of the KL divergence is mutual information which gives the capacities in data compression\nand channel coding [7]. Mutual information estimation has also been used in machine learning\napplications such as feature selection [11], fMRI data processing [12], clustering [13], and neuron\nclassi\ufb01cation [14]. Entropy is also a special case of divergence where one of the distributions is the\nuniform distribution. Entropy estimation is useful for intrinsic dimension estimation [15], texture\nclassi\ufb01cation and image registration [16], and many other applications.\nHowever, one must go beyond entropy and divergence estimation in order to perform inference tasks\non the divergence. An example of an inference task is detection: to test the null hypothesis that the\ndivergence is zero, i.e., testing that the two populations have identical distributions. Prescribing a\np-value on the null hypothesis requires specifying the null distribution of the divergence estimator.\nAnother statistical inference problem is to construct a con\ufb01dence interval on the divergence based on\n\n1\n\n\fthe divergence estimator. This paper provides solutions to these inference problems by establishing\nlarge sample asymptotics on the distribution of divergence estimators. In particular we consider the\nasymptotic distribution of the nonparametric weighted ensemble estimator of f-divergence from [1].\nThis estimator estimates the f-divergence from two \ufb01nite populations of i.i.d. samples drawn from\nsome unknown, nonparametric, smooth, d-dimensional distributions. The estimator [1] achieves a\n\n(cid:1) where T is the sample size. See [17] for proof details.\n\nMSE convergence rate of O(cid:0) 1\n\nT\n\n1.1 Related Work\n\nT\n\n\u00b4\n\nEstimators for some f-divergences already exist. For example, P\u00f3czos & Schneider [8] and Wang et\nal [18] provided consistent k-nn estimators for R\u00e9nyi-\u03b1 and the KL divergences, respectively. Con-\nsistency has been proven for other mutual information and divergence estimators based on plug-in\nhistogram schemes [19, 20, 21, 22]. Hero et al [16] provided an estimator for R\u00e9nyi-\u03b1 divergence but\nassumed that one of the densities was known. However none of these works study the convergence\nrates of their estimators nor do they derive the asymptotic distributions.\nRecent work has focused on deriving convergence rates for divergence estimators. Nguyen et al [23],\nSingh and P\u00f3czos [24], and Krishnamurthy et al [25] each proposed divergence estimators that\n\n(cid:1)) under weaker conditions than those given in [1].\n\nachieve the parametric convergence rate (O(cid:0) 1\n\n1 (x)f \u03b2\nf \u03b1\n\nHowever, solving the convex problem of [23] can be more demanding for large sample sizes than\nthe estimator given in [1] which depends only on simple density plug-in estimates and an of\ufb02ine\nconvex optimization problem. Singh and P\u00f3czos only provide an estimator for R\u00e9nyi-\u03b1 divergences\nthat requires several computations at each boundary of the support of the densities which becomes\ndif\ufb01cult to implement as d gets large. Also, this method requires knowledge of the support of the\ndensities which may not be possible for some problems. In contrast, while the convergence results of\nthe estimator in [1] requires the support to be bounded, knowledge of the support is not required for\nimplementation. Finally, the estimators given in [25] estimate divergences that include functionals of\nthe form\n2 (x)d\u00b5(x) for given \u03b1, \u03b2. While a suitable \u03b1-\u03b2 indexed sequence of divergence\nfunctionals of the form in [25] can be made to converge to the KL divergence, this does not guarantee\nconvergence of the corresponding sequence of divergence estimates, whereas the estimator in [1] can\nbe used to estimate the KL divergence. Also, for some divergences of the speci\ufb01ed form, numerical\nintegration is required for the estimators in [25], which can be computationally dif\ufb01cult. In any case,\nthe asymptotic distributions of the estimators in [23, 24, 25] are currently unknown.\nAsymptotic normality has been established for certain appropriately normalized divergences be-\ntween a speci\ufb01c density estimator and the true density [26, 27, 28]. However, this differs from\nour setting where we assume that both densities are unknown. Under the assumption that the two\ndensities are smooth, lower bounded, and have bounded support, we show that an appropriately nor-\nmalized weighted ensemble average of kernel density plug-in estimators of f-divergence converges\nin distribution to the standard normal distribution. This is accomplished by constructing a sequence\nof interchangeable random variables and then showing (by concentration inequalities and Taylor\nseries expansions) that the random variables and their squares are asymptotically uncorrelated. The\ntheory developed to accomplish this can also be used to derive a central limit theorem for a weighted\nensemble estimator of entropy such as the one given in [3].We verify the theory by simulation. We\nthen apply the theory to the practical problem of empirically bounding the Bayes classi\ufb01cation error\nprobability between two population distributions, without having to construct estimates for these\ndistributions or implement the Bayes classi\ufb01er.\nBold face type is used in this paper for random variables and random vectors. Let f1 and f2 be\nf2(x). The conditional expectation given a random variable Z is EZ.\ndensities and de\ufb01ne L(x) = f1(x)\n\n2 The Divergence Estimator\n\nMoon and Hero [1] focused on estimating divergences that include the form [4]\n\n\u02c6\n\n(cid:18) f1(x)\n\n(cid:19)\n\nf2(x)\n\nG(f1, f2) =\n\ng\n\nf2(x)dx,\n\n(1)\n\nfor a smooth, function g(f ).\n(Note that although g must be convex for (1) to be a divergence,\nthe estimator in [1] does not require convexity.) The divergence estimator is constructed us-\n\n2\n\n\fing k-nn density estimators as follows. Assume that the d-dimensional multivariate densities\nf1 and f2 have \ufb01nite support S = [a, b]d. Assume that T = N + M2 i.i.d.\nrealizations\n{X1, . . . , XN , XN +1, . . . , XN +M2} are available from the density f2 and M1 i.i.d. realizations\n{Y1, . . . , YM1} are available from the density f1. Assume that ki \u2264 Mi. Let \u03c12,k2 (i) be the dis-\ntance of the k2th nearest neighbor of Xi in {XN +1, . . . , XT} and let \u03c11,k1 (i) be the distance of the\nk1th nearest neighbor of Xi in {Y1, . . . , YM1} . Then the k-nn density estimate is [29]\n\n\u02c6fi,ki(Xj) =\n\nki\nMi\u00afc\u03c1d\n\ni,ki\n\n,\n\n(j)\n\nwhere \u00afc is the volume of a d-dimensional unit ball.\nTo construct the plug-in divergence estimator, the data from f2 are randomly divided into two\nparts {X1, . . . , XN} and {XN +1, . . . , XN +M2}. The k-nn density estimate \u02c6f2,k2 is calculated at\nthe N points {X1, . . . , XN} using the M2 realizations {XN +1, . . . , XN +M2}. Similarly, the k-\nnn density estimate \u02c6f1,k1 is calculated at the N points {X1, . . . , XN} using the M1 realizations\n{Y1, . . . , YM1}. De\ufb01ne \u02c6Lk1,k2 (x) =\n\n. The functional G(f1, f2) is then approximated as\n\n\u02c6f1,k1 (x)\n\u02c6f2,k2 (x)\n\n\u02c6Gk1,k2 =\n\n1\nN\n\n(cid:16)\u02c6Lk1,k2 (Xi)\n\n(cid:17)\n\ng\n\nN(cid:88)\n\ni=1\n\n.\n\n(2)\n\nThe principal assumptions on the densities f1 and f2 and the functional g are that: 1) f1, f2, and g are\nsmooth; 2) f1 and f2 have common bounded support sets S; 3) f1 and f2 are strictly lower bounded.\nThe full assumptions (A.0) \u2212 (A.5) are given in the supplemental material [30] and in [17]. Moon\nand Hero [1] showed that under these assumptions, the MSE convergence rate of the estimator in\nEq. 2 to the quantity in Eq. 1 depends exponentially on the dimension d of the densities. However,\nMoon and Hero also showed that an estimator with the parametric convergence rate O(1/T ) can be\nderived by applying the theory of optimally weighted ensemble estimation as follows.\nLet \u00afl = {l1, . . . , lL} be a set of index values and T the number of samples available. For an indexed\nof the parameter E, the weighted ensemble estimator with weights\nensemble of estimators\nl\u2208\u00afl w (l) \u02c6El. The key\nidea to reducing MSE is that by choosing appropriate weights w, we can greatly decrease the bias\nin exchange for some increase in variance. Consider the following conditions on\n\nw = {w (l1) , . . . , w (lL)} satisfying(cid:80)\n\n(cid:110) \u02c6El\n\n(cid:110) \u02c6El\n\n(cid:111)\n\n(cid:111)\n\n[3]:\n\nl\u2208\u00afl\n\nl\u2208\u00afl\n\nl\u2208\u00afl w(l) = 1 is de\ufb01ned as \u02c6Ew = (cid:80)\n(cid:18) 1\u221a\n\nci\u03c8i(l)T \u2212i/2d + O\n\n(cid:88)\n\n(cid:19)\n\n=\n\n,\n\nT\n\ni\u2208J\n\n\u2022 C.1 The bias is given by\n\nBias\n\n(cid:16) \u02c6El\n\n(cid:17)\n\nwhere ci are constants depending on the underlying density, J = {i1, . . . , iI} is a \ufb01nite\nindex set with I < L, min(J) > 0 and max(J) \u2264 d, and \u03c8i(l) are basis functions\ndepending only on the parameter l.\n\n\u2022 C.2 The variance is given by\n\nTheorem 1. [3] Assume conditions C.1 and C.2 hold for an ensemble of estimators\nthere exists a weight vector w0 such that\n\n(cid:111)\n\n(cid:110) \u02c6El\n\n. Then\n\nl\u2208\u00afl\n\n(cid:18) 1\n\n(cid:19)\n\nT\n\n(cid:18) 1\n\n(cid:19)\n\n.\n\nT\n\n= cv\n\n+ o\n\n(cid:17)2(cid:21)\n\n(cid:18) 1\n\n(cid:19)\n\n.\n\nT\n\n= O\n\nVar\n\n(cid:104) \u02c6El\n(cid:105)\n(cid:20)(cid:16) \u02c6Ew0 \u2212 E\n\u03b3w(i) =(cid:80)\n\nE\n\n3\n\nsubject to (cid:80)\n\nminw ||w||2\n\nl\u2208\u00afl w(l) = 1,\n\nl\u2208\u00afl w(l)\u03c8i(l) = 0, i \u2208 J.\n\nThe weight vector w0 is the solution to the following convex optimization problem:\n\n\ffrom f2, dimension d, function g, \u00afc\n\nAlgorithm 1 Optimally weighted ensemble divergence estimator\nInput: \u03b1, \u03b7, L positive real numbers \u00afl, samples {Y1, . . . , YM1} from f1, samples {X1, . . . , XT}\nOutput: The optimally weighted divergence estimator \u02c6Gw0\n1: Solve for w0 using Eq. 3 with basis functions \u03c8i(l) = li/d, l \u2208 \u00afl and i \u2208 {1, . . . , d \u2212 1}\n2: M2 \u2190 \u03b1T , N \u2190 T \u2212 M2\n3: for all l \u2208 \u00afl do\n\u221a\n4:\nM2\n5:\n6:\n\n\u03c1j,k(l)(i) \u2190the distance of the k(l)th nearest neighbor of Xi in {Y1, . . . , YM1} and\n{XN +1, . . . , XT} for j = 1, 2, respectively\n\u02c6fj,k(l)(Xi) \u2190\n\nk(l) \u2190 l\nfor i = 1 to N do\n\nk(l)\n\n(cid:80)N\n\nMj \u00afc\u03c1d\n\nj,k(l)(i) for j = 1, 2, \u02c6Lk(l)(Xi) \u2190 \u02c6f1,k(l)\n(cid:16)\u02c6Lk(l)(Xi)\n\n(cid:17)\n\n\u02c6f2,k(l)\n\ni=1 g\n\nend for\n\u02c6Gk(l) \u2190 1\n\n7:\n8:\n9:\n10: end for\n\n11: \u02c6Gw0 \u2190(cid:80)\n\nN\n\nl\u2208\u00afl w0(l) \u02c6Gk(l)\n\nIn order to achieve the rate of O (1/T ) it is not necessary for the weights to zero out the lower\norder bias terms, i.e.\nIt was shown in [3] that solving the following\nconvex optimization problem in place of the optimization problem in Theorem 1 retains the MSE\nconvergence rate of O (1/T ):\n\nthat \u03b3w(i) = 0, i \u2208 J.\n\nsubject to (cid:80)\n\nminw \u0001\n\n(cid:12)(cid:12)(cid:12)\u03b3w(i)T 1\n\n(cid:107)w(cid:107)2\n\nl\u2208\u00afl w(l) = 1,\n\n2d\n\n2\u2212 i\n2 \u2264 \u03b7,\n\n(cid:12)(cid:12)(cid:12) \u2264 \u0001, i \u2208 J,\n\n(3)\n\nT ) which gives an MSE rate of O(1/T ).\n\nwhere the parameter \u03b7 is chosen to trade-off between bias and variance.\nInstead of forcing\n\u221a\n\u03b3w(i) = 0, the relaxed optimization problem uses the weights to decrease the bias terms at the\nrate of O(1/\nTheorem 1 was applied in [3] to obtain an entropy estimator with convergence rate O (1/T ) . Moon\nand Hero [1] similarly applied Theorem 1 to obtain a divergence estimator with the same rate in the\nfollowing manner. Let L > I = d \u2212 1 and choose \u00afl = {l1, . . . , lL} to be positive real numbers. As-\nM2, M2 = \u03b1T with 0 < \u03b1 < 1, \u02c6Gk(l) := \u02c6Gk(l),k(l), and\nsume that M1 = O (M2) . Let k(l) = l\nl\u2208\u00afl w(l) \u02c6Gk(l). Note that the parameter l indexes over different neighborhood sizes for the\nsatisfy the con-\nk-nn density estimates. From [1], the biases of the ensemble estimators\ndition C.1 when \u03c8i(l) = li/d and J = {1, . . . , d\u22121}. The general form of the variance of \u02c6Gk(l) also\nfollows C.2. The optimal weight w0 is found by using Theorem 1 to obtain a plug-in f-divergence\nestimator with convergence rate of O (1/T ) . The estimator is summarized in Algorithm 1.\n\n\u02c6Gw :=(cid:80)\n\n(cid:110) \u02c6Gk(l)\n\n(cid:111)\n\n\u221a\n\nl\u2208\u00afl\n\n3 Asymptotic Normality of the Estimator\n\nThe following theorem shows that the appropriately normalized ensemble estimator \u02c6Gw converges\nin distribution to a normal random variable.\nTheorem 2. Assume that assumptions (A.0) \u2212 (A.5) hold and let M = O(M1) = O(M2) and\nM with l \u2208 \u00afl. If G(f1, f2) (cid:54)= 0, then the asymptotic distribution of the weighted ensemble\nk(l) = l\nestimator \u02c6Gw is given by\n\n\u221a\n\n\uf8eb\uf8ec\uf8ec\uf8ed \u02c6Gw \u2212 E(cid:104) \u02c6Gw\n\uf8f6\uf8f7\uf8f7\uf8f8 = P r(S \u2264 t),\n(cid:105)\n(cid:114)\n(cid:104) \u02c6Gw\n(cid:105) \u2264 t\n\nVar\n\nlim\n\nM,N\u2192\u221e P r\n\n4\n\n\fwhere S is a standard normal random variable. Also E(cid:104) \u02c6Gw\n\n(cid:105) \u2192 G(f1, f2) and Var\n\n(cid:104) \u02c6Gw\n\n(cid:105) \u2192 0.\n\nThe mean and variance results come from [1]. Based on our experiments, we suspect that the central\nlimit theorem holds for the case when G(f1, f2) = 0 as well. This will be explored in future work.\nThe proof of the distributional convergence when G(f1, f2) (cid:54)= 0 is outlined below and is based\non constructing a sequence of interchangeable random variables {YM,i}N\ni=1 with zero mean and\nunit variance. We then show that the YM,i are asymptotically uncorrelated and that the Y2\nM,i are\nasymptotically uncorrelated as M \u2192 \u221e. This is similar to what was done in [31] to prove a central\nlimit theorem for a density plug-in estimator of entropy. Our analysis for the ensemble estimator\nof divergence is more complicated since we are dealing with a functional of two densities and a\nweighted ensemble of estimators. In fact, some of the equations we use to prove Theorem 2 can\nbe used to prove a central limit theorem for a weighted ensemble of entropy estimators such as that\ngiven in [3].\n\n3.1 Proof Sketch of Theorem 2\n\nThe full proof is included in the supplemental material [30]. We use the following lemma [31, 32]:\nLemma 3. Let the random variables {YM,i}N\ni=1 belong to a zero mean, unit variance, interchange-\nable process for all values of M. Assume that Cov(YM,1, YM,2) and Cov(Y2\nM,2) are\nO(1/M ). Then the random variable\n\nM,1, Y2\n\n(cid:32) N(cid:88)\n\n(cid:33)\n\n(cid:118)(cid:117)(cid:117)(cid:116)Var\n\n(cid:34) N(cid:88)\n\n(cid:35)\n\nSN,M =\n\nYM,i\n\n/\n\nYM,i\n\n(4)\n\ni=1\n\ni=1\nconverges in distribution to a standard normal random variable.\nThis lemma is an extension of work by Blum et al [33] which showed that if {Zi; i = 1, 2, . . .}\nis an interchangeable process with zero mean and unit variance, then SN = 1\u221a\ni=1 Zi con-\nverges in distribution to a standard normal random variable if and only if Cov [Z1, Z2] = 0 and\n\n(cid:3) = 0. In other words, the central limit theorem holds if and only if the interchange-\n\nCov(cid:2)Z2\n\n(cid:80)N\n\nable process is uncorrelated and the squares are uncorrelated. Lemma 3 shows that for a correlated\ninterchangeable process, a suf\ufb01cient condition for a central limit theorem is for the interchangeable\nprocess and the squared process to be asymptotically uncorrelated with rate O(1/M ).\nFor simplicity, let M1 = M2 = M and \u02c6Lk(l) := \u02c6Lk(l),k(l). De\ufb01ne\n\n1, Z2\n2\n\nN\n\n(cid:80)\n\nl\u2208\u00afl w(l)g\n\nYM,i =\n\nl\u2208\u00afl w(l)g\n\n(cid:16)\u02c6Lk(l)(Xi)\n(cid:114)\n(cid:104)(cid:80)\n(cid:16) \u02c6Gw \u2212 E(cid:104) \u02c6Gw\n\n(cid:17) \u2212 E(cid:104)(cid:80)\n(cid:16)\u02c6Lk(l)(Xi)\n(cid:17)(cid:105)\n(cid:17)(cid:105)\n(cid:16)\u02c6Lk(l)(Xi)\n(cid:114)\n(cid:105)\n(cid:104) \u02c6Gw\n\nl\u2208\u00afl w(l)g\n\n(cid:105)(cid:17)\n\nVar\n\nVar\n\n/\n\n.\n\n.\n\nThen from Eq. 4, we have that\n\nSN,M =\n\ng\n\n(cid:104)\n\n(cid:16)\u02c6Lk(l)(Xi)\n(cid:17)\n\n(cid:16)\u02c6Lk(l(cid:48))(Xj)\n(cid:17)(cid:105)\n\nThus it is suf\ufb01cient to show from Lemma 3 that Cov(YM,1, YM,2) and Cov(Y2\nM,2) are\nO(1/M ). To do this, it is necessary to show that the denominator of YM,i converges to a nonzero\nconstant or to zero suf\ufb01ciently slowly. It is also necessary to show that the covariance of the nu-\nmerator is O(1/M ). Therefore, to bound Cov(YM,1, YM,2), we require bounds on the quantity\nwhere l, l(cid:48) \u2208 \u00afl.\nCov\nDe\ufb01ne M(Z) := Z \u2212 EZ, \u02c6Fk(l)(Z) := \u02c6Lk(l)(Z) \u2212 EZ\nEZ\nEZ \u02c6Lk(l)(Z) gives\n\n, and \u02c6ei,k(l)(Z) := \u02c6fi,k(l)(Z) \u2212\naround\n\n\u02c6fi,k(l)(Z). Assuming g is suf\ufb01ciently smooth, a Taylor series expansion of g\n\n(cid:16)\u02c6Lk(l)(Z)\n(cid:17)\n\n(cid:16)\u02c6Lk(l)(Z)\n(cid:17)\n\nM,1, Y2\n\n, g\n\n(cid:16)\u02c6Lk(l)(Z)\n(cid:17)\n\ng\n\n=\n\n\u03bb\u22121(cid:88)\n\ng(i)(cid:16)EZ \u02c6Lk(l)(Z)\n(cid:17)\n\n\u02c6Fi\n\nk(l)(Z) +\n\ng(\u03bb) (\u03beZ)\n\n\u03bb!\n\n\u02c6F\u03bb\n\nk(l)(Z),\n\ni=0\n\ni!\n\n5\n\n\fwhere \u03beZ \u2208 (cid:16)EZ \u02c6Fk(l)(Z), \u02c6Fk(l)(Z)\n(cid:17)\n\n\u02c6f1,k(l)(Z) and EZ\n\n. We use this expansion to bound the covariance. The ex-\npected value of the terms containing the derivatives of g is controlled by assuming that the densities\nare lower bounded. By assuming the densities are suf\ufb01ciently smooth, an expression for \u02c6Fq\nk(l) (Z)\nin terms of powers and products of the density error terms \u02c6e1,k(l) and \u02c6e2,k(l) is obtained by ex-\npanding \u02c6Lk(l)(Z) around EZ\n\u02c6f2,k(l)(Z) and applying the binomial theorem. The\nexpected value of products of these density error terms is bounded by applying concentration in-\nequalities and conditional independence. Then the covariance between \u02c6Fq\nk(l)(Z) terms is bounded\nby bounding the covariance between powers and products of the density error terms by applying\nCauchy-Schwarz and other concentration inequalities. This gives the following lemma which is\nproved in the supplemental material [30].\nLemma 4. Let l, l(cid:48) \u2208 \u00afl be \ufb01xed, M1 = M2 = M, and k(l) = l\nM. Let \u03b31(x), \u03b32(x) be\narbitrary functions with 1 partial derivative wrt x and supx |\u03b3i(x)| < \u221e, i = 1, 2 and let 1{\u00b7} be\nthe indicator function. Let Xi and Xj be realizations of the density f2 independent of \u02c6f1,k(l), \u02c6f1,k(l(cid:48)),\n\u02c6f2,k(l), and \u02c6f2,k(l(cid:48)) and independent of each other when i (cid:54)= j. Then\n\n\u221a\n\nk(l)(Xi), \u03b32(Xj) \u02c6Fr\n\nk(l(cid:48))(Xj)\n\nNote that k(l) is required to grow with\ng\n\n. Lemma 4 can then be used to show that\n\nCov\n\n\u03b31(Xi) \u02c6Fq\n\n(cid:104)\n(cid:16)EX \u02c6Lk(l)(X)\n(cid:17)\n(cid:17)\n(cid:16)\u02c6Lk(l)(Xi)\n(cid:104)\n(cid:16)\u02c6Lk(l)(X1)\n(cid:104)M(cid:16)\n(cid:104)\n\nFor the covariance of Y2\nneed to bound the term\nCov\n\nCov\n\ng\n\ng\n\n(cid:105)\n\ni = j\ni (cid:54)= j.\n\nM\n\nM\n\n=\n\u221a\n\n(cid:1) ,\n\nM for Lemma 4 to hold. De\ufb01ne hl,g(X) =\n\n(cid:26)o(1),\n1{q,r=1}c8 (\u03b31(x), \u03b32(x))(cid:0) 1\n(cid:1) + o(cid:0) 1\n(cid:26)E [M (hl,g(Xi))M (hl(cid:48),g(Xi))] + o(1),\nc8 (hl,g(cid:48)(x), hl(cid:48),g(cid:48)(x))(cid:0) 1\n(cid:1) ,\n(cid:16)\u02c6Lk(j(cid:48))(X2)\n(cid:16)\u02c6Lk(j)(X2)\n,M(cid:16)\n(cid:17)(cid:17)\n(cid:17)(cid:17)(cid:105)\n(cid:18) 1\n(cid:19)\n\n(cid:1) + o(cid:0) 1\n(cid:17)(cid:17)M(cid:16)\n(cid:105)\n\ni = j\ni (cid:54)= j.\n\nM\n\nM\n\ng\n\nM,j, assume WLOG that i = 1 and j = 2. Then for l, l(cid:48), j, j(cid:48) we\n\n, g\n\n(cid:17)(cid:105)\n(cid:16)\u02c6Lk(l(cid:48))(Xj)\n(cid:16)\u02c6Lk(l(cid:48))(X1)\n(cid:17)(cid:17)M(cid:16)\n\nM,i and Y2\n\n=\n\ng\n\n.\n(5)\nFor the case where l = l(cid:48) and j = j(cid:48), we can simply apply the previous results to the functional\nd(x) = (M (g(x)))2. For the more general case, we need to show that\n\ng\n\nCov\n\n\u03b31(X1) \u02c6Fs\n\nk(l)(X1) \u02c6Fq\n\nk(l(cid:48))(X1), \u03b32(X2) \u02c6Ft\n\nk(j)(X2) \u02c6Fr\n\nk(j(cid:48))(X2)\n\n= O\n\n.\n\nM\n\n(6)\n\nTo do this, bounds are required on the covariance of up to eight distinct density error terms. Previous\nresults can be applied by using Cauchy-Schwarz when the sum of the exponents of the density error\nterms is greater than or equal to 4. When the sum is equal to 3, we use the fact that k(l) = O(k(l(cid:48)))\ncombined with Markov\u2019s inequality to obtain a bound of O (1/M ). Applying Eq. 6 to the term in\nEq. 5 gives the required bound to apply Lemma 3.\n\n3.2 Broad Implications of Theorem 2\n\nTo the best of our knowledge, Theorem 2 provides the \ufb01rst results on the asymptotic distribution\nof an f-divergence estimator with MSE convergence rate of O (1/T ) under the setting of a \ufb01nite\nnumber of samples from two unknown, non-parametric distributions. This enables us to perform\ninference tasks on the class of f-divergences (de\ufb01ned with smooth functions g) on smooth, strictly\nlower bounded densities with \ufb01nite support. Such tasks include hypothesis testing and constructing\na con\ufb01dence interval on the error exponents of the Bayes probability of error for a classi\ufb01cation\nproblem. This greatly increases the utility of these divergence estimators.\nAlthough we focused on a speci\ufb01c divergence estimator, we suspect that our approach of show-\ning that the components of the estimator and their squares are asymptotically uncorrelated can be\nadapted to derive central limit theorems for other divergence estimators that satisfy similar assump-\ntions (smooth g, and smooth, strictly lower bounded densities with \ufb01nite support). We speculate that\nthis would be easiest for estimators that are also based on k-nearest neighbors such as in [8] and [18].\nIt is also possible that the approach can be adapted to other plug-in estimator approaches such as\nin [24] and [25]. However, the qualitatively different convex optimization approach of divergence\nestimation in [23] may require different methods.\n\n6\n\n\fFigure 1: Q-Q plot comparing quantiles\nfrom the normalized weighted ensemble es-\ntimator of the KL divergence (vertical axis)\nto the quantiles from the standard normal\ndistribution (horizontal axis). The red line\nshows . The linearity of the Q-Q plot points\nvalidates the central limit theorem, Theo-\nrem. 2, for the estimator.\n\n4 Experiments\n\nWe \ufb01rst apply the weighted ensemble estimator of divergence to simulated data to verify the central\nlimit theorem. We then use the estimator to obtain con\ufb01dence intervals on the error exponents of the\nBayes probability of error for the Iris data set from the UCI machine learning repository [34, 35].\n\n4.1 Simulation\n\nTo verify the central limit theorem of the ensemble method, we estimated the KL divergence between\ntwo truncated normal densities restricted to the unit cube. The densities have means \u00af\u00b51 = 0.7 \u2217 \u00af1d,\n\u00af\u00b52 = 0.3 \u2217 \u00af1d and covariance matrices \u03c3iId where \u03c31 = 0.1, \u03c32 = 0.3, \u00af1d is a d-dimensional\nvector of ones, and Id is a d-dimensional identity matrix. We show the Q-Q plot of the normalized\noptimally weighted ensemble estimator of the KL divergence with d = 6 and 1000 samples from\neach density in Fig. 1. The linear relationship between the quantiles of the normalized estimator and\nthe standard normal distribution validates Theorem 2.\n\n4.2 Probability of Error Estimation\n\nOur ensemble divergence estimator can be used to estimate a bound on the Bayes probability of\nerror [7]. Suppose we have two classes C1 or C2 and a random observation x. Let the a priori class\nprobabilities be w1 = P r(C1) > 0 and w2 = P r(C2) = 1 \u2212 w1 > 0. Then f1 and f2 are the\ndensities corresponding to the classes C1 and C2, respectively. The Bayes decision rule classi\ufb01es x\nas C1 if and only if w1f1(x) > w2f2(x). The Bayes error P \u2217\ne is the minimum average probability\nof error and is equivalent to\nP \u2217\ne =\n\nmin (P r(C1|x), P r(C2|x)) p(x)dx\n\n\u02c6\n\u02c6\n\n=\n\nmin (w1f1(x), w2f2(x)) dx,\n\n(7)\n\nwhere p(x) = w1f1(x) + w2f2(x). For a, b > 0, we have\n\nmin(a, b) \u2264 a\u03b1b1\u2212\u03b1, \u2200\u03b1 \u2208 (0, 1).\n\nReplacing the minimum function in Eq. 7 with this bound gives\nc\u03b1(f1||f2),\n\n\u00b4\n\nwhere c\u03b1(f1||f2) =\nfound by choosing the value of \u03b1 that minimizes the right hand side of Eq. 8:\n\n1 (x)f 1\u2212\u03b1\nf \u03b1\n\n2\n\n1 w1\u2212\u03b1\n\ne \u2264 w\u03b1\nP \u2217\n(8)\n(x)dx is the Chernoff \u03b1-coef\ufb01cient. The Chernoff coef\ufb01cient is\n\n2\n\n\u02c6\n\n(x)dx.\nThus if \u03b1\u2217 = arg min\u03b1\u2208(0,1) c\u03b1(f1||f2), an upper bound on the Bayes error is\n\nc\u2217(f1||f2) = c\u03b1\u2217 (f1||f2) = min\n\u03b1\u2208(0,1)\n\n1 (x)f 1\u2212\u03b1\nf \u03b1\n\n2\n\ne \u2264 w\u03b1\u2217\nP \u2217\n\n1 w1\u2212\u03b1\u2217\n\n2\n\nc\u2217(f1||f2).\n\n(9)\n\n7\n\n\fEstimated Con\ufb01dence Interval\nQDA Misclassi\ufb01cation Rate\n\n(0, 0.0013)\n\n(0, 0.0002)\n\n0\n\n0\n\n(0, 0.0726)\n\n0.04\n\nSetosa-Versicolor\n\nSetosa-Virginica Versicolor-Virginica\n\nTable 1: Estimated 95% con\ufb01dence intervals for the bound on the pairwise Bayes error and the\nmisclassi\ufb01cation rate of a QDA classi\ufb01er with 5-fold cross validation applied to the Iris dataset. The\nright endpoint of the con\ufb01dence intervals is nearly zero when comparing the Setosa class to the other\ntwo classes while the right endpoint is much higher when comparing the Versicolor and Virginica\nclasses. This is consistent with the QDA performance and the fact that the Setosa class is linearly\nseparable from the other two classes.\n\nEquation 9 includes the form in Eq. 1 (g(x) = x\u03b1). Thus we can use the optimally weighted\nensemble estimator described in Sec. 2 to estimate a bound on the Bayes error. In practice, we\nestimate c\u03b1(f1||f2) for multiple values of \u03b1 (e.g. 0.01, 0.02, . . . , 0.99) and choose the minimum.\nWe estimated a bound on the pairwise Bayes error between the three classes (Setosa, Versicolor, and\nVirginica) in the Iris data set [34, 35] and used bootstrapping to calculate con\ufb01dence intervals. We\ncompared the bounds to the performance of a quadratic discriminant analysis classi\ufb01er (QDA) with\n5-fold cross validation. The pairwise estimated 95% con\ufb01dence intervals and the misclassi\ufb01cation\nrates of the QDA are given in Table 1. Note that the right endpoint of the con\ufb01dence interval is less\nthan 1/50 when comparing the Setosa class to either of the other two classes. This is consistent with\nthe performance of the QDA and the fact that the Setosa class is linearly separable from the other\ntwo classes. In contrast, the right endpoint of the con\ufb01dence interval is higher when comparing\nthe Versicolor and Virginica classes which are not linearly separable. This is also consistent with\nthe QDA performance. Thus the estimated bounds provide a measure of the relative dif\ufb01culty of\ndistinguishing between the classes, even though the small number of samples for each class (50)\nlimits the accuracy of the estimated bounds.\n\n5 Conclusion\n\nIn this paper, we established the asymptotic normality for a weighted ensemble estimator of f-\ndivergence using d-dimensional truncated k-nn density estimators. To the best of our knowledge,\nthis gives the \ufb01rst results on the asymptotic distribution of an f-divergence estimator with MSE\nconvergence rate of O (1/T ) under the setting of a \ufb01nite number of samples from two unknown, non-\nparametric distributions. Future work includes simplifying the constants in front of the convergence\nrates given in [1] for certain families of distributions, deriving Berry-Esseen bounds on the rate of\ndistributional convergence, extending the central limit theorem to other divergence estimators, and\nderiving the nonasymptotic distribution of the estimator.\n\nAcknowledgments\n\nThis work was partially supported by NSF grant CCF-1217880 and a NSF Graduate Research Fel-\nlowship to the \ufb01rst author under Grant No. F031543.\n\nReferences\n[1] K. R. Moon and A. O. Hero III, \u201cEnsemble estimation of multivariate f-divergence,\u201d in IEEE International\n\nSymposium on Information Theory, pp. 356\u2013360, 2014.\n\n[2] K. Sricharan and A. O. Hero III, \u201cEnsemble weighted kernel estimators for multivariate entropy estima-\n\ntion,\u201d in Adv. Neural Inf. Process. Syst., pp. 575\u2013583, 2012.\n\n[3] K. Sricharan, D. Wei, and A. O. Hero III, \u201cEnsemble estimators for multivariate entropy estimation,\u201d\n\nIEEE Trans. on Inform. Theory, vol. 59, no. 7, pp. 4374\u20134388, 2013.\n\n[4] I. Csiszar, \u201cInformation-type measures of difference of probability distributions and indirect observa-\n\ntions,\u201d Studia Sci. Math. Hungar., vol. 2, pp. 299\u2013318, 1967.\n\n[5] S. Kullback and R. A. Leibler, \u201cOn information and suf\ufb01ciency,\u201d The Annals of Mathematical Statistics,\n\nvol. 22, no. 1, pp. 79\u201386, 1951.\n\n[6] A. R\u00e9nyi, \u201cOn measures of entropy and information,\u201d in Fourth Berkeley Sympos. on Mathematical Statis-\n\ntics and Probability, pp. 547\u2013561, 1961.\n\n8\n\n\f[7] T. M. Cover and J. A. Thomas, Elements of Information Theory. John Wiley & Sons, 2006.\n[8] B. P\u00f3czos and J. G. Schneider, \u201cOn the estimation of alpha-divergences,\u201d in International Conference on\n\nArti\ufb01cial Intelligence and Statistics, pp. 609\u2013617, 2011.\n\n[9] J. B. Oliva, B. P\u00f3czos, and J. Schneider, \u201cDistribution to distribution regression,\u201d in International Con-\n\nference on Machine Learning, pp. 1049\u20131057, 2013.\n\n[10] I. S. Dhillon, S. Mallela, and R. Kumar, \u201cA divisive information theoretic feature clustering algorithm for\n\ntext classi\ufb01cation,\u201d The Journal of Machine Learning Research, vol. 3, pp. 1265\u20131287, 2003.\n\n[11] H. Peng, F. Long, and C. Ding, \u201cFeature selection based on mutual information criteria of max-\ndependency, max-relevance, and min-redundancy,\u201d Pattern Analysis and Machine Intelligence, IEEE\nTransactions on, vol. 27, no. 8, pp. 1226\u20131238, 2005.\n\n[12] B. Chai, D. Walther, D. Beck, and L. Fei-Fei, \u201cExploring functional connectivities of the human brain\n\nusing multivariate information analysis,\u201d in Adv. Neural Inf. Process. Syst., pp. 270\u2013278, 2009.\n\n[13] J. Lewi, R. Butera, and L. Paninski, \u201cReal-time adaptive information-theoretic optimization of neurophys-\n\niology experiments,\u201d in Adv. Neural Inf. Process. Syst., pp. 857\u2013864, 2006.\n\n[14] E. Schneidman, W. Bialek, and M. J. Berry, \u201cAn information theoretic approach to the functional classi\ufb01-\n\ncation of neurons,\u201d in Adv. Neural Inf. Process. Syst., pp. 197\u2013204, 2002.\n\n[15] K. M. Carter, R. Raich, and A. O. Hero III, \u201cOn local intrinsic dimension estimation and its applications,\u201d\n\nSignal Processing, IEEE Transactions on, vol. 58, no. 2, pp. 650\u2013663, 2010.\n\n[16] A. O. Hero III, B. Ma, O. J. Michel, and J. Gorman, \u201cApplications of entropic spanning graphs,\u201d Signal\n\nProcessing Magazine, IEEE, vol. 19, no. 5, pp. 85\u201395, 2002.\n\n[17] K. R. Moon and A. O. Hero III, \u201cEnsemble estimation of multivariate f-divergence,\u201d CoRR,\n\nvol. abs/1404.6230, 2014.\n\n[18] Q. Wang, S. R. Kulkarni, and S. Verd\u00fa, \u201cDivergence estimation for multidimensional densities via k-\n\nnearest-neighbor distances,\u201d IEEE Trans. Inform. Theory, vol. 55, no. 5, pp. 2392\u20132405, 2009.\n\n[19] G. A. Darbellay, I. Vajda, et al., \u201cEstimation of the information by an adaptive partitioning of the obser-\n\nvation space,\u201d IEEE Trans. Inform. Theory, vol. 45, no. 4, pp. 1315\u20131321, 1999.\n\n[20] Q. Wang, S. R. Kulkarni, and S. Verd\u00fa, \u201cDivergence estimation of continuous distributions based on\n\ndata-dependent partitions,\u201d IEEE Trans. Inform. Theory, vol. 51, no. 9, pp. 3064\u20133074, 2005.\n\n[21] J. Silva and S. S. Narayanan, \u201cInformation divergence estimation based on data-dependent partitions,\u201d\n\nJournal of Statistical Planning and Inference, vol. 140, no. 11, pp. 3180\u20133198, 2010.\n\n[22] T. K. Le, \u201cInformation dependency: Strong consistency of Darbellay\u2013Vajda partition estimators,\u201d Journal\n\nof Statistical Planning and Inference, vol. 143, no. 12, pp. 2089\u20132100, 2013.\n\n[23] X. Nguyen, M. J. Wainwright, and M. I. Jordan, \u201cEstimating divergence functionals and the likelihood\n\nratio by convex risk minimization,\u201d IEEE Trans. Inform. Theory, vol. 56, no. 11, pp. 5847\u20135861, 2010.\n\n[24] S. Singh and B. P\u00f3czos, \u201cGeneralized exponential concentration inequality for R\u00e9nyi divergence estima-\n\ntion,\u201d in International Conference on Machine Learning, pp. 333\u2013341, 2014.\n\n[25] A. Krishnamurthy, K. Kandasamy, B. P\u00f3czos, and L. Wasserman, \u201cNonparametric estimation of R\u00e9nyi\n\ndivergence and friends,\u201d in International Conference on Machine Learning, vol. 32, 2014.\n\n[26] A. Berlinet, L. Devroye, and L. Gy\u00f6r\ufb01, \u201cAsymptotic normality of L1 error in density estimation,\u201d Statis-\n\ntics, vol. 26, pp. 329\u2013343, 1995.\n\n[27] A. Berlinet, L. Gy\u00f6r\ufb01, and I. D\u00e9nes, \u201cAsymptotic normality of relative entropy in multivariate density\n\nestimation,\u201d Publications de l\u2019Institut de Statistique de l\u2019Universit\u00e9 de Paris, vol. 41, pp. 3\u201327, 1997.\n\n[28] P. J. Bickel and M. Rosenblatt, \u201cOn some global measures of the deviations of density function estimates,\u201d\n\nThe Annals of Statistics, pp. 1071\u20131095, 1973.\n\n[29] D. O. Loftsgaarden and C. P. Quesenberry, \u201cA nonparametric estimate of a multivariate density function,\u201d\n\nThe Annals of Mathematical Statistics, pp. 1049\u20131051, 1965.\n\n[30] K. R. Moon and A. O. Hero III, \u201cSupplemental material,\u201d NIPS, 2014.\n[31] K. Sricharan, R. Raich, and A. O. Hero III, \u201cEstimation of nonlinear functionals of densities with con\ufb01-\n\ndence,\u201d IEEE Trans. Inform. Theory, vol. 58, no. 7, pp. 4135\u20134159, 2012.\n\n[32] K. Sricharan, Neighborhood graphs for estimation of density functionals. PhD thesis, Univ. Michigan,\n\n2012.\n\n[33] J. Blum, H. Chernoff, M. Rosenblatt, and H. Teicher, \u201cCentral limit theorems for interchangeable pro-\n\ncesses,\u201d Canad. J. Math, vol. 10, pp. 222\u2013229, 1958.\n\n[34] K. Bache and M. Lichman, \u201cUCI machine learning repository,\u201d 2013.\n[35] R. A. Fisher, \u201cThe use of multiple measurements in taxonomic problems,\u201d Annals of eugenics, vol. 7,\n\nno. 2, pp. 179\u2013188, 1936.\n\n9\n\n\f", "award": [], "sourceid": 1260, "authors": [{"given_name": "Kevin", "family_name": "Moon", "institution": "University of Michigan"}, {"given_name": "Alfred", "family_name": "Hero", "institution": "University of Michigan"}]}