{"title": "Estimation of R\u00e9nyi Entropy and Mutual Information Based on Generalized Nearest-Neighbor Graphs", "book": "Advances in Neural Information Processing Systems", "page_first": 1849, "page_last": 1857, "abstract": "We present simple and computationally efficient nonparametric estimators of R\\'enyi entropy and mutual information based on an i.i.d. sample drawn from an unknown, absolutely continuous distribution over $\\R^d$. The estimators are calculated as the sum of $p$-th powers of the Euclidean lengths of the edges of the `generalized nearest-neighbor' graph of the sample and the empirical copula of the sample respectively. For the first time, we prove the almost sure consistency of these estimators and upper bounds on their rates of convergence, the latter of which under the assumption that the density underlying the sample is Lipschitz continuous. Experiments demonstrate their usefulness in independent subspace analysis.", "full_text": "Estimation of R\u00b4enyi Entropy and Mutual Information\n\nBased on Generalized Nearest-Neighbor Graphs\n\nD\u00b4avid P\u00b4al\n\nDepartment of Computing Science\n\nUniversity of Alberta\nEdmonton, AB, Canada\n\ndpal@cs.ualberta.ca\n\nBarnab\u00b4as P\u00b4oczos\n\nSchool of Computer Science\nCarnegie Mellon University\n\nPittsburgh, PA, USA\n\npoczos@ualberta.ca\n\nCsaba Szepesv\u00b4ari\n\nDepartment of Computing Science\n\nUniversity of Alberta\nEdmonton, AB, Canada\n\nszepesva@ualberta.ca\n\nAbstract\n\nWe present simple and computationally ef\ufb01cient nonparametric estimators of\nR\u00b4enyi entropy and mutual information based on an i.i.d. sample drawn from an\nunknown, absolutely continuous distribution over Rd. The estimators are cal-\nculated as the sum of p-th powers of the Euclidean lengths of the edges of the\n\u2018generalized nearest-neighbor\u2019 graph of the sample and the empirical copula of\nthe sample respectively. For the \ufb01rst time, we prove the almost sure consistency\nof these estimators and upper bounds on their rates of convergence, the latter of\nwhich under the assumption that the density underlying the sample is Lipschitz\ncontinuous. Experiments demonstrate their usefulness in independent subspace\nanalysis.\n\n1\n\nIntroduction\n\nWe consider the nonparametric problem of estimating R\u00b4enyi \u03b1-entropy and mutual information (MI)\nbased on a \ufb01nite sample drawn from an unknown, absolutely continuous distribution over Rd. There\nare many applications that make use of such estimators, of which we list a few to give the reader\na taste: Entropy estimators can be used for goodness-of-\ufb01t testing (Vasicek, 1976; Goria et al.,\n2005), parameter estimation in semi-parametric models (Wolsztynski et al., 2005), studying fractal\nrandom walks (Alemany and Zanette, 1994), and texture classi\ufb01cation (Hero et al., 2002b,a). Mu-\ntual information estimators have been used in feature selection (Peng and Ding, 2005), clustering\n(Aghagolzadeh et al., 2007), causality detection (Hlav\u00b4ackova-Schindler et al., 2007), optimal exper-\nimental design (Lewi et al., 2007; P\u00b4oczos and L\u02ddorincz, 2009), fMRI data processing (Chai et al.,\n2009), prediction of protein structures (Adami, 2004), or boosting and facial expression recogni-\ntion (Shan et al., 2005). Both entropy estimators and mutual information estimators have been used\nfor independent component and subspace analysis (Learned-Miller and Fisher, 2003; P\u00b4oczos and\nL\u02ddorincz, 2005; Hulle, 2008; Szab\u00b4o et al., 2007), and image registration (Kybic, 2006; Hero et al.,\n2002b,a). For further applications, see Leonenko et al. (2008); Wang et al. (2009a).\nIn a na\u00a8\u0131ve approach to R\u00b4enyi entropy and mutual information estimation, one could use the so called\n\u201cplug-in\u201d estimates. These are based on the obvious idea that since entropy and mutual information\nare determined solely by the density f (and its marginals), it suf\ufb01ces to \ufb01rst estimate the density\nusing one\u2019s favorite density estimate which is then \u201cplugged-in\u201d into the formulas de\ufb01ning entropy\n\n1\n\n\fand mutual information. The density is, however, a nuisance parameter which we do not want to\nestimate. Density estimators have tunable parameters and we may need cross validation to achieve\ngood performance.\nThe entropy estimation algorithm considered here is direct\u2014it does not build on density estimators.\nIt is based on k-nearest-neighbor (NN) graphs with a \ufb01xed k. A variant of these estimators, where\neach sample point is connected to its k-th nearest neighbor only, were recently studied by Goria\net al. (2005) for Shannon entropy estimation (i.e.\nthe special case \u03b1 = 1) and Leonenko et al.\n(2008) for R\u00b4enyi \u03b1-entropy estimation. They proved the weak consistency of their estimators under\ncertain conditions. However, their proofs contain some errors, and it is not obvious how to \ufb01x them.\nNamely, Leonenko et al. (2008) apply the generalized Helly-Bray theorem, while Goria et al. (2005)\napply the inverse Fatou lemma under conditions when these theorems do not hold. This latter error\noriginates from the article of Kozachenko and Leonenko (1987), and this mistake can also be found\nin Wang et al. (2009b).\nThe \ufb01rst main contribution of this paper is to give a correct proof of consistency of these estimators.\nEmploying a very different proof techniques than the papers mentioned above, we show that these\nestimators are, in fact, strongly consistent provided that the unknown density f has bounded support\nand \u03b1 \u2208 (0, 1). At the same time, we allow for more general nearest-neighbor graphs, wherein\nas opposed to connecting each point only to its k-th nearest neighbor, we allow each point to be\nconnected to an arbitrary subset of its k nearest neighbors. Besides adding generality, our numer-\nical experiments seem to suggest that connecting each sample point to all its k nearest neighbors\nimproves the rate of convergence of the estimator.\nThe second major contribution of our paper is that we prove a \ufb01nite-sample high-probability bound\non the error (i.e. the rate of convergence) of our estimator provided that f is Lipschitz. According\nto the best of our knowledge, this is the very \ufb01rst result that gives a rate for the estimation of R\u00b4enyi\nentropy. The closest to our result in this respect is the work by Tsybakov and van der Meulen\n(1996) who proved the root-n consistency of an estimator of the Shannon entropy and only in one\ndimension.\nThe third contribution is a strongly consistent estimator of R\u00b4enyi mutual information that is based on\nNN graphs and the empirical copula transformation (Dedecker et al., 2007). This result is proved for\nd \u2265 3 1 and \u03b1 \u2208 (1/2, 1). This builds upon and extends the previous work of P\u00b4oczos et al. (2010)\nwhere instead of NN graphs, the minimum spanning tree (MST) and the shortest tour through the\nsample (i.e. the traveling salesman problem, TSP) were used, but it was only conjectured that NN\ngraphs can be applied as well.\nThere are several advantages of using k-NN graph over MST and TSP (besides the obvious concep-\ntual simplicity of k-NN): On a serial computer the k-NN graph can be computed somewhat faster\nthan MST and much faster than the TSP tour. Furthermore, in contrast to MST and TSP, computa-\ntion of k-NN can be easily parallelized. Secondly, for different values of \u03b1, MST and TSP need to\nbe recomputed since the distance between two points is the p-th power of their Euclidean distance\nwhere p = d(1 \u2212 \u03b1). However, the k-NN graph does not change for different values of p, since\np-th power is a monotone transformation, and hence the estimates for multiple values of \u03b1 can be\ncalculated without the extra penalty incurred by the recomputation of the graph. This can be advan-\ntageous e.g. in intrinsic dimension estimators of manifolds (Costa and Hero, 2003), where p is a free\nparameter, and thus one can calculate the estimates ef\ufb01ciently for a few different parameter values.\nThe fourth major contribution is a proof of a \ufb01nite-sample high-probability error bound (i.e. the rate\nof convergence) for our mutual information estimator which holds under the assumption that the\ncopula of f is Lipschitz. According to the best of our knowledge, this is the \ufb01rst result that gives a\nrate for the estimation of R\u00b4enyi mutual information.\nThe toolkit for proving our results derives from the deep literature of Euclidean functionals, see,\n(Steele, 1997; Yukich, 1998). In particular, our strong consistency result uses a theorem due to Red-\nmond and Yukich (1996) that essentially states that any quasi-additive power-weighted Euclidean\nfunctional can be used as a strongly consistent estimator of R\u00b4enyi entropy (see also Hero and Michel\n1999). We also make use of a result due to Koo and Lee (2007), who proved a rate of convergence\nresult that holds under more stringent conditions. Thus, the main thrust of the present work is show-\n\n1Our result for R\u00b4enyi entropy estimation holds for d = 1 and d = 2, too.\n\n2\n\n\fing that these conditions hold for p-power weighted nearest-neighbor graphs. Curiously enough, up\nto now, no one has shown this, except for the case when p = 1, which is studied in Section 8.3 of\n(Yukich, 1998). However, the condition p = 1 gives results only for \u03b1 = 1 \u2212 1/d.\nUnfortunately, the space limitations do not allow us to present any of our proofs, so we relegate them\ninto the extended version of this paper (P\u00b4al et al., 2010). We instead try to give a clear explanation\nof R\u00b4enyi entropy and mutual information estimation problems, the estimation algorithms and the\nstatements of our converge results.\nIn the \ufb01rst experiment, we compare the\nAdditionally, we report on two numerical experiments.\nempirical rates of convergence of our estimators with our theoretical results and plug-in estimates.\nEmpirically, the NN methods are the clear winner. The second experiment is an illustrative applica-\ntion of mutual information estimation to an Independent Subspace Analysis (ISA) task.\nThe paper is organized as follows: In the next section, we formally de\ufb01ne R\u00b4enyi entropy and R\u00b4enyi\nmutual information and the problem of their estimation. Section 3 explains the \u2018generalized nearest\nneighbor\u2019 graphs. This graph is then used in Section 4 to de\ufb01ne our R\u00b4enyi entropy estimator. In\nthe same section, we state a theorem containing our convergence results for this estimator (strong\nconsistency and rates). In Section 5, we explain the copula transformation, which connects R\u00b4enyi\nentropy with R\u00b4enyi mutual information. The copula transformation together with the R\u00b4enyi entropy\nestimator from Section 4 is used to build an estimator of R\u00b4enyi mutual information. We conclude\nthis section with a theorem stating the convergence properties of the estimator (strong consistency\nand rates). Section 6 contains the numerical experiments. We conclude the paper by a detailed\ndiscussion of further related work in Section 7, and a list of open problems and directions for future\nresearch in Section 8.\n\n2 The Formal De\ufb01nition of the Problem\n\ninformation of d real-valued random variables2 X =\nR\u00b4enyi entropy and R\u00b4enyi mutual\n(X 1, X 2, . . . , X d) with joint density f : Rd \u2192 R and marginal densities fi : R \u2192 R, 1 \u2264 i \u2264 d,\nare de\ufb01ned for any real parameter \u03b1 assuming the underlying integrals exist. For \u03b1 (cid:54)= 1, R\u00b4enyi\nentropy and R\u00b4enyi mutual information are de\ufb01ned respectively as3\n\nlog\n\nf \u03b1(x1, x2, . . . , xd) d(x1, x2, . . . , xd) ,\n\n(1)\n\nf \u03b1(x1, x2, . . . , xd)\n\nfi(xi)\n\nd(x1, x2, . . . , xd).\n\n(2)\n\n(cid:90)\n(cid:90)\n\nRd\n\nH\u03b1(X) = H\u03b1(f ) =\n\n1\n\n1 \u2212 \u03b1\n\nI\u03b1(X) = I\u03b1(f ) =\n\n1\n\n\u03b1 \u2212 1\n\nlog\n\nRd\n\n(cid:33)1\u2212\u03b1\n\n(cid:32) d(cid:89)\n\ni=1\n\nFor \u03b1 = 1 they are de\ufb01ned by the limits H1 = lim\u03b1\u21921 H\u03b1 and I1 = lim\u03b1\u21921 I\u03b1. In fact, Shannon\n(differential) entropy and the Shannon mutual information are just special cases of R\u00b4enyi entropy\nand R\u00b4enyi mutual information with \u03b1 = 1.\nThe goal of this paper is to present estimators of R\u00b4enyi entropy (1) and R\u00b4enyi information (2) and\nstudy their convergence properties. To be more explicit, we consider the problem where we are\ngiven i.i.d. random variables X1:n = (X1, X2, . . . , Xn) where each Xj = (X 1\nj ) has\ndensity f : Rd \u2192 R and marginal densities fi : R \u2192 R and our task is to construct an estimate\n\n(cid:98)H\u03b1(X1:n) of H\u03b1(f ) and an estimate(cid:98)I\u03b1(X1:n) of I\u03b1(f ) using the sample X1:n.\n\nj , X 2\n\nj , . . . , X d\n\n3 Generalized Nearest-Neighbor Graphs\n\nThe basic tool to de\ufb01ne our estimators is the generalized nearest-neighbor graph and more speci\ufb01-\ncally the sum of the p-th powers of Euclidean lengths of its edges.\nFormally, let V be a \ufb01nite set of points in an Euclidean space Rd and let S be a \ufb01nite non-empty\nset of positive integers; we denote by k the maximum element of S. We de\ufb01ne the generalized\n\n2We use superscript for indexing dimension coordinates.\n3The base of the logarithms in the de\ufb01nition is not important; any base strictly bigger than 1 is allowed.\nSimilarly as with Shannon entropy and mutual information, one traditionally uses either base 2 or e. In this\npaper, for de\ufb01nitiveness, we stick to base e.\n\n3\n\n\fnearest-neighbor graph N NS(V ) as a directed graph on V . The edge set of N NS(V ) contains\nfor each i \u2208 S an edge from each vertex x \u2208 V to its i-th nearest neighbor. That is, if we sort\nV \\{x} = {y1, y2, . . . , y|V |\u22121} according to the Euclidean distance to x (breaking ties arbitrarily):\n(cid:107)x \u2212 y1(cid:107) \u2264 (cid:107)x \u2212 y2(cid:107) \u2264 \u00b7\u00b7\u00b7 \u2264 (cid:107)x \u2212 y|V |\u22121(cid:107) then yi is the i-th nearest-neighbor of x and for each\ni \u2208 S there is an edge from x to yi in the graph.\nFor p \u2265 0 let us denote by Lp(V ) the sum of the p-th powers of Euclidean lengths of its edges.\nFormally,\n\n(cid:88)\n\nLp(V ) =\n\n(x,y)\u2208E(N NS (V ))\n\n(cid:107)x \u2212 y(cid:107)p ,\n\n(3)\n\nwhere E(N NS(V )) denotes the edge set of N NS(V ). We intentionally hide the dependence on S\nin the notation Lp(V ). For the rest of the paper, the reader should think of S as a \ufb01xed but otherwise\narbitrary \ufb01nite non-empty set of integers, say, S = {1, 3, 4}.\nThe following is a basic result about Lp. The proof can be found in P\u00b4al et al. (2010).\nTheorem 1 (Constant \u03b3). Let X1:n = (X1, X2, . . . , Xn) be an i.i.d. sample from the uniform\ndistribution over the d-dimensional unit cube [0, 1]d. For any p \u2265 0 and any \ufb01nite non-empty set S\nof positive integers there exists a constant \u03b3 > 0 such that\n\nlim\nn\u2192\u221e\n\nLp(X1:n)\nn1\u2212p/d\n\n= \u03b3\n\na.s.\n\n(4)\n\nThe value of \u03b3 depends on d, p, S and, except for special cases, an analytical formula for its value is\nnot known. This causes a minor problem since the constant \u03b3 appears in our estimators. A simple\nand effective way to deal with this problem is to generate a large i.i.d. sample X1:n from the uniform\ndistribution over [0, 1]d and estimate \u03b3 by the empirical value of Lp(X1:n)/n1\u2212p/d.\n\n4 An Estimator of R\u00b4enyi Entropy\n\n(cid:98)H\u03b1(X1:n) =\n\n1\n\nWe are now ready to present an estimator of R\u00b4enyi entropy based on the generalized nearest-neighbor\ngraph. Suppose we are given an i.i.d. sample X1:n = (X1, X2, . . . , Xn) from a distribution \u00b5 over\nRd with density f. We estimate entropy H\u03b1(f ) for \u03b1 \u2208 (0, 1) by\n\nLp(X1:n)\n\u03b3n1\u2212p/d\n\nwhere p = d(1 \u2212 \u03b1),\n\nlog\n\n1 \u2212 \u03b1\n\n(5)\nand Lp(\u00b7) is the sum of p-th powers of Euclidean lengths of edges of the nearest-neighbor graph\nN NS(\u00b7) for some \ufb01nite non-empty S \u2282 N+ as de\ufb01ned by equation (3). The constant \u03b3 is the same\nas in Theorem 1.\n\nIt states that (cid:98)H\u03b1 is strongly\nThe following theorem is our main result about the estimator (cid:98)H\u03b1.\nTheorem 2 (Consistency and Rate for (cid:98)H\u03b1). Let \u03b1 \u2208 (0, 1). Let \u00b5 be an absolutely continuous\n\nconsistent and gives upper bounds on the rate of convergence. The proof of theorem is in P\u00b4al et al.\n(2010).\n\ndistribution over Rd with bounded support and let f be its density. If X1:n = (X1, X2, . . . , Xn) is\nan i.i.d. sample from \u00b5 then\n\nlim\n\nn\u2192\u221e (cid:98)H\u03b1(X1:n) = H\u03b1(f )\n\uf8f1\uf8f2\uf8f3O\n(cid:16)\n(cid:16)\n\nd(2d\u2212p) (log(1/\u03b4))1/2\u2212p/(2d)(cid:17)\nd(d+1) (log(1/\u03b4))1/2\u2212p/(2d)(cid:17)\n\n\u2212 d\u2212p\n\u2212 d\u2212p\n\na.s.\n\nO\n\n,\n\nn\n\nn\n\n(6)\n\n(7)\n\n,\n\nif 0 < p < d \u2212 1 ;\nif d \u2212 1 \u2264 p < d .\n\nMoreover, if f is Lipschitz then for any \u03b4 > 0 with probability at least 1 \u2212 \u03b4,\n\n(cid:12)(cid:12)(cid:12)(cid:98)H\u03b1(X1:n) \u2212 H\u03b1(f )\n(cid:12)(cid:12)(cid:12) \u2264\n\n5 Copulas and Estimator of Mutual Information\n\nEstimating mutual information is slightly more complicated than estimating entropy. We start with a\nbasic property of mutual information which we call rescaling. It states that if h1, h2, . . . , hd : R \u2192\nR are arbitrary strictly increasing functions, then\n\nI\u03b1(h1(X 1), h2(X 2), . . . , hd(X d)) = I\u03b1(X 1, X 2, . . . , X d) .\n\n(8)\n\n4\n\n\fA particularly clever choice is hj = Fj for all 1 \u2264 j \u2264 d, where Fj is the cumulative distribution\nfunction (c.d.f.) of X j. With this choice, the marginal distribution of hj(X j) is the uniform distri-\nbution over [0, 1] assuming that Fj, the c.d.f. of X j, is continuous. Looking at the de\ufb01nition of H\u03b1\nand I\u03b1 we see that\nI\u03b1(X 1, X 2, . . . , X d) = I\u03b1(F1(X 1), F2(X 2), . . . , Fd(X d)) = \u2212H\u03b1(F1(X 1), F2(X 2), . . . , Fd(X d)) .\nIn other words, calculation of mutual information can be reduced to the calculation of entropy pro-\nvided that marginal c.d.f.\u2019s F1, F2, . . . , Fd are known. The problem is, of course, that these are not\n\nknown and need to be estimated from the sample. We will use empirical c.d.f.\u2019s ((cid:98)F1,(cid:98)F2, . . . ,(cid:98)Fd)\n\nas their estimates. Given an i.i.d. sample X1:n = (X1, X2, . . . , Xn) from distribution \u00b5 and with\ndensity f, the empirical c.d.f\u2019s are de\ufb01ned as\n\n(cid:98)Fj(x) =\n\nIntroduce the compact notation F : Rd \u2192 [0, 1]d,(cid:98)F : Rd \u2192 [0, 1]d,\n\n|{i : 1 \u2264 i \u2264 n, x \u2264 X j\ni }|\n\n1\nn\n\nfor x \u2208 R, 1 \u2264 j \u2264 d .\n\nF(x1, x2, . . . , xd) = (F1(x1), F2(x2), . . . , Fd(xd))\n\n(cid:98)F(x1, x2, . . . , xd) = ((cid:98)F1(x1),(cid:98)F2(x2), . . . ,(cid:98)Fd(xd))\n\nfor (x1, x2, . . . , xd) \u2208 Rd ;\nfor (x1, x2, . . . , xd) \u2208 Rd .\n\n(9)\n(10)\n\n1\nn\n\ni =\n\n1 , X j\n\n(cid:98)Z j\n\nrank(X j\n\ni ,{X j\n\n2 , . . . , X j\n\nwhere rank(x, A) is the number of element of A less than or equal to x. Also, observe\n\nrespectively. The joint distribution of F(X) = (F1(X 1), F2(X 2), . . . , Fd(X d)) is called the copula\n\nLet us call the maps F, (cid:98)F the copula transformation, and the empirical copula transformation,\nof \u00b5, and the sample ((cid:98)Z1,(cid:98)Z2, . . . ,(cid:98)Zn) = ((cid:98)F(X1),(cid:98)F(X2), . . . ,(cid:98)F(Xn)) is called the empirical\ncopula (Dedecker et al., 2007). Note that j-th coordinate of(cid:98)Zi equals\nn}) ,\nthat the random variables (cid:98)Z1,(cid:98)Z2, . . . ,(cid:98)Zn are not even independent! Nonetheless, the empiri-\ncal copula ((cid:98)Z1,(cid:98)Z2, . . . ,(cid:98)Zn) is a good approximation of an i.i.d.\n(cid:98)I\u03b1(X1:n) = \u2212(cid:98)H\u03b1((cid:98)Z1,(cid:98)Z2, . . . ,(cid:98)Zn),\n\nsample (Z1, Z2, . . . , Zn) =\n(F(X1), F(X2), . . . , F(Xn)) from the copula of \u00b5. Hence, we estimate the R\u00b4enyi mutual infor-\nmation I\u03b1 by\n\nwhere (cid:98)H\u03b1 is de\ufb01ned by (5). The following theorem is our main result about the estimator (cid:98)I\u03b1. It\nstates that(cid:98)I\u03b1 is strongly consistent and gives upper bounds on the rate of convergence. The proof of\nTheorem 3 (Consistency and Rate for (cid:98)I\u03b1). Let d \u2265 3 and \u03b1 = 1 \u2212 p/d \u2208 (1/2, 1). Let \u00b5 be an\n\nabsolutely continuous distribution over Rd with density f. If X1:n = (X1, X2, . . . , Xn) is an i.i.d.\nsample from \u00b5 then\n\nthis theorem can be found in P\u00b4al et al. (2010).\n\n(11)\n\nMoreover, if the density of the copula of \u00b5 is Lipschitz, then for any \u03b4 > 0 with probability at least\n1 \u2212 \u03b4,\n\nlim\n\na.s.\n\nn\u2192\u221e(cid:98)I\u03b1(X1:n) = I\u03b1(f )\n(cid:16)\n(cid:16)\n(cid:16)\n\n\u2212 d\u2212p\n\u2212 d\u2212p\n\u2212 d\u2212p\n\nmax{n\nmax{n\nmax{n\n\nd(2d\u2212p) , n\u2212p/2+p/d}(log(1/\u03b4))1/2(cid:17)\nd(2d\u2212p) , n\u22121/2+p/d}(log(1/\u03b4))1/2(cid:17)\nd(d+1) , n\u22121/2+p/d}(log(1/\u03b4))1/2(cid:17)\n\n,\n\n\uf8f1\uf8f4\uf8f4\uf8f4\uf8f2\uf8f4\uf8f4\uf8f4\uf8f3\n\nO\n\nO\n\nO\n\n,\n\n,\n\nif 0 < p \u2264 1 ;\nif 1 \u2264 p \u2264 d \u2212 1 ;\nif d \u2212 1 \u2264 p < d .\n\n(cid:12)(cid:12)(cid:12)(cid:98)I\u03b1(X1:n) \u2212 I\u03b1(f )\n(cid:12)(cid:12)(cid:12) \u2264\n\n6 Experiments\n\nIn this section we show two numerical experiments to support our theoretical results about the con-\nvergence rates, and to demonstrate the applicability of the proposed R\u00b4enyi mutual information esti-\n\nmator,(cid:98)I\u03b1.\n\n5\n\n\f6.1 The Rate of Convergence\n\nIn our \ufb01rst experiment (Fig. 1), we demonstrate that the derived rate is indeed an upper bound on\n\nthe convergence rate. Figure 1a-1c show the estimation error of (cid:98)I\u03b1 as a function of the sample\n\nsize. Here, the underlying distribution was a 3D uniform, a 3D Gaussian, and a 20D Gaussian with\nrandomly chosen nontrivial covariance matrices, respectively. In these experiments \u03b1 was set to 0.7.\nFor the estimation we used S = {3} (kth) and S = {1, 2, 3} (knn) sets. Our results also indicate that\nthese estimators achieve better performances than the histogram based plug-in estimators (hist). The\nnumber and the sizes of the bins were determined with the rule of Scott (1979). The histogram based\nestimator is not shown in the 20D case, as in this large dimension it is not applicable in practice. The\n\ufb01gures are based on averaging 25 independent runs, and they also show the theoretical upper bound\n(Theoretical) on the rate derived in Theorem 3. It can be seen that the theoretical rates are rather\nconservative. We think that this is because the theory allows for quite irregular densities, while the\ndensities considered in this experiment are very nice.\n\n(a) 3D uniform\n\n(b) 3D Gaussian\n\n(c) 20D Gaussian\n\nFigure 1: Error of the estimated R\u00b4enyi informations in the number of samples.\n\n6.2 Application to Independent Subspace Analysis\n\nAn important application of dependence estimators is the Independent Subspace Analysis problem\n(Cardoso, 1998). This problem is a generalization of the Independent Component Analysis (ICA),\nwhere we assume the independent sources are multidimensional vector valued random variables.\nThe formal description of the problem is as follows. We have S = (S1; . . . ; Sm) \u2208 Rdm, m inde-\npendent d-dimensional sources, i.e. Si \u2208 Rd, and I(S1, . . . , Sm) = 0.4 In the ISA statistical model\nwe assume that S is hidden, and only n i.i.d. samples from X = AS are available for observation,\nwhere A \u2208 Rq\u00d7dm is an unknown invertible matrix with full rank and q \u2265 dm. Based on n i.i.d.\nobservation of X, our task is to estimate the hidden sources Si and the mixing matrix A. Let the\nestimation of S be denoted by Y = (Y1; . . . ; Ym) \u2208 Rdm, where Y = WX. The goal of ISA is\nto calculate argminWI(Y1, . . . , Ym), where W \u2208 Rdm\u00d7q is a matrix with full rank. Following the\nideas of Cardoso (1998), this ISA problem can be solved by \ufb01rst preprocessing the observed quan-\ntities X by a traditional ICA algorithm which provides us WICA estimated separation matrix5, and\nthen simply grouping the estimated ICA components into ISA subspaces by maximizing the sum of\nthe MI in the estimated subspaces, that is we have to \ufb01nd a permutation matrix P \u2208 {0, 1}dm\u00d7dm\nwhich solves\n\nm(cid:88)\n\nmax\n\nI(Y j\n\n1 , Y j\n\n2 , . . . , Y j\n\nd ) .\n\n(12)\n\nwhere Y = PWICAX. We used the proposed copula based information estimation, (cid:98)I\u03b1 with\n\n\u03b1 = 0.99 to approximate the Shannon mutual information, and we chose S = {1, 2, 3}. Our\nexperiment shows that this ISA algorithm using the proposed MI estimator can indeed provide good\n\nj=1\n\nP\n\n4Here we need the generalization of MI to multidimensional quantities, but that is obvious by simply re-\n\nplacing the 1D marginals by d-dimensional ones.\n\n5for simplicity we used the FastICA algorithm in our experiments (Hyv\u00a8arinen et al., 2001)\n\n6\n\n10210310\u2212210\u22121100101 kthknnhistTheoretical10210310\u2212210\u22121100101 kthknnhistTheoretical102103104100101 kthknnTheoretical\festimation of the ISA subspaces. We used a standard ISA benchmark dataset from Szab\u00b4o et al.\n(2007); we generated 2,000 i.i.d. sample points on 3D geometric wireframe distributions from 6\ndifferent sources independently from each other. These sampled points can be seen in Fig. 2a, and\nthey represent the sources, S. Then we mixed these sources by a randomly chosen invertible matrix\nA \u2208 R18\u00d718. The six 3-dimensional projections of X = AS observed quantities are shown in\nFig. 2b. Our task was to estimate the original sources S using the sample of the observed quantity\nX only. By estimating the MI in (12), we could recover the original subspaces as it can be seen\nin Fig. 2c. The successful subspace separation is shown in the form of Hinton diagrams as well,\nwhich is the product of the estimated ISA separation matrix W = PWICA and A. It is a block\npermutation matrix if and only if the subspace separation is perfect (Fig. 2d).\n\n(a) Original\n\n(b) Mixed\n\n(c) Estimated\n\n(d) Hinton\n\nFigure 2: ISA experiment for six 3-dimensional sources.\n\n7 Further Related Works\n\nAs it was pointed out earlier, in this paper we heavily built on the results known from the theory of\nEuclidean functionals (Steele, 1997; Redmond and Yukich, 1996; Koo and Lee, 2007). However,\nnow we can be more precise about earlier work concerning nearest-neighbor based Euclidean func-\ntionals: The closest to our work is Section 8.3 of Yukich (1998), where the case of N NS graph\nbased p-power weighted Euclidean functionals with S = {1, 2, . . . , k} and p = 1 was investigated.\nNearest-neighbor graphs have \ufb01rst been proposed for Shannon entropy estimation by Kozachenko\nand Leonenko (1987).\nIn particular, in the mentioned work only the case of N NS graphs with\nS = {1} was considered. More recently, Goria et al. (2005) generalized this approach to S = {k}\nand proved the resulting estimator\u2019s weak consistency under some conditions on the density. The\nestimator in this paper has a form quite similar to that of ours:\n\n\u02dcH1 = log(n \u2212 1) \u2212 \u03c8(k) + log\n\n(cid:18) 2\u03c0d/2\n\n(cid:19)\n\nd\u0393(d/2)\n\nn(cid:88)\n\ni=1\n\n+\n\nd\nn\n\nlog (cid:107)ei(cid:107) .\n\nHere \u03c8 stands for the digamma function, and ei is the directed edge pointing from Xi to its kth\nnearest-neighbor. Comparing this with (5), unsurprisingly, we \ufb01nd that the main difference is the\nuse of the logarithm function instead of | \u00b7 |p and the different normalization. As mentioned before,\nLeonenko et al. (2008) proposed an estimator that uses the N NS graph with S = {k} for the purpose\nof estimating the R\u00b4enyi entropy. Their estimator takes the form\n\n\u02dcH\u03b1 =\n\n1\n\n1 \u2212 \u03b1\n\nlog\n\n(cid:32)\n\nn \u2212 1\nn\n\n(cid:33)\n\n,\n\nn(cid:88)\n(cid:105)1/(1\u2212\u03b1)\n\n(cid:107)ei(cid:107)d(1\u2212\u03b1)\n(n \u2212 1)\u03b1\n\ni=1\n\nk\n\nd C 1\u2212\u03b1\nV 1\u2212\u03b1\n(cid:104)\n\n\u0393(k)\n\n\u0393(k+1\u2212\u03b1)\n\nand Vd = \u03c0d/2\u0393(d/2 + 1)\nwhere \u0393 stands for the Gamma function, Ck =\nis the volume of the d-dimensional unit ball, and again ei is the directed edge in the N NS graph\nstarting from node Xi and pointing to the k-th nearest node. Comparing this estimator with (5),\nit is apparent that it is (essentially) a special case of our N NS based estimator. From the results\nof Leonenko et al. (2008) it is obvious that the constant \u03b3 in (5) can be found in analytical form\nwhen S = {k}. However, we kindly warn the reader again that the proofs of these last three cited\narticles (Kozachenko and Leonenko, 1987; Goria et al., 2005; Leonenko et al., 2008) contain a few\nerrors, just like the Wang et al. (2009b) paper for KL divergence estimation from two samples.\nKraskov et al. (2004) also proposed a k-nearest-neighbors based estimator for the Shannon mutual\ninformation estimation, but the theoretical properties of their estimator are unknown.\n\n7\n\n\f8 Conclusions and Open Problems\n\nWe have studied R\u00b4enyi entropy and mutual information estimators based on N NS graphs. The\nestimators were shown to be strongly consistent. In addition, we derived upper bounds on their\nconvergence rate under some technical conditions. Several open problems remain unanswered:\nAn important open problem is to understand how the choice of the set S \u2282 N+ affects our estimators.\nPerhaps, there exists a way to choose S as a function of the sample size n (and d, p) which strikes\nthe optimal balance between the bias and the variance of our estimators.\nOur method can be used for estimation of Shannon entropy and mutual information by simply using\n\u03b1 close to 1. The open problem is to come up with a way of choosing \u03b1, approaching 1, as a\nfunction of the sample size n (and d, p) such that the resulting estimator is consistent and converges\nas rapidly as possible. An alternative is to use the logarithm function in place of the power function.\nHowever, the theory would need to be changed signi\ufb01cantly to show that the resulting estimator\nremains strongly consistent.\n\nIn the proof of consistency of our mutual information estimator (cid:98)I\u03b1 we used Kiefer-Dvoretzky-\n\nWolfowitz theorem to handle the effect of the inaccuracy of the empirical copula transformation\n(see P\u00b4al et al. (2010) for details). Our particular use of the theorem seems to restrict \u03b1 to the interval\n(1/2, 1) and the dimension to values larger than 2. Is there a better way to estimate the error caused\nby the empirical copula transformation and prove consistency of the estimator for a larger range of\n\u03b1\u2019s and d = 1, 2?\nFinally, it is an important open problem to prove bounds on converge rates for densities that have\nhigher order smoothness (i.e. \u03b2-H\u00a8older smooth densities). A related open problem, in the context of\nof theory of Euclidean functionals, is stated in Koo and Lee (2007).\n\nAcknowledgements\n\nThis work was supported in part by AICML, AITF (formerly iCore and AIF), NSERC, the PAS-\nCAL2 Network of Excellence under EC grant no. 216886 and by the Department of Energy under\ngrant number DESC0002607. Cs. Szepesv\u00b4ari is on leave from SZTAKI, Hungary.\n\nReferences\nC. Adami. Information theory in molecular biology. Physics of Life Reviews, 1:3\u201322, 2004.\nM. Aghagolzadeh, H. Soltanian-Zadeh, B. Araabi, and A. Aghagolzadeh. A hierarchical clustering based on\n\nmutual information maximization. In in IEEE ICIP, pages 277\u2013280, 2007.\n\nP. A. Alemany and D. H. Zanette. Fractal random walks from a variational formalism for Tsallis entropies.\n\nPhys. Rev. E, 49(2):R956\u2013R958, Feb 1994.\n\nJ. Cardoso. Multidimensional independent component analysis. Proc. ICASSP\u201998, Seattle, WA., 1998.\nB. Chai, D. B. Walther, D. M. Beck, and L. Fei-Fei. Exploring functional connectivity of the human brain using\n\nmultivariate information analysis. In NIPS, 2009.\n\nJ. A. Costa and A. O. Hero. Entropic graphs for manifold learning. In IEEE Asilomar Conf. on Signals, Systems,\n\nand Computers, 2003.\n\nJ. Dedecker, P. Doukhan, G. Lang, J.R. Leon, S. Louhichi, and C Prieur. Weak Dependence: With Examples\n\nand Applications, volume 190 of Lecture notes in Statistics. Springer, 2007.\n\nM. N. Goria, N. N. Leonenko, V. V. Mergel, and P. L. Novi Inverardi. A new class of random vector entropy\nestimators and its applications in testing statistical hypotheses. Journal of Nonparametric Statistics, 17:\n277\u2013297, 2005.\n\nA. O. Hero and O. J. Michel. Asymptotic theory of greedy approximations to minimal k-point random graphs.\n\nIEEE Trans. on Information Theory, 45(6):1921\u20131938, 1999.\n\nA. O. Hero, B. Ma, O. Michel, and J. Gorman. Alpha-divergence for classi\ufb01cation, indexing and retrieval,\n\n2002a. Communications and Signal Processing Laboratory Technical Report CSPL-328.\n\nA. O. Hero, B. Ma, O. Michel, and J. Gorman. Applications of entropic spanning graphs.\n\nProcessing Magazine, 19(5):85\u201395, 2002b.\n\nIEEE Signal\n\n8\n\n\fK. Hlav\u00b4ackova-Schindler, M. Palu\u02c6sb, M. Vejmelkab, and J. Bhattacharya. Causality detection based on\n\ninformation-theoretic approaches in time series analysis. Physics Reports, 441:1\u201346, 2007.\n\nM. M. Van Hulle. Constrained subspace ICA based on mutual information optimization directly. Neural\n\nComputation, 20:964\u2013973, 2008.\n\nA. Hyv\u00a8arinen, J. Karhunen, and E. Oja. Independent Component Analysis. John Wiley, New York, 2001.\nY. Koo and S. Lee. Rates of convergence of means of Euclidean functionals. Journal of Theoretical Probability,\n\n20(4):821\u2013841, 2007.\n\nL. F. Kozachenko and N. N. Leonenko. A statistical estimate for the entropy of a random vector. Problems of\n\nInformation Transmission, 23:9\u201316, 1987.\n\nA. Kraskov, H. St\u00a8ogbauer, and P. Grassberger. Estimating mutual information. Phys. Rev. E, 69:066138, 2004.\nJ. Kybic. Incremental updating of nearest neighbor-based high-dimensional entropy estimation. In Proc. Acous-\n\ntics, Speech and Signal Processing, 2006.\n\nE. Learned-Miller and J. W. Fisher. ICA using spacings estimates of entropy. Journal of Machine Learning\n\nResearch, 4:1271\u20131295, 2003.\n\nN. Leonenko, L. Pronzato, and V. Savani. A class of R\u00b4enyi information estimators for multidimensional densi-\n\nties. Annals of Statistics, 36(5):2153\u20132182, 2008.\n\nJ. Lewi, R. Butera, and L. Paninski. Real-time adaptive information-theoretic optimization of neurophysiology\n\nexperiments. In Advances in Neural Information Processing Systems, volume 19, 2007.\n\nD. P\u00b4al, Cs. Szepesv\u00b4ari, and B. P\u00b4oczos. Estimation of R\u00b4enyi entropy and mutual information based on general-\n\nized nearest-neighbor graphs, 2010. http://arxiv.org/abs/1003.1954.\n\nH. Peng and C. Ding. Feature selection based on mutual information: Criteria of max-dependency, max-\n\nrelevance, and min-redundancy. IEEE Trans On Pattern Analysis and Machine Intelligence, 27, 2005.\n\nB. P\u00b4oczos and A. L\u02ddorincz.\n\n673\u2013680, 2005.\n\nIndependent subspace analysis using geodesic spanning trees.\n\nIn ICML, pages\n\nB. P\u00b4oczos and A. L\u02ddorincz. Identi\ufb01cation of recurrent neural networks by Bayesian interrogation techniques.\n\nJournal of Machine Learning Research, 10:515\u2013554, 2009.\n\nB. P\u00b4oczos, S. Kirshner, and Cs. Szepesv\u00b4ari. REGO: Rank-based estimation of R\u00b4enyi information using Eu-\n\nclidean graph optimization. In AISTATS 2010, 2010.\n\nC. Redmond and J. E. Yukich. Asymptotics for Euclidean functionals with power-weighted edges. Stochastic\n\nprocesses and their applications, 61(2):289\u2013304, 1996.\n\nD. W. Scott. On optimal and data-based histograms. Biometrika, 66:605\u2013610, 1979.\nC. Shan, S. Gong, and P. W. Mcowan. Conditional mutual information based boosting for facial expression\n\nrecognition. In British Machine Vision Conference (BMVC), 2005.\n\nJ. M. Steele. Probability Theory and Combinatorial Optimization. Society for Industrial and Applied Mathe-\n\nmatics, 1997.\n\nZ. Szab\u00b4o, B. P\u00b4oczos, and A. L\u02ddorincz. Undercomplete blind subspace deconvolution. Journal of Machine\n\nLearning Research, 8:1063\u20131095, 2007.\n\nA. B. Tsybakov and E. C. van der Meulen. Root-n consistent estimators of entropy for densities with unbounded\n\nsupport. Scandinavian Journal of Statistics, 23:75\u201383, 1996.\n\nO. Vasicek. A test for normality based on sample entropy. Journal of the Royal Statistical Society, Series B,\n\n38:54\u201359, 1976.\n\nQ. Wang, S. R. Kulkarni, and S. Verd\u00b4u. Universal estimation of information measures for analog sources.\n\nFoundations and Trends in Communications and Information Theory, 5(3):265\u2013352, 2009a.\n\nQ. Wang, S. R. Kulkarni, and S. Verd\u00b4u. Divergence estimation for multidimensional densities via k-nearest-\n\nneighbor distances. IEEE Transactions on Information Theory, 55(5):2392\u20132405, 2009b.\n\nE. Wolsztynski, E. Thierry, and L. Pronzato. Minimum-entropy estimation in semi-parametric models. Signal\n\nProcess., 85(5):937\u2013949, 2005.\n\nJ. E. Yukich. Probability Theory of Classical Euclidean Optimization Problems. Springer, 1998.\n\n9\n\n\f", "award": [], "sourceid": 235, "authors": [{"given_name": "D\u00e1vid", "family_name": "P\u00e1l", "institution": null}, {"given_name": "Barnab\u00e1s", "family_name": "P\u00f3czos", "institution": null}, {"given_name": "Csaba", "family_name": "Szepesv\u00e1ri", "institution": null}]}