{"title": "Meta-Gaussian Information Bottleneck", "book": "Advances in Neural Information Processing Systems", "page_first": 1916, "page_last": 1924, "abstract": "We present a reformulation of the information bottleneck (IB) problem in terms of copula, using the equivalence between mutual information and negative copula entropy. Focusing on the Gaussian copula we extend the analytical IB solution available for the multivariate Gaussian case to distributions with a Gaussian dependence structure but arbitrary marginal densities, also called meta-Gaussian distributions. This opens new possibles applications of IB to continuous data and provides a solution more robust to outliers.", "full_text": "Meta-Gaussian Information Bottleneck\n\nM\u00b4elanie Rey\n\nDepartment of Mathematics and Computer Science\n\nUniversity of Basel\n\nmelanie.rey@unibas.ch\n\nVolker Roth\n\nDepartment of Mathematics and Computer Science\n\nUniversity of Basel\n\nvolker.roth@unibas.ch\n\nAbstract\n\nWe present a reformulation of the information bottleneck (IB) problem in terms\nof copula, using the equivalence between mutual information and negative cop-\nula entropy. Focusing on the Gaussian copula we extend the analytical IB solu-\ntion available for the multivariate Gaussian case to distributions with a Gaussian\ndependence structure but arbitrary marginal densities, also called meta-Gaussian\ndistributions. This opens new possibles applications of IB to continuous data and\nprovides a solution more robust to outliers.\n\n1\n\nIntroduction\n\nThe information bottleneck method (IB) [1] considers the concept of relevant information in the\ndata compression problem, and takes a new perspective to signal compression which was classically\ntreated using rate distortion theory. The IB method formalizes the idea of relevance, or meaning-\nful information, by introducing a relevance variable Y . The problem is then to obtain an optimal\ncompression T of the data X which preserves a maximum of information about Y . Although the\nIB method beautifully formalizes the compression problem under relevance constraints, the prac-\ntical solution of this problem remains dif\ufb01cult, particularly in high dimensions, since the mutual\ninformations I(X; T ), I(Y ; T ) must be estimated. The IB optimization problem has no available\nanalytical solution in the general case. It can be solved iteratively using the generalized Blahut-\nArimoto algorithm which, however, requires us to estimate the joint distribution of the potentially\nhigh-dimensional variables X and Y . A formal analysis of the dif\ufb01culties of this estimation problem\nwas conducted in [2]. In the continuous case, estimation of multivariate densities becomes arduous\nand can be a major impediment to the practical application of IB. A notable exception is the case of\njoint Gaussian (X, Y ) for which an analytical solution for the optimal representation T exists [3].\nThe optimal T is jointly Gaussian with (X, Y ) [4] and takes the form of a noisy linear projection\nto eigenvectors of the normalised conditional covariance matrix. The existence of an analytical so-\nlution opens new application possibilities and IB becomes practically feasible in higher dimensions\n[5]. Finding closed form solutions for other continuous distribution families remains an open chal-\nlenge. The practical usefulness of the Gaussian IB (GIB), on the other hand, suffers from its missing\n\ufb02exibility and the statistical problem of \ufb01nding a robust estimate of the joint covariance matrix of\n(X, Y ) in high-dimensional spaces.\nCompression and relevance in IB are de\ufb01ned in terms of mutual information (MI) of two random\nvectors V and W , which is de\ufb01ned as the reduction in the entropy of V by the conditional entropy\nof V given W . MI bears an interesting relationship to copulas: mutual information equals negative\ncopula entropy [6]. This relation between two seemingly unrelated concepts might appear surpris-\n\n1\n\n\fing, but it directly follows from the de\ufb01nition of a copula as the object that captures the \u201cpure\u201d\ndependency structure of random variables [7]: a multivariate distribution consists of univariate ran-\ndom variables related to each other by a dependence mechanism, and copulas provide a framework\nto separate the dependence structure from the marginal distributions. In this work we reformulate\nthe IB problem for the continuous variables in terms of copulas and enlighten that IB is completely\nindependent of the marginal distributions of X, Y . The IB problem in the continuous case is in fact\nto \ufb01nd the optimal copula (or dependence structure) of T and X, knowing the copula of X and the\nrelevance variable Y . We focus on the case of Gaussian copula and on the consequences of the\nIB reformulation for the Gaussian IB. We show that the analytical solution available for GIB can\nnaturally be extended to multivariate distributions with Gaussian copula and arbitrary marginal den-\nsities, also called meta-Gaussian densities. Moreover, we show that the GIB solution depends only a\ncorrelation matrix, and not on the variance. This allows us to use robust rank correlation estimators\ninstead of unstable covariance estimators, and gives a robust version of GIB.\n\n2\n\nInformation Bottleneck and Gaussian IB\n\n2.1 General Information Bottleneck.\nConsider two random variables X and Y with values in the measurable spaces X and Y. Their\njoint distribution pXY (x, y) will also be denoted p(x, y) for simplicity. We construct a compressed\nrepresentation T of X that is most informative about Y by solving the following variational problem:\n\nL | L \u2261 I(X; T ) \u2212 \u03b2I(T ; Y ),\n\nmin\np(t|x)\n\n(1)\n\nwhere the Lagrange parameter \u03b2 > 0 determines the trade-off between compression of X and\npreservation of information about Y . Since the compressed representation is conditionally indepen-\ndent of Y given X as illustrated in Figure 1, to fully characterize T we only need to specify its joint\ndistribution with X, i.e. p(x, t). No analytical solution is available for the general problem de\ufb01ned\nby (1) and this joint distribution must be calculated with an iterative procedure. In the case of dis-\ncrete variables X and Y , p(x, t) is obtained iteratively by self-consistent determination of p(t|x),\np(t) and p(y|t) in the generalized Blahut-Arimoto algorithm. The resulting discrete T then de\ufb01nes\n(soft) clusters of X. In the case of continuous X and Y , the same set of self-consistent equations\nfor p(t|x), p(t) and p(y|t) are obtained. These equations also translate into two coupled eigenvector\nproblems for \u2202 log p(x|t)/\u2202t and \u2202 log p(y|t)/\u2202t, but a direct solution of these problems is very\ndif\ufb01cult in practice. However, when X and Y are jointly multivariate Gaussian distributed, this\nproblem becomes analytically tractable.\n\nFigure 1: Graphical representation of the conditional independence structure of IB.\n\n2.2 Gaussian IB.\n\nConsider two joint Gaussian random vectors (rv) X and Y with zero mean:\n\n(cid:18)\n\n(cid:18) \u03a3x\n\n\u03a3T\nxy\n\u03a3xy \u03a3y\n\n(cid:19)(cid:19)\n\n(X, Y ) \u223c N\n\n0p+q, \u03a3 =\n\n,\n\n(2)\n\nwhere p is the dimension of X, q is the dimension of Y and 0p+q is the zero vector of dimension\np + q. In [4] it is proved that the optimal compression T is also jointly Gaussian with X and Y . This\nimplies that T can be expressed as a noisy linear transformation of X:\n\nT = AX + \u03be,\n\n2\n\n(3)\n\n\fwhere \u03be \u223c N (0p, \u03a3\u03be) is independent of X and A \u2208 Rp\u00d7p. The minimization problem (1) is then\nreduced to solving:\n\n(4)\nFor a given trade-off parameter \u03b2, the optimal compression is given by T \u223c N (0p, \u03a3t) with \u03a3t =\nA\u03a3xAT + \u03a3\u03be and the noise variance can be \ufb01xed to the identity matrix \u03a3\u03be = Ip, as shown in [3].\nThe transformation matrix A is given by:\n\nL|L \u2261 I(X; T ) \u2212 \u03b2I(T ; Y ).\n\nmin\nA,\u03a3\u03be\n\n\uf8eb\uf8ec\uf8ec\uf8ec\uf8ed\n\nA =\n\n(cid:2)0T ; . . . ; 0T(cid:3)\n(cid:2)\u03b11vT\n1 ; 0T ; . . . , 0T(cid:3)\n2 ; 0T ; . . . ; 0T(cid:3) \u03b2c\n(cid:2)\u03b11vT\n\n1 ; \u03b12vT\n\n0 \u2264 \u03b2 \u2264 \u03b2c\n1 \u2264 \u03b2 \u2264 \u03b2c\n\u03b2c\n2 \u2264 \u03b2 \u2264 \u03b2c\n\n3\n\n1\n\n2\n\n\uf8f6\uf8f7\uf8f7\uf8f7\uf8f8\n\n...\n\n(5)\n\n1 , . . . , vT\n\n(cid:113) \u03b2(1\u2212\u03bbi)\u22121\n\np are left eigenvectors of \u03a3x|y\u03a3\u22121\n\nwhere vT\nx sorted by their corresponding increasing eigen-\ni = (1 \u2212 \u03bbi)\u22121, and the \u03b1i coef\ufb01cients are de\ufb01ned\nvalues \u03bb1, . . . , \u03bbp. The critical \u03b2 values are \u03b2c\ni \u03a3xvi. In the above, 0T is a p-dimensional row vector and\nby \u03b1i =\nsemicolons separate rows of A. We can see from equation (5) that the optimal projection of X is a\ncombination of weighted eigenvectors of \u03a3x|y\u03a3\u22121\nx . The number of selected eigenvectors, and thus\nthe effective dimension of T , depends on the parameter \u03b2.\n\nwith ri = vT\n\n\u03bbiri\n\n3 Copula and Information Bottleneck\n\n3.1 Copula and Gaussian copula.\n\nA multivariate distribution consists of univariate random variables related to each other by a de-\npendence mechanism. Copulas provide a framework to separate the dependence structure from\nthe marginal distributions. Formally, a d-dimensional copula is a multivariate distribution function\nC : [0, 1]d \u2192 [0, 1] with standard uniform margins. Sklar\u2019s theorem [7] states the relationship be-\ntween copulas and multivariate distributions. Any joint distribution function F can be represented\nusing its marginal univariate distribution functions and a copula:\n\nF (z1, . . . , zd) = C (F1 (z1) , . . . , Fd (zd)) .\n\n(6)\nIf the margins are continuous, then this copula is unique. Conversely, if C is a copula and F1, . . . , Fd\nare univariate distribution functions, then F de\ufb01ned as in (6) is a valid multivariate distribution\nfunction with margins F1, . . . , Fd. Assuming that C has d-th order partial derivatives we can de\ufb01ne\n, u1, . . . , ud \u2208 [0, 1], The density corre-\nthe copula density function: c(u1, . . . , ud) = \u2202C(u1,...,ud)\nsponding to (6) can then be rewritten as a product of the marginal densities and the copula density\n\nfunction: f (z1, . . . , zd) = c (F1(z1), . . . , Fd(zd))(cid:81)d\n\n\u2202u1...\u2202ud\n\nj=1 fj(zj).\n\nGaussian copulas constitute an important class of copulas. If F is a Gaussian distribution Nd (\u00b5, \u03a3)\nthen the corresponding C ful\ufb01lling equation (6) is a Gaussian copula. Due to basic invariance\nproperties (cf. [8]), the copula of Nd (\u00b5, \u03a3) is the same as the copula of Nd (0, P ), where P is the\ncorrelation matrix corresponding to the covariance matrix \u03a3. Thus a Gaussian copula is uniquely\ndetermined by a correlation matrix P and we denote a Gaussian copula by CP . Using equation\n(6) with CP , we can construct multivariate distributions with arbitrary margins and a Gaussian\ndependence structure. These distributions are called meta-Gaussian distributions. Gaussian copulas\nconveniently have a copula density function:\n\ncP (u) = |P|\u2212 1\n\n(7)\nwhere \u03a6\u22121(u) is a short notation for the univariate Gaussian quantile function applied to each com-\nponent \u03a6\u22121(u) = (\u03a6\u22121(u1), . . . , \u03a6\u22121(ud)).\n\n2 exp\n\n,\n\n\u03a6\u22121(u)T (P \u22121 \u2212 I)\u03a6\u22121(u)\n\n(cid:26)\n\n\u2212 1\n2\n\n(cid:27)\n\n3.2 Copula formulation of IB.\n\nAt the heart of the copula formulation of IB is the following identity: for a continuous random vector\nZ = (Z1, . . . , Zd) with density f (z) and copula density cZ(u) the multivariate mutual information\n\n3\n\n\f(cid:90)\n\nor multi-information is the negative differential entropy of the copula density:\n\nI(Z) \u2261 Dkl(f (z) (cid:107) f0(z)) =\n\ncZ(u) log cZ(u)du = \u2212H(cZ),\n\n(8)\n\nwhere u = (u1, . . . , ud) \u2208 [0, 1]d, Dkl denotes the Kullback-Leibler divergence, and f0(z) =\nf1(z1)f2(z2) . . . fd(zd). For continuous multivariate X, Y and T , equation (8) implies that:\n\n[0,1]d\n\nI(X; T ) = Dkl(f (x, t) (cid:107) f0(x, t)) \u2212 Dkl(f (x)||f0(x)) \u2212 Dkl(f (t)||f0(t)),\n\n= \u2212H(cXT ) + H(cX ) + H(cT ),\nI(Y ; T ) = \u2212H(cY T ) + H(cY ) + H(cT ),\n\nwhere cXT is the copula density of the vector (X1, . . . , Xp, T1, . . . , Tp). The above derivation then\nleads to the following proposition.\nProposition 3.1. Copula formulation of IB\nThe Information Bottleneck minimization problem (1) can be reformulated as:\n\nL | L = \u2212H(cXT ) + H(cX ) + H(cT ) \u2212 \u03b2{\u2212H(cY T ) + H(cY ) + H(cT )}.\n\n(9)\n\nmin\ncXT\n\nThe minimization problem de\ufb01ned in (1) is solved under the assumption that the joint distribution\nof (X, Y ) is known, this now translates in the assumption that the copula copula density cXY (and\nthus cX) is assumed to be known. The density cT is entirely determined by cXT , and using the\nconditional independence structure it is clear that cY T is also determined by cXT when cXY is\nknown. Since the joint density of (X, Y, T ) decomposes as:\n\nf (x, y, t) = f (t, y|x)f (x) = f (t|x)f (y|x)f (x),\n\nthe corresponding copula density then also decomposes as:\n\ncXY T (ux, uy, ut) = RT|X (ux, ut)RY |X (ux, uy)cX (ux),\n\n(10)\n\n(11)\n\n(12)\n\nwhere\n\nRT|X (ux, ut) =\n\ncXT (ux, ut)\n\ncX (ux)\n\n, ux \u2208 [0, 1]p, uy \u2208 [0, 1]q, ut \u2208 [0, 1]p,\n\nas shown in [9]. We can \ufb01nally rewrite the copula density of (Y, T ) as:\n\n(cid:90) cXT (ux, ut)cXY (ux, uy)\n\ncX (ux)\n\n(cid:90)\n\ncY T (uy, ut) =\n\ncXY T (ux, uy, ut)dux =\n\ndux.\n\n(13)\n\nThe IB optimization problem actually reduces to \ufb01nding an optimal copula density cXT . This im-\nplies that in order to construct the compression variable T , the only relevant aspect is the copula\ndependence structure between X, T and Y .\n\n4 Meta-Gaussian IB\n\n4.1 Meta-Gaussian IB formulation.\n\nThe above reformulation of IB is of great practical interest when we focus on the special case of\nthe Gaussian copula. The only known case for which a simple analytical solution to the IB problem\nexists is when (X, Y ) are joint Gaussians. Equation (9) shows that actually an optimal solution\ndoes not depend of the margins but only on the copula density cXY . From this observation the idea\nnaturally follows that an analytical solution should also exist for any joint distribution of (X, Y )\nwhich has a Gaussian copula, and that regardless of its margins. We show below in proposition 4.1\nthat this is indeed the case. The notation \u02dcX and \u02dcY is used to represent the normal scores:\n\n\u02dcX = (\u03a6\u22121 \u25e6 FX1(X1), . . . , \u03a6\u22121 \u25e6 FXp (Xp)).\n\n(14)\n\nSince copulas are invariant to strictly increasing transformations the normal scores have the same\ncopulas as the original variables X and Y .\n\n4\n\n\fProposition 4.1. Optimality of meta-Gaussian IB\nConsider rv X, Y with a Gaussian dependence structure and arbitrary margins:\n\nFX,Y (x, y) \u223c CP (FX1 (x1), . . . , FXp (xp), FY1(y1), . . . , FYq (yq)),\n\n(15)\nwhere FXi, FYi are the marginal distributions of X, Y and CP is a Gaussian copula parametrized by\na correlation matrix P . Then the optimum of the minimization problem (1) is obtained for T \u2208 T ,\nwhere T is the set of all rv T such that (X, Y, T ) has a Gaussian copula and T has Gaussian\nmargins.\n\nBefore proving proposition 4.1 we give a short lemma.\nLemma 4.1. T \u2208 T \u21d4 ( \u02dcX, \u02dcY , T ) are jointly Gaussian.\n\nProof.\n\n1. If T \u2208 T then (X, Y, T ) has a Gaussian copula which implies that ( \u02dcX, \u02dcY , T ) also\nhas a Gaussian copula. Since \u02dcX, \u02dcY , T all have normally distributed margins it follows that\n( \u02dcX, \u02dcY , T ) has a joint Gaussian distribution.\n\n2. If ( \u02dcX, \u02dcY , T ) are jointly Gaussian then ( \u02dcX, \u02dcY , T ) has a Gaussian copula which implies\nthat (X, Y, T ) has again a Gaussian copula. Since T has normally distributed margins, it\nfollows that T \u2208 T .\n\nProposition 4.1 can now be proven by contradiction.\nProof of proposition 4.1. Assume there exists T \u2217 /\u2208 T such that:\nL(X, Y, T \u2217) := I(X; T \u2217) \u2212 \u03b2I(Y ; T \u2217) < min\n\np(t|x),T\u2208T I(X; T ) \u2212 \u03b2I(T ; Y )\n\n(16)\n\nSince ( \u02dcX, \u02dcY , T ) has the same copula as (X, Y, T ), we have that I( \u02dcX; T ) = I(X; T ) and I( \u02dcY ; T ) =\nI(Y ; T ). Using Lemma 4.1 the right hand part of inequality (16) can be rewritten as :\n\nmin\n\np(t|x),T\u2208T\n\nL(X, Y, T ) = min\n\np(t|x),T\u2208T\n\nL( \u02dcX, \u02dcY , T ) =\n\nmin\n\np(t|\u02dcx),( \u02dcX, \u02dcY ,T )\u223cN\n\nL( \u02dcX, \u02dcY , T ).\n\n(17)\n\nCombining equations (16) and (17) we obtain:\n\nI( \u02dcX; T \u2217) \u2212 \u03b2I( \u02dcY ; T \u2217) <\n\nmin\n\np(t|\u02dcx),( \u02dcX, \u02dcY ,T )\u223cN\n\nI( \u02dcX; T ) \u2212 \u03b2I(T ; \u02dcY ).\n\nThis is in contradiction with the optimality of Gaussian information bottleneck, which states that the\noptimal T is jointly Gaussian with (X, Y ). Thus the optimum for meta-Gaussian (X, Y ) is attained\nfor T with normal margins such that (X, Y, T ) also is meta-Gaussian.\n\nCorollary 4.1. The optimal projection T o obtained for ( \u02dcX, \u02dcY ) is also optimal for (X, Y ).\n\nProof. By the above we know that an optimal compression for (X, Y ) can be obtained in the set of\nvariables T such that ( \u02dcX, \u02dcY , T ) is jointly Gaussian, since \u02dcL = L it is clear that T o is also optimal\nfor (X, Y ).\n\nAs a consequence of Proposition 4.1, for any random vector (X, Y ) having a Gaussian copula depen-\ndence structure, an optimal projection T can be obtained by \ufb01rst calculating the vector of the normal\nscores ( \u02dcX, \u02dcY ) and then computing T = A \u02dcX + \u03be. A is here entirely determined by the covariance\nmatrix of the vector ( \u02dcX, \u02dcY ) which also equals its correlation matrix (the normal scores have unit\nvariance by de\ufb01nition), and thus the correlation matrix P parametrizing the Gaussian copula CP . In\npractice the problem is reduced to the estimation the Gaussian copula of (X, Y ). In particular, for\nthe traditional Gaussian case where (X, Y ) \u223c N (0, \u03a3), this means that we actually do not need to\nestimate the full covariance \u03a3 but only the correlations.\n\n5\n\n\f4.2 Meta-Gaussian mutual information.\n\nThe multi-information for a meta-Gaussian random vector Z = (Z1, . . . , Zd) with copula CPz is:\n\nI(Z) = I( \u02dcZ) = \u2212 1\n\n2 log |corr( \u02dcZ)| = \u2212 1\n\n2 log |Pz|,\n\n2 log |cov( \u02dcZ)| = \u2212 1\n\n2 log |\u03a3\u02dcz| = \u2212 1\n\n(18)\nwhere |.| denotes the determinant. A direct derivation of the multi-information for meta-Gaussian\nrandom variables is also given in the supplementary material. The mutual information between\nX and Y is then I(X; Y ) = \u2212 1\n2 log |Px|+ 1\n. It\nis obvious that the formula for the meta-Gaussian is similar to the formula for the Gaussian case\n2 log |\u03a3y|, but uses the correlation matrix parametrizing\nIGauss(X; Y ) = \u2212 1\nthe copula instead of the data covariance matrix. The two formulas are equivalent when X, Y are\njointly Gaussian.\n\n2 log |P|+ 1\n2 log |\u03a3x|+ 1\n\n(cid:18) Px Pyx\n\n2 log |Py|, where P =\n\n2 log |\u03a3|+ 1\n\n(cid:19)\n\nPxy\n\nPy\n\n4.3 Semi-parametric copula estimation.\n\nSemi-parametric copula estimation has been studied in [10], [11] and [12]. The main idea is to\ncombine non-parametric estimation of the margins with a parametric copula model, in our case the\nGaussian copulas family. If the margins F1, . . . , Fd of a random vector Z are known, P can be\nestimated by the matrix \u02c6P with elements given by:\n\n(cid:80)n\ni=1 \u03a6\u22121(Fk(zik))\u03a6\u22121(Fl(zil))\n\n1\nn\n\n,\n\n(19)\n\n\u02c6P(k,l) =\n\n(cid:104) 1\n\nn\n\n(cid:80)n\n\ni=1 [\u03a6\u22121(Fk(zik))]2 1\n\nn\n\nwhere zik denotes the i-th observation of dimension k. \u02c6P is assured to be positive semi-de\ufb01nite. If\nthe margins are unknown we can instead use the rescaled empirical cumulative distributions:\n\ni=1 [\u03a6\u22121(Fl(zil))]2(cid:105)1/2\n(cid:80)n\n(cid:33)\n\n(cid:32)\n\nn(cid:88)\n\ni=1\n\n\u02c6Fj(t) =\n\nn\n\nn + 1\n\n1\nn\n\nIzij\u2264t\n\n.\n\n(20)\n\nThe estimator resulting from using the rescaled empirical distributions (20) in equation (19) is given\nin the following de\ufb01nition.\nDe\ufb01nition 4.1 (Normal scores rank correlation coef\ufb01cient). The normal scores rank correlation\ncoef\ufb01cient is the matrix \u02c6P n with elements:\n\n(cid:80)n\n(cid:80)n\ni=1 \u03a6\u22121( R(zik)\n\n(cid:16)\n\ni=1\n\n(cid:17)2\nn+1 )\u03a6\u22121( R(zil)\nn+1 )\n\u03a6\u22121( i\n\nn+1 )\n\n\u02c6P n\n\n(k,l) =\n\n,\n\n(21)\n\nwhere R(zik) denotes the rank of the i-th observation for dimension k. Robustness properties of\nthe estimator (21) have been studied in [13]. Using (21) we compute an estimate of the correlation\nmatrix P parametrizing cXY and obtain the transformation matrix A as detailed in Algorithm 1.\n\nAlgorithm 1 Construction of the transformation matrix A\n\n1. Compute the normal scores rank correlation estimate \u02c6P n of the correlation matrix P\nparametrizing cXY :\nfor k, l = 1, . . . , p + q do\n\nSet the (k, l)-th element of \u02c6P n to\nthe i-th row of z is the concatenation of the i-th rows of x and y: zi\u2217 = (xi\u2217, yi\u2217) \u2208 Rp+q.\n\nas in equation (21) and where\n\nn+1 ))2\n\ni\n\nend for\n2. Compute the estimated conditional covariance matrix of the normal scores: \u02c6\u03a3\u02dcx|\u02dcy = \u02c6P n\n\u02c6P n\nxy( \u02c6P n\n3. Find the eigenvectors and eigenvalues of \u02c6\u03a3\u02dcx|\u02dcy( \u02c6P n\n4. Construct the transformation matrix A as in equation (5).\n\ny )\u22121 \u02c6P n\nyx.\n\nx )\u22121.\n\nx \u2212\n\n(cid:80)n\n(cid:80)n\ni=1 \u03a6\u22121( R(zik )\ni=1(\u03a6\u22121(\n\nn+1 )\u03a6\u22121( R(zil )\nn+1 )\n\n6\n\n\f5 Results\n\n5.1 Simulations\n\nWe tested meta-Gaussian IB (MGIB) in two different setting, \ufb01rst when the data is Gaussian but con-\ntains outliers, second when the data has a Gaussian copula but non-Gaussian margins. We generated\na training sample with n = 1000 observations of X and Y with dimensions \ufb01xed to dx = 15 and\ndy = 15. A covariance matrix was drawn from a Wishart distribution centered at a correlation matrix\npopulated with a few high correlation values to ensure some dependency between X and Y . This\nmatrix was then scaled to obtain the correlation matrix parametrizing the copula. In the \ufb01rst setting\nthe data was sampled with N (0, 1) margins. A \ufb01xed percentage of outliers, 8%, was then introduced\nto the sample by randomly drawing a row and a column in the data matrix and replacing the current\nvalue with a random draw from the set [\u22126,\u22123] \u222a [3, 6]. In the second setting data points were\ndrawn from meta-Gaussian distributions with three different type of margins: Student with df = 4,\nexponential with \u03bb = 1, and beta with \u03b11 = 0.5 = \u03b12. For each training sample two projection\nmatrices AG and AC were computed, AG was calculated based on the sample covariance \u02c6\u03a3n and\nAC was obtained using the normal scores rank correlation \u02c6P n. The compression quality of the pro-\njection was then tested on a test sample of n = 10(cid:48)000 observations generated independently from\nthe same distribution (without outliers). Each experiment was repeated 50 times. Figure 2 shows the\ninformation curves obtained by varying \u03b2 from 0.1 to 200. The mutual informations I(X; T ) and\n(Y ; T ) can be reliably estimated on the test sample using (18) and (21). The information curves start\nwith a very steep slope, meaning that a small increase in I(X; T ) leads to a signi\ufb01cant increase in\nI(Y ; T ), and then slowly saturate to reach their asymptotic limit in I(Y ; T ). The best information\ncurves are situated in the upper left corner of the \ufb01gure, since for a \ufb01xed compression value I(X; T )\nwe want to achieve the highest relevant information content (I; T ). We clearly see in Figure 2 that\nMGIB consistently outperforms GIB in that it achieves higher compression rates.\n\nFigure 2: Information curves for Gaussian data with outliers, data with Student, Exponential and\nBeta margins. Each panel shows 50 curves obtained for repetitions of the MGIB (red) and the GIB\n(black). The curves stop when they come close to saturation. For higher values of \u03b2 the information\nI(X; T ) would continue to grow while I(Y ; T ) would reach its limit leading to horizontal lines,\nbut such high beta values lead to numerical instability. Since GIB suffers from a model mismatch\nproblem when the margins are not Gaussian, the curves saturate for smaller values of I(Y ; T ).\n\n7\n\n0102030024681012Gaussian with outliersI(X;T)I(Y;T)MGIBGIB01020304002468101214Student marginsI(X;T)I(Y;T)MGIBGIB01020304002468101214Exponential marginsI(X;T)I(Y;T)MGIBGIB01020304002468101214Beta marginsI(X;T)I(Y;T)MGIBGIB\f5.2 Real data\n\nWe further applied MGIB to the Communities and Crime data set from the UCI repository 1. The\ndata set contains observations of predictive and target variables. After removing missing values we\nretained n = 2195 observations. In a pre-processing step we selected the dx = 10 dimensions\nwith the strongest absolute rank correlation to one of the relevance variables. Plotting empirical\ninformation curves as in the synthetic examples above was impossible, because even for this setting\nwith drastically decreased dimensionality all mutual information estimates we tried (including the\nnearest-neighbor graph method in [14]) were too unstable to draw empirical information curves. To\nstill give a graphical representation of our results we show in Figure 3 non-parametric density esti-\nmates of the one dimensional compression T split in 5 groups according to corresponding values of\nthe \ufb01rst relevance variable. We used GIB, MGIB and Principal Component analysis (PCA) to reduce\nX to a 1-dimensional variable. For PCA this is the \ufb01rst principal component, for GIB and MGIB\nwe independently selected the highest value of \u03b2 leading to a 1-dimensional compression. It is obvi-\nous from Figure 3 that the one-dimensional MGIB compression nicely separates the different target\nclasses, whereas the GIB and PCA projections seem to contain much less information about the\ntarget variable. We conclude that similar to our synthetic examples above, the MGIB compression\ncontains more information about the relevance variable than GIB at the same compression rate.\n\nFigure 3: Parzen density estimates of the univariate projection of X split in 5 groups according to\nvalues of the \ufb01rst relevance variable. We see more separation between groups for MGIB than for\nGIB or PCA, which indicates that the projection is more informative about the relevance variable.\n\n6 Conclusion\n\nWe present a reformulation of the IB problem in terms of copula which gives new insights into data\ncompression with relevance constraints and opens new possible applications of IB for continuous\nmultivariate data. Meta-Gaussian IB naturally extends the analytical solution of Gaussian IB to\nmultivariate distributions with Gaussian copula and arbitrary marginal density. It can be applied\nto any type of continuous data, provided the assumption of a Gaussian dependence structure is\nreasonable, in which case the optimal compression can easily be obtained by semi-parametric copula\nestimation. Simulated experiments showed that MGIB clearly outperforms GIB when the marginal\ndensities are not Gaussian, and even in the Gaussian case with a tiny amount of outliers MGIB has\nbeen shown to signi\ufb01cantly bene\ufb01t from the robustness properties of rank estimators. In future work,\nit would be interesting to see if the copula formulation of IB admits analytical solutions for other\ncopula families.\n\nAcknowledgments\n\nM. Rey is partially supported by the Swiss National Science Foundation, grant CR32I2 127017 / 1.\n\n1http://archive.ics.uci.edu/ml/\n\n8\n\nMeta\u2212Gaussian IBFirst component of compression TGaussian IBFirst component of compression TPCAfirst PCA projectionY1 in (\u22123.5,0)Y1 in (0,0.5)Y1 in (0.5,1)Y1 in (1,1.5)Y1 in (1.5,3.5)\fReferences\n[1] N. Tishby, F.C. Pereira, and W. Bialek. The information bottleneck method. The 37th annual Allerton\n\nConference on Communication, Control, and Computing, (29-30):368\u2013377, 1999.\n\n[2] O. Shamir, S. Sabato, and N. Tishby. Learning and generalization with the information bottleneck. Theor.\n\nComput. Sci., 411(29-30):2696\u20132711, 2010.\n\n[3] G. Chechik, A. Globerson, N. Tishby, and Y. Weiss.\n\nJournal of Machine Learning Research, 6:165\u2013188, 2005.\n\nInformation bottleneck for Gaussian variables.\n\n[4] A. Globerson and N. Tishby. On the optimality of the Gaussian information bottleneck curve. Hebrew\n\nUniversity Technical Report, 2004.\n\n[5] R.M. Hecht, E. Noor, and N. Tishby. Speaker recognition by Gaussian information bottleneck. INTER-\n\nSPEECH, pages 1567\u20131570, 2009.\n\n[6] J. Ma and Z. Sun. Mutual information is copula entropy. arXiv:0808.0845v1, 2008.\n[7] A. Sklar. Fonctions de r\u00b4epartition `a n dimensions et leurs marges. Publications de l\u2019Institut de Statistique\n\nde l\u2019Universit\u00b4e de Paris, 8:229\u2013231, 1959.\n\n[8] A. J. McNeil, R. Frey, and P. Embrechts. Quantitative Risk Management. Princeton Series in Finance.\n\nPrinceton University Press, 2005.\n\n[9] G. Elidan. Copula bayesian networks. Proceedings of the Neural Information Processing Systems (NIPS),\n\n2010.\n\n[10] C. Genest, K. Ghoudhi, and L.P. Rivet. A semiparametric estimation procedure of dependence parameters\n\nin multivariate families of distributions. Biometrika, 82(3):543\u2013552, 1995.\n\n[11] H. Tsukahara. Semiparametric estimation in copula models. The Canadian Journal of Statistics,\n\n33(3):357\u2013375, 2005.\n\n[12] Peter D. Hoff. Extending the rank likelihood for semiparametric copula estimation. Annals of Applied\n\nStatistics, 1(1):273, 2007.\n\n[13] K. Boudt, J. Cornelissen, and C. Croux. The gaussian rank correlation estimator: Robustness properties.\n\nStatistics and Computing, 22:471\u2013483, 2012.\n\n[14] D. P\u00b4al, B. P\u00b4oczos, and C. Szepesv\u00b4ari. Estimation of R\u00b4enyi entropy and mutual information based on\ngeneralized nearest-neighbor graphs. Proceedings of the Neural Information Processing Systems (NIPS),\n2010.\n\n9\n\n\f", "award": [], "sourceid": 949, "authors": [{"given_name": "Melanie", "family_name": "Rey", "institution": null}, {"given_name": "Volker", "family_name": "Roth", "institution": null}]}