{"title": "Divisive Normalization: Justification and Effectiveness as Efficient Coding Transform", "book": "Advances in Neural Information Processing Systems", "page_first": 1522, "page_last": 1530, "abstract": "Divisive normalization (DN) has been advocated as an effective nonlinear {\\em efficient coding} transform for natural sensory signals with applications in biology and engineering. In this work, we aim to establish a connection between the DN transform and the statistical properties of natural sensory signals. Our analysis is based on the use of multivariate {\\em t} model to capture some important statistical properties of natural sensory signals. The multivariate {\\em t} model justifies DN as an approximation to the transform that completely eliminates its statistical dependency. Furthermore, using the multivariate {\\em t} model and measuring statistical dependency with multi-information, we can precisely quantify the statistical dependency that is reduced by the DN transform. We compare this with the actual performance of the DN transform in reducing statistical dependencies of natural sensory signals. Our theoretical analysis and quantitative evaluations confirm DN as an effective efficient coding transform for natural sensory signals. On the other hand, we also observe a previously unreported phenomenon that DN may increase statistical dependencies when the size of pooling is small.", "full_text": "Divisive Normalization: Justi\ufb01cation and\nEffectiveness as Ef\ufb01cient Coding Transform\n\nComputer Science Department\n\nUniversity at Albany, State University of New York\n\nSiwei Lyu \u2217\n\nAlbany, NY 12222, USA\n\nAbstract\n\nDivisive normalization (DN) has been advocated as an effective nonlinear ef\ufb01-\ncient coding transform for natural sensory signals with applications in biology\nand engineering. In this work, we aim to establish a connection between the DN\ntransform and the statistical properties of natural sensory signals. Our analysis\nis based on the use of multivariate t model to capture some important statistical\nproperties of natural sensory signals. The multivariate t model justi\ufb01es DN as\nan approximation to the transform that completely eliminates its statistical de-\npendency. Furthermore, using the multivariate t model and measuring statistical\ndependency with multi-information, we can precisely quantify the statistical de-\npendency that is reduced by the DN transform. We compare this with the actual\nperformance of the DN transform in reducing statistical dependencies of natural\nsensory signals. Our theoretical analysis and quantitative evaluations con\ufb01rm\nDN as an effective ef\ufb01cient coding transform for natural sensory signals. On the\nother hand, we also observe a previously unreported phenomenon that DN may\nincrease statistical dependencies when the size of pooling is small.\n\n1\n\nIntroduction\n\nIt has been widely accepted that biological sensory systems are adapted to match the statistical\nproperties of the signals in the natural environments. Among different ways such may be achieved,\nthe ef\ufb01cient coding hypothesis [2, 3] asserts that a sensory system might be understood as a transform\nthat reduces redundancies in its responses to the input sensory stimuli (e.g., odor, sounds, and time\nvarying images). Such signal transforms, termed as ef\ufb01cient coding transforms, are also important\nto applications in engineering \u2013 with the reduced statistical dependencies, sensory signals can be\nmore ef\ufb01ciently stored, transmitted and processed. Over the years, many works, most notably the\nICA methodology, have aimed to \ufb01nd linear ef\ufb01cient coding transforms for natural sensory signals\n[20, 4, 15]. These efforts were widely regarded as a con\ufb01rmation of the ef\ufb01cient coding hypothesis,\nas they lead to localized linear basis that are similar to receptive \ufb01elds found physiologically in the\ncortex. Nonetheless, it has also been noted that there are statistical dependencies in natural images\nor sounds, to which linear transforms are not effective to reduce or eliminate [5, 17]. This motivates\nthe study of nonlinear ef\ufb01cient coding transforms.\nDivisive normalization (DN) is perhaps the most simple nonlinear ef\ufb01cient coding transform that\nhas been extensively studied recently. The output of the DN transform is obtained from the response\nof a linear basis function divided by the square root of a biased and weighted sum of the squared\nresponses of neighboring basis functions of adjacent spatial locations, orientations and scales. In\nbiology, initial interests in DN focused on its ability to model dynamic gain control in retina [24]\nand the \u201cmasking\u201d behavior in perception [11, 33], and to \ufb01t neural recordings from the mammalian\n\n\u2217This work is supported by an NSF CAREER Award (IIS-0953373).\n\n1\n\n\f(a)\n\n(b)\n\n(c)\n\n(d)\n\n(e)\n\nFigure 1: Statistical properties of natural images in a band-pass domain and their representations with the\nmultivariate t model. (a): Marginal densities in the log domain (images: red solid curve, t model: blue dashed\ncurve). (b): Contour plot of the joint density, p(x1, x2), of adjacent pairs of band-pass \ufb01lter responses. (c):\nContour plot of the optimally \ufb01tted multivariate t model of p(x1, x2). (d): Each column of the image correspond\nto a conditional density p(x1|x2) of different x2 values. (e): The three red solid curves correspond to E(x1|x2)\nand E(x1|x2) \u00b1 std(x1|x2). Blue dashed curves correspond to E(x1|x2) and E(x1|x2) \u00b1 std(x1|x2) from\nthe optimally \ufb01tted multivariate t model to p(x1, x2).\n\nvisual cortex [12, 19]. In image processing, nonlinear image representations based on DN have been\napplied to image compression and contrast enhancement [18, 16] showing improved performance\nover linear representations.\nAs an important nonlinear transform with such a ubiquity, it has been of great interest to \ufb01nd the\nunderlying principle from which DN originates. Based on empirical observations, Schwartz and\nSimoncelli [23] suggested that DN can reduce statistical dependencies in natural sensory signals\nand is thus justi\ufb01ed by the ef\ufb01cient coding hypothesis. More recent works on statistical models\nand ef\ufb01cient coding transforms of natural sensory signals (e.g., [17, 26]) have also hinted that DN\nmay be an approximation to the optimal ef\ufb01cient coding transform. However, this claim needs to\nbe rigorously validated based on statistical properties of natural sensory signals, and quantitatively\nevaluated with DN\u2019s performance in reducing statistical dependencies of natural sensory signals.\nIn this work, we aim to establish a connection between the DN transform and the statistical proper-\nties of natural sensory signals. Our analysis is based on the use of multivariate t model to capture\nsome important statistical properties of natural sensory signals. The multivariate t model justi\ufb01es DN\nas an approximation to the transform that completely eliminates its statistical dependency. Further-\nmore, using the multivariate t model and measuring statistical dependency with multi-information,\nwe can precisely quantify the statistical dependency that is reduced by the DN transform. We com-\npare this with the actual performance of the DN transform in reducing statistical dependencies of\nnatural sensory signals. Our theoretical analysis and quantitative evaluations con\ufb01rm DN as an ef-\nfective ef\ufb01cient coding transform for natural sensory signals. On the other hand, we also observe a\npreviously unreported phenomenon that DN may increase statistical dependencies when the size of\npooling is small.\n\n2 Statistical Properties of Natural Sensory Signals and Multivariate t Model\n\nSensory signals in natural environments are highly structured and non-random. Their regularities\nexhibit as statistical properties that distinguish them from the rest of the ensemble of all possible\nsignals. Over the years, many distinct statistical properties of natural sensory signals have been\nobserved. Particularly, in band-pass \ufb01ltered domains where local means are removed, three statistical\ncharacteristics have been commonly observed across different signal ensembles1:\n\n- symmetric and sparse non-Gaussian marginal distributions with high kurtosis [7, 10], Fig.1(a);\n- joint densities of neighboring responses that have elliptically symmetric (spherically symmetric\n\nafter whitening) contours of equal probability [34, 32]; Fig.1(b);\n\n- conditional distributions of one response given neighboring responses that exhibit a \u201cbow-tie\u201d\n\nshape when visualized as an image [25, 6], Fig.1(d).\n\nIt has been noted that higher order statistical dependencies in the joint and conditional densities\n(Fig.1 (b) and (d)) cannot be effectively reduced with linear transform [17].\n\n1The results in Fig.1 are obtained with spatial neighbors in images. Similar behaviors have also been\nobserved for orientation and scale neighbors [6], as well as other type of sensory signals such as audios [23, 17].\n\n2\n\n\fA compact mathematical form that can capture all three aforementioned statistical properties is\nthe multivariate Student\u2019s t model. Formally, the probability density function of a d dimensional t\nrandom vector x is de\ufb01ned as2:\n\npt(x; \u03b1, \u03b2) = \u03b1\u03b2\u0393 (\u03b2 + d/2)\n\n\u0393(\u03b2)(cid:112)det(\u03c0\u03a3)\n\n(cid:0)\u03b1 + x(cid:48)\u03a3\u22121x(cid:1)\u2212\u03b2\u2212d/2\n\n,\n\n(1)\n\nwhere \u03b1 > 0 is the scale parameter and \u03b2 > 1 is the shape parameter. \u03a3 is a symmetric and positive\nde\ufb01nite matrix, and \u0393(\u00b7) is the Gamma function. From data of neighboring responses of natural\nsensory signals in the band-pass domain, the parameters (\u03b1, \u03b2) in the multivariate t model can be\nobtained numerically with maximum likelihood, the details of which are given in the supplementary\nmaterial.. The joint density of the \ufb01tted multivariate t model has elliptically symmetric level curves\nof equal probability, and its marginals are 1D Student\u2019s t densities that are non-Gaussian and kurtotic\n[14], all resembling those of the natural sensory signals, Fig.1(a) and (c). It is due to its heavy tail\nproperty that the multivariate t model has been used as models of natural images [35, 22].\nFurthermore, we provide another property of the multivariate t model that captures the bow-tie\ndependency exhibited by the conditional distributions of natural sensory signals.\nLemma 1 Denote x\\i as the vector formed by excluding the ith element from x. For a d-\ndimensional isotropic t vector x (i.e., \u03a3 = I), we have\nE(xi|x\\i) = 0, and var(xi|x\\i) =\n\n2\u03b2 + d \u2212 3\nwhere E(\u00b7) and var(\u00b7) denote expectation and variance, respectively.\nThis is proved in the supplementary material. Lemma 1 can be extended to anisotropic t models\nby incorporating a non-diagonal \u03a3 using a linear \u201cun-whitening\u201d procedure, the result of which\nis demonstrated in Fig.1(e). The three red solid curves correspond to E(xi|x\\i) and E(xi|x\\i) \u00b1\n\n(cid:112)var(xi|x\\i) for pairs of adjacent band-pass \ufb01ltered responses of a natural image, and the three\n\nblue dashed curves are the same quantities of the optimally \ufb01tted t model. The bow-tie phenomenon\ncomes directly from the dependencies in the conditional variances, which is precisely captured by\nthe \ufb01tted multivariate t model3.\n\n(\u03b1 + x(cid:48)\n\n\\ix\\i),\n\n1\n\n3 DN as Ef\ufb01cient Coding Transform for Multivariate t Model\n\nUsing the multivariate t model as a compact representation of statistical properties of natural sensory\nsignals in linear band-pass domains, our aim is to \ufb01nd an ef\ufb01cient coding transform that can effec-\ntively reduce its statistical dependencies. This is based on an important property of the multivariate\nt model \u2013 it is a special case of the Gaussian scale mixture (GSM) [1]. More speci\ufb01cally, the joint\ndensity pt(x; \u03b1, \u03b2) can be written as an in\ufb01nite mixture of Gaussians with zero mean and covariance\nmatrix \u03a3, as\n\n(cid:90) \u221e\n\n(cid:19)\n\n(cid:18)\n\npt(x; \u03b1, \u03b2) =\n\n1(cid:112)det(2\u03c0z\u03a3)\n2\u03b2 \u0393(\u03b2) z\u2212\u03b2\u22121 exp(cid:0)\u2212 \u03b1\n\n2z\n\n0\n\nexp\n\n\u2212 1\n2z\n\nx(cid:48)\u03a3\u22121x\n\n(cid:1) is the inverse Gamma distribution. Equivalently, for a\n\np\u03b3\u22121(z; \u03b1, \u03b2)dz,\n\nwhere p\u03b3\u22121(z) = \u03b1\u03b2\n\u221a\nz, as x = u \u00b7 \u221a\nd dimensional t vector x, we can decompose it into the product of two independent variables u and\nz, where u is a d-dim Gaussian vector with zero mean and covariance matrix\n\u03a3, and z > 0 is a scalar variable of an inverse Gamma law with parameter (\u03b1, \u03b2). To simplify\nthe discussion, hereafter we will assume that the signals have been whitened so that there is no\nsecond-order dependencies in x. Correspondingly, the Gaussian vector u has a covariance \u03a3 = I.\nAccording to the GSM equivalence of the multivariate t model, we have u = x/\nz. As an isotropic\nGaussian vector has mutually independent components, there is no statistical dependency among\nelements of u. In other words, x/\nz equals to a transform that completely eliminates all statistical\ndependencies in x. Unfortunately, this optimal ef\ufb01cient coding transform is not realizable, because\nz is a latent variable that we do not have direct access to.\nTo overcome this dif\ufb01culty, we can use an estimator of z based on the visible data vector x, \u02c6z,\nto approximate the true value of z, and obtain an approximation to the optimal ef\ufb01cient coding\n\n\u221a\n\n\u221a\n\n2Eq.(1) can be shown to be equivalent to the standard de\ufb01nition of multivariate t density in [14].\n3The dependencies illustrated are nonlinear because we use conditional standard deviations.\n\n3\n\n\f\u221a\n\u02c6z. For the multivariate t model, it turns out that two most common choices for\ntransform as x/\nthe estimators z, namely, the maximum a posterior (MAP) and the Bayesian least square (BLS)\nestimators, and a third estimator all have similar forms, a result formally stated in the following\nlemma (a proof is given in the supplementary material).\nLemma 2 For the d-dimensional isotropic t vector x with parameters (\u03b1, \u03b2), we consider three\nestimators of z as: (i) the MAP estimator, \u02c6z1 = argmaxz p(z|x), which is the mode of the posterior\ndensity, (ii) the BLS estimator, which is the mean of the posterior density \u02c6z2 = Ez|x(z|x), and (iii)\n\nthe inverse of the conditional mean of 1/z, as \u02c6z3 =(cid:0)Ez|x (1/z|x)(cid:1)\u22121, which are:\n\n2\u03b2 + d \u2212 2 , and \u02c6z3 =(cid:0)Ez|x (1/z|x)(cid:1)\u22121 = \u03b1 + x(cid:48)x\n\n\u02c6z2 = \u03b1 + x(cid:48)x\n\n2\u03b2 + d\nIf we drop the irrelevant scaling factors from each of these estimators and plug them in x/\nobtain a nonlinear transform of x as,\n\n\u02c6z1 = \u03b1 + x(cid:48)x\n2\u03b2 + d + 2 ,\n\n.\n\u221a\n\ny = \u03c6(x), where \u03c6(x) \u2261\n\nx\u221a\n\u03b1 + x(cid:48)x\n\n=\n\n(cid:107)x(cid:107)(cid:112)\u03b1 + (cid:107)x(cid:107)2\n\nx\n(cid:107)x(cid:107) .\n\n\u02c6z, we\n\n(2)\n\nThis is the standard form of divisive normalization that will be used throughout this paper. Lemma 2\nshows that the DN transform is justi\ufb01ed as an approximate to the optimal ef\ufb01cient coding transform\ngiven a multivariate t model of natural sensory signals. Our result also shows that the DN transform\napproximately \u201cgaussianizes\u201d the input data, a phenomenon that has been empirically observed by\nseveral authors (e.g., [6, 23]).\n\n3.1 Properties of DN Transform\n\nThe standard DN transform given by Eq.(2) has some nice and important properties. Particularly,\nthe following Lemma shows that it is invertible and its Jacobian determinant has closed form.\nLemma 3 For the standard DN transform given in Eq. (2), its inversion for y \u2208 Rd with (cid:107)y(cid:107) < 1\nis \u03c6\u22121(y) =\ny(cid:107)y(cid:107) . The determinant of its Jacobian matrix is also in closed\n= \u03b1(\u03b1 + x(cid:48)x)\u2212(d/2+1).\n\n\u221a\n\u03b1y\u221a\n1\u2212(cid:107)y(cid:107)2 =\nform, which is given by det\n\n\u221a\n\u221a\n\u03b1(cid:107)y(cid:107)\n1\u2212(cid:107)y(cid:107)2\n\n(cid:16) \u2202\u03c6(x)\n\n(cid:17)\n\n\u2202x\n\nFurther, the DN transform of a multivariate t vector also has a closed form density function.\nLemma 4 If x \u2208 Rd has an isotropic t density with parameter (\u03b1, \u03b2), then its DN transform,\ny = \u03c6(x), follows an isotropic r model, whose probability density function is\n\n(cid:40) \u0393(\u03b2+d/2)\n\u03c0d/2\u0393(\u03b2) (1 \u2212 y(cid:48)y)\u03b2\u22121\n0\n\np\u03c4 (y) =\n\n(cid:107)y(cid:107) < 1\n(cid:107)y(cid:107) \u2265 1\n\n(3)\n\nLemma 4 suggests a duality between t and r models with regards to the DN transform. Proofs of\nLemma 3 and Lemma 4 can be found in [8]. For completeness, we also provide our proofs in the\nsupplementary material.\n\n3.2 Equivalent Forms of DN Transform\n\n(cid:90)\n\n(cid:32)\n\nIn the current literature, the DN transform has been de\ufb01ned in many different forms other than\nEq.(2). However, if we are merely interested in their ability to reduce statistical dependencies, many\nof the different forms of DN transform based on l2 norm of the input vector x become equivalent.\nTo be more speci\ufb01c, we quantify statistical statistical dependency of a random vector x using the\nmulti-information (MI) [27], de\ufb01ned as\n\nx\n\np(x)/\n\np(xk)\n\nI(x) =\n\np(x) log\n\n(4)\nwhere H(\u00b7) denotes the Shannon differential entropy. MI is non-negative, and is zero if and only if\nthe components of x are mutually independent. MI is a generalization of mutual information, and\nthe two become identical when measures dependency for two dimensional x. Furthermore, MI is\ninvariant to any operation that operates on individual components of x (e.g., element-wise rescaling)\nk=1 H(xk) and H(x) (see [27]).\n\nsince such operations produce an equal effect on the two terms(cid:80)d\n\ndx =\n\nk=1\n\nH(xk) \u2212 H(x),\n\n(cid:33)\n\nd(cid:88)\n\nd(cid:89)\n\nk=1\n\n4\n\n\fNow consider four different de\ufb01nitions of the DN transform expressed in terms of the individual\nelement of the output vector as\n\nyi =\n\nxi\u221a\n\u03b1 + x(cid:48)x\n\n, si =\n\nx2\ni\n\n\u03b1 + x(cid:48)x , vi =\n\nxi(cid:113)\n\n\u03b1 + x(cid:48)\n\n,\n\nti =\n\nx2\ni\n\u03b1 + x(cid:48)\n\\ix\\i\n\n.\n\n\\ix\\i\n\nHere x\\i denotes the vector formed from x without its ith component. Speci\ufb01cally, yi is the output\nof Eq.(2). si is the output of the original DN transform used by Heeger [12]. vi corresponds to the\nDN transform used by Schwartz and Simoncelli [23]. The main difference with Eq.(2) is that the\ndenominator is formed without element xi. Last, ti is the output of the DN transform used in [31].\nThese forms of DN4 related with each other by element-wise operations, as we have\n, and ti = s2\n\nsi = y2\n\n=\n\n=\n\ni , vi =\n\n(cid:112)\u03b1 + x(cid:48)x \u2212 x2\n\nxi\n\nyi(cid:112)1 \u2212 y2\n\nxi(cid:113)\n\n\u03b1 + x(cid:48)\n\ni = y2\n1 \u2212 y2\n\ni\n\ni\n\n.\n\ni\n\ni\n\n\\ix\\i\n\nAs element-wise operations do not affect MI, in terms of dependency reduction, all three transforms\nare equivalent to the standard form in terms of reducing statistical dependencies. Therefore, the\nsubsequent analysis applies to all these equivalent forms of the DN transform.\n\n4 Quantifying DN Transform as Ef\ufb01cient Coding Transform\n\nWe have set up a relation between the DN transform with statistical properties of natural sensory\nsignals through the multivariate t model. However, its effectiveness as an ef\ufb01cient coding transform\nfor natural sensory signals needs yet to be quanti\ufb01ed for two reasons. First, DN is only an approx-\nimation to the optimal transform that eliminates statistical dependencies in a multivariate t model.\nFurther, the multivariate t model itself is a surrogate of the true statistical model of natural sensory\nsignals. It is our goal in this section to quantify the effectiveness of the DN transform in reduc-\ning statistical dependencies. We start with a study of applying DN to the multivariate t model, the\nclosed form density of which permits us a theoretical analysis of DN\u2019s performance in dependency\nreduction. We then appy DN to real natural sensory signal data, and compare its effectiveness as an\nef\ufb01cient coding transform with the theoretical prediction obtained with the multivariate t model.\n\n4.1 Results with Multivariate t Model\n\nFor simplicity, we consider isotropic models whose second order dependencies are removed with\nwhitening. The density functions of multivariate t and r models lead to closed form solutions for\nMI, as formally stated in the following lemma (proved in the supplementary material).\nLemma 5 The MI of a d-dimensional isotropic t vector x is\n\nI(x) = (d \u2212 1) log \u0393(\u03b2) \u2212 d log \u0393(\u03b2 + 1/2) + log \u0393(\u03b2 + d/2) \u2212 (d \u2212 1)\u03b2\u03a8(\u03b2)\n\n+ d(\u03b2 + 1/2)\u03a8(\u03b2 + 1/2) \u2212 (\u03b2 + d/2)\u03a8(\u03b2 + d/2).\n\nSimilarly, the MI of a d-dimensional r vector y = \u03c6(x), which is the DN transform of x, is\n\nI(y) = d log \u0393(\u03b2 + (d \u2212 1)/2) \u2212 log \u0393(\u03b2) \u2212 (d \u2212 1) log \u0393(\u03b2 + d/2) + (\u03b2 \u2212 1)\u03a8(\u03b2)\n\n+ (d \u2212 1)(\u03b2 + d/2 \u2212 1)\u03a8(\u03b2 + d/2) \u2212 d(\u03b2 + (d \u2212 3)/2)\u03a8(\u03b2 + (d \u2212 1)/2).\nd\u03b2 log \u0393(\u03b2).\n\nIn both cases, \u03a8(\u03b2) denotes the Digamma function which is de\ufb01ned as \u03a8(\u03b2) = d\nNote that \u03b1 does not appear in these formulas, as it can be removed by re-scaling data and has\nno effect on MI. Using Lemma 5, for a d-dimensional t vector, if we have I(x) > I(y), the DN\ntransform reduces its statistical dependency, conversely, if I(x) < I(y), it increases dependency. As\nboth Gamma function and Digamma function can be computed to high numerical precision, we can\nevaluate \u2206I = I(x)\u2212I(y) corresponding to different shape parameter \u03b2 and data dimensionality d.\nThe left panel of Fig.2 illustrates the surface of \u2206I/I(x), which measures the relative change in MI\nbetween an isotropic t vector and its DN transform. The right panel of Fig.2 shows one dimensional\ncurves of \u2206I/I(x) corresponding to different d values with varying \u03b2.\nThese plots illustrate several interesting aspects of the DN transform as an approximate ef\ufb01cient\ncoding transform of the multivariate t models. First, with data dimensionality d > 4, using DN\n\n4There are usually weights to each x2\n\nweights and leads to no change in terms of MI.\n\ni in the denominator, but re-scaling data can remove the different\n\n5\n\n\fFigure 2: left: Surface plot of [I(x) \u2212 I(\u03c6(x))]/I(x), measuring MI changes after applying the DN trans-\nform \u03c6(\u00b7) to an isotropic t vector x. I(x) and I(\u03c6(x)) computed numerically using Lemma 5. The two\ncoordinates correspond with data dimensionality (d) and shape parameters of the multivariate t model (\u03b2).\nright: one dimensional curves of \u2206I/I(x) corresponding to different d values with varying \u03b2.\nleads to signi\ufb01cant reduction of statistical dependency, but such reductions become weaker as \u03b2\nincreases. On the other hand, our experiment also showed an unexpected behavior that has not been\nreported before, for d \u2264 4, the change of MI caused by the use of DN is negative, i.e., DN increases\nstatistical dependency for such cases. Therefore, though effective for high dimensional models, DN\nis not an ef\ufb01cient coding transform for low dimensional multivariate t models.\n\n4.2 Results with Natural Sensory Signals\n\nAs mentioned previously, the multivariate t model is an approximation to the source model of natural\nsensory signals. Therefore, we would like to compare our analysis in the previous section with the\nactual dependency reduction performance of the DN transform on real natural sensory signal data.\n\n4.2.1 Non-parametric Estimating MI Changes\n\nTo this end, we need to evaluate MI changes after applying DN without relying on any speci\ufb01c para-\nmetric density model. This has been achieved previously for two dimensional data using straightfor-\nward nonparametric estimation of MI based on histograms [28]. However, the estimations obtained\nthis way are prone to strong bias due to the binning scheme in generating the histograms [21], and\ncannot be generalized to higher data dimensions due to the \u201ccurse of dimensionality\u201d, as the number\nof bins increases exponentially with regards to the data dimension.\nInstead, in this work, we directly compute the difference of MI after DN is applied without explicitly\nbinning data. To see how this is possible, we \ufb01rst express the computation of the MI change as\n\nI(x) \u2212 I(y) =\n\n(5)\nthe entropy of y = \u03c6(x) is related to the entropy of x, as H(y) = H(x) \u2212\nis the Jacobian determinant of \u03c6(x) [9]. For DN,\n\nk=1\n\nH(yk) \u2212 H(x) + H(y).\n\nhas closed form (Lemma 3), and replacing it in Eq.(5) yields\n\n(cid:82)\n\nNext,\nx p(x) log\ndet\n\n(cid:16) \u2202\u03c6(x)\n\n\u2202x\n\n(cid:12)(cid:12)(cid:12)det\n(cid:17)\n\nH(xk) \u2212 d(cid:88)\n(cid:17)\n(cid:16) \u2202\u03c6(x)\n\n\u2202x\n\nk=1\n\nd(cid:88)\n(cid:17)(cid:12)(cid:12)(cid:12) dx, where det\nH(yk) \u2212 d(cid:88)\n\n(cid:16) \u2202\u03c6(x)\nd(cid:88)\n\n\u2202x\n\nk=1\n\nk=1\n\n(cid:18) d\n\n2\n\n(cid:19)(cid:90)\n\nx\n\nI(y) \u2212 I(x) =\n\nH(xk) + log \u03b1 \u2212\n\n+ 1\n\np(x) log (\u03b1 + x(cid:48)x)dx.\n\n(6)\n\nOnce we determine \u03b1, the last term in Eq.(6) can be approximated with the average of function\nlog (\u03b1 + x(cid:48)x) over input data. The \ufb01rst two terms requires direct estimation of differential entropies\nof scalar random variables, H(yk) and H(xk). For a more reliable estimation, we use the nonpara-\nmetric \u201cbin-less\u201d m-spacing estimator [30]. As a simple sanity check, Fig.3(a) shows the theoretical\nevaluation of (I(y) \u2212 I(x))/d obtained with Lemma 5 for isotropic t models with \u03b2 = 1.10 and\nvarying d (blue solid curve). The red dashed curve shows the same quantity computed using Eq.(6)\nwith 10, 000 random samples drawn from the same multivariate t models. The small difference\nbetween the two curves in this plot con\ufb01rms the quality of the non-parametric estimation.\n\n6\n\n11.522.533.544.5210203500.050.10.15d\u03b2\u2206I/d11.522.533.544.5\u22120.03\u22120.02\u22120.0100.010.020.03d = 2d = 4d = 5d = 6\u03b2\u2206I/d\f(b) audio data\n\n(c) image data\n\n(a) t model\n\nFigure 3: (a) Comparison of theoretical prediction of MI reduction for isotropic t model with \u03b2 = 1.1 and dif-\nferent dimensions (blue solid curve) with the non-parametric estimation using Eq.(6) and m-spacing estimator\n[30] on 10, 000 random samples drawn from the corresponding multivariate t models (red dashed curve). (b)\nTop row is the mean and standard deviation of the estimated shape parameter \u03b2 for natural audio data of dif-\nferent local window sizes. Bottom row is the comparison of MI changes (\u2206I/d). Blue solid curve corresponds\nto the prediction with Lemma 5, red dashed curve is the non-parametric estimation of Eq.(6). (c) Same results\nas (b) for natural image data with different local block sizes.\n\n4.2.2 Experimental Evaluation and Comparison\n\nWe next experiment with natural audio and image data. For audio, we used 20 sound clips of\nanimal vocalization and recordings in natural environments, which have a sampling frequency of\n44.1 kHz and typical length of 15 \u2212 20 seconds. These sound clips were \ufb01ltered with a bandpass\ngamma-tone \ufb01lter of 3 kHz center frequency [13]. For image data, we used eight images in the\nvan Hateren database [29]. These images have contents of natural scenes such as woods and greens\nwith linearized intensity values. Each image was \ufb01rst cropped to the central 1024 \u00d7 1024 region\nand then subject to a log transform. The log pixel intensities are further adjusted to have a zero\nmean. We further processed the log transformed pixel intensities by convolving with an isotropic\nbandpass \ufb01lter that captures an annulus of frequencies in the Fourier domain ranging from \u03c0/4 to \u03c0\nradians/pixel. Finally, data used in our experiments are obtained by extracting adjacent samples in\nlocalized 1D temporal (for audios) or 2D spatial (for images) windows of different sizes. We further\nwhiten the data to remove second order dependencies.\nWith these data, we \ufb01rst \ufb01t multivariate t models using maximum likelihood (detailed procedure\ngiven in the supplementary material), from which we compute the theoretical prediction of MI dif-\nference using Lemma 5. Shown in the top row of Fig.3 (b) and (c) are the means and standard\ndeviations of the estimated shape parameters of different sizes of local windows for audio and im-\nage data, respectively. These plots suggest two properties of the \ufb01tted multivariate t model. First,\nthe estimated \u03b2 values are typically close to one due to the high kurtosis of these signal ensembles.\nSecond, the shape parameter in general decreases as the data dimension increases.\nUsing the same data, we obtain the optimal DN transform by searching for optimal \u03b1 in Eq.(2) that\nmaximizes the change in MI given by Eq.(6). However, as entropy is estimated non-parametrically,\nwe cannot use gradient based optimization for \u03b1. Instead, with a range of possible \u03b1 values, we\nperform a binary search, at each step of which we evaluate Eq.(6) using the current \u03b1 and the non-\nparametric estimation of entropy based on the data set.\nIn the bottom rows of Fig.3 (b) (for audios) and (c) (for images), we show MI changes of using\nDN on natural sensory data that are predicted by the optimally \ufb01tted t model (blue solid curves) and\nthat obtained with optimized DN parameters using nonparametric estimation of Eq.(6) (red dashed\ncurve). For robustness, these results are the averages over data sets from the 20 audio signals and\n8 images, respectively. In general, changes in statistical dependencies obtained with the optimal\nDN transforms are in accordance with those predicted by the multivariate t model. The model-\n\n7\n\n2345678910111213141511.021.041.061.081.11.12\u03b24916253649648110012111.021.041.061.081.11.12\u03b2246810\u22120.0200.020.040.06\u2206I/d model predictionnonparam estimation23456789101112131415\u22120.04\u22120.0200.020.04 model predictionnonparam estimation49162536496481100121\u22120.04\u22120.0200.020.04 model predictionnonparam estimation\fbased predictions also tend to be upper-bounds of the actual DN performance. Some discrepancies\nbetween the two start to show as dimensionality increases, as the dependency reductions achieved\nwith DN become smaller even though the model-based predictions tend to keep increasing. This\nmay be caused by the approximation nature of the multivariate t model to natural sensory data. As\nsuch, more complex structures in the natural sensory signals, especially with larger local windows,\ncannot be effectively captured by the multivariate t models, which renders DN less effective.\nOn the other hand, our observation based on the multivariate t model that the DN transform tends to\nincrease statistical dependency for small pooling size also holds to real data. Indeed, the increment\nof MI becomes more severe for d \u2264 4. On the surface, our \ufb01nding seems to be in contradiction\nwith [23], where it was empirically shown that applying an equivalent form of the DN transform as\nEq.(2) (see Section 3.2) over a pair of input neurons can reduce statistical dependencies. However,\none key yet subtle difference is that statistical dependency is de\ufb01ned as the correlations in the con-\nditional variances in [23], i.e., the bow-tie behavior as in Fig.1(d). The observation made in [23]\nis then based on the empirical observations that after applying DN transform, such dependencies\nin the transformed variables become weaker, while our results show that the statistical dependency\nmeasured by MI in that case actually increases.\n\n5 Conclusion\n\nIn this work, based on the use of the multivariate t model of natural sensory signals, we have pre-\nsented a theoretical analysis showing that DN emerges as an approximate ef\ufb01cient coding transform.\nFurthermore, we provide a quantitative analysis of the effectiveness of DN as an ef\ufb01cient coding\ntransform for the multivariate t model and natural sensory signal data. These analyses con\ufb01rm the\nability of DN in reducing statistical dependency of natural sensory signals. More interestingly, we\nobserve a previously unreported result that DN can actually increase statistical dependency when the\nsize of pooling is small. As a future direction, we would like to extend this study to a generalized\nDN transform where the denominator and numerator can have different degrees.\nAcknowledgement The author would like to thank Eero Simoncelli for helpful discussions, and the\nthree anonymous reviewers for their constructive comments.\n\nReferences\n[1] D. F. Andrews and C. L. Mallows. Scale mixtures of normal distributions. Journal of the Royal Statistical\n\nSociety. Series B (Methodological), 36(1):99\u2013102, 1974.\n\n[2] F Attneave. Some informational aspects of visual perception. Psych. Rev., 61:183\u2013193, 1954.\n[3] H B Barlow. Possible principles underlying the transformation of sensory messages. In W A Rosenblith,\n\neditor, Sensory Communication, pages 217\u2013234. MIT Press, Cambridge, MA, 1961.\n\n[4] A J Bell and T J Sejnowski.\n\n37(23):3327\u20133338, 1997.\n\nThe \u2019independent components\u2019 of natural scenes are edge \ufb01lters.\n\n[5] Matthias Bethge. Factorial coding of natural images: how effective are linear models in removing higher-\n\norder dependencies? J. Opt. Soc. Am. A, 23(6):1253\u20131268, 2006.\n\n[6] R. W. Buccigrossi and E. P. Simoncelli. Image compression via joint statistical characterization in the\n\nwavelet domain. 8(12):1688\u20131701, 1999.\n\n[7] P.J. Burt and E.H. Adelson. The Laplacian pyramid as a compact image code. IEEE Transactions on\n\nCommunication, 31(4):532\u2013540, 1981.\n\n[8] J. Costa, A. Hero, and C. Vignat. On solutions to multivariate maximum \u03b1-entropy problems. In EMM-\n\nCVPR, 2003.\n\n[9] T. Cover and J. Thomas. Elements of Information Theory. Wiley-Interscience, 2nd edition, 2006.\n[10] D J Field. Relations between the statistics of natural images and the response properties of cortical cells.\n\n4(12):2379\u20132394, 1987.\n\n[11] J. Foley. Human luminence pattern mechanisims: Masking experimants require a new model. J. of Opt.\n\nSoc. of Amer. A, 11(6):1710\u20131719, 1994.\n\n[12] D. J. Heeger. Normalization of cell responses in cat striate cortex. Visual neural science, 9:181\u2013198,\n\n1992.\n\n8\n\n\f[13] P. Johannesma. The pre-response stimulus ensemble of neurons in the cochlear nucleus. In Symposium\n\non Hearing Theory, pages 58\u201369, Eindhoven, Holland, 1972.\n\n[14] Samuel Kotz and Saralees Nadarajah. Multivariate t Distributions and Their Applications. Cambridge\n\nUniversity Press, 2004.\n\n[15] M S Lewicki. Ef\ufb01cient coding of natural sounds. Nature Neuroscience, 5(4):356\u2013363, 2002.\n[16] S. Lyu and E. P. Simoncelli. Nonlinear image representation using divisive normalization.\nConference on Computer Vision and Patten Recognition (CVPR), Anchorage, AK, June 2008.\n\nIn IEEE\n\n[17] S Lyu and E P Simoncelli. Nonlinear extraction of \u2019independent components\u2019 of natural images using\n\nradial Gaussianization. Neural Computation, 18(6):1\u201335, 2009.\n\n[18] J. Malo, I. Epifanio, R. Navarro, and E. P. Simoncelli. Non-linear image representation for ef\ufb01cient\n\nperceptual coding. 15(1):68\u201380, January 2006.\n\n[19] V. Mante, V. Bonin, and M. Carandini. Functional mechanisms shaping lateral geniculate responses to\n\narti\ufb01cial and natural stimuli. Neuron, 58:625\u2013638, May 2008.\n\n[20] B A Olshausen and D J Field. Emergence of simple-cell receptive \ufb01eld properties by learning a sparse\n\ncode for natural images. Nature, 381:607\u2013609, 1996.\n\n[21] Liam Paninski. Estimation of entropy and mutual information. Neural Comput., 15(6):1191\u20131253, 2003.\n[22] S. Roth and M. Black. Fields of experts: A framework for learning image priors. volume 2, pages\n\n860\u2013867, 2005.\n\n[23] O. Schwartz and E. P. Simoncelli. Natural signal statistics and sensory gain control. Nature Neuroscience,\n\n4(8):819\u2013825, August 2001.\n\n[24] R Shapley and C Enroth-Cugell. Visual adaptation and retinal gain control. Progress in Retinal Research,\n\n3:263\u2013346, 1984.\n\n[25] E P Simoncelli and R W Buccigrossi. Embedded wavelet image compression based on a joint probability\nmodel. In Proc 4th IEEE Int\u2019l Conf on Image Proc, volume I, pages 640\u2013643, Santa Barbara, October\n26-29 1997. IEEE Sig Proc Society.\n\n[26] Fabian H. Sinz and Matthias Bethge. The conjoint effect of divisive normalization and orientation selec-\n\ntivity on redundancy reduction. In NIPS. 2009.\n\n[27] M. Studeny and J. Vejnarova. The multiinformation function as a tool for measuring stochastic depen-\ndence. In M. I. Jordan, editor, Learning in Graphical Models, pages 261\u2013297. Dordrecht: Kluwer., 1998.\nInput\u2013output statistical independence in divisive normalization\n\n[28] Roberto Valerio and Rafael Navarro.\n\nmodels of v1 neurons. Network: Computation in Neural Systems, 14(4):733\u2013745, 2003.\n\n[29] A van der Schaaf and J H van Hateren. Modelling the power spectra of natural images: Statistics and\n\ninformation. Vision Research, 28(17):2759\u20132770, 1996.\n\n[30] Oldrich Vasicek. A test for normality based on sample entropy. Journal of the Royal Statistical Society,\n\nSeries B, 38(1):54\u201359, 1976.\n\n[31] M. J. Wainwright, O. Schwartz, and E. P. Simoncelli. Natural image statistics and divisive normaliza-\nIn Probabilistic Models of the Brain:\n\ntion: Modeling nonlinearity and adaptation in cortical neurons.\nPerception and Neural Function, pages 203\u2013222. MIT Press, 2002.\n\n[32] M J Wainwright and E P Simoncelli. Scale mixtures of Gaussians and the statistics of natural im-\nages. In S. A. Solla, T. K. Leen, and K.-R. M\u00a8uller, editors, Adv. Neural Information Processing Systems\n(NIPS*99), volume 12, pages 855\u2013861, Cambridge, MA, May 2000. MIT Press.\n\n[33] A. Watson and J. Solomon. A model of visual contrast gain control and pattern masking. J. Opt. Soc.\n\nAmer. A, pages 2379\u20132391, 1997.\n\n[34] B Wegmann and C Zetzsche. Statistical dependence between orientation \ufb01lter outputs used in an human\nvision based image code. In Proc Visual Comm. and Image Processing, volume 1360, pages 909\u2013922,\nLausanne, Switzerland, 1990.\n\n[35] M. Welling, G. E. Hinton, and S. Osindero. Learning sparse topographic representations with products of\n\nStudent-t distributions. pages 1359\u20131366, 2002.\n\n9\n\n\f", "award": [], "sourceid": 80, "authors": [{"given_name": "Siwei", "family_name": "Lyu", "institution": null}]}