{"title": "Optimal Neural Population Codes for High-dimensional Stimulus Variables", "book": "Advances in Neural Information Processing Systems", "page_first": 297, "page_last": 305, "abstract": "How does neural population process sensory information? Optimal coding theories assume that neural tuning curves are adapted to the prior distribution of the stimulus variable. Most of the previous work has discussed optimal solutions for only one-dimensional stimulus variables. Here, we expand some of these ideas and present new solutions that define optimal tuning curves for high-dimensional stimulus variables. We consider solutions for a minimal case where the number of neurons in the population is equal to the number of stimulus dimensions (diffeomorphic). In the case of two-dimensional stimulus variables, we analytically derive optimal solutions for different optimal criteria such as minimal L2 reconstruction error or maximal mutual information. For higher dimensional case, the learning rule to improve the population code is provided.", "full_text": "Fisher-Optimal Neural Population Codes for\nHigh-Dimensional Diffeomorphic Stimulus\n\nRepresentations\n\nZhuo Wang\n\nDepartment of Mathematics\nUniversity of Pennsylvania\n\nPhiladelphia, PA 19104\n\nwangzhuo@sas.upenn.edu\n\nAlan A. Stocker\n\nDepartment of Psychology\nUniversity of Pennsylvania\n\nPhiladelphia, PA 19104\n\nastocker@sas.upenn.edu\n\nDepartment of Electrical and Systems Engineering\n\nDaniel D. Lee\n\nUniversity of Pennsylvania\n\nPhiladelphia, PA 19104\n\nddlee@seas.upenn.edu\n\nAbstract\n\nIn many neural systems, information about stimulus variables is often represented\nin a distributed manner by means of a population code. It is generally assumed that\nthe responses of the neural population are tuned to the stimulus statistics, and most\nprior work has investigated the optimal tuning characteristics of one or a small\nnumber of stimulus variables. In this work, we investigate the optimal tuning for\ndiffeomorphic representations of high-dimensional stimuli. We analytically derive\nthe solution that minimizes the L2 reconstruction loss. We compared our solution\nwith other well-known criteria such as maximal mutual information. Our solution\nsuggests that the optimal weights do not necessarily decorrelate the inputs, and the\noptimal nonlinearity differs from the conventional equalization solution. Results\nillustrating these optimal representations are shown for some input distributions\nthat may be relevant for understanding the coding of perceptual pathways.\n\n1\n\nIntroduction\n\nThere has been much work investigating how information about stimulus variables is represented by\na population of neurons in the brain [1]. Studies on motion perception [2, 3] and sound localization\n[4, 5] have demonstrated that these representations adapt to the stimulus statistics on various time\nscales [6, 7, 8, 9]. This raises the natural question of what encoding scheme is underlying this\nadaptive process?\nTo address this question, several assumptions about the neural representation and its overall objective\nneed to be made. In the case of a one-dimensional stimulus, a number of theoretical approaches have\npreviously been investigated. Some work have focused on the scenario with a single neuron [10, 11,\n12, 13, 14, 15], while other work focused on the population level [16, 17, 18, 19, 20, 21, 22, 23],\nwith different model and noise assumptions. However, the question becomes more dif\ufb01cult when\nconsidering adaptation to high dimensional stimuli. An interesting class of solutions to this question\nis related to independent component analysis (ICA) [24, 25, 26], which considers maximizing the\namount of information in the encoding given a distribution of stimulus inputs. The use of mutual\ninformation as a metric to measure neural coding quality has also been discussed in [27].\n\n1\n\n\fIn this paper, we study Fisher-optimal population codes for the diffeomorphic encoding of stimuli\nwith multivariate Gaussian distributions. Using Fisher information, we investigate the properties of\nrepresentations that would minimize the L2 reconstruction error assuming an optimal decoder. The\noptimization problem is derived under a diffeomorphic assumption, i.e.\nthe number of encoding\nneurons matches the dimensionality of the input and the nonlinearity is monotonic. In this case, the\noptimal solution can be found analytically and can be given a geometric interpretation. Qualitative\ndifferences between this solution and the previously studied information maximization solutions are\ndemonstrated and discussed.\n\n2 Model and Methods\n\n2.1 Encoding and Decoding Model\n\nWe consider a n dimensional stimulus input s = (s1, . . . , sn) with prior distribution p(s). In general,\na population with m neurons can have m individual activation functions, h1(s), . . . , hm(s) which\ndetermines the average \ufb01ring rate of each neuron in response to the stimulus. However, the encoding\nprocess is affected by neural noise. Two commonly used models are Poisson noise model and\nconstant Gaussian model, for which the observed \ufb01ring rate vector r = (r1, . . . , rm) follows the\nprobabilistic distribution p(r|s), where\nrkT \u223c Poisson(hk(s)T )\nrkT \u223c Gaussian(hk(s)T, V T )\n\n(1)\n(2)\nAs opposed to encoding, the decoding process involves constructing an estimator \u02c6s(r), which de-\nterministically maps the response r to an estimate \u02c6s of the true stimulus s. We choose a maximum\nlikelihood estimator \u02c6sMLE(r) = arg maxs p(r|s) because it simpli\ufb01es the calculation due to its nice\nstatistical properties as discussed in section 2.3.\n\n(Poisson noise)\n(Gaussian noise)\n\n2.2 Fisher Information Matrix\n\nThe Fisher information is a key concept widely used in optimal coding theory. For multiple dimen-\nsions, the Fisher information matrix is de\ufb01ned element-wise for each s, as in [28],\n\n(cid:28) \u2202\n\nIF (s)i,j =\n\nlog p(r|s) \u00b7\n\n\u2202\n\u2202sj\n\n\u2202si\n\nlog p(r|s)\n\nr\n\n(3)\n\nIn the supplementary section A we prove that the Fisher information matrix for a population of m\nneurons is\n\n(cid:12)(cid:12)(cid:12)(cid:12) s\n(cid:29)\n\nm(cid:88)\nm(cid:88)\n\nk=1\n\nk=1\n\nIF (s) = T \u00b7\n\nhk(s)\n\n\u22121\u2207hk(s) \u00b7 \u2207hk(s)T\n\nIF (s) = T \u00b7\n\nV\n\n\u22121\u2207\u02dchk(s) \u00b7 \u2207\u02dchk(s)T\n\n(Poisson noise)\n\n(Gaussian noise)\n\n(4)\n\n(5)\n\nwhere T is length of the encoding time window and V represents the variance of the constant Gaus-\nsian noise. The equivalence for two noise models can be established via the variance stabilizing\ntransformation \u02dchk = 2\u221ahk [29]. Without loss of generality, throughout the paper we assume the\nGaussian noise model for mathematical convenience. Also we will simply assume V = 1, T = 1\nbecause they do not change the optimal solution for any Fisher information-related quantities.\n\n2.3 Cramer-Rao Lower Bound\n\nIdeally, a good neural population code should produce estimates \u02c6s that are close to the true value of\nthe stimulus s. However multiple measures exist for how well an estimate matches the true value.\nOne possibility is the L2 loss which is related to the Fisher information matrix via the Cramer-Rao\nlower bound [28]. For any unbiased estimator \u02c6s, including the MLE,\n\ncov[\u02c6s \u2212 s] \u2265 IF (s)\n\n\u22121\n\n2\n\n(6)\n\n\fin the sense that cov[\u02c6s \u2212 s] \u2212 IF (s)\u22121 is a positive semide\ufb01nite matrix. Being only a lower bound,\nthe Cramer-Rao bound can be attained by the MLE \u02c6s because it is asymptotically ef\ufb01cient. The local\nL2 decoding error (cid:104)(cid:107)\u02c6s \u2212 s(cid:107)2|s(cid:105)r = tr(cov(\u02c6s \u2212 s)) \u2265 tr(IF (s)\u22121). In order to minimize the overall\nL2 decoding error, one should minimize the attainable lower bound on the right side of Eq.(7), under\nappropriate constraints on hk(\u00b7). (cid:10)\n\n(7)\n\n(cid:107)\u02c6s \u2212 s(cid:107)2(cid:11)\n\ns \u2265 (cid:104)tr(IF (s)\n\n\u22121)(cid:105)s\n\n2.4 Mutual Information Limit\n\nAnother possible measurement of neural coding quality is the mutual information. This quantity\ndoes not explicitly rely on an estimator \u02c6s(r) but directly measures the mutual information between\nthe response and the stimulus.\nThe link between mutual information and the Fisher information matrix was established in [16]. One\ngoal (infomax) is to maximize the mutual information I(r, s) = H(r) \u2212 H(r|s). Assuming perfect\nintegration, the \ufb01rst term H(r) asymptotically converges to a constant H(s) for long encoding\ntime because the noise is Gaussian. The second term H(r|s) = (cid:104)H(r|s\u2217)(cid:105)s\u2217 because the noise is\nindependent. For each s\u2217, the conditional entropy H(r|s = s\u2217) \u221d 1\n2 log det IF (s\u2217) since r|s\u2217 is\nasymptotically a Gaussian variable with covariance IF (s\u2217). Therefore the mutual information is\n\nI(r, s) = const +\n\n1\n2(cid:104)log det IF (s)(cid:105)s\n\n(8)\n\n2.5 Diffeomorphic Population\n\nBefore one can formalize the optimal coding problem, some assumptions about the neural population\nneed to be made. Under a diffeomorphic assumption, the number of neurons (m) in the population\nmatches the dimensionality (n) of the input stimulus. Each neuron projects the signal s onto its basis\nwk and passes the one-dimensional projection tk = wT\nk s through a sigmoidal tuning curve hk(\u00b7)\nwhich is bounded 0 \u2264 hk(\u00b7) \u2264 1. The tuning curve is\nrk = hk(wT\n\n(9)\nk=1 si-\nWe would like to optimize for the nonlinear functions h1(\u00b7), . . . , hn(\u00b7) and the basis {wk}n\nmultaneously. We may assume (cid:107)wk(cid:107) = 1 since the scale can be compensated by the nonlinearity.\nSuch an encoding scheme is called diffeomorphic because the population establishes a smooth and\ninvertible mapping from the stimulus space s \u2208 S to the rate space r \u2208 R. An arbitrary observation\n\u22121\nk (rk) and then\nof the \ufb01ring rate r can be \ufb01rst inverted to calculate the hidden variables tk = h\nlinearly decoded to obtain \u02c6sM LE.\nFig.1a shows how the encoding scheme is implemented by a neural network. Fig.1b illustrates\nexplicitly how a 2D stimulus s is encoded by two neurons with basis w1, w2 and nonlinear mappings\nh1, h2.\n\nk s).\n\n(a)\n\n(b)\n\nFigure 1: (a) Illustration of a neural network with diffeomorphic encoding. (b) The Linear-Nonlinear (LN)\nencoding process of 2D stimulus for a stimulus s.\n\n3\n\ns1s2s3s4r1r2r3r4inputstimulusWnonlinearmaphk(\u00b7)outputs1s2w1w2swT1sr1h1(wT1s)wT2sr2h2(wT2s)\f3 Review of One Dimensional Solution\n\n\u00b7\n\n=1\n\n\u2265\n\n(cid:48)\nh\n\n(s) ds\n\n(cid:123)(cid:122)\n\n(cid:123)(cid:122)\n\n(cid:19)3\n\n(cid:19)\n(cid:125)\n\np(s)1/3 ds\n\n(cid:18)(cid:90)\n(cid:124)\n\n(cid:18)(cid:90)\n(cid:124)\n\n(cid:18)(cid:90)\n(cid:82) s\n\np(s)\nh(cid:48)(s)2 ds\noverall L2 loss\n\nIn the case of encoding an one-dimensional stimulus, the diffeomorphic population is just one neuron\nwith sigmoidal tuning curve r = h(w \u00b7 s). The only two options w = \u00b11 is determined by whether\nthe sigmoidal tuning curve is increasing or decreasing. Here we simply assume w = 1.\nFor the L2-minimization problem, we want to minimize (cid:104)tr(IF (s)\u22121)(cid:105) = (cid:104)h(cid:48)(s)\u22122(cid:105) because of\nEq.(5) and (7). Now apply Holder\u2019s inequality [30] to non-negative functions p(s)/h(cid:48)(s)2 and h(cid:48)(s),\n\n(cid:19)2\n(cid:125)\nThe minimum L2 loss is attained by the optimal h\u2217(s) \u221d\n\u2212\u221e p(t)1/3dt. For one dimensional\nGaussian with variance Var[s], the right side of Eq.(10) is 6\u221a3\u03c0Var[s]. This preliminary result\nwill be useful for the high dimensional case discussed in Section 4 and 5.\nOn the other hand, for the infomax problem we want to maximize I(r, s) because of Eq.(5) and (8).\nNote that (cid:104)log det IF (s)(cid:105) = 2(cid:104)log h(cid:48)(s)(cid:105). By treating the sigmoidal activation function h(s) as a\ncumulative probability distribution [10], we have\nbecause the KL-divergence DKL(p||h(cid:48)) =(cid:82) p(s) log p(s) ds \u2212\n(cid:82) p(s) log h(cid:48)(s) ds is non-negative.\nThe optimal solution is h\u2217(s) = (cid:82) s\n\n\u2212\u221e p(t)dt and the optimal value is 2H(p), where H(p) is the\ndifferential entropy of the distribution p(s). This h\u2217(s) is exactly obtained by equalizing the output\nprobability to maximize the entropy. For a one dimensional Gaussian with variance Var[s], the\noptimal value is log Var[s] + const.\n\n(cid:48)\np(s) log h\n\np(s) log p(s) ds\n\n(s) ds \u2264\n\n(cid:90)\n\n(cid:90)\n\n(10)\n\n(11)\n\n4 Optimal Diffeomorphic Population\n\nn(cid:88)\n\nk=1\n\nIn the case of encoding high-dimensional random stimulus using a diffeomorphic population code,\nn neurons encode n stimulus dimensions. The gradient of the k-th neuron\u2019s tuning curve is \u2207k =\nh(cid:48)\nk(wT\n\nk s)wk and the Fisher information matrix is thus\n\nIF (s) =\n\n\u2207k\u2207T\n\nk =\n\nk = W H 2W T\n\n(12)\n\nwhere W = (w1, . . . , wn) and H = diag(h(cid:48)\nn s)). Using the fact that\ntr(AB) = tr(BA) for any matrices A, B, we know tr(IF (s)\u22121) = tr((W T )\u22121H\u22122W \u22121) =\ntr((W T W )\u22121H\u22122). Because H\u22122 is diagonal, the L2-min problem is simpli\ufb01ed as\np(s)\nh(cid:48)\nk(wT\n\nL(W, H) = (cid:104)tr(IF (s)\n\n{wk,hk(\u00b7)},k=1...n\n\n\u22121)(cid:105) =\n\nk s)2 ds\n\nn(cid:88)\n\n[(W T W )\n\nminimize\n\n\u22121]kk\n\nn(wT\n\n1(wT\n\n(cid:90)\n\n(13)\n\nk=1\n\n(cid:48)\nk(wT\nh\n\nk s)2wkwT\n1 s), . . . , h(cid:48)\n\nn(cid:88)\n\nk=1\n\n(cid:90)\n\nn(cid:88)\n\nk=1\n\nIf we de\ufb01ne the marginal distribution\n\npk(t) =\n\np(s)\u03b4(t \u2212 wT\n\nk s) ds\n\ndiscussed in section 3, the optimal value ((cid:82) pk(t)1/3 dt)3 is attained when h\u2217\n\nthen the optimization over wk and hk can be decoupled in the following way. For any \ufb01xed W ,\nthe integral term can be evaluated by marginalizing out all those directions perpendicular to wk. As\n(cid:19)3\n(t) \u221d pk(t)1/3. The\n\noptimization problem is now\n\n(cid:18)(cid:90)\n\nk\n\n(cid:48)\n\nminimize\n{wk},k=1...n\n\nLh\u2217 (W ) =\n\n[(W T W )\n\n\u22121]kk\n\npk(t)1/3 dt\n\nIn general, analytically optimizing such a term for arbitrary prior distribution p(s) is intractable.\nHowever if p(s) is multivariate Gaussian then the optimization can be further simpli\ufb01ed and solved\nanalytically, as discussed in the following section.\n\n4\n\n(14)\n\n(15)\n\n\f5 Stimulus with Gaussian Prior\n\nWe consider the case when the stimulus prior is Gaussian N (0, \u03a3). This assumption allows us to\ncalculate the marginal distribution along any direction wk as an one-dimensional Gaussian with\nmean zero and variance wT\nk \u03a3wk = (W T \u03a3W )kk. By plugging in the Gaussian density pk(t) and\nusing the fact we derived in Section 3, we can further simplify the L2-optimization problem as\n\nminimize\n{wk},k=1...n\n\nLh\u2217 (W ) = 6\u221a3\u03c0 \u00b7\n\n5.1 Geometric Interpretation\n\nn(cid:88)\n\nk=1\n\n[(W T W )\n\n\u22121]kk(W T \u03a3W )kk\n\n(16)\n\nIn the above optimization problem, (W T \u03a3W )kk has a clear and simple meaning \u2013 it is the variance\nof the marginal distribution pk(t). For term [(W T W )\u22121]kk, notice that W T W is the inner product\nmatrix of the basis {wk}n\ni wj. Using the adjoint method we can calculate\nthe diagonal elements of (W T W )\u22121,\n\nk=1, i.e. (W T W )ij = wT\n\n[(W T W )\n\n\u22121]kk =\n\ndet(W T\nk Wk)\ndet(W T W )\n\n(17)\n\nk Wk is the inner product matrix of leave-wk-out basis {w1, . . . , wk\u22121, wk+1, . . . , wn}.\nwhere W T\nLet \u03b8k be the angle between wk and the hyperplane spanned by all other basis vectors (see Fig.2).\nThe diagonal element is just [(W T W )\u22121]kk = (det Wk/ det W )2 = (sin \u03b8k)\u22122 simply because\n(cid:124)\n\n(cid:125)\n= Volume ({w1, . . . , wk\u22121, wk+1, . . . , wn})\n\n\u00b7|wk| \u00b7 sin \u03b8k\n\n(cid:123)(cid:122)\n\n(cid:125)\nVolume ({w1, . . . , wn})\n\n(cid:123)(cid:122)\n\nn dim parallelogram\n\nn\u22121 dim base parallelogram\n\n(cid:123)(cid:122)\n\n(cid:125)\n\n,\n\n(18)\n\n(cid:124)\n\n(cid:124)\n\nheight\n\nFigure 2: Illustration of \u03b8k. In this example, w1\nand w2 are on the s1-s2 plane. \u03b83 is just the angle\nbetween w3 and its projection on the s1-s2 plane.\n\nThe optimization involves two competing parts. Minimizing (W T \u03a3W )kk makes all those direc-\ntions with small variance favorable. Meanwhile, minimizing [(W T W )\u22121]kk = (sin \u03b8k)\u22122 strongly\npenalizes neurons having similar tuning directions with the rest of population. To qualitatively sum-\nmarize, the optimal population would tend to encode those directions with small variance while\nkeeping certain degree of population diversity.\n\n5.2 General Solution\n\nDue to space limitations, we will only present the optimal solution here and the derivation can be\nfound in Appendix C in the supplementary notes. For any covariance matrix \u03a3, the optimal solution\nfor Eq.(16) is\n\n\u2217\n\nW\n\n\u22121/4U, where U T U = I and (U T \u03a31/2U )kk =\n\n= \u03a3\n\n1\nn\n\ntr(\u03a31/2) for all k = 1, . . . , n (19)\n\nSuch unitary matrix U is guaranteed to exist yet may not be unique. See Appendix D for a detailed\ndiscussion. In general for dimension n, the solution has a manifold structure with dimension not\ny).\nless than (n \u2212 1)(n \u2212 2)/2. For n = 2 the solution can be easily derived. Let \u03a3 = diag(\u03c32\nx, \u03c32\nThen optimal solution is given by\n\nU =\n\n1\n\u221a2\n\n, W\n\n\u2217\nL2 = \u03a3\n\n\u22121/4U =\n\n1\n\u221a2\n\n(20)\n\nThis 2D solution is special and is unique under re\ufb02ection and permutation unless the prior distribu-\ntion is spherically symmetric i.e. \u03a3 = aI.\n\n5\n\n(cid:18)1 \u22121\n(cid:19)\n\n1\n\n1\n\n(cid:33)\n\n(cid:32) 1\u221a\n\u03c3x \u2212 1\u221a\n1\u221a\n1\u221a\n\u03c3y\n\u03c3y\n\n\u03c3x\n\ns1s2s3\u03b83w3w1w2\f6 Comparison with Infomax Solution\n\nPrevious studies have focused on \ufb01nding solutions that maximize the mutual information (infomax)\nbetween the stimulus and the neural population response. This is related to independent component\nanalysis (ICA) [24]. Mutual information can be maximized if and only if each neuron encodes\nan independent component of the stimulus and uses the proper nonlinear tuning curve.\nIdeally,\nthe joint distribution p(s) can be decomposed as the product of n one dimensional components\n\nk=1 pk(Wk(s)). For a Gaussian prior with covariance \u03a3, the infomax solution is\n\n(cid:81)n\n\nW\n\n\u2217\ninfo = \u03a3\n\n\u22121/2U \u21d2 cov(W\n\n\u22121/2 \u00b7 \u03a3 \u00b7 \u03a3\n(21)\nwhere \u03a3\u22121/2 is the whitening matrix and U is an arbitrary unitary matrix. The derivation can be\nfound in Appendix E. In the same 2D example where \u03a3 = diag(\u03c32\ny), the family of optimal\n(cid:32) cos \u03c6\nsolutions is parametrized by an angular variable \u03c6\n\n\u2217T\ninfos) = U T \u03a3\n\n\u22121/2U = I\n\n(cid:33)\n\nx, \u03c32\n\n(cid:18)cos \u03c6 \u2212 sin \u03c6\n\n(cid:19)\n\ncos \u03c6\n\ninfo(\u03c6) and W \u2217\n\nsin \u03c6\nIn Fig.3 we compare W \u2217\nL2 for different prior covariances. One observation is that, L2\noptimal neurons do not fully decorrelate input signals unless the Gaussian prior is spherical. By\ncorrelating the input signal and encoding redundant information, the channel signal to noise ratio\n(SNR) can be balanced to reduce the vulnerability of those independent channels with low SNR.\nAs a consequence, the overall L2 performance is improved at the cost of transferring a suboptimal\namount of information. Another important observation is that the infomax solution allows a greater\ndegree of symmetry \u2013 Eq.(21) holds for arbitrary unitary matrices while Eq.(19) holds only for a\nsubset of them.\n\n\u03c3y\n\nU (\u03c6) =\n\n1\n\u221a2\n\n, W\n\n\u2217\ninfo(\u03c6) = \u03a3\n\n\u22121/2 \u00b7 U (\u03c6) =\n\n\u2212 sin \u03c6\n\n\u03c3x\ncos \u03c6\n\n\u03c3x\nsin \u03c6\n\u03c3y\n\n(22)\n\n(a)\n\n(b)\n\n(c)\n\n(d)\n\ny\n\u03c3\n1\n=\nx\n\u03c3\n\ny\n\u03c3\n2\n=\nx\n\u03c3\n\ny\n\u03c3\n3\n=\nx\n\u03c3\n\nL2-min\n\ninfomax\n\nFigure 3: Comparison of L2-min and infomax optimal solution for 2D case. Each row represents the result\nfor different ratio \u03c3x/\u03c3y for the prior distribution. (a) The optimal pair of basis vectors w1, w2 for L2-min\nwith the prior covariance ellipse is unique unless the prior distribution has rotational symmetry. (b) The loss\nfunction with \u201d+\u201d marking the optimal solution shown in (a). (c) One pair of optimal basis vector w1, w2 for\ninfomax with the prior covariance ellipse. (d) The loss function with \u201d+\u201d marking the optimal solution shown\nin (c).\n\n6\n\ns1s2slope=1w1w2w01w02090180090180\u03b1(degree)\u03b2(degree)s1s2slope=1w1w2w01w02090180090180\u03b1(degree)\u03b2(degree)s1s2slope=\u221a2w1w2090180090180\u03b1(degree)\u03b2(degree)s1s2slope=2w1w2w01w02090180090180\u03b1(degree)\u03b2(degree)s1s2slope=\u221a3w1w2090180090180\u03b1(degree)\u03b2(degree)s1s2slope=3w1w2w01w02090180090180\u03b1(degree)\u03b2(degree)\f7 Application \u2013 16-by-16 Gaussian Images\n\nIn this section we apply our diffeomorphic coding scheme to an image representation problem. We\nassume that the intensity values of all pixels from a set of 16-by-16 images follow a 256-D Gaussian\ndistribution. Instead of directly de\ufb01ning the pairwise covariance between pixels of s, we calculate\nits real Fourier components \u02c6s\n\n\u02dcs = F T s \u21d4 s = F \u02c6s\n\n(23)\n\nwhere the real Fourier matrix is F = (f1, . . . , f256) with each \ufb01lter fa and its spatial frequency (cid:126)ka.\nThe covariance of those Fourier components \u02dcs is typically assumed to be diagonal and the power\ndecays following some power law\n\ncov(\u02dcs) = D = diag(\u03c32\n\n1, . . . , \u03c32\n\nn), where \u03c32\n\n\u03b2 > 0\n\n(24)\n\n\u2212\u03b2,\n\na \u221d |(cid:126)ka|\n\nTherefore the original stimulus s has covariance cov(s) = \u03a3 = F DF T . Such image statistics are\ncalled stationary because the covariance between pair of pixels is fully determined by their relative\nposition. For the stimulus s with covariance \u03a3, one naive choice of L2 optimal \ufb01lter is simply\n\nW\n\n\u2217\nL2 = \u03a3\n\n\u22121/4 \u00b7 I = F D\n\n\u22121/4F T\n\n(25)\n\nbecause \u03a31/2 = F D1/2F T has constant diagonal terms (See Appendix F for detailed calculation)\nand U = I quali\ufb01es for Eq.(19). The covariance matrix and one sample image generated from \u03a3 is\nplotted in Fig. 4(a)-(c) below.\n\n(a)\n\n(b)\n\n(c)\n\nFigure 4: For \u03b2 = 2.5 in the power law: (a) The 256 \u00d7 256 covariance matrix \u03a3. (b) One column of \u03a3\nreshaped to 16 \u00d7 16 matrix representing the covariance between any pixels and a \ufb01xed pixel in the center. (c)\nA random sample from the Gaussian distribution with covariance \u03a3.\n\nIn addition, we have numerically computed the L2 loss using a family of \ufb01lters\n\nW\u03b3 = F D\n\n\u2212\u03b3F T ,\n\n\u03b3 \u2208 [0, 1/2]\n\n(26)\n\nNote that when \u03b3 = 0, we have the naive \ufb01lter W0 = F F T = I which does nothing to the input\nstimulus; when \u03b3 = 1/4 or 1/2, we revisit the L2 optimal \ufb01lter or the infomax \ufb01lter, respectively. As\nwe can see from Fig. 5(a)-(d), the L2 optimal \ufb01lter half-decorrelates the input stimulus channels to\nkeep the balance between the simplicity of the \ufb01lters and the simplicity of the correlation structure.\nIn each simulation run, a set of 10,000 16-by-16 images is randomly sampled from the multivariate\n(cid:82) y\nGaussian distribution with zero mean and covariance matrix \u03a3. For each stimulus image s, we\n\u03b3 s and zk = hk(yk) + \u03b7k to simulate the encoding process. Here hk(y) \u221d\ncalculate y = W T\n\u03b3 \u03a3W\u03b3)kk). The additive Gaussian noise \u03b7k is\n\u2212\u221e pk(t)1/3dt and pk(t) is Gaussian N (0, (W T\n\u22121\n\u03b3 )\u22121 \u02c6y.\nindependent Gaussian N (0, 10\u22124). To decode, we just calculate \u02c6yk = h\nk (zk) and \u02c6s = (W T\n2. This procedure is repeated 20 times and the result is plotted\nThen we measure the L2 loss (cid:107)\u02c6s \u2212 s(cid:107)2\nin Fig. 5(e).\n\n8 Discussion and Conclusions\n\nIn this paper, we have studied the an optimal diffeomorphic neural population code which minimizes\nthe L2 reconstruction error. The population of neurons is assumed to have sigmoidal activation\nfunctions encoding linear combinations of a high dimensional stimulus with a multivariate Gaussian\n\n7\n\n\f(a)\n\n(b)\n\n(c)\n\n(d)\n\n(e)\n\nFigure 5: (a) The 2D \ufb01lter W\u03b3 of one speci\ufb01c neuron for \u03b3 = 0, 1/4, 1/2 from top to bottom. (b) The\ncross-section of the \ufb01lter W\u03b3 on one speci\ufb01c row boxed in (a), plotted as a function. (c) The correlation of the\n2D \ufb01ltered stimulus, between one speci\ufb01c neuron and all neurons. (d) The cross-section of the 2D correlation\nof the \ufb01ltered stimulus, between the neuron and other neurons on the same row. (e) The simulation result of L2\nloss for different \ufb01lter W\u03b3 and optimal nonlinearity h and the vertical bar shows the \u00b13\u03c3 interval across trials.\n\ndistribution. The optimal solution is provided and compared with solutions which maximize the\nmutual information.\nIn order to derive the optimal solution, we \ufb01rst show that the Poisson noise model is equivalent to\nthe constant Gaussian noise under the variance stabilizing transformation. Then we relate the L2\nreconstruction error to the trace of inverse Fisher information matrix via the Cramer-Rao bound.\nMinimizing this bound leads to the global optimal solution in the asymptotic limit of long inte-\ngration time. The general L2-minimization problem can be simpli\ufb01ed and the optimal solution is\nanalytically derived when the stimulus distribution is Gaussian.\nCompared to the infomax solutions, a careful evaluation and calculation of the Fisher information\nmatrix is needed for L2 minimization. The manifold of L2 optimal solutions possess a lower di-\nmensional structure compared to the infomax solution. Instead of decorrelating the input statistics,\nthe L2-min solution maintains a certain degree of correlation across the channels. Our result sug-\ngests that maximizing mutual information and minimizing the overall decoding loss are not the same\nin general \u2013 encoding redundant information can be bene\ufb01cial to improve reconstruction accuracy.\nThis principle may explain the existence of correlations at many layers in biological perception\nsystems.\nAs an example, we have applied our theory to 16-by-16 images with stationary pixel statistics. The\noptimal solution exhibits center-surround receptive \ufb01elds, but with a decay differing from those\nfound by decorrelating solutions. We speculate that these solutions may better explain observed\ncorrelations measured in certain neural areas of the brain. Finally, we acknowledge the support of\nthe Of\ufb01ce of Naval Research.\n\nReferences\n\n[1] K Kang, RM Shapley, and H Sompolinsky. Information tuning of populations of neurons in\n\nprimary visual cortex. Journal of neuroscience, 24(15):3726\u20133735, 2004.\n\n[2] AP Georgopoulos, AB Schwartz, and RE Kettner. Adaptation of the motion-sensitive neuron\n\nh1 is generated locally and governed by contrast frequency. Science, 233:1416\u20131419, 1986.\n\n[3] FE Theunissen and JP Miller. Representation of sensory information in the cricket cercal\nsensory system. II. information theoretic calculation of system accuracy and optimal tuning-\ncurve widths of four primary interneurons. J Neurophysiol, 66(5):1690\u20131703, November 1991.\n[4] DC Fitzpatrick, R Batra, TR Stanford, and S Kuwada. A neuronal population code for sound\n\nlocalization. Nature, 388:871\u2013874, 1997.\n\n8\n\nnaive2D filterL2 optimalinfomax081600.51filter cross\u2212section081600.51081600.512D correlation081600.51correlation cross\u2212section081600.51081600.5101/41/200.511.52x 10\u22128L2 loss (\u00b1 3\u03c3)\u03b3  naiveL2 optimalinfomax\f[5] NS Harper and D McAlpine. Optimal neural population coding of an auditory spatial cue.\n\nNature, 430:682\u2013686, 2004.\n\n[6] N Brenner, W Bialek, and R de Ruyter van Steveninck. Adaptive rescaling maximizes infor-\n\nmation transmission. Neuron, 26:695\u2013702, 2000.\n\n[7] Tvd Twer and DIA MacLeod. Optimal nonlinear codes for the perception of natural colours.\n\nNetwork: Computation in Neural Systems, 12(3):395\u2013407, 2001.\n\n[8] I Dean, NS Harper, and D McAlpine. Neural population coding of sound level adapts to\n\nstimulus statistics. Nature neuroscience, 8:1684\u20131689, 2005.\n\n[9] Y Ozuysal and SA Baccus. Linking the computational structure of variance adaptation to\n\nbiophysical mechanisms. Neuron, 73:1002\u20131015, 2012.\n\n[10] SB Laughlin. A simple coding procedure enhances a neurons information capacity. Z. Natur-\n\nforschung, 36c(3):910\u2013912, 1981.\n\n[11] J-P Nadal and N Parga. Non linear neurons in the low noise limit: A factorial code maximizes\n\ninformation transfer, 1994.\n\n[12] M Bethge, D Rotermund, and K Pawelzik. Optimal short-term population coding: when Fisher\n\ninformation fails. Neural Computation, 14:2317\u20132351, 2002.\n\n[13] M Bethge, D Rotermund, and K Pawelzik. Optimal neural rate coding leads to bimodal \ufb01ring\n\nrate distributions. Netw. Comput. Neural Syst., 14:303\u2013319, 2003.\n\n[14] MD McDonnell and NG Stocks. Maximally informative stimuli and tuning curves for sig-\n\nmoidal rate-coding neurons and populations. Phys. Rev. Lett., 101:058103, 2008.\n\n[15] Z Wang, A Stocker, and DD Lee. Optimal neural tuning curves for arbitrary stimulus distribu-\ntions: Discrimax, infomax and minimum lp loss. Adv. Neural Information Processing Systems,\n25:2177\u20132185, 2012.\n\n[16] N Brunel and J-P Nadal. Mutual information, \ufb01sher information and population coding. Neural\n\nComputation, 10(7):1731\u20131757, 1998.\n\n[17] K Zhang and TJ Sejnowski. Neuronal tuning: To sharpen or broaden? Neural Computation,\n\n11:75\u201384, 1999.\n\n[18] A Pouget, S Deneve, J-C Ducom, and PE Latham. Narrow versus wide tuning curves: Whats\n\nbest for a population code? Neural Computation, 11:85\u201390, 1999.\n\n[19] H Sompolinsky and H Yoon. The effect of correlations on the \ufb01sher information of population\n\ncodes. Advances in Neural Information Processing Systems, 11, 1999.\n\n[20] AP Nikitin, NG Stocks, RP Morse, and MD McDonnell. Neural population coding is optimized\n\nby discrete tuning curves. Phys. Rev. Lett., 103:138101, 2009.\n\n[21] D Ganguli and EP Simoncelli. Implicit encoding of prior probabilities in optimal neural pop-\n\nulations. Adv. Neural Information Processing Systems, 23:658\u2013666, 2010.\n\n[22] S Yaeli and R Meir. Error-based analysis of optimal tuning functions explains phenomena\n\nobserved in sensory neurons. Front Comput Neurosci, 4, 2010.\n\n[23] E Doi and MS Lewicki. Characterization of minimum error linear coding with sensory and\n\nneural noise. Neural Computation, 23, 2011.\n\n[24] AJ Bell and TJ Sejnowski. An information-maximization approach to blind separation and\n\nblind deconvolution. Neural Computation, 7:1129\u20131159, 1995.\n\n[25] DJ Field BA Olshausen. Emergence of simple-cell receptive \ufb01eld properties by learning a\n\nsparse code for natural images. Nature, 381:607\u2013609, 1996.\n\n[26] A Hyvarinen and E Oja. Independent component analysis: Algorithms and applications. Neu-\n\nral Networks, 13:411\u2013430, 2000.\n\n[27] P Berens, A Ecker, S Gerwinn, AS Tolias, and M Bethge. Reassessing optimal neural pop-\nulation codes with neurometric functions. Proceedings of the National Academy of Sciences,\n11:4423\u20134428, 2011.\n\n[28] TM Cover and J Thomas. Elements of Information Theory. Wiley, 1991.\n[29] EL Lehmann and G Casella. Theory of point estimation. New York: Springer-Verlag., 1999.\n[30] GH Hardy, JE Littlewood, and G Polya. Inequalities, 2nd ed. Cambridge University Press,\n\n1988.\n\n9\n\n\f", "award": [], "sourceid": 230, "authors": [{"given_name": "Zhuo", "family_name": "Wang", "institution": "University of Pennsylvania"}, {"given_name": "Alan", "family_name": "Stocker", "institution": "University of Pennsylvania"}, {"given_name": "Daniel", "family_name": "Lee", "institution": "University of Pennsylvania"}]}