{"title": "Optimal Neural Tuning Curves for Arbitrary Stimulus Distributions: Discrimax, Infomax and Minimum $L_p$ Loss", "book": "Advances in Neural Information Processing Systems", "page_first": 2168, "page_last": 2176, "abstract": "In this work we study how the stimulus distribution influences the optimal coding of an individual neuron. Closed-form solutions to the optimal sigmoidal tuning curve are provided for a neuron obeying Poisson statistics under a given stimulus distribution. We consider a variety of optimality criteria, including maximizing discriminability, maximizing mutual information and minimizing estimation error under a general $L_p$ norm.  We generalize the Cramer-Rao lower bound and show how the $L_p$ loss can be written as a functional of the Fisher Information in the asymptotic limit, by proving the moment convergence of certain functions of Poisson random variables.  In this manner, we show how the optimal tuning curve depends upon the loss function, and the equivalence of maximizing mutual information with minimizing $L_p$ loss in the limit as $p$ goes to zero.", "full_text": "Optimal Neural Tuning Curves for Arbitrary\n\nStimulus Distributions: Discrimax, Infomax and\n\nMinimum Lp Loss\n\nZhuo Wang\n\nDepartment of Mathematics\nUniversity of Pennsylvania\n\nPhiladelphia, PA 19104\n\nwangzhuo@sas.upenn.edu\n\nAlan A. Stocker\n\nDepartment of Psychology\nUniversity of Pennsylvania\n\nPhiladelphia, PA 19104\n\nastocker@sas.upenn.edu\n\nDepartment of Electrical and Systems Engineering\n\nDaniel D. Lee\n\nUniversity of Pennsylvania\n\nPhiladelphia, PA 19104\n\nddlee@seas.upenn.edu\n\nAbstract\n\nIn this work we study how the stimulus distribution in\ufb02uences the optimal coding\nof an individual neuron. Closed-form solutions to the optimal sigmoidal tuning\ncurve are provided for a neuron obeying Poisson statistics under a given stimulus\ndistribution. We consider a variety of optimality criteria, including maximizing\ndiscriminability, maximizing mutual information and minimizing estimation er-\nror under a general Lp norm. We generalize the Cramer-Rao lower bound and\nshow how the Lp loss can be written as a functional of the Fisher Information\nin the asymptotic limit, by proving the moment convergence of certain functions\nof Poisson random variables. In this manner, we show how the optimal tuning\ncurve depends upon the loss function, and the equivalence of maximizing mutual\ninformation with minimizing Lp loss in the limit as p goes to zero.\n\n1\n\nIntroduction\n\nA neuron represents sensory information via its spike train. Rate coding maps an input stimulus to\na spiking rate via the neuron\u2019s tuning. Previous work in computational neuroscience has tried to\nexplain this mapping via optimality criteria. An important factor determining the optimal shape of\nthe tuning curve is the input statistics of the stimulus. It has previously been observed that environ-\nmental statistics can in\ufb02uence the neural tuning curves of sensory neurons [1, 2, 3, 4, 5]. However,\nmost theoretical analysis has usually assumed the input stimulus distribution to be uniform. Only\nrecently, theoretical work has been demonstrating how non-uniform prior distributions will affect\nthe optimal shape of the neural tuning curves [6, 7, 8, 9, 10].\nAn important factor in determining the optimal tuning curve is the optimality criterion [11]. Most\nprevious work used local Fisher Information [12, 13, 14], the estimation square loss or discrim-\ninability (discrimax) [15, 16] or the mutual information (infomax) [9, 17] to evaluate neural codes.\nIt has been shown that both the square loss and the mutual information are related to the Fisher Infor-\nmation via lower bounds: the lower bound of estimation square loss is provided by the Cramer-Rao\nlower bound [18, 19] and the mutual information can be lower bounded by a functional of Fisher\nInformation as well [7]. It has also been proved that both lower bounds can be attained on the con-\n\n1\n\n\fdition that the encoding time is long enough and the estimator behaves well in the asymptotic limit.\nHowever, there has been no previous study to integrate those two lower bounds into a more general\nframework.\nIn this paper, we ask the question what tuning curve is optimally encoding a stimulus with an ar-\nbitrary prior distribution such that the Lp estimation lost is minimized. We are able to provide\nanalytical solutions to the above question. With the asymptotic analysis of the maximum likelihood\nestimator (MLE), we can show how the Lp loss converges to a functional of Fisher Information in\nthe limit of long encoding time. The optimization of such functional can be conducted for arbitrary\nstimulus prior and for all p \u2265 0 in general. The special case of p = 2 and the limit p \u2192 0 cor-\nresponds to discrimax and infomax, respectively. The general result offers a framework to help us\nunderstand the infomax problem in a new point of view: maximizing mutual information is equiva-\nlent to minimizing the expected L0 loss.\n\n2 Model and Methods\n\n2.1 Encoding and Decoding Model\n\nThroughout this paper we denote s as the scalar stimulus. The stimulus follows an arbitrary prior\ndistribution \u03c0(s). The encoding process involves a probabilistic mapping from stimulus to a random\nnumber of spikes. For each s, the neuron will \ufb01re at a predetermined \ufb01ring rate h(s), representing the\nneuron\u2019s tuning curve. The encoded information will contain some noise due to neural variability.\nAccording to the conventional Poisson noise model, if the available coding time is T , then the\nobserved spike count N has a Poisson distribution with parameter \u03bb = h(s)T\n\nP[N = k] =\n\n1\nk!\n\n(h(s)T )k e\u2212h(s)T\n\n(1)\n\nThe tuning curve h(s) is assumed to be sigmoidal, i.e. monotonically increasing, but limited to\na certain range hmin \u2264 h(s) \u2264 hmax due to biological constraints. The decoding process is the\nreverse process of encoding. The estimator \u02c6s = \u02c6s(N ) should be a function of observed count N.\nOne conventional choice is to use the MLE estimator. First the MLE estimator \u02c6\u03bb for mean \ufb01ring rate\nis \u02c6\u03bb = N/T . There for the MLE estimator for stimulus s is simply \u02c6s = h\u22121(\u02c6\u03bb).\n\n2.2 Fisher Information and Reversal Formula\n\nThe Fisher Information can be used to describe how well one can distinguish a speci\ufb01c distribution\nfrom its neighboring distributions within the same family of distributions. For a family of distribu-\ntion with scalar parameter s, the Fisher Information is de\ufb01ned as\n\n(cid:90) (cid:18) \u2202\n\n(cid:19)2\n\nI(s) =\n\nlog P(N|s)\n\nP(N|s) dN.\n\n\u2202s\n\nFor tuning function h(s) with Poisson spiking model, the Fisher Information is (see [12, 7])\n\nIh(s) = T\n\nh(cid:48)(s)2\nh(s)\n\nFurther with the sigmoidal assumption, by solving the above ordinary differential equation, we can\nderive the inverse formula in Eq.(4) and an equivalent constraint on Fisher Information in Eq.(5)\n\n(2)\n\n(3)\n\n(4)\n\n(cid:18)(cid:112)\n1\n(cid:112)\n2\u221aT\nIh(t) dt \u2264 2\u221aT\n\nhmin +\n\n(cid:90) s\n(cid:16)(cid:112)\n\nh(s) =\n\n(cid:90) s\n\n(cid:112)\n\nIh(t) dt\n\n(cid:112)\n\n\u2212\u221e\nhmax \u2212\n\n(cid:19)2\n(cid:17)\n\n(cid:112)\nThis constraint is closely related to the Jeffrey\u2019s prior, which claims that \u03c0\u2217(s) \u221d\nI(s) is the least\ninformative prior. The above inequality means that the normalization factor of the Jeffrey\u2019s prior is\n\ufb01nite, as long as the range of \ufb01ring rate is limited hmin \u2264 h(s) \u2264 hmax.\n\nhmin\n\n\u2212\u221e\n\n(5)\n\n2\n\n\f3 Two Bounds on Loss Function via Fisher Information\n\n3.1 Cramer-Rao Bound\n\nThe Cramer-Rao Bound [18] for unbiased estimators is\n\nWe can achieve maximum discriminability \u03b4\u22121 by minimizing the mean asymptotic squared error\n(MASE), de\ufb01ned in [15] as\n\n\u03b42 = E[(\u02c6s \u2212 s)2] \u2265\n\nds,\n\n(7)\n\nE[(\u02c6s \u2212 s)2|s] \u2265\n\n(6)\n\n1\nI(s)\n\n(cid:90) \u03c0(s)\n\nIh(s)\n\nEven if Eq.(7) is only a lower bound, it is attained by the MLE of s asymptotically. In order to\noptimize the right side of Eq.(7) under the constraints Eq.(5), variation method can be applied and\nthe optimal condition and the optimal solution can be written as\n\n(cid:32)(cid:112)\n\n(cid:16)(cid:112)\n\n(cid:112)\n\nhmin +\n\nhmax \u2212\n\nhmin\n\n(cid:17)(cid:82) s\n(cid:82) \u221e\n\n\u2212\u221e\n\u2212\u221e\n\n(cid:33)2\n\n\u03c0(t)1/3 dt\n\u03c0(t)1/3 dt\n\n(8)\n\nIh(s) \u221d \u03c0(s)2/3,\n\nh2(s) =\n\n3.2 Mutual Information Bound\n\nSimilar as the Cramer-Rao Bound, Brunel and Nadal [7] gave an upper bound of the mutual infor-\nmation between the MLE \u02c6s and the environmental stimulus s\n\nImutual(\u02c6s, s) \u2265 H\u03c0 \u2212\n\n1\n2\n\n\u03c0(s) log\n\nds,\n\n(9)\n\n(cid:90)\n\n(cid:18) 2\u03c0e\n\n(cid:19)\n\nIh(s)\n\nwhere H\u03c0 is the entropy of the stimulus prior \u03c0(s). Although this is an lower bound on the mutual\ninformation which we want to maximize, the equality holds asymptotically by the MLE of s as\nstated in [7]. To maximize the mutual information, we can maximize the right side of Eq.(9) under\nthe constraint of Eq.(5) by variation method again and obtain the optimal condition and optimal\nsolution as\n\n(cid:32)(cid:112)\n\n(cid:16)(cid:112)\n\n(cid:112)\n\n(cid:17)(cid:82) s\n(cid:82) \u221e\n\n\u2212\u221e\n\u2212\u221e\n\n(cid:33)2\n\n\u03c0(t) dt\n\n\u03c0(t) dt\n\nIh(s) \u221d \u03c02(s),\n\nh0(s) =\n\nhmin +\n\nhmax \u2212\n\nhmin\n\n(10)\n\nTo see the connection between solutions in Eq.(8) and Eq.(10), we need the result of the following\nsection.\n\n4 Asymptotic Behavior of Estimators\n\nIn general, minimizing the lower bound does not imply that the measures of interest, e.g. the left\nside of Eq.(7) and Eq.(9), is minimized. In order to make the lower bounds useful, we need to know\nthe conditions for which there exist \u201dgood\u201d estimators that can reach these theoretical lower bounds.\nFirst we will introduce some de\ufb01nitions of estimator properties. Let T be the encoding time for a\n\nneuron with Poisson noise, and \u02c6sT be the MLE at time T . If we denote Y (cid:48)T = \u221aT (\u02c6sT \u2212 s) and\nZ(cid:48) \u223c N (0, T /I(s)), then the notions we have mentioned above are de\ufb01ned as below\n(asymptotic consistency)\n(asymptotic ef\ufb01ciency)\n\nE[Y (cid:48)T ] \u2192 0\n\n(11)\n(12)\n\nVar[Y (cid:48)T ] \u2192 T /I(s)\n\nD\u2192 Z(cid:48)\n\nY (cid:48)T\n\n(13)\n(14)\nGenerally the above four conditions are listed from the weakest to the strongest, top to bottom.\nTo have the equality in Eq.(7) hold, we need the asymptotic consistent and asymptotic ef\ufb01cient\nestimators. To have the equality in (9) hold, we need the asymptotic normal estimators (see [7]). If\n\n(asymptotic normality)\n(p-th moment convergence)\n\nE[|Y (cid:48)T|p] \u2192 E[|Z(cid:48)|p]\n\n3\n\n\fwe want to generalize the problem even further, i.e. \ufb01nding the tuning curve which minimizes the\nLp estimation loss, then we need the moment convergent estimator for all p-th moments.\nHere we will give two theorems to prove that the MLE \u02c6s of the true stimulus s would satisfy all the\nabove four properties in Eq.(11)-(14). Let h(s) be the tuning curve of a neuron with Poisson spiking\nnoise. The the MLE of s is given by \u02c6s = h\u22121(\u02c6\u03bb). We will show that the limiting distribution of\n\u221aT (\u02c6sT \u2212 s) is a Gaussian distribution with mean 0 and variance h(s)/h(cid:48)(s)2. We will also show\nthat any positive p-norm of \u221aT (\u02c6sT \u2212 s) will converge the p-norm of the corresponding Gaussian\nTheorem 1. Let Xi be i.i.d. Poisson distributed random variables with mean \u03bb. Let Sn =(cid:80)n\n\ndistribution. The proof of Theorem 1 and 2 will be provided in Appendix A.\n\ni=1 Xi\n\nbe the partial sum. Then\n\n(a) Sn has Poisson distribution with mean n\u03bb.\n(b) Yn = \u221an(Sn/n \u2212 \u03bb) converges to Z \u223c N (0, \u03bb) in distribution.\n(c) The p-th moment of Yn converges, and limn\u2192\u221e E\u03bb[|Yn|p] = E[|Z|p] for all p > 0.\n\nOne direct application of this theorem is that, if the tuning curve h(s) = s for (s > 0) and the\nencoding time is T , then the estimator \u02c6s = N/T is asymptotically ef\ufb01cient since as T \u2192 \u221e,\nVar[\u02c6s] \u2192 E[|Z\u03bb/\u221aT|2] = s/T = 1/I(s).\nTheorem 2. Let Xi, Sn be de\ufb01ned as in Theorem 1. Let g(x) be any function with bounded deriva-\ntive |g(cid:48)(x)| \u2264 M. Then\n\n(a) Y (cid:48)n = \u221an(g(Sn/n) \u2212 g(\u03bb)) converges to Z(cid:48) \u223c N (0, \u03bbg(cid:48)(\u03bb)2) in distribution.\n(b) The p-th moment of Y (cid:48)n converges, and limn\u2192\u221e E\u03bb[|Y (cid:48)n|p] = E[|Z(cid:48)|p] for all p > 0.\n\nTheorem 1 indicates that we can always estimate the \ufb01ring rate \u03bb = h(s) ef\ufb01ciently by the estimator\n\u02c6\u03bb = N/T . Theorem 2 indicates, however, that we can also estimate a smooth transformation of\nthe \ufb01ring rate ef\ufb01ciently in the asymptotic limit T \u2192 \u221e. Now, if we go back to the conventional\nsetting of the tuning curve \u03bb = h(s), we can estimate the stimulus by the estimator \u02c6s = h\u22121(\u02c6\u03bb).\nTo meet the need of boundedness of g: |g(cid:48)(\u03bb)| \u2264 M, we have 1/g(cid:48)(\u03bb) = h(cid:48)(s) \u2265 1/M hence this\ntheory only works for stimulus from a compact set s \u2208 [\u2212M, M ], although the M can be chosen as\nlarge as possible. The larger the M is, the longer encoding time T will be necessary to observe the\nasymptotic normality and the convergence of moments.\nThe estimator \u02c6s = h\u22121(\u02c6\u03bb) is biased for \ufb01nite T , but it is asymptotically unbiased and ef\ufb01cient. This\nis because as T \u2192 \u221e\n\nEs[\u221aT (\u02c6sT \u2212 s)] \u2192 E[Z(cid:48)] = 0\nVars[\u221aT (\u02c6sT \u2212 s)] \u2192 E[|Z(cid:48)|2] = \u03bb(h\u22121)(cid:48)(\u03bb)2 =\n\n(15)\n\nh(s)\nh(cid:48)(s)2 =\n\nT\nI(s)\n\n(16)\n\nFrom the above analysis we can see that the property of Lp(\u02c6s, s) = Es[|\u02c6sT \u2212 s|p] saturating the\nlower bound fully relies upon the asymptotic normality. With asymptotic normality, we can do more\nthan just optimizing Imutual(N, s) and Lp(\u02c6s, s). In general we can \ufb01nd the optimal tuning curve\nwhich minimizes the expected Lp loss Lp(\u02c6s, s) since as T \u2192 \u221e\n\nE\n\n(17)\nI(s)/T , \u03c7 \u223c N (0, 1). To calculate the right side of the above limit, we can use\n\n(cid:112)\n\nwhere Z(cid:48) = \u03c7/\nthe fact that for any p \u2265 0,\n\n(cid:104)(cid:12)(cid:12)(cid:12)\u221aT (\u02c6sT \u2212 s)\n(cid:12)(cid:12)(cid:12)p(cid:105)\n\u2192 E(cid:2)\n|Z(cid:48)|p(cid:3)\n(cid:17)p \u0393(cid:0) p+1\n(cid:16)\u221a2\n\u0393(cid:0) 1\n(cid:1)\n\n2\n\nK(p) = E [|\u03c7|p] =\n\n2\n\n(cid:1)\n\nwhere \u0393(\u00b7) is the gamma function\n\n\u0393(z) =\n\ne\u2212ttz\u22121 dt\n\n(cid:90) \u221e\n\n0\n\n4\n\n(18)\n\n(19)\n\n\fThe general conclusion is that (Cramer-Rao Lower bound is a special case with p = 2)\n\n(cid:104)(cid:12)(cid:12)(cid:12)\u221aT (\u02c6sT \u2212 s)\n\n(cid:12)(cid:12)(cid:12)p(cid:105)\n\n\u2192 E(cid:2)\n\n|Z(cid:48)|p(cid:3) =\n\nEs\n\nK(p)\n\n(I(s)/T )p/2\n\n(20)\n\nFigure 1: (A) Illustration of Lp-loss as a function of |\u02c6s\u2212 s| for different values of p. When p = 2 the\nloss is the squared loss and when p \u2192 0, the Lp loss converges to 0-1 loss pointwise. (B) The plot\nof p-th absolute moments K(p) = E[|\u03c7|p] of standard Gaussian random variable \u03c7 for p \u2208 [0, 4].\n\n5 Optimal Tuning Curves: Infomax, Discrimax and More\n\nWith the asymptotic normality and moment convergence, we know the asymptotic expected Lp loss\nwill approach E[|Z(cid:48)|p] for each s. Hence\n\n(cid:90)\n\nE [|\u02c6s \u2212 s|p] \u2192\n\n\u03c0(s)Es\n\n\u03c0(s)\n\nI(s)p/2 ds.\n\nTo obtain the optimal tuning curve for the Lp loss, we need to solve a simple variation problem\n\n(cid:90)\n\n(cid:2)\n|Z(cid:48)|p(cid:3) ds = K(p)\n(cid:90)\n(cid:90) (cid:112)\n\n\u03c0(s)f (Ih(s)) ds\n\nIh(s) ds \u2264 const\n\nminimize\n\nh\n\nsubject to\n\nwith f(cid:48)p(x) = \u2212x\u2212p/2\u22121. To solve this variational problem, the Euler-Lagrange equation and the\nLagrange multiplier method can be used to derive the optimal condition\n\n(cid:16)\n\n(cid:32)(cid:112)\n\n(cid:17)\n\n(cid:112)\nIh(s) \u221d \u03c0(s)1/(p+1)\n\nIh\n\n(cid:112)\n\n= \u03c0(s)f(cid:48)p (Ih(s)) \u2212\n\n0 =\n\n\u2202\n\u2202Ih\n\n\u03c0(s)fp(Ih(s)) \u2212 \u03bb\n\n\u03bb\n2Ih(s)\u22121/2\n\n(24)\n\nhp(s) =\n\nhmin +\n\n(25)\nTherefore the fp-optimal tuning curve, which minimizes the asymptotic Lp loss, is given by equation\nbelow, followed from (4) and (25). For some examples of Lp optimal tuning curves, see Fig. 2.\n\n\u21d2\n\n(cid:16)(cid:112)\n(cid:16)(cid:112)\n\n(cid:112)\n\nhmax \u2212\n\n(cid:112)\n\nhmin\n\n(cid:17)2\n\n(cid:33)2\n\n\u03c0(t)1/(p+1) dt\n\u03c0(t)1/(p+1) dt\n\n\u2212\u221e\n\u2212\u221e\n\u03c0(s)2/(p+1)\n\n(cid:17)(cid:82) s\n(cid:82) \u221e\n(cid:0)(cid:82) \u03c0(t)1/(p+1) dt(cid:1)2\n(cid:17)\u2212p(cid:18)(cid:90)\n\nhmin\n\n\u03c0(t)1/(p+1) dt\n\n(cid:112)\n\n(cid:19)p+1\n\nIp(s) = 4T\n\nhmax \u2212\n\nhmin\n\nFollowing from (21) and (27), the optimal expected Lp loss is\n\nE [|\u02c6s \u2212 s|p] = K(p) \u00b7 (4T )\u2212p/2(cid:16)(cid:112)\n\n(21)\n\n(22)\n\n(23)\n\n(26)\n\n(27)\n\n(28)\n\nhmax \u2212\n\n5\n\n00.511.5201234|\u02c6s\u2212s|LossL(\u02c6s,s)p=2.0p=1.0p=0.5p=0.1(A)0123401234pK(p)(B)\fA very interesting observation is that, by taking the limit p \u2192 0, we will end up with the infomax\ntuning curve. This shows that the infomax tuning curve simultaneously optimizes the mutual infor-\nmation as well as the expected L0 norm of the error \u02c6s \u2212 s. The L0 norm can be considered as the\n0-1 loss, i.e. L(\u02c6s, s) = 0 if \u02c6s = s and L(\u02c6s, s) = 1 otherwise. To put this in a different approach, we\nmay consider the natural log function as a limit of power function:\n\n(cid:90)\n\n(cid:18)\n(cid:90)\n1 \u2212 z\u2212p/2\nand we can conclude that minimizing(cid:82) \u03c0(s)I(s)\u2212p/2ds in the limit of p \u2192 0 (L0 loss) is the same\nas maximizing(cid:82) \u03c0(s) log I(s)ds and the mutual information.\n\nlog z = lim\np\u21920\n2\n\u03c0(s) log I(s) ds = lim\np\np\u21920\n\n(cid:19)\n\u03c0(s)I(s)\u2212p/2 ds\n\np/2\n\n1 \u2212\n\n(29)\n\n(30)\n\n\u21d2\n\nFigure 2: For stimulus with standard Gaussian prior distribution (inset \ufb01gure) and various values of\np, (A) shows the optimal allocation of Fisher Information Ip(s) and (B) shows the fp-optimal tuning\ncurve hp(s). When p = 2 the f2-optimal (discrimax) tuning curve minimizes the squared loss and\nwhen p = 0 the f0-optimal (infomax) tuning curve maximizes the mutual information.\n\n6 Simulation Results\n\nE [|\u02c6s \u2212 s|p] = K(p) \u00b7 (4T )\u2212p/2(cid:16)(cid:112)\n\n(cid:112)\n\n(cid:17)\u2212p(cid:18)(cid:90)\n\nNumerical simulations were performed in order to validate our theory. In each iteration, a random\nstimulus s was chosen from the standard Gaussian distribution or Exponential distribution with mean\none. A Poisson neuron was simulated to generate spikes in response to that stimulus. The difference\nbetween the MLE \u02c6s and s is recorded to analyze the Lp loss. In one simple task, we compared the\nnumerical value vs. the theoretical value of Lp loss for fq-optimal tuning curve\n\n(cid:19)p(cid:18)(cid:90)\n\n(cid:19)\n\nhmax \u2212\n\nhmin\n\n\u03c0(t)1/(q+1) dt\n\n\u03c0(s)1\u2212 p\n\nq+1 ds\n\n(31)\nThe above theoretical prediction works well for distributions with compact support s \u2208 [A, B]. It\nalso requires q > p\u2212 1 for any distribution with tail decaying faster than exponential: \u03c0(s) \u2264 e\u2212Cs,\nsuch as e.g. a Gaussian or exponential distribution. Otherwise the integral in the last term will blow\nup in general.\nThe numerical and theoretical prediction of Lp loss are plotted, for both Gaussian N (0, 1) prior\n(Fig.3A) and Exp(1) prior (Fig.3B). The vertical axis shows 1/p \u00b7 log E[|\u02c6s \u2212 s|p] so all p-norms are\ndisplayed at the same unit.\n\n6\n\n\u22124\u22122024stimulusstuningcurvehp(s)(B)  p=0.0p=0.5p=1.0p=2.0\u22124\u22122024s\u03c0(s)\u22124\u22122024stimuluss\ufb01sherinfoIp(s)(A)\fFigure 3: The comparison between numerical result (solid curves) and theoretical prediction (dashed\ncurves). (A) For standard Gaussian prior. (B) For exponential prior with parameter 1.\n\n7 Discussion\n\nIn this paper we have derived a closed form solution for the optimal tuning curve of a single neuron\ngiven an arbitrary stimulus prior \u03c0(s) and for a variety of optimality. Our work offers a principled\nexplanation for the observed non-linearity in neuron tuning: Each neuron should adapt their tuning\ncurves to reallocate the limited amount of Fisher information they can carry and minimize the Lp\nerror. We have shown in section 2 that each sigmoidal tuned neuron with Poisson spiking noise has\nan upper bound for the integral of square root of Fisher information and the fp-optimal tuning curve\nhas the form\n\n(cid:32)(cid:112)\n\n(cid:16)(cid:112)\n\n(cid:112)\n\nhp(s) =\n\nhmin +\n\nhmax \u2212\n\nhmin\n\n(cid:17)(cid:82) s\n(cid:82) \u221e\n\n\u2212\u221e\n\u2212\u221e\n\n(cid:33)2\n\n\u03c0(t)1/(p+1) dt\n\u03c0(t)1/(p+1) dt\n\n(32)\n\nwhere the fp-optimal tuning curve minimizes the estimation Lp loss E[|\u02c6s \u2212 s|p] of the decoding\nprocess in the limit of long encoding time T . Two special and well known cases are maximum\nmutual information (p = 0) and maximum discriminant (p = 2).\nTo obtain this result, we established two theorems regarding the asymptotic behavior of the MLE\n\u02c6s = h\u22121(\u02c6\u03bb). Asymptotically, the MLE converges to a standard Gaussian not only with regard to\nits distribution, but also in terms of its p-th moment for arbitrary p > 0. By calculating the p-th\nmoments for the Gaussian random variable, we can predict the Lp loss of the encoding-decoding\nprocess and optimize the tuning curve by minimizing the attainable limit. The Cramer-Rao lower\nbound and the mutual information lower bound proposed by [7] are special cases with p = 2 or\np = 0 respectively.\nSo far, we have put our focus on a single neuron with sigmoidal tuning curve. However, the conclu-\nsions in Theorem 1 and 2 still hold for the case of neuronal populations with bell-shaped neurons,\nwith correlated or uncorrelated noise. The optimal condition for Fisher information can be cal-\nculated, regardless of the tuning curve(s) format. According to the assumption on the number of\nneurons and the shape of the tuning curves, the optimized Fisher information can be inverted to\nderive the optimal tuning curve via the same type of analysis as we presented in this paper.\nOne theoretic limitation of our method is that we only addressed the problem for long encoding\ntimes, which is usually not the typical scenario in real sensory systems. Though the long encoding\ntime limit can be replaced by short encoding time with many identical tuned neurons. It is still an\ninteresting problem to \ufb01nd out the optimal tuning curve for arbitrary prior, in the sense of Lp loss\nfunction. Some work [16, 20] has been done to maximize mutual information or L2 for uniformly\ndistributed stimuli. Another problem is that the asymptotic behavior is not uniformly true if the space\nof stimulus is not compact. The asymptotic behavior will take longer to be observed if the slop of\nthe tuning function is too close to zero. In Theorem 2 we made the assumption that |g(cid:48)(s)| \u2264 M\nand that is the reason we cannot evaluate the estimation error for s with large absolute value hence\nwe do not have a perfect match for low p values in the simulation section (see Fig. 3).\n\n7\n\n0246\u22125.5\u22125\u22124.5\u22124p=0.1,q\u2217=0.1p=0.5,q\u2217=0.5p=1.0,q\u2217=1.0p=2.0,q\u2217=2.0q1/p\u00b7log(E[|\u02c6s\u2212s|p])(A)0246\u22126\u22125.5\u22125\u22124.5\u22124p=0.1,q\u2217=0.1p=0.5,q\u2217=0.5p=1.0,q\u2217=1.1p=2.0,q\u2217=2.0q1/p\u00b7log(E[|\u02c6s\u2212s|p])(B)\fA Proof of Theorems in Section 4\n\nProof. of Theorem 1\n(a) Immediately follows from Poisson distribution. Use induction on n.\n(b) Apply Central Limit Theorem. Notice that E[Xi] = Var[Xi] = \u03bb for Poisson random variables.\n(c) In general, convergence in distribution does not imply convergence in p-th moment. However in\nour case, we do have the convergence property for all p-th moments. To show this, we need to show\nfor all p > 0, |Yn|p is uniformly integrable i.e. for any \u0001, there exist a K such that\n\nThis is obvious with Cauchy-Schwartz inequality and Markov inequality\n\nE[|Yn|p \u00b7 1{|Yn|\u2265K}] \u2264 \u0001\n\n\u2264 E[|Yn|2p] \u00b7 P[|Yn| \u2265 K] \u2264 E[|Yn|2p]\n\nE[|Yn|]\n\nK \u2192 0\n\nTo see the last limit, we use the fact that for all p > 0, supn E[|Yn|p] < \u221e. According to [21],\n\n(cid:0)E[|Yn|p \u00b7 1{Yn\u2265K}](cid:1)2\n\n(33)\n\n(34)\n\n(35)\n\n(37)\n\n(38)\n\np(cid:88)\n\na=0\n\np/2(cid:88)\n\na=0\n\nE[|Sn \u2212 n\u03bb|p] =\n\n(n\u03bb)aS2(p, a),\n\nwhere S2(p, a) denotes the number of partitions of a set of size n into a subsets with no singletons\n(i.e. no subsets with only one element). For our purpose, notice that S2(p, a) = 0 for a > p/2 and\nS2(p, a) \u2264 pa. Therefore the supreme of E[|Yn|p] is bounded since\n\nE[|Yn|p] = E[|\u221an(Sn/n \u2212 \u03bb)|p] \u2264 n\u2212p/2\n\n(n\u03bb)apa \u2264\n\nn(\u03bbp)p/2+1\n\nn\u03bbp \u2212 1 \u2264 C(\u03bbp)p/2\n\n(36)\n\nFor arbitrary q, choose any even number p such that p > q, and by Jensen\u2019s inequality, E[|Yn|q] \u2264\nE[|Yn|p]q/p. Thus for all p > 0, n, E[|Yn|p] < \u221e.\n\nProof. of Theorem 2\n(a) Denote \u02c6\u03bbn = Sn/n. Apply mean value theorem for g(x) near \u03bb :\n\ng(\u02c6\u03bbn) \u2212 g(\u03bb) = g(cid:48)(\u03bb\u2217)(\u02c6\u03bbn \u2212 \u03bb)\n\n(cid:17)\nfor some \u03bb\u2217 between \u02c6\u03bbn and \u03bb. Therefore\ng(\u02c6\u03bbn) \u2212 g(\u03bb)\n\n\u221an\n\n(cid:16)\n\n= g(cid:48)(\u03bb\u2217)\u221an(\u02c6\u03bb \u2212 \u03bb) D\u2192 g(cid:48)(\u03bb)Z\n\nNote that \u02c6\u03bbn \u2192 \u03bb in probability, \u03bb\u2217 \u2192 \u03bb in probability and g(cid:48)(\u03bb\u2217) \u2192 g(cid:48)(\u03bb) in probability, together\nwith the fact that \u221an(\u02c6\u03bbn \u2212 \u03bb) D\u2192 Z, apply Slutsky\u2019s theorem and the conclusion follows.\n(b) Use Taylor\u2019s expansion and Slutsky\u2019s theorem again,\n\n(cid:16)\n\n(cid:12)(cid:12)(cid:12)\u221an\n\n(cid:17)(cid:12)(cid:12)(cid:12)p\n\n(cid:12)(cid:12)(cid:12)g(cid:48)(\u03bb\u2217)\u221an(\u02c6\u03bb \u2212 \u03bb)\n\n(cid:12)(cid:12)(cid:12)p\n\ng(\u02c6\u03bbn) \u2212 g(\u03bb)\n\n=\n\n= |g(cid:48)(\u03bb\u2217)|p |Yn|p \u2192 |g(cid:48)(\u03bb)|p |Yn|p\n\n(39)\n\nTo see |Y (cid:48)n|p is uniformly integrable, notice that |Y (cid:48)n|p \u2265 K \u21d2 |Yn|p \u2265 K \u00b7 M\u2212p. The rest follows\nin a similar manner as when proving Theorem 1(c).\n\n8\n\n\fReferences\n[1] TM Maddess and SB Laughlin. Adaptation of the motion-sensitive neuron h1 is generated\nlocally and governed by contrast frequency. Proc. R. Soc. Lond. B Biol. Sci, 225:251\u2013275,\n1985.\n\n[2] J Atick. Could information theory provide an ecological theory of sensory processing? Net-\n\nwork, 3:213\u2013251, 1992.\n\n[3] RA Harris, DC O\u2019Carroll, and SB Laughlin. Contrast gain reduction in \ufb02y motion adaptation.\n\nNeuron, 28:595\u2013606, 2000.\n\n[4] I Dean, NS Harper, and D McAlpine. Neural population coding of sound level adapts to\n\nstimulus statistics. Nature neuroscience, 8:1684\u20131689, 2005.\n\n[5] AA Stocker and EP Simoncelli. Noise characteristics and prior expectations in human visual\n\nspeed perception. Nature neuroscience, 9:578\u2013585, 2006.\n\n[6] J-P Nadal and N Parga. Non linear neurons in the low noise limit: A factorial code maximizes\n\ninformation transfer, 1994.\n\n[7] N Brunel and J-P Nadal. Mutual information, \ufb01sher information and population coding. Neural\n\nComputation, 10(7):1731\u20131757, 1998.\n\n[8] Tvd Twer and DIA MacLeod. Optimal nonlinear codes for the perception of natural colours.\n\nNetwork: Computation in Neural Systems, 12(3):395\u2013407, 2001.\n\n[9] MD McDonnell and NG Stocks. Maximally informative stimuli and tuning curves for sig-\n\nmoidal rate-coding neurons and populations. Phys. Rev. Lett., 101:058103, 2008.\n\n[10] D Ganguli and EP Simoncelli. Implicit encoding of prior probabilities in optimal neural pop-\n\nulations. Adv. Neural Information Processing Systems, 23:658\u2013666, 2010.\n\n[11] HB Barlow. Possible principles underlying the transformation of sensory messages. M.I.T.\n\nPress, 1961.\n\n[12] HS Seung and H Sompolinsky. Simple models for reading neuronal population codes. Proc.\n\nof the National Aca. of Sci. of the U.S.A., 90:10749\u201310753, 1993.\n\n[13] K Zhang and TJ Sejnowski. Neuronal tuning: To sharpen or broaden? Neural Computation,\n\n11:75\u201384, 1999.\n\n[14] A Pouget, S Deneve, J-C Ducom, and PE Latham. Narrow versus wide tuning curves: Whats\n\nbest for a population code? Neural Computation, 11:85\u201390, 1999.\n\n[15] M Bethge, D Rotermund, and K Pawelzik. Optimal short-term population coding: when Fisher\n\ninformation fails. Neural Computation, 14:2317\u20132351, 2002.\n\n[16] M Bethge, D Rotermund, and K Pawelzik. Optimal neural rate coding leads to bimodal \ufb01ring\n\nrate distributions. Netw. Comput. Neural Syst., 14:303\u2013319, 2003.\n\n[17] S Yarrow, E Challis, and P Seris. Fisher and shannon information in \ufb01nite neural populations.\n\nNeural Computation, In Print, 2012.\n\n[18] TM Cover and J Thomas. Elements of Information Theory. Wiley, 1991.\n[19] SI Amari, H Nagaoka, and D Harada. Methods of Information Geometry. Translations of\n\nMathematical Monographs. American Mathematical Society, 2007.\n\n[20] AP Nikitin, NG Stocks, RP Morse, and MD McDonnell. Neural population coding is optimized\n\nby discrete tuning curves. Phys. Rev. Lett., 103:138101, 2009.\n\n[21] N Privault. Generalized Bell polynomials and the combinatorics of Poisson central moments.\n\nElectronic Journal of Combinatorics, 18, 2011.\n\n9\n\n\f", "award": [], "sourceid": 1077, "authors": [{"given_name": "Zhuo", "family_name": "Wang", "institution": null}, {"given_name": "Alan", "family_name": "Stocker", "institution": null}, {"given_name": "Daniel", "family_name": "Lee", "institution": null}]}