{"title": "Better than least squares: comparison of objective functions for estimating linear-nonlinear models", "book": "Advances in Neural Information Processing Systems", "page_first": 1305, "page_last": 1312, "abstract": null, "full_text": "Comparison of objective functions for estimating\n\nlinear-nonlinear models\n\nTatyana O. Sharpee\n\nComputational Neurobiology Laboratory,\n\nthe Salk Institute for Biological Studies, La Jolla, CA 92037\n\nsharpee@salk.edu\n\nAbstract\n\nThis paper compares a family of methods for characterizing neural feature selec-\ntivity with natural stimuli in the framework of the linear-nonlinear model. In this\nmodel, the neural \ufb01ring rate is a nonlinear function of a small number of relevant\nstimulus components. The relevant stimulus dimensions can be found by max-\nimizing one of the family of objective functions, R\u00b4enyi divergences of different\norders [1, 2]. We show that maximizing one of them, R\u00b4enyi divergence of or-\nder 2, is equivalent to least-square \ufb01tting of the linear-nonlinear model to neural\ndata. Next, we derive reconstruction errors in relevant dimensions found by max-\nimizing R\u00b4enyi divergences of arbitrary order in the asymptotic limit of large spike\nnumbers. We \ufb01nd that the smallest errors are obtained with R\u00b4enyi divergence of\norder 1, also known as Kullback-Leibler divergence. This corresponds to \ufb01nding\nrelevant dimensions by maximizing mutual information [2]. We numerically test\nhow these optimization schemes perform in the regime of low signal-to-noise ra-\ntio (small number of spikes and increasing neural noise) for model visual neurons.\nWe \ufb01nd that optimization schemes based on either least square \ufb01tting or informa-\ntion maximization perform well even when number of spikes is small. Information\nmaximization provides slightly, but signi\ufb01cantly, better reconstructions than least\nsquare \ufb01tting. This makes the problem of \ufb01nding relevant dimensions, together\nwith the problem of lossy compression [3], one of examples where information-\ntheoretic measures are no more data limited than those derived from least squares.\n\n1 Introduction\n\nThe application of system identi\ufb01cation techniques to the study of sensory neural systems has a\nlong history. One family of approaches employs the dimensionality reduction idea: while inputs\nare typically very high-dimensional, not all dimensions are equally important for eliciting a neural\nresponse [4, 5, 6, 7, 8]. The aim is then to \ufb01nd a small set of dimensions {\u02c6e1, \u02c6e2, . . .} in the stimulus\nspace that are relevant for neural response, without imposing, however, a particular functional de-\npendence between the neural response and the stimulus components {s1, s2, . . .} along the relevant\ndimensions:\n\nP (spike|s) = P (spike)g(s1, s2, ..., sK),\n\n(1)\n\nIf the inputs are Gaussian, the last requirement is not important, because relevant dimensions can be\nfound without knowing a correct functional form for the nonlinear function g in Eq. (1). However,\nfor non-Gaussian inputs a wrong assumption for the form of the nonlinearity g will lead to systematic\nerrors in the estimate of the relevant dimensions themselves [9, 5, 1, 2]. The larger the deviations of\nthe stimulus distribution from a Gaussian, the larger will be the effect of errors in the presumed form\nof the nonlinearity function g on estimating the relevant dimensions. Because inputs derived from a\nnatural environment, either visual or auditory, have been shown to be strongly non-Gaussian [10], we\n\n\fwill concentrate here on system identi\ufb01cation methods suitable for either Gaussian or non-Gaussian\nstimuli.\nTo \ufb01nd the relevant dimensions for neural responses probed with non-Gaussian inputs, Hunter and\nKorenberg proposed an iterative scheme [5] where the relevant dimensions are \ufb01rst found by assum-\ning that the input\u2013output function g is linear. Its functional form is then updated given the current\nestimate of the relevant dimensions. The inverse of g is then used to improve the estimate of the\nrelevant dimensions. This procedure can be improved not to rely on inverting the nonlinear function\ng by formulating optimization problem exclusively with respect to relevant dimensions [1, 2], where\nthe nonlinear function g is taken into account in the objective function to be optimized. A family of\nobjective functions suitable for \ufb01nding relevant dimensions with natural stimuli have been proposed\nbased on R\u00b4enyi divergences [1] between the the probability distributions of stimulus components\nalong the candidate relevant dimensions computed with respect to all inputs and those associated\nwith spikes. Here we show that the optimization problem based on the R\u00b4enyi divergence of order 2\ncorresponds to least square \ufb01tting of the linear-nonlinear model to neural spike trains. The Kullback-\nLeibler divergence also belongs to this family and is the R\u00b4enyi divergence of order 1. It quanti\ufb01es\nthe amount of mutual information between the neural response and the stimulus components along\nthe relevant dimension [2]. The optimization scheme based on information maximization has been\npreviously proposed and implemented on model [2] and real cells [11]. Here we derive asymptotic\nerrors for optimization strategies based on R\u00b4enyi divergences of arbitrary order, and show that rele-\nvant dimensions found by maximizing Kullback-Leibler divergence have the smallest errors in the\nlimit of large spike numbers compared to maximizing other R\u00b4enyi divergences, including the one\nwhich implements least squares. We then show in numerical simulations on model cells that this\ntrend persists even for very low spike numbers.\n\n2 Variance as an Objective Function\n\nOne way of selecting a low-dimensional model of neural response is to minimize a \u03c72-difference\nbetween spike probabilities measured and predicted by the model after averaging across all inputs s:\n\n(cid:90)\n\n(cid:183)\n\n\u03c72[v] =\n\ndsP (s)\n\nP (spike|s)\nP (spike)\n\n\u2212 P (spike|s \u00b7 v)\nP (spike)\n\n,\n\n(2)\n\n(cid:90)\n\nwhere dimension v is the relevant dimension for a given model described by Eq. (1) [multiple\ndimensions could also be used, see below]. Using the Bayes\u2019 rule and rearranging terms, we get:\n\n\u03c72[v] =\n\ndsP (s)\n\nP (s|spike)\n\nP (s)\n\n\u2212 P (s \u00b7 v|spike)\n\nP (s \u00b7 v)\n\n=\n\nds\n\n[P (s|spike)]2\n\nP (s)\n\n\u2212\n\n[Pv(x|spike)]2\n\nPv(x)\n\ndx\n\n. (3)\n\n(cid:184)2\n\n(cid:90)\n\nIn the last integral averaging has been carried out with respect to all stimulus components except\nfor those along the trial direction v, so that integration variable x = s \u00b7 v. Probability distributions\nPv(x) and Pv(x|spike) represent the result of this averaging across all presented stimuli and those\nthat lead to a spike, respectively:\n\nPv(x) =\n\ndsP (s)\u03b4(x \u2212 s \u00b7 v), Pv(x|spike) =\n\ndsP (s|spike)\u03b4(x \u2212 s \u00b7 v),\n\n(4)\n\n(cid:90)\n\n(cid:183)\n\n(cid:90)\n\nwhere \u03b4(x) is a delta-function. In practice, both of the averages (4) are calculated by bining the\nrange of projections values x and computing histograms normalized to unity. Note that if there\nmultiple spikes are sometimes elicited, the probability distribution P (x|spike) can be constructed\nby weighting the contribution from each stimulus according to the number of spikes it elicited.\nIf neural spikes are indeed based on one relevant dimension, then this dimension will explain all of\nthe variance, leading to \u03c72 = 0. For all other dimensions v, \u03c72[v] > 0. Based on Eq. (3), in order\nto minimize \u03c72 we need to maximize\n\n(cid:184)2\n\n(cid:90)\n\n(cid:90)\n\n(cid:183)\n\nPv(x|spike)\n\n(cid:184)2\n\n,\n\n(5)\nwhich is a R\u00b4enyi divergence of order 2 between probability distribution Pv(x|spike) and Pv(x), and\nare part of a family of f-divergences measures that are based on a convex function of the ratio of\n\ndxPv(x)\n\nF [v] =\n\nPv(x)\n\n\fthe two probability distributions (instead of a power \u03b1 in a R\u00b4enyi divergence of order \u03b1) [12, 13, 1].\nFor optimization strategy based on R\u00b4enyi divergences of order \u03b1, the relevant dimensions are found\nby maximizing:\n\n(cid:90)\n\n(cid:184)\u03b1\n\nF (\u03b1)[v] =\n\n1\n\n\u03b1 \u2212 1\n\ndxPv(x)\n\nPv(x|spike)\n\nPv(x)\n\n.\n\n(6)\n\nBy comparison, when the relevant dimension(s) are found by maximizing information [2], the goal\nis to maximize Kullback-Leibler divergence, which can be obtained by taking a formal limit \u03b1 \u2192 1:\n\nI[v] =\n\ndxPv(x) Pv(x|spike)\nPv(x)\n\ndxPv(x|spike) ln Pv(x|spike)\nPv(x)\n\n.\n\n(7)\n\n(cid:183)\n\n(cid:90)\n\n=\n\n(cid:90)\n\nPv(x)\n\nln Pv(x|spike)\n(cid:90)\n\n[P (s|spike)]2\n\nP (s)\n\n(cid:90)\n\n(cid:183)\n\n(cid:184)2\n\nFmax =\n\n1\nT\n\ndt\n\nr(t)\n\u00afr\n\nReturning to the variance optimization, the maximal value for F [v] that can be achieved by any\ndimension v is:\n\nFmax =\n\nds\n\n.\n\n(8)\n\nIt corresponds to the variance in the \ufb01ring rate averaged across different inputs (see Eq. (9) below).\nComputation of the mutual information carried by the individual spike about the stimulus relies on\nsimilar integrals. Following the procedure outlined for computing mutual information [14], one can\nuse the Bayes\u2019 rule and the ergodic assumption to compute Fmax as a time-average:\n\n,\n\n(9)\n\n(cid:90)\n(cid:82)\n\nwhere the \ufb01ring rate r(t) = P (spike|s)/\u2206t is measured in time bins of width \u2206t using multiple\nrepetitions of the same stimulus sequence . The stimulus ensemble should be diverse enough to\njustify the ergodic assumption [this could be checked by computing Fmax for increasing fractions of\nthe overall dataset size]. The average \ufb01ring rate \u00afr = P (spike)/\u2206t is obtained by averaging r(t) in\ntime.\nThe fact that F [v] < Fmax can be seen either by simply noting that \u03c72[v] \u2265 0, or from the data\nprocessing inequality, which applies not only to Kullback-Leibler divergence, but also to R\u00b4enyi\ndivergences [12, 13, 1]. In other words, the variance in the \ufb01ring rate explained by a given dimension\nF [v] cannot be greater than the overall variance in the \ufb01ring rate Fmax. This is because we have\naveraged over all of the variations in the \ufb01ring rate that correspond to inputs with the same projection\nvalue on the dimension v and differ only in projections onto other dimensions.\nOptimization scheme based on R\u00b4enyi divergences of different orders have very similar structure. In\nparticular, gradient could be evaluated in a similar way:\n\n(cid:34)(cid:181)\n\nPv(x|spike)\n\n(cid:182)\u03b1\u22121\n\n(cid:35)\n\n,\n\n(10)\n\nPv(x)\n\ndxPv(x|spike) [(cid:104)s|x, spike(cid:105) \u2212 (cid:104)s|x(cid:105)] d\ndx\n\n\u2207vF (\u03b1) = \u03b1\n\u03b1 \u2212 1\nwhere (cid:104)s|x, spike(cid:105) =\nds s\u03b4(x\u2212s\u00b7v)P (s|spike)/P (x|spike), and similarly for (cid:104)s|x(cid:105). The gradient\nis thus given by a weighted sum of spike-triggered averages (cid:104)s|x, spike(cid:105) \u2212 (cid:104)s|x(cid:105) conditional upon\nprojection values of stimuli onto the dimension v for which the gradient of information is being\nevaluated. The similarity of the structure of both the objective functions and their gradients for\ndifferent R\u00b4enyi divergences means that numeric algorithms can be used for optimization of R\u00b4enyi\ndivergences of different orders. Examples of possible algorithms have been described [1, 2, 11] and\ninclude a combination of gradient ascent and simulated annealing.\nHere are a few facts common to this family of optimization schemes. First, as was proved in the case\nof information maximization based on Kullback-Leibler divergence [2], the merit function F (\u03b1)[v]\ndoes not change with the length of the vector v. Therefore v\u00b7\u2207vF = 0, as can also be seen directly\nfrom Eq. (10), because v \u00b7 (cid:104)s|x, spike(cid:105) = x and v \u00b7 (cid:104)s|x(cid:105) = x. Second, the gradient is 0 when\nevaluated along the true receptive \ufb01eld. This is because for the true relevant dimension according\nto which spikes were generated, (cid:104)s|s1, spike(cid:105) = (cid:104)s|s1(cid:105), a consequence of the fact that relevant\nprojections completely determine the spike probability. Third, merit functions, including variance\nand information, can be computed with respect to multiple dimensions by keeping track of stimulus\n\n\fprojections on all the relevant dimensions when forming probability distributions (4). For example,\nin the case of two dimensions v1 and v2, we would use\n\n(cid:90)\n\n(cid:90)\nPv1,v2(x1, x2|spike) =\n\nPv1,v2(x1, x2) =\n\nds \u03b4(x1 \u2212 s \u00b7 v1)\u03b4(x2 \u2212 s \u00b7 v2)P (s|spike),\n\nds \u03b4(x1 \u2212 s \u00b7 v1)\u03b4(x2 \u2212 s \u00b7 v2)P (s),\n\n(11)\n\n(cid:82)\n\nto\n\ncompute\n\nthe\n\nvariance with\n\nrespect\n\nto\n\nthe\n\ntwo\n\ndimensions\n\nas F [v1, v2]\n\n=\n\ndx1dx2 [P (x1, x2|spike)]2 /P (x1, x2).\nIf multiple stimulus dimensions are relevant for eliciting the neural response, they can always be\nfound (provided suf\ufb01cient number of responses have been recorded) by optimizing the variance\naccording to Eq. (11) with the correct number of dimensions.\nIn practice this involves \ufb01nding\na single relevant dimension \ufb01rst, and then iteratively increasing the number of relevant dimensions\nconsidered while adjusting the previously found relevant dimensions. The amount by which relevant\ndimensions need to be adjusted is proportional to the contribution of subsequent relevant dimensions\nto neural spiking (the corresponding expression has the same functional form as that for relevant\ndimensions found by maximizing information, cf. Appendix B [2]). If stimuli are either uncorrelated\nor correlated but Gaussian, then the previously found dimensions do not need to be adjusted when\nadditional dimensions are introduced. All of the relevant dimensions can be found one by one, by\nalways searching only for a single relevant dimension in the subspace orthogonal to the relevant\ndimensions already found.\n\n3 Illustration for a model simple cell\n\nHere we illustrate how relevant dimensions can be found by maximizing variance (equivalent to least\nsquare \ufb01tting), and compare this scheme with that of \ufb01nding relevant dimensions by maximizing\ninformation, as well as with those that are based upon computing the spike-triggered average. Our\ngoal is to reconstruct relevant dimensions of neurons probed with inputs of arbitrary statistics. We\nused stimuli derived from a natural visual environment [11] that are known to strongly deviate from\na Gaussian distribution. All of the studies have been carried out with respect to model neurons.\nAdvantage of doing so is that the relevant dimensions are known. The example model neuron is\ntaken to mimic properties of simple cells found in the primary visual cortex. It has a single relevant\ndimension, which we will denote as \u02c6e1. As can be seen in Fig. 1(a), it is phase and orientation\nsensitive. In this model, a given stimulus s leads to a spike if the projection s1 = s \u00b7 \u02c6e1 reaches a\nthreshold value \u03b8 in the presence of noise: P (spike|s)/P (spike) \u2261 g(s1) = (cid:104)H(s1\u2212 \u03b8 + \u03be)(cid:105), where\na Gaussian random variable \u03be with variance \u03c32 models additive noise, and the function H(x) = 1\nfor x > 0, and zero otherwise. The parameters \u03b8 for threshold and the noise variance \u03c32 determine\nthe input\u2013output function. In what follows we will measure these parameters in units of the standard\ndeviation of stimulus projections along the relevant dimension. In these units, the signal-to-noise\nratio is given by \u03c3.\nFigure 1 shows that it is possible to obtain a good estimate of the relevant dimension \u02c6e1 by maxi-\nmizing either information, as shown in panel (b), or variance, as shown in panel(c). The \ufb01nal value\nof the projection depends on the size of the dataset, as will be discussed below. In the example\nshown in Fig. 1 there were \u2248 50, 000 spikes with average probability of spike \u2248 0.05 per frame, and\nthe reconstructed vector has a projection \u02c6vmax \u00b7 \u02c6e1 = 0.98 when maximizing either information or\nvariance. Having estimated the relevant dimension, one can proceed to sample the nonlinear input\u2013\noutput function. This is done by constructing histograms for P (s \u00b7 \u02c6vmax) and P (s \u00b7 \u02c6vmax|spike) of\nprojections onto vector \u02c6vmax found by maximizing either information or variance, and taking their\nratio. Because of the Bayes\u2019 rule, this yields the nonlinear input\u2013output function g of Eq. (1). In\nFig. 1(d) the spike probability of the reconstructed neuron P (spike|s \u00b7 \u02c6vmax) (crosses) is compared\nwith the probability P (spike|s1) used in the model (solid line). A good match is obtained.\nIn actuality, reconstructing even just one relevant dimension from neural responses to correlated\nnon-Gaussian inputs, such as those derived from real-world, is not an easy problem. This fact can\nbe appreciated by considering the estimates of relevant dimension obtained from the spike-triggered\naverage (STA) shown in panel (e). Correcting the STA by second-order correlations of the input\nensemble through a multiplication by the inverse covariance matrix results in a very noisy estimate,\n\n\fFigure 1: Analysis of a model visual neuron with one relevant dimension shown in (a). Panels (b)\nand (c) show normalized vectors \u02c6vmax found by maximizing information and variance, respectively;\n(d) The probability of a spike P (spike|s \u00b7 \u02c6vmax) (blue crosses \u2013 information maximization, red\ncrosses \u2013 variance maximization) is compared to P (spike|s1) used in generating spikes (solid line).\nParameters of the model are \u03c3 = 0.5 and \u03b8 = 2, both given in units of standard deviation of s1,\nwhich is also the units for the x-axis in panels (d and h). The spike\u2013triggered average (STA) is shown\nin (e). An attempt to remove correlations according to the reverse correlation method, C\u22121\na priorivsta\n(decorrelated STA), is shown in panel (f) and in panel (g) with regularization (see text). In panel\n(h), the spike probabilities as a function of stimulus projections onto the dimensions obtained as\ndecorrelated STA (blue crosses) and regularized decorrelated STA (red crosses) are compared to a\nspike probability used to generate spikes (solid line).\nshown in panel (f). It has a projection value of 0.25. Attempt to regularize the inverse of covariance\nmatrix results in a closer match to the true relevant dimension [15, 16, 17, 18, 19] and has a projection\nvalue of 0.8, as shown in panel (g). While it appears to be less noisy, the regularized decorrelated\nSTA can have systematic deviations from the true relevant dimensions [9, 20, 2, 11]. Preferred\norientation is less susceptible to distortions than the preferred spatial frequency [19]. In this case\nregularization was performed by setting aside 1/4 of the data as a test dataset, and choosing a cutoff\non the eigenvalues of the input covariances matrix that would give the maximal information value\non the test dataset [16, 19].\n\n4 Comparison of Performance with Finite Data\n\nIn the limit of in\ufb01nite data the relevant dimensions can be found by maximizing variance, informa-\ntion, or other objective functions [1]. In a real experiment, with a dataset of \ufb01nite size, the optimal\nvector found by any of the R\u00b4enyi divergences \u02c6v will deviate from the true relevant dimension \u02c6e1.\nIn this section we compare the robustness of optimization strategies based on R\u00b4enyi divergences of\nvarious orders, including least squares \ufb01tting (\u03b1 = 2) and information maximization (\u03b1 = 1), as the\ndataset size decreases and/or neural noise increases.\nThe deviation from the true relevant dimension \u03b4v = \u02c6v \u2212 \u02c6e1 arises because the probability distri-\nbutions (4) are estimated from experimental histograms and differ from the distributions found in\nthe limit of in\ufb01nite data size. The effects of noise on the reconstruction can be characterized by\ntaking the dot product between the relevant dimension and the optimal vector for a particular data\nsample: \u02c6v \u00b7 \u02c6e1 = 1 \u2212 1\n2 \u03b4v2, where both \u02c6v and \u02c6e1 are normalized, and \u03b4v is by de\ufb01nition orthogo-\nnal to \u02c6e1. Assuming that the deviation \u03b4v is small, we can use quadratic approximation to expand\nthe objective function (obtained with \ufb01nite data) near its maximum. This leads to an expression\n\u03b4v = \u2212[H (\u03b1)]\u22121\u2207F (\u03b1), which relates deviation \u03b4v to the gradient and Hessian of the objective\nfunction evaluated at the vector \u02c6e1. Subscript (\u03b1) denotes the order of the R\u00b4enyi divergence used\nas an objective function. Similarly to the case of optimizing information [2], the Hessian of R\u00b4enyi\n\n102030102030102030102030102030102030102030102030102030102030102030102030-6-4-202460.00.20.40.60.81.0-6-4-202460.00.20.40.60.81.0STAdecorrelated STAdecorrelated STA(a)(b)(c)(d)(e)(f)(g)(h)truthregularizedfiltered stimulus (sd=1)filtered stimulus (sd=1)spike probabilityspike probabilitymaximally informativedimensiondimension ofmaximal variancemaximizinginformation (x)variance (x)decorrelatedSTA (x)regularizeddecorrelatedSTA (x)truth\fdivergence of arbitrary order when evaluated along the optimal dimension \u02c6e1 is given by\n\n(cid:183)\n\n(cid:184)\u03b1\u22123(cid:183)\n\n(cid:181)\n\n(cid:182)(cid:184)2\n\nP (x|spike)\n\nP (x|spike)\n\n(cid:90)\n\n,\n\nd\ndx\n\nij = \u2212\u03b1\nH (\u03b1)\n\ndxP (x|spike)Cij(x)\n\nP (x)\n\nP (x)\n\n(12)\nwhere Cij(x) = ((cid:104)sisj|x(cid:105) \u2212 (cid:104)si|x(cid:105)(cid:104)sj|x(cid:105)) are covariance matrices of inputs sorted by their projec-\ntion x along the optimal dimension.\nWhen averaged over possible outcomes of N trials, the gradient is zero for the optimal direction. In\nother words, there is no speci\ufb01c direction towards which the deviations \u03b4v are biased. Next, in order\nto measure the expected spread of optimal dimensions around the true one \u02c6e1, we need to evaluate\n(cid:104)\u03b4v2(cid:105) = Tr\n, and therefore need to know the variance of the gradient\nof F averaged across different equivalent datasets. Assuming that the probability of generating a\nspike is independent for different bins, we \ufb01nd that (cid:104)\u2207F (\u03b1)\nP (x|spike)\n\n(cid:104)\n(cid:104)\u2207F (\u03b1)\u2207F (\u03b1)T(cid:105)(cid:163)\n\nij /Nspike, where\nP (x|spike)\n\n(cid:184)2\u03b1\u22124(cid:183)\ni \u2207F (\u03b1)\n\n(cid:105) = B(\u03b1)\n\n(cid:164)\u22122\n\n(cid:184)2\n\nH (\u03b1)\n\n(cid:90)\n\n(cid:183)\n\n(cid:105)\n\nj\n\nB(\u03b1)\n\nij = \u03b12\n\ndxP (x|spike)Cij(x)\n\n.\n\n(13)\n\nd\ndx\n\nP (x)\n\nP (x)\n\nTherefore an expected error in the reconstruction of the optimal \ufb01lter by maximizing variance is\ninversely proportional to the number of spikes:\n\n\u02c6v \u00b7 \u02c6e1 \u2248 1 \u2212 1\n2\n\n(cid:104)\u03b4v2(cid:105) = 1 \u2212 Tr(cid:48)[BH\u22122]\n2Nspike\n\n,\n\ndetermined by the trace of the Hessian of information, (cid:104)\u03b4v2(cid:105) \u221d Tr(cid:48)(cid:163)\n\n(14)\nwhere we omitted superscripts (\u03b1) for clarity. Tr(cid:48) denotes the trace taken in the subspace orthogo-\nnal to the relevant dimension (deviations along the relevant dimension have no meaning [2], which\nmathematically manifests itself in dimension \u02c6e1 being an eigenvector of matrices H and B with the\nzero eigenvalue). Note that when \u03b1 = 1, which corresponds to Kullback-Leibler divergence and\ninformation maximization, A \u2261 H \u03b1=1 = B\u03b1=1. The asymptotic errors in this case are completely\n, reproducing the previ-\nously published result for maximally informative dimensions [2]. Qualitatively, the expected error\n\u223c D/(2Nspike) increases in proportion to the dimensionality D of inputs and decreases as more\nspikes are collected. This dependence is in common with expected errors of relevant dimensions\nfound by maximizing information [2], as well as methods based on computing the spike-triggered\naverage both for white noise [1, 21, 22] and correlated Gaussian inputs [2].\nNext we examine which of the R\u00b4enyi divergences provides the smallest asymptotic error (14) for\nestimating relevant dimensions. Representing the covariance matrix as Cij(x) = \u03b3ik(x)\u03b3jk(x)\n(exact expression for matrices \u03b3 will not be needed), we can express the Hessian matrix H and\ncovariance matrix for the gradient B as averages with respect to probability distribution P (x|spike):\n\nA\u22121\n\n(cid:164)\n\nB =\n\ndxP (x|spike)a(x)bT (x),\n\ndxP (x|spike)b(x)bT (x), H =\n\n(15)\nwhere the gain function g(x) = P (x|spike)/P (x), and matrices bij(x) = \u03b1\u03b3ij(x)g(cid:48)(x) [g(x)]\u03b1\u22122\nand aij(x) = \u03b3ij(x)g(cid:48)(x)/g(x). Cauchy-Schwarz identity for scalar quantities states that,\n(cid:104)b2(cid:105)/(cid:104)ab(cid:105)2 \u2265 1/(cid:104)a2(cid:105), where the average is taken with respect to some probability distribution.\nA similar result can also be proven for matrices under a Tr operation as in Eq. (14). Applying the\nmatrix-version of the Cauchy-Schwarz identity to Eq. (14), we \ufb01nd that the smallest error is obtained\nwhen\n\n(cid:90)\n\nTr(cid:48)[BH\u22122] = Tr(cid:48)[A\u22121], with A =\n\ndxP (x|spike)a(x)aT (x),\n\n(16)\n\n(cid:90)\n\n(cid:90)\n\nMatrix A corresponds to the Hessian of the merit function for \u03b1 = 1: A = H (\u03b1=1). Thus, among the\nvarious optimization strategies based on R\u00b4enyi divergences, Kullback-Leibler divergence (\u03b1 = 1)\nhas the smallest asymptotic errors. The least square \ufb01tting corresponds to optimization based on\nR\u00b4enyi divergence with \u03b1 = 2, and is expected to have larger errors than optimization based on\nKullback-Leibler divergence (\u03b1 = 1) implementing information maximization. This result agrees\nwith recent \ufb01ndings that Kullback-Leibler divergence is the best distortion measure for performing\nlossy compression [3].\nBelow we use numerical simulations with model cells to compare the performance of information\n(\u03b1 = 1) and variance (\u03b1 = 2) maximization strategies in the regime of relatively small numbers\n\n\fof spikes. We are interested in the range 0.1 \u223c< D/Nspike \u223c< 1, where the asymptotic results do not\nnecessarily apply. The results of simulations are shown in Fig. 2 as a function of D/Nspike, as well\nas with varying neural noise levels. To estimate sharper (less noisy) input/output functions with \u03c3 =\n1.5, 1.0, 0.5, 0.25, we used larger number of bins (16, 21, 32, 64), respectively. Identical numerical\nalgorithms, including the number of bins, were used for maximizing variance and information. The\nrelevant dimension for each simulated spike train was obtained as an average of 4 jackknife estimates\ncomputed by setting aside 1/4 of the data as a test set. Results are shown after 1000 line optimizations\n(D = 900), and performance on the test set was checked after every line optimization. As can be\nseen, generally good reconstructions with projection values \u223c> 0.7 can be obtained by maximizing\neither information or variance, even in the severely undersampled regime D < Nspike. We \ufb01nd that\nreconstruction errors are comparable for both information and variance maximization strategies, and\nare better or equal (at very low spike numbers) than STA-based methods. Information maximization\nachieves signi\ufb01cantly smaller errors than the least-square \ufb01tting, when we analyze results for all\nsimulations for four different models cells and spike numbers (p < 10\u22124, paired t-test).\n\nFigure 2: Projection of vector \u02c6vmax obtained by maximizing information (red \ufb01lled symbols) or\nvariance (blue open symbols) on the true relevant dimension \u02c6e1 is plotted as a function of ratio be-\ntween stimulus dimensionality D and the number of spikes Nspike, with D = 900. Simulations were\ncarried out for model visual neurons with one relevant dimension from Fig. 1(a) and the input-output\nfunction Eq.(1) described by threshold \u03b8 = 2.0 and noise standard deviation \u03c3 = 1.5, 1.0, 0.5, 0.25\nfor groups labeled A ((cid:52)), B ((cid:53)), C ((cid:176)), and D (2), respectively. The left panel also shows results\nobtained using spike-triggered average (STA, gray) and decorrelated STA (dSTA, black). In the right\npanel, we replot results for information and variance optimization together with those for regularized\ndecorrelated STA (RdSTA, green open symbols). All error bars show standard deviations.\n5 Conclusions\n\nIn this paper we compared accuracy of a family of optimization strategies for analyzing neural re-\nsponses to natural stimuli based on R\u00b4enyi divergences. Finding relevant dimensions by maximizing\none of the merit functions, R\u00b4enyi divergence of order 2, corresponds to \ufb01tting the linear-nonlinear\nmodel in the least-square sense to neural spike trains. Advantage of this approach over standard least\nsquare \ufb01tting procedure is that it does not require the nonlinear gain function to be invertible. We\nderived errors expected for relevant dimensions computed by maximizing R\u00b4enyi divergences of ar-\nbitrary order in the asymptotic regime of large spike numbers. The smallest errors were achieved not\nin the case of (nonlinear) least square \ufb01tting of the linear-nonlinear model to the neural spike trains\n(R\u00b4enyi divergence of order 2), but with information maximization (based on Kullback-Leibler di-\nvergence). Numeric simulations on the performance of both information and variance maximization\nstrategies showed that both algorithms performed well even when the number of spikes is very small.\nWith small numbers of spikes, reconstructions based on information maximization had also slightly,\nbut signi\ufb01cantly, smaller errors those of least-square \ufb01tting. This makes the problem of \ufb01nding rel-\nevant dimensions, together with the problem of lossy compression [23, 3], one of examples where\n\n000.51.01.52.02.50.10.20.30.40.50.60.70.80.91.0STAdecorrelated STAmaximizing informationmaximizing varianceABCDABCDD / Nspikeprojection on true dimension00.51.01.52.02.50.50.60.70.80.91.0maximizing informationmaximizing varianceregularized decorrelated STAD / NspikeABCD\finformation-theoretic measures are no more data limited than those derived from least squares. It\nremains possible, however, that other merit functions based on non-polynomial divergence measures\ncould provide even smaller reconstruction errors than information maximization.\n\nReferences\n[1] L. Paninski. Convergence properties of three spike-triggered average techniques. Network: Comput.\n\nNeural Syst., 14:437\u2013464, 2003.\n\n[2] T. Sharpee, N.C. Rust, and W. Bialek. Analyzing neural responses to natural signals: Maximally informa-\ntiove dimensions. Neural Computation, 16:223\u2013250, 2004. See also physics/0212110, and a preliminary\naccount in Advances in Neural Information Processing 15 edited by S. Becker, S. Thrun, and K. Ober-\nmayer, pp. 261-268 (MIT Press, Cambridge, 2003).\n\n[3] Peter Harremo\u00a8es and Naftali Tishby. The Information bottleneck revisited or how to choose a good\n\ndistortion measure. Proc. of the IEEE Int. Symp. on Information Theory (ISIT), 2007.\n\n[4] E. de Boer and P. Kuyper. Triggered correlation. IEEE Trans. Biomed. Eng., 15:169\u2013179, 1968.\n[5] I. W. Hunter and M. J. Korenberg. The identi\ufb01cation of nonlinear biological systems: Wiener and Ham-\n\nmerstein cascade models. Biol. Cybern., 55:135\u2013144, 1986.\n\n[6] R. R. de Ruyter van Steveninck and W. Bialek. Real-time performance of a movement-sensitive neuron in\nthe blow\ufb02y visual system: coding and information transfer in short spike sequences. Proc. R. Soc. Lond.\nB, 265:259\u2013265, 1988.\n\n[7] V. Z. Marmarelis. Modeling Methodology for Nonlinear Physiological Systems. Ann. Biomed. Eng.,\n\n25:239\u2013251, 1997.\n\n[8] W. Bialek and R. R. de Ruyter van Steveninck. Features and dimensions: Motion estimation in \ufb02y vision.\n\nq-bio/0505003, 2005.\n\n[9] D. L. Ringach, G. Sapiro, and R. Shapley. A subspace reverse-correlation technique for hte study of visual\n\nneurons. Vision Res., 37:2455\u20132464, 1997.\n\n[10] D. L. Ruderman and W. Bialek. Statistics of natural images: scaling in the woods. Phys. Rev. Lett.,\n\n73:814\u2013817, 1994.\n\n[11] T. O. Sharpee, H. Sugihara, A. V. Kurgansky, S. P. Rebrik, M. P. Stryker, and K. D. Miller. Adaptive\n\n\ufb01ltering enhances information transmission in visual cortex. Nature, 439:936\u2013942, 2006.\n\n[12] S. M. Ali and S. D. Silvey. A general class of coef\ufb01ceint of divergence of one distribution from another.\n\nJ. R. Statist. Soc. B, 28:131\u2013142, 1966.\n\n[13] I. Csisz\u00b4ar. Information-type measures of difference of probability distrbutions and indirect observations.\n\nStudia Sci. Math. Hungar., 2:299\u2013318, 1967.\n\n[14] N. Brenner, S. P. Strong, R. Koberle, W. Bialek, and R. R. de Ruyter van Steveninck. Synergy in a neural\n\ncode. Neural Computation, 12:1531\u20131552, 2000. See also physics/9902067.\n\n[15] F. E. Theunissen, K. Sen, and A. J. Doupe. Spectral-temporal receptive \ufb01elds of nonlinear auditory\n\nneurons obtained using natural sounds. J. Neurosci., 20:2315\u20132331, 2000.\n\n[16] F.E. Theunissen, S.V. David, N.C. Singh, A. Hsu, W.E. Vinje, and J.L. Gallant. Estimating spatio-temporal\nreceptive \ufb01elds of auditory and visual neurons from their responses to natural stimuli. Network, 3:289\u2013\n316, 2001.\n\n[17] K. Sen, F. E. Theunissen, and A. J. Doupe. Feature analysis of natural sounds in the songbird auditory\n\nforebrain. J. Neurophysiol., 86:1445\u20131458, 2001.\n\n[18] D. Smyth, B. Willmore, G. E. Baker, I. D. Thompson, and D. J. Tolhurst. The receptive \ufb01elds organization\nof simple cells in the primary visual cortex of ferrets under natural scene stimulation. J. Neurosci.,\n23:4746\u20134759, 2003.\n\n[19] G. Felsen, J. Touryan, F. Han, and Y. Dan. Cortical sensitivity to visual features in natural scenes. PLoS\n\nBiol., 3:1819\u20131828, 2005.\n\n[20] D. L. Ringach, M. J. Hawken, and R. Shapley. Receptive \ufb01eld structure of neurons in monkey visual\n\ncortex revealed by stimulation with natural image sequences. Journal of Vision, 2:12\u201324, 2002.\n\n[21] N. C. Rust, O. Schwartz, J. A. Movshon, and E. P. Simoncelli. Spatiotemporal elements of macaque V1\n\nreceptive \ufb01elds. Neuron, 46:945\u2013956, 2005.\n\n[22] Schwartz O., J.W. Pillow, N.C. Rust, and E. P. Simoncelli. Spike-triggered neural characterization. Jour-\n\nnal of Vision, 176:484\u2013507, 2006.\n\n[23] N. Tishby, F. C. Pereira, and W. Bialek. The information bottleneck method. In B. Hajek and R. S. Sreeni-\nvas, editors, Proceedings of the 37th Allerton Conference on Communication, Control and Computing, pp\n368\u2013377. University of Illinois, 1999. See also physics/0004057.\n\n\f", "award": [], "sourceid": 334, "authors": [{"given_name": "Tatyana", "family_name": "Sharpee", "institution": null}]}