{"title": "Convergence Properties of Some Spike-Triggered Analysis Techniques", "book": "Advances in Neural Information Processing Systems", "page_first": 189, "page_last": 196, "abstract": "", "full_text": "Convergence Properties of some\n\nSpike-Triggered Analysis Techniques\n\nLiam Paninski\n\nCenter for Neural Science\n\nNew York University\nNew York, NY 10003\n\nliam@cns. nyu. edu\n\nhttp://www.cns.nyu.edu/rvliam\n\nAbstract\n\nvVe analyze the convergence properties of three spike-triggered data\nanalysis techniques. All of our results are obtained in the set(cid:173)\nting of a (possibly multidimensional) linear-nonlinear (LN) cascade\nmodel for stimulus-driven neural activity. We start by giving exact\nrate of convergence results for the common spike-triggered average\n(STA) technique. Next, we analyze a spike-triggered covariance\nmethod, variants of which have been recently exploited successfully\nby Bialek, Simoncelli, and colleagues. These first two methods suf(cid:173)\nfer from extraneous conditions on their convergence; therefore, we\nintroduce an estimator for the LN model parameters which is de(cid:173)\nsigned to be consistent under general conditions. We provide an\nalgorithm for the computation of this estimator and derive its rate\nof convergence. We close with a brief discussion of the efficiency\nof these estimators and an application to data recorded from the\nprimary motor cortex of awake, behaving primates.\n\n1\n\nIntroduction\n\nSystems-level neuroscientists have a few favorite problems, the most prominent of\nwhich is the \"what\" part of the neural coding problem: what makes a given neuron\nin a particular part of the brain fire? In more technical language, we want to know\nabout the conditional probability distributions P(spikelX = x), the probability\nthat our cell emits a spike, given that some observable signal X in the world takes\nvalue x. Because data is expensive, neuroscientists typically postulate a functional\nform for this collection of conditional distributions, and then fit experimental data to\nthese functional models, in lieu of attempting to directly estimate P(spikelX = x)\nfor each possible x.\nIn this paper, we analyze one such phenomenological model\nwhose popularity seems to be on the rise:\n\np(spikelx) = f( < k1 , x>, < k2 , x>, . .. ,< km , x \u00bb.\n\n(1)\nHere f\nis some arbitrary nonconstant, ~m-measurable, [O,l]-valued function, and\n{ki } are some linearly independent elements of the dual space, X', of some topologi(cid:173)\nthe space of possible \"input signals.\" Interpret f as a regular\ncal vector space, X -\n\n\fconditional distribution. Roughly, then, the neuron projects the signal x onto some\nm-dimensional subspace spanned by {ki}l*\nIIKllllk111\n\n~\n\nA\n\n\u2022\n\n(6)\n\nThe scalar terms A, a, and f3 in (3) each depend on .J, K, and p(x); A is a constant\ngiving the order of magnitude of convergence (usually, but not always, equal to 1/2),\na gives the precise convergence rate, and (3 gives the asymptotic error. We will be\nmostly concerned with giving exact values for a and A, and simply indicating when\n(3 is zero or positive (i.e., when K is consistent in probability or not, respectively).\nAs usual, rate-of-convergence results clarify why a given estimator works well (in\n\n\fthe sense that a only a small number of samples is needed for reliable estimates) in\ncertain cases and poorly (sometimes not at all) in others.\n\nWe will discuss three estimators here; the first two are well-known, while the third\nis novel, and is consistent under much more general conditions. The first part of\nthe paper will indicate how to derive representation (3), including the constants\na, 13,. and A, for these three estimators. In the final two sections, we discuss lower\nbounds on the convergence rates of any possible K -estimator (these kinds of bounds\nprovide a rigorous measure of the- difficulty of this estimation problem), and then\ngive a brief illustration of the new estimator applied to data recorded in the primary\nmotor cortex of awake, behaving monkeys.\n\n2 Convergence rates\n\nthree of the estimators considered here can be naturally written as \"M(cid:173)\n\nAll\nestimators,\" that is,\n\nK(XN' SN) == argmaxVEQm (X) M(XN ,SN )(V),\n\nfor some data-dependent function M N == M(XN ,SN ) on Ym(X). Most of the mathe(cid:173)\nmatical labor in this section comes down to an application of the standard \"delta\nmethod\" from the theory of ~v1-estimators [5]:\ntypically the data-dependent (i.e.,\nrandom) functions M N converge in some suitable sense, as N -? 00, to some limit\nfunction M. The asymptotics of the M-estimator are then reduced to a study of 1)\nthe variability of M N around the limit M and 2) the local differential structure of\nM in a neighborhood of the true value of the underlying parameter K. This pro(cid:173)\ngram can be carried out trivially for the first two estimators but is more interesting\nfor the third (the first two require only the multivariate CLT; the third requires an\ninfinite-dimensional CLT).\n\n2.1\n\nSpike-triggered averaging\n\nthe spike-triggered average,\n\nThe first estimator,\nis classical and very intuitive:\nKSTA is defined as the sample -mean of the spike-conditional stimulus distribution\np(xl spike); since the spike signal is binary, this is the same as the cross-correlation\nbetween the spike and the stimulus signal.\n(We assume throughout, without loss\nof generality, that p(x) is centered, that is, E(x) == 0.) We will also consider the\nfollowing \"linear regression\" modification:\n\nK LR == AKsTA'\n\nwhere A is an operator chosen to \"divide out\" correlations in the stimulus distri(cid:173)\nbution p(x) (A is typically the (pseudo-) inverse of the stimulus correlation matrix,\nwhich we will denote as a 2 (p(x))). The analysis for KSTA and K LR depends only\non a straightforward application of the multivariate central limit theorem' (CLT).\n\nWe begin with necessary and sufficient conditions for consistency. We assume\nthroughout this paper that the stimulus distribution p(x) has finite second mo(cid:173)\nments; this assumption seems entirely reasonable on physical grounds. Let q be a\nrandom variable with distribution given by\n\nP( ) ==\n\nq - p < X,\n\n(\n\n~ k~\n\nI Ok) _\n1 > SP~ e -\n\nf( < X, k1 > )p(< X, k1 \u00bb\nJRf\u00ab x,k1 \u00bbp\u00ab x,k1 \u00bb\n\n~\n\n~,\n\n(7)\n\nwith f as defined in (1) and p(< X, k1 \u00bb denoting the one-dimensional projection of\np(x). The expectation of this random variable exists by the finite-variance assump-\n\n\ftion on p(x). Finally, as usual, we say p(x) is radially symmetric if p(B) == p(UB)\nfor all Borel sets B and all unitary transformations U.\nTheorem 1 ((3(KSTA)). Ifp(x) (resp. p(Alj2 x )) is radially symmetric and E(q) i(cid:173)\n0, then (3(KSTA ) == \u00b0(resp. (3(KLR ) == 0). Conversely, if p(x) is radially symmetric\nand E(q) == 0, then (3 > 0, and if p(x) is not radially symmetric, then there exists\nan i for which {3 > 0.\n(Note that i is not required to be smooth, or even continuous.) The above suf(cid:173)\nficiency conditions seem to be somewhat well-known; for example, most of the\nsufficiency statement appeared (albeit in somewhat less precise form) in [1]. On\nthe other hand, the converse is novel, to our knowledge, and is perhaps surpris(cid:173)\ningly stringent. The first part of the necessity statement will be obvious from the\nfollowing discussion of a (and in fact appears implicitly in [1]), while the second\npart is a little harder, and seems to require (rather elementary) characteristic func(cid:173)\ntion techniques. The proof proceeds by showing that a distribution is symmetric\niff it has the property that the conditional mean of x is zero on all planar \"slices\"\n< k, x >E B for some k E XI and real Borel set B.\nNext we have the rate of convergence:\n\nTheorem 2 (a(KSTA)). Assumep(i),is symmetric normal, with standard devia(cid:173)\nthen N 1j2(KsTA - K) is asymptotically normal with\ntion a(p). If (3(KSTA) == 0,\nmean zero (considered as a distribution on the tangent plane of Ym(X) at the true\nunderlying value K), and\n\nThus the performance of the spike-triggered average scales directly with the dimen-\n~ sion of the ambient space and inversely with E(q), a measure of the asymmetry\nof the spike-triggered distribution along k1 . Note that we stated the result un(cid:173)\nder the much stronger condition that p(i) is Gaussian. In this case, the form of a\nbecomes quite simple, depending on the nonlinearity i only through E(q). The gen(cid:173)\neral case is proven by identical methods but results in a slightly more complicated\n(i-dependent) term in place of a(p). The proof follows by applying the multivari(cid:173)\nate central limit theorem to the sample mean random vectors drawn ij.d. from\nthe spike-conditional stimulus distribution, p(xlspike). The proof also supplies the\nasymptotic distribution of Error(KsTA) (a noncentral F), which might be useful\nfor hypothesis testing. The details are quite easy once the mean of this distribution\nis identified (as in [1], under the above sufficiency conditions), and we skip them to\nsave room for more interesting results.\n\nin stating the above two results, we have been assuming implicitly\nOne final note:\nthat K is one-dimensional (since KSTA clearly returns a single vector, that is, a\none-dimensional subspace of X). Nevertheless, the two theorems extend easily to\nthe more general case, after Error(KsTA) is redefined to measure angles between\nm- and I-dimensional subspaces.\n(Of course, now E(KsTA) and limN-H:xJ KSTA\ndepend strongly on the input distribution p(x), even for radi~lly symmetric p(x);\nsee, e.g., [3] for an analysis of a special case of this effect.)\n\n2.2 Covariance-based methods\n\nThe next estimator was introduced in an effort to extend spike-triggered analysis to\nthe m > 1 case (see, e.g., [3], and references therein). Where KSTA was based on\n\n\fthe first moment of the spike-conditional stimulus distribution p(xlspike), KCORR\nis based on the second moment. We define\nKCORR == (0-)\n\nelg(.6.0-2 ),\n\nA\n\n2 -1.\n\nA\n\nwhere eig(A) denotes the significantly non-zero eigenspace of the operator A, and\n.6.~2 is some estimate (typically the usual sample covariance estimate) of the\n\"difference-covariance\" matrix .6.0-2 , defined by\n\nAgain, we start with {3:\nTheorem 3 ((3(KCORR )). Ifp(x) is Gaussian and\n\nVarp(xlspike) \u00ab k, x\u00bb =I- Varp(x) (< k, x\u00bb \\:Ik E E K ,\n\nfor some orthogonal basis E K of K,\nGaussian and the variance condition is not satisfied for f,\nis non-Gaussian, then there exists an f\n\nfor which {3 > o.\n\nthen (3(KCORR) == o. Conversely, if p(x) is\nthen (3 > 0, and if p(x)\n\nAs before, the sufficiency is fairly well-known, while the necessity appears to be (cid:173)\nnovel and relies on characteristic function arguments. It is perhaps surprising that\nthe conditions on p for the consistency of this estimator are even stricter than for the\nspike-triggered average. The essential fact here turns out to be that a distribution\nis normal iff, after a suitable change of basis, the conditional variance on all planar\n\"slices\" of the distribution is constant.\n\nWe have, with Odelia SChwartz, developed a striking inconsistency example which\nis worth mentioning here:\nExample (Inconsistency of KCORR). There is a nonempty open set of noncon(cid:173)\nstant f and radially symmetric p(x) such that KCORR is asymptotically orthogonal\nto K almost surely as N ---7 00.\nthe f and p in this set can be taken to be\ninfinitely differentiable.)\nThe basic idea is that, for nonnormal p, the spike-triggered variance of < V, x >\ndepends on f even for v-lk; we leave the details to the reader.\nWe can derive a similar rate of convergence for these covariance-based methods. To\nreduce the notational load, we state the result for m == 1 only; in this case, we can\ndefine AAa-2 to be the (unique and nonzero by assumption) eigenvalue of .6,0-2 .\n\n(In fact,\n\nTheorem 4 (a(KcoRR)). Assume p(x) is independent normal. If (3(KCORR) == 0,\nthen N 1/ 2 (KcoRR - K) is asymptotically normal with mean zero and\n\n(Again, while AAa-2 will not be exactly zero in practice, it can often be small enough\nthat the asymptotic error remains prohibitively large for physiologically reasonable\nvalues of N.) The proof proceeds by applying the multivariate central limit the(cid:173)\nthen examining the first-order Taylor\norem to the covariance matrix estimator,\nexpansion of the eigenspace map at .6,0-2\n; see the longer draft of this paper at\nhttp://www.cns.nyu.edu/r-.;liam for the more general statement and proof.\n\n\f2.3 Empirical processes techniques\n\nWe have seen that the two most common K-estimators are not consistent in gen(cid:173)\neral; that is, the asymptotic error (3 is bounded away from zero for many (non(cid:173)\npathological) combinations of p(x), f, and K. We now introduce a new estimator\nfor which (3 == 0 under very general conditions (without, say, any symmetry or\nnormality assumptions on p or any symmetry assumptions on f).\nThe basic idea is that Ki is in a sense a sufficient statistic for i\n-\ncould estimate K by maximizing\n\n(that is, x - Ki\nspike forms a Markov chain). The data processing inequality suggests that we\n\nwhere DcjJ is a functional with suitable convexity properties, and qN is some estimate\nofp. For example, we could let DcjJ be an information divergence and qN some kernel\nestimate, that is, a filtered version of the empirical measure\n\n(see [4] for an independent approach along these lines). This doesn't quite work,\nhowever, because the kernel induces an arbitrary scale; if this scale is larger than\nthe natural scale of f and p(< V, X \u00bb for some V but not others, our estimate will\nbe biased away from K. Therefore, DcjJ and PN have to be asymptotically scale-free\nin some sense.\n\nThe simplest approach is to let the kernel width tend to zero as N becomes large; it\nis even possible to calculate the optimal rate of kernel shrinkage in N, depending on\nthe smoothness of f. It also turns out to be helpful to use a bias-corrected version\nof MN (V); a standard jackknife correction is sufficient to obtain an estimator which\nconverges at the standard VN rate. We have:\nTheorem 5 \u00ab(3(KcP )). lfp has a nonzero density with respect to Lebesgue measure,\nis not constant a.e., and the kernel width goes to zero more slowly than NT-l,\nf\nfor some r > 0, then {3 == 0 for the kernel estimator KcP \u2022\nIn other words, this new estimator KcjJ works for very general neurons f and stimulus\ndistributions p; in particular, K \u00a2 is suitable for application to natural signal data.\nClearly, the condition on f is minimal; we ask only that the neuron be tuned. The\ncondition on p is quite weak (and can be relaxed further); we are simply ensuring\nthat we are sampling from all of X, and in particular, the part of X on which the\ncell is tuned.\n\nNext we have the rate of convergence; in the following, the \"approximation error\"\nmeasures the difference between the true information divergence M cP (V) and its\nkernel-smoothed version, defined in the obvious way.\nTheorem 6 (1 and a for (K\u00a2)). If the approximation error is of order aN,\nr > 1,\nthen the jackknifed kernel or histogram versions of KcjJ, with bandwidth\nNS, -1 < s < -l/r, converge at an N- l / 2 rate. Moreover, N l / 2 (K\u00a2 - K) is\nasymptotically normal, with mean zero and easily calculable a (K\u00a2) .\n\nThe methods follow, e.g., example 3.2.12 of [5]\nbasically, a generalization\nof the classical theorem on the asymptotic distribution of the maximum likeli(cid:173)\nhood estimator in regular parametric families. Again, see the longer draft at\n\n-\n\n\fhttp://www.cns.nyu.edu/rvliam for the precise definition of the approximation er(cid:173)\nror and the full expression for a(K\u00a2).\nWe have developed an algorithm for the computation of argmaxvMN(V) , and nu(cid:173)\nmerical results show that K\u00a2 can be competitive with spike-triggered average or\ncovariance techniques even in cases in which f3(KSTA) and f3(KCORR) are zero. We\npresent a brief application of K\u00a2 in section 4.\n\n\u00b73 Lower bounds\n\nLower bounds for convergence rates provide a rigorous measure of the difficulty of\na given estimation problem, or of the efficiency of a given estimator. We give a few\nsuch results below. The first lower bound is local, in the sense that we assume that\nthe true parameter is known a priori to be in some small neighborhood of parameter\nspace. For simplicity, assume for the moment that p(x) is radially symmetric. Recall\nthat the Hellinger metric between any two densities is defined as (half of) the L 2\ndistance between the square roots of the densities.\nTheorem 7 (Local (Hellinger) lower bound). For simplicity, let p be standard\nnormal. For any fixed differentiable f, uniformly bounded away from 0 and 1 and\nwith a uniformly bounded derivative f', and any Hellinger ball F around the true\nparameter (f, K),\nlW-!;e,f N 1\n\n2 ikf s~ E(Error(K)) ~ a(p)(Ep ( 1(1 _ f) ))1/2\n\nvctim X - 1.\n\n)-1\n\n11'12\n\n/\n\nA\n\n(\n\nThe second infimum above is taken over all possible estimators k. The right-hand\nside plays the role of the inverse Fisher information in the Cramer-Rao bound and\nis derived using a similarly local analysis; see [2] for details.\n\nGlobal bounds are more subtle. We want to prove something like:\n\nliminf aN iI!f sup E(Error(k)) ~ C(E),\n\nN-HXJ\n\nK :F(\u20ac)\n\nwhere F(E) is some large parameter set containing, say, all K and all f for which\nsome relevant measure of tuning is greater than E, aN is the corresponding conver-\ngence rate, and C(E) plays the role of a(K) from the previous sections. So far, our\nmost interesting results in this direction are negative:\nTheorem 8 (Information divergences are poor indices of K-difficulty). Let\nF(E) be the set of all (K, f) for which the \u00a2-divergence ((information\" between x and\nspike is greater than E, that is,\n\nDcjJ(P(Kx, spike); p(spike)p(Kx)) > E.\n\nThen, for E > 0 small enough, for any putative convergence rate aN,\n\nliminf aN iI!f sup E(Error(k)) == 00.\nN-'Hx)\n\nK :F(\u20ac)\n\nIn other words, strictly information-theoretic measures of tuning do not provide a\nuseful index of the difficulty of the K-Iearning problem; the intuitive explanation of\nthis result is that purely measure-theoretic distance functions, like \u00a2-divergences,\nignore the topological and vector space structure of the. underlying probability mea(cid:173)\nsures, and it is exactly this structure that determines the convergence rates of any\nefficient K -estimator. To put it more simply, the learnability of K depends on the\nsmoothness of f, just as we saw in the last section.\n\n\f4 Application to primary motor cortex data\n\nWe have applied these new spike-triggered analysis techniques to data collected in\nthe primary motor cortex (MI) of awake, behaving monkeys in an effort to elucidate\nthe neural encoding of time-varying hand position signals in MI. This analysis h~s\nled to several interesting findings on the encoding properties of these neurons, with\nimmediate applications to the design of neural prosthetic devices. Here, we have\nroom to mention only one result: the relevant K for MI cells appear to be largely\none-dimensional. In other words, the conditional firing rate of these neurons, given\na specific time-varying hand path, is well captured by the following model (Fig. 1):\np(spikel\u00a3) == f( < ko,\u00a3 \u00bb, where \u00a3 represents the two-dimensional hand position\nsignal in a temporal neighborhood of the current time, ko is a cell-specific affine\nfunctional, and f is a cell-independent scalar function.\n\n20\n\n20\n\nFigure 1: Example }(I<\u00a3) functions, computed from two different MI cells, with\nrank I< == 2; the x- and y-axes index < k1 , \u00a3 > and < k2 , x >, respectively, while\nthe color axis indicates the value of j (the conditional firing rate given K \u00a3), in Hz.\nThe scale on the x- and y-axes is arbitrary and has been omitted. k was computed\nusing the q'J-divergence estimator, and j was estimated using an adaptive kernel\nwithin the circular region shown (where sufficient data was available for reliable\nestimates). Note that the contours of this function are approximately linear; that\nko,\u00a3 \u00bb, where ko is the vector orthogonal to the contour lines\nis, }(I<\u00a3) ~ fo\u00ab\nand fa is a suitably chosen scalar function on the line.\n\nAcknowledgements\n\nWe thank the Simoncelli lab for interesting discussions, and N. Rust and T. Sharpee\nfor preliminary discussions of [4]. The MI experiments were done with M. Fellows, N.\nHatsopoulos, and J. Donoghue. LP is supported by a HHMI predoctoral fellowship.\n\nReferences\n\n[1] Chichilnisky, E. Network 12: 199-213 (2001).\n[2] Gill, R. & Levit, B. Bernoulli, 1/2: 59-79 (1995).\n[3] Schwartz, 0., Chichilnisky, E. & Simoncelli, E. NIPS 14 (2002).\n[4] Sharpee, T., Bialek, W. & Rust, N. This volume (2003).\n[5] van der Vaart , A. & Wellner, J. Weak convergence and empirical processes.\n\nSpringer-Verlag, New York (1996).\n\n\f", "award": [], "sourceid": 2207, "authors": [{"given_name": "Liam", "family_name": "Paninski", "institution": null}]}*