{"title": "Inferring Neural Firing Rates from Spike Trains Using Gaussian Processes", "book": "Advances in Neural Information Processing Systems", "page_first": 329, "page_last": 336, "abstract": "", "full_text": "Inferring Neural Firing Rates from Spike Trains\n\nUsing Gaussian Processes\n\nJohn P. Cunningham1, Byron M. Yu1;2;3, Krishna V. Shenoy1;2\n\n1Department of Electrical Engineering,\n\n2Neurosciences Program, Stanford University, Stanford, CA 94305\n\nfjcunnin,byronyu,shenoyg@stanford.edu\n\nManeesh Sahani3\n\n3Gatsby Computational Neuroscience Unit, UCL\n\nAlexandra House, 17 Queen Square, London, WC1N 3AR, UK\n\nmaneesh@gatsby.ucl.ac.uk\n\nAbstract\n\nNeural spike trains present challenges to analytical efforts due to their noisy,\nspiking nature. Many studies of neuroscienti(cid:2)c and neural prosthetic importance\nrely on a smoothed, denoised estimate of the spike train\u2019s underlying (cid:2)ring rate.\nCurrent techniques to (cid:2)nd time-varying (cid:2)ring rates require ad hoc choices of\nparameters, offer no con(cid:2)dence intervals on their estimates, and can obscure\npotentially important single trial variability. We present a new method, based\non a Gaussian Process prior, for inferring probabilistically optimal estimates of\n(cid:2)ring rate functions underlying single or multiple neural spike trains. We test the\nperformance of the method on simulated data and experimentally gathered neural\nspike trains, and we demonstrate improvements over conventional estimators.\n\nIntroduction\n\n1\nNeuronal activity, particularly in cerebral cortex, is highly variable. Even when experimental con-\nditions are repeated closely, the same neuron may produce quite different spike trains from trial\nto trial. This variability may be due to both randomness in the spiking process and to differences\nin cognitive processing on different experimental trials. One common view is that a spike train is\ngenerated from a smooth underlying function of time (the (cid:2)ring rate) and that this function carries\na signi(cid:2)cant portion of the neural information. If this is the case, questions of neuroscienti(cid:2)c and\nneural prosthetic importance may require an accurate estimate of the (cid:2)ring rate. Unfortunately, these\nestimates are complicated by the fact that spike data gives only a sparse observation of its underlying\nrate. Typically, researchers average across many trials to (cid:2)nd a smooth estimate (averaging out spik-\ning noise). However, averaging across many roughly similar trials can obscure important temporal\nfeatures [1]. Thus, estimating the underlying rate from only one spike train (or a small number of\nspike trains believed to be generated from the same underlying rate) is an important but challenging\nproblem.\nThe most common approach to the problem has been to collect spikes from multiple trials in a peri-\nstimulus-time histogram (PSTH), which is then sometimes smoothed by convolution or splines [2],\n[3]. Bin sizes and smoothness parameters are typically chosen ad hoc (but see [4], [5]) and the result\nis fundamentally a multi-trial analysis. An alternative is to convolve a single spike train with a kernel.\nAgain, the kernel shape and time scale are frequently ad hoc. For multiple trials, researchers may\naverage over multiple kernel-smoothed estimates. [2] gives a thorough review of classical methods.\n\n1\n\n\fMore recently, point process likelihood methods have been adapted to spike data [6](cid:150)[8]. These\nmethods optimize (implicitly or explicitly) the conditional intensity function (cid:21)(tjx(t); H(t)) (cid:151)\nwhich gives the probability of a spike in [t; t + dt), given an underlying rate function x(t) and the\nhistory of previous spikes H(t) (cid:151) with respect to x(t). In a regression setting, this rate x(t) may\nbe learned as a function of an observed covariate, such as a sensory stimulus or limb movement.\nIn the unsupervised setting of interest here, it is constrained only by prior expectations such as\nsmoothness. Probabilistic methods enjoy two advantages over kernel smoothing. First, they allow\nexplicit modelling of interactions between spikes through the history term H(t) (e.g., refractory\nperiods). Second, as we will see, the probabilistic framework provides a principled way to share\ninformation between trials and to select smoothing parameters.\nIn neuroscience, most applications of point process methods use maximum likelihood estimation. In\nthe unsupervised setting, it has been most common to optimize x(t) within the span of an arbitrary\nbasis (such as a spline basis [3]). In other (cid:2)elds, a theory of generalized Cox processes has been\ndeveloped, where the point process is conditionally Poisson, and x(t) is obtained by applying a link\nfunction to a draw from a random process, often a Gaussian process (GP) (e.g. [9]). In this approach,\nparameters of the GP, which set the scale and smoothness of x(t) can be learned by optimizing the\n(approximate) marginal likelihood or evidence, as in GP classi(cid:2)cation or regression. However, the\nlink function, which ensures a nonnegative intensity, introduces possibly undesirable artifacts. For\ninstance, an exponential link leads to a process that grows less smooth as the intensity increases.\nHere, we make two advances. First, we adapt the theory of GP-driven point processes to incorpo-\nrate a history-dependent conditional likelihood, suitable for spike trains. Second, we formulate the\nproblem such that nonnegativity in x(t) is achieved without a distorting link function or sacri(cid:2)ce of\ntractability. We also demonstrate the power of numerical techniques that makes application of GP\nmethods to this problem computationally tractable. We show that GP methods employing evidence\noptimization outperform both kernel smoothing and maximum-likelihood point process models.\n\n2 Gaussian Process Model For Spike Trains\nSpike trains can often be well modelled by gamma-interval point processes [6], [10]. We assume the\nunderlying nonnegative (cid:2)ring rate x(t) : t 2 [0; T ] is a draw from a GP, and then we assume that\nour spike train is a conditionally inhomogeneous gamma-interval process (IGIP), given x(t). The\nspike train is represented by a list of spike times y = fy0; : : :; yN g. Since we will model this spike\ntrain as an IGIP1, y j x(t) is by de(cid:2)nition a renewal process, so we can write:\n\np(y j x(t)) =\n\nN\n\nYi=1\n\np(yi j yi(cid:0)1; x(t)) (cid:1) p0(y0 j x(t)) (cid:1) pT (T j yN ; x(t));\n\n(1)\n\nwhere p0((cid:1)) is the density of the (cid:2)rst spike occuring at y0, and pT ((cid:1)) is the density of no spikes being\nobserved on (yN ; T ]; the density for IGIP intervals (of order (cid:13) (cid:21) 1) (see e.g. [6]) can be written as:\n\np(yi j yi(cid:0)1; x(t)) =\n\n(cid:13)x(yi)\n\n(cid:0)((cid:13)) (cid:18)(cid:13)Z yi\n\nyi(cid:0)1\n\nx(u)du(cid:19)(cid:13)(cid:0)1\n\nexp(cid:26)(cid:0)(cid:13)Z yi\n\nyi(cid:0)1\n\nx(u)du(cid:27):\n\n(2)\n\nThe true p0((cid:1)) and pT ((cid:1)) under this gamma-interval spiking model are not closed form, so we sim-\nplify these distributions as intervals of an inhomogeneous Poisson process (IP). This step, which\nwe (cid:2)nd to sacri(cid:2)ce very little in terms of accuracy, helps to preserve tractability. Note also that\nwe write the distribution in terms of the inter-spike-interval distribution p(yijyi(cid:0)1; x(t)) and not\n(cid:21)(tjx(t); H(t)), but the process could be considered equivalently in terms of conditional intensity.\nWe now discretize x(t) : t 2 [0; T ] by the time resolution of the experiment ((cid:1), here 1 ms), to\nyield a series of n evenly spaced samples x = [x1; : : :; xn]0 (with n = T\n(cid:1)). The events y become\nN + 1 time indices into x, with N much smaller than n. The discretized IGIP output process is now\n(ignoring terms that scale with (cid:1)):\n\n1The IGIP is one of a class of renewal models that works well for spike data (much better than inhomoge-\nneous Poisson; see [6], [10]). Other log-concave renewal models such as the inhomogeneous inverse-Gaussian\ninterval can be chosen, and the implementation details remain unchanged.\n\n2\n\n\fp(y j x) =\n\nN\n\n(cid:20) (cid:13)xyi\n(cid:0)((cid:13))(cid:18)(cid:13)\n\nYi=1\n(cid:1) xy0exp(cid:26)(cid:0)\n\ny0(cid:0)1\n\nXk=0\n\nexp(cid:26)(cid:0)(cid:13)\n\nyi(cid:0)1\n\nXk=yi(cid:0)1\n\nxk(cid:1)(cid:19)(cid:13)(cid:0)1\nxk(cid:1)(cid:27) (cid:1) exp(cid:26)(cid:0)\n\nyi(cid:0)1\n\nXk=yi(cid:0)1\nxk(cid:1)(cid:27);\n\nn(cid:0)1\n\nXk=yN\n\nxk(cid:1)(cid:27)(cid:21)\n\n(3)\n\nwhere the (cid:2)nal two terms are p0((cid:1)) and pT ((cid:1)), respectively [11]. Our goal is to estimate a smoothly\nvarying (cid:2)ring rate function from spike times. Loosely, instead of being restricted to only one family\nof functions, GP allows all functions to be possible; the choice of kernel determines which functions\nare more likely, and by how much. Here we use the standard squared exponential (SE) kernel. Thus,\nx (cid:24) N ((cid:22)1; (cid:6)), where (cid:6) is the positive de(cid:2)nite covariance matrix de(cid:2)ned by\n\n(cid:6) = (cid:8)K(ti; tj)(cid:9)i;j2f1;:::;ng where K(ti; tj) = (cid:27)2\n\nf exp(cid:26)(cid:0)\n\n(cid:20)\n2\n\n(ti (cid:0) tj)2(cid:27) + (cid:27)2\n\nv(cid:14)ij:\n\n(4)\n\nFor notational convenience, we de(cid:2)ne the hyperparameter set (cid:18) = [(cid:22); (cid:13); (cid:20); (cid:27)2\nv]. Typically, the\nGP mean (cid:22) is set to 0. Since our intensity function is nonnegative, however, it is sensible to treat (cid:22)\ninstead as a hyperparameter and let it be optimized to a positive value. We note that other standard\nkernels - including the rational quadratic, Matern (cid:23) = 3\n2 - performed similarly to\nthe SE; thus we only present the SE here. For an in depth discussion of kernels and of GP, see [12].\nAs written, the model assumes only one observed spike train; it may be that we have m trials believed\nto be generated from the same (cid:2)ring rate pro(cid:2)le. Our method naturally incorporates this case: de(cid:2)ne\n1 j x) = Ym\np(y(i) j x), where y(i) denotes the ith spike train observed.2 Otherwise, the\np(fygm\nmodel is unchanged.\n\n2, and Matern (cid:23) = 5\n\nf ; (cid:27)2\n\ni=1\n\n3 Finding an Optimal Firing Rate Estimate\n3.1 Algorithmic Approach\nIdeally, we would calculate the posterior on (cid:2)ring rate p(x j y) = R(cid:18) p(x j y; (cid:18))p((cid:18))d(cid:18) (integrating\nover the hyperparameters (cid:18)), but this problem is intractable. We consider two approximations:\nreplacing the integral by evaluation at the modal (cid:18), and replacing the integral with a sum over a\ndiscrete grid of (cid:18) values. We (cid:2)rst consider choosing a modal hyperparameter set (ML-II model\nselection, see [12]), i.e. p(x j y) (cid:25) q(x j y; (cid:18)(cid:3)) where q((cid:1)) is some approximate posterior, and\n\n(cid:18)(cid:3) = argmax\n\np((cid:18) j y) = argmax\n\np((cid:18))p(y j (cid:18)) = argmax\n\n(cid:18)\n\n(cid:18)\n\n(cid:18)\n\np((cid:18))Zx\n\np(y j x; (cid:18))p(x j (cid:18))dx:\n\n(5)\n\n(This and the following equations hold similarly for a single observation y or multiple observations\n1 , so we consider only the single observation for notational brevity.) Speci(cid:2)c choices for the\nfygm\nhyperprior p((cid:18)) are discussed in Results. The integral in Eq. 5 is intractable under the distributions\nwe are modelling, and thus we must use an approximation technique. Laplace approximation and\nExpectation Propagation (EP) are the most widely used techniques (see [13] for a comparison). The\nLaplace approximation (cid:2)ts an unnormalized Gaussian distribution to the integrand in Eq. 5. Below\nwe show this integrand is log concave in x. This fact makes reasonable the Laplace approximation,\nsince we know that the distribution being approximated is unimodal in x and shares log concavity\nwith the normal distribution. Further, since we are modelling a non-zero mean GP, most of the\nLaplace approximated probability mass lies in the nonnegative orthant (as is the case with the true\nposterior). Accordingly, we write:\n\n2Another reasonable approach would consider each trial as having a different rate function x that is a draw\n(cid:3), we would\n\nfrom a GP with a nonstationary mean function (cid:22)(t). Instead of inferring a mean rate function x\nlearn a distribution of means. We are considering this choice for future work.\n\n3\n\n\fp(y j (cid:18)) = Zx\n\np(y j x; (cid:18))p(x j (cid:18))dx (cid:25) p(y j x(cid:3); (cid:18))p(x(cid:3) j (cid:18))\n\n(2(cid:25))\n\nn\n2\n\nj(cid:3)(cid:3) + (cid:6)(cid:0)1j\n\n;\n\n1\n2\n\n(6)\n\nwhere x(cid:3) is the mode of the integrand and (cid:3)(cid:3) = (cid:0)r2\nxlog p(y j x; (cid:18)) jx=x(cid:3). Note that in general\nboth (cid:6) and (cid:3)(cid:3) (and x(cid:3), implicitly) are functions of the hyperparameters (cid:18). Thus, Eq. 6 can be\ndifferentiated with respect to the hyperparameter set, and an iterative gradient optimization (we used\nconjugate gradients) can be used to (cid:2)nd (locally) optimal hyperparameters. Algorithmic details and\nthe gradient calculations are typical for GP; see [12]. The Laplace approximation also naturally\nprovides con(cid:2)dence intervals from the approximated posterior covariance ((cid:6)(cid:0)1 + (cid:3)(cid:3))(cid:0)1.\nWe can also consider approximate integration over (cid:18) using the Laplace approximation above. The\nLaplace approximation produces a posterior approximation q(x j y; (cid:18)) = N (cid:0)x(cid:3); ((cid:3)(cid:3) + (cid:6)(cid:0)1)(cid:0)1(cid:1)\nand a model evidence approximation q((cid:18) j y) (Eq. 6). The approximate integrated posterior can be\nwritten as p(x j y) = E(cid:18)jy[p(x j y; (cid:18))] (cid:25) Pj q(x j y; (cid:18)j)q((cid:18)j j y) for some choice of samples\n(cid:18)j (which again gives con(cid:2)dence intervals on the estimates). Since the dimensionality of (cid:18) is small,\nand since we (cid:2)nd in practice that the posterior on (cid:18) is well behaved (well peaked and unimodal), we\n(cid:2)nd that a simple grid of (cid:18)j works very well, thereby obviating MCMC or another sampling scheme.\nThis approximate integration consistently yields better results than a modal hyperparameter set, so\nwe will only consider approximate integration for the remainder of this report.\nFor the Laplace approximation at any value of (cid:18), we require the modal estimate of (cid:2)ring rate x(cid:3),\nwhich is simply the MAP estimator:\n\nx(cid:3) = argmax\n\np(x j y) = argmax\n\np(y j x)p(x):\n\nx(cid:23)0\n\nx(cid:23)0\n\n(7)\n\nSolving this problem is equivalent to solving an unconstrained problem where p(x) is a truncated\nmultivariate normal (but this is not the same as individually truncating each marginal p(xi); see\n[14]). Typically a link or squashing function would be included to enforce nonnegativity in x, but\nthis can distort the intensity space in unintended ways. We instead impose the constraint x (cid:23) 0,\nwhich reduces the problem to being solved over the (convex) nonnegative orthant. To pose the\nproblem as a convex program, we de(cid:2)ne f (x) = (cid:0)log p(y j x)p(x):\n\nf (x) =\n\nN\n\nXi=1\n\n(cid:18)(cid:0)log xyi (cid:0) ((cid:13) (cid:0) 1)log (cid:0)\nXk=yN\n\nXk=1\n\nxk(cid:1) +\n\n(cid:0)log xy0 +\n\ny0(cid:0)1\n\nn(cid:0)1\n\nyi(cid:0)1\n\nXk=yi(cid:0)1\n\nxk(cid:1)(cid:1)(cid:19) +\n\nyN (cid:0)1\n\nXk=y0\n\n(cid:13)xk(cid:1)\n\nxk(cid:1) +\n\n1\n2\n\n(x (cid:0) (cid:22)1)T (cid:6)(cid:0)1(x (cid:0) (cid:22)1) + C;\n\nwhere C represents constants with respect to x. From this form follows the Hessian\n\nr2\n\nxf (x) = (cid:6)(cid:0)1 + (cid:3) where (cid:3) = (cid:0)r2\n\nxlog p(y j x; (cid:18)) = B + D;\n\n(8)\n\n(9)\n\n(10)\n\nyi : : :; 0; : : :; x(cid:0)2\n\ny0 ; : : :; 0; : : :; x(cid:0)2\n\nwhere D = diag(x(cid:0)2\nyN ) is positive semide(cid:2)nite and diagonal. B is\nblock diagonal with N blocks. Each block is rank 1 and associates its positive, nonzero eigenvalue\nwith eigenvector [0; : : :; 0; bT\ni ; 0; : : :; 0]T . The remaining n (cid:0) N eigenvalues are zero. Thus, B has\ntotal rank N and is positive semide(cid:2)nite. Since (cid:6) is positive de(cid:2)nite, it follows then that the Hessian\nis also positive de(cid:2)nite, proving convexity. Accordingly, we can use a log barrier Newton method to\nef(cid:2)ciently solve for the global MAP estimator of (cid:2)ring rate x(cid:3) [15].\nIn the case of multiple spike train observations, we need only add extra terms of negative log like-\nlihood from the observation model. This (cid:3)ows through to the Hessian, where r2\nxf (x) = (cid:6)(cid:0)1 + (cid:3)\nand (cid:3) = (cid:3)1 + : : : + (cid:3)m, with (cid:3)i 8 i 2 f1; : : :; mg de(cid:2)ned for each observation as in Eq. 10.\n\n4\n\n\f3.2 Computational Practicality\nThis method involves multiple iterative layers which require many Hessian inversions and other ma-\ntrix operations (matrix-matrix products and determinants) that cost O(n3) in run-time complexity\nand O(n2) in memory, where (x 2 IRn). For any signi(cid:2)cant data size, a straightforward implemen-\ntation is hopelessly slow. With 1 ms time resolution (or similar), this method would be restricted\nto spike trains lasting less than a second, and even this problem would be burdensome. Achieving\ncomputational improvements is critical, as a naive implementation is, for all practical purposes, in-\ntractable. Techniques to improve computational performance are a subject of study in themselves\nand are beyond the scope of this paper. We give a brief outline in the following paragraph.\nIn the MAP estimation of x(cid:3), since we have analytical forms of all matrices, we avoid explicit\nrepresentation of any matrix, resulting in linear storage. Hessian inversions are avoided using the\nmatrix inversion lemma and conjugate gradients, leaving matrix vector multiplications as the single\ncostly operation. Multiplication of any vector by (cid:3) can be done in linear time, since (cid:3) is a (block-\nwise) vector outer product matrix. Since we have evenly spaced resolution of our data x in time\nindices ti, (cid:6) is Toeplitz; thus multiplication by (cid:6) can be done using Fast Fourier Transform (FFT)\nmethods [16]. These techniques allow exact MAP estimation with linear storage and nearly linear\nrun time performance. In practice, for example, this translates to solving MAP estimation problems\nof 103 variables in fractions of a second, with minimal memory load. For the modal hyperparameter\nscheme (as opposed to approximately integrating over the hyperparameters), gradients of Eq. 6\nmust also be calculated at each step of the model evidence optimization. In addition to using similar\ntechniques as in the MAP estimation, log determinants and their derivatives (associated with the\nLaplace approximation) can be accurately approximated by exploiting the eigenstructure of (cid:3).\nIn total, these techniques allow optimal (cid:2)ring rates functions of 103 to 104 variables to be estimated\nin seconds or minutes (on a modern workstation). These data sizes translate to seconds of spike\ndata at 1 ms resolution, long enough for most electrophysiological trials. This algorithm achieves a\nreduction from a naive implementation which would require large amounts of memory and would\nrequire many hours or days to complete.\n\n4 Results\nWe tested the methods developed here using both simulated neural data, where the true (cid:2)ring rate\nwas known by construction, and in real neural spike trains, where the true (cid:2)ring rate was estimated\nby a PSTH that averaged many similar trials. The real data used were recorded from macaque\npremotor cortex during a reaching task (see [17] for experimental method). Roughly 200 repeated\ntrials per neuron were available for the data shown here.\nWe compared the IGIP-likelihood GP method (hereafter, GP IGIP) to other rate estimators (kernel\nsmoothers, Bayesian Adaptive Regressions Splines or BARS [3], and variants of the GP method)\nusing root mean squared difference (RMS) to the true (cid:2)ring rate. PSTH and kernel methods approxi-\nmate the mean conditional intensity (cid:21)(t) = EH(t)[(cid:21)(tjx(t); H(t))]. For a renewal process, we know\n(by the time rescaling theorem [7], [11]) that (cid:21)(t) = x(t), and thus we can compare the GP IGIP\n(which (cid:2)nds x(t)) directly to the kernel methods. To con(cid:2)rm that hyperparameter optimization im-\nproves performance, we also compared GP IGIP results to maximum likelihood (ML) estimates of\nx(t) using (cid:2)xed hyperparameters (cid:18). This result is similar in spirit to previously published likelihood\nmethods with (cid:2)xed bases or smoothness parameters. To evaluate the importance of an observation\nmodel with spike history dependence (the IGIP of Eq. 3), we also compared GP IGIP to an inho-\nmogeneous Poisson (GP IP) observation model (again with a GP prior on x(t); simply (cid:13) = 1 in\nEq. 3).\nThe hyperparameters (cid:18) have prior distributions (p((cid:18)) in Eq. 5). For (cid:27)f, (cid:20), and (cid:13), we set log-\nnormal priors to enforce meaningful values (i.e. (cid:2)nite, positive, and greater than 1 in the case of\n(cid:13)). Speci(cid:2)cally, we set log((cid:27)2\nf ) (cid:24) N (5; 2) ; log((cid:20)) (cid:24) N (2; 2), and log((cid:13) (cid:0) 1) (cid:24) N (0; 100).\nThe variance (cid:27)v can be set arbitrarily small, since the GP IGIP method avoids explicit inversions\nof (cid:6) with the matrix inversion lemma (see 3.2). For the approximate integration, we chose a grid\nconsisting of the empirical mean rate for (cid:22) (that is, total spike count N divided by total time T )\nand ((cid:13); log((cid:27)2\nf ); log((cid:20))) 2 [1; 2; 4] (cid:2) [4; : : :; 8] (cid:2) [0; : : :; 7]. We found this coarse grid (or similar)\nproduced similar results to many other very (cid:2)nely sampled grids.\n\n5\n\n\f60\n\n50\n\n40\n\n30\n\n20\n\n10\n\ni\n\n)\nc\ne\ns\n/\ns\ne\nk\np\ns\n(\n \ne\nt\na\nR\n \ng\nn\ni\nr\ni\nF\n\n70\n\n60\n\n50\n\n40\n\n30\n\n20\n\n10\n\ni\n\n)\nc\ne\ns\n/\ns\ne\nk\np\ns\n(\n \ne\nt\na\nR\n \ng\nn\ni\nr\ni\nF\n\n0\n\n0\n\n0.2\n\n0.4\n\n0.6\n\n0.8\n\n1\nTime (sec)\n\n1.2\n\n1.4\n\n1.6\n\n0\n\n0\n\n0.2\n\n0.4\n\n0.6\n\n0.8\n\nTime (sec)\n\n1\n\n1.2\n\n1.4\n\n1.6\n\n(a) Data Set L20061107.214.1; 1 spike train\n\n(b) Data Set L20061107.14.1; 4 spike trains\n\n)\nc\ne\ns\n/\ns\ne\nk\np\ns\n(\n \n\ni\n\ne\n\nt\n\n \n\na\nR\ng\nn\ni\nr\ni\nF\n\n50\n\n45\n\n40\n\n35\n\n30\n\n25\n\n20\n\n15\n\n10\n\n5\n\n0\n\n0\n\n)\nc\ne\ns\n/\ns\ne\nk\np\ns\n(\n \n\ni\n\ne\n\nt\n\n \n\na\nR\ng\nn\ni\nr\ni\nF\n\n16\n\n14\n\n12\n\n10\n\n8\n\n6\n\n4\n\n2\n\n0\n\n0.2\n\n0.4\n\n0.6\n\n0.8\n\n1\nTime (sec)\n\n1.2\n\n1.4\n\n1.6\n\n0\n\n0.2\n\n0.4\n\n0.6\n\n0.8\n\nTime (sec)\n\n1\n\n1.2\n\n1.4\n\n1.6\n\n(c) Data Set L20061107.151.5; 8 spike trains\n\n(d) Data Set L20061107.46.3; 1 spike train\n\nFigure 1: Sample GP (cid:2)ring rate estimate. See full description in text.\n\nThe four examples in Fig. 1 represent experimentally gathered (cid:2)ring rate pro(cid:2)les (according to the\nmethods in [17]). In each of the plots, the empirical average (cid:2)ring rate of the spike trains is shown\nin bold red. For simulated spike trains, the spike trains were generated from each of these empirical\naverage (cid:2)ring rates using an IGIP ((cid:13) = 4, comparable to (cid:2)ts to real neural data). For real neural\ndata, the spike train(s) were selected as a subset of the roughly 200 experimentally recorded spike\ntrains that were used to construct the (cid:2)ring rate pro(cid:2)le. These spike trains are shown as a train of\nblack dots, each dot indicating a spike event time (the y-axis position is not meaningful). This spike\ntrain or group of spike trains is the only input given to each of the (cid:2)tting models. In thin green and\nmagenta, we have two kernel smoothed estimates of (cid:2)ring rates; each represents the spike trains\nconvolved with a normal distribution of a speci(cid:2)ed standard deviation (50 and 100 ms). We also\nsmoothed these spike trains with adaptive kernel [18], (cid:2)xed ML (as described above), BARS [3],\nand 150 ms kernel smoothers. We do not show these latter results in Fig. 1 for clarity of (cid:2)gures.\nThese standard methods serve as a baseline from which we compare our method. In bold blue, we\nsee x(cid:3), the results of the GP IGIP method. The light blue envelopes around the bold blue GP (cid:2)ring\nrate estimate represent the 95% con(cid:2)dence intervals. Bold cyan shows the GP IP method. This color\nscheme holds for all of Fig. 1.\nWe then ran all methods 100 times on each (cid:2)ring rate pro(cid:2)le, using (separately) simulated and real\nneural spike trains. We are interested in the average performance of GP IGIP vs. other GP methods\n(a (cid:2)xed ML or a GP IP) and vs. kernel smoothing and spline (BARS) methods. We show these\nresults in Fig. 2. The four panels correspond to the same rate pro(cid:2)les shown in Fig. 1. In each\npanel, the top, middle, and bottom bar graphs correspond to the method on 1, 4, and 8 spike trains,\nrespectively. GP IGIP produces an average RMS error, which is an improvement (or, less often,\na deterioration) over a competing method. Fig. 2 shows the percent improvement of the GP IGIP\nmethod vs. the competing method listed. Only signi(cid:2)cant results are shown (paired t-test, p < 0:05).\n\n6\n\n\f GP Methods Kernel Smoothers\nGP IP Fixed ML short medium long adaptive BARS\n\n(b) L20061107.14.1; 1,4,8 spike trains\n\n GP Methods Kernel Smoothers\nGP IP Fixed ML short medium long adaptive BARS\n\n GP Methods Kernel Smoothers\nGP IP Fixed ML short medium long adaptive BARS\n\n(a) L20061107.214.1; 1,4,8 spike trains\n\n GP Methods Kernel Smoothers\nGP IP Fixed ML short medium long adaptive BARS\n\n50\n\n%\n\n0\n\n50\n\n%\n\n0\n50\n\n%\n\n0\n\n50\n\n%\n\n0\n50\n\n%\n\n0\n50\n\n%\n\n0\n\n50\n\n%\n\n0\n50\n\n%\n\n0\n\n50\n\n%\n\n0\n\n50\n\n%\n\n0\n50\n\n%\n\n0\n50\n\n%\n\n0\n\n(c) L20061107.151.5; 1,4,8 spike trains\n\n(d) L20061107.46.3; 1,4,8 spike trains\n\nFigure 2: Average percent RMS improvement of GP IGIP method (with model selection) vs. method\nindicated in the column title. See full description in text.\n\nBlue improvement bars are for simulated spike trains; red improvement bars are for real neural spike\ntrains. The general positive trend indicates improvements, suggesting the utility of this approach.\nNote that, in the few cases where a kernel smoother performs better (e.g. the long bandwidth kernel\nin panel (b), real spike trains, 4 and 8 spike trains), outperforming the GP IGIP method requires\nan optimal kernel choice, which can not be judged from the data alone. In particular, the adaptive\nkernel method generally performed more poorly than GP IGIP. The relatively poor performance of\nGP IGIP vs. different techniques in panel (d) is considered in the Discussion section. The data\nsets here are by no means exhaustive, but they indicate how this method performs under different\nconditions.\n\n5 Discussion\n\nWe have demonstrated a new method that accurately estimates underlying neural (cid:2)ring rate functions\nand provides con(cid:2)dence intervals, given one or a few spike trains as input. This approach is not\nwithout complication, as the technical complexity and computational effort require special care.\nEstimating underlying (cid:2)ring rates is especially challenging due to the inherent noise in spike trains.\nHaving only a few spike trains deprives the method of many trials to reduce spiking noise. It is\nimportant here to remember why we care about single trial or small number of trial estimates, since\nwe believe that in general the neural processing on repeated trials is not identical. Thus, we expect\nthis signal to be dif(cid:2)cult to (cid:2)nd with or without trial averaging.\nIn this study we show both simulated and real neural spike trains. Simulated data provides a good test\nenvironment for this method, since the underlying (cid:2)ring rate is known, but it lacks the experimental\nproof of real neural spike trains (where spiking does not exactly follow a gamma-interval process).\nFor the real neural spike trains, however, we do not know the true underlying (cid:2)ring rate, and thus we\ncan only make comparisons to a noisy, trial-averaged mean rate, which may or may not accurately\nre(cid:3)ect the true underlying rate of an individual spike train (due to different cognitive processing on\ndifferent trials). Taken together, however, we believe the real and simulated data give good evidence\nof the general improvements offered by this method.\nPanels (a), (b), and (c) in Fig. 2 show that GP IGIP offers meaningful improvements in many cases\nand a small loss in performance in a few cases. Panel (d) tells a different story. In simulation, GP\nIGIP generally outperforms the other smoothers (though, by considerably less than in other panels).\nIn real neural data, however, GP IGIP performs the same or relatively worse than other methods.\nThis may indicate that, in the low (cid:2)ring rate regime, the IGIP is a poor model for real neural spiking.\n\n7\n\n\fIt may also be due to our algorithmic approximations (namely, the Laplace approximation, which\nallows density outside the nonnegative orthant). We will report on this question in future work.\nFurthermore, some neural spike trains may be inherently ill-suited to analysis. A problem with this\nand any other method is that of very low (cid:2)ring rates, as only occasional insight is given into the\nunderlying generative process. With spike trains of only a few spikes/sec, it will be impossible\nfor any method to (cid:2)nd interesting structure in the (cid:2)ring rate. In these cases, only with many trial\naveraging can this structure be seen.\nSeveral studies have investigated the inhomogeneous gamma and other more general models (e.g.\n[6], [19]), including the inhomogeneous inverse gaussian (IIG) interval and inhomogeneous Markov\ninterval (IMI) processes. The methods of this paper apply immediately to any log-concave inhomo-\ngeneous renewal process in which inhomogeneity is generated by time-rescaling (this includes the\nIIG and several others). The IMI (and other more sophisticated models) will require some changes\nin implementation details; one possibility is a variational Bayes approach. Another direction for\nthis work is to consider signi(cid:2)cant nonstationarity in the spike data. The SE kernel is standard, but\nit is also stationary; the method will have to compromise between areas of categorically different\ncovariance. Nonstationary covariance is an important question in modelling and remains an area of\nresearch [20]. Advances in that (cid:2)eld should inform this method as well.\n\nAcknowledgments\nThis work was supported by NIH-NINDS-CRCNS-R01, the Michael Flynn SGF, NSF, NDSEGF,\nGatsby, CDRF, BWF, ONR, Sloan, and Whitaker. This work was conceived at the UK Spike Train\nWorkshop, Newcastle, UK, 2006; we thank Stuart Baker for helpful discussions during that time.\nWe thank Vikash Gilja, Stephen Ryu, and Mackenzie Risch for experimental, surgical, and animal\ncare assistance. We thank also Araceli Navarro.\n\nReferences\n[1] B. Yu, A. Afshar, G. Santhanam, S. Ryu, K. Shenoy, and M. Sahani. Advances in NIPS, 17,\n\n2005.\n\n2001.\n\n[2] R. Kass, V. Ventura, and E. Brown. J. Neurophysiol, 94:8(cid:150)25, 2005.\n[3] I. DiMatteo, C. Genovese, and R. Kass. Biometrika, 88:1055(cid:150)1071, 2001.\n[4] H. Shimazaki and S. Shinomoto. Neural Computation, 19(6):1503(cid:150)1527, 2007.\n[5] D. Endres, M. Oram, J. Schindelin, and P. Foldiak. Advances in NIPS, 20, 2008.\n[6] R. Barbieri, M. Quirk, L. Frank, M. Wilson, and E. Brown. J Neurosci Methods, 105:25(cid:150)37,\n\n[7] E. Brown, R. Barbieri, V. Ventura, R. Kass, and L. Frank. Neural Comp, 2002.\n[8] W. Truccolo, U. Eden, M. Fellows, J. Donoghue, and E. Brown. J Neurophysiol., 93:1074(cid:150)\n\n1089, 2004.\n\n[9] J. Moller, A. Syversveen, and R. Waagepetersen. Scandanavian J. of Stats., 1998.\n[10] K. Miura, Y. Tsubo, M. Okada, and T. Fukai. J Neurosci., 27:13802(cid:150)13812, 2007.\n[11] D. Daley and D. Vere-Jones. An Introduction to the Theory of Point Processes. Springer, 2002.\n[12] C. Rasmussen and C. Williams. Gaussian Processes for Machine Learning. MIT Press, 2006.\n[13] M. Kuss and C. Rasmussen. Journal of Machine Learning Res., 6:1679(cid:150)1704, 2005.\n[14] W. Horrace. J Multivariate Analysis, 94(1):209(cid:150)221, 2005.\n[15] S. Boyd and L. Vandenberghe. Convex Optimization. Cambridge University Press, 2004.\n[16] B. Silverman. Journal of Royal Stat. Soc. Series C: Applied Stat., 33, 1982.\n[17] C. Chestek, A. Batista, G. Santhanam, B. Yu, A. Afshar, J. Cunningham, V. Gilja, S. Ryu,\n\nM. Churchland, and K. Shenoy. J Neurosci., 27:10742(cid:150)10750, 2007.\n[18] B. Richmond, L. Optican, and H. Spitzer. J. Neurophys., 64(2), 1990.\n[19] R. Kass and V. Ventura. Neural Comp, 14:5(cid:150)15, 2003.\n[20] C. Paciorek and M. Schervish. Advances in NIPS, 15, 2003.\n\n8\n\n\f", "award": [], "sourceid": 366, "authors": [{"given_name": "John", "family_name": "Cunningham", "institution": ""}, {"given_name": "Byron", "family_name": "Yu", "institution": null}, {"given_name": "Krishna", "family_name": "Shenoy", "institution": ""}, {"given_name": "Maneesh", "family_name": "Sahani", "institution": ""}]}