{"title": "Real-Time Inference for a Gamma Process Model of Neural Spiking", "book": "Advances in Neural Information Processing Systems", "page_first": 2805, "page_last": 2813, "abstract": "With simultaneous measurements from ever increasing populations of neurons, there is a growing need for sophisticated tools to recover signals from individual neurons. In electrophysiology experiments, this classically proceeds in a two-step process: (i) threshold the waveforms to detect putative spikes and (ii) cluster the waveforms into single units (neurons). We extend previous Bayesian nonparamet- ric models of neural spiking to jointly detect and cluster neurons using a Gamma process model. Importantly, we develop an online approximate inference scheme enabling real-time analysis, with performance exceeding the previous state-of-the- art. Via exploratory data analysis\u2014using data with partial ground truth as well as two novel data sets\u2014we find several features of our model collectively contribute to our improved performance including: (i) accounting for colored noise, (ii) de- tecting overlapping spikes, (iii) tracking waveform dynamics, and (iv) using mul- tiple channels. We hope to enable novel experiments simultaneously measuring many thousands of neurons and possibly adapting stimuli dynamically to probe ever deeper into the mysteries of the brain.", "full_text": "Real-Time Inference for a Gamma Process\n\nModel of Neural Spiking\n\n1David Carlson, 2Vinayak Rao, 2Joshua Vogelstein, 1Lawrence Carin\n\n1Electrical and Computer Engineering Department, Duke University\n\n2Statistics Department, Duke University\n\n{dec18,lcarin}@duke.edu, {var11,jovo}@stat.duke.edu\n\nAbstract\n\nWith simultaneous measurements from ever increasing populations of neurons,\nthere is a growing need for sophisticated tools to recover signals from individual\nneurons. In electrophysiology experiments, this classically proceeds in a two-step\nprocess: (i) threshold the waveforms to detect putative spikes and (ii) cluster the\nwaveforms into single units (neurons). We extend previous Bayesian nonparamet-\nric models of neural spiking to jointly detect and cluster neurons using a Gamma\nprocess model. Importantly, we develop an online approximate inference scheme\nenabling real-time analysis, with performance exceeding the previous state-of-the-\nart. Via exploratory data analysis\u2014using data with partial ground truth as well as\ntwo novel data sets\u2014we \ufb01nd several features of our model collectively contribute\nto our improved performance including: (i) accounting for colored noise, (ii) de-\ntecting overlapping spikes, (iii) tracking waveform dynamics, and (iv) using mul-\ntiple channels. We hope to enable novel experiments simultaneously measuring\nmany thousands of neurons and possibly adapting stimuli dynamically to probe\never deeper into the mysteries of the brain.\n\n1\n\nIntroduction\nThe recent heightened interest in understanding the brain calls for the development of technolo-\ngies that will advance our understanding of neuroscience. Crucial for this endeavor is the advance-\nment of our ability to understand the dynamics of the brain, via the measurement of large populations\nof neural activity at the single neuron level. Such reverse engineering efforts bene\ufb01t from real-time\ndecoding of neural activity, to facilitate effectively adapting the probing stimuli. Regardless of the\nexperimental apparati used (e.g., electrodes or calcium imaging), real-time decoding of individual\nneuron responses requires identifying and labeling individual spikes from recordings from large\npopulations. In other words, real-time decoding requires real-time spike sorting.\n\nAutomatic spike sorting methods are continually evolving to deal with more sophisticated exper-\niments. Most recently, several methods have been proposed to (i) learn the number of separable\nneurons on each electrode or \u201cmulti-trode\u201d [1, 2], or (ii) operate online to resolve overlapping spikes\nfrom multiple neurons [3]. To our knowledge, no method to date is able to simultaneously address\nboth of these challenges.\n\nWe develop a nonparametric Bayesian continuous-time generative model of population activity.\nOur model explains the continuous output of each neuron by a latent marked Poisson process, with\nthe \u201cmarks\u201d characterizing the shape of each spike. Previous efforts to address overlapping spiking\noften assume a \ufb01xed kernel for each waveform, but joint intracellular and extracellular recording\nclearly indicate that this assumption is false (see Figure 3c). Thus, we assume that the statistics of\nthe marks are time-varying. We use the framework of completely random measures to infer how\nmany of a potentially in\ufb01nite number of neurons (or single units) are responsible for the observed\ndata, simultaneously characterizing spike times and waveforms of these neurons\n\nWe describe an intuitive discrete-time approximation to the above in\ufb01nite-dimensional\ncontinuous-time stochastic process, then develop an online variational Bayesian inference algorithm\nfor this model. Via numerical simulations, we demonstrate that our inference procedure improves\n\n1\n\n\fover the previous state-of-the-art, even though we allow the other methods to use the entire dataset\nfor training, whereas we learn online. Moreover, we demonstrate that we can effectively track the\ntime-varying changes in waveform, and detect overlapping spikes. Indeed, it seems that the false\npositive detections from our approach have indistinguishable \ufb01rst order statistics from the true pos-\nitives, suggesting that second-order methods may be required to reduce the false positive rate (i.e.,\ntemplate methods may be inadequate). Our work therefore suggests that further improvements in\nreal-time decoding of activity may be most effective if directed at simultaneous real-time spike sort-\ning and decoding. To facilitate such developments and support reproducible research, all code and\ndata associated with this work is provided in the Supplementary Materials.\n2 Model\n\nOur data is a time-series of multielectrode recordings X \u2318 (x1,\u00b7\u00b7\u00b7 , xT ), and consists of T\nrecordings from M channels. As in usual measurement systems, the recording times lie on reg-\nular grid, with interval length , and xt 2 RM for all t. Underlying these observations is a\ncontinuous-time electrical signal driven by an unknown number of neurons. Each neuron gener-\nates a continuous-time voltage trace, and the outputs of all neurons are superimposed and discretely\nsampled to produce the recordings X. At a high level, in \u00a72.1 we model the continuous-time out-\nput of each neuron as a series of idealized Poisson events smoothed with appropriate kernels, while\n\u00a72.2 uses the Gamma process to develop a nonparametric prior for an entire population. \u00a72.3 then\ndescribes a discrete-time approximation based on the Bernoulli approximation to the Poisson pro-\ncess. For conceptual clarity, we restrict ourselves to single channel recordings until \u00a72.4, where we\ndescribe the complete model for multichannel data.\n2.1 Modeling the continuous-time output of a single neuron\n\nThere is a rich literature characterizing the spiking activity of a single neuron [4] accounting\nin detail for factors like non-stationarity, refractoriness and spike waveform. We however make a\nnumber of simplifying assumptions (some of which we later relax). First, we model the spiking\nactivity of each neuron are stationary and memoryless, so that its set of spike times are distributed as\na homogeneous Poisson process (PP). We model the neurons themselves are heterogeneous, with the\nith neuron having an (unknown) \ufb01ring rate i. Call the ordered set of spike times of the ith neuron\nTi = (\u2327i1,\u2327 i2, . . .); then the time between successive elements of Ti is exponentially distributed\nwith mean 1/i. We write this as Ti \u21e0 PP(i).\nThe actual electrical output of a neuron is not binary; instead each spiking event is a smooth\nperturbation in voltage about a resting state. This perturbation forms the shape of the spike, with the\nspike shapes varying across neurons as well as across different spikes of the same neuron. However,\neach neuron has its own characteristic distribution over shapes, and we let \u2713\u21e4i 2 \u21e5 parametrize this\ndistribution for neuron i. Whenever this neuron emits a spike, a new shape is drawn independently\nfrom the corresponding distribution. This waveform is then offset to the time of the spike, and\ncontributes to the voltage trace associated with that spike.\n\nThe complete recording from the neuron is the superposition of all these spike waveforms plus\nnoise. Rather than treating the noise as white as is common in the literature [5], we allow it to exhibit\ntemporal correlation, recognizing that the \u2018noise\u2019 is in actual fact background neural activity. We\nmodel it as a realization of a Gaussian process (GP) [6], with the covariance kernel K of the GP\ndetermining the temporal structure. We use an exponential kernel, modeling the noise as Markov.\nWe model each spike shape as weighted superpositions of a dictionary of K basis functions\nd(t) \u2318 (d1(t),\u00b7\u00b7\u00b7 , dK(t))T. The dictionary elements are shared across all neurons, and each\nis a real-valued function of time, i.e., dk 2 L2. Each spike time \u2327ij is associated with a ran-\ndom K-dimensional weight vector y\u21e4ij \u2318 (y\u21e4ij1, . . . y\u21e4ijK)T, and the shape of this spike at time t\nis given by the weighted sum PK\nk=1 y\u21e4ijkdk(t \u2327ij). We assume y\u21e4ij \u21e0 NK(\u00b5\u21e4i , \u2303\u21e4i ), indicat-\ning a K-dimensional Gaussian distribution with mean and covariance given by (\u00b5\u21e4i , \u2303\u21e4i ); we let\n\u2713\u21e4i \u2318 (\u00b5\u21e4i , \u2303\u21e4i ). Then, at any time t, the output of neuron i is xi(t) =P|Ti|\nk=1 y\u21e4ijkdk(t \u2327ij).\nThe total signal received by any electrode is the superposition of the outputs of all neurons. As-\nsume for the moment there are N neurons, and de\ufb01ne T\u2318 [ i2[N ]Ti as the (ordered) union of the\nspike times of all neurons. Let \u2327l 2T indicate the time of the lth overall spike, whereas \u2327ij 2T i\nis the time of the jth spike of neuron i. This de\ufb01nes a pair of mappings: \u232b : [|T |] ! [N ], and\np : [|T |] !T \u232bi, with \u2327l = \u2327\u232blpl. In words, \u232bl 2 N is the neuron to which the lth element of T\nbelongs, while pl indexes this spike in the spike train T\u232bl. Let \u2713l \u2318 (\u00b5l, \u2303l) be the neuron parameter\nassociated with spike l, so that \u2713l = \u2713\u21e4\u232bl. Finally, de\ufb01ne yl \u2318 (yl1, . . . , ylK)T \u2318 y\u21e4\u232bj pj as the weight\n\nj=1PK\n\n2\n\n\fvector of spike \u2327l. Then, we have that\n\nx(t) = Xi2[N ]\n\nxi(t) = Xl2|T | Xk2[K]\n\nylkdk(t \u2327l),\n\nwhere yl \u21e0 NK(\u00b5l, \u2303l).\n\n(1)\n\nFrom the superposition property of the Poisson process [7], the overall spiking activity T is Poisson\nwith rate \u21e4= Pi2[N ] i. Each event \u2327l 2T has a pair of labels, its neuron parameter \u2713l \u2318 (\u00b5l, \u2303l),\nand yl, the weight-vector characterizing the spike shape. We view these weight-vectors as the\n\u201cmarks\u201d of a marked Poisson process T . From the properties of the Poisson process, we have\nthat the marks \u2713l are drawn i.i.d. from a probability measure G(d\u2713) = 1/\u21e4Pi2[N ] i\u2713\u21e4i .\nWith probability one, the neurons have distinct parameters, so that the mark \u2713l identi\ufb01es the\nneuron which produced spike l: G(\u2713l = \u2713\u21e4i ) = P(\u232bl = i) = i/\u21e4. Given \u2713l, yl is distributed as in\nEq. (1). The output waveform x(t) is then a linear functional of this marked Poisson process.\n2.2 A nonparametric model of population activity\n\nIn practice, the number of neurons driving the recorded activity is unknown. We do not wish to\nbound this number a priori, moreover we expect this number to increase as we record over longer\nintervals. This suggests a nonparametric Bayesian approach: allow the total number of underlying\nneurons to be in\ufb01nite. Over any \ufb01nite interval, only a \ufb01nite subset of these will be active, and\ntypically, these dominate spiking activity over any interval. This elegant and \ufb02exible modeling\napproach allows the data to suggest how many neurons are active, and has already proved successful\nin neuroscience applications [8]. We use the framework of completely random measures (CRMs)\n[9] to model our data. CRMs have been well studied in the Bayesian nonparametrics community,\nand there is a wealth of literature on theoretical properties, as well as posterior computation; see e.g.\n[10, 11, 12]. Recalling that each neuron is characterized by a pair of parameters (i, \u2713\u21e4i ), we map\n\nthe in\ufb01nite collection of pairs {(i, \u2713\u21e4i )} to an random measure \u21e4(\u00b7) on \u21e5: \u21e4(d\u2713) =P1i=1 i\u2713\u21e4i .\n\nFor a CRM, the distribution over measures is induced by distributions over the in\ufb01nite sequence of\nweights, and the in\ufb01nite sequence of their locations. The weights i are the jumps of a L\u00b4evy process\n[13], and their distribution is characterized by a L\u00b4evy measure \u21e2(). The locations \u2713\u21e4i are drawn\ni.i.d. from a base probability measure H(\u2713\u21e4). As is typical, we assume these to be independent.\n\nWe set the L\u00b4evy measure \u21e2() = \u21b51 exp(), resulting in a CRM called the Gamma process\n(P) [14]. The Gamma process has the convenient property that the total rate \u21e4 \u2318 \u21e4(\u21e5) =P1i=1 i\nis Gamma distributed (and thus conjugate to the Poisson process prior on T ). The Gamma process is\nalso closely connected with the Dirichlet process [15], which will prove useful later on. To complete\nthe speci\ufb01cation on the Gamma process, we set H(\u2713\u21e4) to the conjugate normal-Wishart distribution\nwith hyperparameters .\n\nIt is easy to directly specify the resulting continuous-time model, we provide the equations in the\nSupplementary Material. However it is more convenient to represent the model using the marked\nPoisson process of Eq. (1). There, the overall process T is a rate \u21e4 Poisson process, and under a\nGamma process prior, \u21e4 is Gamma(\u21b5, 1) distributed [15]. The labels \u2713i assigning events to neurons\nare drawn i.i.d. from a normalized Gamma process: G(d\u2713) = (1/\u21e4)P1l=1 l.\n\nG(d\u2713) is a random probability measure (RPM) called a normalized random measure [10]. Cru-\ncially, a normalized Gamma process is the Dirichlet process (DP) [15], so that the spike parameters\n\u2713 are i.i.d. draws with a DP-distributed RPM. For spike l, the shape vector is drawn from a normal\nwith parameters (\u00b5l, \u2303l): these are thus draws from a DP mixture (DPM) of Gaussians [16].\n\nWe can exploit the connection with the DP to integrate out the in\ufb01nite-dimensional measure G(\u00b7)\n(and thus \u21e4(\u00b7)), and assign spikes to neurons via the so-called Chinese restaurant process (CRP)\n[17]. Under this scheme, the lth spike is assigned the same parameter as an earlier spike with\nprobability proportional to the number of earlier spikes having that parameter. It is assigned a new\nparameter (and thus, a new neuron is observed) with probability proportional to \u21b5. Letting Ct be the\nnumber of neurons observed until time t, and T t\ni = Ti \\ [0, t) be the times of spikes produced by\nneuron i before time t, we then have for spike l at time t = \u2327l:\ni 2 [Ct],\ni |\n(2)\n\u21b5i = Ct + 1,\n\n\u2713l = \u2713\u21e4\u232bl, where P (\u232bl = i) /\u21e2|T t\n\nThis marginalization property of the DP allows us to integrate out the in\ufb01nite-dimensional rate\nvector \u21e4(\u00b7), and sequentially assign spikes to neurons based on the assignments of earlier spikes.\nThis requires one last property: for the Gamma process, the RPM G(\u00b7) is independent of the total\nmass \u21e4. Consequently, the clustering of spikes (determined by G(\u00b7)) is independent of the rate \u21e4 at\nwhich they are produced. We then have the following model:\n\n3\n\n\fT\u21e0 PP(\u21e4),\nyl \u21e0 NK(\u00b5l, \u2303l),\nx(t) =Pl2|T |Pk2[K] ylkdk(t \u2327l) + \"t\n\n2.3 A discrete-time approximation\n\nwhere (\u00b5l, \u2303l) \u21e0 CRP(\u21b5, H(\u00b7)),\n\nwhere \u21e4 \u21e0 P(\u21b5, 1),\nwhere \" \u21e0 GP(0,K).\n\n(3a)\nl 2 [|T |], (3b)\n(3c)\n\nThe previous subsections modeled the continuous-time voltage output of a neural population. Our\ndata on the other hand consists of recordings at a discrete set of times. While it is possible to make\ninferences about the continuous-time process underlying these discrete recordings, in this paper, we\nrestrict ourselves to the discrete case. The marked Poisson process characterization of Eq. 3 leads to\na simple discrete-time approximation of our model.\n\nRecall \ufb01rst the Bernoulli approximation to the Poisson process: a sample from a Poisson process\nwith rate \u21e4 can be approximated by discretizing time at a granularity , and assigning each bin an\nevent independently with probability \u21e4 (the accuracy of the approximation increasing as tends\nto 0). To approximate the marked Poisson process T , all that is additionally required is to assign\nmarks \u2713i and yi to each event in the Bernoulli approximation. Following Eqs. (3b) and (3c), the\n\u2713l\u2019s are distributed according to a Chinese restaurant process, while each yl is drawn from a normal\ndistribution parametrized by the corresponding \u2713l. We discretize the elements of dictionary as well,\n\nyielding discrete dictionary elementsedk,: = (edk,1, . . . ,edk,L)T. These form the rows of a K \u21e5 L\nmatrix eD (we call its columnsed:,h). The shape of the jth spike is now a vector of length L, and for\na weight vector y, is given by eDy.\nWe can simplify notation a little for the discrete-time model. Let t index time-bins (so that for an\nobservation interval of length T , t 2 [T /]). We use tildes for variables indexed by bin-position.\nThus,e\u232bt ande\u2713t are the neuron and neuron parameter associated with time bin t, andeyt is its weight-\nvector. Let the binary variableezt indicate whether or not a spike is present in time bin t (recall that\nezt \u21e0 Bernoulli(\u21e4)). If there is no spike associated with bin t, then we ignore the markse\u00b5 andey.\nThus the output at time t, xt is given by xt =PL\n:,heyth1 + \"t. Note that the noise \"t\n\nis now a discrete-time Markov Gaussian process. Let a and rt be the decay and innovation of the\nresulting autoregressive (AR) process, so that \"t+1 = a\"t + rt.\n2.4 Correlations in time and across electrodes\n\nh=1ezthdT\n\nSo far, for simplicity, we restricted our model to recordings from a single channel. We now\ndescribe the full model we use in experiments with multichannel recordings. We let every spike\naffect the recordings at all channels, with the spike shape varying across channels. For spike l in\nchannel m, call the weight-vector ym\n. All these vectors must be correlated as they correspond to the\nl\nsame spike; we do this simply by concatenating the set of vectors into a single M K-element vector\nl ), and modeling this as a multivariate normal. In principle, one might expect the\nyl = (y1\nassociated covariance matrix to possess a block structure (corresponding to the subvector associated\nwith each channel); however, rather than building this into the model, we allow the data to inform\nus about any such structure.\n\nl ;\u00b7\u00b7\u00b7 ; yM\n\nWe also relax the requirement that the parameters \u2713\u21e4 of each neuron remain constant, and instead\nallow \u00b5\u21e4, the mean of the weight-vector distribution, to evolve with time (we keep the covariance\nparameter \u2303\u21e4i \ufb01xed, however). Such \ufb02exibility can capture effects like changing cell characteristics\nor moving electrodes. Like the noise term, we model the time-evolution of this quantity as a realiza-\ntion of a Markov Gaussian process; again, in discrete-time, this corresponds to a simple \ufb01rst-order\nAR process. With B 2 RK\u21e5K the transition matrix, and rt 2 RK, independent Gaussian innova-\ntions, we have \u00b5\u21e4t+1 = B\u00b5\u21e4t + rt. Where we previously had a DP mixture of Gaussians, we now\nhave a DP mixture of GPs. Each neuron is now associated with a vector-valued function \u2713\u21e4(\u00b7), rather\nthan a constant. When a spike at time \u2327l is assigned to neuron i, it is assigned a weight-vector yl\ndrawn from a Gaussian with mean \u00b5\u21e4i (\u2327l). Algorithm 1 in the Supplementary Material summarizes\nthe full generative mechanism for the full discrete-time model.\n3\n\nInference\nThere exists a vast literature on computational approaches to posterior inference for Bayesian non-\nparametric models, especially so for models based on the DP. Traditional approaches are sampling-\nbased, typically involving Markov chain Monte Carlo techniques (see eg. [18, 19]), and recently\nthere has also been work on constructing deterministic approximations to the intractable posterior\n(eg. [20, 21]). Our problem is complicated by two additional factors. The \ufb01rst is the convolutional\nnature of our observation process, where at each time, we observe a function of the previous obser-\n\n4\n\n\fvations drawn from the DPMM. This is in contrast to the usual situation where one directly observes\nthe DPMM outputs themselves. The second complication is a computational requirement: typical\ninference schemes are batch methods that are slow and computationally expensive. Our ultimate\ngoal, on the other hand, is to perform inference in real time, making these approaches unsuitable.\nInstead, we develop an online algorithm for posterior inference. Our algorithm is inspired by the\nsequential update and greedy search (SUGS) algorithm of [22], though that work was concerned\nwith the usual case of i.i.d. observations from a DPMM. We generalize SUGS to our observation\nprocess, also accounting for the time-evolution of the cluster parameters and correlated noise.\n\ni\n\nBelow, we describe a single iteration of our algorithm for the case a single electrode; generalizing\nto the multielectrode case is straightforward. At each time t, our algorithm maintains the set of\ntimes of the spikes it has inferred from the observations so far. It also maintains the identities of the\nneurons that it assigned each of these spikes to, as well as the weight vectors determining the shapes\nof the associated spike waveforms. We indicate these point estimates with the hat operator, so, for\nis the set of estimated spike times before time t assigned to neuron i. In addition to\nthese point estimates, the algorithm also keeps a set of posterior distributions qit(\u2713\u21e4i ) where i spans\n\nexample bT t\nover the set of neurons seen so far (i.e. i 2 [bCt]). For each i, qit(\u2713\u21e4i ) approximates the distribution\nover the parameters \u2713\u21e4i \u2318 (\u00b5\u21e4i , \u2303\u21e4i ) of neuron i given the observations until time t.\nHaving identi\ufb01ed the time and shape of spikes from earlier times, we can calculate their con-\ntribution to the recordings xL\nt \u2318 (xt,\u00b7\u00b7\u00b7 , xt+L1)T. Recalling that the basis functions D,\nand thus all spike waveforms, span L time bins, the residual at time t + t1 is then given by\nxt+t1 = xt Ph2[Lt1]bzthDbyth (at time t, for t1 > 0, we de\ufb01ne bzt+t1 = 0). We treat\n\nthe residual xt = (xt,\u00b7\u00b7\u00b7 ,x t+L)T as an observation from a DP mixture model, and use this to\nmake hard decisions about whether or not this was produced by an underlying spike, what neuron\nthat spike belongs to (one of the earlier neurons or a new neuron), and what the shape of the associ-\nated spike waveform is. The latter is used to calculate qi,t+1(\u2713\u21e4i ), the new distribution over neuron\nparameters at time t + 1. Our algorithm proceeds recursively in this manner.\n\nFor the \ufb01rst step we use Bayes\u2019 rule to decide whether there is a spike underlying the residual:\n\n(4)\n\nequal to their MAP values.\n\nP(ezt = 1|xt) /Pi2bCt+1P(xt,\u232b t = i|ezt = 1)P(ezt = 1)\n\nover all possible cluster assignments \u232bt, and all values of the weight vector yt. On the other hand,\n\nCRP update rule (equation (2)). P(xt|\u2713t) is just the normal distribution, while we restrict qit(\u00b7) be\nthe family of normal-Wishart distribution. We can then evaluate the integral, and then summation\n\nHere, P(xt|\u232bt = i,ezt = 1) = R\u21e5 P(xt|\u2713t)qit(\u2713t)d\u2713t, while P(\u232bt = i|ezt = 1) follows from the\n(4) to approximate P(ezt = 1|xt). If this exceeds a threshold of 0.5 we decide that there is a spike\npresent at time t, otherwise, we setezt = 0. Observe that making this decision involves marginalizing\nhaving made this decision, we collapse these posterior distributions to point estimates b\u232bt and byt\nIn the event of a spike (bzt = 1), we use these point estimates to update the posterior distribution\nover parameters of cluster b\u232bt, to obtain qi,t+1(\u00b7) from qi,t(\u00b7); this is straightforward because of\n\nconjugacy. We follow this up with an additional update step for the distributions of the means of all\nclusters: this is to account for the AR evolution of the cluster means. We use a variational update\nto keep qi,t+1(\u00b7) in the normal-Wishart distribution. Finally we take a stochastic gradient step to\nupdate any hyperparameters we wish to learn. We provide all details in the Supplementary material.\n4 Experiments\nIn the following, we refer to our algorithm as OP A S S1. We used two different datasets\nData:\nto demonstrate the ef\ufb01cacy of OP A S S. First, the ever popular, publicly available HC1 dataset as\ndescribed in [23]. We used the dataset d533101 that consisted of an extracellular tetrode and a single\nintracellular electrode. The recording was made simultaneously on all electrodes and was set up such\nthat the cell with the intracellular electrode was also recorded on the extracellular array implanted in\nthe hippocampus of an anesthetized rat. The intracellular recording is relatively noiseless and gives\nnearly certain \ufb01ring times of the intracellular neuron. The extracellular recording contains the spike\nwaveforms from the intracellular neuron as well as an unknown number of additional neurons. The\ndata is a 4-minute recording at a 10 kHz sampling rate.\n\nThe second dataset comes from novel NeuroNexus devices implanted in the rat motor cortex.\nThe data was recorded at 32.5 kHz in freely-moving rats. The \ufb01rst device we consider is a set of\n\n1Online gamma Process Autoregressive Spike Sorting\n\n5\n\n\f3 channels of data (Fig. 7a). The neighboring electrode sites in these devices have 30 \u00b5m between\nelectrode edges and 60 \u00b5m between electrode centers. These devices are close enough that a locally-\n\ufb01ring neuron could appear on multiple electrode sites [2], so neighboring channels warrant joint\nprocessing. The second device has 8-channels (see Fig. 10a), but is otherwise similar to the \ufb01rst. We\nused a 15-minute segment of this data for our experiments.\n\nFor both datasets, we preprocessed with a high-pass \ufb01lter at 800 Hz using a fourth order But-\nterworth \ufb01lter before we analyzed the time series. To de\ufb01ne D, we used the \ufb01rst \ufb01ve principle\ncomponents of all spikes detected with a threshold (three times the standard deviation of the noise\nabove the mean) in the \ufb01rst \ufb01ve seconds. The noise standard deviation was estimated both over\nthe \ufb01rst \ufb01ve seconds of the recording as well as the entire recording, and the estimate was nearly\nidentical. Our results were also robust to minor variations in the choice of the number of principal\ncomponents. The autoregressive parameters were estimated by using lag-1 autocorrelation on the\nsame set of data. For the multichannel algorithms we estimate the covariance between channels and\nnormalize by our noise variance estimate.\n\nEach algorithm gives a clustering of the detected spikes. In this dataset, we only have a partial\nground truth, so we can only verify accuracy for the neuron with the intracellular (IC) recording. We\nde\ufb01ne a detected spike to be an IC spike if the IC recording has a spike within 0.5 milliseconds (ms)\nof the detected spike in the extracellular recording. We de\ufb01ne the cluster with the greatest number\nof intracellular spikes as a the \u201cIC cluster\u201d. We refer to these data as \u201cpartial ground truth data\u201d,\nbecause we know the ground truth spike times for one of the neurons, but not all the others.\nAlgorithm Comparisons We compare a number of variants of OP A S S, as well as several previ-\nously proposed methods, as described below. The vanilla version of OP A S S operates on a single\nchannel with colored noise. When using multiple channels, we append an \u201cM\u201d to obtain MOP A S S.\nWhen we model the mean of the waveforms as an auto-regressive process, we \u201cpost-pend\u201d to obtain\nOP A S SR. We compare these variants of OP A S S to Gaussian mixture models and k-means [5] with\nN components (GM M-N and K-N, respectively), where N indicates the number of components. We\ncompare with a Dirichlet Process Mixture Model (DPMM) [8] as well as the Focused Mixture Model\n(FM M) [24], a recently proposed Bayesian generative model with state-of-the-art performance. Fi-\nnally, with compare with OSORT [25], an online sorting algorithm. Only OP A S S and OSORT meth-\nods were online as we desired to compare to the state-of-the-art batch algorithms which use all the\ndata. Note that OP A S S algorithms learned D from the \ufb01rst \ufb01ve seconds of data, whereas all other\nalgorithms used a dictionary learned from the entire data set.\n\nThe single-channel experiments were all run on channel 2 (the results were nearly identical for\nall channels). The spike detections for the of\ufb02ine methods used a threshold of three times the noise\nstandard deviation [5] (unless stated otherwise), and windowed at a size L = 30. For multichannel\ndata, we concatenated the M channels for each waveform to obtain a M \u21e5 L-dimensional vector.\nThe online algorithms were all run with weakly informative parameters. For the normal-Wishart,\nwe used \u00b50 = 0 , 0 = 0.1, W = 10I, and \u232b = 1 (I is the identity matrix). The AR process corre-\nsponded to a GP with length-scale 30 seconds, and variance 0.1. \u21b5 was set to 0.1. The parameters\nwere insensitive to minor changes. Running time in unoptimized MATLAB code for 4 minutes of\ndata was 31 seconds for a single channel and 3 minutes for all 4 channels on a 3.2 GHz Intel Core\ni5 machine with 6 GB of memory (see Supplementary Fig. 11 for details).\nPerformance on partial ground truth data The main empirical result of our contribution is that\nall variants of OP A S S detect more true positives with fewer false positives than any of the other\nalgorithms on the partial ground truth data (see Fig. 1). The only comparable result is the OSORT;\nhowever, the OSORT algorithm split the IC cluster into 2 different clusters and we combined the\ntwo clusters into one by hand. Our improved sensitivity and speci\ufb01city is despite the fact that\nOP A S S is fully online, whereas all the algorithms (besides OSORT) that we compare to are batch\nalgorithms using all data for all spikes. Note that all the comparison algorithms pre-process the\ndata via thresholding at some constant (which we set to three standard deviations above the mean).\nTo assess the extent to which performance of OP A S S is due to not thresholding, we implement\nFA K E-OP A S S, which thresholds the data. Indeed, FA K E-OP A S S\u2019s performance is much like that\nof the batch algorithms. To get uncertainty estimates, we split the data into ten random two minute\nsegments and repeat this analysis and the results are qualitatively similar.\n\nOne possible explanation for the relatively poor performance of the batch algorithms as compared\nto OP A S S is a poor choice of the important\u2014but often overlooked\u2014threshold parameter. The right\npanel of Fig. 1 shows the receiver operating characteristic (ROC) curve for the k-means algorithms\nas well as OP A S S and MOP A S S (where M indicates multichannel, see below for detail). Although we\n\n6\n\n\fe\nt\na\nR\n \ne\nv\ni\nt\ni\ns\no\nP\n \ne\nu\nr\nT\n\n1\n\n0.95\n\n0.9\n\n0.85\n\n0.8\n\n0.75\n4\n\n1\n0\n\u22121\n\nv\nm\n\n \n,\n\ne\nd\nu\nt\ni\nl\n\np\nm\nA\n\n1\n0\n\u22121\n0\n\nPerformance on the IC Cluster\n\n MOR\n\n MO\n\n OR\n\n O\n\n FOSORT\n\n GMM\u22122\n\n GMM\u22125\n\n DPMM\n\n K\u22124\n K\u22123\n FAKE\u2212O\n\n K\u22125\n\n FMM\n\n O\u2212W\n\n K\u22122\n\n GMM\u22123\n\n GMM\u22124\n\n6\n\n10\nFalse Positive Rate\n\n8\n\n12\nx 10\u22125\n\nOverlapping Spikes\n\n0.5 1 1.5 2 2.5 3 3.5 4\n\nResiduals\n\n1\n\n2\n\nTime (ms)\n\n3\n\n4\n\nt\n\n \n\ns\ne\na\nR\ne\nv\ni\nt\ni\ns\no\nP\ne\nu\nr\nT\n\n \n\n1\n\n0.8\n\n0.6\n\n0.4\n\n0.2\n\n \n\n0\n0\n\ny\nc\nn\ne\nu\nq\ne\nr\nF\n\n120\n100\n80\n60\n40\n20\n0\n0\n\n \n\nROC Curves for the IC Cluster\n\n \n\nK\u22124\nOR\nMOR\n\n1\nx 10\u22124\n\n0.5\n\nFalse Positive Rates\n\nOverlapping Spike Residuals\n\n \n\nNo Spikes\n1st Only\n2nd Only\nBoth\n\n5\n\n10\n\nResidual Sum of Squares\n\nFigure 1: OP A S S achieves improved\nsensitivity and speci\ufb01city over all\ncompeting methods on partial ground\ntruth data.\n(a) True positive and\nfalse positive rates for all variants of\nOP A S S and several competing algo-\nrithms. (b) ROC curves demonstrating\nthat OP A S S outperforms all competi-\ntor algorithms, regardless of threshold\n(\u2022 indicates learning \u21e4 from the data).\nFigure 2: OP A S S detects multiple over-\nlapping waveforms (Top Left) The ob-\nserved voltage (solid black), MAP\nwaveform 1 (red), MAP waveform 2\n(blue), and waveform from the sum\n(dashed-black). (Bottom Left) Residu-\nals from same example snippet, show-\ning a clear improvement in residuals.\n\ntypically run OP A S S without tuning parameters, the prior on \u21e4 sets the expected number of spikes,\nwhich we can vary in a kind of \u201cempirical Bayes\u201d strategy. Indeed, the OP A S S curves are fully\nabove the batch curves for all thresholds and priors, suggesting that regardless of which threshold\none chooses for pre-processing, OP A S S always does better on these data than all the competitor\nalgorithms. Moreover, in OP A S Swe are able to infer the parameter \u21e4 at a reasonable point, and the\ninferred \u21e4 is shown in the left panel of Fig. 1. and the points along the curve in the right panel.\nThese \ufb01gures also reveal that using the correlated noise model greatly improves performance.\n\nThe above analysis suggests OP A S S\u2019s ability to detect signals more reliably than thresholding\ncontributes to its success. In the following, we provide evidence suggesting how several of OP A S S\u2019s\nkey features are fundamental to this improvement.\nOverlapping Spike Detection A putative reason for the improved sensitivity and speci\ufb01city of\nOP A S S over other algorithms is its ability to detect overlapping spikes. When spikes overlap, al-\nthough the result can accurately be modeled as a linear sum in voltage space, the resulting waveform\noften does not appear in any cluster in PC space (see [1]). However, our online approach can readily\n\ufb01nd such overlapping spikes. Fig. 2 (top left panel) shows one example of 135 examples where\nOP A S S believed that multiple waveforms were overlapping. Note that even though the waveform\npeaks are approximately 1 ms from one another, thresholding algorithms do not pick up these spikes,\nbecause they look different in PC space.\n\nIndeed, by virtue of estimating the presence of multiple spikes, the residual squared error between\nthe expected voltage and observed voltage shrinks for this snippet (bottom left). The right panel\nof Fig. 2 shows the density of the residual errors for all putative overlapping spikes. The mass\nof this density is signi\ufb01cantly smaller than the mass of the other scenarios. Of the 135 pairs of\noverlapping spikes, 37 of those spikes came from the intracellular neuron. Thus, while it seems\ndetecting overlapping spikes helps, it does not fully explain the improvements over the competitor\nalgorithms.\nTime-Varying Waveform Adaptation As has been demonstrated previously [26], the waveform\nshape of a neuron may change over time. The mean waveform over time for the intracellular neuron\nis shown in Fig. 3a. Clearly, the mean waveform is changing over time. Moreover, these changes are\nre\ufb02ected in the principal component space (Fig. 3b). We therefore compared means and variances\nOP A S S with OP A S SR, which models the mean of the dictionary weights via an auto-regressive\nprocess. Fig. 3c shows that the auto-regressive model for the mean dictionary weights yields a time-\nvarying posterior (top), whereas the static prior yields a constant posterior mean with increasing\nposterior marginal variances (bottom). More precisely, the mean of the posterior standard deviations\nfor the time-varying prior is about half of that for the static prior\u2019s posteriors. Indeed, the OP A S SR\nyields 11 more true detections than OP A S S.\nMultielectrode Array OP A S S achieved a heightened sensitivity by incorporating multiple chan-\nnels (see MOP A S S point in Fig. 1). We further evaluate the impact of multiple channels using a three\n\n7\n\n\fEvolution of the IC Waveform Shape\n1\n\nEvolution of IC Waveform in PC Space\n2\n\n \n\ns\nt\ni\n\nn\nu\n\n \n,\n\ne\nd\nu\n\nt\ni\nl\n\np\nm\nA\n\n0.5\n\n0\n\n\u22120.5\n0\n\n2\n \nt\nn\ne\nn\no\np\nm\no\nC\nA\nC\nP\n\n \n\n1.5\n\n1\n\n0.5\n\n3\n\n \n\n0\n0.5\n\n1\n\nms\n\n2\n\n(a)\n\nIC Cluster Posterior Parameters\n\n1 Min\n3 Min\n4 Min\n\n \n\nR\nO\n\nO\n\n200\n166\n133\n100\n66\n33\n\ns\nd\nn\no\nc\ne\nS\n \nd\ne\ns\np\na\nE\n\nl\n\n1\n\nPCA Component 1\n\n1.5\n\n2\n\n \n\n0\n\n1\n\nms\n\n2\n\n3\n\n(b)\n\n(c)\n\nFigure 3: The IC waveform changes over time, which our posterior parameters track. (a) Mean\nIC waveforms over time. Each colored line represents the mean of the waveform averaged over 24\nseconds with color denoting the time interval. This neuron decreases in amplitude over the period\nof the recording. (b) The same waveforms plotted in PC space still captures the temporal variance.\n(c) The mean and standard deviation of the waveforms at three time points for the auto-regressive\nprior on the mean waveform (top) and static prior (bottom). While the auto-regressive prior admits\nadaptation to the time-varying mean, the posterior of the static prior simply increases its variance.\n\nCh1\n\nCh2\n\nCh3\n\nv\nm\n\n \n,\ne\nd\nu\nt\ni\nl\n\np\nm\nA\n\n0.05\n\n0\n\n\u22120.05\n\nCh1\n\nCh2\n\nCh3\n\nv\nm\n\n \n,\n\ne\nd\nu\n\nt\ni\nl\n\np\nm\nA\n\n0.05\n\n0\n\n\u22120.05\n\nFigure 4: Improving OP A S S by in-\ncorporating multiple channels. The\ntop 2 most prevalent waveforms\nfrom the NeuroNexus dataset with\nthree channels. Note that the left\npanel has a waveform that appears\non both channel 2 and channel 3,\nwhereas the waveform in the right\npanel only appears in channel 3. If\nonly channel 3 was used, it would be\ndif\ufb01cult to separate these waveform.\n\nchannel NeuroNexus shank (Supp. Fig. 7a). In Fig. 4 we show the top two most prevalent wave-\nforms from these data across the three electrodes. Had only the third electrode been used, these two\nwaveforms would not be distinct (as evidenced by their substantial overlap in PC space upon using\nonly the third channel in Fig. 7b). This suggests that borrowing strength across electrodes improves\ndetection accuracy. Supplementary Fig. 10 shows a similar plot for the eight channel data.\n5 Discussion\n\nOur improved sensitivity and speci\ufb01city seem to arise from multiple sources including (i) im-\nproved detection, (ii) accounting for correlated noise, (iii) capturing overlapping spikes, (iv) track-\ning waveform dynamics, and (v) utilizing multiple channels. While others have developed closely\nrelated Bayesian models for clustering [8, 27], deconvolution based techniques [1], time-varying\nwaveforms [26], or online methods [25, 3], we are the \ufb01rst to our knowledge to incorporate all of\nthese.\n\nAn interesting implication of our work is that it seems that our errors may be irreconcilable using\nmerely \ufb01rst order methods (that only consider the mean waveform to detect and cluster). Supp. Fig.\n8a shows the mean waveform of the true and false positives are essentially identical, suggesting that\neven in the full 30-dimensional space excluding those waveforms from intracellular cluster would\nbe dif\ufb01cult. Projecting each waveform into the \ufb01rst two PCs is similarly suggestive, as the missed\npositives do not seem to be in the cluster of the true positives (Supp. Fig. 8b). Thus, in future work,\nwe will explore dynamic and multiscale dictionaries [28], as well as incorporate a more rich history\nand stimulus dependence.\nAcknowledgments\nThis research was supported in part by the Defense Advanced Research Projects Agency (DARPA),\nunder the HIST program managed by Dr. Jack Judy.\n\n8\n\n\fReferences\n[1] J W Pillow, J Shlens, E J Chichilnisky, and E P Simoncelli. A model-based spike sorting algorithm for\n\nremoving correlation artifacts in multi-neuron recordings. PLoS ONE, 8(5):1\u201315, 2013.\n\n[2] J S Prentice, J Homann, K D Simmons, G Tka\u02c7cik, V Balasubramanian, and P C Nelson. Fast, scalable,\n\nBayesian spike identi\ufb01cation for multi-electrode arrays. PloS one, 6(7):e19884, January 2011.\n\n[3] F Franke, M Natora, C Boucsein, M H J Munk, and K Obermayer. An online spike detection and spike\nclassi\ufb01cation algorithm capable of instantaneous resolution of overlapping spikes. Journal of Computa-\ntional Neuroscience, 29(1-2):127\u2013148, August 2010.\n\n[4] W Gerstner and W M Kistler. Spiking Neuron Models: Single Neurons, Populations, Plasticity. Cam-\n\nbridge University Press, 1 edition, August 2002.\n\n[5] M S Lewicki. A review of methods for spike sorting: the detection and classi\ufb01cation of neural action\n\npotentials. Network: Computation in Neural Systems, 1998.\n\n[6] C E Rasmussen and C K I Williams. Gaussian Processes for Machine Learning. MIT Press, 2006.\n[7] J F C Kingman. Poisson processes, volume 3 of Oxford Studies in Probability. The Clarendon Press\n\nOxford University Press, New York, 1993. Oxford Science Publications.\n\n[8] F Wood and M J Black. A non-parametric Bayesian alternative to spike sorting. Journal of Neuroscience\n\nMethods, 173:1\u201312, 2008.\n\n[9] J F C Kingman. Completely random measures. Paci\ufb01c Journal of Mathematics, 21(1):59\u201378, 1967.\n[10] L F James, A Lijoi, and I Pruenster. Posterior analysis for normalized random measures with independent\n\nincrements. Scand. J. Stat., 36:76\u201397, 2009.\n\n[11] N L Hjort. Nonparametric Bayes estimators based on beta processes in models for life history data. Annals\n\nof Statistics, 18(3):1259\u20131294, 1990.\n\n[12] R Thibaux and M I Jordan. Hierarchical beta processes and the Indian buffet process. In Proceedings of\n\nthe International Workshop on Arti\ufb01cial Intelligence and Statistics, volume 11, 2007.\n\n[13] K Sato. L\u00b4evy Processes and In\ufb01nitely Divisible Distributions. Cambridge University Press, 1990.\n[14] D Applebaum. L\u00b4evy Processes and Stochastic Calculus. Cambridge studies in advanced mathematics.\n\nUniversity Press, 2004.\n\n[15] T S Ferguson. A Bayesian analysis of some nonparametric problems. The Annals of Statistics, 1(2):209\u2013\n\n230, 1973.\n\n[16] A Y Lo. On a class of bayesian nonparametric estimates: I. density estimates. Annals of Statistics,\n\n12(1):351\u2013357, 1984.\n\n[17] J Pitman. Combinatorial stochastic processes. Technical Report 621, Department of Statistics, University\n\nof California at Berkeley, 2002. Lecture notes for St. Flour Summer School.\n\n[18] R M Neal. Markov chain sampling methods for Dirichlet process mixture models. Journal of Computa-\n\ntional and Graphical Statistics, 9:249\u2013265, 2000.\n\n[19] H Ishwaran and L F James. Gibbs sampling methods for stick-breaking priors. Journal of the American\n\nStatistical Association, 96(453):161\u2013173, 2001.\n\n[20] D M Blei and M I Jordan. Variational inference for Dirichlet process mixtures. Bayesian Analysis,\n\n1(1):121\u2013144, 2006.\n\n[21] T P Minka and Z Ghahramani. Expectation propagation for in\ufb01nite mixtures. Presented at NIPS2003\n\nWorkshop on Nonparametric Bayesian Methods and In\ufb01nite Models, 2003.\n\n[22] L Wang and D B Dunson. Fast bayesian inference in dirichlet process mixture models. Journal of\n\nComputational & Graphical Statistics, 2009.\n\n[23] D A Henze, Z Borhegyi, J Csicsvari, A Mamiya, K D Harris, and G Buzsaki.\n\nIntracellular feautures\n\npredicted by extracellular recordings in the hippocampus in Vivo. J. Neurophysiology, 2000.\n\n[24] D E Carlson, Q Wu, W Lian, M Zhou, C R Stoetzner, D Kipke, D Weber, J T Vogelstein, D B Dunson,\nand L Carin. Multichannel Electrophysiological Spike Sorting via Joint Dictionary Learning and Mixture\nModeling. IEEE TBME, 2013.\n\n[25] U Rutishauser, E M Schuman, and A N Mamelak. Online detection and sorting of extracellularly recorded\n\naction potentials in human medial temporal lobe recordings, in vivo. J. Neuro. Methods, 2006.\n\n[26] A Calabrese and L Paninski. Kalman \ufb01lter mixture model for spike sorting of non-stationary data. Journal\n\nof neuroscience methods, 196(1):159\u2013169, 2011.\n\n[27] J Gasthaus, F D Wood, D Gorur, and Y W Teh. Dependent dirichlet process spike sorting. Advances in\n\nneural information processing systems, 21:497\u2013504, 2009.\n\n[28] G Chen, M Iwen, S Chin, and M. Maggioni. A fast multiscale framework for data in high-dimensions:\n\nMeasure estimation, anomaly detection, and compressive measurements. In VCIP, 2012 IEEE, 2012.\n\n9\n\n\f", "award": [], "sourceid": 1285, "authors": [{"given_name": "David", "family_name": "Carlson", "institution": "Duke University"}, {"given_name": "Vinayak", "family_name": "Rao", "institution": "Gatsby Unit, UCL"}, {"given_name": "Joshua", "family_name": "Vogelstein", "institution": "Duke University"}, {"given_name": "Lawrence", "family_name": "Carin", "institution": "Duke University"}]}