{"title": "Temporal alignment and latent Gaussian process factor inference in population spike trains", "book": "Advances in Neural Information Processing Systems", "page_first": 10445, "page_last": 10455, "abstract": "We introduce a novel scalable approach to identifying common latent structure in neural population spike-trains, which allows for variability both in the trajectory and in the rate of progression of the underlying computation. Our approach is based on shared latent Gaussian processes (GPs) which are combined linearly, as in the Gaussian Process Factor Analysis (GPFA) algorithm. We extend GPFA to handle unbinned spike-train data by incorporating a continuous time point-process likelihood model, achieving scalability with a sparse variational approximation. Shared variability is separated into terms that express condition dependence, as well as trial-to-trial variation in trajectories. Finally, we introduce a nested GP formulation to capture variability in the rate of evolution along the trajectory. We show that the new method learns to recover latent trajectories in synthetic data, and can accurately identify the trial-to-trial timing of movement-related parameters from motor cortical data without any supervision.", "full_text": "Temporal alignment and latent Gaussian process\n\nfactor inference in population spike trains\n\nLea Duncker & Maneesh Sahani\n\nGatsby Computational Neuroscience Unit\n\n{duncker,maneesh}@gatsby.ucl.ac.uk\n\nUniversity College London\n\nLondon, W1T 4JG\n\nAbstract\n\nWe introduce a novel scalable approach to identifying common latent structure in\nneural population spike-trains, which allows for variability both in the trajectory\nand in the rate of progression of the underlying computation. Our approach is\nbased on shared latent Gaussian processes (GPs) which are combined linearly, as\nin the Gaussian Process Factor Analysis (GPFA) algorithm. We extend GPFA to\nhandle unbinned spike-train data by incorporating a continuous time point-process\nlikelihood model, achieving scalability with a sparse variational approximation.\nShared variability is separated into terms that express condition dependence, as\nwell as trial-to-trial variation in trajectories. Finally, we introduce a nested GP\nformulation to capture variability in the rate of evolution along the trajectory. We\nshow that the new method learns to recover latent trajectories in synthetic data, and\ncan accurately identify the trial-to-trial timing of movement-related parameters\nfrom motor cortical data without any supervision.\n\n1\n\nIntroduction\n\nMany computations in the brain are thought to be implemented by the dynamical evolution of activity\ndistributed across large neural populations. As simultaneous recordings of population activity have\nbecome more common, the need has grown for analytic methods that can identify these dynamical\ncomputational variables from the data. Where the computation is tightly coupled to an externally\nmeasurable covariate \u2014 a stimulus or a movement, perhaps \u2014 such identi\ufb01cation is a simple matter\nof exploiting linear or non-linear regression to describe a population encoding or decoding model.\nHowever, when aspects of the computation are likely to re\ufb02ect internal mental states, which may vary\nsubstantially even when external covariates remain constant, the relevant variables must be uncovered\nfrom neural data alone, most typically using a latent variable model [1\u20136].\nHowever, most such methods fail to account properly for at least one key form of dynamical variability\n\u2014 trial-to-trial differences in the timing of the computation. Such differences may be re\ufb02ected in\nbehavioural variability, for instance in reaction times or movement durations [7], or in varying\nrelationships between external events and neural \ufb01ring, for instance in sensory onset latencies [8]. In\nsome cases manual alignment to salient external events or behavioural time-course may be used to\nreduce temporal misalignment [8, 9]. However, just as with variability in the trajectories themselves,\ntemporal variations in purely internal states must ultimately be identi\ufb01ed from neural data alone [10].\nA particularly challenging problem is to build models that capture the variability both in the latent\ntrajectories underlying spiking population activity, and in the time-course with which such trajectories\nare followed. Temporal misalignment in trajectories might be confused for variability in the trajectory\nitself; while genuine variability in that trajectory makes alignment more dif\ufb01cult. Indeed, previous\nwork has considered these problems separately. Algorithms like dynamic time warping (DTW) [11]\n\n32nd Conference on Neural Information Processing Systems (NeurIPS 2018), Montr\u00e9al, Canada.\n\n\ftreat the time-series as observed and try to estimate an optimal alignment of each series. DTW and\nrelated approaches may suffer in settings where observed data are noisy, though recent work has\nbegun to explore more robust [12\u201315], or probabilistic alternatives for time-series alignment [16\u201318].\nFurthermore, these approaches generally assume a Gaussian noise model and often make assumptions\nthat restrict applications to non-conjugate observation models. Only recently have studies considered\npairwise alignment of univariate [19\u201321], or multivariate point-processes [22]. Overall, simultaneous\ninference and temporal alignment of latent trajectories from multivariate point process observations \u2013\nsuch as a set of spike-times of simultaneously recorded neurons \u2013 is a relatively unexplored area of\nresearch.\nIn this work, we develop a novel method to jointly infer shared latent structure directly from the\nspiking activity of neural populations, allowing for (and identifying) trial-to-trial variations in the\ntiming of the latent processes. We make two main contributions.\nFirst, we extend Gaussian Process Factor Analysis (GPFA) [2], an algorithm that has been successfully\napplied in the context of extracting time-varying low-dimensional latent structure from binned neural\npopulation activity on single trials, to directly model the point-process structure of spiking activity.\nWe do so using a sparse variational approach, which both simpli\ufb01es the adoption of a non-conjugate\npoint-process likelihood, and signi\ufb01cantly improves on the scalability of the GPFA algorithm.\nSecond, we combine the sparse variational GPFA model with a second Gaussian process in a nested\nmodel architecture. We exploit shared latent structure across trials, or groups of trials, in order to\ndisentangle variation in timing from variation in trajectories. This extension allows us to infer latent\ntime-warping functions that align trials in a purely unsupervised manner \u2013 that is, using population\nspiking activity alone with no behavioural covariates. We apply our method to simulated data and\nshow that we can more accurately recover \ufb01ring rate estimates than related methods. Using neural data\nfrom macaque monkeys performing a variable-delay centre-out reaching task, we demonstrate that\nthe inferred alignment is behaviourally meaningful and predicts reaction times with high accuracy.\n\n2 Gaussian Process Factor Analysis\n\nGaussian Process Factor Analysis (GPFA) is a method for inferring latent structure and single-\ntrial trajectories in latent space that in\ufb02uence the \ufb01ring of a population of neurons [2]. Temporal\ncorrelations in the high-dimensional neural population are modelled via a lower number of shared\nlatent processes xk(\u00b7), k = 1, . . . , K, which linearly relate to a high-dimensional signal hhh(\u00b7) \u2208 RN\nin neural space. Thus, inter-trial variability in neural \ufb01ring that is shared across the population is\nmodelled via the evolution of the latent processes on a given trial.\nIn the GPFA model, a Gaussian process (GP) prior is placed over each latent process xk(\u00b7), speci\ufb01ed\nvia a mean function \u00b5k(\u00b7) and a covariance function, or kernel, \u03bak(\u00b7,\u00b7). The extent and nature of the\ntemporal correlations is speci\ufb01ed by \u03bak(\u00b7,\u00b7) and governed via hyperparameters.\nThe classic GPFA model in [2] considers regularly sampled observations of hhh(t) that are corrupted by\naxis-aligned Gaussian noise. Recent work has aimed to extend GPFA to a Poisson observation model\n[23]. Here, hhh(t) is related to a piecewise-constant \ufb01ring rate of a Poisson process whose counting\nprocess is observed as spike-counts falling into time bins of a \ufb01xed width.\nWe can summarise the GPFA generative model for observations y(r)\ngeneral way by writing\n\nn (ti) of neuron n on trial r in a\n\nk (\u00b7) \u223c GP(\u00b5k(\u00b7), \u03bak(\u00b7,\u00b7))\nK(cid:88)\nx(r)\nn (\u00b7) =\nk (\u00b7) + dn\nh(r)\nn (ti) \u223c p(y(r)\ny(r)\n\ncn,kx(r)\nn (ti)|h(r)\n\nk=1\n\nn (ti))\n\nfor k = 1, . . . , K\n\nfor n = 1, . . . , N\n\nfor\n\ni = 1, . . . , T\n\n(1)\n\nThe cn,k are weights for each latent and neuron that de\ufb01ne a subspace mapping from low-dimensional\nlatent space to high-dimensional neural space and dn is a constant offset.\nThe widespread use of GPFA may be restricted by its poor scaling with time: Since time is evaluated\non an evenly spaced grid with T points, GPFA requires building and inverting a T \u00d7 T covariance\nmatrix, leading to O(T 3) complexity. The intractability of performing exact inference in GPFA\n\n2\n\n\fIn the\nmodels with non-conjugate likelihood adds further complexity on top of this [23, 24].\nnext section, we will outline how to improve the scalability of GPFA irrespective of the choice of\nobservation model via the use of inducing points.\n\n3 Sparse Variational Gaussian Process Factor Analysis (svGPFA)\n\nThe framework of sparse variational GP approximations [25] has helped to overcome dif\ufb01culties\nassociated with the scalability of GP methods to large sample sizes. It has since been applied to\ndiverse problems in GP inference, contributing to improvements of the scalability of GP methods to\nlarge datasets and complex, potentially non-conjugate, applications [26\u201330].\nA sparse variational extension of the GPFA model can be obtained by extending the work on additive\nsignal decomposition in [26]. The main idea is to augment the model in (1) by introducing inducing\npoints uuuk for each latent process k = 1, . . . , K. The inducing points uuuk represent function evaluations\nof the kth latent GP at Mk input locations zzzk. A joint prior over the process xk(\u00b7) and its inducing\npoints can hence be written as\n\np(uuuk|zzzk) = N(cid:16)\n\nuuuk|000, K(k)\n\nzz\n\n(cid:17)\n\np(xk(\u00b7)|uuuk) = GP(\u02dc\u00b5k(\u00b7), \u02dc\u03bak(\u00b7,\u00b7))\n\n(2)\n\nwhere the GP prior over xk(\u00b7) is now conditioned on the inducing points with conditional mean and\ncovariance function\n\n\u22121\n\n\u02dc\u00b5k(t) = \u03ba\u03ba\u03bak(t, zzz)K(k)\nzz\n\nuuuk\n\n\u02dc\u03bak(t, t(cid:48)) = \u03bak(t, t(cid:48)) \u2212 \u03ba\u03ba\u03bak(t, zzzk)K(k)\n\nzz\n\n\u22121\n\n\u03ba\u03ba\u03bak(zzzk, t(cid:48))\n\n(3)\n\nHere, \u03ba\u03ba\u03bak(\u00b7, zzz) is a vector valued function taking a single input argument and consisting of evaluations\nof the covariance function \u03bak(\u00b7,\u00b7) at the input and inducing point locations zzzk; K(k)\nzz is the Gram\nmatrix of \u03bak(\u00b7,\u00b7) evaluated at the inducing point locations.\nform q(uuu1:K, x1:K) =(cid:81)K\nWe follow [26] in choosing a factorised variational approximation for posterior inference of the\nk=1 p(xk|uuuk)q(uuuk), with Gaussian q(uuuk) = N (uuuk|mmmk, Sk). This choice\nof posterior approximation makes it possible to derive a variational lower bound to the marginal\nlog-likelihood over the observed data Y = {yyy(r)\nN(cid:88)\nlog p(Y) \u2265 R(cid:88)\n\nN }R\nK(cid:88)\n\nr=1 of the form\n\n(cid:105) def\n\nlog p(yyy(r)\n\n= F\n\nq(uuu(r)\n\n(cid:104)\n\n(cid:104)\n\nKL\n\n(4)\n\nk )(cid:107)p(uuu(r)\nk )\n\nr=1\n\nn=1\n\nE\nq(h(r)\nn )\n\nwhere q(h(r)\n\nn |h(r)\nn )\n\n1 , . . . , yyy(r)\n\n(cid:105) \u2212 R(cid:88)\nneuron obtained from q(x) =(cid:82) p(x|uuu)q(uuu)duuu. q(h(r)\n(cid:16)\n\nn (t) and covariance function \u03c3(r)\n\ncn,k \u03ba\u03ba\u03bak( t , zzzk) K(k)\nzz\n\nfunction \u03bd(r)\n\n(cid:88)\n(cid:88)\n\n\u03bd(r)\nn (t) =\n\nk + dn\n\nn (t, t(cid:48)) are given by\n\u22121\n\nmmm(r)\n\n(cid:16)\n\n\u22121\n\nr=1\n\nk\n\nk=1\n\nc2\nn,k\n\n\u03bak(t, t(cid:48)) + \u03ba\u03ba\u03bak(t, zzzk)\n\nn (t, t(cid:48)) =\n\u03c3(r)\n\nn ) is the variational distribution over the af\ufb01ne transformation of the latents for the nth\nn ) is a GP with additive structure. Its mean\n\nK(k)\nzz\n\nS(r)\nk K(k)\n\nzz\n\n\u22121 \u2212 K(k)\n\nzz\n\n\u03ba\u03ba\u03bak(zzzk, t(cid:48))\n\n\u22121(cid:17)\n\n(cid:17)\n\n(5)\n\nk\n\nThe cost of evaluating this bound on the likelihood now scales linearly in the number of time points\nT , with cubic scaling only in the number of inducing points. Maximising the lower bound F in (4)\nallows for variational learning of the parameters in q(uuuk), the kernel hyperparmeters, the inducing\npoint locations, and the model parameters describing the af\ufb01ne transformation from latents to hns.\n\n3.1 A continuous-time point-process observation model\n\nThe form of the variational lower bound in (4) makes it clear that including different observation\nmodels only requires taking a Gaussian expectation of the respective log-likelihood. Importantly, the\ninference approach is essentially decoupled from the locations of the observed data. This crucial\n\n3\n\n\fconsequence of the inducing point approach makes it possible to move away from gridded, binned\ndata and fully exploit the power of GPs in continuous-time.\nPrevious work has used sparse variational GP approximations to infer the intensity of a univariate\nGP modulated Poisson process [27]. Here, we extend this to the multivariate case by combining\nthe svGPFA model with a point-process likelihood. To do this, we relate the af\ufb01ne transformation\nof the latent processes for the nth neuron, hn(\u00b7), to the non-negative intensity function of a point-\nprocess, \u03bbn(\u00b7), via a static non-linearity g : R \u2192 R+. Thus, the spike times of neuron n on trial r,\nn = {t1, . . . t\u03a6(n,r)}, are modelled as\nttt(r)\n\n(cid:90) Tr\n\n(cid:32)\n\n\u2212\n\n(cid:104)\n\n(cid:33) \u03a6(n,r)(cid:89)\n\n(cid:105)\n\n\u03a6(n,r)(cid:88)\n\ni=1\n\n(cid:104)\n\n(cid:105)\n\n(cid:104)\n\n(cid:105)\n\n(cid:90) Tr\n\n0\n\nn |\u03bb(r)\n\np(ttt(r)\n\n(6)\nWhere \u03bbn(t) = g(hn(t)), Tr is the duration of the rth trial, and \u03a6(n, r) is the total spike-count of\nneuron n on trial r. The expected log-likelihood term in (4) can be evaluated as\n\nn ) = exp\n\n\u03bb(r)\nn (ti)\n\ndt \u03bb(r)\n\nn (t)\n\ni=1\n\n0\n\nE\nq(h(r)\nn )\n\nlog p(ttt(r)\n\nn |h(r)\nn )\n\n= \u2212\n\nE\nq(h(r)\nn )\n\ng(h(r)\n\nn (t))\n\ndt+\n\nE\nq(h(r)\nn )\n\nlog g(h(r)\n\nn (ti))\n\n(7)\n\nThe resulting expected log-likelihood in (7) still contains an integral over the expected rate function\nof the neuron, which cannot be computed analytically. However, the integral is one-dimensional and\ncan be computed numerically using ef\ufb01cient quadrature rules [31, 32].\nThe svGPFA model with point-process likelihood already fully addresses the two major limitations of\nthe classic GPFA approach outlined in section 2: Firstly, it improves the scalability of the algorithm\nvia the use of inducing points, scaling cubically only in the number of inducing points per latent\nand linearly in the total number of spiking events. Secondly, our approach appropriately models\nneural spike trains as observations of a point-process. The model also provides the basis for further\nextensions addressing temporal alignment across trials, which will be the focus of the following\nsection.\n\n4 Temporal alignment and latent factor inference using Gaussian processes\n\nThe svGPFA model we have developed in section 3 aims to extract different latent trajectories on each\ntrial. It does not explicitly model any structure that is shared across trials, and each trial\u2019s variable\ntime-course is simply captured via inter-trial differences in latents. In this section, we will extend\nour basic svGPFA model in order to disentangle inter-trial variations in time-course from variations\nin the latent trajectories themselves. To achieve this, we explicitly model latent structure that is\nshared across trials or subsets of trials, as well as structure that is speci\ufb01c to each trial, and make\nuse of a nested GP architecture with time-warping functions. In this way, shared, neurally-de\ufb01ned\nlatent structure provides an anchor for the temporal alignment across trials. We extend the previous\ninducing point approach in order to arrive at a sparse variational inference algorithm in this setting.\n\n4.1 A generative model for population spike times with grouped trial structure and variable\n\ntime courses\n\nWe introduce latent processes that are shared across all trials, across subsets of trials that share the\nsame experimental condition, or that are speci\ufb01c to each individual trial. We model each of these as\ndraws from a GP prior with K latent processes, L experimental conditions, and R trials:\nfor k = 1, . . . , K\nfor\nfor\n\nshared:\ncondition-speci\ufb01c:\ntrial-speci\ufb01c:\n\nk (\u00b7,\u00b7))\nk (\u00b7,\u00b7))\nk(\u00b7,\u00b7))\n\n\u03b1k(\u00b7) \u223c GP(\u00b5\u03b1\nk (\u00b7) \u223c GP(\u00b5\u03b2\n\u03b2((cid:96))\nk (\u00b7) \u223c GP(\u00b5\u03b3\n\u03b3(r)\n\nk (\u00b7), \u03ba\u03b1\nk (\u00b7), \u03ba\u03b2\nk(\u00b7), \u03ba\u03b3\n\nr = 1, . . . , R\n\n(cid:96) = 1, . . . , L\n\n(8)\n\nAllowing each of these latents to evolve in potentially separate subspaces, we de\ufb01ne the linear\nmapping from low-dimensional latent, to high-dimensional neural space to be of the form\n\nn (\u00b7) =\nh(r)\n\nn,k\u03b1k(\u00b7) + c\u03b2\nc\u03b1\n\nn,k\u03b2(cid:96)(r)\n\nk\n\n(\u00b7) + c\u03b3\n\nn,k\u03b3(r)\n\nk (\u00b7)\n\n+ dn\n\n(9)\n\nK(cid:88)\n\n(cid:16)\n\nk=1\n\n4\n\n(cid:17)\n\n\f(a)\n\nk = 1, . . . , K\n\n\u03b1k\n\n(cid:96) = 1, . . . , L\n\n\u03b2((cid:96))\nk\n\nt\n\n\u03c4 (r)\n\n\u03b3(r)\nk\n\nC\u03b1\n\nC\u03b2\n\nC\u03b3\n\nhhh(r)\n\ng(\u00b7)\n\n\u03bb\u03bb\u03bb(r)\n\n{ttt(r)\nn }N\n\nn=1\nr = 1, . . . , R\n\n(b)\n\nt\n\n\u03c4 (r)\n\nuuu\u03c4,(r)\n\nk = 1, . . . , K\n\n(cid:96) = 1, . . . , L\n\n\u03b1k\n\nuuu\u03b1\nk\n\n\u03b2((cid:96))\nk\n\n\u03b3(r)\nk\n\nuuu\u03b2,((cid:96))\nk\n\nuuu\u03b3,(r)\nk\n\nhhh(r)\n\n\u03bb\u03bb\u03bb(r)\n\n{ttt(r)\nn }N\n\nn=1\n\nr = 1, . . . , R\n\nFigure 1: (a) generative model architecture. (b) the augmented model with inducing points.\n\nwhere (cid:96)(r) is the condition of trial r. When these latents are evaluated on a canonical time-axis all\ntrials evolve according to the same time-course. To incorporate inter-trial variability in time-course\ninto our model, we include a second GP that acts to warp time differently on each trial:\n\n\u03c4 (r)(\u00b7) \u223c GP(\u00b5\u03c4 (\u00b7), \u03ba\u03c4 (\u00b7,\u00b7))\n\ntime-warping:\n\n(10)\nAs before, we can relate hn(\u00b7) to the neuron\u2019s \ufb01ring rate via a static non-linearity g(\u00b7), and use\na point-process observation model for spike times. Thus, the \ufb01ring rate of neuron n on trial r is\ndescribed as \u03bb(r)\n\nn (\u03c4 (r)(t))). This generative model is summarised in Figure 1(a).\n\nn (t) = g(h(r)\n\nr = 1, . . . , R\n\nfor\n\n4.2 Sparse variational inference using inducing points\n\nThe generative model introduced in section 4.1 is a two-layer GP, where one layer has additive\nstructure. Previous work in the GP literature has successfully applied inducing point approaches for\nvariational inference in nested GP architectures [33, 34] (albeit without explicit additive decomposi-\ntions within layers), and we can take a similar approach here.\nk (\u00b7), and \u03c4 (r)(\u00b7),\nWe introduce sets of inducing points for each of the latent processes \u03b1k(\u00b7), \u03b2((cid:96))\n, and uuu\u03c4,(r), respectively. The augmented model using in-\nwhich we will denote as uuu\u03b1\nducing points is summarised in Figure 1(b). For each of the latent processes \u03b6 = {\u03b1, \u03b2, \u03b3} we choose\na factorised approximating distribution of the form q({\u03b6k, uuu\u03b6\nk) with\nk). For the time-warping process, we choose q(\u03c4, uuu\u03c4 ) = p(\u03c4|uuu\u03c4 )q(uuu\u03c4 ) with\nk, S\u03b6\nq(uuu\u03b6\nq(uuu\u03c4 ) = N (uuu\u03c4|mmm\u03c4 , S\u03c4 ).\n(cid:104)\nUsing this approximation, the variational lower bound to the marginal log-likelihood becomes\n\nk=1|\u03c4 ) =(cid:81)K\n(cid:105)\n\nk=1 p(\u03b6k|uuu\u03b6\n\nk (\u00b7), \u03b3(r)\n\nk, \u03c4 )q(uuu\u03b6\n\nk , uuu\u03b2,((cid:96))\n\nk|mmm\u03b6\n\n, uuu\u03b3,(r)\n\nk}K\n\n(cid:104)\n\nk\n\nk\n\nKL\n\nq(uuu\u03c4,(r))(cid:107)p(uuu\u03c4,(r))\n\n(cid:105) \u2212(cid:88)\n(cid:104)\n\nKL\n\nn |h(r)\nn )\n\nk )] \u2212(cid:88)\n\nE\nq(h(r)\nn )\n\nlog p(yyy(r)\n\nKL [q(uuu\u03b1\n\nk )(cid:107)p(uuu\u03b1\n\nk) = N (uuu\u03b6\n(cid:88)\n\u2212(cid:88)\n\nF =\n\nr,n\n\nr\nq(uuu\u03b2,((cid:96))\n\nk\n\n)(cid:107)p(uuu\u03b2,((cid:96))\n\nk\n\n)\n\n(cid:105) \u2212(cid:88)\n\n(cid:104)\n\nKL\n\nq(uuu\u03b3,(r)\n\nk\n\n)(cid:107)p(uuu\u03b3,(r)\n\nk\n\n)\n\n(cid:105)\n\nk\n\n(cid:96),k\n\nr,k\n\nThe mean and covariance function of the variational GP q(h(r)\n\nn ) are given by\n\n\u22121(cid:17)\n\n\u03a8\u03b6,(r)\n\nk,2 (t, zzz\u03b6\nk)\n\n(11)\n\n(cid:105)(cid:17)\n\n(12)\n\n(cid:88)\n(cid:88)\n\n\u03b6,k\n\n2(cid:16)\n\n\u03bd(r)\nn (t) =\n\nn,k \u03a8\u03b6,(r)\nc\u03b6\n\nk,1 (t, zzz\u03b6\n\nk) K\u03b6,(k)\n\nzz\n\nmmm\u03b6,(r)\n\nk + dn\n\n\u03c3(r)\nn (t, t) =\n\nc\u03b6\nn,k\n\n\u03a8\u03b6,(r)\n\nk,0 (t) + Tr\n\nK\u03b6,(k)\n\nzz\n\n\u22121\n\nS\u03b6,(r)\nk K\u03b6,(k)\n\nzz\n\n\u22121 \u2212 K\u03b6,(k)\n\nzz\n\n\u22121\n\n(cid:104)(cid:16)\n\n\u03b6,k\n\n5\n\n\fThe \u03a8\u03b6,(r)\n\nk,i (t, zzz\u03b6\n\nk) are the \u03a8-statistics [33, 35] of the kernel covariance functions\n\n(cid:105)\n\n(cid:105)\n\n(13)\n\nk,0 (t) = E\n\u03a8\u03b6,(r)\nk) = E\nk) = E\n\nk,1 (t, zzz\u03b6\nk,2 (t, zzz\u03b6\n\n\u03a8\u03b6,(r)\n\n\u03a8\u03b6,(r)\n\nq(\u03c4 (r))\n\nq(\u03c4 (r))\n\nq(\u03c4 (r))\n\n(cid:105)\n\n\u03ba\u03ba\u03ba\u03b6\nk( \u03c4 (r)(t) , \u03c4 (r)(t))\nk( \u03c4 (r)(t) , zzz\u03b6\n\u03ba\u03ba\u03ba\u03b6\nk)\nk, \u03c4 (r)(t))\u03ba\u03ba\u03ba\u03b6\n\u03ba\u03ba\u03ba\u03b6\nk(zzz\u03b6\n\nk(\u03c4 (r)(t), zzz\u03b6\nk)\n\n(cid:104)\n(cid:104)\n(cid:104)\n\nThe \u03a8-statistics can be evaluated analytically for common kernel choices such as the linear, ex-\nponentiated quadratic, or cosine kernel. For other kernel choices they can be computed using\ne.g. Gaussian quadrature. Finally, the variational distribution over the time-warping functions\n\nq(\u03c4 (r)) =(cid:82) p(\u03c4 (r)|uuu\u03c4,(r))q(uuu(\u03c4,(r))duuu(\u03c4,(r) is also a GP with mean and covariance function\n(cid:17)\n\n\u03bd\u03c4,(r)(t) = \u03ba\u03ba\u03ba\u03c4 ( t , zzz\u03c4 ) K\u03c4\nzz\n\n(cid:16)\n\n(14)\n\n\u22121(cid:17)\n\n\u03c3\u03c4,(r)(t, t(cid:48)) = \u03ba\u03c4 (t, t(cid:48)) + \u03ba\u03ba\u03ba\u03c4 (t, zzz\u03c4 )\n\n\u22121S\u03c4,(r)K\u03c4\n\n\u22121 \u2212 K\u03c4\n\nzz\n\nzz\n\n\u22121 mmm\u03c4,(r)\nK\u03c4\nzz\n\n\u03ba\u03ba\u03ba\u03c4 (zzz\u03c4 , t(cid:48))\n\n.\n\nThe variational lower bound in (11) can thus be evaluated tractably and optimised with respect to all\nparameters in the model. While the decomposition into shared, condition-speci\ufb01c and trial-speci\ufb01c\nlatents and the addition of the time-warping layer increases the total number of inducing points that\nrequire optimisation, the impact on the time-complexity of the algorithm is minimal: it remains linear\nin the total number of spikes across trials and neurons, and only cubic in the number of inducing\npoints per individual latent process.\n\n5 Results\n\n5.1 Synthetic data\n\nWe \ufb01rst generate synthetic data for 100 neurons on 15 trials that are each one second in duration.\nThe neural activity is driven by one shared and one condition-speci\ufb01c latent process. We omit the\ntrial-speci\ufb01c latent process here, such that inter-trial variability is solely due to the variable time-\ncourse of each trial and independent Poisson variability in the spiking of each neuron. The generative\nprocedure is illustrated in Figure 2 and aims to provide a toy simulation of decision-making, where\ntrial-to-trial differences in the decision-making process are modelled via the time-warping.\n\nshared\n\ntime-warping\n\npopulation \ufb01ring rates \u03bb(t)\n\n)\n\u03c4\n(\n\u03b1\n\n)\n\u03c4\n(\n\n(cid:96)\n\u03b2\n\ncondition-speci\ufb01c\n\n)\nt\n(\n\u03c4\n\n(cid:96) = 1\n\nt\n\n(cid:96) = 2\n\n\u03c4\n\n)\nt\n(\n\u03b1\n\n)\nt\n(\n\n(cid:96)\n\u03b2\n\nx\nd\ni\n\nn\no\nr\nu\ne\nn\n\nexp (\u00b7)\n\n+\n\nt\nr\ni\na\nl\n1\n\nt\nr\ni\na\nl\n\n2\n\nt\n\ntime\n\nFigure 2: Synthetic data example: the shared latent re\ufb02ects a gating signal; conditions are preferred\nand anti-preferred for each neuron. Inter-trial differences in the decision-making process are modelled\nvia time warping. Warped latents are shown for two example trials per condition. The warped latents\nare mixed linearly and passed through an exponential non-linearity to generate \ufb01ring rates, which\ndrive a Poisson process. Example spike rasters are superimposed over the population \ufb01ring rates for\ntwo example trials of different conditions.\n\nWe \ufb01rst investigate how our sparse variational inference approach compares to the Poisson GPFA\napproach proposed in [23] (vLGP), using published code1 which includes some further numerical\napproximations for speed. Figure 3(a) shows that the svGPFA methods achieve an improved approxi-\nmation at lower or comparable runtime cost. We next \ufb01t the time-warped method (tw-pp-svGPFA)\n\n1https://github.com/catniplab/vlgp\n\n6\n\n01-0.50101-1.501.50101201-0.50101-1.501.5\f(b)\n\n(c)\n\n(a)\n\n]\n\nl\nl\nu\nf\nq\n(cid:107)\nq\n[\nL\nK\ng\no\nl\n\nFigure 3: (a) runtime in seconds vs. approximation error of inference approach (under generative\nparameters and matched kernel hyperparameters) of pp-svGPFA (solid) with different numbers\nof inducing points (nZ) and vLGP (dashed). Approximation error is computed as the mean KL-\ndivergence between the respective marginal distributions q(hn(t)) and those of a full Gaussian\nvariational approximation over latent processes. (b) comparison of RMSE of inferred \ufb01ring rates\nacross different methods and bin widths. K = 2 for all methods. (c) relative runtime comparisons\nacross svGPFA methods.\n\nwith 5 inducing points for the shared and condition-speci\ufb01c latents, and 8 inducing points for the\ntime-warping latent. We use Gauss-Legendre quadrature with 80 abscissas for evaluating the log-\nnormaliser term in (7). We compare tw-pp-svGPFA with time-warped PCA [10] (twPCA)2, vLGP, a\nlatent linear dynamical systems model with Poisson observations (PLDS) [5], and our point-process\n(pp-) and Poisson (P-) svGPFA model without time-warping, but with all other settings chosen\nequivalently. For the discrete-time methods, we bin the spike trains at different resolutions before\napplying either method to the data.\nFigure 3(b) shows the RMSE in inferred \ufb01ring rates of each method. The svGPFA variants achieve\nthe best posterior estimate of \ufb01ring rates throughout. The svGPFA models without time-warping\nand vLGP have to \ufb01t the temporal variability via inter-trial differences in the latent processes. The\nPLDS model captures the discrete-time evolution of the latent using a linear dynamical operator\nand an additive noise process, often referred to as the innovations. It is this innovations process that\nallows for inter-trial variability in this case. Thus, neither of these competing approaches offers a\ndissociation of the variations in time-course from other sources of inter-trial variability. In the case of\nthe PLDS, discrepancies between the inferred latent path, and a canonical path as predicted via the\nlearnt dynamical operator could in principle be used to estimate an alignment across trials. However,\ninnovations noise is typically modelled via a stationary parameter which restricts the capacity of such\nan approach to capture more complex variations in timing. The twPCA approach does not assume\na noise model for the observations, which limits its ability to recover \ufb01ring rates and time-warped\nstructure underlying variable spike trains. This is especially true as spiking is relatively sparse in our\nexample. Figure 3(c) compares the relative runtime between the different svGPFA models.\n\n5.2 Variable-delay centre-out reaching task\n\nWe apply tw-pp-svGPFA to simultaneously recorded spike trains of 105 neurons in dorsal premotor\nand primary motor cortex of a macaque monkey (Macaca mulatta) during a variable-delay centre-out\nreaching task [2].\nIn this task, the animal is presented with one of multiple possible target locations, but has to wait for a\nvariable duration before it receives the instruction to perform an arm-reach to the target. Beyond these\ndesigned sources of inter-trial variability, there are also variations in reaction time and movement\nkinematics, as well as extra variability in neural \ufb01ring.\nWe \ufb01t our model to repeated trials from 5 different target directions (12 trials per condition). We\nuse 20 inducing points per latent, 10 inducing points for the time-warping functions, and 100\nGauss-Legendre quadrature abscissas to evaluate the log-normaliser term in (7).\n\n2https://github.com/ganguli-lab/twpca\n\n7\n\n-20246log(runtime)2610nZ = 5nZ = 10nZ = 20nZ = 50nZ = 100vLGPn.a.151020bin width in ms252025RMSEP-svGPFApp-svGPFAtw-pp-svGPFAvLGPPLDStwPCAn.a.151020bin width in ms158relative runtimeP-svGPFApp-svGPFAtw-pp-svGPFA\fFigure 4: Leave-one-neuron-out (LONO) cross-validation across\nlatent dimensionality on 15 held-out trials (3 per condition). Results\nare shown for models with an exponential static non-linearity, with\nand without the time-warping layer. The time-warped models achieve\na higher predictive log-likelihood than their non-warped analogues.\n\nWe perform model comparison across latent dimensionality by leave-one-neuron-out cross-validation\non held-out data [2]. To demonstrate the bene\ufb01t of time-warping, we compare the time-warped\nmodels to their svGPFA analogues without a time-warping layer. Figure 4 shows that time-warping\nresults in a substantial increase in predictive log-likelihood on held-out data.\nFigure 5(a) shows that the inferred time-warping functions automatically align the latent trajectories\nto the onset of movement (MO). The model did not have access to this information and the alignment\nis purely based on structure present in the neural population activity. Furthermore, the time passed\nin warped time since the go-cue (GC) is highly predictive of reaction times, as is shown in Figure\n5(b). We achieve an R2 value of 0.90, which is substantially higher than previously reported values\non the same dataset ([36] report R2 = 0.52 using a switching LDS model). Figure 5(c) shows a\nreduction in the standard deviation of the times when the animal\u2019s absolute hand position crosses a\nthreshold value. This reduction is not simply due to an overall compression of time, since the slope\nof the time-warping functions during the movement period was 1.0572 on average across trials. This\ndemonstrates that the inferred time-warping provides an improved alignment of the behaviourally\nmeasured arm reach compared to MO or GC alignment, which are common ways of aligning data in\nreaching experiments [37].\n\n(a)\n\n(b)\n\n(c)\n\ny\n\nx\n\nFigure 5: (a) Inferred time-warping functions and behavioural events. \u03c4 (t) automatically aligns trials\nto MO. Inset histogram compares the distribution of MO times in measured, and in warped time. (b)\nleave-one-out cross-validated predictions of the reaction times, computed by determining a threshold\nvalue for MO in warped time from all but one trials, and taking the time it takes to pass this threshold\nfrom the GC on the held-out trial. (c) within-condition standard deviation (std.) in times when the\nabsolute hand position crosses a threshold under different methods of alignment. Inset illustrates\ntarget directions and location of the threshold (85% of average reach distance). Grey bars show\naverage std. across target directions.\n\n6 Conclusion\n\nIn this work, we have introduced a scalable sparse variational extension of GPFA, allowing one\nto extract latent structure directly from the unbinned spike-time observations of simultaneously\nrecorded neurons. We further extended GPFA using an explicit separation of shared variability into\ncondition-dependent and trial-speci\ufb01c components within a nested GP architecture. This allowed\nus to separate trial-to-trial variations in trajectories from variations in time-course. We arrived at a\nscalable algorithm for posterior inference using inducing points.\n\n8\n\n151077.3LONO pred-loglik104warpednonwarped1231231.522.500.40.20.30.40.20.30.4 ( t )MOGC10203040Std. in crossing times in ms\fUsing synthetic data, we showed that our svGPFA methods can more accurately recover \ufb01ring\nrate estimates than related methods. Using neural data, we demonstrated that our time-warped\nmethod infers a behaviourally meaningful alignment in a completely unsupervised fashion using only\ninformation present in the neural population spike times. We showed that the inferred time-warping\nis behaviourally meaningful, highly predictive of reaction times, and provides an improved temporal\nalignment compared to manually aligning to behaviourally de\ufb01ned time-points.\nWhile we have focused on temporal variability in this work, we note that the inference scheme using\ninducing points is more generally applicable for a variety of other nested GP models of interest\nin neuroscience, such as the GP-LVM [35, 38]. Similarly, further extensions of our model with\ntime-warping to deeper GP hierarchies \u2013 for instance using a non-linear mapping from latents to\nobservations as in the GP-LVM \u2013 can be incorporated straightforwardly.\nThe problem of latent input alignment in the context of the GP-LVM has also recently been considered\nin [39], where maximum a posteriori inference in a Gaussian model is used to recover a single true\nunderlying input sequence from multiple observations that are generated from different time-warping\nfunctions. The sparse-variational approach we have presented here could hence be used to generalise\nthis approach to account for further differences across sequences, and extend it to non-Gaussian\nsettings.\nFurthermore, our approach can also be applied to multiple sets of observed data with potentially\ndifferent likelihoods in a canonical correlations analysis extension of GPFA. This could be useful\nfor inferring latent structure from multiple recorded signals. For instance, one could combine the\nanalysis of LFP or calcium imaging data with that of spike trains. Thus, with slight modi\ufb01cations to\nthe model architecture, an analogous inference approach to the one presented here is applicable in a\nvariety of contexts, including manifold learning, combinations of modalities, and beyond.\n\nAcknowledgements\nWe would like to thank Vincent Adam for early contributions to this project, Gerg\u00f6 Bohner for helpful\ndiscussions, and the Shenoy laboratory at Stanford University for sharing the centre-out reaching\ndataset. This work was funded by the Simons Foundation (SCGB 323228, 543039; MS) and the\nGatsby Charitable Foundation.\n\nReferences\n[1] MM Churchland, BM Yu, M Sahani, and KV Shenoy. Techniques for extracting single-trial activity\n\npatterns from large-scale neural recordings. Current Opinion in Neurobiology, 17(5):609\u2013618, 2007.\n\n[2] BM Yu, JP Cunningham, G Santhanam, SI Ryu, KV Shenoy, and M Sahani. Gaussian-process factor\nanalysis for low-dimensional single-trial analysis of neural population activity. Journal of Neurophysiology,\n102(1):614\u2013635, 2009.\n\n[3] Y Gao, EW Archer, L Paninski, and JP Cunningham. Linear dynamical neural population models through\n\nnonlinear embeddings. In Advances in Neural Information Processing Systems, pp. 163\u2013171, 2016.\n\n[4] C Pandarinath, DJ O\u2019Shea, J Collins, R Jozefowicz, SD Stavisky, JC Kao, EM Trautmann, MT Kaufman,\nSI Ryu, LR Hochberg, JM Henderson, KV Shenoy, LF Abbott, and D Sussillo. Inferring single-trial neural\npopulation dynamics using sequential auto-encoders. Nature Methods, 15(10):805\u2013815, 2018.\n\n[5] JH Macke, L Buesing, and M Sahani. Estimating state and parameters in state space models of spike trains.\n\nAdvanced State Space Methods for Neural and Clinical Data, p. 137, 2015.\n\n[6] JP Cunningham and BM Yu. Dimensionality reduction for large-scale neural recordings. Nature Neuro-\n\nscience, 17(11):1500\u20131509, 2014.\n\n[7] A Afshar, G Santhanam, BM Yu, SI Ryu, M Sahani, and KV Shenoy. Single-trial neural correlates of arm\n\nmovement preparation. Neuron, 71(3):555\u2013564, 2011.\n\n[8] S Kollmorgen and RH Hahnloser. Dynamic alignment models for neural coding. PLoS computational\n\nbiology, 10(3):e1003508, 2014.\n\n[9] PN Lawlor, MG Perich, LE Miller, and KP Kording. Linear-nonlinear-time-warp-poisson models of neural\n\nactivity. Journal of Computational Neuroscience, 2018.\n\n[10] B Poole, A Williams, N Maheswaranathan, BM Yu, G Santhanam, S Ryu, SA Baccus, K Shenoy, and\nS Ganguli. Time-warped PCA: simultaneous alignment and dimensionality reduction of neural data. In\nFrontiers in Neuroscience. Computational and Systems Neuroscience (COSYNE), Salt Lake City, UT, 2017.\n\n9\n\n\f[11] H Sakoe and S Chiba. Dynamic programming algorithm optimization for spoken word recognition. IEEE\n\nTransactions on Acoustics, Speech, and Signal Processing, 26(1):43\u201349, 1978.\n\n[12] EJ Keogh and MJ Pazzani. Derivative dynamic time warping. In Proceedings of the 2001 SIAM Interna-\n\ntional Conference on Data Mining, pp. 1\u201311. SIAM, 2001.\n\n[13] E Hsu, K Pulli, and J Popovi\u00b4c. Style translation for human motion. In ACM Transactions on Graphics\n\n(TOG), vol. 24, pp. 1082\u20131089. ACM, 2005.\n\n[14] M Cuturi and M Blondel. Soft-DTW: a differentiable loss function for time-series. In Proceedings of the\n\n34th International Conference on Machine Learning, pp. 894\u2013903, 2017.\n\n[15] F Zhou and F De la Torre. Generalized canonical time warping. IEEE Transactions on Pattern Analysis\n\nand Machine Intelligence, 38(2):279\u2013294, 2016.\n\n[16] M L\u00e1zaro-Gredilla. Bayesian warped Gaussian processes. In F Pereira, CJC Burges, L Bottou, and\nKQ Weinberger, eds., Advances in Neural Information Processing Systems 25, pp. 1619\u20131627. Curran\nAssociates, Inc., 2012.\n\n[17] J Snoek, K Swersky, R Zemel, and R Adams. Input warping for Bayesian optimization of non-stationary\nfunctions. In Proceedings of the 31st International Conference on Machine Learning, pp. 1674\u20131682,\n2014.\n\n[18] M Kaiser, C Otte, T Runkler, and CH Ek. Bayesian alignments of warped multi-output Gaussian processes.\n\narXiv preprint arXiv:1710.02766, 2017.\n\n[19] A Arribas-Gil and HG M\u00fcller. Pairwise dynamic time warping for event data. Computational Statistics &\n\nData Analysis, 69:255\u2013268, 2014.\n\n[20] VM Panaretos, Y Zemel, et al. Amplitude and phase variation of point processes. The Annals of Statistics,\n\n44(2):771\u2013812, 2016.\n\n[21] W Wu and A Srivastava. Estimating summary statistics in the spike-train space. Journal of Computational\n\nNeuroscience, 34(3):391\u2013410, 2013.\n\n[22] Y Zemel and VM Panaretos. Fr\u00e9chet means and procrustes analysis in wasserstein space. arXiv preprint\n\narXiv:1701.06876, 2017.\n\n[23] Y Zhao and IM Park. Variational latent Gaussian process for recovering single-trial dynamics from\n\npopulation spike trains. Neural Computation, 29(5):1293\u20131316, 2017.\n\n[24] M Opper and C Archambeau. The variational Gaussian approximation revisited. Neural Computation,\n\n21(3):786\u2013792, 2009.\n\n[25] MK Titsias. Variational learning of inducing variables in sparse Gaussian processes. In Proceedings of the\n\n12th International Conference on Arti\ufb01cial Intelligence and Statistics, pp. 567\u2013574, 2009.\n\n[26] V Adam, J Hensman, and M Sahani. Scalable transformed additive signal decomposition by non-conjugate\nGaussian process inference. In Proceedings of the 26th International Workshop on Machine Learning for\nSignal Processing (MLSP), 2016.\n\n[27] C Lloyd, T Gunter, MA Osborne, and SJ Roberts. Variational inference for Gaussian process modulated\n\nPoisson processes. In Proceedings of the 32nd International Conference on Machine Learning, 2015.\n\n[28] J Hensman, N Fusi, and ND Lawrence. Gaussian processes for big data. In Conference on Uncertainty in\n\nArti\ufb01cial Intellegence, pp. 282\u2013290. auai.org, 2013.\n\n[29] J Hensman, AGdG Matthews, and Z Ghahramani. Scalable variational Gaussian process classi\ufb01cation. In\n\nProceedings of the 18th International Conference on Arti\ufb01cial Intelligence and Statistics, 2015.\n\n[30] AD Saul, J Hensman, A Vehtari, and ND Lawrence. Chained Gaussian processes. In Proceedings of the\n\n19th International Conference on Arti\ufb01cial Intelligence and Statistics, pp. 1431\u20131440, 2016.\n\n[31] G Mena and L Paninski. On quadrature methods for refractory point process likelihoods. Neural\n\nComputation, 26(12):2790\u20132797, 2014.\n\n[32] P Abbott. Tricks of the trade: Legendre-gauss quadrature. Mathematica Journal, 9(4):689\u2013691, 2005.\n[33] A Damianou and N Lawrence. Deep Gaussian processes.\n\nIn Proceedings of the 16th International\n\nConference on Arti\ufb01cial Intelligence and Statistics, pp. 207\u2013215, 2013.\n\n[34] H Salimbeni and M Deisenroth. Doubly stochastic variational inference for deep Gaussian processes. In\n\nAdvances in Neural Information Processing Systems, pp. 4591\u20134602, 2017.\n\n[35] MK Titsias and ND Lawrence. Bayesian Gaussian process latent variable model. In Proceedings of the\n\n13th International Conference on Arti\ufb01cial Intelligence and Statistics, pp. 844\u2013851, 2010.\n\n[36] B Petreska, BM Yu, JP Cunningham, G Santhanam, SI Ryu, KV Shenoy, and M Sahani. Dynamical\nsegmentation of single trials from population neural data. In Advances in Neural Information Processing\nSystems, pp. 756\u2013764, 2011.\n\n10\n\n\f[37] MM Churchland, JP Cunningham, MT Kaufman, JD Foster, P Nuyujukian, SI Ryu, and KV Shenoy.\n\nNeural population dynamics during reaching. Nature, 487(7405):51, 2012.\n\n[38] A Wu, NG Roy, S Keeley, and JW Pillow. Gaussian process based nonlinear latent structure discovery\nin multivariate spike train data. In Advances in Neural Information Processing Systems, pp. 3499\u20133508,\n2017.\n\n[39] I Kazlauskaite, CH Ek, and ND Campbell. Gaussian process latent variable alignment learning. arXiv\n\npreprint arXiv:1803.02603, 2018.\n\n11\n\n\f", "award": [], "sourceid": 6696, "authors": [{"given_name": "Lea", "family_name": "Duncker", "institution": "Gatsby Unit, UCL"}, {"given_name": "Maneesh", "family_name": "Sahani", "institution": "Gatsby Unit, UCL"}]}