{"title": "Spectral learning of linear dynamics from generalised-linear observations with application to neural population data", "book": "Advances in Neural Information Processing Systems", "page_first": 1682, "page_last": 1690, "abstract": "Latent linear dynamical systems with generalised-linear observation models arise in a variety of applications, for example when modelling the spiking activity of populations of neurons. Here, we show how spectral learning methods for linear systems with Gaussian observations (usually called subspace identification in this context) can be extended to estimate the parameters of dynamical system models observed through non-Gaussian noise models. We use this approach to obtain estimates of parameters for a dynamical model of neural population data, where the observed spike-counts are Poisson-distributed with log-rates determined by the latent dynamical process, possibly driven by external inputs. We show that the extended system identification algorithm is consistent and accurately recovers the correct parameters on large simulated data sets with much smaller computational cost than approximate expectation-maximisation (EM) due to the non-iterative nature of subspace identification. Even on smaller data sets, it provides an effective initialization for EM, leading to more robust performance and faster convergence. These benefits are shown to extend to real neural data.", "full_text": "Spectral learning of linear dynamics from\n\ngeneralised-linear observations\n\nwith application to neural population data\n\nLars Buesing\u2217, Jakob H. Macke\u2217,\u2020 , Maneesh Sahani\n\nGatsby Computational Neuroscience Unit\nUniversity College London, London, UK\n\n{lars, jakob, maneesh}@gatsby.ucl.ac.uk\n\nAbstract\n\nLatent linear dynamical systems with generalised-linear observation models arise\nin a variety of applications, for instance when modelling the spiking activ-\nity of populations of neurons. Here, we show how spectral learning methods\n(usually called subspace identi\ufb01cation in this context) for linear systems with\nlinear-Gaussian observations can be extended to estimate the parameters of a\ngeneralised-linear dynamical system model despite a non-linear and non-Gaussian\nobservation process. We use this approach to obtain estimates of parameters for\na dynamical model of neural population data, where the observed spike-counts\nare Poisson-distributed with log-rates determined by the latent dynamical process,\npossibly driven by external inputs. We show that the extended subspace identi\ufb01ca-\ntion algorithm is consistent and accurately recovers the correct parameters on large\nsimulated data sets with a single calculation, avoiding the costly iterative compu-\ntation of approximate expectation-maximisation (EM). Even on smaller data sets,\nit provides an effective initialisation for EM, avoiding local optima and speeding\nconvergence. These bene\ufb01ts are shown to extend to real neural data.\n\n1 Introduction\n\nLatent linear dynamical system (LDS) models, also known as Kalman-\ufb01lter models or linear-\nGaussian state-space models, provide an important framework for modelling shared temporal struc-\nture in multivariate time series. If the observation process is linear with additive Gaussian noise, then\nthere are many established options for parameter learning. Inference of the dynamical state in such\na model can be performed exactly by Kalman smoothing [1] and so the expectation-maximisation\n(EM) algorithm may be used to \ufb01nd a local maximum of the likelihood [2]. An alternative is the\nspectral approach known as subspace identi\ufb01cation (SSID) in the engineering literature [3, 4, 5].\nThis is a method-of-moments-based estimation process, which, like other spectral methods, pro-\nvides estimators that are non-iterative, consistent and do not suffer from the problems of multiple\noptima that dog maximum-likelihood (ML) learning in practice. However, they are not as statisti-\ncally ef\ufb01cient as the true (global) ML estimator. Thus, a combined approach often produces the best\nresults, with the SSID-based parameter estimates being used to initialise the EM iterations.\nMany real-world data sets, however, are not well described by a linear-Gaussian output process. Of\nparticular interest to us here are multiple neural spike-trains measured simultaneously by arrays of\nelectrodes [6, 7], which are best treated either as multivariate point-processes or, after binning, as a\ntime series of vectors of small integers. In either case the event rates must be positive, precluding\na linear mapping from the Gaussian latent process, and the noise distribution cannot accurately be\n\n\u2217 These authors contributed equally. \u2020 Current Af\ufb01liation: Max Planck Institute for Biological Cybernetics\n\nand Bernstein Center for Computational Neuroscience T\u00a8ubingen\n\n1\n\n\fmodelled as normal. Similar point-process or count data may arise in many other settings, such as\nseismology or text modelling. More generally, we are interested in the broad class of generalised-\nlinear output models (de\ufb01ned by analogy to the generalised-linear regression model [8]), where the\nexpected value of an observation is given by a monotonic function of the latent Gaussian process,\nwith an arbitrary (most frequently exponential-family) distribution of observations about this mean.\nFor such models exact inference, and therefore exact EM, is not possible. Instead, approximate\nML learning relies on either Monte-Carlo or deterministic approximations to the posterior. Such\nmethods may be computationally intensive, suffer from varying degrees of approximation error, and\nare subject to the same concerns about multiple likelihood optima as is the linear-Gaussian case2\nThus, a consistent spectral method is likely to be of particular value for such models. In this paper\nwe show how the SSID approach may be extended to yield consistent estimators for generalised-\nlinear-output LDS (gl-LDS) models. In experiments with simulated and real neural data, we show\nthat these estimators may be better than those provided by approximate EM when given suf\ufb01cient\ndata. Even when data are few, the approach provides a valuable initialisation to approximate EM.\n\n2 Theory\n\nWe brie\ufb02y review the Ho-Kalman SSID algorithm [10] for linear-Gaussian LDS models, before\nextending it to the gl-LDS case. Using this framework, we derive and then evaluate an algorithm to\n\ufb01t models of Poisson-distributed count data with log-rates generated by an LDS.\n\n2.1 SSID for LDS models with linear-Gaussian observations\n\nLet q-dimensional observations yt, t \u2208 {1, . . . , T } depend on a p-dimensional latent state xt, de-\nscribed by a linear \ufb01rst-order auto-regressive process with Gaussian initial distribution and Gaussian\ninnovations:\n\nx1 \u223c N (x0, Q0)\nxt+1 | xt \u223c N (Axt, Q)\n\nzt = Cxt + d\nyt | zt \u223c N (zt, R).\n\n(1)\n\nHere, x0 and Q0 are the mean and covariance of the initial state and Q is the covariance of the\ninnovations. The dynamics matrix A models the temporal dependence of the process x. The variable\nzt of dimension q is de\ufb01ned as an af\ufb01ne function of the latent state xt, parametrised by the loading\nmatrix C and the mean parameter d. Given zt, observations are independently distributed around\nthis value with covariance R. Furthermore let \u03a0 := limt\u2192\u221e Cov[xt] denote the covariance of the\nstationary marginal distribution if the system is stable (i.e. if the spectral radius of A is < 1).\nProvided the generative model is stationary (i.e., x0 = 0 and Q0 = \u03a0), SSID algorithms yield\nconsistent estimates of the parameters A, C, Q, R, d without iteration. We adopt an approach to\nSSID based on the Ho-Kalman method [10, 4]. This algorithm takes as input the empirical estimate\nof the so-called \u201cfuture-past Hankel matrix\u201d H which is de\ufb01ned as the cross-covariance between\ntime-lagged vectors y+\n\nt (the \u201cpast\u201d) of the observed data:\n\nt (the \u201cfuture\u201d) and y\u2212\n\nH := Cov[y+\n\nt , y\u2212\nt ]\n\ny+\nt\n\n:= \uf8eb\n\uf8ed\n\nyt\n...\n\nyt+k\u22121\n\n\uf8f6\n\uf8f8\n\ny\u2212\nt\n\n:= \uf8eb\n\uf8ed\n\nyt\u22121\n...\nyt\u2212k\n\n\uf8f6\n\uf8f8 .\n\nThe parameter k is called the Hankel size and has to be chosen so that k \u2265 p. The key to SSID is\nthat H (which is independent of t as stationarity is assumed) has rank equal to the dimensionality\np of the linear dynamical state. Indeed, it is straightforward to show that the Hankel matrix can be\ndecomposed in terms of the model parameters A, C, \u03a0,\n\n(2)\nThe SSID algorithm \ufb01rst takes the singular value decomposition (SVD) of the empirical estimate\n\nH = [C\u22a4 (CA)\u22a4 . . . (CAk\u22121)\u22a4]\u22a4 \u00b7 [A\u03a0C\u22a4 A2\u03a0C\u22a4 . . . Ak\u03a0C\u22a4].\n\nbH of H to recover a two-part factorisation as in (2) given a user-de\ufb01ned latent dimensionality p (a\n\nsuitable p may be estimated by inspection of the singular value spectrum of \u02c6H). From this low-rank\n\n2A recent paper [9] has argued that the log-likelihood of a model with Poisson count observations is\nconcave\u2014however, the result therein showed only a necessary condition for concavity of the expected joint\nlog-likelihood optimised in the M-step.\n\n2\n\n\fapproximation to bH the model parameters A, C as well as the covariances Q and R can be found by\n\nlinear regression and by solving an algebraic Riccati equation; d is given simply by the empirical\nmean of the data. However, this speci\ufb01c procedure works only for linear systems with Gaussian\nobservations and innovations, and not for models which feature non-linear transformations or non-\nGaussian observation models.\nIndeed, we \ufb01nd that linear SSID methods can yield poor results\nwhen applied directly to count-process data. Although SSID techniques have been developed for\nobservations that are Gaussian-distributed around a mean that is a nonlinear function of the latent\nstate [5], we are unaware of SSID methods that address arbitrary observation models.\n\n2.2 SSID for gl-LDS models by moment conversion\n\nConsider now the gl-LDS in which the Gaussian observation process of model (1) is replaced by the\nfollowing more general observation model. We assume yt,i \u22a5 yt,j | zt; i.e. observation dimensions\nare independent given zt. Further, let yt,i | zt be arbitrarily distributed around a (known) monotonic\nelement-wise nonlinear mapping f (\u00b7) such that E[yt|zt] = f (zt). Following the theory of gener-\nalised linear modelling, we also assume that the variance of the observation distribution is a (known)\nfunction V (\u00b7) of its mean.3\nOur extension to SSID for such models is based on the following idea. The variables z1, . . . , zT are\njointly normal, so in principle we can apply standard SSID algorithms to z. Although z is unob-\nserved, we can use the fact that the observation model dictates a computable relationship between\nthe moments of y and those of z. This allows us to determine the future-past Hankel matrix of z\nfrom the moments of y, which can then be fed into standard SSID algorithms. Consider the covari-\nance matrix Cov[y\u00b1] of the combined 2kq-dimensional future-past vector y\u00b1 which is de\ufb01ned by\nstacking y+ and y\u2212 (here and henceforth we drop the subscripts t as unnecessary given the assumed\nstationarity of the process). Denote the mean and covariance matrix of the normal distribution of z\u00b1\n(de\ufb01ned analogously to y\u00b1) by \u00b5 and \u03a3. We then have,\n\nE[(y\u00b1\n\nE[y\u00b1\n\ni y\u00b1\n\ny|z[y\u00b1\n\ni ] \u00b7 E\n\ny|z[(y\u00b1\n\nj ] = Ez[E\n\ni )2] = Ez[E\n\nE[y\u00b1\ni )2]] = Ez[f (z\u00b1\n\ni ] = Ez[f (z\u00b1\ni )2 + V (f (z\u00b1\n\ni )] =: \u03b1(\u00b5i, \u03a3ii)\ni ))] =: \u03b2(\u00b5i, \u03a3ii).\n\n(3)\n(4)\nThe functions \u03b1(\u00b7) and \u03b2(\u00b7) are given by Gaussian integrals with mean \u00b5i and variance \u03a3ii over the\nfunctions f (\u00b7) and f 2(\u00b7) + V (f (\u00b7)), respectively. For off-diagonal second moments we have (i 6= j):\n(5)\nEquations (3)-(5) are a 4kq + kq(2kq \u2212 1) system of non-linear equations in 4kq + kq(2kq \u2212 1)\nunknowns \u00b5, \u03a3 (with symmetric \u03a3 = \u03a3\u22a4). The equations above can be solved ef\ufb01ciently by\nseparately solving one 2-dimensional system (equations 3-4) for each pair of unknowns \u00b5i, \u03a3ii,\n\u2200i \u2208 {1, . . . , kq}. Once the \u00b5i and \u03a3ii are known, equation (5) reduces to a 1-dimensional nonlinear\nequation for \u03a3ij for each pair of indices (i < j). The upper-right block of the covariance matrix \u03a3\nthen provides an estimate of the future-past Hankel matrix Cov[z+, z\u2212] which can be decomposed\nas in standard Ho-Kalman SSID.\n\nj )] =: \u03b3(\u00b5i, \u03a3ii, \u00b5j, \u03a3jj , \u03a3ij).\n\ny|z[y\u00b1\n\nj ]] = Ez[f (z\u00b1\n\ni )f (z\u00b1\n\n2.3 SSID for Poisson dynamical systems (PLDSID)\n\nWe now consider in greater detail a special case of the gl-LDS model, which is of particular interest\nin neuroscience applications. The observations in this model are (when conditioned on the latent\nstate) Poisson-distributed with a mean that is exponential in the output of the dynamical system,\n\nyt,i | zt,i \u223c Poisson[exp(zt,i)].\n\nWe call this model, which is a special case of a Log-Gaussian Cox Process [11], a Poisson Lin-\near Dynamical System (PLDS). PLDS and close variants have recently been applied for modelling\nmulti-electrode recordings [12, 13, 14, 15]. In these applications, yt,i models the spike-count of\nneuron i in time-bin t and its log-\ufb01ring-rate (which we will refer to as the \u201cinput to neuron i\u201d) is\ngiven by zt,i. Estimation of the model parameters \u0398 = (A, C, Q, x0, Q0, d) often depends on ap-\nproximate likelihood maximisation, using EM with an approximate E-step [16, 9]. The exponential\nnonlinearity ensures that the posterior distribution p(x1...,T |y1...,T , \u0398) is a log-concave function of\nx1...,T [17], making its mode easy to \ufb01nd and justifying unimodal approximations (such as that of\nLaplace). However, the typical data likelihood is nonetheless multimodal and the approximations\nmay introduce bias in estimation [18].\n\n3Our method readily generalises to models in which each dimension i has different nonlinearities fi and Vi.\n\n3\n\n\fA\n\nobserved\n\nB\n\ntransformed\n\nC\n\ntrue\n\n \n\nV\nS\nd\ne\nm\nr\no\nn\n\n1\n\n0.5\n\n0\n\n1\n\nD\n\ns\nV\nE\n\n \n \nf\n\n \n\no\ne\nc\nn\ne\nr\ne\n\nf\nf\ni\n\nd\n\n0.5\n\n0.25\n\n0\n\n2\n\n20\n\n# SV\n\n40\n\n \n\nV\nS\nd\ne\nm\nr\no\nn\n\n1\n\n0.5\n\n0\n\n1\n\nE\n\n20\n\n# SV\n\n40\n\n1.5\n\n0.75\n\nl\n\n \n\ne\ng\nn\na\ne\nc\na\np\ns\nb\nu\ns\n\n3.5\n\n0\n\n2\n\n2.75\n\nlog\n\n # training trials\n\n10\n\n \n\nV\nS\nd\ne\nm\nr\no\nn\n\n1\n\n0.5\n\n0\n\n1\n\nF\n\n1\n\nV\nS\n\n0.5\n\n20\n\n# SV\n\n40\n\n1\n\n0.5\n\n2.75\n\nlog\n\n # training trials\n\n10\n\n3.5\n\n0\n\n1\n\n10\n\n# SV\n\n20\n\n0\n\n1\n\n10\n\n# SV\n\n20\n\nFigure 1: Moment conversion uncovers low-rank structure in arti\ufb01cial data. A) Time-lagged\ncovariance matrix Cov[yt+1, yt] and the singular value (SV) spectrum of the full Hankel matrix\nH = Cov[y+, y\u2212] computed from the observed count data (arti\ufb01cial data set I). The spectrum\ndecays gradually. B) Same as A) but after moment conversion. The transformed Hankel matrix now\nexhibits a clear cut-off in the spectrum, indicative of low underlying rank. C) Same as A) and B) but\ncomputed from the (ground truth) log-rates z, illustrating the true low-rank structure in the data. D)\nSummed absolute difference of the eigenvalue spectra of the ground truth dynamics matrix A and\nthe one identi\ufb01ed by PLDSID. The difference decreases with increasing data set size, indicating that\nPLDSID estimates are consistent. E) Same as C) but for the angle between the subspaces spanned\nby the loading matrix of the ground truth and estimated models. F) SV spectrum of the Hankel\nmatrix of multi-electrode data before (left) and after (right) moment conversion.\n\nUnder the PLDS model, the equations (3)-(5) can be solved analytically (see also [19] and the\nsupplementary material for details),\n\n(6)\n\n\u00b5i = 2 log(mi) \u2212\n\n1\n2\n\u03a3ii = log(Sii + m2\ni \u2212 mi) \u2212 log(m2\ni )\n\u03a3ij = log(Sij + mimj) \u2212 log(mimj),\n\nlog(Sii + m2\n\ni \u2212 mi)\n\ni , y\u00b1\n\ni ] and Cov[y\u00b1\n\n(7)\n(8)\nwhere mi and Sij denote the empirical estimates of E[y\u00b1\nj ], respectively. One\ncan see that the above equations do not have solutions if any one of the terms in the logarithms\nis non-positive, which may happen with \ufb01nitely sampled moments or a misspeci\ufb01ed model. We\ntherefore scale the matrix S (by left and right multiplication with the same diagonal matrix) such\nthat all Fano factors that are initially smaller than 1 are set to a given threshold (in simulations we\nused 1 + 10\u22122). This procedure ensures that there exists a unique solution (\u00b5, \u03a3) to the moment\nconversion (6)-(8). It is still the case that the resulting matrix \u03a3 might not be positive semide\ufb01nite\n[20], but this can be recti\ufb01ed by \ufb01nding its eigendecomposition, thresholding the eigenvalues (EVs)\nand then reconstructing \u03a3.\nFor suf\ufb01ciently large data sets generated from a \u201ctrue\u201d PLDS model, observed Fano factors will be\ngreater than one with high probability. In such cases, the moment conversion asymptotically yields\nthe unique correct moments \u00b5 and \u03a3 of the Gaussian log-rates z. Assuming stationarity, the Ho-\nKalman SSID yields consistent estimates of A, C, Q, d given the true \u00b5 and \u03a3. Hence, the proposed\ntwo-stage method yields consistent estimates of the parameters A, C, Q, d of a stationary PLDS. In\nthe remainder, we call this algorithm PLDSID.\nIt is often of interest to model the conditional distribution of the observables y given some external,\nobserved covariate or \u201cinput\u201d u. In neuroscience, for instance, u might be a sensory stimulus in\ufb02u-\nencing retinal [14] or other sensory spiking activity. Fortunately, provided that the external inputs\nare Gaussian-distributed and perturb the dynamics linearly, PLDSID can be extended to identify the\n\n4\n\n\fparameters of this augmented model. Let ut denote the r-dimensional observed external input at\ntime t, and assume that u1, . . . , uT are jointly normal and in\ufb02uence the latent state of the dynamical\nprocess linearly and instantaneously (through a p \u00d7 r matrix B):\n\nxt+1 | xt, ut \u223c N (Axt + But, Q),\n\nThe dynamical state xt is then observed through a generalised-linear process as before, and we de-\n\ufb01ne future-past vectors for all relevant time series. In this case, the N4SID algorithm [3] can perform\nsubspace identi\ufb01cation based on the joint covariance of u\u00b1 and z\u00b1. Although this covariance is not\nobserved directly in the gl-LDS case, our assumptions make u\u00b1 and z\u00b1 jointly normal and so we can\nuse moment transformation again to estimate the required covariance from the observed covariance\nof u\u00b1 and y\u00b1. For the Poisson model with exponential nonlinearity, this transformation remains\nclosed-form, and in combination with N4SID yields consistent estimates of the PLDS parameters\nand the input-coupling matrix B. 4 Further details are provided in the supplementary material.\n\n3 Results\n\nWe investigated the properties of the proposed PLDSID algorithm in numerical experiments, using\nboth arti\ufb01cial data and multi-electrode recordings of neural activity.\n\n3.1 PLDSID infers the correct parameters given suf\ufb01ciently large synthetic data sets\n\nWe used three arti\ufb01cial data sets to evaluate our algorithm, each consisting of 200 time-series (\u201ctri-\nals\u201d), with each trial being of length T = 100 time steps. Time-series were generated by sampling\nfrom a stationary ground truth PLDS with p = 10 latent and q = 25 observed dimensions. Count\naverages across time-bins and neurons ranged from 0.15 to 0.2, corresponding to 15\u201320 Hz if the\ntime-step size dt is taken to be 10 ms (the binning used for the multi-electrode recordings, see be-\nlow). The dynamics matrices A had eigenvalues corresponding to auto-correlation time constants\nranging from < 1 time step (data set III), through 3 dt (data set I) to 20 dt (data set II). The loading\nmatrices C were generated from a matrix with orthonormal columns and by a subsequent scaling\nwith 12.5 (data set I) or 5 (data sets II and III). This resulted in instantaneous correlations that were\ncomparable to (average absolute correlation coef\ufb01cient data set I: \u00afc = 2 \u00b7 10\u22122) or smaller than (data\nsets II, III: \u00afc = 3.5 \u00b7 10\u22123) those observed in the cortical multi-electrode recordings used below\n(\u00afc = 2.2 \u00b7 10\u22122). Hence, all our arti\ufb01cial data sets either roughly match (data sets I, II) or substan-\ntially underestimate (data set III) the correlation-structure of typical cortical multi-cell recordings.\nAdditionally, we generated a data set for identifying PLDS models with external input by driving\nthe ground truth PLDS of data set II with a 3 dimensional Gaussian AR(1) process ut; the coupling\nmatrix B was generated such that But had the same covariance as the innovations Q. A Hankel size\nk = 10 was used for all experiments with arti\ufb01cial data.\nWe \ufb01rst illustrate the moment conversion de\ufb01ned by equations (6)-(8) on arti\ufb01cial data set I. Fig. 1A\nshows the time-lagged cross-covariance Cov[yt+1, yt] as well as the singular value (SV) spectrum\nof the full future-past Hankel matrix H = Cov[y+, y\u2212] (normalised such that the largest SV is 1),\nboth estimated from 200 trials, with a Hankel size of k = 10. The raw spectrum gradually decays\ntowards small values but does not show a clear low-rank structure of the future-past Hankel matrix\nH. In contrast, Fig. 1B shows the output of the moment transformation yielding an approximation\nof the cross-covariance Cov[zt+1, zt] of the underlying inputs. Further the SV spectrum of the full,\ntransformed future-past Hankel matrix Cov[z+, z\u2212] is shown. The latter is dominated by only a few\nSVs, whose number matches the dimension of the ground truth system p = 10, clearly indicating\na low-rank structure. On this synthetic data set, we also have access to the underlying inputs. One\ncan see that the transformed Hankel matrix Fig. 1B as well as its SV spectrum are close to the ones\ncomputed from the underlying inputs shown in Fig. 1C.\nWe also evaluated the accuracy of the parameters identi\ufb01ed by PLDSID as a function of the training\nset size. Fig. 1D shows the difference between the spectra (i.e., the summed absolute differences\nbetween sorted eigenvalues) of the identi\ufb01ed and the ground truth dynamics matrix A. The spectrum\nof A is an important characteristic of the model, as it determines the time-constants of the underlying\ndynamics. It can be seen that the difference between the spectra decreases with increasing data set\nsize (Fig. 1D), indicating that our method asymptotically identi\ufb01es the correct dynamics. Further-\nmore, Fig. 1E shows the subspace-angle between the true loading matrix C and the one estimated\n\n4Again, simply applying SSID to the log of the observed counts does not work as most counts are 0.\n\n5\n\n\fA\n\narti\ufb01cial data set I\n\nB\n\narti\ufb01cial data set II\n\nC\n\narti\ufb01cial data set III\n\nx 10\u22123\n\n \n\nx 10\u22124\n\ne\nc\nn\na\nm\nr\no\n\nf\nr\ne\np\n\n6\n\n3\n\n0\n\n \n1\n\nPLDSID+EM\nFA+EM\nSSID+EM\nRAND+EM\n20\n10\n# EM iterations\n\n8\n\n4\n\ne\nc\nn\na\nm\nr\no\n\nf\nr\ne\np\n\n0\n\n1\n\n200\n\n# EM iterations\n\n400\n\nx 10\u22125\n\n20\n\n10\n\ne\nc\nn\na\nm\nr\no\n\nf\nr\ne\np\n\n0\n\n1\n\n25\n\n# EM iterations\n\n50\n\nD\n\ne\nc\nn\na\nm\nr\no\n\nf\nr\ne\np\n\nneural data, 100 trials\n\nx 10\u22123\n\n5\n\n4.5\n\n4\n\n1\n\n50\n\n# EM iterations\n\n100\n\nE\n\ne\nc\nn\na\nm\nr\no\n\nf\nr\ne\np\n\nneural data, 500 trials\n\nx 10\u22123\n\n10\n\n8.5\n\n7\n1\n\n50\n\n# EM iterations\n\n100\n\nF\n\ne\nc\nn\na\nm\nr\no\n\nf\nr\ne\np\n\nneural data, 863 trials\n\nx 10\u22123\n\n6.5\n\n5.75\n\n5\n\n1\n\n40\n\n# EM iterations\n\n80\n\nFigure 2: PLDSID is a good initialiser for EM. Cosmoothing performance on the training set as a\nfunction of the number of EM iterations for different initialisers on various data sets. A) Arti\ufb01cial\ndata set consisting of 200 trials and 25 observed dimensions. EM initialised by PLDSID converges\nfaster and achieves higher training performance than EM initialised with FA, Gaussian SSID or\nrandom parameter values. B) Same as A) but for data with lower instantaneous correlations and\nlonger auto-correlation. EM does not improve the performance of PLDSID on this data set. C)\nSame as A) but for data with negligible temporal correlations and low instantaneous correlations.\nFor this weakly structured data set, PLDSID-EM does not work well. D) 100 trials of multi-electrode\nrecordings with 86 observed dimensions (spike-sorted units). E) Same as D) but of data set size 500\ntrials, and only using the 40 most active units F) Same as D) but for 863 trials with all 86 units.\n\nby PLDS. As for the dynamics spectrum, the identi\ufb01ed loading matrix approaches the true one for\nincreasing training set size.\nNext, we investigated the usefulness of PLDSID as an initialiser for EM. We compared it to 3 dif-\nferent methods, namely initialisation with random parameters (with 20-50 restarts), factor analysis\n(FA) and Gaussian SSID. The quality of these initialisers was assessed by monitoring performance\nof the identi\ufb01ed parameters as a function of EM iterations after initialisation. Good initial parameter\nvalues yield fast convergence of EM in few iterations to high performance values, whereas poor\ninitialisations are characterised by slow convergence and termination of EM in poor local maxima\n(or, potentially, shallow regions of the likelihood). Fast convergence of EM is an important issue\nwhen dealing with large data sets, as EM iterations become computationally expensive (see below).\nWe monitor performance by evaluating the so-called cosmoothing performance on the training data,\na measure for cross-prediction performance described elsewhere in detail [21, 15]. This measure\nyielded more reliable and robust results than computing the likelihood, as the latter cannot be com-\nputed exactly and approximations can be numerically unreliable. We evaluated performance on the\ntraining set, as we were interested in comparing \ufb01tting-performance of the algorithms for the same\nmodel, and not the generalisation error of the model itself.\nFig. 2A to C show the results of this comparison on three different arti\ufb01cial data sets. On data set\nI (Fig. 2A), which was designed to have short auto-correlation time constants but pronounced in-\nstantaneous correlations between the observed dimensions, PLDSID initialisation leads to superior\nperformance compared to competing methods. For the same number of EM iterations (which is\na good proxy of invested computation time, see below), it resulted in better co-smoothing perfor-\nmance. Furthermore, the PLDSID+EM parameters converge to a better local optimum than those\ninitialised by the other methods. Hence, on this data set, our initialisation yields both faster com-\nputation time and better \ufb01nal results. The second arti\ufb01cial data set featured smaller instantaneous\ncorrelations between dimensions but longer auto-correlation time constants. As can be seen in Fig.\n2B, the PLDSID initialisation here yields parameters which are not further improved by EM itera-\ntions whereas EM with other initialisations becomes stuck in poor local solutions.\n\n6\n\n\fBy contrast, we found PLDSID not to yield useful parameter values on data sets which do not have\ntemporal correlations (Fig. 2C), and only very small instantaneous correlations across neurons (av-\nerage instantaneous absolute-correlation \u00afc = 3.5 \u00b7 10\u22123). For this particular data set, PLDSID and\nGaussian SSID both yielded poor parameters compared to factor analysis. In general, we observed\nthat PLDSID compares favourably to the other initialisation methods on any data sets we investi-\ngated as long as it exhibits shared variability across dimensions and time, and it was observed to\nwork particularly well when correlations were substantial. Fig. 3 shows results for identi\ufb01cation\nof a PLDS model driven by external inputs. The proposed PLDSID method identi\ufb01es better PLDS\nparameters, including the coupling matrix B, than alternative methods. Notably, identifying the pa-\nrameters with the PLDSID-variant that ignores external input (and setting the initial value B = 0\nfor EM) clearly results in suboptimal parameters.\n\n3.2 Expectation Maximisation initialised by PLDSID identi\ufb01es better models on neural data\n\nWe move now to examine the value of PLDSID in providing initial parameter values for subse-\nquent EM iterations on multi-electrode recordings of neural activity. Such data sets are challenging\nfor statistical modelling as they are high-dimensional (on the order of 102 observed dimensions),\nsparse (on the order of 10 Hz of spiking activity) and show shared variability across time and di-\nmensions. The experimental setup, acquisition and preprocessing of the data set are documented\nin detail elsewhere [22]. Brie\ufb02y, spiking activity was acquired from a 96-channel silicon electrode\narray (Blackrock, Salt Lake City, UT) implanted in motor areas of the cortex of a rhesus macaque\nperforming a delayed center-out reach task. For the analysis presented in this paper, we used data\nfrom a single recording session consisting in total of 863 trials, each truncated to be of length 1 s\nwith 86 distinct single and multi-units identi\ufb01ed by spike sorting. The data had an average \ufb01ring rate\nof 10.7 Hz and it was binned at 10 ms which resulted in 9.9% of bins containing at least one spike.\nFirst, we investigated the SV spectrum of the future-past Hankel matrix computed either from the\ncount-observations of the data, or from the inferred underlying inputs (using Hankel size k = 30\nand all trials, see Fig. 1F). While we did not observe a marked difference between the two spectra,\nboth spectra indicate that the data can be well described using a small number of singular values.\nBased on these spectra, we used a dimensionality of q = 10 for subsequent simulations.\nNext, we compared PLDSID to FA and Gaussian SSID initialisations for EM on two different sub-\nsets as well as the whole multi-electrode recording data set. Fig. 2D shows the performance of EM\nwith the different initialisations using a training set of modest size (100 trials, Hankel size k = 10).\nPLDSID provides the most appropriate initialisation for EM, allowing it to converge rapidly to bet-\nter parameter values than are found starting from either the FA or SSID estimates. This effect was\nstill more pronounced for a larger training set of 500 trials, but including only the 40 most active\nneurons from the original data (Fig. 2E, Hankel size k = 30). We also applied all of the methods\nto the complete data set consisting of 863 trials with all 86 observed neurons (Hankel size k = 30).\nThe results plotted in Fig. 2F indicate that again PLDSID provided the most useful initialisation\nfor EM. Interestingly, on this data set EM with random initialisations eventually identi\ufb01es parame-\nters with performances comparable to PLDSID+EM. However, random initialisation leads to slow\nconvergence and thus requires substantial computation, as described below. Gaussian SSID yielded\npoor values for parameters on all data sets, leading EM to terminate in poor local optima after only\na few iterations. We note that, because of the use of the Laplace approximation during inference (as\nwell as our non-likelihood performance measure) EM is not guaranteed to increase performance at\neach iteration, and, in practice, sometimes terminated after rather few iterations.\n\n3.3 PLDSID improves training time by orders of magnitude compared to conventional EM\n\nThe computational time needed to identify PLDS parameters might prove to be an important issue\nin practice. For example, when using a PLDS model as part of an algorithm for brain-machine inter-\nfacing [12], the parameters must be identi\ufb01ed during an experimental session. For multi-electrode\nrecording data of commonly-encountered size, and using our implementation of EM, inference of\nparameters under these time-constraints would be infeasible. Thus, an ideal parameter initialisation\nmethod will not only improve the robustness of the identi\ufb01ed parameters, but also reduce the com-\nputational time needed for EM convergence. Clearly, the computer time needed will depend on the\nimplementation, hardware and the properties and size of the data used. We used an EM-algorithm\nwith a global-Laplace approximation in the E-step [23, 15], and a conjugate-gradient-based optimi-\n\n7\n\n\fx 10\u22123\n\n \n\n8\n\n5\n\n2\n\ne\nc\nn\na\nm\nr\no\nf\nr\ne\np\n\nPLDSID+EM with input\nPLDSID+EM w/o input\nFA+EM\nSSID+EM\nRAND+EM\n\n \n\n1\n\n30\n\n# EM iterations\n\n60\n\nIdenti\ufb01cation of PLDS models with external\nFigure 3:\ninputs. Same as Fig. 2 B) but for an arti\ufb01cial data set\nwhich is generated by sampling from a PLDS with external\ninput. Using the variant of PLDSID which also identi\ufb01es\nthe coupling matrix B yields yields the best parameters. In\ncontrast, using the PLDSID variant which does not estimate\nB (B is initialised at 0) yields parameters which are of the\nsame quality as alternative methods.\n\nsation method in the M-step implemented in Matlab. Alternative methods based on variational ap-\nproximations or MCMC-sampling have been reported to be more costly than Laplace-EM [13, 24].\nFor all of the data sets used above, one single EM iteration in our implementation was substantially\nmore costly than parameter initialisation by PLDSID (Fig. 2D: factor 6.4, Fig. 2E: factor 4.0, Fig. 2F:\nfactor 1.4). In addition, EM started with random initialisation still yielded worse performance than\nwith PLDSID initialisation even after 50 iterations (see Figure 2). Thus, even with a conservative\nestimate, PLDSID initialisation reduces computational cost by at least a factor of 50 compared\nto random initialisation. Both PLDSID and EM have a time computational complexity which is\nproportional to the size N T of the data set (where N is the number of trials and T is the trial\nlength). However, in PLDSID, only the cost O(N T pq2) of calculating the Hankel-matrix scales\nwith the data set size (assuming k is of order p). This simple covariance calculation was much\ncheaper in our experiments than the moment conversion with cost O(pq2) or the SVD with cost\nO(p3q3), both of which are independent of the data set size N T . In contrast, each iteration of EM\nrequires at least O(N T (p3 + pq)) time. Therefore, the computational advantage of PLDSID is\nexpected to be especially great for large data sets. This is also the regime where the performance\nbene\ufb01t is most pronounced.\n\n4 Discussion\n\nWe investigated parameter estimation for linear-Gaussian state-space models with generalised-\nlinear observations and presented a method for parameter identi\ufb01cation in such models which builds\non the extensive subspace-identi\ufb01cation literature for fully Gaussian state-space models. In numeri-\ncal experiments we studied a special case of the proposed algorithm (PLDSID) for linear state-space\nmodels with conditionally Poisson-distributed observations. We showed that PLDSID yields con-\nsistent estimates of the model parameters without requiring iterative computation. Although this\nmethod generally makes less ef\ufb01cient use of available training data than do maximum likelihood\nmethods, we found that it sometimes outperformed likelihood hill-climbing by EM from random\ninitial conditions in practice (presumably due to optimisation dif\ufb01culties). Even when this was not\nthe case, EM initialised with the results of PLDSID converged in fewer iterations, and to a better\nparameter estimate than when it was initialised randomly, or by other methods\u2014an effect seen with\nmultiple arti\ufb01cial and multi-electrode recording data sets. As the practical computational dif\ufb01cul-\nties of parameter estimation (slow convergence and shallow optima in parameter estimation with\nEM) in this model are substantial, our algorithm facilitates the use of linear state-space models with\nnon-Gaussian observations in practice.\nWhile proven here in the Poisson case, the underlying moment-transformation algorithm is \ufb02exible\nand can be applied to a wide range of gl-LDS models. Of particular interest for neural data might be\na dynamical system model which precisely reproduced the marginal distribution of integer observa-\ntions for each observed dimension (by using a \u2018Discretised Gaussian\u2019 [20] as the observation model).\nBy contrast, the need for tractability in sampling or deterministic approximations for inference often\nlimits the range of models in which EM is practical.\n\nAcknowledgements Supported by the Gatsby Charitable Foundation; an EU Marie Curie Fel-\nlowship to JHM (hosted by MS); DARPA REPAIR N66001-10-C-2010 and NIH CRCNS R01-\nNS054283 to MS; as well as the Bernstein Center T\u00a8ubingen funded by the German Ministry of\nEducation and Research (BMBF; FKZ: 01GQ1002). We would like to thank Krishna V. Shenoy and\nmembers of his laboratory for many useful discussions as well as for generously sharing their data\nwith us.\n\n8\n\n\fReferences\n[1] R. E. Kalman and R. S. Bucy. New results in linear \ufb01ltering and prediction theory. Trans. Am.\n\nSoc. Mech. Eng., Series D, Journal of Basic Engineering, 83:95\u2013108, 1961.\n\n[2] Z. Ghahramani and G. E. Hinton. Parameter estimation for linear dynamical systems. Univer-\n\nsity of Toronto Technical Report, 6(CRG-TR-96-2), 1996.\n\n[3] P. V. Overschee and B. D. Moor. N4sid: Subspace algorithms for the identi\ufb01cation of combined\n\ndeterministic-stochastic systems. Automatica, 30(1):75\u201393, 1994.\n\n[4] T. Katayama. Subspace methods for system identi\ufb01cation. Springer Verlag, 2005.\n[5] H. Palanthandalam-Madapusi, S. Lacy, J. Hoagg, and D. Bernstein. Subspace-based identi\ufb01-\ncation for linear and nonlinear systems. In Proceedings of the American Control Conference,\n2005, pp. 2320\u20132334, 2005.\n\n[6] E. N. Brown, R. E. Kass, and P. P. Mitra. Multiple neural spike train data analysis: state-of-\n\nthe-art and future challenges. Nat Neurosci, 7(5):456\u201361, 2004.\n\n[7] M. M. Churchland, B. M. Yu, M. Sahani, and K. V. Shenoy. Techniques for extracting single-\ntrial activity patterns from large-scale neural recordings. Curr Opin Neurobiol, 17(5):609\u2013618,\n2007.\n\n[8] P. McCulloch and J. Nelder. Generalized linear models. Chapman and Hall, London, 1989.\n[9] K. Yuan and M. Niranjan. Estimating a state-space model from point process observations: a\n\nnote on convergence. Neural Comput, 22(8):1993\u20132001, 2010.\n\n[10] B. L. Ho and R. E. Kalman. Effective construction of linear state-variable models from in-\n\nput/output functions. Regelungstechnik, 14(12):545\u2013548, 1966.\n\n[11] J. M\u00f8ller, A. Syversveen, and R. Waagepetersen. Log gaussian cox processes. Scand J Stat,\n\n25(3):451\u2013482, 1998.\n\n[12] V. Lawhern, W. Wu, N. Hatsopoulos, and L. Paninski. Population decoding of motor cortical\nactivity using a generalized linear model with hidden states. J Neurosci Methods, 189(2):267\u2013\n280, 2010.\n\n[13] A. Z. Mangion, K. Yuan, V. Kadirkamanathan, M. Niranjan, and G. Sanguinetti. Online vari-\national inference for state-space models with point-process observations. Neural Comput,\n23(8):1967\u20131999, 2011.\n\n[14] M. Vidne, Y. Ahmadian, J. Shlens, J. Pillow, J. Kulkarni, A. Litke, E. Chichilnisky, E. Simon-\ncelli, and L. Paninski. Modeling the impact of common noise inputs on the network activity of\nretinal ganglion cells. J Comput Neurosci, 2011.\n\n[15] J. H. Macke, L. B\u00a8using, J. P. Cunningham, B. M. Yu, K. V. Shenoy, and M. Sahani. Empir-\nical models of spiking in neural populations. In Advances in Neural Information Processing\nSystems, vol. 24. Curran Associates, Inc., 2012.\n\n[16] J. Kulkarni and L. Paninski. Common-input models for multiple neural spike-train data. Net-\n\nwork, 18(4):375\u2013407, 2007.\n\n[17] L. Paninski. Maximum likelihood estimation of cascade point-process neural encoding models.\n\nNetwork, 15(4):243\u2013262, 2004.\n\n[18] R. E. Turner and M. Sahani. Two problems with variational expectation maximisation for\ntime-series models. In D. Barber, A. T. Cemgil, and S. Chiappa, eds., Inference and Learning\nin Dynamic Models. Cambridge University Press, 2011.\n\n[19] M. Krumin and S. Shoham. Generation of Spike Trains with Controlled Auto-and Cross-\n\nCorrelation Functions. Neural Comput, pp. 1\u201323, 2009.\n\n[20] J. Macke, P. Berens, A. Ecker, A. Tolias, and M. Bethge. Generating spike trains with speci\ufb01ed\n\ncorrelation coef\ufb01cients. Neural Comput, 21(2):397\u2013423, 2009.\n\n[21] B. M. Yu, J. P. Cunningham, G. Santhanam, S. I. Ryu, K. V. Shenoy, and M. Sahani. Gaussian-\nprocess factor analysis for low-dimensional single-trial analysis of neural population activity.\nJ Neurophysiol, 102(1):614\u2013635, 2009.\n\n[22] M. M. Churchland, B. M. Yu, S. Ryu, G. Santhanam, and K. V. Shenoy. Neural variability\nin premotor cortex provides a signature of motor preparation. J Neurosci, 26(14):3697\u20133712,\n2006.\n\n[23] L. Paninski, Y. Ahmadian, D. Ferreira, S. Koyama, K. Rahnama Rad, M. Vidne, J. Vogelstein,\nand W. Wu. A new look at state-space models for neural data. J Comput Neurosci, 29:107\u2013126,\n2010.\n\n[24] K. Yuan, M. Girolami, and M. Niranjan. Markov chain monte carlo methods for state-space\n\nmodels with point process observations. Neural Comput, 24(6):1462\u20131486, 2012.\n\n9\n\n\f", "award": [], "sourceid": 806, "authors": [{"given_name": "Lars", "family_name": "Buesing", "institution": null}, {"given_name": "Jakob", "family_name": "Macke", "institution": ""}, {"given_name": "Maneesh", "family_name": "Sahani", "institution": ""}]}