{"title": "Variational Gaussian Process State-Space Models", "book": "Advances in Neural Information Processing Systems", "page_first": 3680, "page_last": 3688, "abstract": "State-space models have been successfully used for more than fifty years in different areas of science and engineering. We present a procedure for efficient variational Bayesian learning of nonlinear state-space models based on sparse Gaussian processes. The result of learning is a tractable posterior over nonlinear dynamical systems. In comparison to conventional parametric models, we offer the possibility to straightforwardly trade off model capacity and computational cost whilst avoiding overfitting. Our main algorithm uses a hybrid inference approach combining variational Bayes and sequential Monte Carlo. We also present stochastic variational inference and online learning approaches for fast learning with long time series.", "full_text": "Variational Gaussian Process State-Space Models\n\nRoger Frigola, Yutian Chen and Carl E. Rasmussen\n\nDepartment of Engineering\nUniversity of Cambridge\n\n{rf342,yc373,cer54}@cam.ac.uk\n\nAbstract\n\nState-space models have been successfully used for more than \ufb01fty years in differ-\nent areas of science and engineering. We present a procedure for ef\ufb01cient varia-\ntional Bayesian learning of nonlinear state-space models based on sparse Gaussian\nprocesses. The result of learning is a tractable posterior over nonlinear dynamical\nsystems. In comparison to conventional parametric models, we offer the possi-\nbility to straightforwardly trade off model capacity and computational cost whilst\navoiding over\ufb01tting. Our main algorithm uses a hybrid inference approach com-\nbining variational Bayes and sequential Monte Carlo. We also present stochastic\nvariational inference and online learning approaches for fast learning with long\ntime series.\n\n1\n\nIntroduction\n\nState-space models (SSMs) are a widely used class of models that have found success in applications\nas diverse as robotics, ecology, \ufb01nance and neuroscience (see, e.g., Brown et al. [3]). State-space\nmodels generalize other popular time series models such as linear and nonlinear auto-regressive\nmodels: (N)ARX, (N)ARMA, (G)ARCH, etc. [21].\nIn this article we focus on Bayesian learning of nonparametric nonlinear state-space models. In\nparticular, we use sparse Gaussian processes (GPs) [19] as a convenient method to encode general\nassumptions about the dynamical system such as continuity or smoothness. In contrast to conven-\ntional parametric methods, we allow the user to easily trade off model capacity and computation\ntime. Moreover, we present a variational training procedure that allows very complex models to be\nlearned without risk of over\ufb01tting.\nOur variational formulation leads to a tractable approximate posterior over nonlinear dynamical\nsystems. This approximate posterior can be used to compute fast probabilistic predictions of future\ntrajectories of the dynamical system. The computational complexity of our learning approach is\nlinear in the length of the time series. This is possible thanks to the use of variational sparse GPs [22]\nwhich lead to a smoothing problem for the latent state trajectory in a simpler auxiliary dynamical\nsystem. Smoothing in this auxiliary system can be carried out with any conventional technique (e.g.\nsequential Monte Carlo). In addition, we present a stochastic variational inference procedure [10] to\naccelerate learning for long time series and we also present an online learning scheme.\nThis work is useful in situations where: 1) it is important to know how uncertain future predictions\nare, 2) there is not enough knowledge about the underlying nonlinear dynamical system to create\na principled parametric model, and 3) it is necessary to have an explicit model that can be used\nto simulate the dynamical system into the future. These conditions arise often in engineering and\n\ufb01nance. For instance, consider an autonomous aircraft adapting its \ufb02ight control when carrying a\nlarge external load of unknown weight and aerodynamic characteristics. A model of the nonlinear\ndynamics of the new system can be very useful in order to automatically adapt the control strategy.\nWhen few data points are available, there is high uncertainty about the dynamics. In this situation,\n\n1\n\n\fa model that quanti\ufb01es its uncertainty can be used to synthesize control laws that avoid the risks of\novercon\ufb01dence.\nThe problem of learning \ufb02exible models of nonlinear dynamical systems has been tackled from\nmultiple perspectives. Ghahramani and Roweis [9] presented a maximum likelihood approach to\nlearn nonlinear SSMs based on radial basis functions. This work was later extended by using a\nparameterized Gaussian process point of view and developing tailored \ufb01ltering algorithms [6, 7, 23].\nApproximate Bayesian learning has also been developed for parameterized nonlinear SSMs [5, 24].\nWang et al. [25] modeled the nonlinear functions in SSMs using Gaussian processes (GP-SSMs) and\nfound a MAP estimate of the latent variables and hyperparameters. Their approach preserved the\nnonparametric properties of Gaussian processes. Despite using MAP learning over state trajectories,\nover\ufb01tting was not an issue since it was applied in a dimensionality reduction context where the\nlatent space of the SSM was much smaller than the observation space. In a similar vein, [4, 12]\npresented a hierarchical Gaussian process model that could model linear dynamics and nonlinear\nmappings from latent states to observations. More recently, Frigola et al. [8] learned GP-SSMs\nin a fully Bayesian manner by employing particle MCMC methods to sample from the smoothing\ndistribution. However, their approach led to predictions with a computational cost proportional to\nthe length of the time series.\nIn the rest of this article, we present an approach to variational Bayesian learning of \ufb02exible non-\nlinear state-space models which leads to a simple representation of the posterior over nonlinear\ndynamical systems and results in predictions having a low computational complexity.\n\n2 Gaussian Process State-Space Models\n\nxt+1 = f (xt) + vt,\nyt = g(xt) + et.\n\nWe consider discrete-time nonlinear state-space models built with deterministic functions and addi-\ntive noise\n(1a)\n(1b)\nThe dynamics of the system are de\ufb01ned by the state transition function f (xt) and independent\nadditive noise vt (process noise). The states xt \u2208 RD are latent variables such that all future\nvariables are conditionally independent on the past given the present state. Observations yt \u2208 RE\nare linked to the state via another deterministic function g(xt) and independent additive noise et\n(observation noise). State-space models are stochastic dynamical processes that are useful to model\ntime series y (cid:44) {y1, ..., yT}. The deterministic functions in (1) can also take external known inputs\n(such as control signals) as an argument but, for conciseness, we will omit those in our notation.\nA traditional approach to learn f and g is to restrict them to a family of parametric functions. This is\nparticularly appropriate when the dynamical system is very well understood, e.g. orbital mechanics\nof a spacecraft. However, in many applications, it is dif\ufb01cult to specify a class of parametric models\nthat can provide both the ability to model complex functions and resistance to over\ufb01tting thanks to an\neasy to specify prior or regularizer. Gaussian processes do have these properties: they can represent\nfunctions of arbitrary complexity and provide a straightforward way to specify assumptions about\nthose unknown functions, e.g. smoothness. In the light of this, it is natural to place Gaussian process\npriors over both f and g [25]. However, the extreme \ufb02exibility of the two Gaussian processes\nleads to severe nonidenti\ufb01ability and strong correlations between the posteriors of the two unknown\nfunctions. In the rest of this paper we will focus on a model with a GP prior over the transition\nfunction and a parametric likelihood. However, our variational formulation can also be applied to\nthe double GP case (see supplementary material).\nA probabilistic state-space model with a Gaussian process prior over the transition function and a\nparametric likelihood is speci\ufb01ed by\n\nf (x) \u223c GP(cid:0)mf (x), kf (x, x(cid:48))(cid:1),\n\nxt | ft \u223c N (xt | ft, Q),\n\n(2a)\n(2b)\n(2c)\n(2d)\nwhere we have used ft (cid:44) f (xt\u22121). Since f (x) \u2208 RD, we use the convention that the covariance\nfunction kf returns a D\u00d7 D matrix. We group all hyperparameters into \u03b8 (cid:44) {\u03b8f , \u03b8y, Q}. Note that\n\nyt | xt \u223c p(yt | xt, \u03b8y),\n\nx0 \u223c p(x0)\n\n2\n\n\fFigure 1: State trajectories from four 2-state nonlinear dynamical systems sampled from a GP-SSM\nprior with \ufb01xed hyperparameters. The same prior generates systems with qualitatively different\nbehaviors, e.g. the leftmost panel shows behavior similar to that of a non-oscillatory linear system\nwhereas the rightmost panel appears to have arisen from a limit cycle in a nonlinear system.\n\nT(cid:89)\n\nt=1\n\nwe are not restricting the likelihood (2d) to any particular form. The joint distribution of a GP-SSM\nis\n\n(cid:1),\n\n0:t\u22122,0:t\u22122(f1:t\u22121 \u2212 mf (x0:t\u22122)),\n0:t\u22122,0:t\u22122K(cid:62)\n\np(yt|xt)p(xt|ft)p(ft|f1:t\u22121, x0:t\u22121),\n\np(y, x, f ) = p(x0)\n\n(3)\nwhere we use the convention f1:0 = \u2205 and omit the conditioning on \u03b8 in the notation. The GP on\nthe transition function induces a distribution over the latent function values with the form of a GP\npredictive:\n\np(ft|f1:t\u22121, x0:t\u22121) = N(cid:0)mf (xt\u22121) + Kt\u22121,0:t\u22122K\u22121\n\nKt\u22121,t\u22121 \u2212 Kt\u22121,0:t\u22122K\u22121\n\nt\u22121,0:t\u22122\n\n(4)\nwhere the subindices of the kernel matrices indicate the arguments to the covariance function nec-\nessary to build each matrix, e.g. Kt\u22121,0:t\u22122 = [kf (xt\u22121, x0) . . . kf (xt\u22121, xt\u22122)]. When t = 1, the\ndistribution is that of a GP marginal p(f1|x0) = N (mf (x0), kf (x0, x0)).\nEquation (3) provides a sequential procedure to sample state trajectories and observations. GP-\nSSMs are doubly stochastic models in the sense that one could, at least notionally, \ufb01rst sample a\nstate transition dynamics function from eq. (2a) and then, conditioned on that function, sample the\nstate trajectory and observations.\nGP-SSMs are a very rich prior over nonlinear dynamical systems. In Fig. 1 we illustrate this concept\nby showing state trajectories sampled from a GP-SSM with \ufb01xed hyperparameters. The dynamical\nsystems associated with each of these trajectories are qualitatively very different from each other. For\ninstance, the leftmost panel shows the dynamics of an almost linear non-oscillatory system whereas\nthe rightmost panel corresponds to a limit cycle in a nonlinear system. Our goal in this paper is\nto use this prior over dynamical systems and obtain a tractable approximation to the posterior over\ndynamical systems given the data.\n\n3 Variational Inference in GP-SSMs\n\nSince the GP-SSM is a nonparametric model, in order to de\ufb01ne a posterior distribution over f (x) and\nmake probabilistic predictions it is necessary to \ufb01rst \ufb01nd the smoothing distribution p(x0:T|y1:T ).\nFrigola et al. [8] obtained samples from the smoothing distribution that could be used to de\ufb01ne a\npredictive density via Monte Carlo integration. This approach is expensive since it requires averag-\ning over L state trajectory samples of length T . In this section we present an alternative approach\nthat aims to \ufb01nd a tractable distribution over the state transition function that is independent of the\nlength of the time series. We achieve this by using variational sparse GP techniques [22].\n\n3.1 Augmenting the Model with Inducing Variables\n\nAs a \ufb01rst step to perform variational inference in a GP-SSM, we augment the model with M inducing\npoints u (cid:44) {ui}M\ni=1. Those inducing points are jointly Gaussian with the latent function values. In\nthe case of a GP-SSM, the joint probability density becomes\n\np(y, x, f , u) = p(x, f|u) p(u)\n\np(yt|xt),\n\n(5)\n\nT(cid:89)\n\nt=1\n\n3\n\n0timestates0time0time0time\fwhere\n\np(u) = N (u | 0, Ku,u)\n\nT(cid:89)\n\n(6a)\n\n(6b)\n\n(cid:1). (6c)\n\np(x, f|u) = p(x0)\n\np(ft|f1:t\u22121, x0:t\u22121, u)p(xt|ft),\n\nT(cid:89)\n\np(ft|f1:t\u22121, x0:t\u22121, u) = N(cid:0)f1:T | K0:T\u22121,uK\u22121\n\nt=1\n\nu,uu, K0:T\u22121\u2212 K0:T\u22121,uK\u22121\n\nu,uK(cid:62)\n\n0:T\u22121,u\n\nt=1\n\nKernel matrices relating to the inducing points depend on a set of inducing inputs {zi}M\ni=1 in such\na way that Ku,u is an M D \u00d7 M D matrix formed with blocks kf (zi, zj) having size D \u00d7 D. For\nbrevity, we use a zero mean function and we omit conditioning on the inducing inputs in the notation.\n\n3.2 Evidence Lower Bound of an Augmented GP-SSM\n\nVariational inference [1] is a popular method for approximate Bayesian inference based on making\nassumptions about the posterior over latent variables that lead to a tractable lower bound on the\nevidence of the model (sometimes referred to as ELBO). Maximizing this lower bound is equivalent\nto minimizing the Kullback-Leibler divergence between the approximate posterior and the exact\none. Following standard variational inference methodology, [1] we obtain the evidence lower bound\nof a GP-SSM augmented with inducing points\n\n(cid:90)\n\nlog p(y|\u03b8) \u2265\n\nq(x, f , u) log\n\nx,f ,u\n\np(u)p(x0)(cid:81)T\nt=1 p(ft|f1:t\u22121, x0:t\u22121, u)p(yt|xt)p(xt|ft)\nT(cid:89)\n\nq(x, f , u)\n\np(ft|f1:t\u22121, x0:t\u22121, u),\n\n.\n\n(7)\n\n(8)\n\nq(x, f , u) = q(u)q(x)\n\nt=1\n\nIn order to achieve tractability, we use a variational distribution that factorizes as\n\n(cid:26)(cid:90)\n\nT(cid:88)\n\nt=1\n\n(cid:90)\n(cid:124)\n\nft\n\nwhere q(u) and q(x) can take any form but the terms relating to f are taken to match those of the\nprior (3). As a consequence, the dif\ufb01cult p(ft|...) terms inside the log cancel out and lead to the\nfollowing lower bound\nL(q(u), q(x),\u03b8) = \u2212KL(q(u)(cid:107)p(u)) + H(q(x)) +\n\nq(x) log p(x0)\n\n(cid:90)\n\nx\n\n(cid:90)\n\n(cid:27)\n\n+\n\nx\n\n+\n\nx,u\n\n(9)\n\n(cid:125)\n\nq(x)q(u)\n\n\u03a6(xt,xt\u22121,u)\n\nq(x) log p(yt|xt)\n\np(ft|xt\u22121, u) log p(xt|ft)\n\n(cid:123)(cid:122)\nwhere KL denotes the Kullback-Leibler divergence and H the entropy. The integral with respect to\nft can be solved analytically: \u03a6(xt, xt\u22121, u) = \u2212 1\n2 tr(Q\u22121Bt\u22121) + log N (xt|At\u22121u, Q) where\nAt\u22121 = Kt\u22121,uK\u22121\nAs in other variational sparse GP methods, the choice of variational distribution (8) gives the abil-\nity to precisely learn the latent function at the locations of the inducing inputs. Away from those\nlocations, the posterior takes the form of the prior conditioned on the inducing variables. By in-\ncreasing the number of inducing variables, the ELBO can only become tighter [22]. This offers a\nstraightforward trade-off between model capacity and computation cost without increasing the risk\nof over\ufb01tting.\n\nu,u, and Bt\u22121 = Kt\u22121,t\u22121 \u2212 Kt\u22121,uK\u22121\n\nu,uKu,t\u22121.\n\n3.3 Optimal Variational Distribution for u\n\nThe optimal distribution of q(u) can be found by setting to zero the functional derivative of the\nevidence lower bound with respect to q(u)\n\nq\u2217(u) \u221d p(u)\n\nexp{(cid:104)log N (xt|At\u22121u, Q)(cid:105)q(x)},\n\n(10)\n\nT(cid:89)\n\nt=1\n\n4\n\n\fwhere (cid:104)\u00b7(cid:105)q(x) denotes an expectation with respect to q(x). The optimal variational distribution\nq\u2217(u) is, conveniently, a multivariate Gaussian distribution. If, for simplicity of notation, we restrict\nourselves to D = 1 the natural parameters of the optimal distribution are\n\nT(cid:88)\n\nt=1\n\n\u03b71 = Q\u22121\n\n(cid:104)AT\n\nt\u22121xt(cid:105)q(xt,xt\u22121),\n\n\u03b72 = \u2212 1\n2\n\nK\u22121\nuu + Q\u22121\n\n(cid:104)AT\n\nt\u22121At\u22121(cid:105)q(xt\u22121)\n\n. (11)\n\n(cid:32)\n\nT(cid:88)\n\nt=1\n\n(cid:33)\n\n\u03a81 =(cid:80)T\n\nThe mean and covariance matrix of q\u2217(u), denoted as \u00b5 and \u03a3 respectively, can be computed as\n\u00b5 = \u03a3\u03b71 and \u03a3 = (\u22122\u03b72)\u22121. Note that the optimal q(u) depends on the suf\ufb01cient statistics\n\nt\u22121,uxt(cid:105)q(xt,xt\u22121) and \u03a82 =(cid:80)T\n\nt\u22121,uKt\u22121,u(cid:105)q(xt\u22121).\n\nt=1(cid:104)KT\n\nt=1(cid:104)KT\n\n3.4 Optimal Variational Distribution for x\nIn an analogous way as for q\u2217(u), we can obtain the optimal form of q(x)\n\nT(cid:89)\n\nt=1\n\nq\u2217(x) \u221d p(x0)\n\np(yt|xt) exp{\u2212 1\n2\n\ntr(cid:0)Q\u22121(Bt\u22121 + At\u22121\u03a3AT\n\nt\u22121)(cid:1)}N (xt|At\u22121\u00b5, Q),\n\n(12)\n\n2 tr(cid:0)Q\u22121(Bt\u22121 + At\u22121\u03a3AT\n\nwhere, in the second equation, we have used q(u) = N (u|\u00b5, \u03a3).\nThe optimal distribution q\u2217(x) is equivalent to the smoothing distribution of an auxiliary parametric\nstate-space model. The auxiliary model is simpler than the original one in (3) since the latent states\nfactorize with a Markovian structure. Equation (12) can be interpreted as a nonlinear state-space\nmodel with a Gaussian state transition density, N (xt|At\u22121\u00b5, Q), and a likelihood augmented with\nan additional term: exp{\u2212 1\nSmoothing in nonlinear Markovian state-space models is a standard problem in the context of time\nseries modeling. There are various existing strategies to \ufb01nd the smoothing distribution which could\nbe used depending on the characteristics of each particular problem [20]. For instance, in a mildly\nnonlinear system with Gaussian noise, an extended Kalman smoother can have very good perfor-\nmance. On the other hand, problems with severe nonlinearities and/or non-Gaussian likelihoods can\nlead to heavily multimodal smoothing distributions that are better represented using particle meth-\nods. We note that the application of sequential Monte Carlo (SMC) is particularly straightforward\nin the present auxiliary model.\n\nt\u22121)(cid:1)}.\n\n3.5 Optimizing the Evidence Lower Bound\n\nAlgorithm 1 presents a procedure to maximize the evidence lower bound by alternatively sampling\nfrom the smoothing distribution and taking steps both in \u03b8 and in the natural parameters of q\u2217(u).\nWe propose a hybrid variational-sampling approach whereby approximate samples from q\u2217(x) are\nobtained with a sequential Monte Carlo smoother. However, as discussed in section 3.4, depending\non the characteristics of the dynamical system, other smoothing methods could be more appropriate\n[20]. As an alternative to smoothing on the auxiliary dynamical system in (12), one could force a\nq(x) from a particular family of distributions and optimise the evidence lower bound with respect\nto its variational parameters. For instance, we could posit a Gaussian q(x) with a sparsity pattern in\nthe covariance matrix assuming zero covariance between non-neighboring states and maximize the\nELBO with respect to the variational parameters.\nWe use stochastic gradient descent [10] to maximize the ELBO (where we have plugged in the\noptimal q\u2217(u) [22]) by using its gradient with respect to the hyperparameters. Both quantities are\nstochastic in our hybrid approach due to variance introduced by the sampling of q\u2217(x). In fact,\nvanilla sequential Monte Carlo methods will result in biased estimators of the gradient and the\nparameters of q\u2217(u). However, in our experiments this has not been an issue. Techniques such as\nparticle MCMC would be a viable alternative to conventional sequential Monte Carlo [13].\n\n5\n\n\fAlgorithm 1 Variational learning of GP-SSMs with particle smoothing. Batch mode (i.e. non-SVI)\nis the particular case where the mini-batch is the whole dataset.\nRequire: Observations y1:T . Initial values for \u03b8, \u03b71 and \u03b72. Schedules for \u03c1 and \u03bb. i = 1.\n\nrepeat\n\nl=1 \u2190 GETSAMPLESOPTIMALQX(y\u03c4 :\u03c4(cid:48), \u03b8, \u03b71, \u03b72)\n2 \u2190 GETOPTIMALQU({x\u03c4 :\u03c4(cid:48)}L\n\ny\u03c4 :\u03c4(cid:48) \u2190 SAMPLEMINIBATCH(y1:T )\n{x\u03c4 :\u03c4(cid:48)}L\n\u2207\u03b8L \u2190 GETTHETAGRADIENT({x\u03c4 :\u03c4(cid:48)}L\n\u03b7\u2217\n1, \u03b7\u2217\nl=1, \u03b8)\n\u03b71 \u2190 \u03b71 + \u03c1i(\u03b7\u2217\n\u03b72 \u2190 \u03b72 + \u03c1i(\u03b7\u2217\n\u03b8 \u2190 \u03b8 + \u03bbi\u2207\u03b8L\ni \u2190 i + 1\n\n1 \u2212 \u03b71)\n2 \u2212 \u03b72)\n\nl=1, \u03b8)\n\nuntil ELBO convergence\n\nsample from eq. (12)\nsupp. material\neq. (11) or (14)\n\n3.6 Making Predictions\n\nOne of the most appealing properties of our variational approach to learning GP-SSMs is that the\napproximate predictive distribution of the state transition function can be cheaply computed\np(f\u2217|x\u2217, u) p(x|u, y) q(u)\n\np(f\u2217|x\u2217, x, u) p(x|u, y) p(u|y) \u2248\n\np(f\u2217|x\u2217, y) =\n\n(cid:90)\n\np(f\u2217|x\u2217, u) q(u) = N (f\u2217|A\u2217\u00b5, B\u2217 + A\u2217\u03a3A(cid:62)\n\u2217 ).\n\n(13)\n\nx,u\n\n(cid:90)\n(cid:90)\n\nx,u\n\n=\n\nu\n\nfrom the smoothing distribution where p(f\u2217|x\u2217, y) = (cid:82)\n\nThe derivation in eq. (13) contains two approximations: 1) predictions at new test points are con-\nsidered to depend only on the inducing variables, and 2) the posterior distribution over u is approx-\nimated by a variational distribution.\nAfter pre-computations, the cost of each prediction is O(M ) for the mean and O(M 2) for the\nvariance. This contrasts with the O(T L) and O(T 2L) complexity of approaches based on sampling\nx p(f\u2217|x\u2217, x) p(x|y) is approximated with\nL samples from p(x|y) [8]. The variational approach condenses the learning of the latent function\non the inducing points u and does not explicitly need the smoothing distribution p(x|y) to make\npredictions.\n\n4 Stochastic Variational Inference\nStochastic variational inference (SVI) [10] can be readily applied using our evidence lower bound.\nWhen the observed time series is long, it can be expensive to compute q\u2217(u) or the gradient of L with\nrespect to the hyperparameters and inducing inputs. Since both q\u2217(u) and\ndepend linearly\non q(x) via suf\ufb01cient statistics that contain a summation over all elements in the state trajectory,\nwe can obtain unbiased estimates of these suf\ufb01cient statistics by using one or multiple segments\nof the sequence that are sampled uniformly at random. However, obtaining q(x) also requires a\ntime complexity of O(T ). Yet, in practice, q(x) can be approximated by running the smoothing\nalgorithm locally around those segments. This can be justi\ufb01ed by the fact that in a time series\ncontext, the smoothing distribution at a particular time is not largely affected by measurements that\nare far into the past or the future [20]. The natural parameters of q\u2217(u) can be estimated by using a\nportion of the time series of length S\n\n\u2202\u03b8/z1:M\n\n\u2202L\n\n\u03b71 = Q\u22121 T\nS\n\n(cid:104)AT\n\nt\u22121xt(cid:105)q(xt,xt\u22121), \u03b72 = \u2212 1\n2\n\nuu + Q\u22121 T\nS\n\n(cid:104)AT\n\nt\u22121At\u22121(cid:105)q(xt\u22121)\n\n(14)\n\n\u03c4(cid:48)(cid:88)\n\nt=\u03c4\n\n\uf8eb\uf8edK\u22121\n\n\u03c4(cid:48)(cid:88)\n\nt=\u03c4\n\n\uf8f6\uf8f8 .\n\n5 Online Learning\nOur variational approach to learn GP-SSMs also leads naturally to an online learning implementa-\ntion. This is of particular interest in the context of dynamical systems as it is often the case that data\narrives in a sequential manner, e.g. a robot learning the dynamics of different objects by interacting\n\n6\n\n\fTable 1: Experimental evaluation of 1D nonlinear system. Unless otherwise stated, training times\nare reported for a dataset with T = 500 and test times are given for a test set with 105 data points. All\npre-computations independent on test data are performed before timing the \u201ctest time\u201d. Predictive\nlog likelihoods are the average over the full test set. * our PMCMC code did not use fast updates-\ndowndates of the Cholesky factors during training. This does not affect test times.\n\nVariational GP-SSM\nVar. GP-SSM (SVI, T = 104)\nPMCMC GP-SSM [8]\nGP-NARX [17]\nGP-NARX + FITC [17, 18]\nLinear (N4SID, [16])\n\n1.15\n1.07\n1.12\n1.46\n1.47\n2.35\n\nTest RMSE log p(xtest\n\nt\n\nt+1|xtest\n-1.61\n-1.47\n-1.57\n-1.90\n-1.90\n-2.30\n\n, ytr\n\n0:T )\n\nTrain time\n\nTest time\n\n2.14 min\n4.12 min\n547 min*\n0.22 min\n0.17 min\n0.01 min\n\n0.14 s\n0.14 s\n421 s\n3.85 s\n0.23 s\n0.11 s\n\nwith them. Online learning in a Bayesian setting consists in sequential application of Bayes rule\nwhereby the posterior after observing data up to time t becomes the prior at time t + 1 [2, 15].\nIn our case, this involves replacing the prior p(u) = N (u|0, Ku,u) by the approximate posterior\nN (u|\u00b5, \u03a3) obtained in the previous step. The expressions for the update of the natural parameters\nof q\u2217(u) with a new mini batch y\u03c4 :\u03c4(cid:48) are\n\n\u03b7(cid:48)\n1 = \u03b71 + Q\u22121\n\n(cid:104)AT\n\nt\u22121xt(cid:105)q(xt,xt\u22121),\n\n\u03c4(cid:48)(cid:88)\n\nt=\u03c4\n\n2 = \u03b72 \u2212 1\n\u03b7(cid:48)\n2\n\nQ\u22121\n\n(cid:104)AT\n\nt\u22121At\u22121(cid:105)q(xt\u22121).\n\n(15)\n\n\u03c4(cid:48)(cid:88)\n\nt=\u03c4\n\n1D Nonlinear System\n\n6 Experiments\nThe goal of this section is to showcase the ability of variational GP-SSMs to perform approximate\nBayesian learning of nonlinear dynamical systems. In particular, we want to demonstrate: 1) the\nability to learn the inherent nonlinear dynamics of a system, 2) the application in cases where the\nlatent states have higher dimensionality than the observations, and 3) the use of non-Gaussian like-\nlihoods.\n6.1\nWe apply our variational learning procedure presented above to the one-dimensional nonlinear sys-\ntem described by p(xt+1|xt) = N (f (xt), 1) and p(yt|xt) = N (xt, 1) where the transition function\nis xt + 1 if x < 4 and \u22124xt + 21 if x \u2265 4. Its pronounced kink makes it challenging to learn. Our\ngoal is to \ufb01nd a posterior distribution over this function using a GP-SSM with Mat\u00b4ern covariance\nfunction. To solve the expectations with respect to the approximate smoothing distribution q(x) we\nuse a bootstrap particle \ufb01xed-lag smoother with 1000 particles and a lag of 10.\nIn Table 1, we compare our method (Variational GP-SSM) against the PMCMC sampling proce-\ndure from [8] taking 100 samples and 10 burn in samples. As in [8], the sampling exhibited very\ngood mixing with 20 particles. We also compare to an auto-regressive model based on Gaussian\nprocess regression [17] of order 5 with Mat\u00b4ern ARD covariance function with and without FITC\napproximation. Finally, we use a linear subspace identi\ufb01cation method (N4SID, [16]) as a base-\nline for comparison. The PMCMC training offers the best test performance from all methods using\n500 training points at the cost of substantial train and test time. However, if more data is available\n(T = 104) the stochastic variational inference procedure can be very attractive since it improves test\nperformance while having a test time that is independent of the training set size. The reported SVI\nperformance has been obtained with mini-batches of 100 time-steps.\n6.2 Neural Spike Train Recordings\nWe now turn to the use of SSMs to learn a simple model of neural activity in rats\u2019 hippocampus. We\nuse data in neuron cluster 1 (the most active) from experiment ec013.717 in [14]. In some regions\nof the time series, the action potential spikes show a clear pattern where periods of rapid spiking\nare followed by periods of very little spiking. We wish to model this behaviour as an autonomous\nnonlinear dynamical system (i.e. one not driven by external inputs). Many parametric models of\nnonlinear neuron dynamics have been proposed [11] but our goal here is to learn a model from data\n\n7\n\n\fFigure 2: From left to right: 1) part of the observed spike count data, 2) sample from the corre-\nsponding smoothing distribution, 3) predictive distribution of spike counts obtained by simulating\nthe posterior dynamical from an initial state, and 4) corresponding latent states.\n\nFigure 3: Contour plots of the state transition function x(2)\nt ), and trajectories in\nstate space. Left: mean posterior function and trajectory from smoothing distribution. Other three\npanels: transition functions sampled from the posterior and trajectories simulated conditioned on\nthe corresponding sample. Those simulated trajectories start inside the limit cycle and are naturally\nattracted towards it. Note how function samples are very similar in the region of the limit cycle.\n\nt+1 = f (x(1)\n\n, x(2)\n\nt\n\nt + \u03b2)).\n\nwithout using any biological insight. We use a GP-SSM with a structure such that it is the discrete-\ntime analog of a second order nonlinear ordinary differential equation: two states one of which\nis the derivative of the other. The observations are spike counts in temporal bins of 0.01 second\nwidth. We use a Poisson likelihood relating the spike counts to the second latent state yt|xt \u223c\nPoisson(exp(\u03b1x(2)\nWe \ufb01nd a posterior distribution for the state transition function using our variational GP-SSM ap-\nproach. Smoothing is done with a \ufb01xed-lag particle smoother and training until convergence takes\napproximately 50 iterations of Algorithm 1. Figure 2 shows a part of the raw data together with an\napproximate sample from the smoothing distribution during the same time interval. In addition, we\nshow the distribution over predictions made by chaining 1-step-ahead predictions. To make those\npredictions we have switched off process noise (Q = 0) to show more clearly the effect of uncer-\ntainty in the state transition function. Note how the frequency of roughly 6 Hz present in the data is\nwell captured. Figure 3 shows how the limit cycle corresponding to a nonlinear dynamical system\nhas been captured (see caption for details).\n\n7 Discussion and Future Work\n\nWe have derived a tractable variational formulation to learn GP-SSMs: an important class of mod-\nels of nonlinear dynamical systems that is particularly suited to applications where a principled\nparametric model of the dynamics is not available. Our approach makes it possible to learn very ex-\npressive models without risk of over\ufb01tting. In contrast to previous approaches [4, 12, 25], we have\ndemonstrated the ability to learn a nonlinear state transition function in a latent space of greater\ndimensionality than the observation space. More crucially, our approach yields a tractable posterior\nover nonlinear systems that, as opposed to those based on sampling from the smoothing distribution\n[8], results in a computation time for the predictions that does not depend on the length of the time\nseries.\nGiven the interesting capabilities of variational GP-SSMs, we believe that future work is warranted.\nIn particular, we want to focus on structured variational distributions q(x) that could eliminate the\nneed to solve the smoothing problem in the auxiliary dynamical system at the cost of having more\nvariational parameters to optimize. On a more theoretical side, we would like to better characterize\nGP-SSM priors in terms of their dynamical system properties: stability, equilibria, limit cycles, etc.\n\n8\n\n940940.594110203040time [s]spike counts940940.59410time [s]states00.5110203040prediction time [s]spike counts00.510prediction time [s]states0x(1)x(2)x(1)x(2)0x(1)x(2)0x(1)x(2)0\fReferences\n[1] C. M. Bishop. Pattern Recognition and Machine Learning. Springer, 2006.\n[2] Tamara Broderick, Nicholas Boyd, Andre Wibisono, Ashia C Wilson, and Michael Jordan. Streaming\nvariational Bayes. In Advances in Neural Information Processing Systems 26, pages 1727\u20131735. Curran\nAssociates, Inc., 2013.\n\n[3] Emery N. Brown, Loren M. Frank, Dengda Tang, Michael C. Quirk, and Matthew A. Wilson. A statistical\nparadigm for neural spike train decoding applied to position prediction from ensemble \ufb01ring patterns of\nrat hippocampal place cells. The Journal of Neuroscience, 18(18):7411\u20137425, 1998.\n\n[4] Andreas C. Damianou, Michalis Titsias, and Neil D. Lawrence. Variational Gaussian process dynamical\n\nsystems. In Advances in Neural Information Processing Systems 24, pages 2510\u20132518. 2011.\n\n[5] J. Daunizeau, K.J. Friston, and S.J. Kiebel. Variational Bayesian identi\ufb01cation and prediction of stochastic\n\nnonlinear dynamic causal models. Physica D: Nonlinear Phenomena, 238(21):2089 \u2013 2118, 2009.\n\n[6] M. P. Deisenroth and S. Mohamed. Expectation Propagation in Gaussian process dynamical systems. In\n\nAdvances in Neural Information Processing Systems (NIPS) 25, pages 2618\u20132626. 2012.\n\n[7] M. P. Deisenroth, R. D. Turner, M. F. Huber, U. D. Hanebeck, and C. E. Rasmussen. Robust \ufb01ltering and\nsmoothing with Gaussian processes. IEEE Transactions on Automatic Control, 57(7):1865 \u20131871, 2012.\n[8] Roger Frigola, Fredrik Lindsten, Thomas B. Sch\u00a8on, and Carl E. Rasmussen. Bayesian inference and\nlearning in Gaussian process state-space models with particle MCMC. In Advances in Neural Information\nProcessing Systems (NIPS) 26. 2013.\n\n[9] Z. Ghahramani and S. Roweis. Learning nonlinear dynamical systems using an EM algorithm. In Ad-\n\nvances in Neural Information Processing Systems (NIPS) 11. MIT Press, 1999.\n\n[10] Matthew D Hoffman, David M Blei, Chong Wang, and John Paisley. Stochastic variational inference. The\n\nJournal of Machine Learning Research, 14(1):1303\u20131347, 2013.\n\n[11] Eugene M Izhikevich. Neural excitability, spiking and bursting. International Journal of Bifurcation and\n\nChaos, 10(06):1171\u20131266, 2000.\n\n[12] Neil D. Lawrence and Andrew J. Moore. The hierarchical Gaussian process latent variable model. In\nZoubin Ghahramani, editor, Proceedings of the 24th International Conference on Machine Learning\n(ICML), 2007.\n\n[13] Fredrik Lindsten and Thomas B. Sch\u00a8on. Backward simulation methods for Monte Carlo statistical infer-\n\nence. Foundations and Trends in Machine Learning, 6(1):1\u2013143, 2013.\n\n[14] K. Mizuseki, A. Sirota, E. Pastalkova, K. Diba, and G. Buzski. Multiple single unit recordings from\ndifferent rat hippocampal and entorhinal regions while the animals were performing multiple behavioral\ntasks. CRCNS.org. http://dx.doi.org/10.6080/K09G5JRZ, 2013.\n\n[15] Manfred Opper. A bayesian approach to on-line learning. In David Saad, editor, On-Line Learning in\n\nNeural Networks. Cambridge University Press, 1998.\n\n[16] Van Overschee P. and De Moor B. Subspace Identi\ufb01cation for Linear Systems, Theory, Implementation,\n\nApplications. Kluwer Academic Publishers, 1996.\n\n[17] J. Qui\u02dcnonero Candela, A Girard, J. Larsen, and C.E. Rasmussen. Propagation of uncertainty in Bayesian\nIn Acoustics, Speech, and Signal Pro-\nkernel models - application to multiple-step ahead forecasting.\ncessing, 2003. Proceedings. (ICASSP \u201903). 2003 IEEE International Conference on, volume 2, pages\nII\u2013701\u20134 vol.2, April 2003.\n\n[18] J. Qui\u02dcnonero-Candela and C.E. Rasmussen. A unifying view of sparse approximate Gaussian process\n\nregression. Journal of Machine Learning Research, 6:1939\u20131959, 2005.\n\n[19] C. E. Rasmussen and C. K. I. Williams. Gaussian Processes for Machine Learning. MIT Press, 2006.\n[20] S. S\u00a8arkk\u00a8a. Bayesian Filtering and Smoothing. Cambridge University Press, 2013.\n[21] R. H. Shumway and D. S. Stoffer. Time Series Analysis and Its Applications. Springer, 3rd edition, 2011.\n[22] Michalis Titsias. Variational learning of inducing variables in sparse Gaussian processes. In Proceedings\n\nof the 12th International Conference on Arti\ufb01cial Intelligence and Statistics (AISTATS), 2009.\n\n[23] R. Turner, M. P. Deisenroth, and C. E. Rasmussen. State-space inference and learning with Gaussian\nprocesses. In Yee Whye Teh and Mike Titterington, editors, 13th International Conference on Arti\ufb01cial\nIntelligence and Statistics, volume 9 of W&CP, pages 868\u2013875, Chia Laguna, Sardinia, Italy, 2010.\n\n[24] Harri Valpola and Juha Karhunen. An unsupervised ensemble learning method for nonlinear dynamic\n\nstate-space models. Neural Computation, 14(11):2647\u20132692, 2002.\n\n[25] J.M. Wang, D.J. Fleet, and A. Hertzmann. Gaussian process dynamical models. In Advances in Neural\n\nInformation Processing Systems (NIPS) 18, pages 1441\u20131448. MIT Press, Cambridge, MA, 2006.\n\n9\n\n\f", "award": [], "sourceid": 1929, "authors": [{"given_name": "Roger", "family_name": "Frigola", "institution": "University of Cambridge"}, {"given_name": "Yutian", "family_name": "Chen", "institution": "University of Cambridge"}, {"given_name": "Carl Edward", "family_name": "Rasmussen", "institution": "University of Cambridge"}]}