{"title": "Approximate inference in latent Gaussian-Markov models from continuous time observations", "book": "Advances in Neural Information Processing Systems", "page_first": 971, "page_last": 979, "abstract": "We propose an approximate inference algorithm for continuous time Gaussian-Markov process models with both discrete and continuous time likelihoods. We show that the continuous time limit of the expectation propagation algorithm exists and results in a hybrid fixed point iteration consisting of (1) expectation propagation updates for the discrete time terms and (2) variational updates for the continuous time term. We introduce corrections methods that improve on the marginals of the approximation. This approach extends the classical Kalman-Bucy smoothing procedure to non-Gaussian observations, enabling continuous-time inference in a variety of models, including spiking neuronal models (state-space models with point process observations) and box likelihood models. Experimental results on real and simulated data demonstrate high distributional accuracy and significant computational savings compared to discrete-time approaches in a neural application.", "full_text": "Approximate inference in latent Gaussian-Markov\n\nmodels from continuous time observations\n\nBotond Cseke1\n\nManfred Opper2\n\nGuido Sanguinetti1\n\n1School of Informatics\n\nUniversity of Edinburgh, U.K.\n\n{bcseke,gsanguin}@inf.ed.ac.uk\n\n2Computer Science\nTU Berlin, Germany\n\nmanfred.opper@tu-berlin.de\n\nAbstract\n\nWe propose an approximate inference algorithm for continuous time Gaussian Markov\nprocess models with both discrete and continuous time likelihoods. We show that the\ncontinuous time limit of the expectation propagation algorithm exists and results in a\nhybrid \ufb01xed point iteration consisting of (1) expectation propagation updates for discrete\ntime terms and (2) variational updates for the continuous time term. We introduce post-\ninference corrections methods that improve on the marginals of the approximation. This\napproach extends the classical Kalman-Bucy smoothing procedure to non-Gaussian ob-\nservations, enabling continuous-time inference in a variety of models, including spiking\nneuronal models (state-space models with point process observations) and box likelihood\nmodels. Experimental results on real and simulated data demonstrate high distributional\naccuracy and signi\ufb01cant computational savings compared to discrete-time approaches in\na neural application.\n\n1\n\nIntroduction\n\nContinuous time stochastic processes provide a \ufb02exible and popular framework for data modelling in\na broad spectrum of scienti\ufb01c and engineering disciplines. Their intrinsically non-parametric, in\ufb01nite-\ndimensional nature also makes them a challenging \ufb01eld for the development of ef\ufb01cient inference algo-\nrithms. Recent years have seen several such algorithms being proposed for a variety of models [Opper\nand Sanguinetti, 2008, Opper et al., 2010, Rao and Teh, 2012]. Most inference work has focused on the\nscenario when observations are available at a \ufb01nite set of time points, however, modern technologies are\nmaking effectively continuous time observations increasingly common: for example, high speed imaging\ntechnologies now enable the acquisition of biological data at around 100Hz for extended periods of time.\nOther scenarios give intrinsically continuous time observations: for example, sensors monitoring the transit\nof a particle through a barrier provide continuous time data on the particle\u2019s position. To the best of our\nknowledge, this problem has not been addressed in the statistical machine learning community.\nIn this paper, we propose an expectation-propagation (EP)-type algorithm [Opper and Winther, 2000,\nMinka, 2001] for latent diffusion processes observed in either discrete or continuous time. We derive\n\ufb01xed-point update equations by considering a continuous time limit of the parallel EP algorithm [e.g. Op-\nper and Winther, 2005, Cseke and Heskes, 2011b]: these \ufb01xed point updates naturally become differential\nequations in the continuous time limit. Remarkably, we show that, in the presence of continuous time\nobservations, the update equations for the EP algorithm reduce to updates for a variational Gaussian ap-\nproximation [Archambeau et al., 2007]. We also generalise to the continuous-time limit the EP correction\nscheme of [Cseke and Heskes, 2011b], which enable us to capture some of the non-Gaussian behaviour of\nthe time marginals.\n\n1\n\n\f2 Models and methods\n\nWe consider dynamical systems described by multivariate stochastic differential equations (SDEs) of\nOrnstein-Uhlenbeck (OU) type over the [0, 1] time interval\n\nt dWt,\n\ndxt = (Atxt + ct)dt + B1/2\n\n(1)\nwhere {Wt}t is the standard Wiener process [Gardiner, 2002] and At, Bt and ct are time dependent matrix\nand vector valued functions respectively with Bt being positive de\ufb01nite for all t \u2208 [0, 1]. Even though the\nprocess does not posses a formulation through density functions (with respect to the Lebesgue measure),\nin order to be able to symbolically represent and manipulate the variables of the process in the Bayesian\nformalism, we will use the proxy p0({xt}) to denote their distribution.\nThe process can be observed (noisily) both at discrete time points, and for continuous time intervals; we\nt , t \u2208 [0, 1] accordingly. We assume that the likelihood\nwill partition the observations in yd\nfunction admits the general formulation\nti|xti ) \u00d7 exp\uffff\u2212\uffff 1\n\nti, ti \u2208 Td and yc\nt}|{xt}) \u221d \uffffti\u2208Td\n\nWe refer to p(yd\nt , xt) as discrete time likelihood term and continuous time loss function,\nrespectively. We notice that, using Girsanov\u2019s theorem and Ito\u2019s lemma, non-linear diffusion equations\nwith constant (diagonal) diffusion matrix can be re-written in the form (1)-(2), provided the drift can be\nobtained as the gradient of a potential function [e.g. \u00d8ksendal, 2010].\nOur aim is to propose approximate inference methods to compute the marginals p(xt|{yd\nposterior distribution\n\nti|xti) and V (t, yc\n\nt , xt)\uffff .\n\nt}) of the\n\nti}i,{yc\n\ndtV (t, yc\n\np({yd\n\nti}i,{yc\n\np(yd\n\n(2)\n\n0\n\np({xt}t|{yd\n\nti}i,{yc\n\nt}) \u221d p({yd\n\nti}i,{yc\n\nt}|{xt}) \u00d7 p0({xt}).\n\n2.1 Exact inference in Gaussian models\nWe start form the exact case of Gaussian observations and quadratic loss function. The linearity of equa-\ntion (1) implies that the marginal distributions of the process at every time point are Gaussian (assuming\nGaussian initial conditions). The time evolution of the marginal mean mt and covariance Vt is governed\nby the pair of differential equations [Gardiner, 2002]\n\nd\ndt\n\nmt = Atmt + ct\n\nand\n\nd\ndt\n\nVt = AtVt + VtAT\n\nt + Bt.\n\n(3)\n\nt Qc\n\nIn the case of Gaussian observations and a quadratic loss function V (t, yc\nt +\ntxt, these equations, together with their backward analogues, enable an exact recursive inference\n1\n2 xT\nalgorithm, known as the Kalman-Bucy smoother [e.g. S\u00a8arkk\u00a8a, 2006]. This algorithm arises because we can\nrecast the loss function as an auxiliary (observation) process\n\nt , xt) = const. \u2212 xT\n\nt hc\n\ndyc\n\nt = xtdt + R1/2\n\nt dWt,\n\n(4)\n\nt dyc\n\nt = Qc\n\nt /dt = hc\n\nt and R\u22121\n\nt. This follows by the Gaussianity of the observation process and\n\nwhere R\u22121\nthe fundamental property of Ito\u2019s calculus dW 2\nThe Kalman-Bucy algorithm computes the posterior marginal means and covariances by solving the differ-\nential equations in a forward-backward fashion. These can be combined with classical Kalman \ufb01ltering to\naccount for discrete-time observations. The exact form of the equations as well as the variational derivation\nof the Kalman-Bucy problem are given in Section B of the Supplementary Material.\n\nt = Idt.\n\n2.2 Approximate inference\n\nIn this section we use an Euler discretisation of the prior and the continuous time likelihood to turn our\nmodel into a multivariate latent Gaussian model. We review the EP algorithm for such models and then we\nshow that when taking the limit \u2206t \u2192 0 the updates of the EP algorithm exist. The resulting approximate\nposterior process is again an OU process and we compute its parameters. Finally, we show how corrections\nto the marginals proposed [Cseke and Heskes, 2011b] can be extended to the continuous time case.\n\n2\n\n\fp(yc|x) \u221d exp\uffff\u2212\uffffk\n\n\u2206tkV (tk, yc\n\ntk , xtk )\uffff ,\n\nwhere yc is the matrix yc = [yc\nsian model\n\nt1, . . . , yc\n\ntK ]. Consequently we approximate our model by the latent Gaus-\n\np({yd\n\nti}i, yc, x) = p0(x) \u00d7\uffffi\n\nwhere we remark that the prior p0 has a block-diagonal precision structure. To simplify notation, in the\nfollowing we use the aliases \u03c6d\n\ni (xti) = p(yd\n\np(yd\n\nti|xti )\uffffk\nti|xti) and \u03c6c\n\nexp\uffff\u2212\u2206tkV (tk, yc\ntk , xtk )\uffff\nk(xtk ;\u2206 tk) = exp\uffff\u2212\u2206tkV (tk, yc\n\ntk , xtk )\uffff.\n\n2.2.2 Inference using expectation propagation\nExpectation propagation [Opper and Winther, 2000, Minka, 2001] is a well known algorithm that provides\ngood approximations of the posterior marginals in latent Gaussian models. We use here the parallel EP\napproach [e.g. Cseke and Heskes, 2011b]; similar continuous time limiting arguments can be made for the\noriginal (sequential) EP approach. The algorithm approximates the posterior p(x|{yd\nti}i, yc) by a Gaussian\n\n2.2.1 Euler discretisation\nLet T = {t1 = 0, t2, . . . , tK\u22121, tK = 1} be a discretisation of the [0, 1] interval and let the matrix\nx = [xt1, . . . , xtK ] represent the process {xt}t using the discretisation given by T . Without loss of\ngenerality we can assume that Td \u2282 T . We assume the Euler-Maruyama approach and approximate\np({xt}) by1\n\nN (xtk+1 ; xtk + (Atk xtk + ctk )\u2206tk, \u2206tkBtk )\n\np0(x) = N (x0; m0, V0)\uffffk\n\nand in a similar fashion we approximate the continuous time likelihood by\n\nq0(x) \u221d p0(x)\uffffi\n\n\u02dc\u03c6d\n\ni (xti )\uffffk\n\n\u02dc\u03c6c\nk(xtk ;\u2206 tk),\n\nwhere \u02dc\u03c6d\nthe \ufb01xed point iteration\n\ni and \u02dc\u03c6c\n\nk are Gaussian functions. When applied to our model the algorithm proceeds by performing\n\n(5)\n\n[ \u02dc\u03c6d\n\ni (xti )]\n\nnew\n\n\u221d\n\nCollapse(\u03c6d\n\ni (xti ) \u02dc\u03c6d\n\nq0(xti )\nk(xtk ;\u2206 tk) \u02dc\u03c6c\n\ni (xti )\u22121q0(xti );N )\n\n\u00d7 \u02dc\u03c6d\n\ni (xti )\nk(xtk ;\u2206 tk)\u22121q0(xtk );N )\n\nnew\n\nCollapse(\u03c6c\n\nfor all ti \u2208 Td,\n\u00d7 \u02dc\u03c6c\n\nk(xtk ;\u2206 tk)\n\n\u221d\n\nq0(xtk )\n\nk(xtk ;\u2206 tk)]\n\ni (xti) \u02dc\u03c6d\n\ni (xti) \u02dc\u03c6d\n\nk(xtk ;\u2206 tk) \u02dc\u03c6c\n\nfor all tk \u2208 T,\n\n[ \u02dc\u03c6c\n(6)\nwhere Collapse(p(z);N ) = argminq\u2208N D[p(z)||q(z)] denotes the projection of the density p(z) into\nIn other words, Collapse(p(z);N ) is the Gaussian density that\nthe Gaussian family denoted by N .\nmatches the \ufb01rst and second moments of p(z). Readers familiar with the classical formulation of\nEP [Minka, 2001] will recognise in equation (5) the so-called term updates, where \u02dc\u03c6d\ni (xti)\u22121q0(xti)\ni (xti)\u22121q0(xti) the tilted distribution. Equations (5-6) imply\nis the cavity distribution and \u03c6d\nthat at any \ufb01xed point of the iterations we have q(xti) = Collapse(\u03c6d\ni (xti)\u22121q0(xti);N ) and\nk(xtk ;\u2206 tk)\u22121q0(xtk );N ). The algorithm can also be derived and jus-\nq(xtk ) = Collapse(\u03c6c\nti\ufb01ed as a constrained optimisation problem of a Gibbs free energy formulation [Heskes et al., 2005]; this\nalternative approach can also be shown to extend to the continuous time limit (see Section A.2 of the\nSupplementary Material) and provides a useful tool for approximate evidence calculations.\nEquation (5) does not depend on the time discretisation, and hence provides a valid update equation also\nworking directly with the continuous time process. On the other hand, the quantities in equation (6) de-\npend explicitly on \u2206tk, and it is necessary to ensure that they remain well de\ufb01ned (and computable) in\nthe continuous time limit. In order to derive the limiting behaviour of (6) we introduce the the follow-\ning notation: (i) we use f (z) = (z,\u2212zzT /2) to denote the suf\ufb01cient statistic of a multivariate Gaus-\nsian (ii), we use \u03bbd\nti) as the canonical parameters corresponding to the Gaussian function\n\u02dc\u03c6d\ntk ) as the canonical parameters corresponding\ni (xti) \u221d exp{\u03bbd\ntk = (hc\nto the Gaussian function \u02dc\u03c6c\ntk \u00b7 f (xtk )}, and \ufb01nally, (iv) we use Collapse(p(z); f ) as\n1We remark that one could also integrate the OU process between time steps, yielding an exact \ufb01nite dimensional\n2We use \u201c\u00b7\u201d as scalar product for general (concatenated) vector objects, for example, x\u00b7y = xT y when x, y \u2208 Rn.\n\nti = (hd\nti \u00b7 f (xti)}2, (iii) we use \u03bbc\nk(xtk ) \u221d exp{\u2206tk\u03bbc\n\nmarginal of the prior. In the limit however both procedures are equivalent.\n\ntk , Qc\n\nti, Qd\n\n3\n\n\fthe canonical parameters corresponding to the density Collapse(p(z);N ). By using this notation we can\nrewrite (6) as\n\n[\u03bbc\n\ntk ]new = \u03bbc\n\ntk +\n\n1\n\n\u2206tk\uffffCollapse(qc(xtk ); f ) \u2212 Collapse(q0(xtk ); f )\uffff\n\nwith\n\n(7)\n\n(8)\n\n(9)\n\nThe approximating density can then be written as\n\nqc(xtk ) \u221d exp(\u2212\u2206tk[V (tk, xtk ) + \u03bbc\nq0(x) \u221d p0(x) \u00d7 exp\uffff\uffffi\nti \u00b7 f (xti ) +\uffffk\n\n\u03bbd\n\ntk \u00b7 f (xtk )])q0(xtk ).\n\n\u2206tk\u03bbc\n\ntk \u00b7 f (xtk )\uffff.\n\nBy direct Taylor expansion of Collapse(qc(xtk ); f ) one can show that the update equation (7) remains\n\ufb01nite when we take the limit \u2206tk \u2192 0. A slightly more general perspective however affords greater insight\ninto the algorithm, as shown below.\n\n2.2.3 Continuous time limit of the update equations\nLet \u00b5tk = Collapse(q0(xtk ); f ) and denote by Z(\u2206tk, \u00b5tk ) and Z(\u00b5tk ) the normalisation constant of\nqc(xtk ) and q0(xtk ) respectively. The notation emphasises that qc(xtk ) differs from q0(xtk ) by a term\ndependent on the granularity of the discretisation \u2206tk. We exploit the well known fact that the derivatives\nwith respect to the canonical parameters of the log normalisation constant of a distribution within the\nexponential family give the moment parameters of the distribution. From the de\ufb01nition of qc(xtk ) in\nequation (8) we then have that its \ufb01rst two moments can be computed as \u2202\u00b5tk\nlog Z(\u2206tk, \u00b5tk ). The\nCollapse operation in (7) can then be rewritten as\n\nCollapse(qc(xtk ); f ) =\u03a8( \u2202\u00b5tk\n\nlog Z(\u2206tk, \u00b5tk )),\n\n(10)\n\nwhere \u03a8 is the function transforming the moment parameters of a Gaussian into its (canonical) parameters.\nWe now assume \u2206tk to be small and expand Z(\u2206tk, \u00b5tk ) to \ufb01rst order in \u2206tk. By using the property that\nlim\u03b1\u21920+ \uffffg(z)\u03b1\uffff1/\u03b1\n\np(z) = exp(\ufffflog g(x)\uffffp) for any distribution p(z) and g(z) > 0, one can write\n\nlim\n\n\u2206tk\u21920\n\n1\n\u2206tk\n\n[log Z(\u2206tk, \u00b5tk ) \u2212 log Z(\u00b5tk )] = log lim\n\n\u2206tk\u21920\uffffexp{\u2212\u2206tk[V (tk, xtk ) + \u03bbc\ntk \u00b7 f (xtk )]\uffffq0(xtk )\n= \u2212\uffff[V (tk, xtk ) + \u03bbc\n= \u2212\uffffV (tk, xtk )\uffffq0(xtk ) \u2212 \u03a8\u22121(\u00b5tk )\u03bbc\ntk ,\n\ntk \u00b7 f (xtk )]}\uffff1/\u2206tk\n\nq0(xtk )\n\n(11)\nwhere we exploited the fact that \ufffff (xtk )\uffffq0(xtk ) are the moments of the q0(xtk ) distribution. We can now\nexploit the fact that \u2206tk is small and linearise the nonlinear map \u03a8 about the moments of q0(xtk ) to obtain\na \ufb01rst order approximation to equation (10) as\n\nCollapse(qc(xtk ); f ) \uffff \u00b5tk \u2212 \u2206tk\u03bbc\n\ntk \u2212 \u2206tkJ\u03a8(\u00b5tk )\u2202\u00b5tk \uffffV (tk, xtk )\uffffq0(xtk )\n\n(12)\n\nwhere J\u03a8(\u00b5tk ) denotes the Jacobian matrix of the map \u03a8 evaluated at \u00b5tk. The second term on the r.h.s.\nof equation (12) follows from the obvious identity \u2202\u00b5tk\nBy substituting (12) into (7), we take the limit \u2206tk \u2192 0 and obtain the update equations\n\n\u03a8(\u03a8\u22121(\u00b5tk )) = I.\n\nNotice that the updating of \u03bbc\ntained in the parameters \u00b5tk. Since \u03bbc\nwe can use the representation \u03bbc\nt = (hc\ntion of Gaussians we write the \ufb01xed point iteration as\n\nt ]new = \u2212J\u03a8(\u00b5t)\u2202\u00b5t \uffffV (t, xt)\uffffq0(xt)\n[\u03bbc\n(13)\nt is somewhat hidden in equation (13); the \u201cold\u201d parameters are in fact con-\nt corresponds to the canonical parameters of a multivariate Gaussian,\nt) and after some algebra on the moment-canonical transforma-\nt, Qc\n\nfor all t \u2208 [0, 1].\n\n[hc\nt ]new = \u2212\u2202mt \uffffV (t, xt)\uffffq0(xt) + 2\u2202Vt \uffffV (t, xt)\uffffq0(xt) mt\n\n(14)\nwhere mt and Vt are the marginal means and covariances of q0 at the \u2206tk \u2192 0. Algorithmically, comput-\ning the marginal moments and covariances of the discretised Gaussian q0(x) in (9) can be done by solving a\nsparse linear system and doing partial matrix inversion using the Cholesky factorisation and the Takahashi\nequations as in Cseke and Heskes [2011b]. This corresponds to a junction tree algorithm on a (block) chain\ngraph [Davis, 2006] which, in the continuous time limit, can be reduced to a set of differential equations\n\nt ]new = \u2202Vt \uffffV (t, xt)\uffffq0(xt) ,\n\n[Qc\n\nand\n\n4\n\n\fdue to the chain structure of the graph. Alternatively, one can notice that, in the continuous time limit,\nthe structure of q0(x) in equation (9) de\ufb01nes a posterior process for an OU process p0({xt}) observed\nat discrete times with Gaussian noise (corresponding to the terms \u02dc\u03c6d\ni (xti) with canonical parameters \u03bbd\nti)\nand with a quadratic continuous time loss, which is computed using equation (14). The moments there-\nfore be computed using the Kalman-Bucy algorithm; details of the algorithm are given in Section B.1 of\nthe Supplementary Material. The derivation above illustrates another interesting characteristic of working\nwith continuous-time likelihoods. Readers familiar with the fractional free energies and the power EP al-\ngorithm may notice that the time lag \u2206tk plays a similar role as the fractional or power parameter \u03b1. It\nis well known property that in the \u03b1 \u2192 0 limit the algorithm and the free energy collapses to variational\n[e.g. Wiegerinck and Heskes, 2003, Cseke and Heskes, 2011a] and thus, intuitively, the collapse and the\nexistence of the limit is related to this property.\nti) corresponding to\nOverall, we arrive to a hybrid algorithm in which: (i) the canonical parameters (hd\nthe discrete time terms are updated by the usual EP updates in (5), (ii) the canonical parameters (hc\nt, Qc\nt)\ncorresponding to the continuous loss function V (t, xt) are updated by the variational updates in (14) (iii),\nthe marginal moment parameters of q0(xt) are computed by the forward-backward differential equations\nreferred to in Section 2.1. We can use either parallel or a forward-backward type scheduling. A more\ndetailed description of the inference algorithm is given in Section C of the Supplementary Material. The\nalgorithm performs well in the comfort zone of EP, that is, log-concave discrete likelihood terms and convex\nloss. Non-convergence can occur in case of multimodal likelihoods and loss functions and alternative\noptions to optimise the free energy have to be explored [e.g. Heskes et al., 2005, Archambeau et al., 2007].\n\nti, Qd\n\n2.2.4 Parameters of the approximating OU process\nThe \ufb01xed point iteration scheme computes only the marginal means and covariances of q0({xt}) and it\ndoes not provide a parametric OU process as an approximation. However, this can be computed by \ufb01nding\nthe parameters of an OU process that matches q0 in the moment matching Kullback-Leibler divergence.\nThat is, if q\u2217({xt}) minimises D[q0({xt})||q\u2217({xt})], then the parameters of q\u2217 are given by\n\nwhere mbw\nt\nare somewhat lengthy; a full derivation can be found in Section B.3 of the Supplementary Material.\n\n(15)\nare computed by the backward Kalman-Bucy \ufb01ltering equations. The computations\n\nA\u2217t = At \u2212 Bt[V bw\n\nc\u2217t = ct + Bt[V bw\n\nand B\u2217t = Bt,\n\nand V bw\n\n]\u22121mbw\nt\n\nt\n\n]\u22121,\n\nt\n\nt\n\n2.2.5 Corrections to the marginals\nIn this section we extend the factorised correction method for multivariate latent Gaussian models intro-\nduced in Cseke and Heskes [2011b] to continuous time observations. Other correction schemes [e.g. Opper\net al., 2009] can in principle also be applied. We start again from the discretised representation and then\ntake the \u2206tk \u2192 0. To begin with, we focus on the corrections from the continuous time observation pro-\ncess. By removing the Gaussian terms (with canonical parameters \u03bbc\ntk) from the approximate posterior and\nreplacing them with the exact likelihood, we can rewrite the exact discretised posterior as\n\nThe exact posterior marginal at time tj is thus given by\n\np(x) \u221d q0(x) \u00d7 exp\uffff \u2212\uffffk\n\n\u2206tk[V (tk, xtk ) + \u03bbc\n\ntk \u00b7 f (xt)]\uffff.\n\nwith\n\np(xtj ) \u221d q0(xtj )\u00d7 exp\uffff \u2212 \u2206tj[V (tj, xtj + \u03bbc\n\ncT (xtj ) =\uffff dx\\tj q0(x\\tj|xtj ) \u00d7 exp\uffff \u2212\uffffk\uffff=j\n\ntj \u00b7 f (xtj ))]\uffff \u00d7 cT (xtj )\n\n\u2206tk[V (tk, xtk ) + \u03bbc\n\ntk \u00b7 f .(xtk )]\uffff,\n\nwhere the subscript \\j indicates the whole vector with the j-th entry removed. By approximating the joint\nconditional q0(x\\tj|xtj ) with a product of its marginals and taking the \u2206tk \u2192 0 limit, we obtain\n\nc(xt) \uffff exp\uffff \u2212\uffff 1\n\n0\n\nds\uffffV (s, xs) + \u03bbc\n\ns \u00b7 f (xs)\uffffq0(xs|xt)\uffff.\n\nWhen combining the continuous part and the factorised discrete time corrections\u2014by adding the discrete\ntime terms to the formalism above\u2014we arrive to the corrected approximate marginal\n\n\u02dcp(xt) \u221d q0(xt) exp\uffff \u2212\uffff 1\n\n0\n\nds\uffffV (s, xs) + \u03bbc\n\ns \u00b7 f (xs)\uffffq0(xs|xt)\uffff \u00d7\uffffi \uffff\n\np(yd\nexp{\u03bbd\n\nti|xti )\n\nti \u00b7 f (xti )}\uffffq0(xti|xt)\n\n.\n\nFor any \ufb01xed t one can compute the correlations in linear time by using the parametric form of the approx-\nimation in 15. The evaluations for a \ufb01xed xt are also linear in time.\n\n5\n\n\f3\n\n2.5\n\n2\n\n1.5\n\n1\n\n0.5\n\n \n0\n\u22121\n\nMarginal distributions at t=0.3351\n\n \n\nsampling at 10\u22123\nvariational. corr\nvariational Gaussian\n\n\u22120.8\n\n\u22120.6\n\n\u22120.4\n\n\u22120.2\n\n0\n\n0.2\n\n0.4\n\n0.6\n\n0.8\n\n1\n\nFigure 1: Inference results for the toy model in Section 3.1. The continuous time potential is de\ufb01ned as V (t, xt) =\n(2xt)8I[1/2,2/3](t) and we assume two hard box discrete likelihood terms I[\u22120.25,0.25](xt1 ) and I[\u22120.25,0.25](xt2 )\nplaced at t1 = 1/3 and t2 = 2/3. The prior is de\ufb01ned by the parameters at = \u22121, ct = 4\u03c0 cos(4\u03c0t) and bt = 4. The\nleft panel shows the prior\u2019s and the posterior approximation\u2019s marginal means and standard deviations. The right panel\nshows the marginal approximations at t = 0.3351, a region where we expect the corrections to be strongly in\ufb02uenced\nby both types of likelihoods. Samples were generated by using the lag \u2206t = 10\u22123, the approximate inference was run\nusing RK4 at \u2206t = 10\u22124.\n\n3 Experiments\n\n3.1 Inference in a (soft) box\nThe \ufb01rst example we consider is a mixed discrete-continuous time inference under box and soft box likeli-\nhood observations respectively. We consider a diffusing particle on the line under an OU prior process of\nthe form\n\ndxt = (\u2212axt + ct)dt + \u221abdWt\n\nwith a = \u22121, ct = 4\u03c0 cos(4\u03c0t) and b = 4. The likelihood model is given by the loss function V (t, xt) =\n(2xt)8 for all t \u2208 [1/2, 2/3] and 0 otherwise, effectively con\ufb01ning the process to a narrow strip near zero\n(soft box). This likelihood is therefore an approximation to physically realistic situations where particles\ncan perform diffusion in a con\ufb01ned environment. The box has hard gates: two discrete time likelihoods\ngiven by the indicator functions I[\u22120.25,0.25](xt1) and I[\u22120.25,0.25](xt2) placed at the ends of the interval,\nthat is, Td = {1/3, 2/3}. The left panel in Figure 1 shows the prior and approximate posterior processes\n(mean \u00b1 one standard deviation) in pink and cyan respectively: the con\ufb01nement of the process to the box is\nin clear evidence, as well as the narrowing of the con\ufb01dence intervals corresponding to the two discrete time\nobservations. The right panel in Figure 1 shows the marginal approximations at a time point shortly after\nthe \u201cgate\u201d to the box, these are: (i) sampling (grey) (ii) the Gaussian EP approximation (blue line), and (iii)\nits corrected version (red line). The time point was chosen as we expect the strongest non-Gaussian effects\nto be felt near the discrete likelihoods; the corrected distribution does indeed show strong skewness. To\nbenchmark the method, we compare it to MCMC sampling obtained by using slice sampling [Murray et al.,\n2010] on the discretised model with \u2206t = 10\u22123. We emphasise that this is an approximation to the model,\nhence the benchmark is not a true gold standard; however, we are not aware of sampling schemes that\nwould be able to perform inference under the exact continuous time likelihood. The histogram in Figure 1\nwas generated from a sample size of 105 following a burn in of 104. The Gaussian EP approach gives a\nvery good reconstruction of the \ufb01rst two moments of the distribution. The corrected EP approximation is\nvery close to the MCMC results.\n\n3.2 Log Gaussian Cox processes\nAnother family of models where one encounters continuous time likelihoods is point processes; these\nprocesses \ufb01nd wide application in a number of disciplines, from neuroscience Smith and Brown [2003] to\ncon\ufb02ict modelling Zammit-Mangion et al. [2012]. We assume that we have a multivariate log Gaussian\nCox process model [Kingman, 1992]: this is de\ufb01ned by a d-variate Ornstein-Uhlenbeck process {xt}t\n\n6\n\n\fFigure 2: A toy example for the point process model\nThe prior is de\ufb01ned by A =\nt = 4i\u03c0 cos(2i\u03c0t), B = 4I. We use \u00b5i = 0. The prior means\n[\u22122, 1, 0, 1; 1,\u22122, 1, 0; 0, 1,\u22122, 1; 1, 0, 1,\u22122], ci\nand standard deviations, the sampled process path, and the sampled events are shown on the left panel while the\nposterior approximations are shown on the right panel.\n\nin Section 3.2.\n\non the [0, 1] interval. Conditioned on {xt}t we have d Poisson point processes with intensities given by\nt = e\u00b5i+xi\nt for all i = 1, . . . , d and t \u2208 [0, 1]. The likelihood of this point process model is formed by\n\u03bbi\nboth discrete time (point probabilities) and continuous time (void probability) terms and can be written as\n\nlog\uffffi\n\np(Yi|{xi\n\nt}t)\n\n.\n\n=\uffffi \uffff \u2212 e\u00b5i\uffff 1\n\n0\n\ndtexi\n\nt + |Yi|\u00b5i + \ufffftk\u2208Yi\n\nxi\n\nt\uffff,\n\nt}t. Clearly, the discrete time obser-\nwhere Yi denotes the set of observed event times corresponding to {xi\nvations in this model are (degenerate) Gaussians, therefore, one may opt for starting with an OU process\nwith a translated drift, however, for consistency reasons, we treat them as discrete time observations.\nIn this example we chose d = 4 and A = [\u22122, 1, 0, 1; 1,\u22122, 1, 0; 0, 1,\u22122, 1; 1, 0, 1,\u22122], thus coupling the\nt = 4i\u03c0 cos(2i\u03c0t), B = 4I and \u00b5i = 0. We generate a sample path { \u02dcxt}t,\nvarious processes. We chose ci\ndraw observations Yi based on {\u02dcxi\nThe results are shown in Figure 2, with four colours distinguishing the four processes. The left panel shows\nprior processes (mean \u00b1 standard deviation), sample paths and (bottom row) the sampled points (i.e. the\ndata). The right panel shows the corresponding posterior processes approximations. The results re\ufb02ect the\ngeneral pattern characteristic of \ufb01tting point process data: in regions with a substantial number of events\nthe sampled path can be inferred with great accuracy (accurate mean, low standard deviation) whereas in\nregions with no or only a few events the \ufb01t reverts to a skewed/shifted prior path, as the void probability\ndominates.\n\nt}t and perform inference.\n\n3.3 Point process modelling of neural spikes trains\n\nIn a third example we consider continuous time point process inference for spike time recordings from a\npopulation of neurons. This type of data is frequently modelled using (discrete time) state-space models\nwith point process observations (SSPP) [Smith and Brown, 2003, Zammit Mangion et al., 2011, Macke\net al., 2011]; parameter estimation from such models can reveal biologically relevant facts about the neu-\nron\u2019s electrophysiology which are not apparent from the spike trains themselves. We consider a dataset\nfrom Di Lorenzo and Victor [2003], available at www.neurodatabase.org, consisting of recordings of\nspiking patterns of taste response cells in Sprague-Dawley rats during presentation of different taste stim-\nuli. The recordings are 10s each at a resolution of 10\u22123s, and four different taste stimuli: (i) NaCL, (ii)\nQuinine HCl, (iii) Quinine HCl, and (iv) Sucrose are presented to the subjects for the duration of the \ufb01rst\n5s of the 10s recording window. We modelled the spike train recordings by univariate log Gaussian Cox\nprocess models (see Section 3.2) with homogeneous OU priors, that is, At, ct and Bt were considered\nconstant. We use the variational EM algorithm (discrete time likelihoods are Gaussian) to learn the prior\n\n7\n\n\f5\n\n4.5\n\n4\n\n3.5\n\n3\n\n2.5\n\n2\n\n1.5\n\n\u00b5\n\nThe fitted (c,\u00b5) parameters\n\n \n\nNaCl\nQuinine\nHCl\nSucrose\n\n1\n \n(cid:239)10\n\n0\n\n10\nc\n\n20\n\n30\n\nFigure 3: Inference results on data from cell 9 form the dataset in Section 3.3. The top-left, bottom-left and centre\npanels show the intensity \ufb01t, event count and the Q-Q plot corresponding to one of the recordings, whereas the right\npanel shows the learned c and \u00b5 parameters for all spike trains in cell 9.\n\nparameters A, c and \u00b5 and initial conditions for each individual recording. We scaled the 10s window into\nthe unit interval [0, 1] and used a 10\u22124 resolution.\nFig 3 shows example results of this procedure. The right panel shows an emergent pattern of stimulus\nbased clustering of \u00b5 and c as in Zammit Mangion et al. [2011]. We observe that discrete-time approaches\nsuch as [Smith and Brown, 2003, Zammit Mangion et al., 2011] are usually forced to take very \ufb01ne time\ndiscretisation by the requirement that at most one spike happens during one time step. This leads to signi\ufb01-\ncant computational resources being invested in regions with few spikes. Our continuous time approach, on\nthe other hand, handles uneven observations naturally.\n\n4 Conclusion\n\nInference methodologies for continuous time stochastic processes are a subject of intense research, both\nfor fundamental and applied research. This paper contributes a novel approach which allows inference\nfrom both discrete time and continuous time observations. Our results show that the method is effective\nin accurately reconstructing marginal posterior distributions, and can be deployed effectively on real world\nproblems. Furthermore, it has recently been shown [Kappen et al., 2012] that optimal control problems can\nbe recast in inference terms: in many cases, the relevant inference problem is of the same type as the one\nconsidered here, hence this methodology could in principle also be used in control problems. The method\nis based on the parallel EP formulation of Cseke and Heskes [2011b]:\ninterestingly, we show that the\nEP updates from continuous time observations collapse to variational updates [Archambeau et al., 2007].\nAlgorithmically, our approach results in ef\ufb01cient forward-backward updates, compared to the gradient\nascent algorithm of Archambeau et al. [2007]. Furthermore, the EP perspective allows us to compute\ncorrections to the Gaussian marginals; in our experiments, these turned out to be highly accurate.\nOur modelling framework assumes a latent linear diffusion process; however, as mentioned before, some\nnon-linear diffusion processes are equivalent to posterior processes for OU processes observed in contin-\nuous time [\u00d8ksendal, 2010]. Our approach, hence, can also be viewed as a method for accurate marginal\ncomputations in (a class of) nonlinear diffusion processes observed with noise. In general, all non-linear\ndiffusion processes can be recast in a form similar to the one considered here; the important difference\nthough is that the continuous time likelihood is in general an Ito integral, not a regular integral. In the\nfuture, it would be interesting to explore the extension of this approach to general non-linear diffusion\nprocesses, as well as discrete and hybrid stochastic processes [Rao and Teh, 2012, Ocone et al., 2013].\n\nAcknowledgements\n\nB.Cs. is funded by BBSRC under grant BB/I004777/1. M.O. would like to thank for the support by EU\ngrant FP7-ICT-270327 (Complacs). G.S. acknowledges support from the ERC under grant MLCS-306999.\n\n8\n\n\fReferences\nC. Archambeau, D. Cornford, M. Opper, and J. Shawe-Taylor. Gaussian process approximations of stochastic differ-\n\nential equations. Journal of Machine Learning Research - Proceedings Track, 1:1\u201316, 2007.\n\nB. Cseke and T. Heskes. Properties of Bethe free energies and message passing in Gaussian models. Journal of\n\nArti\ufb01cial Intelligence Research, 41:1\u201324, 2011a.\n\nB. Cseke and T. Heskes. Approximate marginals in latent Gaussian models. Journal of Machine Learning Research,\n\n12:417\u2013457, 2011b.\n\nT. A. Davis. Direct Methods for Sparse Linear Systems (Fundamentals of Algorithms 2). Society for Industrial and\n\nApplied Mathematics, Philadelphia, 2006.\n\nP. M. Di Lorenzo and J. D. Victor. Taste response variability and temporal coding in the nucleus of the solitary tract of\n\nthe rat. Journal of Neurophysiology, 90:1418\u20131431, 2003.\n\nC. W. Gardiner. Handbook of stochastic methods: for physics, chemistry and the natural sciences. Springer series in\n\nsynergetics, 13. Springer, 2002.\n\nT. Heskes, M. Opper, W. Wiegerinck, O. Winther, and O. Zoeter. Approximate inference techniques with expectation\n\nconstraints. Journal of Statistical Mechanics: Theory and Experiment, 2005.\n\nH. J. Kappen, V. G\u00b4omez, and M. Opper. Optimal control as a graphical model inference problem. Machine Learning,\n\n87(2):159\u2013182, 2012.\n\nJ. F. C. Kingman. Poisson Processes. Oxford Statistical Science Series. Oxford University Press, New York, 1992.\nS. L. Lauritzen. Graphical Models. Oxford Statistical Science Series. Oxford University Press, New York, 1996.\nJ. H. Macke, L. Buesing, J. P. Cunningham, B. M. Yu, K. V. Shenoy, and M. Sahani. Empirical models of spiking in\n\nneural populations. In Advances in Neural Information Processing Systems 24, pages 1350\u20131358. 2011.\n\nT. P. Minka. A family of algorithms for approximate Bayesian inference. PhD thesis, MIT, 2001.\nI. Murray, R. P. Adams, and D. J.C. MacKay. Elliptical slice sampling.\n\nIn Proceedings of the 13th International\n\nConference on Arti\ufb01cial Intelligence and Statistics, pages 541\u2013548. 2010.\n\nA. Ocone, A.J. Millar, and G. Sanguinetti. Hybrid regulatory models: a statistically tractable approach to model\n\nregulatory network dynamics. Bioinformatics, 29(7):910\u2013916, 2013.\n\nB. \u00d8ksendal. Stochastic differential equations. Universitext. Springer, 2010.\nM. Opper and G. Sanguinetti. Variational inference for Markov jump processes. In Advances in Neural Information\n\nProcessing Systems 20, 2008.\n\nM. Opper and O. Winther. Gaussian processes for classi\ufb01cation: Mean-\ufb01eld algorithms. Neural Computation, 12(11):\n\n2655\u20132684, 2000.\n\nM. Opper and O. Winther. Expectation consistent approximate inference. Journal of Machine Learing Research, 6:\n\n2177\u20132204, 2005.\n\nM. Opper, U. Paquet, and O. Winther. Improving on Expectation Propagation. In Advances in Neural Information\n\nProcessing Systems 21, pages 1241\u20131248. MIT, Cambridge, MA, US, 2009.\n\nM. Opper, A. Ruttor, and G. Sanguinetti. Approximate inference in continuous time Gaussian-Jump processes. In\n\nAdvances in Neural Information Processing Systems 23, pages 1831\u20131839, 2010.\n\nV. Rao and Y-W Teh. MCMC for continuous-time discrete-state systems. In Advances in Neural Information Process-\n\ning Systems 25, pages 710\u2013718, 2012.\n\nS. S\u00a8arkk\u00a8a. Recursive Bayesian Inference on Stochastic Differential Equations. PhD thesis, Helsinki University of\n\nTechnology, 2006.\n\nA. C. Smith and E. N. Brown. Estimating a state-space model from point process observations. Neural Computation,\n\n15(5):965\u2013991, 2003.\n\nW. Wiegerinck and T. Heskes. Fractional Belief Propagation. In Advances in Neural Information Processing Systems\n\n15, pages 438\u2013445, Cambridge, MA, 2003. The MIT Press.\n\nJ. S. Yedidia, W. T. Freeman, and Y. Weiss. Generalized belief propagation.\n\nProcessing Systems 12, pages 689\u2013695, Cambridge, MA, 2000. The MIT Press.\n\nIn Advances in Neural Information\n\nA. Zammit Mangion, K. Yuan, V. Kadirkamanathan, M. Niranjan, and G. Sanguinetti. Online variational inference for\n\nstate-space models with point-process observations. Neural Computation, 23(8):1967\u20131999, 2011.\n\nA. Zammit-Mangion, G. Dewar, M., Kadirkamanathan V., A., and G. Sanguinetti. Point process modelling of the\n\nAfghan war diary. Proceeding of the National Academy of Sciences, 2012. doi: 10.1073/pnas.1203177109.\n\n9\n\n\f", "award": [], "sourceid": 528, "authors": [{"given_name": "Botond", "family_name": "Cseke", "institution": "University of Edinburgh"}, {"given_name": "Manfred", "family_name": "Opper", "institution": "TU Berlin"}, {"given_name": "Guido", "family_name": "Sanguinetti", "institution": "University of Edinburgh"}]}