{"title": "Expectation Propagation in Gaussian Process Dynamical Systems", "book": "Advances in Neural Information Processing Systems", "page_first": 2609, "page_last": 2617, "abstract": "Rich and complex time-series data, such as those generated from engineering sys- tems, financial markets, videos or neural recordings are now a common feature of modern data analysis. Explaining the phenomena underlying these diverse data sets requires flexible and accurate models. In this paper, we promote Gaussian process dynamical systems as a rich model class appropriate for such analysis. In particular, we present a message passing algorithm for approximate inference in GPDSs based on expectation propagation. By phrasing inference as a general mes- sage passing problem, we iterate forward-backward smoothing. We obtain more accurate posterior distributions over latent structures, resulting in improved pre- dictive performance compared to state-of-the-art GPDS smoothers, which are spe- cial cases of our general iterative message passing algorithm. Hence, we provide a unifying approach within which to contextualize message passing in GPDSs.", "full_text": "Expectation Propagation in\n\nGaussian Process Dynamical Systems\n\nMarc Peter Deisenroth\u2217\n\nDepartment of Computer Science\n\nTechnische Universit\u00a8at Darmstadt, Germany\n\nShakir Mohamed\u2217\n\nDepartment of Computer Science\n\nUniversity of British Columbia, Canada\n\nAbstract\n\nRich and complex time-series data, such as those generated from engineering sys-\ntems, \ufb01nancial markets, videos, or neural recordings are now a common feature\nof modern data analysis. Explaining the phenomena underlying these diverse data\nsets requires \ufb02exible and accurate models. In this paper, we promote Gaussian\nprocess dynamical systems as a rich model class that is appropriate for such an\nanalysis. We present a new approximate message-passing algorithm for Bayesian\nstate estimation and inference in Gaussian process dynamical systems, a non-\nparametric probabilistic generalization of commonly used state-space models. We\nderive our message-passing algorithm using Expectation Propagation and provide\na unifying perspective on message passing in general state-space models. We\nshow that existing Gaussian \ufb01lters and smoothers appear as special cases within\nour inference framework, and that these existing approaches can be improved upon\nusing iterated message passing. Using both synthetic and real-world data, we\ndemonstrate that iterated message passing can improve inference in a wide range\nof tasks in Bayesian state estimation, thus leading to improved predictions and\nmore effective decision making.\n\n1\n\nIntroduction\n\nThe Kalman \ufb01lter and its extensions [1], such as the extended and unscented Kalman \ufb01lters [7],\nare principled statistical models that have been widely used for some of the most challenging and\nmission-critical applications in automatic control, robotics, machine learning, and economics. In-\ndeed, wherever complex time-series are found, Kalman \ufb01lters have been successfully applied for\nBayesian state estimation. However, in practice, time series often have an unknown dynamical\nstructure, and they are high dimensional and noisy, violating many of the assumptions made in es-\ntablished approaches for state estimation. In this paper, we look beyond traditional linear dynamical\nsystems and advance the state-of the-art in state estimation by developing novel inference algorithms\nfor the class of nonlinear Gaussian process dynamical systems (GPDS).\nGPDSs are non-parametric generalizations of state-space models that allow for inference in time\nseries, using Gaussian process (GP) probability distributions over nonlinear transition and measure-\nment dynamics. GPDSs are thus able to capture complex dynamical structure with few assumptions,\nmaking them of broad interest. This interest has sparked the development of general approaches for\n\ufb01ltering and smoothing in GPDSs, such as [8, 3, 5]. In this paper, we further develop inference\nalgorithms for GPDSs and make the following contributions: (1) We develop an iterative local mes-\nsage passing framework for GPDSs based on Expectation Propagation (EP) [11, 10], which allows\nfor re\ufb01nement of the posterior distribution and, hence, improved inference. (2) We show that the\ngeneral message-passing framework recovers the EP updates for existing dynamical systems as a\nspecial case and expose the implicit modeling assumptions made in these models. We show that EP\nin GPDSs encapsulates all GPDS forward-backward smoothers [5] as a special case and transforms\nthem into iterative algorithms yielding more accurate inference.\n\n* Authors contributed equally.\n\n1\n\n\f2 Gaussian Process Dynamical Systems\n\nGaussian process dynamical systems are a general class of discrete-time state-space models with\n\nvt \u223c N (0, R) ,\n\ng \u223c GP g ,\n\nh \u223c GP h ,\n\nf (x\u2217) = k\u2217\u2217 \u2212 k\n\nxt = h(xt\u22121) + wt , wt \u223c N (0, Q) ,\nzt = g(xt) + vt ,\n\n(1)\n(2)\nwhere t = 1, . . . , T . Here, x \u2208 RD is a latent state that evolves over time, and z \u2208 RE, E \u2265 D,\nare measurements. We assume i.i.d. additive Gaussian system noise w and measurement noise v.\nThe central feature of this model class is that both the measurement function g and the transition\nfunction h are not explicitly known or parametrically speci\ufb01ed, but instead described by probability\ndistributions over these functions. The function distributions are non-parametric Gaussian processes\n(GPs), and we write h \u223c GP h and g \u223c GP g, respectively.\nA GP is a probability distribution p(f ) over functions f that is speci\ufb01ed by a mean function \u00b5f\nand a covariance function kf [15]. Consider a set of training inputs X = [x1, . . . , xn](cid:62) and\ncorresponding training targets y = [y1, . . . yn](cid:62), yi = f (xi) + w, w \u223c N (0, \u03c32\nw). The poste-\nrior predictive distribution at a test input x\u2217 is Gaussian distributed N (y\u2217 | \u00b5f (x\u2217), \u03c32\nf (x\u2217)) with\n(cid:62)\n(cid:62)\n\u2217 K\u22121k\u2217, where k\u2217 = kf (X, x\u2217),\n\u2217 K\u22121y and variance \u03c32\nmean \u00b5f (x\u2217) = k\nk\u2217\u2217 = kf (x\u2217, x\u2217), and K is the kernel matrix.\nSince the GP is a non-parametric model, its use in GPDSs is desirable as it results in fewer restrictive\nmodel assumptions, compared to dynamical systems based on parametric function approximators for\nthe transition and measurement functions (1)\u2013(2). In this paper, we assume that the GP models are\ntrained, i.e., the training inputs and corresponding targets as well as the GP hyperparameters are\nknown. For both GP h and GP g in the GPDS, we used zero prior mean functions. As covariance\nfunctions kh and kg we use squared- exponential covariance functions with automatic relevance\ndetermination plus a noise covariance function to account for the noise in (1)\u2013(2).\nExisting work for learning GPDSs includes the Gaussian process dynamical model (GPDM) [20],\nwhich tackles the challenging task of analyzing human motion in (high-dimensional) video se-\nquences. More recently, variational [2] and EM-based [19] approaches for learning GPDS were pro-\nposed. Exact Bayesian inference, i.e., \ufb01ltering and smoothing, in GPDSs is analytically intractable\nbecause of the dependency of the states and measurements on previous states through the nonlinear-\nity of the GP. We thus make use of approximations to infer the posterior distributions p(xt|Z) over\nlatent states xt, t = 1, . . . , T , given a set of observations Z = z1:T . Existing approximate inference\napproaches for \ufb01ltering and forward-backward smoothing are based on either linearization, particle\nrepresentations, or moment matching as approximation strategies [8, 3, 5].\nA principled incorporation of the posterior GP model uncertainty into inference in GPDSs is neces-\nsary, but introduces additional uncertainty. In tracking problems where the location of an object is\nnot directly observed, this additional source of uncertainty can eventually lead to losing track of the\nlatent state. In this paper, we address this problem and propose approximate message passing based\non EP for more accurate inference. We will show that forward-backward smoothing in GPDSs [5]\nbene\ufb01ts from the iterative re\ufb01nement scheme of EP, leading to more accurate posterior distributions\nover the latent state and, hence, to more informative predictions and improved decision making.\n\n3 Bayesian State Estimation using Expectation Propagation\n\nproduct of factors fi(xt), i.e., p(xt|Z) =(cid:81)\nalgorithm in which p(xt|Z) is approximated by a distribution q(xt) = (cid:81)\n\nExpectation Propagation [10, 11] is a widely-used deterministic algorithm for approximate Bayesian\ninference that has been shown to be highly accurate in many problems, including sparse regression\nmodels [17], GP classi\ufb01cation [9], and inference in dynamical systems [13, 6, 18]. EP is derived\nusing a factor-graph, in which the distribution over the latent state p(xt|Z) is represented as the\ni fi(xt). EP then speci\ufb01es an iterative message passing\ni qi(xt), using approx-\nimate messages qi(xt). In EP, q and the messages qi are members of the exponential family, and\nq is determined such that the the KL-divergence KL(p||q) is minimized. EP is provably robust for\nlog-concave messages [17] and invariant under invertible variable transformations [16]. In practice,\nEP has been shown to be more accurate than competing approximate inference methods [9, 17].\nIn the context of the dynamical system (1)\u2013(2), we consider factor graphs of the form of Fig. 1 with\nthree types of messages: forward, backward, and measurement messages, denoted by the symbols\n\n2\n\n\fFigure 1: Factor graph (left) and fully factored graph (right) of a general dynamical system.\n\nAlgorithm 1 Gaussian EP for Dynamical Systems\n1: Init: Set all factors qi to N (0,\u221eI); Set q(x1) = p(x1) and marginals q(xt(cid:54)=1) = N (0, 1010I)\n2: repeat\n3:\n4:\n5:\n\nCompute cavity distribution q\\i(xt) = q(xt)/qi(xt) = N (xt | \u00b5\\i, \u03a3\\i) with\n\nfor all factors qi(xt), where i = (cid:66), (cid:77), (cid:67) do\n\nfor t = 1 to T do\n\n\u03a3\\i = (\u03a3\u22121\n\nt \u00b5t \u2212 \u03a3\u22121\nDetermine moments of fi(xt)q\\i(xt), e.g., via the derivatives of\n\n\u00b5\\i = \u03a3\\i(\u03a3\u22121\n\nt \u2212 \u03a3\u22121\n\n)\u22121 ,\n\ni\n\ni \u00b5i)\n\n(3)\n\n(4)\n\n(5)\n(6)\n(7)\n\n6:\n\n7:\n\nlog Zi(\u00b5\\i, \u03a3\\i) = log \u222b fi(xt)q\\i(xt)dxt\n\nUpdate the posterior q(xt) \u221d N (xt | \u00b5t, \u03a3t) and the approximate factor qi(xt):\n\n\u00b5t = \u00b5\\i + \u03a3\\i\u2207(cid:62)\nm ,\n\u2207m := d log Zi/d\u00b5\\i ,\nqi(xt) = q(xt)/q\\i(xt)\n\n\u03a3t = \u03a3\\i \u2212 \u03a3\\i(\u2207(cid:62)\n\u2207s := d log Zi/d\u03a3\\i\n\nm\u2207m \u2212 2\u2207s)\u03a3\\i\n\nend for\n\n8:\n9:\n10: until Convergence or maximum number of iterations exceeded\n\nend for\n\n(cid:66), (cid:67), (cid:77), respectively. For EP inference, we assume a fully-factored graph, using which we compute\nthe marginal posterior distributions p(x1|Z), . . . , p(xT|Z), rather than the full joint distribution\np(X|Z) = p(x1, . . . , xT|Z). Both the states xt and measurements zt are continuous variables and\nthe messages qi are unnormalized Gaussians, i.e., qi(xt) = siN (xt | \u00b5i, \u03a3i)\n\n3.1\n\nImplicit Linearizations Require Explicit Consideration\n\nAlg. 1 describes the main steps of Gaussian EP for dynamical systems. For each node xt in the\nfully-factored factor graph in Fig. 1, EP computes three messages: a forward, backward, and mea-\nsurement message, denoted by q(cid:66)(xt), q(cid:67)(xt), and q(cid:77)(xt), respectively. The EP algorithm updates\nthe marginal q(xt) and the messages qi(xt) in three steps. First, the cavity distribution q\\i(xt) is\ncomputed (step 5 in Alg. 1) by removing qi(xt) from the marginal q(xt). Second, in the projec-\ntion step, the moments of fi(xt)q\\i(xt) are computed (step 6), where fi is the true factor. In the\nexponential family, the required moments can be computed using the derivatives of the log-partition\nfunction (normalizing constant) log Zi of fi(xt)q\\i(xt) [10, 11, 12]. Third, the moments of the\nmarginal q(xt) are set to the moments of fi(xt)q\\i(xt), and the message qi(xt) is updated (step 7).\nWe apply this procedure repeatedly to all latent states xt, t = 1, . . . , T , until convergence.\nEP does not directly \ufb01t a Gaussian approximation qi to the non-Gaussian factor fi. Instead, EP\ndetermines the moments of qi in the context of the cavity distribution such that qi = proj[fiq\\i]/q\\i,\nwhere proj[\u00b7] is the projection operator, returning the moments of its argument.\nTo update the posterior q(xt) and the messages qi(xt), EP computes the log-partition function log Zi\nin (4) to complete the projection step. However, for nonlinear transition and measurement models\n\n3\n\nqB(xt)xtqM(xt)qC(xt+1)qM(xt+1)p(xt+1|xt)xt+1qB(xt)xtqM(xt)xt+1qC(xt+1)qM(xt+1)qB(xt+1)qC(xt)\fin (1)\u2013(2), computing Zi involves solving integrals of the form\n\n(cid:90)\n\n(cid:90)\n\np(a) =\n\np(a|xt)p(xt)dxt =\n\nN (a| m(xt), S(xt))N (xt | b, B)dxt ,\n\n(8)\n\nwhere a = zt for the measurement message, or a = xt+1 for the forward and backward mes-\nsages. In nonlinear dynamical systems m(xt) is a nonlinear measurement or transition function.\nIn GPDSs, m(xt) and S(xt) are the corresponding predictive GP means and covariances, respec-\ntively, which are nonlinearly related to xt. Because of the nonlinear dependencies between a and xt,\nsolving (8) is analytically intractable. We propose to approximate p(a) by a Gaussian distribution\nN (a| \u02dc\u00b5, \u02dc\u03a3). This Gaussian approximation is only correct for a linear relationship a = J xt, where\nJ is independent of xt. Hence, the Gaussian approximation is an implicit linearization of the func-\ntional relationship between a and xt, effectively linearizing either the transition or the measurement\nmodels.\nWhen computing EP updates using the derivatives \u2207m and \u2207s according to (5), it is crucial to\nexplicitly account for the implicit linearization assumption in the derivatives\u2014otherwise, the EP\nupdates are inconsistent. For example, in the measurement and the backward message, we directly\napproximate the partition functions Zi, i \u2208 {(cid:77), (cid:67)} by Gaussians \u02dcZi(a) = N ( \u02dc\u00b5i, \u02dc\u03a3\n). The consis-\ntent derivatives d(log \u02dcZi)/d\u00b5\\i and d(log \u02dcZi)/d\u03a3\\i of \u02dcZi with respect to the mean and covariance\nof the cavity distribution q are obtained by applying the chain rule, such that\n\ni\n\n\u2202( \u02dc\u00b5i)(cid:62)\n\n\u2207m = d log \u02dcZi\n\u2207s = d log \u02dcZi\n\u2202\u00b5\\i = J(cid:62)\n\ni(cid:17) \u2202 \u02dc\u03a3i\n\u2208 R1\u00d7D ,\n\u2202\u03a3\\i \u2208 RD\u00d7D ,\n\n(cid:16) \u2202 log \u02dcZi\n\u2202\u00b5\\i = (a \u2212 \u02dc\u00b5i)(cid:62)( \u02dc\u03a3\n\u2202 \u02dc\u03a3i\n\u2202\u03a3\\i = 1\n\u2202\u03a3\\i = J I4J(cid:62)\n\u2202 \u02dc\u03a3i\n\n)\u22121J(cid:62)\n\u2202 \u02dc\u00b5i \u2212 \u02dc\u03a3\n\u2208 RE\u00d7E\u00d7D\u00d7D ,\n\nd\u00b5\\i = \u2202 log \u02dcZi\nd\u03a3\\i = \u2202 log \u02dcZi\n\u2202 \u02dc\u03a3i\n\u2208 RE\u00d7D ,\n(11)\nwhere I4 \u2208 RD\u00d7D\u00d7D\u00d7D is an identity tensor. Note that with the implicit linear model a = J xt,\nthe derivatives \u2202 \u02dc\u00b5i/\u2202\u03a3\\i and \u2202 \u02dc\u03a3\n/\u2202\u00b5\\i vanish. Although we approximate Zi by a Gaussian \u02dcZi,\ni, which also\nwe are still free to choose a method of computing its mean \u02dc\u00b5i and covariance matrix \u02dc\u03a3\nin\ufb02uences the computation of J = \u2202( \u02dc\u00b5i)/\u2202\u00b5\\i. However, even if \u02dc\u00b5i and \u02dc\u03a3\ni are general functions\n/\u2202\u03a3\\i must equal the corresponding partial\nof \u00b5\\i and \u03a3\\i, the derivatives \u2202 \u02dc\u00b5i/\u2202\u00b5\\i and \u2202 \u02dc\u03a3\nderivatives in (11), and \u2202 \u02dc\u00b5i/\u2202\u03a3\\i and \u2202 \u02dc\u03a3\n/\u2202\u00b5\\i must be set to 0. Hence, the implicit linearization\nexpressed by the Gaussian approximation \u02dcZi must be explicitly taken into account in the derivatives\nto guarantee consistent EP updates.\n\n(9)\n\n(10)\n\n\u2202 log \u02dcZi\n\n\u2202 \u02dc\u00b5i\n\n\u2202 \u02dc\u00b5i\n\n\u2202 \u02dc\u00b5i\n\n2\n\ni\n\ni\n\ni\n\ni\n\n3.2 Messages in Gaussian Process Dynamical Systems\n\nWe now describe each of the messages needed for inference in GPDSs, and outline the approxima-\ntions required to compute the partition function in (4). Updating a message requires a projection\nto compute the moments of the new posterior marginal q(xt), followed by a Gaussian division to\nupdate the message itself. For the projection step, we compute approximate partition functions\n\\i\n\\i\n\u02dcZi, where i \u2208 {(cid:77), (cid:66), (cid:67)}. Using the derivatives d log \u02dcZi/d\u00b5\nt and d log \u02dcZi/d\u03a3\nt , we update the\nmarginal q(xt), see (5).\n\nMeasurement Message For the measurement message in a GPDS, the partition function is\n\n\\(cid:77)\nZ(cid:77)(\u00b5\nt\n\n\\(cid:77)\nt ) =\n\n\\(cid:77)\nf(cid:77)(xt)N (xt | \u00b5\nt\n\n\\(cid:77)\nt )dxt ,\n\n, \u03a3\n\n(12)\n\nf(cid:77)(xt)q\\(cid:77)(xt)dxt \u221d\n\n, \u03a3\nf(cid:77)(xt) = p(zt|xt) = N (zt | \u00b5g(xt), \u03a3g(xt)),\n\n(13)\nwhere f(cid:77) is the true measurement factor, and \u00b5g(xt) and \u03a3g(xt) are the predictive mean and co-\nvariance of the measurement GP GP g. In (12), we made it explicit that Z(cid:77) depends on the moments\n\\(cid:77)\nof the cavity distribution q\\(cid:77)(xt). The integral in (12) is of the form (8), but is\n\u00b5\nt\nintractable since solving it corresponds to a GP prediction at uncertain inputs [14], resulting in non-\nGaussian predictive distributions. However, the mean and covariance of a Gaussian approximation\n\u02dcZ(cid:77) to Z(cid:77) can be computed analytically: either using exact moment matching [14, 3], or approxi-\nmately by expected linearization of the posterior GP [8]; details are given in [4]. The moments of\n\nand \u03a3\n\n\\(cid:77)\nt\n\n(cid:90)\n\n(cid:90)\n\n4\n\n\f\\(cid:77)\n\\(cid:77)\n\u02dcZ(cid:77) are also functions of the mean \u00b5\nt of the cavity distribution. By taking the\nt\nlinearization assumption of the Gaussian approximation into account explicitly (here, we implicitly\nlinearize GP g) when computing the derivatives, the EP updates remain consistent, see Sec. 3.1.\nBackward Message To update the backward message q(cid:67)(xt), we require the partition function\n\nand variance \u03a3\n\n(cid:90)\n\n(cid:90)\n(cid:90)\nf(cid:67)(xt)q\\(cid:67)(xt)dxt \u221d\np(xt+1|xt)q\\(cid:66)(xt+1)dxt+1 =\n\n) =\n\n(cid:90)\n\n\\(cid:67)\nZ(cid:67)(\u00b5\nt\n\n\\(cid:67)\n, \u03a3\nt\n\nf(cid:67)(xt) =\n\n\\(cid:67)\nt\n\n\\(cid:67)\n, \u03a3\nt\n\n)dxt ,\n\nf(cid:67)(xt)N (xt | \u00b5\nN (xt+1 | \u00b5h(xt), \u03a3h(xt))q\\(cid:66)(xt+1)dxt+1 .\n(cid:90)\n\n(14)\n\n(15)\n\n(16)\n\nHere, the true factor f(cid:67)(xt) in (15) takes into account the coupling between xt and xt+1, which\nwas lost in assuming the full factorization in Fig. 1. The predictive mean and covariance of GP h are\ndenoted \u00b5h(xt) and \u03a3h(xt), respectively. Using (15) in (14) and reordering the integration yields\n\n(cid:90)\n\n\\(cid:67)\nZ(cid:67)(\u00b5\nt\n\n\\(cid:67)\n, \u03a3\nt\n\n) \u221d\n\nq\\(cid:66)(xt+1)\n\np(xt+1|xt)q\\(cid:67)(xt)dxtdxt+1 .\n\n\\(cid:67)\n\n\\(cid:67)\n\n, \u02dc\u03a3\n\n+ \u03a3\n\n(cid:90)\n\nand \u03a3\n\n, \u02dc\u03a3\n\\(cid:67)\nt\n\n\\(cid:67)\nare functions of \u00b5\nt\n\\(cid:67)\n\nForward Message Similarly, for the forward message, the projection step involves computing the\npartition function\n\nWe approximate the inner integral in (16), which is of the form (8), by N (xt+1 | \u02dc\u00b5\\(cid:67)\n) by\nmoment matching [14], for instance. Note that \u02dc\u00b5\\(cid:67)\nand \u02dc\u03a3\n. This\nGaussian approximation implicitly linearizes GP h. Now, (16) can be computed analytically, and\n\\(cid:66)\n\\(cid:66)\nt+1 | \u02dc\u00b5\\(cid:67)\nwe obtain a Gaussian approximation \u02dcZ(cid:67) = N (\u00b5\nt+1) of Z(cid:67) that allows us to\nupdate the moments of q(xt) and the message q(cid:67)(xt).\n(cid:90)\n(cid:90)\n\nf(cid:66)(xt)N (xt | \u00b5\nN (xt | \u00b5f (xt\u22121), \u03a3f (xt\u22121))q\\(cid:67)(xt\u22121)dxt\u22121 ,\nwhere the true factor f(cid:66)(xt) takes into account the coupling between xt\u22121 and xt, see Fig. 1. Here,\nthe true factor f(cid:66)(xt) is of the form (8). We propose to approximate f(cid:66)(xt) directly by a Gaussian\n(cid:66)\nq(cid:66)(xt) \u221d N ( \u02dc\u00b5\n). This approximation implicitly linearizes GP h. We obtain the updated\nposterior q(xt) by Gaussian multiplication, i.e., q(xt) \u221d q(cid:66)(xt)q\\(cid:66)(xt). With this approximation\nwe do not update the forward message in context, i.e., the true factor f(cid:66)(xt) is directly approximated\ninstead of the product f(cid:66)(xt)q\\(cid:66)(xt), which can result in suboptimal approximation.\n\np(xt|xt\u22121)q\\(cid:67)(xt\u22121)dxt\u22121 =\n\nf(cid:66)(xt)q\\(cid:66)(xt)dxt =\n\n\\(cid:66)\nZ(cid:66)(\u00b5\nt\n\nf(cid:66)(xt) =\n\n\\(cid:66)\n, \u03a3\nt\n\n\\(cid:66)\n, \u03a3\nt\n\n(cid:66)\n, \u02dc\u03a3\n\n)dxt,\n\n\\(cid:66)\nt\n\n(cid:90)\n\n(17)\n\n) =\n\n3.3 EP Updates for General Gaussian Smoothers\n\nWe can interpret the EP computations in the context of classical Gaussian \ufb01ltering and smooth-\ning [1]. During the forward sweep, the marginal q(xt) = q\\(cid:67)(xt) corresponds to the \ufb01lter dis-\ntribution p(xt|z1:t). Moreover, the cavity distribution q\\(cid:77)(xt) corresponds to the time update\np(xt|z1:t\u22121). In the backward sweep, the marginal q(xt) is the smoothing distribution p(xt|Z),\nincorporating the measurements of the entire time series. The mean and covariance of \u02dcZ(cid:67) can be\ninterpreted as the mean and covariance of the time update p(xt+1|z1:t).\nUpdating the moments of the posterior q(xt) via the derivatives of the log-partition function recovers\nexactly the standard Gaussian EP updates in dynamical systems described by Qi and Minka [13].\nFor example, when incorporating an updated measurement message, the moments in (5) can also be\nt \u2212 K\u03a3zx\\(cid:77)\n\\(cid:77)\n\\(cid:77)\nz ) and \u03a3t = \u03a3\nwritten as \u00b5t = \u00b5\n=\n\\(cid:77)\n(\u03a3\\(cid:77)\nz = E[g(xt)] and \u03a3\\(cid:77)\nz )\u22121. Here, \u00b5\nz = cov[g(xt)], where\ncov[x\nxt \u223c q\\(cid:77)(xt). Similarly, the updated moments of q(xt) with a new backward message via (5)\n\\(cid:67)\n\\(cid:67)\n\\(cid:67)\n(cid:67)\nt+1)L(cid:62),\ncorrespond to the updates [13] \u00b5t = \u00b5\nt+1) and \u03a3t = \u03a3\nt + L(\u03a3t+1\u2212 \u03a3\nt + L(\u00b5t+1\u2212 \u00b5\n\\(cid:67)\n\\(cid:67)\n\\(cid:67)\nt+1)\u22121. Here, we de\ufb01ned \u00b5\nwhere L = cov[x\nt+1 = cov[h(xt)],\n, x\nt\nwhere xt \u223c q\\(cid:67)(xt).\n\n, respectively, where \u03a3xz\\(cid:77)\n\n\\(cid:67)\nt+1 = E[h(xt)] and \u03a3\n\n\\(cid:77)\nt + K(zt \u2212 \u00b5\n\n] and K = \u03a3xz\\(cid:77)\n\n\\(cid:67)\nt+1](\u03a3\n\n\\(cid:77)\nt\n\n\\(cid:77)\nt\n\n, z\n\nt\n\nt\n\nt\n\n5\n\n\fTable 1: Performance comparison on the synthetic data set. Lower values are better.\n\nNLLx\nMAEx\nNLLz\n\nEKS\n\nEP-EKS\n\nGPEKS\n\n\u22122.04 \u00b1 0.07\n0.03 \u00b1 2.0 \u00d7 10\u22123\n\u22120.69 \u00b1 0.11\n\n\u22122.17 \u00b1 0.04\n0.03 \u00b1 2.0 \u00d7 10\u22123\n\u22120.73 \u00b1 0.11\n\n\u22121.67 \u00b1 0.22\n0.04 \u00b1 4.6 \u00d7 10\u22122\n\u22120.75 \u00b1 0.08\n\nEP-GPEKS\n\u22121.87 \u00b1 0.14\n0.04 \u00b1 4.6 \u00d7 10\u22122\n\u22120.81 \u00b1 0.07\n\nGPADS\n\n+ 1.67 \u00b1 0.37\n1.79 \u00b1 0.21\n1.93 \u00b1 0.28\n\nEP-GPADS\n\u22121.91 \u00b1 0.10\n0.04 \u00b1 4 \u00d7 10\u22123\n\u22120.77 \u00b1 0.07\n\nThe iterative message-passing algorithm in Alg. 1 provides an EP-based generalization and a uni-\nfying view of existing approaches for smoothing in dynamical systems, e.g., (Extended/Unscented/\nCubature) Kalman smoothing and the corresponding GPDS smoothers [5]. Computing the messages\nvia the derivatives of the approximate log-partition functions log \u02dcZi recovers not only standard EP\nupdates in dynamical systems [13], but also the standard Kalman smoothing updates [1].\nUsing any prediction method (e.g., unscented transformation, linearization), we can compute Gaus-\nsian approximations of (8). This in\ufb02uences the computation of log \u02dcZi and its derivatives with respect\nto the moments of the cavity distribution, see (9)\u2013(10). Hence, our message-passing formulation is\nalso general as it includes all conceivable Gaussian \ufb01lters/smoothers in (GP)DSs, solely depending\non the prediction technique used.\n\n4 Experimental Results\n\nWe evaluated our proposed EP-based message passing algorithm on three data sets: a synthetic data\nset, a low-dimensional simulated mechanical system with control inputs, and a high-dimensional\nmotion-capture data set. We compared to existing state-of-the-art forward-backward smoothers in\nGPDSs, speci\ufb01cally the GPEKS [8], which is based on the expected linearization of the GP models,\nand the GPADS [5], which uses moment-matching. We refer to our EP generalizations of these\nmethods as EP-GPEKS and EP-GPADS.\nIn all our experiments, we evaluated the inference methods using test sequences of measurements\nZ = [z1, . . . , zT ]. We report the negative log-likelihood of predicted measurements using the\nobserved test sequence (NLLz). Whenever available, we also compared the inferred posterior dis-\ntribution q(X) \u2248 p(X|Z) of the latent states with the underlying ground truth using the average\nnegative log-likelihood (NLLx) and Mean Absolute Errors (MAEx). We terminated EP after 100\niterations or when the average norms of the differences of the means and covariances of q(X) in\ntwo subsequent EP iterations were smaller than 10\u22126.\n\n4.1 Synthetic Data\n\nzt = 4 sin(xt) + v ,\n\nv \u223c N (0, 0.12) .\n\nWe considered the nonlinear dynamical system\nxt+1 = 4 sin(xt) + w , w \u223c N (0, 0.12) ,\n\nWe used p(x1) = N (0, 1) as a prior on the initial latent state. We assumed access to the latent state\nand trained the dynamics and measurement GPs using 30 randomly generated points, resulting in\na model with a substantial amount of posterior model uncertainty. The length of the test trajectory\nused was T = 20 time steps.\nTab. 1 reports the quality of the inferred posterior distributions of the latent state trajectories using the\naverage NLLx, MAEx, and NLLz (with standard errors), averaged over 10 independent scenarios.\nFor this dataset, we also compared to the Extended Kalman Smoother (EKS) and an EP-iterated EKS\n(EP-EKS). Both inference methods make use of the known transition and measurement mappings\nh and g, respectively. Iterated forward-backward smoothing with EP (EP-EKS, EP-GPEKS, EP-\nGPADS) improved the smoothing posteriors using a single sweep only (EKS, GPEKS, GPADS).\nThe GPADS performed poorly across all our evaluation criteria for two reasons: First, the GPs were\ntrained using few data points, resulting in posterior distributions with a high degree of uncertainty.\nSecond, predictive variances using moment-matching are generally conservative and increased the\nuncertainty even further. This uncertainty caused the GPADS to quickly lose track of the period of\nthe state, as shown in Fig. 2(a). By iterating forward-backward smoothing using EP (EP-GPADS),\nthe posteriors p(xt|Z) were iteratively re\ufb01ned, and the latent state could be followed closely as\nindicated by both the small blue error bars in Fig. 2(a) and all performance measures in Tab. 1. EP\nsmoothing typically required a small number of iterations for the inferred posterior distribution to\nclosely track the true state, Fig. 2(b). On average, EP required fewer than 10 iterations to converge\nto a good solution in which the mean of the latent-state posterior closely matched the ground truth.\n\n6\n\n\f(a) Example trajectory distributions with 95% con-\n\ufb01dence bounds.\n\n(b) Average NLLx as a function of the EP iteration\nwith twice the standard error.\n\nFigure 2: (a) Posterior latent state distributions using EP-GPADS (blue) and the GPADS (gray). The\nground truth is shown in red (dashed). The GPADS quickly loses track of the period of the state\nrevealed by the large posterior uncertainty. EP with moment matching (EP-GPADS) in the GPDS\niteratively re\ufb01nes the GPADS posterior and can closely follow the true latent state trajectory. (b)\nAverage NLLx per data point in latent space with standard errors of the posterior state distributions\ncomputed by the GPADS and the EP-GPADS as a function of EP iterations.\n\n4.2 Pendulum Tracking\n\ncos \u03c6\u2212xi\n\nWe considered a pendulum tracking problem to demonstrate GPDS inference in multidimen-\nsional settings, as well as the ability to handle control inputs. The state x of the system is\ngiven by the angle \u03c6 measured from being upright and the angular velocity \u02d9\u03c6. The pendulum\nused has a mass of 1 kg and a length of 1 m, and random torques u \u2208 [\u22122, 2] Nm were ap-\nplied for a duration 200 ms (zero-order-hold control). The system noise covariance was set to\n(cid:0)0, diag(0.12, 0.052)(cid:1) with zi = arctan(cid:0) sin \u03c6\u2212yi\n(cid:1), i = 1, 2. We trained the GP models using 4\n\u03a3w = diag(0.32, 0.12). The state was measured indirectly by two bearings sensors with coordinates\n(x1, y1) = (\u22122, 0) and (x2, y2) = (\u22120.5,\u22120.5), respectively, according to z = [z1, z2](cid:62) + v , v \u223c\nN\nrandomly generated trajectories of length T = 20 time steps, starting from an initial state distribu-\ntion p(x1) = N (0, diag(\u03c02/162, 0.52)) around the upright position. For testing, we generated 12\nrandom trajectories starting from p(x1).\nTab. 2 summarizes the performance\nof the various inference methods.\nGenerally,\nthe (EP-)GPADS per-\nformed better than the (EP-)GPEKS\nacross all performance measures.\n0.30 \u00b1 0.02 \u22122.41 \u00b1 0.047\nThis indicates that the (EP-)GPEKS\n0.31 \u00b1 0.02 \u22122.39 \u00b1 0.038\nsuffered from overcon\ufb01dent posteri-\n0.30 \u00b1 0.02 \u22122.37 \u00b1 0.042\n0.29 \u00b1 0.02 \u22122.40 \u00b1 0.037\nors compared to (EP-)GPADS, which\nis especially pronounced in the de-\ngrading NLLx values with increasing EP iterations and the relatively high standard errors. In about\n20% of the test cases, the inference methods based on explicit linearization of the posterior mean\nfunction (GPEKS and EP-GPEKS) ran into numerical problems typical of linearizations [5], i.e.,\novercon\ufb01dent posterior distributions that caused numerical problems. We excluded these runs from\nthe results in Tab. 2. The inference algorithms based on moment matching (GPADS and EP-GPADS)\nwere numerically stable as their predictions are typically more coherent due to conservative approx-\nimations of moment matching.\n\nTable 2: Performance comparison on the pendulum-swing\ndata. Lower values are better.\n\nGPEKS\n\u22120.35 \u00b1 0.39\nEP-GPEKS \u22120.33 \u00b1 0.44\nGPADS\n\u22120.80 \u00b1 0.06\nEP-GPADS \u22120.85 \u00b1 0.05\n\nNLLx\n\nMAEx\n\nNLLz\n\n4.3 Motion Capture Data\n\nWe considered motion capture data (from http://mocap.cs.cmu.edu/, subject 64) contain-\ning 10 trials of golf swings recorded at 120 Hz, which we subsampled to 20 Hz. After removing\nobservation dimensions with no variability we were left with observations zt \u2208 R56, which were\nthen whitened as a pre-processing step. For trials 1\u20137 (403 data points), we used the GPDM [20]\nto learn MAP estimates of the latent states xt \u2208 R3. These estimated latent states and their corre-\nsponding observations are used to train the GP models GP f and GP g. Trials 8\u201310 were used as test\n\n7\n\n2468101214161820\u2212505Time stepLatent State True statePosterior state distribution (EP\u2212GPADS)Posterior state distribution (GPADS)51015202530\u22122\u22121012EP iterationAverage NLL per data point GPADSEP\u2212GPADS\fFigure 3: Latent space posterior distribution (95% con\ufb01dence ellipsoids) of a test trajectory of the\ngolf-swing motion capture data. The further the ellipsoids are separated the faster the movement.\n\ndata without ground truth labels. The GPDM [20] focuses on learning a GPDS; we are interested in\ngood approximate inference in these models.\nFig. 3 shows the latent-state posterior distribution of a single test sequence (trial 10) obtained from\nthe EP-GPADS. The most signi\ufb01cant prediction errors in observed space occurred in the region\ncorresponding to the yellow/red ellipsoids, which is a low-dimensional embedding of the motion\nwhen the golf player hits the ball, i.e., the periods of high acceleration (poses 3\u20135).\nTab. 3 summarizes the results of inference on the golf data set in all test trials: Iterating forward-\nbackward smoothing by means of EP improved the inferred posterior distributions over the latent\nstates. The posterior distributions in latent space inferred by the EP-GPEKS were tighter than the\nones inferred by the EP-GPADS. The NLLz-values suffered a bit from this overcon\ufb01dence, but the\npredictive performance of the EP-GPADS and EP-GPEKS were similar. Generally, inference was\nmore dif\ufb01cult in areas with fast movements (poses 3\u20135 in Fig. 3) where training data were sparse.\nThe computational demand the two\ninference methods for GPDSs we\npresented is vastly different. High-\ndimensional approximate inference\nin the motion capture example using\nmoment matching (EP-GPADS) was\nabout two orders of magnitude slower\nthan approximate inference based on\nlinearization of the posterior GP mean (EP-GPEKS): For updating the posterior and the messages for\na single time slice, the EP-GPEKS required less than 0.5 s, the EP-GPADS took about 20 s. Hence,\nnumerical stability and more coherent posterior inference with the EP-GPADS trade off against\ncomputational demands.\n\nTest trial GPEKS EP-GPEKS GPADS EP-GPADS\nTrial 8\nTrial 9\nTrial 10\n\nTable 3: Average inference performance (NLLz, motion\ncapture data set). Lower values are better.\n\n14.09\n14.84\n25.42\n\n14.20\n15.63\n26.68\n\n13.82\n14.71\n25.73\n\n14.28\n15.19\n25.64\n\n5 Conclusion\nWe have presented an approximate message passing algorithm based on EP for improved infer-\nence and Bayesian state estimation in GP dynamical systems. Our message-passing formulation\ngeneralizes current inference methods in GPDSs to iterative forward-backward smoothing. This\ngeneralization allows for improved predictions and comprises existing methods for inference in the\nwider theory for dynamical systems as a special case. Our new inference approach makes the full\npower of the GPDS model available for the study of complex time-series data. Future work includes\ninvestigating alternatives to linearization and moment matching when computing messages, and the\nmore general problem of learning in Gaussian process dynamical systems.\n\nAcknowledgements\n\nWe thank Zhikun Wang for helping with the motion capture experiment and Jan Peters for valu-\nable discussions. The research leading to these results has received funding from the European\nCommunity\u2019s Seventh Framework Programme (FP7/2007\u20132013) under grant agreement #270327\n(ComPLACS) and the Canadian Institute for Advanced Research (CIFAR).\n\n8\n\n\fReferences\n[1] B. D. O. Anderson and J. B. Moore. Optimal Filtering. Dover Publications, 2005.\n[2] A. Damianou, M. K. Titsias, and N. D. Lawrence. Variational Gaussian Process Dynamical\n\nSystems. In Advances in Neural Information Processing Systems. 2011.\n\n[3] M. P. Deisenroth, M. F. Huber, and U. D. Hanebeck. Analytic Moment-based Gaussian Process\nFiltering. In Proceedings of the 26th International Conference on Machine Learning, pages\n225\u2013232. Omnipress, 2009.\n\n[4] M. P. Deisenroth and S. Mohamed. Expectation Propagation in Gaussian Process Dynamical\n\nSystems: Extended Version, 2012. http://arxiv.org/abs/1207.2940.\n\n[5] M. P. Deisenroth, R. Turner, M. Huber, U. D. Hanebeck, and C. E. Rasmussen. Robust Filtering\n\nand Smoothing with Gaussian Processes. IEEE Transactions on Automatic Control, 2012.\n\n[6] T. Heskes and O. Zoeter. Expectation Propagation for Approximate Inference in Dynamic\nBayesian Networks. In Proceedings of the International Conference on Uncertainty in Arti\ufb01-\ncial Intelligence, pages 216\u2013233, 2002.\n\n[7] S. J. Julier and J. K. Uhlmann. Unscented Filtering and Nonlinear Estimation. Proceedings of\n\nthe IEEE, 92(3):401\u2013422, March 2004.\n\n[8] J. Ko and D. Fox. GP-BayesFilters: Bayesian Filtering using Gaussian Process Prediction and\n\nObservation Models. Autonomous Robots, 27(1):75\u201390, 2009.\n\n[9] M. Kuss and C. E. Rasmussen. Assessing Approximate Inference for Binary Gaussian Process\n\nClassi\ufb01cation. Journal of Machine Learning Research, 6:1679\u20131704, 2005.\n\n[10] T. P. Minka. Expectation Propagation for Approximate Bayesian Inference. In Proceedings of\nthe 17th Conference on Uncertainty in Arti\ufb01cial Intelligence, pages 362\u2013369. Morgan Kauf-\nman Publishers, 2001.\n\n[11] T. P. Minka. A Family of Algorithms for Approximate Bayesian Inference. PhD thesis, Mas-\n\nsachusetts Institute of Technology, 2001.\n\n[12] T. P. Minka. EP: A Quick Reference. 2008.\n[13] Y. Qi and T. Minka. Expectation Propagation for Signal Detection in Flat-Fading Channels. In\n\nProceedings of the IEEE International Symposium on Information Theory, 2003.\n\n[14] J. Qui\u02dcnonero-Candela, A. Girard, J. Larsen, and C. E. Rasmussen. Propagation of Uncer-\ntainty in Bayesian Kernel Models\u2014Application to Multiple-Step Ahead Forecasting. In IEEE\nInternational Conference on Acoustics, Speech and Signal Processing, pages 701\u2013704, 2003.\n[15] C. E. Rasmussen and C. K. I. Williams. Gaussian Processes for Machine Learning. The MIT\n\nPress, 2006.\n\n[16] M. W. Seeger. Expectation Propagation for Exponential Families. Technical report, University\n\nof California Berkeley, 2005.\n\n[17] M. W. Seeger. Bayesian Inference and Optimal Design for the Sparse Linear Model. Journal\n\nof Machine Learning Research, 9:759\u2013813, 2008.\n\n[18] M. Toussaint and C. Goerick. From Motor Learning to Interaction Learning in Robotics,\nchapter A Bayesian View on Motor Control and Planning, pages 227\u2013252. Springer-Verlag,\n2010.\n\n[19] R. Turner, M. P. Deisenroth, and C. E. Rasmussen. State-Space Inference and Learning with\nGaussian Processes. In Proceedings of the International Conference on Arti\ufb01cial Intelligence\nand Statistics, volume JMLR: W&CP 9, pages 868\u2013875, 2010.\n\n[20] J. M. Wang, D. J. Fleet, and A. Hertzmann. Gaussian Process Dynamical Models for Human\nIEEE Transactions on Pattern Analysis and Machine Intelligence, 30(2):283\u2013298,\n\nMotion.\n2008.\n\n9\n\n\f", "award": [], "sourceid": 1232, "authors": [{"given_name": "Marc", "family_name": "Deisenroth", "institution": null}, {"given_name": "Shakir", "family_name": "Mohamed", "institution": null}]}