{"title": "Variational Gaussian Process Dynamical Systems", "book": "Advances in Neural Information Processing Systems", "page_first": 2510, "page_last": 2518, "abstract": "High dimensional time series are endemic in applications of machine learning such as robotics (sensor data), computational biology (gene expression data), vision (video sequences) and graphics (motion capture data). Practical nonlinear probabilistic approaches to this data are required. In this paper we introduce the variational Gaussian process dynamical system. Our work builds on recent variational approximations for Gaussian process latent variable models to allow for nonlinear dimensionality reduction simultaneously with learning a dynamical prior in the latent space. The approach also allows for the appropriate dimensionality of the latent space to be automatically determined. We demonstrate the model on a human motion capture data set and a series of high resolution video sequences.", "full_text": "Variational Gaussian Process Dynamical Systems\n\nAndreas C. Damianou\u2217\n\nDepartment of Computer Science\n\nUniversity of Shef\ufb01eld, UK\n\nandreas.damianou@sheffield.ac.uk\n\nMichalis K. Titsias\n\nSchool of Computer Science\nUniversity of Manchester, UK\n\nmtitsias@gmail.com\n\nNeil D. Lawrence\u2217\n\nDepartment of Computer Science\n\nUniversity of Shef\ufb01eld, UK\n\nN.Lawrence@dcs.shef.ac.uk\n\nAbstract\n\nHigh dimensional time series are endemic in applications of machine learning such as robotics\n(sensor data), computational biology (gene expression data), vision (video sequences) and\ngraphics (motion capture data). Practical nonlinear probabilistic approaches to this data are\nrequired. In this paper we introduce the variational Gaussian process dynamical system. Our\nwork builds on recent variational approximations for Gaussian process latent variable models\nto allow for nonlinear dimensionality reduction simultaneously with learning a dynamical\nprior in the latent space. The approach also allows for the appropriate dimensionality of the\nlatent space to be automatically determined. We demonstrate the model on a human motion\ncapture data set and a series of high resolution video sequences.\n\n1\n\nIntroduction\n\nNonlinear probabilistic modeling of high dimensional time series data is a key challenge for the machine learn-\ning community. A standard approach is to simultaneously apply a nonlinear dimensionality reduction to the\ndata whilst governing the latent space with a nonlinear temporal prior. The key dif\ufb01culty for such approaches is\nthat analytic marginalization of the latent space is typically intractable. Markov chain Monte Carlo approaches\ncan also be problematic as latent trajectories are strongly correlated making ef\ufb01cient sampling a challenge. One\npromising approach to these time series has been to extend the Gaussian process latent variable model [1, 2]\nwith a dynamical prior for the latent space and seek a maximum a posteriori (MAP) solution for the latent\npoints [3, 4, 5]. Ko and Fox [6] further extend these models for fully Bayesian \ufb01ltering in a robotics setting. We\nrefer to this class of dynamical models based on the GP-LVM as Gaussian process dynamical systems (GPDS).\nHowever, the use of a MAP approximation for training these models presents key problems. Firstly, since the\nlatent variables are not marginalised, the parameters of the dynamical prior cannot be optimized without the\nrisk of over\ufb01tting. Further, the dimensionality of the latent space cannot be determined by the model: adding\nfurther dimensions always increases the likelihood of the data. In this paper we build on recent developments\nin variational approximations for Gaussian processes [7, 8] to introduce a variational Gaussian process dynami-\ncal system (VGPDS) where latent variables are approximately marginalized through optimization of a rigorous\nlower bound on the marginal likelihood. As well as providing a principled approach to handling uncertainty in\nthe latent space, this allows both the parameters of the latent dynamical process and the dimensionality of the\nlatent space to be determined. The approximation enables the application of our model to time series containing\nmillions of dimensions and thousands of time points. We illustrate this by modeling human motion capture data\nand high dimensional video sequences.\n\n\u2217Also at the Shef\ufb01eld Institute for Translational Neuroscience, University of Shef\ufb01eld, UK.\n\n1\n\n\f2 The Model\nAssume a multivariate times series dataset {yn, tn}N\nn=1, where yn \u2208 RD is a data vector observed at time\ntn \u2208 R+. We are especially interested in cases where each yn is a high dimensional vector and, therefore,\nwe assume that there exists a low dimensional manifold that governs the generation of the data. Speci\ufb01cally, a\ntemporal latent function x(t) \u2208 RQ (with Q (cid:28) D), governs an intermediate hidden layer when generating the\ndata, and the dth feature from the data vector yn is then produced from xn = x(tn) according to\n\n\u0001nd \u223c N (0, \u03b2\u22121),\n\nynd = fd(xn) + \u0001nd ,\n\n(1)\nwhere fd(x) is a latent mapping from the low dimensional space to dth dimension of the observation space\nand \u03b2 is the inverse variance of the white Gaussian noise. We do not want to make strong assumptions about\nthe functional form of the latent functions (x, f ).1 Instead we would like to infer them in a fully Bayesian\nnon-parametric fashion using Gaussian processes [9]. Therefore, we assume that x is a multivariate Gaussian\nprocess indexed by time t and f is a different multivariate Gaussian process indexed by x, and we write\n\nxq(t) \u223c GP(0, kx(ti, tj)), q = 1, . . . , Q,\nfd(x) \u223c GP(0, kf (xi, xj)), d = 1, . . . , D.\n\n(2)\n(3)\nHere, the individual components of the latent function x are taken to be independent sample paths drawn from\na Gaussian process with covariance function kx(ti, tj). Similarly, the components of f are independent draws\nfrom a Gaussian process with covariance function kf (xi, xj). These covariance functions, parametrized by\nparameters \u03b8x and \u03b8f respectively, play very distinct roles in the model. More precisely, kx determines the\nproperties of each temporal latent function xq(t). For instance, the use of an Ornstein-Uhlbeck covariance\nfunction yields a Gauss-Markov process for xq(t), while the squared-exponential covariance function gives rise\nto very smooth and non-Markovian processes. In our experiments, we will focus on the squared exponential\ncovariance function (RBF), the Mat\u00b4ern 3/2 which is only once differentiable, and a periodic covariance function\n[9, 10] which can be used when data exhibit strong periodicity. These covariance functions take the form:\n\nkx(rbf) (ti , tj ) = \u03c32\n\nrbfe\n\n\u2212 (ti\u2212tj )2\n(2l2\nt )\n\n, kx(mat) (ti, tj) = \u03c32\nmat\n\n1 +\n\n\u2212\u221a\n\ne\n\n3|ti\u2212tj|\nlt\n\n,\n\n(cid:32)\n\n(cid:33)\n\n\u221a\n\n3|ti \u2212 tj|\n\nlt\n\n\u2212 1\n\nT (ti\u2212tj ))\nsin2( 2\u03c0\nlt\n\nkx(per) (ti , tj ) = \u03c32\n\n(4)\nThe covariance function kf determines the properties of the latent mapping f that maps each low dimensional\nvariable xn to the observed vector yn. We wish this mapping to be a non-linear but smooth, and thus a suitable\nchoice is the squared exponential covariance function\n\npere\n\n.\n\n2\n\n,\n\nkf (xi, xj) = \u03c32\n\n(5)\nwhich assumes a different scale wq for each latent dimension. This, as in the variational Bayesian formulation\nof the GP-LVM [8], enables an automatic relevance determination procedure (ARD), i.e. it allows Bayesian\ntraining to \u201cswitch off\u201d unnecessary dimensions by driving the values of the corresponding scales to zero.\nThe matrix Y \u2208 RN\u00d7D will collectively denote all observed data so that its nth row corresponds to the data\npoint yn. Similarly, the matrix F \u2208 RN\u00d7D will denote the mapping latent variables, i.e. fnd = fd(xn), asso-\nciated with observations Y from (1). Analogously, X \u2208 RN\u00d7Q will store all low dimensional latent variables\nxnq = xq(tn). Further, we will refer to columns of these matrices by the vectors yd, fd, xq \u2208 RN . Given the\nlatent variables we assume independence over the data features, and given time we assume independence over\nlatent dimensions to give\n\n(cid:80)Q\nq=1 wq(xi,q\u2212xj ,q )2\n\narde\u2212 1\n\n2\n\nD(cid:89)\n\nd=1\n\nQ(cid:89)\n\nq=1\n\np(Y, F, X|t) = p(Y |F )p(F|X)p(X|t) =\n\np(yd|fd)p(fd|X )\n\np(xq|t),\n\n(6)\n\nwhere t \u2208 RN and p(yd|fd) is a Gaussian likelihood function term de\ufb01ned from (1). Further, p(fd|X ) is a\nmarginal GP prior such that\n\np(fd|X ) = N (fd|0, KNN ),\n\n(7)\n\n1To simplify our notation, we often write x instead of x(t) and f instead of f (x). Later we also use a similar convention\n\nfor the covariance functions by often writing them as kf and kx.\n\n2\n\n\fwhere KN N = kf (X, X) is the covariance matrix de\ufb01ned by the covariance function kf and similarly p(xq|t)\nis the marginal GP prior associated with the temporal function xq(t),\np(xq|t) = N (xq|0, Kt ) ,\n\n(8)\nwhere Kt = kx(t, t) is the covariance matrix obtained by evaluating the covariance function kx on the observed\ntimes t.\nBayesian inference using the above model poses a huge computational challenge as, for instance, marginaliza-\ntion of the variables X, that appear non-linearly inside the covariance matrix KN N , is troublesome. Practical\napproaches that have been considered until now (e.g. [5, 3]) marginalise out only F and seek a MAP solution\nfor X. In the next section we describe how ef\ufb01cient variational approximations can be applied to marginalize\nX by extending the framework of [8].\n\n2.1 Variational Bayesian training\nThe key dif\ufb01culty with the Bayesian approach is propagating the prior density p(X|t) through the nonlinear\nmapping. This mapping gives the expressive power to the model, but simultaneously renders the associated\nmarginal likelihood,\n\np(Y |t) =\n\np(Y |F )p(F|X)p(X|t)dXdF,\n\n(9)\n\nintractable. We now invoke the variational Bayesian methodology to approximate the integral. Following a\nstandard procedure [11], we introduce a variational distribution q(\u0398) and compute the Jensen\u2019s lower bound Fv\non the logarithm of (9),\n\nFv(q, \u03b8) =\n\nq(\u0398) log\n\np(Y |F )p(F|X)p(X|t)\n\nq(\u0398)\n\ndXdF,\n\n(10)\n\nwhere \u03b8 denotes the model\u2019s parameters. However, the above form of the lower bound is problematic because\nX (in the GP term p(F|X)) appears non-linearly inside the covariance matrix KN N making the integration\nover X dif\ufb01cult. As shown in [8], this intractability is removed by applying the \u201cdata augmentation\u201d principle.\nMore precisely, we augment the joint probability model in (6) by including M extra samples of the GP latent\nmapping f, known as inducing points, so that um \u2208 RD is such a sample. The inducing points are evaluated at\na set of pseudo-inputs \u02dcX \u2208 RM\u00d7Q. The augmented joint probability density takes the form\n\n(cid:90)\n\n(cid:90)\n\np(Y, F, U, X, \u02dcX|t) =\n\np(yd|fd)p(fd|ud, X )p(ud| \u02dcX)p(X|t),\n\n(11)\n\nD(cid:89)\n\nd=1\n\nwhere p(ud| \u02dcX) is a zero-mean Gaussian with a covariance matrix KMM constructed using the same function\nas for the GP prior (7). By dropping \u02dcX from our expressions, we write the augmented GP prior analytically\n(see [9]) as\n\np(fd|ud, X) = N(cid:0)fd|KNM K\u22121\n\n(12)\nA key result in [8] is that a tractable lower bound (computed analogously to (10)) can be obtained through the\nvariational density\n\nMM KM N\n\nMM ud, KN N \u2212 KNM K\u22121\n\n(cid:1) .\n\nq(\u0398) = q(F, U, X) = q(F|U, X)q(U )q(X) =\n\np(fd|ud, X)q(ud)q(X),\n\n(13)\n\nD(cid:89)\n\nd=1\n\nq=1 N (xq|\u00b5q, Sq) and q(ud) is an arbitrary variational distribution. Titsias and Lawrence [8]\nassume full independence for q(X) and the variational covariances are diagonal matrices. Here, in contrast, the\nposterior over the latent variables will have strong correlations, so Sq is taken to be a N \u00d7 N full covariance\nmatrix. Optimization of the variational lower bound provides an approximation to the true posterior p(X|Y )\nby q(X). In the augmented probability model, the \u201cdif\ufb01cult\u201d term p(F|X) appearing in (10) is now replaced\nwith (12) and, eventually, it cancels out with the \ufb01rst factor of the variational distribution (13) so that F can be\nmarginalised out analytically. Given the above and after breaking the logarithm in (10), we obtain the \ufb01nal form\nof the lower bound (see supplementary material for more details)\n\nwhere q(X) =(cid:81)Q\n\nFv(q, \u03b8) = \u02c6Fv \u2212 KL(q(X) (cid:107) p(X|t)),\n\n(14)\n\n3\n\n\fwith \u02c6Fv = (cid:82) q(X) log p(Y |F )p(F|X) dXdF . Both terms in (14) are now tractable. Note that the \ufb01rst of\n\nthe above terms involves the data while the second one only involves the prior. All the information regarding\ndata point correlations is captured in the KL term and the connection with the observations comes through the\nvariational distribution. Therefore, the \ufb01rst term in (14) has the same analytical solution as the one derived in\n[8]. Equation (14) can be maximized by using gradient-based methods2. However, when not factorizing q(X)\nacross data points yields O(N 2) variational parameters to optimize. This issue is addressed in the next section.\n\n2.2 Reparametrization and Optimization\nThe optimization involves the model parameters \u03b8 = (\u03b2, \u03b8f , \u03b8x), the variational parameters {\u00b5q, Sq}Q\nq(X) and the inducing points3 \u02dcX.\nOptimization of the variational parameters appears challenging, due to their large number and the correlations\n\nbetween them. However, by reparametrizing our O(cid:0)N 2(cid:1) variational parameters according to the framework\n\nq=1 from\n\ndescribed in [12] we can obtain a set of O(N ) less correlated variational parameters. Speci\ufb01cally, we \ufb01rst take\nthe derivatives of the variational bound (14) w.r.t. Sq and \u00b5q and set them to zero, to \ufb01nd the stationary points,\n\nSq =(cid:0)K \u22121\n\n(cid:1)\u22121\n\n(15)\nwhere \u039bq = \u22122 \u03d1 \u02c6Fv (q,\u03b8)\nis a N\u2212dimensional vector.\nThe above stationary conditions tell us that, since Sq depends on a diagonal matrix \u039bq, we can reparametrize it\nusing only the N\u2212dimensional diagonal of that matrix, denoted by \u03bbq. Then, we can optimise the 2(Q \u00d7 N )\nparameters (\u03bbq, \u00af\u00b5q) and obtain the original parameters using (15).\n\nis a N \u00d7 N diagonal, positive matrix and \u00af\u00b5q = \u03d1 \u02c6Fv\n\nand \u00b5q = Kt \u00af\u00b5q,\n\nt + \u039bq\n\n\u03d1\u00b5q\n\n\u03d1Sq\n\n2.3 Learning from Multiple Sequences\n\nOur objective is to model multivariate time series. A given data set may consist of a group of independent ob-\nserved sequences, each with a different length (e.g. in human motion capture data several walks from a subject).\n\nLet, for example, the dataset be a group of S independent sequences(cid:0)Y (1), ..., Y (S)(cid:1). We would like our model\ni.e. p(cid:0)X (1), X (2), ..., X (S)(cid:1) =(cid:81)S\n\nto capture the underlying commonality of these data. We handle this by allowing a different temporal latent\nfunction for each of the independent sequences, so that X (s) is the set of latent variables corresponding to the\nsequence s. These sets are a priori assumed to be independent since they correspond to separate sequences,\ns=1 p(X (s)), where we dropped the conditioning on time for simplicity. This\nfactorisation leads to a block-diagonal structure for the time covariance matrix Kt, where each block corre-\nsponds to one sequence. In this setting, each block of observations Y (s) is generated from its corresponding\nX (s) according to Y (s) = F (s) + \u0001, where the latent function which governs this mapping is shared across all\nsequences and \u0001 is Gaussian noise.\n\n3 Predictions\n\nOur algorithm models the temporal evolution of a dynamical system. It should be capable of generating com-\npletely new sequences or reconstructing missing observations from partially observed data. For generating\na novel sequence given training data the model requires a time vector t\u2217 as input and computes a density\np(Y\u2217|Y, t, t\u2217). For reconstruction of partially observed data the time-stamp information is additionally ac-\ncompanied by a partially observed sequence Y p\u2217 \u2208 RN\u2217\u00d7Dp from the whole Y\u2217 = (Y p\u2217 , Y m\u2217 ), where p and\nm are set indices indicating the present (i.e. observed) and missing dimensions of Y\u2217 respectively, so that\np\u222a m = {1, . . . , D}. We reconstruct the missing dimensions by computing the Bayesian predictive distribution\np(Y m\u2217 |Y p\u2217 , Y, t\u2217, t). The predictive densities can also be used as estimators for tasks like generative Bayesian\nclassi\ufb01cation. Whilst time-stamp information is always provided, in the next section we drop its dependence to\navoid notational clutter.\n\n2See supplementary material for more detailed derivation of (14) and for the equations for the gradients.\n3We will use the term \u201cvariational parameters\u201d to refer only to the parameters of q(X) although the inducing points are\n\nalso variational parameters.\n\n4\n\n\f3.1 Predictions Given Only the Test Time Points\nTo approximate the predictive density, we will need to introduce the underlying latent function values F\u2217 \u2208\nRN\u2217\u00d7D (the noisy-free version of Y\u2217) and the latent variables X\u2217 \u2208 RN\u2217\u00d7Q. We write the predictive density as\n(16)\n\np(Y\u2217|F\u2217)p(F\u2217|X\u2217, Y )p(X\u2217|Y )dF\u2217dX\u2217.\n\np(Y\u2217, F\u2217, X\u2217|Y )dF\u2217dX\u2217 =\n\np(Y\u2217|Y ) =\n\n(cid:90)\n\n(cid:90)\n\nThe term p(F\u2217|X\u2217, Y ) is approximated by the variational distribution\np(f\u2217,d|ud, X\u2217)q(ud)dud =\n\nq(F\u2217|X\u2217) =\n\n(17)\nwhere q(f\u2217,d|X\u2217) is a Gaussian that can be computed analytically, since in our variational framework the optimal\nsetting for q(ud) is also found to be a Gaussian (see suppl. material for complete forms). As for the term\np(X\u2217|Y ) in eq. (16), it is approximated by a Gaussian variational distribution q(X\u2217),\n\nd\u2208D\n\nd\u2208D\n\nq(f\u2217,d|X\u2217),\n\n(cid:90) (cid:89)\n\nq(X\u2217) =\n\nq(x\u2217,q) =\n\np(x\u2217,q|xq)q(xq)dxq =\n\n(cid:104)p(x\u2217,q|xq)(cid:105)q(xq) ,\n\n(18)\n\nQ(cid:89)\n\nq=1\n\n(cid:90)\n\nQ(cid:89)\n\nq=1\n\n(cid:89)\n\nQ(cid:89)\n\nq=1\n\nwhere p(x\u2217,q|xq) is a Gaussian found from the conditional GP prior (see [9]) and q(X) is also Gaussian. We\ncan, thus, work out analytically the mean and variance for (18), which turn out to be:\n\n\u00b5x\u2217,q = K\u2217N \u00af\u00b5q\n\n(19)\n(20)\nwhere K\u2217N = kx(t\u2217, t), K\u2217N = K(cid:62)\n\u2217N and K\u2217\u2217 = kx(t\u2217, t\u2217). Notice that these equations have exactly the\nsame form as found in standard GP regression problems. Once we have analytic forms for the posteriors in (16),\nthe predictive density is approximated as\n\nvar(x\u2217,q) = K\u2217\u2217 \u2212 K\u2217N (Kt + \u039b\u22121\n\nq )\u22121KN\u2217\n\np(Y\u2217|Y ) =\n\np(Y\u2217|F\u2217)q(F\u2217|X\u2217)q(X\u2217)dF\u2217dX\u2217 =\n\np(Y\u2217|F\u2217)(cid:104)q(F\u2217|X\u2217)(cid:105)q(X\u2217) dF\u2217,\n\n(21)\n\n(cid:90)\n\n(cid:90)\n\nE(F\u2217) = B(cid:62)\u03a8\u2217\n\nCov(F\u2217) = B(cid:62)(cid:0)\u03a8\u2217\n\n1\n\nwhich is a non-Gaussian integral that cannot be computed analytically. However, following the same argument\nas in [9, 13], we can calculate analytically its mean and covariance:\n\n1)(cid:62)(cid:1) B + \u03a8\u2217\n\n(cid:104)(cid:16)\n\u22121(cid:17)\nMM \u2212 (KMM + \u03b2\u03a82)\nK\u22121\n0 = (cid:104)kf (X\u2217, X\u2217)(cid:105), \u03a8\u2217\n1 = (cid:104)KM\u2217(cid:105) and \u03a8\u2217\n\n0I \u2212 Tr\n\n2 \u2212 \u03a8\u2217\n\u22121 \u03a8(cid:62)\n\n1(\u03a8\u2217\n1 Y , \u03a8\u2217\n\n(23)\n2 = (cid:104)KM\u2217K\u2217M(cid:105). All\nwhere B = \u03b2 (KMM + \u03b2\u03a82)\nexpectations are taken w.r.t. q(X\u2217) and can be calculated analytically, while KM\u2217 denotes the cross-covariance\nmatrix between the training inducing inputs \u02dcX and X\u2217. The \u03a8 quantities are calculated analytically (see suppl.\nmaterial). Finally, since Y\u2217 is just a noisy version of F\u2217, the mean and covariance of (21) is just computed as:\nE(Y\u2217) = E(F\u2217) and Cov(Y\u2217) = Cov(F\u2217) + \u03b2\u22121IN\u2217.\n\nI,\n\n2\n\n(22)\n\n\u03a8\u2217\n\n(cid:105)\n\n3.2 Predictions Given the Test Time Points and Partially Observed Outputs\nThe expression for the predictive density p(Y m\u2217 |Y p\u2217 , Y ) is similar to (16),\n\n(cid:90)\n\np(Y m\u2217 |Y p\u2217 , Y ) =\n\np(Y m\u2217 |F m\u2217 )p(F m\u2217 |X\u2217, Y p\u2217 , Y )p(X\u2217|Y p\u2217 , Y )dF m\u2217 dX\u2217,\n\n(24)\n\nand is analytically intractable. To obtain an approximation, we \ufb01rstly need to apply variational inference and\napproximate p(X\u2217|Y p\u2217 , Y ) with a Gaussian distribution. This requires the optimisation of a new variational\nlower bound that accounts for the contribution of the partially observed data Y p\u2217 . This lower bound approximates\nthe true marginal likelihood p(Y p\u2217 , Y ) and has exactly analogous form with the lower bound computed only on\nthe training data Y . Moreover, the variational optimisation requires the de\ufb01nition of the variational distribution\nq(X\u2217, X) which needs to be optimised and is fully correlated across X and X\u2217. After the optimisation, the\napproximation to the true posterior p(X\u2217|Y p\u2217 , Y ) is given from the marginal q(X\u2217). A much faster but less\naccurate method would be to decouple the test from the training latent variables by imposing the factorisation\nq(X\u2217, X) = q(X)q(X\u2217). This is not used, however, in our current implementation.\n\n5\n\n\f4 Handling Very High Dimensional Datasets\n\nOur variational framework avoids the typical cubic complexity of Gaussian processes allowing relatively large\ntraining sets (thousands of time points, N). Further, the model scales only linearly with the number of dimen-\nsions D. Speci\ufb01cally, the number of dimensions only matters when performing calculations involving the data\nmatrix Y . In the \ufb01nal form of the lower bound (and consequently in all of the derived quantities, such as gra-\ndients) this matrix only appears in the form Y Y (cid:62) which can be precomputed. This means that, when N (cid:28) D,\nwe can calculate Y Y (cid:62) only once and then substitute Y with the SVD (or Cholesky decomposition) of Y Y (cid:62). In\nthis way, we can work with an N \u00d7 N instead of an N \u00d7 D matrix. Practically speaking, this allows us to work\nwith data sets involving millions of features. In our experiments we model directly the pixels of HD quality\nvideo, exploiting this trick.\n\n5 Experiments\n\nWe consider two different types of high dimensional time series, a human motion capture data set consisting\nof different walks and high resolution video sequences. The experiments are intended to explore the various\nproperties of the model and to evaluate its performance in different tasks (prediction, reconstruction, generation\nof data). Matlab source code for repeating the following experiments and links to the video \ufb01les are available\non-line from http://staffwww.dcs.shef.ac.uk/people/N.Lawrence/vargplvm/.\n\n5.1 Human Motion Capture Data\n\nWe followed [14, 15] in considering motion capture data of walks and runs taken from subject 35 in the CMU\nmotion capture database. We treated each motion as an independent sequence. The data set was constructed and\npreprocessed as described in [15]. This results in 2,613 separate 59-dimensional frames split into 31 training\nsequences with an average length of 84 frames each.\nThe model is jointly trained, as explained in section 2.3, on both walks and runs, i.e. the algorithm learns a\ncommon latent space for these motions. At test time we investigate the ability of the model to reconstruct test\ndata from a previously unseen sequence given partial information for the test targets. This is tested once by\nproviding only the dimensions which correspond to the body of the subject and once by providing those that\ncorrespond to the legs. We compare with results in [15], which used MAP approximations for the dynamical\nmodels, and against nearest neighbour. We can also indirectly compare with the binary latent variable model\n(BLV) of [14] which used a slightly different data preprocessing. We assess the performance using the cumu-\nlative error per joint in the scaled space de\ufb01ned in [14] and by the root mean square error in the angle space\nsuggested by [15]. Our model was initialized with nine latent dimensions. We performed two runs, once using\nthe Mat\u00b4ern covariance function for the dynamical prior and once using the RBF. From table 1 we see that the\nvariational Gaussian process dynamical system considerably outperforms the other approaches. The appropriate\nlatent space dimensionality for the data was automatically inferred by our models. The model which employed\nan RBF covariance to govern the dynamics retained four dimensions, whereas the model that used the Mat\u00b4ern\nkept only three. The other latent dimensions were completely switched off by the ARD parameters. The best\nperformance for the legs and the body reconstruction was achieved by the VGPDS model that used the Mat\u00b4ern\nand the RBF covariance function respectively.\n\n5.2 Modeling Raw High Dimensional Video Sequences\n\nFor our second set of experiments we considered video sequences. Such sequences are typically preprocessed\nbefore modeling to extract informative features and reduce the dimensionality of the problem. Here we work\ndirectly with the raw pixel values to demonstrate the ability of the VGPDS to model data with a vast number of\nfeatures. This also allows us to directly sample video from the learned model.\nFirstly, we used the model to reconstruct partially observed frames from test video sequences4. For the \ufb01rst\nvideo discussed here we gave as partial information approximately 50% of the pixels while for the other two\nwe gave approximately 40% of the pixels on each frame. The mean squared error per pixel was measured to\n\n4\u2018Missa\u2019 dataset: cipr.rpi.edu. \u2018Ocean\u2019: cog\ufb01lms.com. \u2018Dog\u2019: \ufb01tfurlife.com. See details in supplementary. The logo\n\nappearing in the \u2018dog\u2019 images in the experiments that follow, has been added with post-processing.\n\n6\n\n\fTable 1: Errors obtained for the motion capture dataset considering nearest neighbour in the angle space (NN) and in the\nscaled space(NN sc.), GPLVM, BLV and VGPDS. CL / CB are the leg and body datasets as preprocessed in [14], L and B\nthe corresponding datasets from [15]. SC corresponds to the error in the scaled space, as in Taylor et al. while RA is the\nerror in the angle space. The best error per column is in bold.\n\nData\n\nError Type\n\nBLV\nNN sc.\n\nVGPDS (RBF)\n\nVGPDS (Mat\u00b4ern 3/2)\n\nGPLVM (Q = 3)\nGPLVM (Q = 4)\nGPLVM (Q = 5)\n\nNN sc.\n\nNN\n\nCL\nSC\n11.7\n22.2\n\n-\n-\n-\n-\n-\n-\n-\n\nCB\nSC\n8.8\n20.5\n\n-\n-\n-\n-\n-\n-\n-\n\nL\nSC\n-\n-\n\n11.4\n9.7\n13.4\n13.5\n14.0\n8.19\n6.99\n\nL\nRA\n-\n-\n\n3.40\n3.38\n4.25\n4.44\n4.11\n3.57\n2.88\n\nB\nSC\n-\n-\n\n16.9\n20.7\n23.4\n20.8\n30.9\n10.73\n14.22\n\nB\nRA\n-\n-\n\n2.49\n2.72\n2.78\n2.62\n3.20\n1.90\n2.23\n\ncompare with the k\u2212nearest neighbour (NN) method, for k \u2208 (1, .., 5) (we only present the error achieved\nfor the best choice of k in each case). The datasets considered are the following: \ufb01rstly, the \u2018Missa\u2019 dataset,\na standard benchmark used in image processing. This is a 103,680-dimensional video, showing a woman\ntalking for 150 frames. The data is challenging as there are translations in the pixel space. We also considered\nan HD video of dimensionality 9 \u00d7 105 that shows an arti\ufb01cially created scene of ocean waves as well as\na 230, 400\u2212dimensional video showing a dog running for 60 frames. The later is approximately periodic in\nnature, containing several paces from the dog. For the \ufb01rst two videos we used the Mat\u00b4ern and RBF covariance\nfunctions respectively to model the dynamics and interpolated to reconstruct blocks of frames chosen from the\nwhole sequence. For the \u2018dog\u2019 dataset we constructed a compound kernel kx = kx(rbf) + kx(periodic), where\nthe RBF term is employed to capture any divergence from the approximately periodic pattern. We then used\nour model to reconstruct the last 7 frames extrapolating beyond the original video. As can be seen in table\n2, our method outperformed NN in all cases. The results are also demonstrated visually in \ufb01gure 1 and the\nreconstructed videos are available in the supplementary material.\n\nTable 2: The mean squared error per pixel for VGPDS and NN for the three datasets (measured only in the missing inputs).\nThe number of latent dimensions selected by our model is in parenthesis.\n\nVGPDS\n\nNN\n\nMissa\n2.52 (Q = 12)\n2.63\n\nOcean\n9.36 (Q = 9)\n9.53\n\nDog\n4.01 (Q = 6)\n4.15\n\nAs can be seen in \ufb01gure 1, VGPDS predicts pixels which are smoothly connected with the observed part of the\nimage, whereas the NN method cannot \ufb01t the predicted pixels in the overall context.\nAs a second task, we used our generative model to create new samples and generate a new video sequence. This\nis most effective for the \u2018dog\u2019 video as the training examples were approximately periodic in nature. The model\nwas trained on 60 frames (time-stamps [t1, t60]) and we generated new frames which correspond to the next 40\ntime points in the future. The only input given for this generation of future frames was the time-stamp vector,\n[t61, t100]. The results show a smooth transition from training to test and amongst the test video frames. The\nresulting video of the dog continuing to run is sharp and high quality. This experiment demonstrates the ability\nof the model to reconstruct massively high dimensional images without blurring. Frames from the result are\nshown in \ufb01gure 2. The full video is available in the supplementary material.\n\n6 Discussion and Future Work\n\nWe have introduced a fully Bayesian approach for modeling dynamical systems through probabilistic nonlinear\ndimensionality reduction. Marginalizing the latent space and reconstructing data using Gaussian processes\n\n7\n\n\f(a)\n\n(b)\n\n(c)\n\n(d)\n\n(e)\n\n(f)\n\n(g)\n\n(h)\n\nFigure 1:\n(a) and (c) demonstrate the reconstruction achieved by VGPDS and NN respectively for the most challenging\nframe (b) of the \u2018missa\u2019 video, i.e. when translation occurs. (d) shows another example of the reconstruction achieved by\nVGPDS given the partially observed image. (e) (VGPDS) and (f) (NN) depict the reconstruction achieved for a frame of\nthe \u2018ocean\u2019 dataset. Finally, we demonstrate the ability of the model to automatically select the latent dimensionality by\nshowing the initial lengthscales (\ufb01g: (g)) of the ARD covariance function and the values obtained after training (\ufb01g: (h)) on\nthe \u2018dog\u2019 data set.\n\n(a)\n\n(b)\n\n(c)\n\nFigure 2: The last frame of the training video (a) is smoothly followed by the \ufb01rst frame (b) of the generated video. A\nsubsequent generated frame can be seen in (c).\n\nresults in a very generic model for capturing complex, non-linear correlations even in very high dimensional\ndata, without having to perform any data preprocessing or exhaustive search for de\ufb01ning the model\u2019s structure\nand parameters.\nOur method\u2019s effectiveness has been demonstrated in two tasks; \ufb01rstly, in modeling human motion capture data\nand, secondly, in reconstructing and generating raw, very high dimensional video sequences. A promising future\ndirection to follow would be to enhance our formulation with domain-speci\ufb01c knowledge encoded, for example,\nin more sophisticated covariance functions or in the way that data are being preprocessed. Thus, we can obtain\napplication-oriented methods to be used for tasks in areas such as robotics, computer vision and \ufb01nance.\n\nAcknowledgments\n\nResearch was partially supported by the University of Shef\ufb01eld Moody endowment fund and the Greek State\nScholarships Foundation (IKY). We also thank Colin Litster and \u201cFit Fur Life\u201d for allowing us to use their video\n\ufb01les as datasets. Finally, we thank the reviewers for their insightful comments.\n\n8\n\n\fReferences\n\n[1] N. D. Lawrence, \u201cProbabilistic non-linear principal component analysis with Gaussian process latent vari-\n\nable models,\u201d Journal of Machine Learning Research, vol. 6, pp. 1783\u20131816, 2005.\n\n[2] N. D. Lawrence, \u201cGaussian process latent variable models for visualisation of high dimensional data,\u201d in\n\nAdvances in Neural Information Processing Systems, pp. 329\u2013336, MIT Press, 2004.\n\n[3] J. M. Wang, D. J. Fleet, and A. Hertzmann, \u201cGaussian process dynamical models,\u201d in NIPS, pp. 1441\u2013\n\n1448, MIT Press, 2006.\n\n[4] J. M. Wang, D. J. Fleet, and A. Hertzmann, \u201cGaussian process dynamical models for human motion,\u201d\n\nIEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 30, pp. 283\u2013298, Feb. 2008.\n\n[5] N. D. Lawrence, \u201cHierarchical Gaussian process latent variable models,\u201d in Proceedings of the Interna-\n\ntional Conference in Machine Learning, pp. 481\u2013488, Omnipress, 2007.\n\n[6] J. Ko and D. Fox, \u201cGP-BayesFilters: Bayesian \ufb01ltering using Gaussian process prediction and observation\n\nmodels,\u201d Auton. Robots, vol. 27, pp. 75\u201390, July 2009.\n\n[7] M. K. Titsias, \u201cVariational learning of inducing variables in sparse Gaussian processes,\u201d in Proceedings of\nthe Twelfth International Conference on Arti\ufb01cial Intelligence and Statistics, vol. 5, pp. 567\u2013574, JMLR\nW&CP, 2009.\n\n[8] M. K. Titsias and N. D. Lawrence, \u201cBayesian Gaussian process latent variable model,\u201d in Proceedings\nof the Thirteenth International Conference on Arti\ufb01cial Intelligence and Statistics, pp. 844\u2013851, JMLR\nW&CP 9, 2010.\n\n[9] C. E. Rasmussen and C. Williams, Gaussian Processes for Machine Learning. MIT Press, 2006.\n[10] D. J. C. MacKay, \u201cIntroduction to Gaussian processes,\u201d in Neural Networks and Machine Learning (C. M.\n\nBishop, ed.), NATO ASI Series, pp. 133\u2013166, Kluwer Academic Press, 1998.\n\n[11] C. M. Bishop, Pattern Recognition and Machine Learning (Information Science and Statistics). Springer,\n\n1st ed. 2006. corr. 2nd printing ed., Oct. 2007.\n\n[12] M. Opper and C. Archambeau, \u201cThe variational Gaussian approximation revisited,\u201d Neural Computation,\n\nvol. 21, no. 3, pp. 786\u2013792, 2009.\n\n[13] A. Girard, C. E. Rasmussen, J. Qui\u02dcnonero-Candela, and R. Murray-Smith, \u201cGaussian process priors with\nuncertain inputs - application to multiple-step ahead time series forecasting,\u201d in Neural Information Pro-\ncessing Systems, 2003.\n\n[14] G. W. Taylor, G. E. Hinton, and S. Roweis, \u201cModeling human motion using binary latent variables,\u201d in\n\nAdvances in Neural Information Processing Systems, vol. 19, MIT Press, 2007.\n\n[15] N. D. Lawrence, \u201cLearning for larger datasets with the Gaussian process latent variable model,\u201d in Pro-\nceedings of the Eleventh International Conference on Arti\ufb01cial Intelligence and Statistics, pp. 243\u2013250,\nOmnipress, 2007.\n\n9\n\n\f", "award": [], "sourceid": 1354, "authors": [{"given_name": "Andreas", "family_name": "Damianou", "institution": ""}, {"given_name": "Michalis", "family_name": "Titsias", "institution": null}, {"given_name": "Neil", "family_name": "Lawrence", "institution": ""}]}