{"title": "Gaussian Process Dynamical Models", "book": "Advances in Neural Information Processing Systems", "page_first": 1441, "page_last": 1448, "abstract": "", "full_text": "Gaussian Process Dynamical Models\n\nJack M. Wang, David J. Fleet, Aaron Hertzmann\n\nDepartment of Computer Science\n\n{jmwang,hertzman}@dgp.toronto.edu, fleet@cs.toronto.edu\n\nUniversity of Toronto, Toronto, ON M5S 3G4\n\nAbstract\n\nThis paper introduces Gaussian Process Dynamical Models (GPDM) for\nnonlinear time series analysis. A GPDM comprises a low-dimensional\nlatent space with associated dynamics, and a map from the latent space\nto an observation space. We marginalize out the model parameters in\nclosed-form, using Gaussian Process (GP) priors for both the dynamics\nand the observation mappings. This results in a nonparametric model\nfor dynamical systems that accounts for uncertainty in the model. We\ndemonstrate the approach on human motion capture data in which each\npose is 62-dimensional. Despite the use of small data sets, the GPDM\nlearns an effective representation of the nonlinear dynamics in these\nspaces. Webpage: http://www.dgp.toronto.edu/\n\njmwang/gpdm/\n\n\u223c\n\n1 Introduction\nA central dif\ufb01culty in modeling time-series data is in determining a model that can capture\nthe nonlinearities of the data without over\ufb01tting. Linear autoregressive models require\nrelatively few parameters and allow closed-form analysis, but can only model a limited\nrange of systems. In contrast, existing nonlinear models can model complex dynamics, but\nmay require large training sets to learn accurate MAP models.\n\nIn this paper we investigate learning nonlinear dynamical models for high-dimensional\ndatasets. We take a Bayesian approach to modeling dynamics, averaging over dynamics\nparameters rather than estimating them. Inspired by the fact that averaging over nonlinear\nregression models leads to a Gaussian Process (GP) model, we show that integrating over\nparameters in nonlinear dynamical systems can also be performed in closed-form. The\nresulting Gaussian Process Dynamical Model (GPDM) is fully de\ufb01ned by a set of low-\ndimensional representations of the training data, with both dynamics and observation map-\npings learned from GP regression. As a natural consequence of GP regression, the GPDM\nremoves the need to select many parameters associated with function approximators while\nretaining the expressiveness of nonlinear dynamics and observation.\n\nOur work is motivated by modeling human motion for video-based people tracking and\ndata-driven animation. Bayesian people tracking requires dynamical models in the form\nof transition densities in order to specify prediction distributions over new poses at each\ntime instant (e.g., [11, 14]); similarly, data-driven computer animation requires prior dis-\ntributions over poses and motion (e.g., [1, 4, 6]). An individual human pose is typically\nparameterized with more than 60 parameters. Despite the large state space, the space of\nactivity-speci\ufb01c human poses and motions has a much smaller intrinsic dimensionality; in\nour experiments with walking and golf swings, 3 dimensions often suf\ufb01ce.\n\nOur work builds on the extensive literature in nonlinear time-series analysis, of which we\n\n\fA\n\nB\n\nx1\n\ny1\n\n(a)\n\nx2\n\ny2\n\nx3\n\ny3\n\nx4\n\ny4\n\nX\n\nY\n\n(b)\n\nFigure 1: Time-series graphical models. (a) Nonlinear latent-variable model for time se-\nries. (Hyperparameters \u00af\u03b1 and \u00af\u03b2 are not shown.) (b) GPDM model. Because the mapping\nparameters A and B have been marginalized over, all latent coordinates X = [x1, ..., xN ]T\nare jointly correlated, as are all poses Y = [y1, ..., yN ]T .\n\nmention a few examples. Two main themes are the use of switching linear models (e.g.,\n[11]), and nonlinear transition functions, such as represented by Radial Basis Functions\n[2]. Both approaches require suf\ufb01cient amounts of training data that one can learn the\nparameters of the switching or basis functions. Determining the appropriate number of\nbasis functions is also dif\ufb01cult.\nIn Kernel Dynamical Modeling [12], linear dynamics are\nkernelized to model nonlinear systems, but a density function over data is not produced.\n\nSupervised learning with GP regression has been used to model dynamics for a variety\nof applications [3, 7, 13]. These methods model dynamics directly in observation space,\nwhich is impractical for the high-dimensionality of motion capture data. Our approach\nis most directly inspired by the unsupervised Gaussian Process Latent Variable Model\n(GPLVM) [5], which models the joint distribution of the observed data and their corre-\nsponding representation in a low dimensional latent space. This distribution can then be\nused as a prior for inference from new measurements. However, the GPLVM is not a dy-\nnamical model; it assumes that data are generated independently. Accordingly it does not\nrespect temporal continuity of the data, nor does it model the dynamics in the latent space.\nHere we augment the GPLVM with a latent dynamical model. The result is a Bayesian\ngeneralization of subspace dynamical models to nonlinear latent mappings and dynamics.\n\n2 Gaussian Process Dynamics\nThe Gaussian Process Dynamical Model (GPDM) comprises a mapping from a latent space\nto the data space, and a dynamical model in the latent space (Figure 1). These mappings\nare typically nonlinear. The GPDM is obtained by marginalizing out the parameters of the\ntwo mappings, and optimizing the latent coordinates of training data.\nMore precisely, our goal is to model the probability density of a sequence of vector-valued\nstates y1..., yt, ..., yN , with discrete-time index t and yt \u2208 R\nD. As a basic model, consider\na latent-variable mapping with \ufb01rst-order Markov dynamics:\nxt = f(xt\u22121; A) + nx,t\nyt = g(xt; B) + ny,t\n\n(1)\n(2)\nHere, xt \u2208 R\nd denotes the d-dimensional latent coordinates at time t, nx,t and ny,t are\nzero-mean, white Gaussian noise processes, f and g are (nonlinear) mappings parameter-\nized by A and B, respectively. Figure 1(a) depicts the graphical model.\nWhile linear mappings have been used extensively in auto-regressive models, here we con-\nsider the nonlinear case for which f and g are linear combinations of basis functions:\n\nf(x; A) =\n\ng(x; B) =\n\nai \u03c6i(x)\n\nbj \u03c8j(x)\n\n(3)\n\n(4)\n\n(cid:1)\n(cid:1)\n\ni\n\nj\n\n\ffor weights A = [a1, a2, ...] and B = [b1, b2, ...], and basis functions \u03c6i and \u03c8j. In order\nto \ufb01t the parameters of this model to training data, one must select an appropriate number\nof basis functions, and one must ensure that there is enough data to constrain the shape of\neach basis function. Ensuring both of these conditions can be very dif\ufb01cult in practice.\n\nHowever, from a Bayesian perspective, the speci\ufb01c forms of f and g \u2014 including the\nnumbers of basis functions \u2014 are incidental, and should therefore be marginalized out.\nWith an isotropic Gaussian prior on the columns of B, marginalizing over g can be done in\nclosed form [8, 10] to yield\np(Y | X, \u00af\u03b2) =\n\n(5)\nwhere Y = [y1, ..., yN ]T , KY is a kernel matrix, and \u00af\u03b2 = {\u03b21, \u03b22, ..., W} comprises the\nkernel hyperparameters. The elements of kernel matrix are de\ufb01ned by a kernel function,\n(KY )i,j = kY (xi, xj). For the latent mapping, X \u2192 Y, we currently use the RBF kernel\n\n(cid:2)\n(2\u03c0)N D|KY |D exp\n\nY YW2YT\n\n|W|N\n\n\u22121\n2\n\n(cid:5)(cid:6)\n\nK\u22121\n\n(cid:3)\n\n(cid:4)\n\ntr\n\n,\n\n(cid:3)\n\n(cid:6)\n\nkY (x, x(cid:2)\n\n||x \u2212 x(cid:2)||2\n\n\u2212 \u03b22\n2\n\n\u22121\n3 \u03b4x,x(cid:1) .\n\n+ \u03b2\n\n) = \u03b21 exp\n\n(6)\nAs in the SGPLVM [4], we use a scaling matrix W \u2261 diag(w1, ..., wD) to account for\ndifferent variances in the different data dimensions. This is equivalent to a GP with kernel\nfunction k(x, x(cid:2))/w2\nm for dimension m. Hyperparameter \u03b21 represents the overall scale of\nthe output function, while \u03b22 corresponds to the inverse width of the RBFs. The variance\nof the noise term ny,t is given by \u03b2\nThe dynamic mapping on the latent coordinates X is conceptually similar, but more subtle.1\nAs above, we form the joint probability density over the latent coordinates and the dynamics\nweights A in (3). We then marginalize over the weights A, i.e.,\n\n\u22121\n3 .\n\np(X| \u00af\u03b1) =\n\np(X, A| \u00af\u03b1) dA =\n\np(X| A, \u00af\u03b1) p(A| \u00af\u03b1) dA .\n\n(cid:7)\n\n(cid:7)\n\n(7)\n\n(8)\n\nIncorporating the Markov property (Eqn. (1)) gives:\n\np(X| \u00af\u03b1) = p(x1)\n\np(xt | xt\u22121, A, \u00af\u03b1) p(A| \u00af\u03b1) dA ,\n\n(cid:7) N(cid:8)\n\nt=2\n\n(cid:2)\n\np(X| \u00af\u03b1) =p( x1)\n\nwhere \u00af\u03b1 is a vector of kernel hyperparameters. Assuming an isotropic Gaussian prior on\nthe columns of A, it can be shown that this expression simpli\ufb01es to:\nK\u22121\n\n(9)\nwhere Xout = [x2, ..., xN ]T , KX is the (N\u22121) \u00d7 (N\u22121) kernel matrix constructed from\n{x1, ..., xN\u22121}, and x1 is assumed to be have an isotropic Gaussian prior.\nWe model dynamics using both the RBF kernel of the form of Eqn. (6), as well as the\nfollowing \u201clinear + RBF\u201d kernel:\n\n(2\u03c0)(N\u22121)d|KX|d\n\n(cid:3)\n\u22121\n2\n\nX XoutXT\n\n(cid:5)(cid:6)\n\nexp\n\n(cid:4)\n\ntr\n\nout\n\n1\n\n,\n\n(cid:9)\n\n(cid:10)\n\nkX(x, x(cid:2)\n\n) = \u03b11 exp\n\n||x \u2212 x(cid:2)||2\n\n\u2212 \u03b12\n2\n\n+ \u03b13xT x(cid:2)\n\n\u22121\n4 \u03b4x,x(cid:1) .\n\n+ \u03b1\n\n(10)\n\nThe kernel corresponds to representing g as the sum of a linear term and RBF terms. The\ninclusion of the linear term is motivated by the fact that linear dynamical models, such as\n\n1Conceptually, we would like to model each pair (xt, xt+1) as a training pair for regression with\ng. However, we cannot simply substitute them directly into the GP model of Eqn. (5) as this leads to\nthe nonsensical expression p(x2, ..., xN | x1, ..., xN\u22121).\n\n\f\ufb01rst or second-order autoregressive models, are useful for many systems. Hyperparameters\n\u03b11, \u03b12 represent the output scale and the inverse width of the RBF terms, and \u03b13 represents\nthe output scale of the linear term. Together, they control the relative weighting between\n\u22121\nthe terms, while \u03b1\n4\n\nrepresents the variance of the noise term nx,t.\n\nIt should be noted that, due to the nonlinear dynamical mapping in (3), the joint distribution\nof the latent coordinates is not Gaussian. Moreover, while the density over the initial state\nmay be Gaussian, it will not remain Gaussian once propagated through the dynamics. One\ncan also see this in (9) since xt variables occur inside the kernel matrix, as well as outside\nFinally, we also place priors on the hyperparameters ( p(\u00af\u03b1) \u221d (cid:11)\nof it. So the log likelihood is not quadratic in xt.\n(cid:11)\n, and p(\u00af\u03b2) \u221d\n) to discourage over\ufb01tting. Together, the priors, the latent mapping, and the dy-\n\n\u22121\ni \u03b1\ni\n\n\u22121\ni \u03b2\ni\n\nnamics de\ufb01ne a generative model for time-series observations (Figure 1(b)):\n\np(X, Y, \u00af\u03b1, \u00af\u03b2) = p(Y|X, \u00af\u03b2) p(X|\u00af\u03b1) p(\u00af\u03b1) p(\u00af\u03b2) .\n\n(11)\n\nMultiple sequences. This model extends naturally to multiple sequences Y1, ..., YM .\nEach sequence has associated latent coordinates X1, ..., XM within a shared latent space.\nFor the latent mapping g we can conceptually concatenate all sequences within the GP\nlikelihood (Eqn. (5)). A similar concatenation applies for the dynamics, but omitting the\n\ufb01rst frame of each sequence from Xout , and omitting the \ufb01nal frame of each sequence from\nthe kernel matrix KX. The same structure applies whether we are learning from multiple\nsequences, or learning from one sequence and inferring another. That is, if we learn from\na sequence Y1, and then infer the latent coordinates for a new sequence Y2, then the joint\nlikelihood entails full kernel matrices KX and KY formed from both sequences.\n\nHigher-order features. The GPDM can be extended to model higher-order Markov\nchains, and to model velocity and acceleration in inputs and outputs. For example, a\nsecond-order dynamical model,\n\nxt = f(xt\u22121, xt\u22122; A) + nx,t\n\n(12)\n\nmay be used to explicitly model the dependence of the prediction on two past frames (or\non velocity). In the GPDM framework, the equivalent model entails de\ufb01ning the kernel\nfunction as a function of the current and previous time-step:\n\nkX( [xt, xt\u22121], [x\u03c4 , x\u03c4\u22121] ) = \u03b11 exp\n+ \u03b14 xT\n\n\u2212 \u03b12\n2\nt x\u03c4 + \u03b15 xT\n\n||xt \u2212 x\u03c4||2 \u2212 \u03b13\n2\nt\u22121x\u03c4\u22121 + \u03b1\n\n||xt\u22121 \u2212 x\u03c4\u22121||2\n\u22121\n6 \u03b4t,\u03c4\n\n(cid:9)\n\nSimilarly, the dynamics can be formulated to predict velocity:\n\nvt\u22121 = f(xt\u22121; A) + nx,t\n\nVelocity prediction may be more appropriate for modeling smoothly motion trajectories.\nUsing Euler integration with time-step \u2206t, we have xt = xt\u22121 + vt\u22121\u2206t. The dynam-\nics likelihood p(X| \u00af\u03b1) can then be written by rede\ufb01ning Xout = [x2 \u2212 x1, ..., xN \u2212\nxN\u22121]T /\u2206t in Eqn. (9). In this paper, we use a \ufb01xed time-step of \u2206t = 1. This is analo-\ngous to using xt\u22121 as a \u201cmean function.\u201d Higher-order features can also be fused together\nwith position information to reduce the Gaussian process prediction variance [15, 9].\n\n3 Properties of the GPDM and Algorithms\nLearning the GPDM from measurements Y entails minimizing the negative log-posterior:\n\nL = \u2212 ln p(X, \u00af\u03b1, \u00af\u03b2 | Y)\n\n(cid:10)\n\n(13)\n\n(14)\n\n(15)\n\n\ftr\n\nln|KX| +\n\n1\n= d\n2\n2\n\u2212 N ln|W| + D\n2\n\n(cid:4)\n\nK\u22121\n\nX XoutXT\n\nout\n\n(cid:1)\n\nj\n\n+\n\nln \u03b1j\n\n(cid:5)\n(cid:4)\n\nK\u22121\n\n(16)\n\n(cid:5)\n\n(cid:1)\n\nln|KY | +\n\n1\n2\n\nY YW2YT\n\n+\n\ntr\n\nln \u03b2j\nup to an additive constant. We minimize L with respect to X, \u00af\u03b1, and \u00af\u03b2 numerically.\nFigure 2 shows a GPDM 3D latent space learned from a human motion capture data com-\nprising three walk cycles. Each pose was de\ufb01ned by 56 Euler angles for joints, 3 global\n(torso) pose angles, and 3 global (torso) translational velocities. For learning, the data was\nmean-subtracted, and the latent coordinates were initialized with PCA. Finally, a GPDM is\nlearned by minimizing L in (16). We used 3D latent spaces for all experiments shown here.\nUsing 2D latent spaces leads to intersecting latent trajectories. This causes large \u201cjumps\u201d\nto appear in the model, leading to unreliable dynamics.\n\nj\n\nFor comparison, Fig. 2(a) shows a 3D SGPLVM learned from walking data. Note that\nthe latent trajectories are not smooth; there are numerous cases where consecutive poses\nin the walking sequence are relatively far apart in the latent space. By contrast, Fig. 2(b)\nshows that the GPDM produces a much smoother con\ufb01guration of latent positions. Here\nthe GPDM arranges the latent positions roughly in the shape of a saddle.\nFigure 2(c) shows a volume visualization of the inverse reconstruction variance, i.e.,\n\u22122 ln \u03c3y|x,X,Y, \u00af\u03b2. This shows the con\ufb01dence with which the model reconstructs a pose\nfrom latent positions x.\nIn effect, the GPDM models a high probability \u201ctube\u201d around\nthe data. To illustrate the dynamical process, Fig. 2(d) shows 25 fair samples from the\nlatent dynamics of the GPDM. All samples are conditioned on the same initial state, x0,\nand each has a length of 60 time steps. As noted above, because we marginalize over the\nweights of the dynamic mapping, A, the distribution over a pose sequence cannot be fac-\ntored into a sequence of low-order Markov transitions (Fig. 1(a)). Hence, we draw fair\n1:60 \u223c p( \u02dcX1:60 | x0, X, Y, \u00af\u03b1), using hybrid Monte Carlo [8]. The resulting\nsamples \u02dcX(j)\ntrajectories (Fig. 2(c)) are smooth and similar to the training motions.\n\n3.1 Mean Prediction Sequences\nFor both 3D people tracking and computer animation, it is desirable to generate new mo-\ntions ef\ufb01ciently. Here we consider a simple online method for generating a new motion,\ncalled mean-prediction, which avoids the relatively expensive Monte Carlo sampling used\nabove. In mean-prediction, we consider the next timestep \u02dcxt conditioned on \u02dcxt\u22121 from the\nGaussian prediction [8]:\n\n\u02dcxt \u223c N (\u00b5X(\u02dcxt\u22121); \u03c32\n\nX(\u02dcxt\u22121)I)\n\n(17)\n\n\u00b5X(x) = XT\n\noutK\u22121\n\nX kX(x) ,\n\nX(x) =k X(x, x) \u2212 kX(x)T K\u22121\n\u03c32\n\nX kX(x)\n\n(18)\nwhere kX(x) is a vector containing kX(x, xi) in the i-th entry and xi is the ith training\nvector. In particular, we set the latent position at each time-step to be the most-likely (mean)\npoint given the previous step: \u02dcxt = \u00b5X(\u02dcxt\u22121). In this way we ignore the process noise that\none might normally add. We \ufb01nd that this mean-prediction often generates motions that are\nmore like the fair samples shown in Fig. 2(d), than if random process noise had been added\nat each time step (as in (1)). Similarly, new poses are given by \u02dcyt = \u00b5Y (\u02dcxt).\nDepending on the dataset and the choice of kernels, long sequences generated by sampling\nor mean-prediction can diverge from the data. On our data sets, mean-prediction trajec-\ntories from the GPDM with an RBF or linear+RBF kernel for dynamics usually produce\nsequences that roughly follow the training data (e.g., see the red curves in Figure 3). This\nusually means producing closed limit cycles with walking data. We also found that mean-\nprediction motions are often very close to the mean obtained from the HMC sampler; by\n\n\f \n\n \n\n \n\n(a)\n\n(b)\n\n(c)\n\n(d)\n\n(e)\n\nFigure 2: Models learned from a walking sequence of 2.5 gait cycles. The latent positions\nlearned with a GPLVM (a) and a GPDM (b) are shown in blue. Vectors depict the temporal\nsequence. (c) - log variance for reconstruction shows regions of latent space that are recon-\nstructed with high con\ufb01dence. (d) Random trajectories drawn from the model using HMC\n(green), and their mean (red). (e) A GPDM of walk data learned with RBF+linear kernel\ndynamics. The simulation (red) was started far from the training data, and then optimized\n(green). The poses were reconstructed from points on the optimized trajectory.\n\n \n\n \n\n \n\n \n\n \n\n \n\n(a)\nFigure 3: (a) Two GPDMs and mean predictions. The \ufb01rst is that from the previous \ufb01gure.\nThe second was learned with a linear kernel. (b) The GPDM model was learned from 3\nswings of a golf club, using a 2nd order RBF kernel for dynamics. The two plots show 2D\northogonal projections of the 3D latent space.\n\n(b)\n\ninitializing HMC with mean-prediction, we \ufb01nd that the sampler reaches equilibrium in a\nsmall number of interations. Compared to the RBF kernels, mean-prediction motions gen-\nerated from GPDMs with the linear kernel often deviate from the original data (e.g., see\nFigure 3a), and lead to over-smoothed animation.\n\nFigure 3(b) shows a 3D GPDM learned from three swings of a golf club. The learning\naligns the sequences and nicely accounts for variations in speed during the club trajectory.\n\n3.2 Optimization\nWhile mean-prediction is ef\ufb01cient, there is nothing in the algorithm that prevents trajecto-\nries from drifting away from the training data. Thus, it is sometimes desirable to optimize\na particular motion under the GPDM, which often reduces drift of the mean-prediction mo-\n\n\f(a)\n\n(b)\n\nFigure 4: GPDM from walk sequence with missing data learned with (a) a RBF+linear\nkernel for dynamics, and (b) a linear kernel for dynamics. Blue curves depict original data.\nGreen curves are the reconstructed, missing data.\n\ntions. To optimize a new sequence, we \ufb01rst select a starting point \u02dcx1 and a number of\ntime-steps. The likelihood p( \u02dcX| X, \u00af\u03b1) of the new sequence \u02dcX is then optimized directly\n(holding the latent positions of the previously learned latent positions, X, and hyperparam-\neters, \u00af\u03b1, \ufb01xed). To see why optimization generates motion close to the traing data, note\nthat the variance of pose \u02dcxt+1 is determined by \u03c32\nX(\u02dcxt), which will be lower when \u02dcxt is\nnearer the training data. Consequently, the likelihood of \u02dcxt+1 can be increased by moving\n\u02dcxt closer to the training data. This generalizes the preference of the SGPLVM for poses\nsimilar to the examples [4], and is a natural consequence of the Bayesian approach. As an\nexample, Fig. 2(e) shows an optimized walk sequence initialized from the mean-prediction.\n\n3.3 Forecasting\nWe performed a simple experiment to compare the predictive power of the GPDM to a\nlinear dynamical system, implemented as a GPDM with linear kernel in the latent space and\nRBF latent mapping. We trained each model on the \ufb01rst 130 frames of the 60Hz walking\nsequence (corresponding to 2 cycles), and tested on the remaining 23 frames. From each\ntest frame mean-prediction was used to predict the pose 8 frames ahead, and then the RMS\npose error was computed against ground truth. The test was repeated using mean-prediction\nand optimization for three kernels, all based on \ufb01rst-order predictions as in (1):\n\nmean-prediction\n\noptimization\n\nLinear\n59.69\n58.32\n\nRBF\n48.72\n45.89\n\nLinear+RBF\n\n36.74\n31.97\n\nDue to the nonlinear nature of the walking dynamics in latent space, the RBF and Lin-\near+RBF kernels outperform the linear kernel. Moreover, optimization (initialized by\nmean-prediction) improves the result in all cases, for reasons explained above.\n\n3.4 Missing Data\nThe GPDM model can also handle incomplete data (a common problem with human motion\ncapture sequences). The GPDM is learned by minimizing L (Eqn. (16)), but with the terms\ncorresponding to missing poses yt removed. The latent coordinates for missing data are\ninitialized by cubic spline interpolation from the 3D PCA initialization of observations.\n\nWhile this produces good results for short missing segments (e.g., 10\u201315 frames of the\n157-frame walk sequence used in Fig. 2), it fails on long missing segments. The problem\nlies with the dif\ufb01culty in initializing the missing latent positions suf\ufb01ciently close to the\ntraining data. To solve the problem, we \ufb01rst learn a model with a subsampled data sequence.\nReducing sampling density effectively increases uncertainty in the reconstruction process\nso that the probability density over the latent space falls off more smoothly from the data.\nWe then restart the learning with the entire data set, but with the kernel hyperparameters\n\ufb01xed. In doing so, the dynamics terms in the objective function exert more in\ufb02uence over\nthe latent coordinates of the training data, and a smooth model is learned.\n\nWith 50 missing frames of the 157-frame walk sequence, this optimization produces mod-\n\n\fels (Fig. 4) that are much smoother than those in Fig. 2. The linear kernel is able to pull\nthe latent coordinates onto a cylinder (Fig. 4b), and thereby provides an accurate dynam-\nical model. Both models shown in Fig. 4 produce estimates of the missing poses that are\nvisually indistinguishable from the ground truth.\n\n4 Discussion and Extensions\nOne of the main strengths of the GPDM model is the ability to generalize well from small\ndatasets. Conversely, performance is a major issue in applying GP methods to larger\ndatasets. Previous approaches prune uninformative vectors from the training data [5]. This\nis not straightforward when learning a GPDM, however, because each timestep is highly\ncorrelated with the steps before and after it. For example, if we hold xt \ufb01xed during opti-\nmization, then it is unlikely that the optimizer will make much adjustment to xt+1 or xt\u22121.\nThe use of higher-order features provides a possible solution to this problem. Speci\ufb01cally,\nconsider a dynamical model of the form vt = f(xt\u22121, vt\u22121). Since adjacent time-steps\nare related only by the velocity vt \u2248 (xt \u2212 xt\u22121)/\u2206t, we can handle irregularly-sampled\ndatapoints by adjusting the timestep \u2206t, possibly using a different \u2206t at each step.\nA number of further extensions to the GPDM model are possible. It would be straightfor-\nward to include a control signal ut in the dynamics f(xt, ut). It would also be interesting to\nexplore uncertainty in latent variable estimation (e.g., see [3]). Our use of maximum like-\nlihood latent coordinates is motivated by Lawrence\u2019s observation that model uncertainty\nand latent coordinate uncertainty are interchangeable when learning PCA [5]. However, in\nsome applications, uncertainty about latent coordinates may be highly structured (e.g., due\nto depth ambiguities in motion tracking).\nAcknowledgements This work made use of Neil Lawrence\u2019s publicly-available GPLVM code, the\nCMU mocap database (mocap.cs.cmu.edu), and Joe Conti\u2019s volume visualization code from math-\nworks.com. This research was supported by NSERC and CIAR.\nReferences\n[1] M. Brand and A. Hertzmann. Style machines. Proc. SIGGRAPH, pp. 183-192, July 2000.\n[2] Z. Ghahramani and S. T. Roweis. Learning nonlinear dynamical systems using an EM algo-\n\nrithm. Proc. NIPS 11, pp. 431-437, 1999.\n\n[3] A. Girard, C. E. Rasmussen, J. G. Candela, and R. Murray-Smith. Gaussian process priors with\nuncertain inputs - application to multiple-step ahead time series forecasting. Proc. NIPS 15, pp.\n529-536, 2003.\n\n[4] K. Grochow, S. L. Martin, A. Hertzmann, and Z. Popovi\u00b4c. Style-based inverse kinematics. ACM\n\nTrans. Graphics, 23(3):522-531, Aug. 2004.\n\n[5] N. D. Lawrence. Gaussian process latent variable models for visualisation of high dimensional\n\ndata. Proc. NIPS 16, 2004.\n\n[6] J. Lee, J. Chai, P. S. A. Reitsma, J. K. Hodgins, and N. S. Pollard. Interactive control of avatars\n\nanimated with human motion data. ACM Trans. Graphics, 21(3):491-500, July 2002.\n\n[7] W. E. Leithead, E. Solak, and D. J. Leith. Direct identi\ufb01cation of nonlinear structure using\n\nGaussian process prior models. Proc. European Control Conference, 2003.\n\n[8] D. MacKay. Information Theory, Inference, and Learning Algorithms. 2003.\n[9] R. Murray-Smith and B. A. Pearlmutter. Transformations of Gaussian process priors. Technical\n\nReport, Department of Computer Science, Glasgow University, 2003\n\n[10] R. M. Neal. Bayesian Learning for Neural Networks. Springer-Verlag, 1996.\n[11] V. Pavlovi\u00b4c, J. M. Rehg, and J. MacCormick. Learning switching linear models of human\n\nmotion. Proc. NIPS 13, pp. 981-987, 2001.\n\n[12] L. Ralaivola and F. d\u2019Alch\u00b4e-Buc. Dynamical modeling with kernels for nonlinear time series\n\nprediction. Proc. NIPS 16, 2004.\n\n2004.\n\n[13] C. E. Rasmussen and M. Kuss. Gaussian processes in reinforcement learning. Proc. NIPS 16,\n\n[14] H. Sidenbladh, M. J. Black, and D. J. Fleet. Stochastic tracking of 3D human \ufb01gures using 2D\n\nmotion. Proc. ECCV, volume 2, pp. 702-718, 2000.\n\n[15] E. Solak, R. Murray-Smith, W. Leithead, D. Leith, and C. E. Rasmussen. Derivative observa-\n\ntions in Gaussian process models of dynamic systems. Proc. NIPS 15, pp. 1033-1040, 2003.\n\n\f", "award": [], "sourceid": 2783, "authors": [{"given_name": "Jack", "family_name": "Wang", "institution": null}, {"given_name": "Aaron", "family_name": "Hertzmann", "institution": null}, {"given_name": "David", "family_name": "Fleet", "institution": "University of Toronto"}]}