{"title": "Learning Probabilistic Non-Linear Latent Variable Models for Tracking Complex Activities", "book": "Advances in Neural Information Processing Systems", "page_first": 1359, "page_last": 1367, "abstract": "A common approach for handling the complexity and inherent ambiguities of 3D human pose estimation is to use pose priors learned from  training data. Existing approaches however, are either too simplistic (linear), too complex to learn, or  can only learn latent spaces from \"simple data\", i.e., single activities such as walking or running. In this paper, we present an efficient stochastic gradient descent algorithm that is able to learn probabilistic non-linear latent  spaces composed of multiple  activities. Furthermore, we derive an incremental algorithm for the online setting which can update the latent space without extensive relearning. We demonstrate the  effectiveness of our approach on the  task of monocular and multi-view tracking and show that our approach  outperforms the state-of-the-art.", "full_text": "Learning Probabilistic Non-Linear Latent Variable\n\nModels for Tracking Complex Activities\n\nAngela Yao\u2217\nRaquel Urtasun\nETH Zurich\nTTI Chicago\n{yaoa, gall, vangool}@vision.ee.ethz.ch, rurtasun@ttic.edu\n\nLuc Van Gool\nETH Zurich\n\nJuergen Gall\nETH Zurich\n\nAbstract\n\nA common approach for handling the complexity and inherent ambiguities of 3D\nhuman pose estimation is to use pose priors learned from training data. Exist-\ning approaches however, are either too simplistic (linear), too complex to learn,\nor can only learn latent spaces from \u201csimple data\u201d, i.e., single activities such as\nwalking or running. In this paper, we present an ef\ufb01cient stochastic gradient de-\nscent algorithm that is able to learn probabilistic non-linear latent spaces com-\nposed of multiple activities. Furthermore, we derive an incremental algorithm for\nthe online setting which can update the latent space without extensive relearning.\nWe demonstrate the effectiveness of our approach on the task of monocular and\nmulti-view tracking and show that our approach outperforms the state-of-the-art.\n\n1\n\nIntroduction\n\nTracking human 3D articulated motions from video sequences is well known to be a challenging\nmachine vision problem. Estimating the human body\u2019s 3D location and orientation of the joints\nis notoriously dif\ufb01cult because it is a high-dimensional problem and is riddled with ambiguities\ncoming from noise, monocular imagery and occlusions. To reduce the complexity of the task, it has\nbecome very popular to use prior models of human pose and dynamics [20, 25, 27, 28, 8, 13, 22].\nLinear models (e.g. PCA) are among the simplest priors [20, 15, 26], though linearity also restricts a\nmodel\u2019s expressiveness and results in inaccuracies when learning complex motions. Priors generated\nfrom non-linear dimensionality reduction techniques such as Isomap [23] and LLE [18] have also\nbeen used for tracking [5, 8]. These techniques try to preserve the local structure of the manifold\nbut tend to fail when manifold assumptions are violated, e.g., in the presence of noise, or multiple\nactivities. Moreover, LLE and Isomap provide neither a probability distribution over the space of\npossible poses nor a mapping from the latent space to the high dimensional space. While such a\ndistribution and or mapping can be learned post hoc, learning them separately from the latent space\ntypically results in suboptimal solutions.\nProbabilistic latent variable models (e.g. probabilistic PCA), have the advantage of taking uncer-\ntainties into account when learning latent representations. Taylor et al. [22] introduced the use of\nConditional Restricted Boltzmann Machines (CRBM) and implicit mixtures of CRBM (imCRBM),\nwhich are composed of large collections of discrete latent variables. Unfortunately, learning this\ntype of model is a highly complex task. A more commonly used latent variable model is the Gaus-\nsian Process Latent Variable Model (GPLVM) [9] which has been applied to animation [27] and\ntracking [26, 25, 6, 7]. While the GPLVM is very successful at modeling small training sets with\nsingle activities, it often struggles to learn latent spaces from larger datasets, especially those with\nmultiple activities. The main reason is that the GPLVM is a non-parametric model; learning requires\n\u2217This research was supported by the Swiss National Foundation NCCR project IM2, NSERC Canada and\n\nNSF #1017626. Source code is available at www.vision.ee.ethz.ch/yaoa\n\n1\n\n\fFigure 1: Representative poses, data (Euclidean) distance matrices and learned latent spaces\nfrom walking, jumping, exercise stretching and basketball signal sequences. GPLVM was initialized\nusing probabilistic PCA; while stochastic GPLVM was initialized randomly.\n\nthe optimization of a non-convex function, for which complexity grows with the number of training\nsamples. As such, having a good initialization is key for success [9], though good initializations\nare not always available [6], especially with complex data. Additionally, GPLVM learning scales\ncubicly with the number of training examples, and application to large datasets is computationally\nintractable, making it necessary to use sparsi\ufb01cation techniques to approximate learning [17, 10].\nAs a consequence, the GPLVM has been mainly applied to single activities, e.g., walking or running.\nMore recent works have focused on handling multiple activities, most often with mixture mod-\nels [14, 12, 13] or switching models [16, 8, 2]. However, coordinating the different components\nof the mixture models requires special care to ensure that they are aligned in the latent space [19],\nthereby complicating the learning process. In addition, both mixture and switching models require a\ndiscrete notion of activity which is not always available, e.g. dancing motions are not a discrete set.\nOthers have tried to couple discriminate action classi\ufb01ers with action-speci\ufb01c models [1, 5], though\naccuracy of such systems does not scale well with the number of actions.\nA good prior model for tracking should be accurate, expressive enough to capture a wide range\nof human poses, and easy and tractable for both learning and inference. Unfortunately, none of the\naforementioned approaches exhibit all of these properties. In this paper, we are interested in learning\na probabilistic model that ful\ufb01ll all of these criteria. Towards this end, we propose a stochastic gra-\ndient descent algorithm for the GPLVM which can learn latent spaces from random initializations.\nWe draw inspiration for our work from two main sources. The \ufb01rst, [24], approximates Gaussian\nprocess regression for large training sets by doing online predictions based on local neighborhoods.\nThe second, [11], maximizes the likelihood function for GPLVM by considering one dimension of\nthe gradient at a time in the context of collaborative \ufb01ltering. Based on these two works, we propose\na similar strategy to approximate the gradient computation within each step of the stochastic gra-\ndient descent algorithm. Local estimation of the gradients allows our approach to ef\ufb01ciently learn\nmodels from large and complex training sets while mitigating the problem of local minima. Further-\nmore, we propose an online algorithm that can effectively learn latent spaces incrementally without\nextensive relearning. We demonstrate the effectiveness of our approach on the task of monocular\nand multi-view tracking and show that our approach outperforms the state-of-the-art on the standard\nbenchmark HumanEva [21].\n\n2\n\nPCAGPLVMstochastic GPLVMDistance MatrixWalkingJumpingExerciseStretchingBasketballSignals\f2 Stochastic learning\n\nWe \ufb01rst review the GPLVM, the basis of our work, and then introduce our optimization method for\nlearning with stochastic local updates. Finally, we derive an extension of the algorithm which can\nbe applied to the online setting.\n\n2.1 GPLVM Review\n\nThe GPLVM assumes that the observed data has been generated by some unobserved latent random\nvariables. More formally, let Y = [y1,\u00b7\u00b7\u00b7 , yN ]T be the set of observations yi \u2208 (cid:60)D, and X =\n[x1,\u00b7\u00b7\u00b7 , xN ]T be the set of latent variables xi \u2208 (cid:60)Q, with Q (cid:28) D. The GPLVM relates the latent\nvariables and the observations via the probabilistic mapping y(d) = f (x) + \u03b7, with \u03b7 being i.i.d.\nGaussian noise, and y(d) the d-th coordinate of the observations. In particular, the GPLVM places a\nGaussian process prior over the mapping f such that marginalization of the mapping can be done in\nclosed form. The resulting conditional distribution becomes\n\u2212 1\n2\n\ntr(cid:0)K\u22121YYT(cid:1)(cid:19)\n\np (Y|X, \u03b2) =\n\nexp\n\n(1)\n\n1\n\n,\n\n(cid:112)(2\u03c0)N\u00b7D|K|D\n(cid:16)\u2212(cid:107)x\u2212x(cid:48)(cid:107)2\n\n(cid:18)\n(cid:17)\n\nwhere K is the kernel matrix with elements Kij = k(xi, xj) and the kernel k has parameters \u03b2.\nHere, we follow existing approaches [26, 25] and use a a kernel compounded from an RBF, a bias,\nand Gaussian noise, i.e., k (x, x(cid:48)) = \u03b21 exp\nThe GPLVM is usually learned by maximum likelihood estimation of the latent coordinates X and\nthe kernel hyperparameters \u03b2 = {\u03b21,\u00b7\u00b7\u00b7 , \u03b24}. This is equivalent to minimizing the negative log\nlikelihood L:\n\n+ \u03b23 + \u03b4x,x(cid:48)\n\n\u03b24\n\n\u03b22\n\n.\n\ntr(cid:0)K\u22121YYT(cid:1) .\n\n(2)\n\nL = \u2212 ln p (Y|X, \u03b2) = \u2212 DN\n2\n\nln 2\u03c0 \u2212 D\n2\n\nln|K| \u2212 1\n2\n\nTypically a gradient descent algorithm is used for the minimization. The gradient of L with respect\nto X can be obtained via the chain rule, where\n\n\u2202L\n\u2202X\n\n=\n\n\u2202L\n\u2202K\n\n\u00b7 \u2202K\n\u2202X\n\n= \u2212(cid:0)K\u22121YYT K\u22121 \u2212 DK\u22121(cid:1) \u00b7 \u2202K\n\n.\n\n(3)\n\n\u2202X\n\nSimilarly, the gradient of L with respect to \u03b2 can be found by substituting \u2202K\n\u2202\u03b2 in Eq. (3)\n(see [9] for the exact derivation). As N gets large, however, computing the gradients becomes com-\nputationally expensive, because inverting K is of O(N 3), with N the number of training examples.\nMore importantly, as the negative log likelihood L is highly non-convex, especially with respect to\nX, standard gradient descent approaches tend to get stuck in local minima, and rely on having good\ninitializations for success.\nWe now demonstrate how a stochastic gradient descent approach can be used to reduce computa-\ntional complexity as well as decrease the chances of getting trapped in local minima. In particular,\nas shown in our experiments (Section 3), we are able to obtain smooth and accurate manifolds (see\nFig. 1) from random initialization.\n\n\u2202X with \u2202K\n\n2.2 Stochastic Gradient Descent\n\nIn standard gradient descent, all points are taken into account at the same time when computing\nthe gradient; stochastic gradient descent approaches, on the other hand, approximate the gradient at\neach point individually. Typically, a loop goes over the points in a series or by randomly sampling\nfrom the training set. Note that after iterating over all the points, the gradient is exact. As the\nGPLVM is a non-parametric approach, the gradient computation at each point does not decompose,\nmaking it necessary to invert K, an O(N 3) operation at every iteration. We propose, however, to\napproximate the gradient computation within each step of the stochastic gradient descent algorithm.\nTherefore, the gradient of L can be estimated locally for some neighborhood of points XR, centered\nat a reference point xr, rather than over all of X. Eq. (3) can then be evaluated only for the points\nwithin the neighborhood, i.e.,\n\n3\n\n\fAlgorithm 1: Stochastic GPLVM\nRandomly initialize X\nSet \u03b2 with an initial guess\nfor t = 1:T\n\nrandomly select xr\n\ufb01nd R neighbors around xr: XR = X \u2208 R\nCompute \u2202L\n\u2202XR\nUpdate X and \u03b2:\n\nand \u2202L\n\u2202\u03b2R\n\n(see Eq. (3))\n\n\u2206Xt = \u00b5X \u00b7 \u2206Xt\u22121 + \u03b7X \u00b7 \u2202L\nXt \u2190 Xt\u22121 + \u2206Xt\n\u2206\u03b2t = \u00b5\u03b2 \u00b7 \u2206\u03b2t\u22121 + \u03b7\u03b2 \u00b7 \u2202L\n\u2202\u03b2R\n\u03b2t \u2190 \u03b2t\u22121 + \u2206\u03b2t\n\n\u2202XR\n\nend\n\nAlgorithm 2: Incremental stochastic GPLVM\nfor t = 1 : T1\n\nLearn Xorig and \u03b2orig as per Algorithm 1.\n\nend\nInitialize Xincr using nearest neighbors.\nSet \u03b2 = \u03b2orig\nGroup data:\n\nfor t = T1 + 1 : T2\n\nY = [Yorig, Yincr]\nX = [Xorig, Xincr]\nrandomly select xr \u2208 Xincr\n\ufb01nd R neighbors around xr: XR = X \u2208 R\nCompute \u2202Lincr\n(see Eq. (6))\n\u2202XR\nUpdate X and \u03b2:\n\nand \u2202Lincr\n\u2202\u03b2R\n\n\u2206Xt = \u00b5X \u00b7 \u2206Xt\u22121 + \u03b7X \u00b7 \u2202Lincr\nXt \u2190 Xt\u22121 + \u2206Xt\n\u2206\u03b2t = \u00b5\u03b2 \u00b7 \u2206\u03b2t\u22121 + \u03b7\u03b2 \u00b7 \u2202Lincr\n\u2202\u03b2R\n\u03b2t \u2190 \u03b2t\u22121 + \u2206\u03b2t\n\n\u2202XR\n\nend\n\nFigure 2: Stochastic gradient descent and incremental learning for the GPLVM; \u00b5(\u00b7) is a mo-\nmentum parameter and \u03b7(\u00b7) is the learning rate. Note that R, \u00b5, and \u03b7 can also vary with t.\n\n\u2248 \u2212(cid:0)K\u22121\n\nR YRYT\n\nRK\u22121\n\nR \u2212 DK\u22121\n\nR\n\n(cid:1) \u00b7 \u2202KR\n\n\u2202XR\n\n,\n\n(4)\n\n\u2202L\n\u2202XR\n\nwhere KR is the kernel matrix for XR and YR is the corresponding neighborhood data points.\nWe employ a random strategy for choosing the reference point xr. The neighborhood R can be\ndetermined by any type of distance measure, such as Euclidean distance in the latent space and/or\ndata space, or temporal neighbors when working with time series. More critical than the speci\ufb01c\ntype of distance measure, however, is allowing suf\ufb01cient coverage of the latent space so that each\nneighborhood is not restricted too locally. To keep the complexity low, it is bene\ufb01cial to sample\nrandomly from a larger set of neighbors (see supplementary material).\nThe use of stochastic gradient descent has several desirable traits that correct for the aforementioned\ndrawbacks of GPLVMs. First, computational complexity is greatly reduced, making it feasible to\nlearn latent spaces with much larger amounts of data. Secondly, estimating the gradients stochasti-\ncally and locally improves robustness of the learning process against local minima, making it possi-\nble to have a random initialization. An algorithmic summary of stochastic gradient descent learning\nfor GPLVMs is given in Fig. 2.\n\n2.3\n\nIncremental Learning\n\nIn this section, we derive an incremental learning algorithm based on the stochastic gradient descent\napproach of the previous section. In this setting, we have an initial model which we would like\nto update as new data comes in on the \ufb02y. More formally, let Yorig be the initial training data,\nand Xorig and \u03b2orig be a model learned from Yorig using stochastic GPLVM. For every step in\nthe online learning, let Yincr be new data, which can be as little as a single point or an entire set\nof training points. Let Y = [Yorig, Yincr] \u2208 R(N +M )\u00d7D be the set of training points containing\nboth the already trained data Yorig, and the new incoming data Yincr, and let X=[Xorig, Xincr] \u2208\nR(N +M )\u00d7Q be the corresponding latent coordinates, where M is the number of newly added training\nexamples. Let \u02c6Xorig be the estimate of the latent coordinates that has already been learned.\nA possible strategy is to update only the incoming points; however, we would like to exploit the\nnew data for improving the estimate of the entire manifold, therefore we propose to learn the full\nX. To prevent the already-learned manifold from diverging and also to speed up learning, we add a\nregularizer to the log-likelihood to encourage original points to not deviate too far from their initial\nestimate. To this end, we use the Frobenius norm of the deviation from the estimate \u02c6Xorig. Learning\n\n4\n\n\fFigure 3: Within- and cross-subject 3D tracking errors for each type of activity sequence with\nrespect to amount of additive noise for different number of particles, where error bars represent the\nstandard deviation from repetitions runs.\n\nis then done by minimizing the regularized negative log-likelihood\n||X1:N,: \u2212 \u02c6Xorig||2\nF .\n\nLincr = L + \u03bb \u00b7 1\nN\n\n(5)\n\nHere, X1:N,: indicates the \ufb01rst N rows of X, while \u03bb is a weighting on the regularization term. The\ngradient of L with respect to XR\n\n.\n\n(6)\n\n1 can then be computed as\n\u2202L\n\u2202XR\n\n\u00b7(cid:0)X1:N,: \u2212 \u02c6Xorig\n\n+ \u03bb \u00b7 2\nN\n\n(cid:1) \u2202X1:N,:\n\n\u2202XR\n\n\u2202Lincr\n\u2202XR\n\n=\n\nWe employ a stochastic gradient descent approach for our incremental learning, where the points are\nsampled randomly from Xincr. Note that while xr is only sampled from Xincr in the subsequent\nlearning step, this does not exclude points in Xorig from being a part of the neighbourhood R,\nand thus from being updated. We have chosen a nearest neighbor approach by comparing Yincr to\nYorig for estimating an initial Xincr, though other possibilities include performing a grid search in\nthe latent space and selecting locations with the highest global log-likelihood (Eq. (2)) or training a\nregressor from Yorig to Xorig to be applied to Yincr. An algorithmic summary of the incremental\nmethod is provided in Fig. 2.\n\n2.4 Tracking Framework\nDuring training, a latent variable model M is learned from YM , where YM are relative joint loca-\ntions with respect to a root node. We designate the learned latent points as XM . During inference,\ntracking is performed in the latent space using a particle \ufb01lter. The corresponding pose is computed\nby projecting back to the data space via the Gaussian process mapping learned in the GPLVM.\n\n1 \u2202Lincr\n\u2202\u03b2R\n\n= \u2202L\n\u2202\u03b2R\n\nsince the regularization term does not depend on \u03b2R.\n\n5\n\nWalkingError (mm)JumpingError (mm)Exercise StretchingError (mm)Basketball SignalError (mm)PCAGPLVMstochastic GPLVM120160200240751001251501530456060801001200.1%0%25 particles0.05% 0%0.05%0.1%150 particlesWithin-Subject150 particles 0%0.05%0.1%25 particles 0%0.05%0.1%22024026028016017519020545607590115130145160Cross-Subject\f(a) manifolds\n\n(b) 3D tracking error\n\nFigure 4: (a) Learned manifolds from regular GPLVM, stochastic GPLVM and incremental\nstochastic GPLVM from an exercise stretching sequence, where blue, red, green indicate jumping\njacks, jogging and squats respectively and (b) the associated 3D tracking errors (mm), where error\nbars indicate standard deviation over repeated runs.\n\nWe model the state s at time t as st = (xt, gt, rt) where xt denotes position in the latent space,\nwhile gt and rt are the global position and rotation of the root node. Particles are initialized in the\nlatent space by a nearest neighbor search between the observed 2D image pose in the \ufb01rst frame of\nthe sequence and the projected 2D poses of YM . Particles are then propagated from frame to frame\nusing a \ufb01rst-order Markov model\n\nxi\nt = xi\n\nt\u22121 + \u02d9xi\nt,\n\ngi\nt = gi\n\nt\u22121 + \u02d9gi\nt,\n\nri\nt = ri\n\nt\u22121 + \u02d9ri\nt.\n\n(7)\n\nWe approximate the derivative \u02d9xi with the difference between temporally sequential points of the\nnearest neighbors in XM , while \u02d9gi and \u02d9ri are drawn from individual Gaussians with means and\nstandard deviations estimated from the training data. The tracked latent position \u02c6xt at time t is then\napproximated as the mode over all particles in the latent space while \u02c6yt is estimated via the mean\nGaussian process estimate\n\n(8)\nwith \u00b5M the mean of YM and k(\u02c6xt, XM ) the vector with elements k(\u02c6xt, xm) for all xm in XM .\nNote that the computation of K\u22121 needs to be performed only once and can be stored.\n\n\u02c6yt = \u00b5M + YT\n\nM K\u22121k(\u02c6xt, XM ),\n\n3 Experimental Evaluation\n\nWe demonstrate the effectiveness of our model when applied to tracking in both monocular and\nmulti-view scenarios.\nIn all cases, the latent models were learned with \u00b5X = 0.8, \u00b5\u03b2 = 0.5,\n\u03b7X=10e-4, \u03b7\u03b2 =10e-8; we annealed these parameters over the iterations. To further smooth the\nlearned models, we incorporate a Gaussian Process prior over the dynamics of the training data\nin the latent space [27] for the GPLVM and the stochastic GPLVM. We refer the reader to the\nsupplementary material for a visualization of the learning process as well as the results.\n\n3.1 Monocular Tracking\n\nWe compare in the monocular setting the use of PCA, regular GPLVM and our stochastic GPLVM\nto learn latent spaces from motion capture sequences (from the CMU Motion Capture Database [3]).\nWe chose simple single-activity sequences, such as walking (3 subjects, 18 sequences) and jumping\n(2 subjects, 8 sequences), as well as complex multi-activity sequences, such as stretching exercises\n(2 subjects, 6 sequences) and basketball refereeing signals (7 subjects, 13 sequences). The stretch-\ning exercise and basketball signal sequences were cut to each contain four types of activities. We\nsynthesized 2D data by projecting the mocap from 3D to 2D and then corrupting the location of each\njoint with different levels of additive Gaussian noise. We then recover the 3D locations of each joint\nfrom the noisy images by tracking with the particle \ufb01lter described in the previous section.\nExamples of learned latent spaces for each type of sequence (i.e., walking, jumping, exercise, bas-\nketball) are shown in Fig. 1. We used a neighborhood of 60 points for the single activity sequences,\nwhich have on average 250 training examples, and 100 points for the multiple activity sequences,\n\n6\n\nGPLVMstochastic GPLVMincrementalstochastic GPLVM0%0.05%0.1%50100150200250  GPLVMstochastic GPLVMincremental stochastic GPLVM\fFigure 5: Example poses from tracked results on HumanEva.\n\nwhich have on average 800 training examples. For a sequence of 800 training examples, the stochas-\ntic GPLVM takes only 27s to learn (neighborhood of 100 points, 2500 iterations); in comparison,\nthe regular GPLVM takes 2560s for 312 iterations, while with FITC approximations [10] takes on\naverage 1700s (100 active points, 2500 iterations)2. In general, as illustrated by Fig. 1, the mani-\nfolds learned with stochastic GPLVM have smoother trajectories than those learned from PCA and\nGPLVM, with better separation between the activities in the multi-activity sequences.\nWe evaluate the effectiveness of the learned latent pose models for tracking by comparing the av-\nerage tracking error per joint per frame between PCA, GPLVM and stochastic GPLVM in two sets\nof experiments. In the \ufb01rst, training and test sequences are performed by the same subject; in the\nsecond, to test generalization properties of the different latent spaces, we train and test on different\nsubjects. We report results average over 10 sequences, each repeated over 10 different runs of the\ntracker. We use importance sampling and weight each particle at time t proportionally to a likelihood\nde\ufb01ned by the reprojection error: wi\nj,t is the projected 2D\nt (see Eq. (8)) and qj,t is the observed 2D position of joint j, assuming\nposition of joint j in yi\nthat the camera projection and correspondences between joints are already known. \u03b1 is a parameter\ndetermining selectivity of the weight function (we use \u03b1 = 5 \u00b7 10\u22125).\nFig. 3 depicts 3D tracking error as a function of the amount of Gaussian noise for different number of\nparticles employed in the particle \ufb01lter for the within- and cross-subject experiments. As expected,\ntracking error is lower within-subject than cross-subject for all types of latent models. For the simple\nactivities such as walking and jumping, GPLVM generally outperforms PCA, but for the complex\nactivities, it performs only comparably or worse than PCA (with the exception of cross-subject\nbasketball signals). Our stochastic GPLVM, on the other hand, consistently outperforms PCA and\nmatches or outperforms the regular GPLVM in all experimental conditions, with signi\ufb01cantly better\nperformance in the complex, multi-activity sequences. Additional experiments are provided in the\nsupplementary material.\n\nj,t \u2212 qj,t(cid:107)2(cid:17)\n\nt \u221d exp\n\nt from xi\n\n(cid:16)\u2212\u03b1(cid:80)\n\nj (cid:107)pi\n\n, where pi\n\n3.2 Online Tracking\n\nWe took two stretching exercise sequences with three different activities from the same subject and\napply the online learning algorithm (see Sec. 2.3), setting \u03bb = 2. We consider each activity as a new\nbatch of data, and learn the latent space on the \ufb01rst sequence and then track on the second and vice\nversa. We \ufb01nd the online algorithm less accurate for tracking than the stochastic GPLVM learned\nwith all data. This is expected since the latent space is biased towards the initial set of activities.\nWe note, however, that the incremental stochastic GPLVM still outperforms the regular GPLVM, as\nillustrated in Fig. 4(b). Examples of the learned manifolds are shown in Fig. 4(a).\n\n3.3 Multi-view Tracking on HumanEva\n\nWe also evaluate our learning algorithm on the HumanEva benchmark [21] on the activities walking\nand boxing. For all experiments, we use a particle \ufb01lter as described in Sec. 2.4 with 25 particles as\nwell as an additional annealing component [4] of 15 layers. To maintain consistency with previous\n\n2Note that none of the models have completed training. For timing purposes, we take here a \ufb01xed number of\niterations for the stochastic method and the FITC approximation and the \u201cequivalent\u201d for the regular GPLVM,\ni.e., 2500 iterations /8, where 8 comes from the fact that 8X more points are used in computing K.\n\n7\n\nC1, Frame 27C1, Frame 72C3,  Frame 27C3, Frame 72S1  BoxingC1, Frame 30C1, Frame 60C3,  Frame 30C3, Frame 60S3  Walking\fTrain\nS1\n\nS1,2,3\n\nS1,2,3\n\nS2\n\nS3\n\nS1,2,3\n\nTest\nS1\nS1\nS2\nS2\nS3\nS3\n\n[28]\n-\n\n140.3\n\n-\n\n-\n\n149.4\n\n156.3\n\n[13]\n-\n-\n\n68.7 \u00b1 24.7\n69.6 \u00b1 22.2\n\n-\n\n-\n\nGPLVM\n57.6 \u00b1 11.6\n64.3 \u00b1 19.2\n98.2 \u00b1 15.8\n155.9 \u00b1 48.8\n71.6 \u00b1 10.0\n123.8. \u00b1 16.7\n\nCRBM [22]\n48.8 \u00b1 3.7\n55.4 \u00b1 0.8\n47.4 \u00b1 2.9\n99.1 \u00b1 23.0\n49.8 \u00b1 2.2\n70.9 \u00b1 2.1\n\nimCRBM [22]\n58.6 \u00b1 3.9\n54.3 \u00b1 0.5\n67.0 \u00b1 0.7\n69.3 \u00b1 3.3\n51.4 \u00b1 0.9\n43.4 \u00b1 4.1\n\nOurs\n\n44.0 \u00b1 1.8\n41.6 \u00b1 0.8\n54.4 \u00b1 1.8\n64.0 \u00b1 2.9\n45.4 \u00b1 1.1\n46.5 \u00b1 1.4\n\nTable 1: Comparison of 3D tracking errors (mm) on the entire walking validation sequence with\nsubject-speci\ufb01c models, where \u00b1 indicates standard deviation over runs, except for [13], who reports\ntracking results for 200 frames of the sequences, with standard deviation over frames.\n\nModel\n\n[16] as reported in [12]\n[14] as reported in [12]\n\nGPLVM\n\n[12]\n\nOurs\n\nBest CRBM [22]\n\nTracking Error\n569.90 \u00b1 209.18\n380.02 \u00b1 74.97\n121.44 \u00b1 30.7\n117.0 \u00b1 5.5\n75.4 \u00b1 9.7\n74.1 \u00b1 3.3\n\nTable 2: Comparison of 3D tracking errors (mm) on boxing validation sequence for S1, where \u00b1\nindicates standard deviation over runs. Our results are comparable to the state-of-the-art [22].\n\nworks, we use the images from the 3 color cameras and the simple silhouette and edge likelihoods\nprovided in the HumanEva baseline algorithm [21].\nHumanEva-I Walking: As per [22, 28, 13], we track the walking validation sequences of subjects S1,\nS2, and S3. The latent variable models are learned on the training sequences, being either subject-\nspeci\ufb01c or with all three subjects combined. Subject-speci\ufb01c models have \u223c1200-2000 training\nexamples each, for which we used a neighborhood of 60 points, while the combined model has\n\u223c4000 training examples with a neighborhood of 150 points. 3D tracking errors averaged over the\n15 joints as speci\ufb01ed in [21] and over all frames in the full sequence are depicted in Table1. Sample\nframes of the estimated poses are shown in Fig. 5. In four of the six training/test combinations, the\nstochastic GPLVM model outperforms the state-of-the-art CRBM and imCRBM model from [22],\nwhile in the other two cases, our model is comparable. These results are remarkable, given that we\nuse only a simple \ufb01rst-order Markov model for estimating dynamics, and our success can only be\nattributed to the latent model\u2019s accuracy in encoding the body poses from the training data.\n\nHumanEva-I boxing: We also track the validation sequence of S1 for boxing to assess the ability of\nthe stochastic GPLVM for learning acyclic motions. 3D tracking errors are shown in Table 2 and are\ncompared with [14, 13, 22]. Our results are slightly better than state-of-the-art.\n\n4 Conclusion and Future Work\n\nIn this paper, we try to learn a probabilistic prior model which is accurate yet expressive, and is\ntractable for both learning and inference. Our proposed stochastic GPLVM ful\ufb01lls all these criteria\n- it effectively learns latent spaces of complex multi-activity datasets in a computationally ef\ufb01cient\nmanner. When applied to tracking, our model outperforms state-of-the-art on the HumanEva bench-\nmark, despite the use of very few particles and only a simple \ufb01rst-order Markov model for handling\ndynamics. In addition, we have also derived a novel approach for learning latent spaces incremen-\ntally. One of the great criticisms of current latent variable models is that they cannot handle new\ntraining examples without relearning; given the sometimes cumbersome learning process, this is not\nalways feasible. Our incremental method can be easily applied to an online setting without exten-\nsive relearning, which may have impact in applications such as robotics where domain adaptation\nmight be key for accurate prediction. In the future, we plan to further investigate the incorporation\nof dynamics into the stochastic model, particularly for multiple activities.\n\n8\n\n\fReferences\n[1] A. Baak, M. Mueller B. Rosenhahn, and H.-P. Seidel. Stabilizing motion tracking using retrieved motion\n\npriors. In ICCV, 2009.\n\n[2] J. Chen, M. Kim, Y. Wang, and Q. Ji. Switching gaussian process dynamic models for simultaneous\n\ncomposite motion tracking and recognition. In CVPR, 2009.\n\n[3] CMU Mocap Database. http://mocap.cs.cmu.edu/.\n[4] J. Deutscher and I. Reid. Articulated body motion capture by stochastic search. IJCV, 61(2), 2005.\n[5] J. Gall, A. Yao, and L. Van Gool. 2d action recognition serves 3d human pose estimation. In ECCV, 2010.\n[6] A. Geiger, R. Urtasun, and T. Darrell. Rank priors for continuous non-linear dimensionality reduction. In\n\nCVPR, 2009.\n\n[7] S. Hou, A. Galata, F. Caillette, N. Thacker, and P. Bromiley. Real-time body tracking using a gaussian\n\nprocess latent variable model. ICCV, 2007.\n\n[8] T. Jaeggli, E. Koller-Meier, and L. Van Gool. Learning generative models for multi-activity body pose\n\nestimation. IJCV, 83(2):121\u2013134, 2009.\n\n[9] N. Lawrence. Probabilistic non-linear principal component analysis with gaussian process latent variable\n\nmodels. JMLR, 6:1783\u20131816, 2005.\n\n[10] N. Lawrence. Learning for larger datasets with the gaussian process latent variable model. In AISTATS,\n\n2007.\n\n[11] N. Lawrence and R. Urtasun. Non-linear matrix factorization with gaussian processes. In ICML, 2009.\n[12] R. Li, T. Tian, and S. Sclaroff. Simultaneous learning of non-linear manifold and dynamical models for\n\nhigh-dimensional time series. In ICCV, 2007.\n\n[13] R. Li, T.-P. Tian, S. Sclaroff, and M.-H. Yang. 3d human motion tracking with a coordinated mixture of\n\nfactor analyzers. IJCV, 87:170\u2013190, 2010.\n\n[14] R.S. Lin, C.B. Liu, M.H. Yang, N. Ahja, and S. Levinson. Learning nonlinear manifolds from time series.\n\nIn ECCV, 2006.\n\n[15] D. Ormoneit, C. Lemieux, and D. Fleet. Lattice particle \ufb01lters. In UAI, 2001.\n[16] V. Pavlovic, J. Rehg, and J. Maccormick. Learning switching linear models of human motion. In NIPS,\n\npages 981\u2013987, 2000.\n\n[17] J. Quinonero-Candela and C. Rasmussen. A unifying view of sparse approximate gaussian process re-\n\ngression. JMLR, page 2005, 2006.\n\n[18] S. Roweis and L. Saul. Nonlinear Dimensionality Reduction by Locally Linear Embedding. Science,\n\n290(5500):2323\u20132326, 2000.\n\n[19] S. Roweis, L. Saul, and G. Hinton. Global coordination of local linear models. In NIPS, 2002.\n[20] H. Sidenbladh, M. Black, and D. Fleet. Stochastic tracking of 3d human \ufb01gures using 2d image motion.\n\nIn ECCV, 2000.\n\n[21] L. Sigal, A. Balan, and M. Black. Humaneva: Synchronized video and motion capture dataset and baseline\n\nalgorithm for evaluation of articulated human motion. IJCV, 87(1-2):4\u201327, 2010.\n\n[22] G. Taylor, L. Sigal, D. Fleet, and G. Hinton. Dynamical binary latent variable models for 3d human pose\n\ntracking supplementary material. In CVPR, 2010.\n\n[23] J. Tenenbaum, V. de Silva, and J. Langford. A Global Geometric Framework for Nonlinear Dimensional-\n\nity Reduction. Science, 2000.\n\n[24] R. Urtasun and T. Darrell. Sparse probabilistic regression for activity-independent human pose inference.\n\nIn CVPR, 2008.\n\n[25] R. Urtasun, D. Fleet, and P. Fua. 3d people tracking with gaussian process dynamical models. In CVPR,\n\n2006.\n\n[26] R. Urtasun, D. Fleet, A. Hertzman, and P. Fua. Priors for people tracking from small training sets. In\n\nICCV, 2005.\n\n[27] J. Wang, D. Fleet, and A. Hertzmann. Gaussian process dynamical models for human motion. PAMI,\n\n30(2):283\u2013298, 2008.\n\n[28] X. Xu and B. Li. Learning motion correlation for tracking articulated human body with a rao-\n\nblackwellised particle \ufb01lter. In ICCV, 2007.\n\n9\n\n\f", "award": [], "sourceid": 784, "authors": [{"given_name": "Angela", "family_name": "Yao", "institution": null}, {"given_name": "Juergen", "family_name": "Gall", "institution": null}, {"given_name": "Luc", "family_name": "Gool", "institution": null}, {"given_name": "Raquel", "family_name": "Urtasun", "institution": null}]}