{"title": "Bayesian Reconstruction of 3D Human Motion from Single-Camera Video", "book": "Advances in Neural Information Processing Systems", "page_first": 820, "page_last": 826, "abstract": null, "full_text": "Bayesian Reconstruction of 3D Human Motion \n\nfrom Single-Camera Video \n\nNicholas R. Howe \n\nDepartment of Computer Science \n\nCornell University \nIthaca, NY 14850 \n\nnihowe@cs.comell.edu \n\nMichael E. Leventon \n\nArtificial Intelligence Lab \n\nMassachusetts Institute of Technology \n\nCambridge, MA 02139 \nleventon@ai.mit.edu \n\nWilliam T. Freeman \n\nMERL - a Mitsubishi Electric Research Lab \n\n201 Broadway \n\nCambridge, MA 02139 \n\nfreeman@merL.com \n\nAbstract \n\nThe three-dimensional motion of humans is underdetermined when the \nobservation is limited to a single camera, due to the inherent 3D ambi(cid:173)\nguity of 2D video. We present a system that reconstructs the 3D motion \nof human subjects from single-camera video, relying on prior knowledge \nabout human motion, learned from training data, to resolve those am(cid:173)\nbiguities. After initialization in 2D, the tracking and 3D reconstruction \nis automatic; we show results for several video sequences. The results \nshow the power of treating 3D body tracking as an inference problem. \n\n1 \n\nIntroduction \n\nWe seek to capture the 3D motions of humans from video sequences. The potential appli(cid:173)\ncations are broad, including industrial computer graphics, virtual reality, and improved \nhuman-computer interaction. Recent research attention has focused on unencumbered \ntracking techniques that don't require attaching markers to the subject's body [4, 5], see \n[12] for a survey. Typically, these methods require simultaneous views from multiple cam(cid:173)\neras. \n\nMotion capture from a single camera is important for several reasons. First, though under(cid:173)\ndetermined, it is a problem people can solve easily, as anyone viewing a dancer in a movie \ncan confirm. Single camera shots are the most convenient to obtain, and, of course, apply \nto the world's film and video archives. It is an appealing computer vision problem that \nemphasizes inference as much as measurement. \n\nThis problem has received less attention than motion capture from multiple cameras. \nGoncalves et.al. rely on perspective effects to track only a single arm, and thus need not \ndeal with complicated models, shadows, or self-occlusion [7]. Bregler & Malik develop a \nbody tracking system that may apply to a single camera, but performance in that domain is \n\n\fBayesian Reconstruction of 3D Human Motion from Single-Camera Video \n\n821 \n\nnot clear; most of the examples use multiple cameras [4]. Wachter & Nagel use an iterated \nextended Kalman filter, although their body model is limited in degrees of freedom [12l \nBrand [3] uses an learning-based approach, although with representational expressiveness \nrestricted by the number of HMM states. An earlier version of the work reported here [10] \nrequired manual intervention for the 2D tracking. \n\nThis paper presents our system for single-camera motion capture, a learning-based ap(cid:173)\nproach, relying on prior information learned from a labeled training set. The system tracks \njoints and body parts as they move in the 2D video, then combines the tracking informa(cid:173)\ntion with the prior model of human motion to form a best estimate of the body's motion in \n3D. Our reconstruction method can work with incomplete information, because the prior \nmodel allows spurious and distracting information to be discarded. The 3D estimate pro(cid:173)\nvides feedback to influence the 2D tracking process to favor more likely poses. \n\nThe 2D tracking and 3D reconstruction modules are discussed in Sections 3 and 4, respec(cid:173)\ntively. Section 4 describes the system operation and presents performance results. Finally, \nSection 5 concludes with possible improvements. \n\n2 2D Tracking \n\nThe 2D tracker processes a video stream to determine the motion of body parts in the image \nplane over time. The tracking algorithm used is based on one presented by Ju et. al. [9], \nand performs a task similar to one described by Morris & Rehg [11]. Fourteen body parts \nare modeled as planar patches, whose positions are controlled by 34 parameters. Tracking \nconsists of optimizing the parameter values in each frame so as to minimize the mismatch \nbetween the image data and a projection of the body part maps. The 2D parameter values \nfor the first frame must be initialized by hand, by overlaying a model onto the 2D image of \nthe first frame. \n\nWe extend Ju et. al.'s tracking algorithm in several ways. We track the entire body, and \nbuild a model of each body part that is a weighted average of several preceding frames, not \njust the most recent one. This helps eliminate tracking errors due to momentary glitches \nthat last for a frame or two. \n\nWe account for self-occlusions through the use of support maps [4, 1]. It is essential to \naddress this problem, as limbs and other body parts will often partly or wholly obscure one \nanother. For the single-camera case, there are no alternate views to be relied upon when a \nbody part cannot be seen. \n\nThe 2D tracker returns the coordinates of each limb in each successive frame. These in tum \nyield the positions of joints and other control points needed to perform 3D reconstruction. \n\n3 3D Reconstruction \n\n3D reconstruction from 2D tracking data is underdetermined. At each frame, the algorithm \nreceives the positions in two dimensions of 20 tracked body points, and must to infer the \ncorrect depth of each point. We rely on a training set of 3D human motions to determine \nwhich reconstructions are plausible. Most candidate projections are unnatural motions, \nif not anatomically impossible, and can be eliminated on this basis. We adopt a Bayesian \nframework, and use the training data to compute prior probabilities of different 3D motions. \n\nWe model plausible motions as a mixture of Gaussian probabilities in a high-dimensional \nspace. Motion capture data gathered in a professional studio provide the training data: \nframe-by-frame 3D coordinates for 20 tracked body points at 20-30 frames per second. We \nwant to model the probabilities of human motions of some short duration, long enough be \n\n\f822 \n\nN. R. Howe, M. E. Leventon and W T. Freeman \n\ninformative, but short enough to characterize probabilistically from our training data. We \nassembled the data into short motion elements we caJled snippets of 11 successive frames, \nabout a third of a second. We represent each snippet from the training data as a large \ncolumn vector of the 3D positions of each tracked body point in each frame of the snippet. \n\nWe then use those data to build a mixture-of-Gaussians probability density model [2]. For \ncomputational efficiency, we used a clustering approach to approximate the fitting of an EM \nalgorithm. We use k-means clustering to divide the snippets into m groups, each of which \nwill be modeled by a Gaussian probability cloud. For each cluster, the matrix M j is formed, \nwhere the columns of M j are the nj individual motion snippets after subtracting the mean \nJ.l j. The singular value decomposition (SVD) gives M j = Uj Sj VI, where Sj contains \nthe singular values along the diagonal, and Uj contains the basis vectors. (We truncate \nthe SVD to include only the 50 largest singular values.) The cluster can be modeled by \na multidimensional Gaussian with covariance Aj = ;j UjSJUJ. The prior probability \nof a snippet x over all the models is a sum of the Gaussian probabilities weighted by the \nprobability of each model. \n\nm \n\nP(x) = Lk7fje- !(x-llj)TA- 1(X-llj) \n\nj=1 \n\n(1) \n\nHere k is a normalization constant, and 7f j is the a priori probability of model j, computed \nas the fraction of snippets in the knowledge base that were originally placed in cluster j . \nGiven this approximately derived mixture-of-factors model [6], we can compute the prior \nprobability of any snippet. \n\nTo estimate the data term (likelihood) in Bayes' law, we assume that the 2D observations \ninclude some Gaussian noise with variance (T. Combined with the prior, the expression for \nthe probability of a given snippet x given an observation ybecomes \n\np(x,e,s,vly) = k' (e-IIY-R6 , \u2022. v(XlII 2/(2tr2)) (f k7fj e-!(X-llj)T A _l(X-llj)) \n\nJ=l \n\n(2) \n\nIn this equation, Rn ,s,ii(X) is a rendering function which maps a 3D snippet x into the image \ncoordinate system, performing scaling s, rotation about the vertical axis e, and image-plane \ntranslation v. We use the EM algorithm to find the probabilities of each Gaussian in the \nmixture and the corresponding snippet x that maximizes the probability given the observa(cid:173)\ntions [6]. This allows the conversion of eleven frames of 2D tracking measurements into \nthe most probable corresponding 3D snippet. In cases where the 2D tracking is poor, the \nreconstruction may be improved by matching only the more reliable points in the likelihood \nterm of Equation 2. This adds a second noise process to explain the outlier data points in \nthe likelihood term. \n\nTo perform the full 3D reconstruction, the system first divides the 2D tracking data into \nsnippets, which provides the y values of Eq. 2, then finds the best (MAP) 3D snippet \nfor each of the 2D observations. The 3D snippets are stitched together, using a weighted \ninterpolation for frames where two snippets overlap. The result is a Bayesian estimate of \nthe subject's motion in three dimensions. \n\n4 Performance \n\nThe system as a whole will track and successfully 3D reconstruct simple, short video clips \nwith no human intervention, apart from 2D pose initialization. It is not currently reliable \nenough to track difficult footage for significant lengths of time. However, analysis of short \nclips demonstrates that the system can successfully reconstruct 3D motion from ambiguous \n\n\fBayesian Reconstruction of 3D Human Motion from Single-Camera Video \n\n823 \n\n2D video. We evaluate the two stages of the algorithm independently at first, and then \nconsider their operation as a system. \n\n4.1 Performance of the 3D reconstruction \n\nThe 3D reconstruction stage is the heart of the system. To our knowledge, no similar \n2D to 3D reconstruction technique relying on prior infonnation has been published. ([3], \ndeveloped simultaneously, also uses an inference-based approach). Our tests show that the \nmodule can restore deleted depth infonnation that looks realistic and is close to the ground \ntruth, at least when the knowledge base contains some examples of similar motions. This \nmakes the 3D reconstruction stage itself an important result, which can easily be applied in \nconjunction with other tracking technologies. \n\nTo test the reconstruction with known ground truth, we held back some of the training \ndata for testing. We artificially provided perfect 2D marker position data, yin Eq. 2, and \ntested the 3D reconstruction stage in isolation. After removing depth information from the \ntest sequence, the sequence is reconstructed as if it had come from the 2D tracker. Se(cid:173)\nquences produced in this manner look very much like the original. They show some rigid \nmotion error along the line of sight. An analysis of the uncertainty in the posterior prob(cid:173)\nability predicts high uncertainty for the body motion mode of rigid motion parallel to the \northographic projection [10]. This slipping can be corrected by enforcing ground-contact \nconstraints. Figure 1 shows a reconstructed running sequence corrected for rigid motion \nerror and superimposed on the original. The missing depth information is reconstructed \nwell, although it sometimes lags or anticipates the true motion slightly. Quantitatively, this \nerror is a relatively small effect. After subtracting rigid motion error, the mean residual \n3D errors in limb position are the same order of magnitude as the small frame-to frame \nchanges in those positions. \n\n-\n\n~' \n\" -, \n. _. \n\n~ \n\n.~ ..\u2022 \n\nFigure 1: Original and reconstructed running sequences superimposed (frames 1, 7, 14, \nand 21). \n\n4.2 Performance of the 2D tracker \n\nThe 2D tracker performs well under constant illumination, providing quite accurate results \nfrom frame to frame. The main problem it faces is the slow accumulation of error. On \nlonger sequences, the errors can build up to the point where the module is no longer tracking \nthe body parts it was intended to track. The problem is worsened by low contrast, occlusion \nand lighting changes. More careful body modeling [5], lighting models, and modeling of \nthe background may address these issues. The sequences we used for testing were several \nseconds long and had fairly good contrast. Although adequate to demonstrate the operation \nof our system, the 2D tracker contains the most open research issues. \n\n4.3 Overall system performance \n\nThree example reconstructions are given, showing a range of different tracking situations. \nThe first is a reconstruction of a stationary figure waving one arm, with most of the motion \n\n\f824 \n\nN. R. Howe. M E. Leventon and W. T. Freeman \n\nin the image plane. The second shows a figure bringing both arms together towards the \ncamera, resulting in a significant amount of foreshortening. The third is a reconstruction of \na figure walking sideways, and includes significant self-occlusion \n\nFigure 2: First clip and its reconstruction (frames 1, 2S, SO, and 7S). \n\nThe first video is the easiest to track because there is little or no occlusion and change in \nlighting. The reconstruction is good, capturing the stance and motion of the arm. There \nis some rigid motion error, which is corrected through ground friction constraints. The \nknees are slightly bent; this may be because the subject in the video has different body \nproportions than those represented in the training database. \n\nFigure 3: Second clip and its reconstruction (frames 1, 2S, SO, and 7S). \n\nThe second video shows a figure bringing its arms together towards the camera. The only \nindication of this is in the foreshortening of the limbs, yet the 3D reconstruction correctly \ncaptures this in the right arm. (Lighting changes and contrast problems cause the 2D tracker \nto lose the left arm partway through, confusing the reconstruction of that limb, but the right \narm is tracked accurately throughout.) \n\nThe third video shows a figure walking to the right in the image plane. This clip is the \nhardest for the 2D tracker, due to repeated and prolonged occlusion of some body parts. \nThe tracker loses the left arm after IS frames due to severe occlusion, yet the remaining \ntracking information is still sufficient to perform an adequate reconstruction. At about \nframe 4S, the left leg has crossed behind the right several times and is lost, at which point \nthe reconstruction quality begins to degrade. The key to a more reliable reconstruction on \nthis sequence is better tracking. \n\n\fBayesian Reconstruction of 3D Human Motion from Single-Camera Video \n\n825 \n\n. .... \n: '\" , \n\n.. \n.. ,t \n~ \n1. \n\nFigure 4: Third clip and its reconstruction (frames 6, 16, 26, and 36). \n\n5 Conclusion \n\nWe have demonstrated a system that tracks human figures in short video sequences and \nreconstructs their motion in three dimensions. The tracking is unassisted, although 2D pose \ninitialization is required. The system uses prior information learned from training data to \nresolve the inherent ambiguity in going from two to three dimensions, an essential step \nwhen working with a single-camera video source. To achieve this end, the system relies \non prior knowledge, extracted from examples of human motion. Such a learning-based \napproach could be combined with more sophisticated measurement-based approaches to \nthe tracking problem [12, 8, 4]. \n\nReferences \n\n[1] J. R. Bergen, P. Anandan, K. J. Hanna, and R. Hingorani. Hierarchical model-based \nmotion estimation. In European Conference on Computer Vision, pages 237-252, \n1992. \n\n[2] C. M. Bishop. Neural networks for pattern recognition. Oxford, 1995. \n[3] M. Brand. Shadow puppetry. In Proc. 7th IntI. Con! on Computer Vision, pages \n\n1237-1244. IEEE, 1999. \n\n[4] c. Bregler and 1. Malik. Tracking people with twists and exponential maps. In IEEE \nComputer Society Conference on Computer Vision and Pattern Recognition, Santa \nBarbera, 1998. \n\n[5] D. M. Gavrila and L. S. Davis. 3d model-based tracking of humans in action: A \nmulti-view approach. In IEEE Computer Society Conference on Computer Vision \nand Pattern Recognition, San Francisco, 1996. \n\n[6] Z. Ghahramani and G. E. Hinton. The EM algorithm for mixtures offactor analyzers. \nTechnical report, Department of Computer Science, University of Toronto, May 21 \n1996. (revised Feb. 27, 1997). \n\n[7] L. Goncalves, E. Di Bernardo, E. Ursella, and P. Perona. Monocular tracking of the \nhuman arm in 3D. In Proceedings of the Third International Conference on Computer \nVision, 1995. \n\n[8] M. Isard and A. Blake. Condensation - conditional density propagation for visual \n\ntracking. International Journal of Computer Vision, 29( 1 ):5-28, 1998. \n\n[9] S. X. Ju, M. J. Black, and Y. Yacoob. Cardboard people: A parameterized model of \narticulated image motion. In 2nd International Conference on Automatic Face and \nGesture Recognition, 1996. \n\n\f826 \n\nN. R. Howe, M. E. Leventon and W T. Freeman \n\n[10] M. E. Leventon and W. T. Freeman. Bayesian estimation of 3-d human motion from \nan image sequence. Technical Report TR98-06, Mitsubishi Electric Research Lab, \n1998. \n\n[11] D. D. Morris and 1. Rehg. Singularity analysis for articulated object tracking. In \nIEEE Computer Societ), Conference on Computer Yz'sion and Pattern Recognition, \nSanta Barbera, 1998. \n\n[12] S. Wachter and H.-H. Nagel. Tracking of persons in monocular image sequences. In \n\nNonrigid and ArticuLated Motion Workshop, 1997. \n\n\f", "award": [], "sourceid": 1698, "authors": [{"given_name": "Nicholas", "family_name": "Howe", "institution": null}, {"given_name": "Michael", "family_name": "Leventon", "institution": null}, {"given_name": "William", "family_name": "Freeman", "institution": null}]}