{"title": "Learning and Tracking Cyclic Human Motion", "book": "Advances in Neural Information Processing Systems", "page_first": 894, "page_last": 900, "abstract": null, "full_text": "Learning and Tracking Cyclic Human \n\nMotion \n\nD.Ormoneit \n\nH. Sidenbladh \n\nDept. of Computer Science \n\nRoyal Institute of Technology (KTH), \n\nStanford University \nStanford, CA 94305 \n\normoneitOcs.stanford.edu \n\nCVAP/NADA, \n\nS-100 44 Stockholm, Sweden \n\nhedvigOnada.kth.se \n\nM. J. Black \n\nDept. of Computer Science \nBrown University, Box 1910 \n\nProvidence, RI 02912 \nblackOcs.brown.edu \n\nT. Hastie \n\nDept. of Statistics \nStanford University \nStanford, CA 94305 \n\nhastieOstat.stanford.edu \n\nAbstract \n\nWe present methods for learning and tracking human motion in \nvideo. We estimate a statistical model of typical activities from a \nlarge set of 3D periodic human motion data by segmenting these \ndata automatically into \"cycles\". Then the mean and the princi(cid:173)\npal components of the cycles are computed using a new algorithm \nthat accounts for missing information and enforces smooth tran(cid:173)\nsitions between cycles. The learned temporal model provides a \nprior probability distribution over human motions that can be used \nin a Bayesian framework for tracking human subjects in complex \nmonocular video sequences and recovering their 3D motion. \n\n1 \n\nIntroduction \n\nThe modeling and tracking of human motion in video is important for problems as \nvaried as animation, video database search, sports medicine, and human-computer \ninteraction. Technically, the human body can be approximated by a collection of \narticulated limbs and its motion can be thought of as a collection of time-series \ndescribing the joint angles as they evolve over time. A key challenge in modeling \nthese joint angles involves decomposing the time-series into suitable temporal prim(cid:173)\nitives. For example, in the case of repetitive human motion such as walking, motion \nsequences decompose naturally into a sequence of \"motion cycles\" . In this work, \nwe present a new set of tools that carry out this segmentation automatically using \nthe signal-to-noise ratio of the data in an aligned reference domain. This procedure \nallows us to use the mean and the principal components of the individual cycles in \nthe reference domain as a statistical modeL Technical difficulties include missing in(cid:173)\nformation in the motion time-series (resulting from occlusions) and the necessity of \nenforcing smooth transitions between different cycles. To deal with these problems, \n\n\fwe develop a new iterative method for functional Principal Component Analysis \n(PCA). The learned temporal model provides a prior probability distribution over \nhuman motions that can be used in a Bayesian framework for tracking. The details \nof this tracking framework are described in [7] and are briefly summarized here. \nSpecifically, the posterior distribution of the unknown motion parameters is repre(cid:173)\nsented using a discrete set of samples and is propagated over time using particle \nfiltering [3, 7]. Here the prior distribution based on the PCA representation im(cid:173)\nproves the efficiency of the particle filter by constraining the samples to the most \nlikely regions of the parameter space. The resulting algorithm is able to track hu(cid:173)\nman subjects in monocular video sequences and to recover their 3D motion under \nchanges in their pose and against complex unknown backgrounds. \n\nPrevious work on modeling human motion has focused on the recognition of ac(cid:173)\ntivities using Hidden Markov Models (HMM's), linear dynamical models, or vector \nquantization (see [7, 5] for a summary of related work). These approaches typically \nprovide a coarse approximation to the underlying motion. Alternatively, explicit \ntemporal curves corresponding to joint motion may be derived from biometric stud(cid:173)\nies or learned from 3D motion-capture data. In previous work on principal com(cid:173)\nponent analysis of motion data, the 3D motion curves corresponding to particular \nactivities had typically to be hand-segmented and aligned [1, 7, 8]. By contrast, \nthis paper details an automated method for segmenting the data into individual \nactivities, aligning activities from different examples, modeling the statistical vari(cid:173)\nation in the data, dealing with missing data, enforcing smooth transitions between \ncycles, and deriving a probabilistic model suitable for a Bayesian interpretation. We \nfocus here on cyclic motions which are a particularly simple but important class of \nhuman activities [6]. While Bayesian methods for tracking 3D human motion have \nbeen suggested previously [2 , 4], the prior information obtained from the functional \nPCA proves particularly effective for determining a low-dimensional representation \nof the possible human body positions [8, 7]. \n\n2 Learning \n\nTraining data is provided by a commercial motion capture system describes the \nevolution of m = 19 relative joint angles over a period of about 500 to 5000 frames. \nWe refer to the resulting multivariate time-series as a \"motion sequence\" and we \nuse the notation Zi (t) == {Za ,i (t) la = 1, ... , m} for t = 1, ... ,T; to denote the an(cid:173)\ngle measurements. Here T; denotes the length of sequence i and a = 1, ... , m \nis the index for the individual angles. Altogether, there are n = 20 motion \nsequences in our training set. Note that missing observations occur frequently \nas body markers are often occluded during motion capture. An associated set \nIa,i == {t E {I, ... , T;} I za ,; (t) is not missing} indicates the positions of valid data. \n\n2.1 Sequence Alignment \n\nPeriodic motion is composed of repetitive \"cycles\" which constitute a natural unit \nof statistical modeling and which must be identified in the training data prior to \nbuilding a model. To avoid error-prone manual segmentation we present alignment \nprocedures that segment the data automatically by separately estimating the cy(cid:173)\ncle length and a relative offset parameter for each sequence. The cycle length is \ncomputed by searching for the value p that maximizes the \"signal-to-noise ratio\": \n\n. \n\n_ \" \n\nstn_ratzo;(p) = ~ . \n\nsignali ,a (p) \na nozse;,a p \n\n() , \n\n(1) \n\n\fm \ni ~~t \n~~~t \n~~t \n~~~t \n!~t \n!~~t \n1] \nl ~~t \ni OOOt \n; 50::0 \n\nIMteoOl Sl gnal -tCH'IOl &O \n\n1 \nIIII~ 1III11~lllil 1.llilillllli~ IIII \n: \n: \n3' \n: \n: \n3' \n: : \nr \n: r \n: \n: : \nJ' \n: \n: \nr \n, \n::: \nJ' \n: \nJ' \nr \n: \n\n: ..... : \n: .1 : \n: ~ : \n: ... : \n: .... : \n: .1 : \n, \n, \n: .1 : \n: ..l : \n\n: \n: \n:'\" \n: \n: \n\n....&.. \n\n~ \n\n' 00 \n\n, ~ \n\n~\" \n\n~:E;\u00a53~~~=-~ \ni}= \n~ J #?,8 ; .. ~e ~O; 'ft; \ni;~ \n!; \nE:= \n~;~~ ~~ \n~ :~ \n\nr ~ \n\n- 2 \n\nFigure 1: Left: Signal-to-noise ratio of a representative set of angles as a function \nof the candidate period length. Right: Aligned representation of eight walking \nsequences. \n\nwhere noisei,a (p) is the variation in the data that is not explained by the mean \ncycle, z, and signal;,a (P) measures the signal intensity. 1 In Figure 1 we show the \nindividual signal-to-noise ratios for a subset of the angles as well as the accumulated \nsignal-to-noise ratio as functions of p in the range {50, 51, ... , 250}. Note the peak \nof these values around the optimal cycle length p = 126. Note also that the signal(cid:173)\nto-noise ratio of the white noise series in the first row is approximately constant , \nwarranting the unbiasedness of our approach. \nNext, we estimate the offset parameters , 0, to align multiple motion sequences in \na common domain. Specifically, we choose 0(1) , 0(2) , ... , o(n) so that the shifted \nmotion sequences minimize the deviation from a common prototype model by anal(cid:173)\nogy to the signal-to-noise-criterion (1). An exhaustive search for the optimal offset \ncombination is computationally infeasible. Instead , we suggest the following iter(cid:173)\native procedure: We initialize the offset values to zero in Step 1, and we define a \nreference signal ra in Step 2 so as to minimize the deviation with respect to the \naligned data. This reference signal is a periodically constrained regression spline \nthat ensures smooth transitions at the boundaries between cycles. Next, we choose \nthe offsets of all sequences so that they minimize the prediction error with respect \nto the reference signal (Step 3). By contrast to the exhaustive search, this operation \nrequires 00:=7=1 p(i)) comparisons. Because the solution of the first iteration may \nbe suboptimal, we construct an improved reference signal using the current offset \nestimates, and use this signal in turn to improve the offset estimates. Repeating \nthese steps, we obtain an iterative optimization algorithm that is terminated if the \nimprovement falls below a given threshold . Because Steps 2 and 3 both decrease the \nprediction error, so that the algorithm converges monotonically. Figure 1 (right) \nshows eight joint angles of a walking motion, aligned using this procedure. \n\n2.2 Functional peA \n\nThe above alignment procedures segment the training data into a collection of \ncycle-data called \"slices\". Next, we compute the principal components of these \nslices , which can be interpreted as the major sources of variation in the data. The \nalgorithm is as follows \n\nlThe mean cycle is obtained by \"folding\" the original sequence into the domain \n\n{I, . .. ,p}. For brevi ty, we don't provide formal definitions here; see [5]. \n\n\f1. For a = 1, ... , m and i = 1, ... , n: \n\n(a) Dissect Zi,a into K i cycles of length p(i), marlcing missing values at both \nends. This gives a n ew set of time series Z~l ) for k = 1, ... , K i wher e \nK i = I T';(~f) 1 + 1. Let h,a b e the new index ~:t for this series. \n\n(b) Compute functional estimates in the domain [0,1]. \n(c) Resample the data in the reference domain, imputing missing observations. \nThis gives yet another time-series zk~~ (j) := ik ,a ( 1=) for j = 0,1, ... , T. \n\n2. Stack the \"slices\" zk2 ) obtained from all sequences row-wise into a 2::. Ki X mT \n\ndesign matrix X. \n\n,a \n\n\u2022 \n\n3. Compute th e row-mean /1. of X, and let X(1) := X -\n\nl'p. 1 is a vector of ones. \n\n4. Slice by slice, compute the Fourier coefficients of X(1), and store them in a new \n\nmatrix, X(2). Use the first 20 coefficients only. \n\n5. Compute the Singular Value Decomposition of X(2): X(2) = USV'. \n6. Reconstruct X(2), using the rank q approximation to S: X(3) = usqv'. \n7. Apply the Inverse Fourier Transform and add I' p to obtain X(4). \n\n8. Impute the missing values in X using the corresponding values in X(4). \n9. Evaluate IIX - X(4) II. Stop, if the performance improvement is b elow 10-6 . \n\nO therwise, goto Step 3. \n\nOur algorithm addresses several difficulties. First, even though the individual mo(cid:173)\ntion sequences are aligned in Figure I , they are still sampled at different frequencies \nin the reference domain due to the different alignment parameters. This problem \nis accommodated in Step lc by resampling after computing a functional estimate \nin continuous time in Step lb. Second, missing data in the design matrix X means \nwe cannot simply use the Singular Value Decomposition (SVD) of X(l) to obtain \nthe principal components. Instead we use an iterative approximation scheme [9] in \nwhich we alternate between an SVD step (4 through 7) and a data imputation step \n(8) , where each update is designed so as to decrease the matrix distance between X \nand its reconstruction, X(4 ) . Finally, we need to ensure that the m ean estimates and \nthe principal components produce a smooth motion when recombined into a new \nsequence. Specifically, the approximation of an individual cycle must be periodic in \nthe sense that its first two derivatives match at the left and the right endpoint. This \nis achieved by translating the cycles into a Fourier domain and by truncating high(cid:173)\nfrequency coefficients (Step 4). Then we compute the SVD in the Fourier domain \nin Step 5, and we reconstruct the design matrix using a rank-q approximation in \nSteps 6 and 7, respectively. In Step 8 we use the reconstructed values as improved \nestimates for the missing data in X, and then we repeat Steps 4 through 7 using \nthese improved estimates. This iterative process is continued until the performance \nimprovement falls below a given threshold. As its output, the algorithm generates \nthe imputed design matrix, X, as well as its principal components. \n\n3 Bayesian Tracking \n\nIn tracking, our goal is to calculate the posterior probability distribution over 3D \nhuman poses given a sequence of image measurements, It. The high dimensionality \nof the body model makes this calculation computationally demanding. Hence, we \nuse the learned model above to constrain the body motions to valid walking motions. \nTowards that end , we use the SVD of X(2) to formulate a prior distribution for \nBayesian tracking. \n\n\fFormally, let O(t) == (Oa(t)la = 1, ... , m) be a random vector of the relative joint \nangles at time t; i.e., the value of a motion sequence, Zi(t), at time t is interpreted \nas the i-th realization of O(t). Then O(t) can be written in the form \n\nO(t) = ji(1/!t) + L Ct,kVk(1/!t) , \n\nq \n\nk=l \n\n(2) \n\nwhere Vk is the Fourier inverse of the k-th column of V, rearranged as an T X m(cid:173)\nmatrix; similarly, j1, denotes the rearranged mean vector J.L. Vk (1/! ) is the 1/!-th column \nof Vk, and the Ct,k are time-varying coefficients. 1/!t E {O, T -I} maps absolute time \nonto relative cycle positions or phases, and Pt denotes the speed of the motion \nsuch that 1/!t+l = (1/!t + pt) mod T Given representation (2), body positions are \ncharacterized entirely by the low-dimensional state-vector cPt = (Ct, 1/!t, Pt, -ri, Oi)\" \nwhere Ct = (Ct,l, ... , Ct ,q) and where -ri and 0i represent the global 3D translation \nand rotation of the torso, respectively. Hence we the problem is to calculate the \nposterior distribution of cPt given images up to time t. Due to the Markovian \nstructure underlying cPt, this posterior distribution is given recursively by: \n\n(3) \n\nHere p(It I cPt ) is the likelihood of observing the image It given the parameters and \nP(cPt-l I It-I) is the posterior probability from the previous instant. p(cPt I cPt-d \nis a temporal prior probability distribution that encodes how the parameters cPt \nchange over time. The elements of the Bayesian approach are summarized below; \nfor details the reader is referred to [7]. \nGenerative Image Model. Let M(It, cPt) be a function that takes image texture \nat time t and, given the model parameters, maps it onto the surfaces of the 3D \nmodel using the camera model. Similarly, let M- 1 (-) take a 3D model and project \nits texture back into the image. Given these functions, the generative model of \nimages at time t + 1 can be viewed as a mapping from the image at time t to images \nat time t + 1: \n\nIt+1 = M-l(M(It, cPt) , cPt+l) + 17, \n\n17 ~ G(O, 0\") , \n\nwhere G(O, 0\") denotes a Gaussian distribution with zero mean and standard devia(cid:173)\ntion 0\" and 0\" depends on the viewing angle of the limb with respect to the camera \nand increases as the limb is viewed more obliquely (see [7] for details) . \nTemporal Prior. The temporal prior, p(cPt I cPt-d, models how the parameters \ndescribing the body configuration are expected to vary over time. The individual \n\ncomponents of cP, (Ct, 1/!t, Pt , -ri, on, are assumed to follow a random walk with \n\nGaussian increments. \n\nLikelihood Model. Given the generative model above we can compare the image \nat time t - 1 to the image It at t. Specifically, we compute this likelihood term \nseparately for each limb. To avoid numerical integration over image regions, we \ngenerate ns pixel locations stochastically. Denoting the ith sample for limb j as \nXj ,i, we obtain the following measure of discrepancy: \n\nn \n\nE == L(It(xj,i ) - M-1(M(It_ 1, cPt-I), cPt)(Xj,i ))2. \n\nAs an approximate likelihood term we use \n\ni =l \n\np(ItlcPt) = II ~Ctj) exp(-E/(2u(Ctj)2ns)) + (1- q(Ctj))Poccluded, \n\n. \nJ \n\n21r0\"(Ctj) \n\n(4) \n\n(5) \n\n\fFigure 2: Tracking of person walking, 10000 samples. Upper rows: frames 0, 10, 20, \n30, 40, 50 with the projection of the expected model configuration overlaid. Lower row: \nexpected 3D configuration in the same frames. \n\nwhere Poccluded is a constant probability that a limb is occluded, aj is the angle \nbetween the limb j principal axis and the image plane of the camera, 0\"( a j) is a \nfunction that increases with narrow viewing angles, and q(aj) = cos(aj) if limb j \nis non-occluded, or 0 if limb j is occluded. \n\nPartical Filter. As it is typical for tracking problems, the posterior distribution \nmay well be multi-modal due to the nonlinearity of the likelihood function. Hence, \nwe use a particle filter for inference where the posterior is represented as a weighted \nset of state samples, \u00a2;, which are propagated in time. In detail, we use N. ~ 104 \nparticles in our experiments. Details of this algorithm can be found in [3, 7]. \n\n4 Experiment \n\nTo illustrate the method we show an example of tracking a walking person in a \ncluttered scene in Figure 2. The 3D motion is recovered from a monocular sequence \nusing only the motion between frames. To visualize the posterior distribution we \ndisplay the projection of the 3D model corresponding to the expected value of \nthe model parameters: ~, ~~1 Pi\u00a2; where P; is the likelihood of sample \u00a2;. All \nparameters were initialized manually with a Gaussian prior at time t = O. The \nlearned model is able to generalize to the subject in the sequence who was not part \nof the training set. \n\n5 Conclusions \n\nWe described an automated method for learning periodic human motions from \ntraining data using statistical methods for detecting the length of the periods in the \n\n\fdata, segmenting it into cycles, and optimally aligning the cycles. We also presented \na PCA method for building a statistical eigen-model of the motion curves that copes \nwith missing data and enforces smoothness between the beginning and ending of a \nmotion cycle. The learned eigen-curves are used as a prior probability distribution \nin a Bayesian tracking framework. Tracking in monocular image sequences was \nperformed using a particle filtering technique and results were shown for a cluttered \nImage sequence. \n\nAcknowledgements. We thank M. Gleicher for generously providing the 3D \nmotion-capture data and M. Kamvysselis and D. Fleet for many discussions on \nhuman motion and Bayesian estimation . Portions of this work were supported by \nthe Xerox Corporation and we gratefully acknowledge their support. \n\nReferences \n\n[1] A. Bobick and J. Davis. An appearance-based representation of action. ICPR, \n\n1996. \n\n[2] T-J. Cham and J. Rehg. A multiple hypothesis approach to figure tracking. \n\nCVPR, pp. 239- 245, 1999. \n\n[3] M. Isard and A. Blake. Contour tracking by stochastic propagation of condi(cid:173)\n\ntional density. ECCV, pp. 343-356, 1996. \n\n[4] M. E. Leventon and W. T. Freeman. Bayesian estimation of 3-d human motion \nfrom an image sequence. Tech. Report TR-98-06, Mitsubishi Electric Research \nLab, 1998. \n\n[5] D. Ormoneit, H. Sidenbladh , M. Black, T. Hastie, Learning and tracking hu(cid:173)\n\nman motion using functional analysis, submitted: IEEE Workshop on Human \nModeling, Analysis and Synthesis, 2000. \n\n[6] S.M. Seitz and C.R. Dyer. Affine invariant detection of periodic motion. CVPR, \n\npp. 970-975, 1994. \n\n[7] H. Sidenbladh, M. J. Black, and D. J. Fleet. Stochastic tracking of 3D human \n\nfigures using 2D image motion. to appear, ECCV-2000, Dublin Ireland. \n\n[8] Y. Yacoob and M. Black. Parameterized modeling and recognition of activities \n\nin temporal surfaces. CVIU, 73(2):232-247, 1999. \n\n[9] G. Sherlock, M. Eisen, O. Alter, D. Botstein, P. Brown, T. Hastie, and R. Tib(cid:173)\n\nshirani. \"Imputing missing data for gene expression arrays,\" 2000, Working \nPaper, Department of Statistics, Stanford University. \n\n\f", "award": [], "sourceid": 1938, "authors": [{"given_name": "Dirk", "family_name": "Ormoneit", "institution": null}, {"given_name": "Hedvig", "family_name": "Sidenbladh", "institution": null}, {"given_name": "Michael", "family_name": "Black", "institution": null}, {"given_name": "Trevor", "family_name": "Hastie", "institution": null}]}