{"title": "Learning Body Pose via Specialized Maps", "book": "Advances in Neural Information Processing Systems", "page_first": 1263, "page_last": 1270, "abstract": null, "full_text": "Learning Body Pose via Specialized Maps \n\nRomer Rosales \n\nStan Sclaroff \n\nDepartment of Computer Science \n\nDepartment of Computer Science \n\nBoston University, Boston, MA 02215 \n\nBoston University, Boston, MA 02215 \n\nrrosales@cs.bu.edu \n\nsclaroff@cs.bu.edu \n\nAbstract \n\nA nonlinear supervised learning model, the Specialized Mappings \nArchitecture (SMA), is described and applied to the estimation of \nhuman body pose from monocular images. The SMA consists of \nseveral specialized forward mapping functions and an inverse map(cid:173)\nping function. Each specialized function maps certain domains \nof the input space (image features) onto the output space (body \npose parameters). The key algorithmic problems faced are those of \nlearning the specialized domains and mapping functions in an op(cid:173)\ntimal way, as well as performing inference given inputs and knowl(cid:173)\nedge of the inverse function. Solutions to these problems employ \nthe EM algorithm and alternating choices of conditional indepen(cid:173)\ndence assumptions. Performance of the approach is evaluated with \nsynthetic and real video sequences of human motion. \n\n1 \n\nIntroduction \n\nIn everyday life, humans can easily estimate body part locations (body pose) from \nrelatively low-resolution images of the projected 3D world (e.g., when viewing a \nphotograph or a video). However, body pose estimation is a very difficult computer \nvision problem. It is believed that humans employ extensive prior knowledge about \nhuman body structure and motion in this task [10]. Assuming this, we consider \nhow a computer might learn the underlying structure and thereby infer body pose. \n\nIn computer vision, this task is usually posed as a tracking problem. Typically, \nmodels comprised of 2D or 3D geometric primitives are designed for tracking a \nspecific articulated body [13, 5, 2, 15]. At each frame, these models are fitted to the \nimage to optimize some cost function. Careful manual placement of the model on \nthe first frame is required, and tracking in subsequent frames tends to be sensitive to \nerrors in initialization and numerical drift. Generally, these systems cannot recover \nfrom tracking errors in the middle of a sequence. To address these weaknesses, \nmore complex dynamic models have been proposed [14, 13,9]; these methods learn \na prior over some specific motion (such as walking). This strong prior however, \nsubstantially limits the generality of the motions that can be tracked. \n\n\fDeparting from the aforementioned tracking paradigm, in [8] a Gaussian probability \nmodel was learned for short human motion sequences. In [17] dynamic program(cid:173)\nming was used to calculate the best global labeling according to the learned joint \nprobability density function of the position and velocity of body features. Still, \nin these approaches, the joint locations, correspondences, or model initialization \nmust be provided by hand. In [1], the manifold of human body dynamics was mod(cid:173)\neled via a hidden Markov model and learned via entropic minimization. In all of \nthese approaches models were learned. Although the approach presented here can \nbe used to model dynamics, we argue that when general human motion dynamics \nare intended to be learned, the amount of training data, model complexity, and \ncomputational resources required are impractical. As a consequence, models with \nlarge priors towards specific motions (e .g., walking) are generated. In this paper we \ndescribe a non-linear supervised learning algorithm, the Specialized Maps Archi(cid:173)\ntecture (SMA), for recovering articulated body pose from single monocular images. \nThis approach avoids the need for initialization and tracking per se, and reduces \nthe above mentioned disadvantages. \n\n2 Specialized Maps \n\nThere at least two key characteristics of the problem we are trying to solve which \nmake it different from other supervised learning problems. First, we have access to \nthe inverse map. We are trying to learn unknown probabilistic maps from inputs to \noutputs space, but we have access to the map (in general probabilistic) from outputs \nto inputs. In our pose estimation problem, it is easy to see how we can artificially, \nusing computer graphics (CG), produce some visual features (e.g., body silhouettes) \ngiven joint positions1 . Second, it is one-to-many: one input can be associated with \nmore than one output. Features obtained from silhouettes (and many other visual \nfeatures) are ambiguous. Consider an occluded arm, or the reflective ambiguity \ngenerated by symmetric poses. This last observation precludes the use of standard \nalgorithms for supervised learning that fit a single mapping function to the data. \nGiven input and output spaces ~c and ~t, and the inverse function ( : ~t -+ ~c, we \ndescribe a solution for these supervised learning problems. Our approach consists \nin generating a series of m functions \u00a2k : ~c -+ ~t. Each of these functions is \nspecialized to map only certain inputs (for a specialized sub-domain) better than \nothers. For example, each sub-domain can be a region of the input space. However, \nthe specialized sub-domain of \u00a2k can be more general than just a connected region \nin the input space. \n\nSeveral other learning models use a similar concept of fitting surfaces to the observed \ndata by splitting the input space into several regions and approximating simpler \nfunctions in these regions (e.g., [11,7, 6]). However, in these approaches, the inverse \nmap is not incorporated in the estimation algorithm because it is not considered \nin the problem definition and the forward model is usually more complex, making \ninference and learning more difficult. \n\nThe key algorithmic problems are that of estimating the specialized domains and \nfunctions in an optimal way (taking into account the form of the specialized func(cid:173)\ntions), and using the knowledge of the inverse function to formulate efficient infer-\n\nIThus, ( is a computer graphics rendering, in general called forward kinematics \n\n\fence and learning algorithms. We propose to determine the specialized domains \nand functions using an approximate EM algorithm and to perform inference using, \nin an alternating fashion, the conditional independence assumptions specified by \nthe forward and inverse models. Fig. l(a) illustrates a learned forward model. \n\nFigure 1: SMA diagram illustrating (a) an already learned SMA model with m specialized \nfunctions mapping subsets of the training data, each subset is drawn with a different color \n(at initializations, coloring is random) and (b) the mean-output inference process in which a \ngiven observation is mapped by all the specialized functions, and then a feedback matching \nstep, using (, is performed to choose the best of the m estimates. \n\n3 Probabilistic Model \n\nLet the training sets of output-input observations be \\)! = {1jI1, ... , 1jIN } , and Y = \n{Vl , ... ,VN} respectively. We will use Z i = (1jIi,Vi) to define the given output-input \ntraining pair, and Z = {ZI ' ... , ZN} as our observed training set. \n\nWe introduce the unobserved random variable y = (Yl , ... , Yn). In our model any Yi \nhas domain the discrete set C = {l, ... , M} oflabels for the specialized functions , and \ncan be thought as the function number used to map data point i; thus M is the num(cid:173)\nber of specialized mapping functions. Our model uses parameters 8 = (81 , ... , 8M , A) , \n8k represents the parameters of the mapping function k; A = (AI\"\", AM), where \nAk represents P(Yi = kI8): the prior probability that mapping function with label \ni will be used to map an unknown point. As an example, P(Yi lzi, 8) represents the \nprobability that function number Yi generated data point number i. \n\nUsing Bayes' rule and assuming independence of observations given 8, we have the \nlog-probability of our data given the modellogp(ZI8), which we want to maximize: \n\nargm;x 2:)og LP(1jIi lvi, Yi = k,8)P(Yi = kI8)p(Vi ), \n\n(1) \n\ni \n\nk \n\nwhere we used the independence assumption p(vI8) = p(v). This is also equivalent \nto maximizing the conditional likelihood of the model. \n\nBecause of the log-sum encountered, this problem is intractable in general. How(cid:173)\never, there exist practical approximate optimization procedures, one of them is \nExpectation Maximization (EM) [3,4, 12]. \n\n3.1 Learning \n\nThe EM algorithm is well known, therefore here we only provide the derivations \nspecific to SMA's. The E-step consists of finding P(y = klz, 8) = P(y). Note that \nthe variables Yi are assumed independent (given Z i)' Thus, factorizing P(y): \n\n\fp(y) = II P(t)(Yi) = II[(AYiP(1/Jilvi,Yi,B))/(2:AkP(1/Jilvi,Yi = k,B))] \n\n(2) \n\nkEC \n\nHowever, p( 1/Ji lVi, Yi = k, B) is still undefined. For the implementation described in \nthis paper we use N(1/Ji; \u00a2k(Vi,Bk), ~k)' where Bk are the parameters of the k-th \nspecialized function, and ~k the error covariance of the specialized function k. One \nway to interpret this choice is to think that the error cost in estimating 1/J once \nwe know the specialized function to use, is a Gaussian distribution with mean the \noutput of the specialized function and some covariance which is map dependent. \nThis also led to tractable further derivations. Other choices were given in [16]. \n\nThe M-step consists of finding B(t) = argmaxoEj>(t) [logp(Z,y IB)]. In our case we \ncan show that this is equivalent to finding: \n\nargmJn 2: 2: P(t)(Yi = k)(1/Ji - \u00a2k(Vi, Bk))T~kl(Zi - \u00a2k(Zi,Bk))\u00b7 \n\n(3) \n\ni \n\nk \n\nThis gives the following update rules for Ak and ~k (where Lagrange multipliers \nwere used to incorporate the constraint that the sum of the Ak'S is 1. \n\n1 - 2: P(Yi = klzi' B) \nn \n\n. \n\n(4) \n\nIn keeping the formulation general, we have not defined the form of the specialized \nfunctions \u00a2k. Whether or not we can find a closed form solution for the update of \nBk depends on the form of \u00a2k. For example if \u00a2k is a non-linear function, we may \nhave to use iterative optimization to find Bit). In case \u00a2k yield a quadratic form, \nthen a closed form update exists. However, in general we have: \n\n(6) \n\nIn our experiments, \u00a2k is a I-hidden layer perceptron. Thus, the M-step is an \napproximate, iterative optimization procedure. \n\n4 \n\nInference \n\nOnce learning is accomplished, each specialized function maps (with different levels \nof accuracy) the input space. We can formally state the inference process as that \nof maximum-a-posteriori (MAP) estimation where we are interested in finding the \nmost likely output h given an input configuration x: \n\nh* = argmaxp(hlx) = argmax '\" p(hly, x)P(y), \n\n(7) \n\nh \n\nh ~ \n\nY \n\nAny further treatment depends on the properties of the probability distributions \ninvolved. If p(hlx, y) = N(h ; \u00a2y(x) , ~y), the MAP estimate involves finding the \nmaximum in a mixture of Gaussians. However, no closed form solution exists and \nmoreover, we have not incorporated the potentially useful knowledge of the inverse \nfunction C. \n\n\f4.1 MAP by Using the Inverse Function ( \n\nThe access to a forward kinematics function ( (called here the inverse function) \nallows to formulate a different inference algorithm. We are again interested in \nfinding an optimal h* given an input x (e.g. , an optimal body pose given features \ntaken from an image). This can be formulated as: \n\nh* = arg maxp(hlx) = argmaxp(xlh) \"p(hly, x)P(y) , \n\n(8) \n\nh \n\nh \n\n~ \ny \n\nsimply by Bayes' rule, and marginalizing over all variables except h. Note that we \nhave made the distribution p(xlh) appear in the solution. This is important because \nwe can know use our knowledge of ( to define this distribution. This solution is \ncompletely general within our architecture, we did not make any assumptions on \nthe form of the distributions or algorithms used. \n\n5 Approximate Inference using ( \n\nLet us assume that we can approximate Lyp(hly, x)P(y) by a set of samples gen(cid:173)\nerated according to p(hly,x)P(y) and a kernel function K(h,hs). Denote the set \nof samples HSpl = {hs}s=l...s. An approximate to Lyp(hly,x)P(y) is formally \nbuilt by ~ L;=l K(h , h s), with the normalizing condition J K(h , hs)dh = 1 for \nany given h s. \nWe will consider two simple forms of K. If K(h, h s) = J(h - h s), we have: h = \nargmaxhP(xlh) L;=l J(h - h s). \nAfter some simple manipulations, this can be reduced to the following equivalent \ndiscrete optimization problem whose goal is to find the most likely sample s*: \n\n(9) \n\nwhere the last equivalence used the assumption p(xlh) = N(x; ((h), ~d. \n\nIf K(h, h s) = N(h ; hs , ~Spl)' we have: h = argmaxhP(xlh) L S=l N(h ; hs , ~Spl). \nThis case is hard to use in practice, because contrary to the case above (Eq. 9) , in \ngeneral, there is no guarantee that the optimal h is among the samples. \n\nA \n\nS \n\n5.1 A Deterministic Approximation based on the Functions Mean \n\nOutput \n\nThe structure of the inference in SMA, and the choice of probabilities p(hlx, y) \nallows us to construct a newer approximation that is considerably less expensive to \ncompute, and it is deterministic. Intuitively they idea consists of asking each of the \nspecialized functions \u00a2k what their most likely estimate for h is, given the observed \ninput x. The opinions of each of these specialized functions are then evaluated \nusing our distribution p(xlh) similar to the above sampling method. \n\nThis can be justified by the observation that the probability of the mean is maximal \nin a Gaussian distribution. Thus by considering the means \u00a2k(X), we would be \nconsidering the most likely output of each specialized function. Of course, in many \ncases this approximation could be very far from the best solution, for example when \n\n\fthe uncertainty in the function estimate is relatively high relative to the difference \nbetween means. \n\nWe use Fig. l(b) to illustrate the mean-output (MO) approximate inference process. \nWhen generating an estimate of body pose, denoted h, given an input x (the gray \npoint with a dark contour in the lower plane), the SMA generates a series of output \nhypotheses tl q, = {h!h obtained using hk = (/Jk(x), with k E C (illustrated by each \nof the points pointed to by the arrows). \n\nGiven the set tlq\" \nthe one that minimizes the function: \n\nthe most accurate hypothesis under the mean-output criteria is \n\nk* \n\n(10) \n\nwhere in the last equation we have assumed p(xlh) is Gaussian. \n\n5.2 Bayesian Inference \n\nNote that in many cases, there may not be any need to simply provide a point \nestimate, in terms of a most likely output h. In fact we could instead use the whole \ndistribution found in the inference process. We can show that using the above \nchoices for K we can respectively obtain. \n\n1 s \n\np(hlx) = S 2: N (x; ((hs ), ~d, \n\np(hlx) = N(h; h8' ~Spz) 2:N(x; ((h) , ~d\u00b7 \n\n8= 1 \ns \n\n8=1 \n\n(11) \n\n(12) \n\n6 Experiments \n\nThe described architecture was tested using a computer graphics rendering as our \n( inverse function. The training data set consisted of approx. 7,000 frames of \nhuman body poses obtained through motion capture. The output consisted of 20 \n2D marker positions (i. e., 3D markers projected to the image plane using a per(cid:173)\nspective model) but linearly encoded by 8 real values using Principal Component \nAnalysis (PCA). The input (visual features) consisted of 7 real-valued Hu moments \ncomputed on synthetically generated silhouettes of the articulated figure. For train(cid:173)\ning/testing we generated 120,000 data points: our 3D poses from motion capture \nwere projected to 16 views along the view-sphere equator. We took 8,000 for train(cid:173)\ning and the rest for testing. The only free parameter in this test, related to the \ngiven SMA, was the number of specialized functions used; this was set to 15. For \nthis, several model selection approaches could be used instead. Due to space limita(cid:173)\ntions, in this paper we show results using the mean-output inference algorithm only, \nreaders are referred to http://cs-people.bu.edu/rrosales/SMABodyInference where \ninference using multiple samples is shown. \n\nFig. 2(left) shows the reconstruction obtained in several single images coming from \nthree different artificial sequences. The agreement between reconstruction and ob(cid:173)\nservation is easy to perceive for all sequences. Note that for self-occluding configu(cid:173)\nrations, reconstruction is harder, but still the estimate is close to ground-truth. No \n\n\fhuman intervention nor pose initialization was required. For quantitative results, \nFig. 2(right) shows the average marker error and variance per body orientation in \npercentage of body height. Note that the error is bigger for orientations closer \nto a and 7r radians. This intuitively agrees with the notion that at those angles \n(side-views) , there is less visibility of the body parts. We consider this performance \npromising, given the complexity of the task and the simplicity of the approach. By \nchoosing poses at random from training set, the RMSE was 17% of body height. In \nrelated work, quantitative performance have been usually ignored, in part due to \nthe lack of ground-truth and standard evaluation data sets. \n\n2.9 ,-----~~--,-:.----'.----.--:----.;...-_---,----, \n\nPenormance regarding cameraviewpoinl (16 101al) \n\n2.75 \n\n14 \n\n16 \n\nFigure 2: Left: Example reconstruction of several test sequences with CG-generated \nsilhouettes. Each set consists of input images and reconstruction (every 5th frame). Right: \nMarker root-mean-square-error and variance per camera viewpoint (every 27r/32 rads.). \nUnits are percentage of body height. Approx. 110,000 test poses were used. \n\n6.1 Experiments using Real Visual Cues \n\nFig. 3 shows examples of system performance with real segmented visual data, \nobtained from observing a human subject. Reconstruction for several relatively \ncomplex sequences are shown. Note that even though the characteristics of the \nsegmented body differ from the ones used for training, good performance is still \nachieved. Most reconstructions are visually close to what can be thought as the \nright pose reconstruction. Body orientation is also generally accurate. \n\n7 Conclusion \n\nIn this paper, we have proposed the Specialized Mappings Architecture (SMA) . A \nlearning algorithm was developed for this architecture using ideas from ML estima(cid:173)\ntion and latent variable models. Inference was based on the possibility of alterna(cid:173)\ntively use different sets of conditional independence assumptions specified by the \nforward and inverse models. The incorporation of the inverse function in the model \nallows for simpler forward models. For example the inverse function is an architec(cid:173)\ntural alternative to the gating networks of Mixture of Experts [11]. SMA advantages \nfor body pose estimation include: no iterative methods for inference are used, the \n\n\fFigure 3: Reconstruction obtained from observing a human subject (every 10th frame). \n\nalgorithm for inference runs in constant time and scales only linearly O(M) with \nrespect to the number of specialized functions M; manual initialization is not re(cid:173)\nquired; compared to approaches that learn dynamical models, the requirements for \ndata are much smaller, and also large priors to specific motions are prevented thus \nimproving generalization capabilities. \n\nReferences \n\n[1] M. Brand. Shadow puppetry. In ICCV, 1999. \n[2] C. Bregler. Tracking people with twists and exponential maps. In CVPR, 1998. \n[3] 1. Csiszar and G. Thsnady. Information geometry and alternating minimization pro(cid:173)\n\ncedures. Statistics and Decisions, 1:205- 237, 1984. \n\n[4] A. Dempster, N. Laird, and D. Rubin. Maximum likelihood estimation from incom(cid:173)\n\nplete data. Journal of the Royal Statistical Society (B), 39(1), 1977. \n\n[5] J. Deutscher, A. Blake, and 1. Reid. Articulated body motion capture by annealed \n\nparticle filtering. In CVPR, 2000. \n\n[6] J.H. Friedman. Multivatiate adaptive regression splines. The Annals of Statistics, \n\n19,1-141, 1991. \n\n[7] G. Hinton, B. Sallans, and Z. Ghahramani. A hierarchical community of experts. \n\nLearning in Graphical Models, M. Jordan (editor) , 1998. \n\n[8] N. Howe, M. Leventon, and B. Freeman. Bayesian reconstruction of 3d human motion \n\nfrom single-camera video. In NIPS-1 2, 2000. \n\n[9] M. Isard and A. Blake. Contour tracking by stochastic propagation of conditional \n\ndensity. In ECCV, 1996. \n\n[10] G. Johansson. Visual perception of biological motion and a model for its analysis. \n\nP erception and Psychophysics, 14(2): 210-211, 1973. \n\n[11] M. 1. Jordan and R. A. Jacobs. Hierarchical mixtures of experts and the EM algorithm. \n\nN eural Computation, 6, 181-214, 1994. \n\n[12] R. Neal and G. Hinton. A view of the em algorithm that justifies incremental, sparse, \n\nand other variants. Learning in Graphical Models, M. Jordan (editor) , 1998. \n\n[13] Dirk Ormoneit, Hedvig Sidenbladh, Michael J . Black, and Trevor Hastie. Learning \n\nand tracking cyclic human motion. In NIPS-1 3, 200l. \n\n[14] Vladimir Pavlovic, James M. Rehg, and John MacCormick. Learning switching linear \n\nmodels of human motion. In NIPS-13, 200l. \n\n[15] J. M. Regh and T. Kanade. Model-based tracking of self-occluding articulated objects. \n\nIn ICC V, 1995. \n\n[16] R. Rosales and S. Sclaroff. Specialized mappings and the estimation of body pose \n\nfrom a single image. In IEEE Human Motion Workshop , 2000. \n\n[17] Y. Song, Xiaoling Feng, and P. Perona. Towards detection of human motion. In \n\nCVPR, 2000. \n\n\f", "award": [], "sourceid": 2019, "authors": [{"given_name": "R\u00f3mer", "family_name": "Rosales", "institution": null}, {"given_name": "Stan", "family_name": "Sclaroff", "institution": null}]}