{"title": "Combined discriminative and generative articulated pose and non-rigid shape estimation", "book": "Advances in Neural Information Processing Systems", "page_first": 1337, "page_last": 1344, "abstract": null, "full_text": "Combined discriminative and generative articulated\n\npose and non-rigid shape estimation\n\nLeonid Sigal\n\nAlexandru Balan\n\nMichael J. Black\n\nDepartment of Computer Science\n\nBrown University\n\nProvidence, RI 02912\n\n{ls, alb, black}@cs.brown.edu\n\nAbstract\n\nEstimation of three-dimensional articulated human pose and motion from images\nis a central problem in computer vision. Much of the previous work has been\nlimited by the use of crude generative models of humans represented as articu-\nlated collections of simple parts such as cylinders. Automatic initialization of\nsuch models has proved dif\ufb01cult and most approaches assume that the size and\nshape of the body parts are known a priori. In this paper we propose a method for\nautomatically recovering a detailed parametric model of non-rigid body shape and\npose from monocular imagery. Speci\ufb01cally, we represent the body using a param-\neterized triangulated mesh model that is learned from a database of human range\nscans. We demonstrate a discriminative method to directly recover the model pa-\nrameters from monocular images using a conditional mixture of kernel regressors.\nThis predicted pose and shape are used to initialize a generative model for more\ndetailed pose and shape estimation. The resulting approach allows fully automatic\npose and shape recovery from monocular and multi-camera imagery. Experimen-\ntal results show that our method is capable of robustly recovering articulated pose,\nshape and biometric measurements (e.g. height, weight, etc.) in both calibrated\nand uncalibrated camera environments.\n\n1 Introduction\n\nWe address the problem of marker-less articulated pose and shape estimation of the human body\nfrom images using a detailed parametric body model [3]. Most prior work on marker-less pose\nestimation and tracking has concentrated on the use of generative Baysian methods [8, 15] that\nexploit crude models of body shape (e.g. cylinders [8, 15], superquadrics, voxels [7]). We argue\nthat a richer representation of shape is needed to make future strides in building better generative\nmodels. Discriminative methods [1, 2, 10, 13, 16, 17], more recently introduced speci\ufb01cally for\nthe pose estimation task, do not address estimation of the body shape; in fact, they are speci\ufb01cally\ndesigned to be invariant to body shape variations. Any real-world system must be able to estimate\nboth body shape and pose simultaneously.\n\nDiscriminative approaches to pose estimation attempt to learn a direct mapping from image fea-\ntures to 3D pose from either a single image [1, 14, 17] or multiple approximately calibrated views\n[9]. These approaches tend to use silhouettes [1, 9, 14] and sometimes edges [16, 17] as image\nfeatures and learn a probabilistic mapping in the form of Nearest Neighbor (NN) search, regression\n[1], mixture of regressors [2], mixture of Baysian experts [17], or specialized mappings [14]. While\neffective and fast, they are inherently limited by the amount and the quality of the training data.\nMore importantly they currently do not address estimation of the body shape itself. Body shape es-\ntimation (independent of the pose) has many applications in biometric authentication and consumer\napplication domains.\n\n1\n\n\fSimpli\ufb01ed models of body shape have a long history in computer vision and provide a relatively low\ndimensional description of the human form. More detailed triangulated mesh models obtained from\nlaser range scans have been viewed as too high dimensional for vision applications. Moreover, mesh\nmodels of individuals lack a convenient, low-dimensional, parameterization to allow \ufb01tting to new\nsubjects. In this paper we use the SCAPE model (Shape Completion and Animation of PEople) [3]\nwhich provides a low-dimensional parameterized mesh that is learned from a database of 3D range\nscans of different people. The SCAPE model captures correlated body shape deformations of the\nbody due to the identity of the person and their non-rigid muscle deformation due to articulation.\nThis model has been shown to allow tractable estimation of parameters from multi-view silhouette\nimage features [5, 11] and from monocular images in scenes with point lights and cast shadows [4].\n\nIn [5] the SCAPE model is projected into multiple calibrated images and an iterative importance\nsampling method is used for inference of the pose and shape that best explain the observed sil-\nhouettes. Alternatively, in [11] visual hulls are constructed from many silhouette images and the\nIterative Closest Point (ICP) algorithm is used to extract the pose by registering the volumetric fea-\ntures with SCAPE. Both [5] and [11], however, require manual initialization to bootstrap estimation.\nIn this paper we substitute discriminative articulated pose and shape estimation in place of manual\ninitialization. In doing so, we extend the current models for discriminative pose estimation to deal\nwith the estimation of shape, and couple the discriminative and generative methods for more robust\ncombined estimation. Few combined discriminative and generative pose estimation methods that\nexist [16], typically require temporal image data and do not address shape estimation problem.\n\nFor discriminative pose and shape estimation we use a Mixture of Experts model, with kernel linear\nregression as experts, to learn a direct probabilistic mapping between monocular silhouette contour\nfeatures and the SCAPE parameters. To our knowledge this is the \ufb01rst work that has attempted to\nrecover the 3D shape of the human body from monocular image directly. While the results are typi-\ncally noisy, they are appropriate as initialization for the more precise generative re\ufb01nement process.\nFor generative optimization we make use of the method proposed in [5] where the silhouettes are\npredicted in multiple views given the pose and shape parameters of the SCAPE model and are com-\npared to the observed silhouettes using a Chamfer distance measure. For training data we use the\nSCAPE model to generate pairs of 3D body shapes and projected image silhouettes. Evaluation is\nperformed on sequences of two subjects performing free-style motion. We are able to predict pose,\nshape, and simple biometric measurements for the subjects from images captured by 4 synchronized\ncameras. We also show results for 3D shape estimation from monocular images.\n\nThe contributions of this paper are two fold: (1) we formulate a discriminative model for estimating\nthe pose and shape directly from monocular image features, and (2) we couple this discriminative\nmethod with a generative stochastic optimization for detailed estimation of pose and the shape.\n\n2 SCAPE Body Model\n\nIn this section we brie\ufb02y introduce the SCAPE body model; for details the reader is referred to [3].\nA low-dimensional mesh model is learned using principal component analysis applied to a registered\ndatabase of range scans. The SCAPE model is de\ufb01ned by a set of parameterized deformations that\nare applied to a reference mesh that consists of T triangles {\u2206xt|t \u2208 [1, ..., T ]} (here T = 25, 000).\nEach of the triangles in the reference mesh is de\ufb01ned by three vertices in 3D space, (vt,1, vt,2, vt,3),\nand has a corresponding associated body part index pt \u2208 [1, ..., P ] (we work with the model that\nhas P = 15 body parts corresponding to torso, pelvis, head, and 3 segments for each of the upper\nand lower extremities). For convenience, the triangles of the mesh are parameterized by the edges,\n\u2206xt = (vt,2 \u2212 vt,1, vt,3 \u2212 vt,1), instead of the vertices themselves. Estimating the shape and\narticulated pose of the body amounts to estimating parameters, Y, of the deformations required to\nproduce the mesh {\u2206yt|t \u2208 [1, ..., T ]}, the projection of which matches the image evidence. The\nstate-space of the model can be expressed by a vector Y = {\u03c4, \u03b8, \u03bd}, where \u03c4 \u2208 R3 is the global\n3D position for the body, \u03b8 \u2208 R37 is the joint-angle parameterization of the articulation with respect\nto the skeleton (encoded using Euler angles), and \u03bd \u2208 R9 is the shape parameters encoding the\nidentity-speci\ufb01c shape of the person. Given a set of estimated parameters Y a new mesh {\u2206yt} can\nbe produced using:\n\n\u2206yt = Rpt(\u03b8)S(\u03bd)Q(Rpt (\u03b8))\u2206xt\n\n(1)\n\n2\n\n\fpn\n\npc\n\n(a)\n\n7\n\n6\n\n8\n\n5\n\n9\n\n10\n\n4\n\n3\n\n11\n\n2\n\n12\n\n1\n\ns\nn\nb\n\ni\n\n \nl\n\ni\n\na\nd\na\nR\n\n1\n\n2\n\n3\n\n4\n\n5\n\n15\n\n45\n\n75\n\n105\n\n(b)\n\n165\n\n225\n135\n\u03b8 bins (in degrees)\n\n195\n\n255\n\n285\n\n315\n\n345\n\nFigure 1: Silhouette contour descriptors. Radial Distance Function (RDF) encoding of the silhou-\nette contour is illustrated in (a); Shape Context (SC) encoding of a contour sample point in (b).\n\nwhere Rpt (\u03b8) is the rigid 3 \u00d7 3 rotation matrix for a part pt and is a function of the joint angles \u03b8;\nS(\u03bd) is the linear 3\u00d73 transformation matrix modeling subject-speci\ufb01c shape variation as a function\nof the shape-space parameters \u03bd; Q(Rpt (\u03b8)) is a 3 \u00d7 3 residual transformation corresponding to the\nnon-rigid articulation-induced deformations (e.g. bulging of muscles). Notice, that Q() is simply\na learned linear function of the rigid rotation and has no independent parameters. To learn Q()\nwe minimize the residual in the least-squared sense between the set of 70 registered scans of one\nperson under different (but known) articulations. It is also worth mentioning that body shape linear\ndeformation sub-space, S(\u03bd) = Us\u03bd + \u00b5s, is learned from a set of 10 meshes of different people\nin full correspondence using PCA; hence \u03bd can be interpreted as a vector of linear coef\ufb01cients\ncorresponding to eigen-directions of the shape-space that characterize a given body shape.\n\n3 Features\nIn this work we make use of silhouette features for both discriminative and generative estimation of\npose and shape. Silhouettes are commonly used for human pose estimation [1, 2, 13, 15, 17]; while\nlimited in their representational power, they are easy to estimate from images and fast to synthesize\nfrom a mesh model. The framework introduced here, however, is general and can easily be extended\nto incorporate richer features (e.g. edges [15], dense region descriptors [16] such as SIFT or HOG,\nor hierarchical descriptors [10] like HMAX, Hyperfeatures, Spatial Pyramid). The use of such richer\nfeature representations will likely improve both discriminative and generative estimation.\nHistograms of shape context. Shape contexts (SC) [6] are rich descriptors based on the local\nshape-based histograms of the contour points sampled from the external boundary of the silhouette.\nAt every sampled boundary point the shape context descriptor is parameterized by the number of\norientation bins, \u03c6, number of radial-distance bins, r, and the minimum and maximum radial dis-\ntances denoted by rin and rout respectively. As in [1] we achieve scale invariance by making rout\na function of the overall silhouette height and normalizing the individual shape context histogram\nby the sum over all histogram bins. Assuming that N contour points are chosen, at random, to en-\ncode the silhouette, the full feature vector can be represented using \u03c6rN bin histogram. Even for\nmoderate values of N this produces high dimensional feature vectors that are hard to deal with.\nTo reduce the silhouette representation to a more manageable size, a secondary histogramming was\nintroduced by Agarwal and Triggs in [1]. In this, bag-of-words style model, the shape context space\nis vector quantized into a set of K clusters (a.k.a. codewords). The K = 100 center codebook is\nlearned by running k-means clustering on the combined set of shape context vectors obtained from\nthe large set of training silhouettes. Once the codebook is learned, the quantized K-dimensional\nhistograms are obtained by voting into the histogram bins corresponding to codebook entries. Soft\nvoting has been shown [1] to reduce effects of spatial quantization. The \ufb01nal descriptor Xsc \u2208 RK\nis normalized to unit length, to ensure that silhouettes that contain different number of contour points\ncan be compared.\n\nThe resulting codebook shape context representation is translation and scale invariant by de\ufb01nition.\nFollowing the prior work [1, 13] we let \u03c6 = 12, r = 5, rin = 3, and rout = \u03bah where h is the height\nof the silhouette and \u03ba is typically 1\n4 ensuring integration of contour points over regions roughly\nsimilar to the limb size [1]. For shape estimation, we found that combining features across multiple\nspatial scales (e.g. \u03ba = { 1\n\n2 , ...}) to be more effective.\n\n4 , 1\n\n3\n\n\fRadial distance function. The Radial Distance Function (RDF) features are de\ufb01ned by a feature\nvector Xrdf = {pc,||p1\u2212pc||,||p2\u2212pc||, ...,||pN \u2212pc||}, where pc \u2208 R2 is the centroid of the image\nsilhouette, and pi is the point on the silhouette outer contour; hence ||pi \u2212 pc|| \u2208 R measures the\nmaximal object extent in the particular direction denoted by i from the centroid. For all experiments,\nwe use N = 100 points, resulting in the Xrdf \u2208 R102. We explicitly ensure that the dimensionality\nof the RDF descriptor is comparable to that of shape context introduced above. Unlike the shape\ncontext descriptor, the RDF feature vector is neither scale nor translation invariant. Hence, RDF\nfeatures are only suited for applications where camera calibration is known and \ufb01xed.\n\n4 Discriminative estimation of pose and shape\n\nTo produce initial estimates for the body pose and/or shape in 3D from image features, we need to\nmodel the conditional distribution p(Y|X) of the 3D body state Y given the set of 2D features X.\nIntuitively this conditional mapping should be related to the inverse of the camera projection matrix\nand, as with many inverse problems, is highly ambiguous. To model this non-linear relationship we\nuse a Mixtures of Experts (MoE) model to represent the conditionals [2, 17].\n\nThe parameters of the MoE model are learned by maximizing the log-likelihood of the training data\nset D = {(x(1), y(1)), ..., (x(N ), y(N ))} consisting of N input-output pairs (x(i), y(i)). We use an\niterative Expectation Maximization (EM) algorithm, based on type-II maximum likelihood, to learn\nparameters of the MoE. Our model for the conditional can be written as:\n\np(Y|X) \u221d\n\nM\n\nXk=1\n\npe,k(Y|X, \u0398e,k)pg,k(k|X, \u0398g,k)\n\n(2)\n\n2 \u2206T\n\nk \u039b\u22121\n\nexp\u2212 1\n\nk \u2206k, and \u2206k = Y \u2212 \u03b2kX \u2212 \u03b1k.\n\nwhere pe,k is the probability of choosing pose Y given the input X according to the k-th expert, and\npg,k is the probability of that input being assigned to the k-th expert using an input sensitive gating\nnetwork; in both cases \u0398 represents the parameters of the mixture and gate distributions respectively.\nFor simplicity and to reduce complexity of the experts we choose kernel linear regression with\nconstant offset, Y = \u03b2X + \u03b1, as our expert model, which allows us to solve for the parameters\n\u0398e,k = {\u03b2k, \u03b1k, \u039bk} analytically using the weighted linear regression, where pe,k(Y|X, \u0398e,k) =\n1\u221a(2\u03c0)n|\u039bk|\nPose estimation is a high dimensional and ill-conditioned problem, so simple least squares estima-\ntion of the linear regression matrix parameters typically produces severe over-\ufb01tting and poor gener-\nalization. To reduce this, we add smoothness constraints on the learned mapping. We use a damped\nregularization term R(\u03b2) = \u03bb||\u03b2||2 that penalizes large values in the coef\ufb01cient matrix \u03b2, where \u03bb is\na regularization parameter. Larger values of \u03bb will result in overdamping, where the solution will be\nunderestimated, small values of \u03bb will result in over\ufb01tting and possibly ill-conditioning. Since the\nsolution of the ridge regressors is not symmetric under the scaling of the inputs, we normalize the\ninputs {x(1), x(2), ..., x(N )} by the standard deviation in each dimension respectively before solving.\nWeighted ridge regression solution for the parameters \u03b2k and \u03b1k can be written in matrix notation\nas follows,\n\nX\n\nk , ..., z(N )\n\nk\n\nk Zk (cid:21)\u22121(cid:20) DT\n\nZ T\n\n= (cid:20) DT\nk , z(2)\n\nk (cid:21) diag(Zk) DY,\n\nX diag(Zk) DX + diag(\u03bb) Zk\nZ T\nZ T\nk\n\n\u03b1k (cid:21)T\n(cid:20) \u03b2k\nwhere Zk = [z(1)\n]T is the vector of ownership weights described later in the section\nand diag(Zk) is diagonal matrix with Zk on the diagonal; DX = [x(1), x(2), ..., x(N )] and DY =\n[y(1), y(2), ..., y(N )] are vectors of inputs and outputs from the training data D.\nMaximization for the gate parameters can be done analytically as well. Given the gate model,\nk (X\u2212\u00b5k) maximization of the gate parameters\npg,k(k|X, \u0398g,k) =\n\u0398g,k = {\u03a3k, \u00b5k} becomes similar to the mixture of Gaussians estimation, where \u00b5k =\nn=1 z(n)\nk x(n)/PN\nPN\nk is the\n\nk [x(n) \u2212 \u00b5k][x(n) \u2212 \u00b5k]T and zn\n\nk PN\n\nn=1 z(n)\n\nk , \u03a3k =\n\n1\u221a(2\u03c0)n|\u03a3k|\n\nexp\u2212 1\n\n2 (X\u2212\u00b5k)T \u03a3\u22121\n\n(3)\n\n1\n\nPN\n\nn=1 z(n)\n\nn=1 z(n)\n\n4\n\n\festimated ownership weight of the example n by the expert k estimated by expectation\n\nz(n)\nk =\n\npe,k(y(n)|x(n), \u0398e,k)pg,k(k|x(n), \u0398g,k)\nPM\nj=1 pe,j(y(n)|x(n), \u0398e,j)pg,j(j|x(n), \u0398g,j)\n\n.\n\n(4)\n\nThe above outlines the full EM procedure for the MoE model. We learn three separate models for\nshape, p(\u03bd|X), articulated pose, p(\u03b8|X) and global position, p(\u03c4|X). Similar to [2] we initialize the\nEM learning by clustering the output 3D poses using the K-means procedure.\nImplementation details. For articulated pose and shape we experimented with using both RDF\nand SC features (global position requires RDF features since SC is location and scale invariant).\nSC features tend to work better for pose estimation where as RDF features perform better for shape\nestimation. Hence, we learn p(\u03bd|Xrdf ), p(\u03b8|Xsc) and p(\u03c4|Xrdf ). In cases where calibration is\nunavailable, we estimate the shape using p(\u03bd|Xsc) which tends to produce reasonable results but\ncannot estimate the overall height. We estimate the number of mixture components, M, and regular-\nization parameter, \u03bb, by learning a number of models and cross validating on the withheld dataset.\n\n5 Generative stochastic optimization of pose and shape\n\ni=1 \u03c0(k)\n\ni N (y(k)\n\ni\n\nq(yi)\n\nGenerative stochastic state estimation, as in [5], is handled within an iterative importance sampling\nframework [8]. To this end, we represent the posterior distribution over the state (that includes\ni=1,\nboth pose and shape), p(Y|I) \u221d p(I|Y)p(Y), using a set of N weighted samples {yi, \u03c0i}N\nwhere yi \u223c q(Y) is a sample drawn from the importance function q(Y) and \u03c0i \u221d p(I|yi)p(yi)\nis an associated normalized weight. As in [5] we make no rigorous probabilistic claims about the\ngenerative model, but rather use it as effective means of performing stochastic search. As required\nby the annealing framework, we de\ufb01ne a set of importance functions qk(Y) from which we draw\nsamples at each respective iteration k. We de\ufb01ne importance functions recursively using a smoothed\nversion of posterior from the previous iteration qk+1(Y) = PN\n, \u03a3(k)), encoded using\na kernel Gaussian density with iteration dependent bandwidth parameter \u03a3(k). To avoid effects of\nlocal optima, the likelihood is annealed as follows: pk(I|Y) = [p(I|Y)]Tk at every iteration, where\nTk is the temperature parameter. As a result, effects of peaks in the likelihood are introduced slowly.\nTo initiate the stochastic search an initial distribution is needed. The high dimensionality of the\nstate space requires this initial distribution to be relatively close to the solution in order to reach\nconvergence. Here we make use of the discriminative pose and shape estimate from Section 4 to\ngive us the initial distribution for the posterior. In particular, given the discriminative model for the\nshape, p(\u03bd|X), position, p(\u03c4|X), and articulated pose, p(\u03b8|X), of the body, we can let (with slight\nabuse of notation) y(0)\nThe outlined stochastic optimization framework also requires an image likelihood function, p(I|Y),\nthat measures how well our model under a given state Y matches the image evidence, I, obtained\nfrom one or multiple synchronized cameras. We adopt the likelihood function introduced in [5]\nthat measures the similarity between observed and hypothesized silhouettes. For a given camera\nview, a foreground silhouette is computed using a shadow-suppressing background subtraction pro-\ncedure and is compared to the silhouette obtained by projecting the SCAPE model subject to the\nhypothesized state into the image plane (given calibration parameters of the camera). Pixels in the\nnon-overlapping regions are penalized by the distance to the closest contour point of the silhouette.\nThis is made ef\ufb01cient by the use of Chamfer distance map precomputed for both silhouettes.\n\ni \u223c [p(\u03c4|X), p(\u03b8|X), p(\u03bd|X)] and \u03c0(0)\n\ni = 1/N for i \u2208 [1, ..., N ].\n\n6 Experiments\n\nDatasets. In this paper we make use of 3 different datasets. The training dataset, used to learn\ndiscriminative MoE models and codeword dictionary for SC, was generated by synthesizing 3000\nsilhouette images obtained by projecting corresponding SCAPE body models into an image plane\nusing calibration parameters of the camera. SCAPE body models, in turn, were generated by ran-\ndomly sampling the pose from a database of motion capture data (consisting of generally non-cyclic\nrandom motions) and the body shape coef\ufb01cient from a uniform distribution centered at the mean\nshape. Similar synthetic test dataset was constructed consisting of 597 silhouette-SCAPE body\n\n5\n\n\f(a)\n\n(b)\n\n(c)\n\nFigure 2: Discriminative estimation of weight loss. Two images of a subject before and after\nweight loss are shown in (a) on the left and right respectively. The images were downloaded from\nthe web (Google) and manually segmented (b). The estimated shape and pose obtained by our\ndiscriminative estimation procedure is shown in (c). In bottom row, we manually rotated the model\n90 degrees for better visibility of the shape variation. Since camera calibration is unavailable, we\nuse p(\u03bd|Xsc) and normalize the before and after shapes to the same reference height. Our method\nestimated that the person illustrated in the top row lost 22 lb and the one illustrated in the bottom\nrow \u2013 32 lb; web-reported weight loss for the two subjects was 24 lb and 64 lb respectively. Notice\nthat the neutral posture assumed in images was not present in our training data set, causing visible\nartifacts with estimation of the arm pose. Also, the bottom example pushes the limits of our current\nshape model which was trained using only 10 scans of people, none close to the desired body shape.\n\nmodel pairs. In addition, we collected a real dataset consisting of hardware-synchronized motion\ncapture and video collected using 4 cameras. Two subjects were captured performing roughly the\nsame class of motions as in the training dataset.\nDiscriminative estimation of shape. Results of using the MoE model, similar to the one introduced\nhere, for pose estimation have previously been reported in [2] and [17]. Our experience with the\narticulated pose estimation was similar and we omit supporting experiments due to lack of space.\nFor discriminative estimation of shape we quantitatively compared SC and RDF features, by training\ntwo MoE models p(\u03bd|Xsc) and p(\u03bd|Xrdf ), and found the latter to perform better when camera\ncalibration is available (on the average we achieve a 19.3 % performance increase over simply using\nthe mean shape). We attribute the superior performance of RDF features to their sensitivity to the\nsilhouette position and scale, that allows for better estimation of overall height of the body.\n\nGiven the shape we can also estimate the volume of the body and assuming constant density of\nwater, compute the weight of the person. To illustrate this we estimate approximate weight loss of\na person from monocular uncalibrated images (see Figure 2). Please note that this application is a\nproof of concept and not a rigorous experiment1. In principle, the SCAPE model is not ideal for\nweight calculations, since non-rigid deformations caused by articulations of the body will result in\n(unnatural) variations in weight. In practice, however, we found such variations produce relatively\nminor artifacts. The weight calculations are, on the other hand, very sensitive to the body shape.\nCombining discriminative and generative estimation. Lastly we tested the performance of the\ncombined discriminative and generative framework by estimating articulated pose, shape and bio-\nmetric measurements for people in our real dataset. Results of biometric measurement estimates\ncan be seen in Figure 3; corresponding visual illustration of results is shown in Figure 4.\nAnalysis of errors. Rarely our system does produce poor pose and/or shape estimates. Typically\nthese cases can be classi\ufb01ed into two categories: (1) minor errors that only effect the pose and are\nartifacts of local optima or (2) more signi\ufb01cant errors that effect the shape and result from poor initial\ndistribution over the state produced by the discriminative method. The latter arise as a result of 180\u2013\ndegree view ambiguity and/or pose con\ufb01guration ambiguities, due to symmetry, in the silhouettes.\n\n1The \u201cground truth\u201d weight change here is self reported and gathered from the Internet.\n\n6\n\n\f) Height (mm)\n\nArm Span (mm)\nWeight (kg)\nA\n) Height (mm)\n\nBiometric Feature Actual Mean\n1716.1\n1553.6\n83.62\n1703.8\n1537.7\n80.63\n\nArm Span (mm)\nWeight (kg)\n\n1780\n1597\n88\n1825\n1668\n63\n\n4\n3\n(\n\n0\n3\n(\n\nB\n\nDiscriminative\nStd\n41.9\n39.7\n8.94\n88.8\n69.2\n18.53\n\nDisc. + Generative GT + Generative\nMean\n1776.2\n1597.3\n83.37\n1751.0\n1547.5\n64.98\n\nMean\n1796.9\n1607.7\n85.83\n1844.1\n1659.0\n66.33\n\nStd\n22.9\n30.7\n3.73\n63.8\n29.1\n4.69\n\nStd\n43.8\n58.0\n8.01\n95.2\n91.4\n9.27\n\nFigure 3: Estimating basic biometric measurements. Figure illustrates basic biometric measure-\nments (height, arm span3and weight) recovered for two subjects A and B. Mean and standard devi-\nation reported over 34 and 30 frames for subject A and B respectively. Every 25-th frame from two\nsequence obtained using 4 synchronized cameras was chosen for estimation. The actual measured\nvalues for the two subjects are shown in the left column. Estimates obtained using discriminative\nonly and discriminative followed by generative shape estimation methods are reported in the next\ntwo columns. Discriminative method used only one view for estimation, where as generative method\nused all 4 views to obtain a better \ufb01t. Last column reports estimates obtained using ground truth pose\nand mean shape as initialization for the generative \ufb01t (this is the algorithm proposed in [5]). Notice\nthat generative estimation signi\ufb01cantly re\ufb01nes the discriminative estimates.\nIn addition, our ap-\nproach, that unlike [5] does not require manual initialization, performs comparably (and sometimes\nmarginally better than [5]) in terms of mean performance (but has roughly twice the variance).\n\n7 Discussion and Conclusions\n\nWe have presented a method for automatic estimation of articulated pose and shape of people from\nimages. Our approach goes beyond prior work in that it is able to estimate a detailed parametric\nmodel (SCAPE) directly from images without requiring manual intervention or initialization. We\nfound that the discriminative model produced an effective initialization for generative optimization\nprocedure and that biometric measurements from the recovered shape were comparable to those pro-\nduced by prior approaches that required manual initialization [5]. We also introduced and addressed\nthe problem of discriminative estimation of shape from monocular calibrated and un-calibrated im-\nages. More accurate shape estimates from monocular data will require richer image descriptors.\n\nA number of straightforward extensions to our model will likely yeld immediate improvement in\nperformance. Among such, is the use of temporal consistency in the discriminative pose (and per-\nhaps shape) estimation [17] and dense image descriptors [10]. In addition, in this work we estimated\nthe shape space of the SCAPE model from only 10 body scans, as a result the learned shape space\nis rather limited in its expressive power. We belive some of the artifacts of this can be observed in\nFigure 2 where the weight of the heavier woman is underestimated.\n\nAcknowledgments. This work was supported by NSF grants IIS-0534858 and IIS-0535075 and a\ngift from Intel Corp. We also thank James Davis and Dragomir Anguelov for discussions and data.\n\nReferences\n[1] A. Agarwal and B. Triggs. Recovering 3D human pose from monocular images, IEEE Transactions on\n\nPattern Analysis and Machine Intelligence, Vol. 28, No. 1, pp. 44\u201358, 2006.\n\n[2] A. Agarwal and B. Triggs. Monocular human motion capture with a mixture of regressors, IEEE Work-\n\nshop on Vision for Human-Computer Interaction, 2005.\n\n[3] D. Anguelov, P. Srinivasan, D. Koller, S. Thrun, J. Rodgers and J.Davis. SCAPE: Shape Completion and\n\nAnimation of PEople, ACM Transactions on Graphics (SIGGRAPH), Vol. 24(3), pp. 408\u2013416, 2005.\n\n[4] A. Balan, M. J. Black, H. Haussecker and L. Sigal. Shining a light on human pose: On shadows, shading\n\nand the estimation of pose and shape, International Conference on Computer Vision (ICCV), 2007.\n\n[5] A. Balan, L. Sigal, M. Black, J. Davis and H. Haussecker. Detailed human shape and pose from images,\n\nIEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2007.\n\n[6] S. Belongie, J. Malik and J. Puzicha. Matching shapes, ICCV, pp. 454\u2013461, 2001.\n\n3Arm span is de\ufb01ned as the distance between knuckles of left and right arm fully extended in \u2018T\u2019-pose [5].\n\n7\n\n\fA\n\nt\nc\ne\nj\nb\nu\nS\n\nB\n\nt\nc\ne\nj\nb\nu\nS\n\nFigure 4: Visualizing pose and shape estimation. Examples of simultaneous pose and shape\nestimation for subjects A and B are shown on top and bottom respectively. Results are obtained by\ndiscriminatively estimating the distribution over the initial state and then re\ufb01ning this distribution\nvia generative local stochastic search. Left column illustrates projection of the estimated model into\nall 4 views. Middle column shows the projection of the model onto image silhouettes, where light\nblue denotes image silhouette, dark red projection of the model and orange non-silhouette regions\nthat overlap with the projection. On the right are the two views of the estimated 3D model.\n\n[7] K. M. Cheung, S. Baker and T. Kanade. Shape-from-silhouette of articulated objects and its use for\n\nhuman body kinematics estimation and motion capture, CVPR, Vol. 1, pp. 77\u201384, 2003.\n\n[8] J. Deutscher, A. Blake and I. Reid. Articulated body motion capture by annealed particle \ufb01ltering, IEEE\n\nConference on Computer Vision and Pattern Recognition (CVPR), Vol. 2, pp. 126\u2013133, 2000.\n\n[9] K. Grauman, G. Shakhnarovich, T. Darrell. Inferring 3D structure with a statistical image-based shape\n\nmodel, IEEE International Conference on Computer Vision (ICCV), pp. 641\u2013648, 2003.\n\n[10] A. Kanaujia, C. Sminchisescu and D. Metaxas. Semi-supervised Hierarchical Models for 3D Human Pose\n\nReconstruction, IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2007.\n\n[11] L. Muendermann, S. Corazza and T. Andriacchi. Accurately measuring human movement using articu-\n\nlated ICP with soft-joint constraints and a repository of articulated models, CVPR, 2007.\n\n[12] R. Plankers and P. Fua. Articulated soft objects for video-based body modeling, ICCV, 2001.\n[13] R. W. Poppe and M. Poel. Comparison of silhouette shape descriptors for example-based human pose\nrecovery, IEEE Conference on Automatic Face and Gesture Recognition (FG 2006), pp. 541\u2013546, 2006.\n\n[14] R. Rosales and S. Sclaroff. Learning Body Pose Via Specialized Maps, NIPS, 2002.\n[15] L. Sigal, S. Bhatia, S. Roth, M. J. Black and M. Isard Tracking Loose-limbed People, IEEE Conference\n\non Computer Vision and Pattern Recognition (CVPR), Vol. 1, pp. 421\u2013428, 2004.\n\n[16] C. Sminchisescu, A. Kanajujia and D. Metaxas. Learning Joint Top-Down and Bottom-up Processes for\n\n3D Visual Inference, CVPR, Vol. 2, pp. 1743\u20131752, 2006.\n\n[17] C. Sminchisescu, A. Kanaujia, Z. Li and D. Metaxas. Discriminative density propagation for 3D human\n\nmotion estimation, CVPR, Vol. 1, pp. 390\u2013397, 2005.\n\n8\n\n\f", "award": [], "sourceid": 989, "authors": [{"given_name": "Leonid", "family_name": "Sigal", "institution": null}, {"given_name": "Alexandru", "family_name": "Balan", "institution": null}, {"given_name": "Michael", "family_name": "Black", "institution": null}]}