{"title": "MoCap-guided Data Augmentation for 3D Pose Estimation in the Wild", "book": "Advances in Neural Information Processing Systems", "page_first": 3108, "page_last": 3116, "abstract": "This paper addresses the problem of 3D human pose estimation in the wild. A significant challenge is the lack of training data, i.e., 2D images of humans annotated with 3D poses. Such data is necessary to train state-of-the-art CNN architectures. Here, we propose a solution to generate a large set of photorealistic synthetic images of humans with 3D pose annotations. We introduce an image-based synthesis engine that artificially augments a dataset of real images with 2D human pose annotations using 3D Motion Capture (MoCap) data. Given a candidate 3D pose our algorithm selects for each joint an image whose 2D pose locally matches the projected 3D pose. The selected images are then combined to generate a new synthetic image by stitching local image patches in a kinematically constrained manner. The resulting images are used to train an end-to-end CNN for full-body 3D pose estimation. We cluster the training data into a large number of pose classes and tackle pose estimation as a K-way classification problem. Such an approach is viable only with large training sets such as ours. Our method outperforms the state of the art in terms of 3D pose estimation in controlled environments (Human3.6M) and shows promising results for in-the-wild images (LSP). This demonstrates that CNNs trained on artificial images generalize well to real images.", "full_text": "MoCap-guided Data Augmentation\nfor 3D Pose Estimation in the Wild\n\nGr\u00e9gory Rogez\n\nCordelia Schmid\n\nInria Grenoble Rh\u00f4ne-Alpes, Laboratoire Jean Kuntzmann, France\n\nAbstract\n\nThis paper addresses the problem of 3D human pose estimation in the wild. A sig-\nni\ufb01cant challenge is the lack of training data, i.e., 2D images of humans annotated\nwith 3D poses. Such data is necessary to train state-of-the-art CNN architectures.\nHere, we propose a solution to generate a large set of photorealistic synthetic im-\nages of humans with 3D pose annotations. We introduce an image-based synthesis\nengine that arti\ufb01cially augments a dataset of real images with 2D human pose\nannotations using 3D Motion Capture (MoCap) data. Given a candidate 3D pose\nour algorithm selects for each joint an image whose 2D pose locally matches the\nprojected 3D pose. The selected images are then combined to generate a new\nsynthetic image by stitching local image patches in a kinematically constrained\nmanner. The resulting images are used to train an end-to-end CNN for full-body\n3D pose estimation. We cluster the training data into a large number of pose classes\nand tackle pose estimation as a K-way classi\ufb01cation problem. Such an approach is\nviable only with large training sets such as ours. Our method outperforms the state\nof the art in terms of 3D pose estimation in controlled environments (Human3.6M)\nand shows promising results for in-the-wild images (LSP). This demonstrates that\nCNNs trained on arti\ufb01cial images generalize well to real images.\n\nIntroduction\n\n1\nConvolutionnal Neural Networks (CNN) have been very successful for many different tasks in\ncomputer vision. However, training these deep architectures requires large scale datasets which are\nnot always available or easily collectable. This is particularly the case for 3D human pose estimation,\nfor which an accurate annotation of 3D articulated poses in large collections of real images is non-\ntrivial: annotating 2D images with 3D pose information is impractical [6] while large scale 3D pose\ncapture is only available through marker-based systems in constrained environments [13]. The images\ncaptured in such conditions do not match well real environments. This has limited the development\nof end-to-end CNN architectures for in-the-wild 3D pose understanding.\nLearning architectures usually augment existing training data by applying synthetic perturbations\nto the original images, e.g.\njittering exemplars or applying more complex af\ufb01ne or perspective\ntransformations [15]. Such data augmentation has proven to be a crucial stage, especially for training\ndeep architectures. Recent work [14, 23, 34, 40] has introduced the use of data synthesis as a solution\nto train CNNs when only limited data is available. Synthesis can potentially provide in\ufb01nite training\ndata by rendering 3D CAD models from any camera viewpoint [23, 34, 40]. Fisher et al [8] generate\na synthetic \u201cFlying Chairs\u201d dataset to learn optical \ufb02ow with a CNN and show that networks trained\non this unrealistic data still generalize very well to existing datasets. In the context of scene text\nrecognition, Jaderberg et al. [14] trained solely on data produced by a synthetic text generation\nengine. In this case, the synthetic data is highly realistic and suf\ufb01cient to replace real data. Although\nsynthesis seems like an appealing solution, there often exists a large domain shift from synthetic to\nreal data [23]. Integrating a human 3D model in a given background in a realistic way is not trivial.\nRendering a collection of photo-realistic images (in terms of color, texture, context, shadow) that\nwould cover the variations in pose, body shape, clothing and scenes is a challenging task.\nInstead of rendering a human 3D model, we propose an image-based synthesis approach that makes\nuse of Motion Capture (MoCap) data to augment an existing dataset of real images with 2D pose\n\n\fannotations. Our system synthesizes a very large number of new in-the-wild images showing more\npose con\ufb01gurations and, importantly, it provides the corresponding 3D pose annotations (see Fig. 1).\nFor each candidate 3D pose in the MoCap library, our system combines several annotated images to\ngenerate a synthetic image of a human in this particular pose. This is achieved by \u201ccopy-pasting\u201d the\nimage information corresponding to each joint in a kinematically constrained manner. Given this\nlarge \u201cin-the-wild\u201d dataset, we implement an end-to-end CNN architecture for 3D pose estimation.\nOur approach \ufb01rst clusters the 3D poses into K pose classes. Then, a K-way CNN classi\ufb01er is trained\nto return a distribution over probable pose classes given a bounding box around the human in the\nimage. Our method outperforms state-of-the-art results in terms of 3D pose estimation in controlled\nenvironments and shows promising results on images captured \u201cin-the-wild\u201d.\n\nFigure 1: Image-based synthesis engine. Input: real images with manual annotation of 2D poses, and 3D poses\ncaptured with a Motion Capture (MoCap) system. Output: 220x220 synthetic images and associated 3D poses.\n\n1.1 Related work\n\n3D human pose estimation in monocular images. Recent approaches employ CNNs for 3D pose\nestimation in monocular images [20] or in videos [44]. Due to the lack of large scale training data,\nthey are usually trained (and tested) on 3D MoCap data in constrained environments [20]. Pose\nunderstanding in natural images is usually limited to 2D pose estimation [7, 36, 37]. Recent work\nalso tackles 3D pose understanding from 2D poses [2, 10]. Some approaches use as input the 2D\njoints automatically provided by a 2D pose detector [32, 38], while others jointly solve the 2D and\n3D pose estimation [31, 43]. Most similar to ours is the approach of Iqbal et al. [42] who use a\ndual-source approach that combines 2D pose estimation with 3D pose retrieval. Our method uses\nthe same two training sources, i.e., images with annotated 2D pose and 3D MoCap data. However,\nwe combine both sources off-line to generate a large training set that is used to train an end-to-end\nCNN 3D pose classi\ufb01er. This is shown to improve over [42], which can be explained by the fact that\ntraining is performed in an end-to-end fashion.\nSynthetic pose data. A number of works have considered the use of synthetic data for human pose\nestimation. Synthetic data have been used for upper body [29], full-body silhouettes [1], hand-object\ninteractions [28], full-body pose from depth [30] or egocentric RGB-D scenes [27]. Recently, Zuf\ufb01\nand Black [45] used a 3D mesh-model to sample synthetic exemplars and \ufb01t 3D scans. In [11],\na scene-speci\ufb01c pedestrian detectors was learned without real data while [9] synthesized virtual\nsamples with a generative model to enhance the classi\ufb01cation performance of a discriminative model.\nIn [12], pictures of 2D characters were animated by \ufb01tting and deforming a 3D mesh model. Later,\n[25] augmented labelled training images with small perturbations in a similar way. These methods\nrequire a perfect segmentation of the humans in the images. Park and Ramanan [22] synthesized\nhypothetical poses for tracking purposes by applying geometric transformations to the \ufb01rst frame of\na video sequence. We also use image-based synthesis to generate images but our rendering engine\ncombines image regions from several images to create images with associated 3D poses.\n\n2\n\nImage-based synthesis engine\n\nAt the heart of our approach is an image-based synthesis engine that arti\ufb01cially generates \u201cin-the-wild\u201d\nimages with 3D pose annotations. Our method takes as input a dataset of real images with 2D\n\n2\n\n\fFigure 2: Synthesis engine. From left to right: for each joint j of a 2D query pose p (centered in a 220 \u00d7 220\nbounding box), we align all the annotated 2D poses w.r.t the limb and search for the best pose match, obtaining a\nlist of n matches {(I(cid:48)\nj), j = 1...n} where I(cid:48)\nj, q(cid:48)\nj. For each\nretrieved pair, we compute a probability map pj[u, v]. These n maps are used to compute index[u, v] \u2208 {1...n},\npointing to the image I(cid:48)\nj that should be used for a particular pixel (u, v). Finally, our blending algorithm\ncomputes each pixel value of the synthetic image M [u, v] as the weighted sum over all aligned images I(cid:48)\nj, the\nweights being calculated using an histogram of indexes in a squared region Ru,v around (u, v).\n\nj is obtained after transforming Ij with Tqj \u2192 q(cid:48)\n\nannotations and a library of 3D Motion Capture (MoCap) data, and generates a large number of\nsynthetic images with associated 3D poses (Fig. 1). We introduce an image-based rendering engine\nthat augments the existing database of annotated images with a very large set of photorealistic images\ncovering more body pose con\ufb01gurations than the original set. This is done by selecting and stitching\nimage patches in a kinematically constrained manner using the MoCap 3D poses. Our synthesis\nprocess consists of two stages: a MoCap-guided mosaic construction stage that stitches image patches\ntogether and a pose-aware blending process that improves image quality and erases patch seams.\nThese are discussed in the following subsections. Fig. 2 summarizes the overall process.\n\n2.1 MoCap-guided image mosaicing\nGiven a 3D pose with n joints P \u2208 Rn\u00d73, and its projected 2D joints p = {pj, j = 1...n} in a\nparticular camera view, we want to \ufb01nd for each joint j \u2208 {1...n} an image whose annotated 2D pose\npresents a similar kinematic con\ufb01guration around j. To do so, we de\ufb01ne a distance function between\n2 different 2D poses p and q, conditioned on joint j as:\n\nDj(p, q) =\n\ndE(pk, q(cid:48)\nk)\n\n(1)\n\nn(cid:88)\n\nk=1\n\nn(cid:88)\n\nwhere dE is the Euclidean distance. q(cid:48) is the aligned version of q with respect to joint j after applying\na rigid transformation Tqj\u2192q(cid:48)\ni = pi , where i is the farthest directly\nconnected joint to j in p. This function Dj measures the similarity between 2 joints by aligning and\ntaking into account the entire poses. To increase the in\ufb02uence of neighboring joints, we weight the\ndistances dE between each pair of joints {(pk, q(cid:48)\nk), k = 1...n} according to their distance to the query\njoint j in both poses. Eq. 1 becomes:\n\n, which respects q(cid:48)\n\nj = pj and q(cid:48)\n\nj\n\nDj(p, q) =\n\n(wj\n\nk(p) + wj\n\nk(q)) dE(pk, q(cid:48)\nk)\n\n(2)\n\nk(p) = 1/dE(pk, pj) and normalized so that(cid:80)\n\nwhere weight wj\nwj\nwe retrieve from our dataset Q = {(I1, q1) . . . (IN , qN )} of images and annotated 2D poses1:\n\nk is inversely proportional to the distance between joint k and the query joint j, i.e.,\nk(p) = 1. For each joint j of the query pose p,\n\nk wj\n\nk=1\n\nWe obtain a list of n matches {(I(cid:48)\ntransforming Ij with Tqj\u2192q(cid:48)\ncandidates, i.e., being a good match for several joints.\n\nj, q(cid:48)\n\nj\n\nqj = argminq\u2208QDj(p, q) \u2200j \u2208 {1...n}.\n\n(3)\nj is the cropped image obtained after\n. Note that a same pair (I, q) can appear multiple times in the list of\n\nj), j = 1...n} where I(cid:48)\n\n1In practice, we do not search for occluded joints.\n\n3\n\n\fj, q(cid:48)\n\nj) based on local matches measured by dE(pk, q(cid:48)\n\nFinally, to render a new image, we need to select the candidate images I(cid:48)\nj to be used for each pixel\n(u, v). Instead of using regular patches, we compute a probability map pj[u, v] associated with each\npair (I(cid:48)\nk) in Eq. 1. To do so, we \ufb01rst apply a\nj} obtaining a partition of the image into triangles,\nDelaunay triangulation to the set of 2D joints in {q(cid:48)\naccordingly to the selected pose. Then, we assign the probability pj(q(cid:48)\nk)2/\u03c32)\nto each vertex q(cid:48)\nk. We \ufb01nally compute a probability map pj[u, v] by interpolating values from these\nvertices using barycentric interpolation inside each triangle. The resulting n probability maps are\nconcatenated and an index map index[u, v] \u2208 {1...n} can be computed as follows:\n\nk) = exp(\u2212dE(pk, q(cid:48)\n\nindex[u, v] = argmaxj\u2208{1...n} pj[u, v],\n\nthis map pointing to the training image I(cid:48)\ncan be generated by \u201ccopy-pasting\u201d image information at pixel (u, v) indicated by index[u, v]:\n\n(4)\nj that should be used for each pixel (u, v). A mosaic M [u, v]\nj\u2217 [u, v] with j\u2217 = index[u, v].\n\nM [u, v] = I(cid:48)\n\n(5)\n\n2.2 Pose-aware image blending\n\nThe mosaic M [u, v] resulting from the previous stage presents signi\ufb01cant artifacts at the boundaries\nbetween image regions. Smoothing is necessary to prevent the learning algorithm from interpreting\nthese artifacts as discriminative pose-related features. We \ufb01rst experimented with off-the-shelf image\n\ufb01ltering and alpha blending algorithms, but the results were not satisfactory. Instead, we propose\na new pose-aware blending algorithm that maintains image information on the human body while\nerasing most of the stitching artifacts. For each pixel (u, v), we select a surrounding squared region\nRu,v whose size varies with the distance of pixel (u, v) to the pose: Ru,v will be larger when far\nfrom the body and smaller nearby. Then, we evaluate how much each image I(cid:48)\nj should contribute to\nthe value of pixel (u, v) by building an histogram of the image indexes inside the region Ru,v:\n\nwhere the weights are normalized so that(cid:80)\n\nwj[u, v] = Hist(index(Ru,v)) \u2200j \u2208 {1 . . . n},\n\n(6)\nj wj[u, v] = 1. The \ufb01nal mosaic M [u, v] (see examples\n\nin Fig. 1) is then computed as the weighted sum over all aligned images:\n\n(cid:88)\n\nM [u, v] =\n\nwj[u, v]I(cid:48)\n\nj[u, v].\n\n(7)\n\nThis procedure produces plausible images that are kinematically correct and locally photorealistic.\n\nj\n\n3 CNN for full-body 3D pose estimation\n\nHuman pose estimation has been addressed as a classi\ufb01cation problem in the past [4, 21, 27, 26].\nHere, the 3D pose space is partitioned into K clusters and a K-way classi\ufb01er is trained to return a\ndistribution over pose classes. Such a classi\ufb01cation approach allows modeling multimodal outputs\nin ambiguous cases, and produces multiple hypothesis that can be rescored, e.g. using temporal\ninformation. Training such a classi\ufb01er requires a reasonable amount of data per class which implies a\nwell-de\ufb01ned and limited pose space (e.g. walking action) [26, 4], a large-scale synthetic dataset [27] or\nboth [21]. Here, we introduce a CNN-based classi\ufb01cation approach for full-body 3D pose estimation.\nInspired by the DeepPose algorithm [37] where the AlexNet CNN architecture [19] is used for\nfull-body 2D pose regression, we select the same architecture and adapt it to the task of 3D body\npose classi\ufb01cation. This is done by adapting the last fully-connected layer to output a distribution of\nscores over pose classes as illustrated in Fig. 3. Training such a classi\ufb01er requires a large amount of\ntraining data that we generate using our image-based synthesis engine.\nGiven a library of MoCap data and a set of camera views, we synthesize for each 3D pose a 220\u00d7 220\nimage. This size has proved to be adequate for full-body pose estimation [37]. The 3D poses are\nthen aligned with respect to the camera center and translated to the center of the torso. In that way,\nwe obtain orientated 3D poses that also contain the viewpoint information. We cluster the resulting\n3D poses to de\ufb01ne our classes which will correspond to groups of similar orientated 3D poses.We\nempirically found that K=5000 clusters was a suf\ufb01cient number of clusters. For evaluation, we return\nthe average 2D and 3D poses of the top scoring class.\nTo compare with [37], we also train a holistic pose regressor, which regresses to 2D and 3D poses\n(not only 2D). To do so, we concatenate the 3D coordinates expressed in meters normalized to the\nrange [\u22121, 1], with the 2D pose coordinates, also normalized in the range [\u22121, 1] following [37].\n\n4\n\n\fFigure 3: CNN-based pose classi\ufb01er. We show the different layers with their corresponding dimensions, with\nconvolutional layers depicted in blue and fully connected ones in green. The output is a distribution over K pose\nclasses. Pose estimation is obtained by taking the highest score in this distribution. We show on the right the 3D\nposes for 3 highest scores.\n\n4 Experiments\n\nWe address 3D pose estimation in the wild. However, there does not exist a dataset of real-world\nimages with 3D annotations. We thus evaluate our method in two different settings using existing\ndatasets: (1) we validate our 3D pose predictions using Human3.6M [13] which provides accurate 3D\nand 2D poses for 15 different actions captured in a controlled indoor environment; (2) we evaluate\non Leeds Sport dataset (LSP)[16] that presents in-the-wild images together with full-body 2D pose\nannotations. We demonstrate competitive results with state-of-the-art methods for both of them.\nOur image-based rendering engine requires two different training sources: 1) a 2D source of images\nwith 2D pose annotations and 2) a MoCap 3D source. We consider two different datasets for each:\nfor 3D poses we use the CMU Motion Capture Dataset2 and the Human3.6M 3D poses [13], and for\n2D pose annotations the MPII-LSP-extended dataset [24] and the Human3.6M 2D poses and images.\nMoCap 3D source. The CMU Motion Capture dataset consists of 2500 sequences and a total of\n140,000 3D poses. We align the 3D poses w.r.t. the torso and select a subset of 12,000 poses, ensuring\nthat selected poses have at least one joint 5 cm apart. In that way, we densely populate our pose space\nand avoid repeating common poses (e.g. neutral standing or walking poses which are over-represented\nin the dataset). For each of the 12,000 original MoCap poses, we sample 180 random virtual views\nwith azimuth angle spanning 360 degrees and elevation angles in the range [\u221245, 45]. We generate\nover 2 million pairs of 3D/2D pose con\ufb01gurations (articulated poses + camera position and angle).\nFor Human3.6M, we randomly selected a subset of 190,000 orientated 3D poses, discarding similar\nposes, i.e., when the average Euclidean distance of the joints is less than 15mm as in [42].\n2D source. For the training dataset of real images with 2D pose annotations, we use the MPII-LSP-\nextended [24] which is a concatenation of the extended LSP [17] and the MPII dataset [3]. Some of\nthe poses were manually corrected as a non-negligible number of annotations are not accurate enough\nor completely wrong (eg., right-left inversions or bad ordering of the joints along a limb). We mirror\nthe images to double the size of the training set, obtaining a total of 80,000 images with 2D pose\nannotations. For Human3.6M, we consider the 4 cameras and create a pool of 17,000 images and\nassociated 2D poses that we also mirror. We ensure that most similar poses have at least one joint 5\ncm apart in 3D.\n\n4.1 Evaluation on Human3.6M Dataset (H3.6M)\n\nTo compare our results with very recent work in 3D pose estimation [42], we follow the protocol\nintroduced in [18] and employed in [42]: we consider six subjects (S1, S5, S6, S7, S8 and S9)\nfor training, use every 64th frame of subject S11 for testing and evaluate the 3D pose error (mm)\naveraged over the 13 joints. We refer to this protocol by P1. As in [42], we consider a 3D pose error\nthat measures accuracy of aligned pose by a rigid transformation but also report the absolute error.\nWe \ufb01rst evaluate the impact of our synthetic data on the performances for both the regressor and\nclassi\ufb01er. The results are reported in Tab. 1. We can observe that when considering few training\nimages (17,000), the regressor clearly outperforms the classi\ufb01er which, in turns, reaches better\nperformances when trained on larger sets. This can be explained by the fact that the classi\ufb01cation\napproach requires a suf\ufb01cient amount of examples. We, then, compare results when training both\nregressor and classi\ufb01er on the same 190,000 poses considering a) synthetic data generating from\nH3.6M, b) the real images corresponding to the 190,000 poses and c) the synthetic and real images\n\n2http://mocap.cs.cmu.edu\n\n5\n\n\fTable 1: 3D pose estimation results on Human3.6M (protocol P1).\n\nMethod Type of images\n\n3D source size Error (mm)\n\nReg.\nClass.\nReg.\nClass.\nReg.\nClass.\nReg.\nClass.\n\nReal\nReal\nSynth\nSynth\nReal\nReal\n\nSynth + Real\nSynth + Real\n\n2D source size\n\n17,000\n17,000\n17,000\n17,000\n190,000\n190,000\n207,000\n207,000\n\n17,000\n17,000\n190,000\n190,000\n190,000\n190,000\n190,000\n190,000\n\n112.9\n149.7\n101.9\n97.2\n139.6\n97.7\n125.5\n88.1\n\nTable 2: Comparison with state-of-the-art results on Human3.6M. The average 3D pose error (mm) is\nreported before (Abs.) and after rigid 3D alignment for 2 different protocols. See text for details.\n\nMethod\n\nAbs. Error (P1) Error (P1) Abs. Error (P2) Error (P2)\n\nBo&Sminchisescu [5]\nKostrikov&Gall [18]\n\nIqbal et al. [42]\nLi et al. [20]\n\nTekin et al. [35]\nZhou et al. [44]\n\nOurs\n\n-\n-\n-\n-\n-\n-\n126\n\n117.9\n115.7\n108.3\n\n-\n-\n-\n\n88.1\n\n-\n-\n-\n\n121.31\n124.97\n113.01\n121.2\n\n-\n-\n-\n-\n-\n-\n\n87.3\n\ntogether. We observe that the classi\ufb01er has similar performance when trained on synthetic or real\nimages, which means that our image-based rendering engine synthesizes useful data. Furthermore,\nwe can see that the classi\ufb01er performs much better when trained on synthetic and real images together.\nThis means that our data is different from the original data and allows the classi\ufb01er to learn better\nfeatures. Note that we retrain Alexnet from scratch. We found that it performed better than just\n\ufb01ne-tuning a model pre-trained on Imagenet (3D error of 88.1mm vs 98.3mm with \ufb01ne-tuning).\nIn Tab. 2, we compare our results to state-of-the-art approaches. We also report results for a second\nprotocol (P2) employed in [20, 44, 35] where all the frames from subjects S9 and S11 are used\nfor testing and only S1, S5, S6, S7 and S8 are used for training. Our best classi\ufb01er, trained with\na combination of synthetic and real data, outperforms state-of-the-art results in terms of 3D pose\nestimation for single frames. Zhou et al. [44] report better performance, but they integrate temporal\ninformation. Note that our method estimates absolute pose (including orientation w.r.t. the camera),\nwhich is not the case for other methods such as Bo et al. [5], who estimate a relative pose and do not\nprovide 3D orientation.\n\n4.2 Evaluation on Leeds Sport Dataset (LSP)\n\nWe now train our pose classi\ufb01er using different combinations of training sources and use them to\nestimate 3D poses on images captured in-the-wild, i.e., LSP. Since 3D pose evaluation is not possible\non this dataset, we instead compare 2D pose errors expressed in pixels and measure this error on the\nnormalized 220 \u00d7 220 images following [44]. We compute the average 2D pose error over the 13\njoints on both LSP and H3.6M (see Table 3).\nAs expected, we observe that when using a pool of the in-the-wild images to generate the synthetic\ndata, the performance increases on LSP and drops on H3.6M, showing the importance of realistic\nimages for good performance in-the-wild and the lack of generability of models trained on constrained\nindoor images. The error slightly increases in both cases when using the same number (190,000)\nof CMU 3D poses. The same drop was observed by [42] and can be explained by the fact that by\nCMU data covers a larger portions of the 3D pose space, resulting in a worse \ufb01t. The results improve\non both test sets when considering more poses and synthetic images (2 millions). The larger drop\nin Abs 3D error and 2D error compared to 3D error means that a better camera view is estimated\nwhen using more synthetic data. In all cases, the performance (in pixel) is lower on LSP than on\nH3.6M due to the fact that the poses observed in LSP are more different from the ones in the CMU\nMoCap data. In Fig. 4 , we visualize the 2D pose error on LSP and Human3.6M 1) for different pools\nof annotated 2D images, 2) varying the number of synthesized training images and 3) considering\ndifferent number of pose classes K. As expected using a bigger set of annotated images improves the\n\n6\n\n\fTable 3: Pose error on LSP and H3.6M using different sources for rendering the synthetic images.\n\n3D\n\nH3.6M\n\nH3.6M\n\nLSP\n\nH3.6M\n\n2D\n\nsource\nH3.6M\n\nNum. of\n3D poses Abs Error (mm) Error (mm) Error (pix) Error (pix)\n\nsource\nH3.6M 190,000\nMPII+LSP H3.6M 190,000\nMPII+LSP\n190,000\nMPII+LSP\n2.106\nTable 4: State-of-the-art results on LSP (2D pose error in pixels on normalized 220 \u00d7 220 images).\n\n97.2\n122.1\n150.6\n138.0\n\n130.1\n248.9\n320.0\n216.5\n\n31.1\n20.7\n22.4\n13.8\n\n8.8\n17.3\n19.7\n11.2\n\nCMU\nCMU\n\nMethod\n\nWei et al. [39]\n\nPishchulin et al. [24]\nChen & Yuille [7]\n\nYang et al. [41]\nOurs (Alexnet)\nOurs (VGG)\n\nFeet Knees Hips Hands Elbows\n6.6\n10.0\n15.7\n15.5\n19.1\n16.2\n\n8.6\n11.1\n15.6\n14.7\n21.4\n17.7\n\n7.0\n8.2\n12.1\n12.2\n16.6\n13.0\n\n5.3\n6.8\n11.5\n11.5\n13\n10.6\n\n4.8\n5.0\n8.1\n8.0\n4.9\n4.1\n\nShoulder Head All\n6.2\n7.6\n11.5\n11.5\n13.8\n11.5\n\n5.3\n5.9\n6.8\n7.4\n10.3\n9.8\n\n5.2\n5.7\n8.6\n8.9\n10.5\n8.4\n\nperformance in-the-wild. Pose error converges both on LSP and H3.6M when using 1.5 million of\nimages; using more than K = 5000 classes does not further improve the performance.\n\nFigure 4: 2D pose error on LSP and Human3.6M using different pools of annotated images to generate 2 million\nof synthetic training images (left), varying the number of synthetic training images (center) and considering\ndifferent number of pose classes K (right).\n\nTo further improve the performance, we also experiment with \ufb01ne-tuning a VGG-16 architecture\n[33] for pose classi\ufb01cation. By doing so, the average (normalized) 2D pose error decreases by 2.3\npixels. In Table 4, we compare our results on LSP to the state-of-the-art 2D pose estimation methods.\nAlthough our approach is designed to estimate a coarse 3D pose, its performances is comparable to\nrecent 2D pose estimation methods [7, 41].\nThe qualitative results in Fig. 5 show that our algorithm correctly estimates the global 3D pose. After\na visual analysis of the results, we found that failures occur in two cases: 1) when the observed pose\ndoes not belong to the MoCap training database, which is a limitation of purely holistic approaches,\nor 2) when there is a possible right-left or front-back confusion. We observed that this later case\nis often correct for subsequent top-scoring poses. This highlights a property of our approach that\ncan keep multiple pose hypotheses which could be rescored adequately, for instance, using temporal\ninformation in videos.\n\n5 Conclusion\n\nIn this paper, we introduce an approach for creating a synthetic training dataset of \u201cin-the-wild\u201d\nimages and their corresponding 3D pose. Our algorithm arti\ufb01cially augments a dataset of real images\nwith new synthetic images showing new poses and, importantly, with 3D pose annotations. We\nshow that CNNs can be trained on arti\ufb01cial images and generalize well to real images. We train\nan end-to-end CNN classi\ufb01er for 3D pose estimation and show that, with our synthetic training\nimages, our method outperforms state-of-the-art results in terms of 3D pose estimation in controlled\nenvironments and shows promising results for in-the-wild images (LSP). In this paper, we have\n\n7\n\n\fFigure 5: Qualitative results on LSP. We show correct 3D pose estimations (top 2 rows) and typical failure\ncases (bottom row) corresponding to unseen poses or right-left and front-back confusions.\n\nestimated a coarse 3D pose by returning the average pose of the top scoring cluster. In future work,\nwe will investigate how top scoring classes could be re-ranked and also how the pose could be re\ufb01ned.\n\nAcknowledgments. This work was supported by the European Commission under FP7 Marie Curie\nIOF grant (PIOF-GA-2012-328288) and partially supported by ERC advanced grant Allegro. We\nacknowledge the support of NVIDIA with the donation of the GPUs used for this research. We thank\nP. Weinzaepfel for his help and the anonymous reviewers for their comments and suggestions.\n\nReferences\n[1] A. Agarwal and B. Triggs. Recovering 3D human pose from monocular images. PAMI, 28(1):44\u201358, 2006.\n[2] I. Akhter and M. Black. Pose-conditioned joint angle limits for 3D human pose reconstruction. In CVPR,\n\n2015.\n\n[3] M. Andriluka, L. Pishchulin, P. Gehler, and B. Schiele. 2D human pose estimation: New benchmark and\n\nstate-of- the-art analysis. In CVPR, 2014.\n\n[4] A. Bissacco, M.-H. Yang, and S. Soatto. Detecting humans via their pose. In NIPS, 2006.\n[5] L. Bo and C. Sminchisescu. Twin Gaussian processes for structured prediction. IJCV, 87(1-2):28\u201352,\n\n2010.\n\n[6] L. Bourdev and J. Malik. Poselets: Body part detectors trained using 3D human pose annotations. In ICCV,\n\n2009.\n\n[7] X. Chen and A. L. Yuille. Articulated pose estimation by a graphical model with image dependent pairwise\n\nrelations. In NIPS, 2014.\n\n[8] A. Dosovitskiy, P. Fischer, E. Ilg, P. H\u00e4usser, C. Hazirbas, V. Golkov, P. van der Smagt, D. Cremers, and\n\nT. Brox. Flownet: Learning optical \ufb02ow with convolutional networks. In ICCV, 2015.\n\n[9] M. Enzweiler and D. M. Gavrila. A mixed generative-discriminative framework for pedestrian classi\ufb01cation.\n\nIn CVPR, 2008.\n\n[10] X. Fan, K. Zheng, Y. Zhou, and S. Wang. Pose locality constrained representation for 3D human pose\n\nreconstruction. In ECCV, 2014.\n\n[11] H. Hattori, V. N. Boddeti, K. M. Kitani, and T. Kanade. Learning scene-speci\ufb01c pedestrian detectors\n\nwithout real data. In CVPR, 2015.\n\n[12] A. Hornung, E. Dekkers, and L. Kobbelt. Character animation from 2D pictures and 3D motion data. ACM\n\nTrans. Graph., 26(1), 2007.\n\n[13] C. Ionescu, D. Papava, V. Olaru, and C. Sminchisescu. Human3.6m: Large scale datasets and predictive\n\nmethods for 3D human sensing in natural environments. PAMI, 36(7):1325\u20131339, 2014.\n\n[14] M. Jaderberg, K. Simonyan, A. Vedaldi, and A. Zisserman. Reading text in the wild with convolutional\n\nneural networks. IJCV, 116(1):1\u201320, 2016.\n\n[15] M. Jaderberg, K. Simonyan, A. Zisserman, and K. Kavukcuoglu. Spatial transformer networks. NIPS,\n\n2015.\n\n8\n\n\f[16] S. Johnson and M. Everingham. Clustered pose and nonlinear appearance models for human pose\n\nestimation. In BMVC, 2010.\n\n[17] S. Johnson and M. Everingham. Learning effective human pose estimation from inaccurate annotation. In\n\nCVPR, 2011.\n\n[18] I. Kostrikov and J. Gall. Depth sweep regression forests for estimating 3D human pose from images. In\n\nBMVC, 2014.\n\n[19] A. Krizhevsky, I. Sutskever, and G. E. Hinton. Imagenet classi\ufb01cation with deep convolutional neural\n\nnetworks. In NIPS, 2012.\n\n[20] S. Li, W. Zhang, and A. B. Chan. Maximum-margin structured learning with deep networks for 3D human\n\npose estimation. In ICCV, 2015.\n\n[21] R. Okada and S. Soatto. Relevant feature selection for human pose estimation and localization in cluttered\n\nimages. In ECCV, 2008.\n\n[22] D. Park and D. Ramanan. Articulated pose estimation with tiny synthetic videos. In CVPRW, 2015.\n[23] X. Peng, B. Sun, K. Ali, and K. Saenko. Learning deep object detectors from 3D models. In ICCV, 2015.\n[24] L. Pishchulin, E. Insafutdinov, S. Tang, B. Andres, M. Andriluka, P. V. Gehler, and B. Schiele. Deepcut:\n\nJoint subset partition and labeling for multi person pose estimation. CVPR, 2016.\n\n[25] L. Pishchulin, A. Jain, M. Andriluka, T. Thorm\u00e4hlen, and B. Schiele. Articulated people detection and\n\npose estimation: Reshaping the future. In CVPR, 2012.\n\n[26] G. Rogez, J. Rihan, C. Orrite, and P. Torr. Fast human pose detection using randomized hierarchical\n\ncascades of rejectors. IJCV, 99(1):25\u201352, 2012.\n\n[27] G. Rogez, J. Supancic, and D. Ramanan. First-person pose recognition using egocentric workspaces. In\n\nCVPR, 2015.\n\n[28] J. Romero, H. Kjellstrom, and D. Kragic. Hands in action: real-time 3D reconstruction of hands in\n\ninteraction with objects. In ICRA, 2010.\n\n[29] G. Shakhnarovich, P. A. Viola, and T. Darrell. Fast pose estimation with parameter-sensitive hashing. In\n\nICCV, 2003.\n\n[30] J. Shotton, A. W. Fitzgibbon, M. Cook, T. Sharp, M. Finocchio, R. Moore, A. Kipman, and A. Blake.\n\nReal-time human pose recognition in parts from single depth images. In CVPR, 2011.\n\n[31] E. Simo-Serra, A. Quattoni, C. Torras, and F. Moreno-Noguer. A joint model for 2D and 3D pose estimation\n\nfrom a single image. In CVPR, 2013.\n\n[32] E. Simo-Serra, A. Ramisa, G. Aleny\u00e0, C. Torras, and F. Moreno-Noguer. Single image 3D human pose\n\nestimation from noisy observations. In CVPR, 2012.\n\n[33] K. Simonyan and A. Zisserman. Very deep convolutional networks for large-scale image recognition.\n\nCoRR, abs/1409.1556, 2014.\n\n[34] H. Su, C. Ruizhongtai Qi, Y. Li, and L. J. Guibas. Render for CNN: viewpoint estimation in images using\n\nCNNs trained with rendered 3D model views. In ICCV, 2015.\n\n[35] Bugra Tekin, Artem Rozantsev, Vincent Lepetit, and Pascal Fua. Direct prediction of 3d body poses from\n\nmotion compensated sequences. In CVPR, 2016.\n\n[36] J. J. Tompson, A. Jain, Y. LeCun, and C. Bregler. Joint training of a convolutional network and a graphical\n\nmodel for human pose estimation. In NIPS, 2014.\n\n[37] A. Toshev and C. Szegedy. DeepPose: Human pose estimation via deep neural networks. In CVPR, 2014.\n[38] C. Wang, Y. Wang, Z. Lin, A. L. Yuille, and W. Gao. Robust estimation of 3D human poses from a single\n\nimage. In CVPR, 2014.\n\n[39] S-E Wei, V. Ramakrishna, T. Kanade, and Y. Sheikh. Convolutional pose machines. In CVPR, 2016.\n[40] Z. Wu, S. Song, A. Khosla, F. Yu, L. Zhang, X. Tang, and J. Xiao. 3D shapenets: A deep representation for\n\nvolumetric shapes. In CVPR, 2015.\n\n[41] W. Yang, W. Ouyang, H. Li, and X. Wang. End-to-end learning of deformable mixture of parts and deep\n\nconvolutional neural networks for human pose estimation. In CVPR, 2016.\n\n[42] H. Yasin, U. Iqbal, B. Kr\u00fcger, A. Weber, and J. Gall. A dual-source approach for 3D pose estimation from\n\na single image. In CVPR, 2016.\n\n[43] F. Zhou and F. De la Torre. Spatio-temporal matching for human detection in video. In ECCV, 2014.\n[44] X. Zhou, M. Zhu, S. Leonardos, K. Derpanis, and K. Daniilidis. Sparseness meets deepness: 3D human\n\npose estimation from monocular video. In CVPR, 2016.\n\n[45] S. Zuf\ufb01 and M. J. Black. The stitched puppet: A graphical model of 3D human shape and pose. In CVPR,\n\n2015.\n\n9\n\n\f", "award": [], "sourceid": 1548, "authors": [{"given_name": "Gregory", "family_name": "Rogez", "institution": "Inria"}, {"given_name": "Cordelia", "family_name": "Schmid", "institution": "Inria Grenoble"}]}