{"title": "Joint Training of a Convolutional Network and a Graphical Model for Human Pose Estimation", "book": "Advances in Neural Information Processing Systems", "page_first": 1799, "page_last": 1807, "abstract": "This paper proposes a new hybrid architecture that consists of a deep Convolutional Network and a Markov Random Field. We show how this architecture is successfully applied to the challenging problem of articulated human pose estimation in monocular images. The architecture can exploit structural domain constraints such as geometric relationships between body joint locations. We show that joint training of these two model paradigms improves performance and allows us to significantly outperform existing state-of-the-art techniques.", "full_text": "Joint Training of a Convolutional Network and a\n\nGraphical Model for Human Pose Estimation\n\nJonathan Tompson, Arjun Jain, Yann LeCun, Christoph Bregler\n\n{tompson, ajain, yann, bregler}@cs.nyu.edu\n\nNew York University\n\nAbstract\n\nThis paper proposes a new hybrid architecture that consists of a deep Convolu-\ntional Network and a Markov Random Field. We show how this architecture is\nsuccessfully applied to the challenging problem of articulated human pose esti-\nmation in monocular images. The architecture can exploit structural domain con-\nstraints such as geometric relationships between body joint locations. We show\nthat joint training of these two model paradigms improves performance and allows\nus to signi\ufb01cantly outperform existing state-of-the-art techniques.\n\n1\n\nIntroduction\n\nDespite a long history of prior work, human body pose estimation, or speci\ufb01cally the localization\nof human joints in monocular RGB images, remains a very challenging task in computer vision.\nComplex joint inter-dependencies, partial or full joint occlusions, variations in body shape, clothing\nor lighting, and unrestricted viewing angles result in a very high dimensional input space, making\nnaive search methods intractable.\nRecent approaches to this problem fall into two broad categories: 1) more traditional deformable\npart models [27] and 2) deep-learning based discriminative models [15, 30]. Bottom-up part-based\nmodels are a common choice for this problem since the human body naturally segments into articu-\nlated parts. Traditionally these approaches have relied on the aggregation of hand-crafted low-level\nfeatures such as SIFT [18] or HoG [7], which are then input to a standard classi\ufb01er or a higher level\ngenerative model. Care is taken to ensure that these engineered features are sensitive to the part that\nthey are trying to detect and are invariant to numerous deformations in the input space (such as vari-\nations in lighting). On the other hand, discriminative deep-learning approaches learn an empirical\nset of low and high-level features which are typically more tolerant to variations in the training set\nand have recently outperformed part-based models [27]. However, incorporating priors about the\nstructure of the human body (such as our prior knowledge about joint inter-connectivity) into such\nnetworks is dif\ufb01cult since the low-level mechanics of these networks is often hard to interpret.\nIn this work we attempt to combine a Convolutional Network (ConvNet) Part-Detector \u2013 which\nalone outperforms all other existing methods \u2013 with a part-based Spatial-Model into a uni\ufb01ed learn-\ning framework. Our translation-invariant ConvNet architecture utilizes a multi-resolution feature\nrepresentation with overlapping receptive \ufb01elds. Additionally, our Spatial-Model is able to approx-\nimate MRF loopy belief propagation, which is subsequently back-propagated through, and learned\nusing the same learning framework as the Part-Detector. We show that the combination and joint\ntraining of these two models improves performance, and allows us to signi\ufb01cantly outperform exist-\ning state-of-the-art models on the task of human body pose recognition.\n\n1\n\n\f2 Related Work\n\nFor unconstrained image domains, many architectures have been proposed, including \u201cshape-\ncontext\u201d edge-based histograms from the human body [20] or just silhouette features [13]. Many\ntechniques have been proposed that extract, learn, or reason over entire body features. Some use a\ncombination of local detectors and structural reasoning [25] for coarse tracking and [5] for person-\ndependent tracking). In a similar spirit, more general techniques using \u201cPictorial Structures\u201d such\nas the work by Felzenszwalb et al. [10] made this approach tractable with so called \u2018Deformable\nPart Models (DPM)\u2019. Subsequently a large number of related models were developed [1, 9, 31, 8].\nAlgorithms which model more complex joint relationships, such as Yang and Ramanan [31], use\na \ufb02exible mixture of templates modeled by linear SVMs. Johnson and Everingham [16] employ\na cascade of body part detectors to obtain more discriminative templates. Most recent approaches\naim to model higher-order part relationships. Pishchulin [23, 24] proposes a model that augments\nthe DPM model with Poselet [3] priors. Sapp and Taskar [27] propose a multi-modal model which\nincludes both holistic and local cues for mode selection and pose estimation. Following the Pose-\nlets approach, the Armlets approach by Gkioxari et al. [12] employs a semi-global classi\ufb01er for part\ncon\ufb01guration, and shows good performance on real-world data, however, it is tested only on arms.\nFurthermore, all these approaches suffer from the fact that they use hand crafted features such as\nHoG features, edges, contours, and color histograms.\nThe best performing algorithms today for many vision tasks, and human pose estimation in partic-\nular ([30, 15, 29]) are based on deep convolutional networks. Toshev et al. [30] show state-of-art\nperformance on the \u2018FLIC\u2019 [27] and \u2018LSP\u2019 [17] datasets. However, their method suffers from inac-\ncuracy in the high-precision region, which we attribute to inef\ufb01cient direct regression of pose vectors\nfrom images, which is a highly non-linear and dif\ufb01cult to learn mapping.\nJoint training of neural-networks and graphical models has been previously reported by Ning et\nal. [22] for image segmentation, and by various groups in speech and language modeling [4, 21].\nTo our knowledge no such model has been successfully used for the problem of detecting and lo-\ncalizing body part positions of humans in images. Recently, Ross et al. [26] use a message-passing\ninspired procedure for structured prediction on computer vision tasks, such as 3D point cloud clas-\nsi\ufb01cation and 3D surface estimation from single images. In contrast to this work, we formulate our\nmessage-parsing inspired network in a way that is more amenable to back-propagation and so can be\nimplemented in existing neural networks. Heitz et al. [14] train a cascade of off-the-shelf classi\ufb01ers\nfor simultaneously performing object detection, region labeling, and geometric reasoning. However,\nbecause of the forward nature of the cascade, a later classi\ufb01er is unable to encourage earlier ones\nto focus its effort on \ufb01xing certain error modes, or allow the earlier classi\ufb01ers to ignore mistakes\nthat can be undone by classi\ufb01ers further in the cascade. Bergtholdt et al. [2] propose an approach\nfor object class detection using a parts-based model where they are able to create a fully connected\ngraph on parts and perform MAP-inference using A\u2217 search, but rely on SIFT and color features to\ncreate the unary and pairwise potentials.\n\n3 Model\n\n3.1 Convolutional Network Part-Detector\n\nFigure 1: Multi-Resolution Sliding-Window With Overlapping Receptive Fields\n\n2\n\nFully-Connected LayersImage Patches64x64128x1289x9x256LCN64x64x330x30x12813x13x1289x9x128LCN64x64x3512256430x30x12813x13x1289x9x1289x9x2565x5 Conv + ReLU + Pool5x5 Conv + ReLU + Pool5x5 Conv + ReLU Conv + ReLU + Pool (3 Stages)LCN\fThe \ufb01rst stage of our detection pipeline is a deep ConvNet architecture for body part localization.\nThe input is an RGB image containing one or more people and the output is a heat-map, which\nproduces a per-pixel likelihood for key joint locations on the human skeleton.\nA sliding-window ConvNet architecture is shown in Fig 1. The network is slid over the input image\nto produce a dense heat-map output for each body-joint. Our model incorporates a multi-resolution\ninput with overlapping receptive \ufb01elds. The upper convolution bank in Fig 1 sees a standard 64x64\nresolution input window, while the lower bank sees a larger 128x128 input context down-sampled\nto 64x64. The input images are then Local Contrast Normalized (LCN [6]) (after down-sampling\nwith anti-aliasing in the lower resolution bank) to produce an approximate Laplacian pyramid. The\nadvantage of using overlapping contexts is that it allows the network to see a larger portion of the\ninput image with only a moderate increase in the number of weights. The role of the Laplacian\nPyramid is to provide each bank with non-overlapping spectral content which minimizes network\nredundancy.\n\nFigure 2: Ef\ufb01cient Sliding Window Model with Single Receptive Field\n\nAn advantage of the Sliding-Window model (Fig 1) is that the detector is translation invariant.\nHowever a major drawback is that evaluation is expensive due to redundant convolutions. Recent\nwork [11, 28] has addressed this problem by performing the convolution stages on the full input\nimage to ef\ufb01ciently create dense feature maps. These dense feature maps are then processed through\nconvolution stages to replicate the fully-connected network at each pixel. An equivalent but ef\ufb01cient\nversion of the sliding window model for a single resolution bank is shown in Fig 2. Note that due\nto pooling in the convolution stages, the output heat-map will be a lower resolution than the input\nimage.\nFor our Part-Detector, we combine an ef\ufb01cient sliding window-based architecture with multi-\nresolution and overlapping receptive \ufb01elds; the subsequent model is shown in Fig 3. Since the\nlarge context (low resolution) convolution bank requires a stride of 1/2 pixels in the lower resolution\nimage to produce the same dense output as the sliding window model, the bank must process four\ndown-sampled images, each with a 1/2 pixel offset, using shared weight convolutions. These four\noutputs, along with the high resolution convolutional features, are processed through a 9x9 convolu-\ntion stage (with 512 output features) using the same weights as the \ufb01rst fully connected stage (Fig 1)\nand then the outputs of the low resolution bank are added and interleaved with the output of high\nresolution bank.\nTo improve training time we simplify the above architecture by replacing the lower-resolution stage\nwith a single convolution bank as shown in Fig 4 and then upscale the resulting feature map. In our\npractical implementation we use 3 resolution banks. Note that the simpli\ufb01ed architecture is no longer\nequivalent to the original sliding-window network of Fig 1 since the lower resolution convolution\nfeatures are effectively decimated and replicated leading into the fully-connected stage, however we\nhave found empirically that the performance loss is minimal.\nSupervised training of the network is performed using batched Stochastic Gradient Descent (SGD)\nwith Nesterov Momentum. We use a Mean Squared Error (MSE) criterion to minimize the distance\nbetween the predicted output and a target heat-map. The target is a 2D Gaussian with a small\nvariance and mean centered at the ground-truth joint locations. At training time we also perform\nrandom perturbations of the input images (randomly \ufb02ipping and scaling the images) to increase\ngeneralization performance.\n\n3\n\nFull Image 320x240pxConv + ReLU +Pool (3 stages)98x68x12890x60x51290x60x25690x60x49x9 Conv + ReLU 1x1 Conv + ReLU 1x1 Conv + ReLU Fully-connected equivalent model\fFigure 3: Ef\ufb01cient Sliding Window Model with Overlapping Receptive Fields\n\nFigure 4: Approximation of Fig 3\n\n3.2 Higher-Level Spatial-Model\n\nThe Part-Detector (Section 3.1) performance on our validation set predicts heat-maps that contain\nmany false positives and poses that are anatomically incorrect; for instance when a peak for face de-\ntection is unusually far from a peak in the corresponding shoulder detection. Therefore, in spite of\nthe improved Part-Detector context, the feed forward network still has dif\ufb01culty learning an implicit\nmodel of the constraints of the body parts for the full range of body poses. We use a higher-level\nSpatial-Model to constrain joint inter-connectivity and enforce global pose consistency. The ex-\npectation of this stage is to not increase the performance of detections that are already close to the\nground-truth pose, but to remove false positive outliers that are anatomically incorrect.\nSimilar to Jain et al. [15], we formulate the Spatial-Model as an MRF-like model over the distribu-\ntion of spatial locations for each body part. However, the biggest drawback of their model is that the\nbody part priors and the graph structure are explicitly hand crafted. On the other hand, we learn the\nprior model and implicitly the structure of the spatial model. Unlike [15], we start by connecting\nevery body part to itself and to every other body part in a pair-wise fashion in the spatial model to\ncreate a fully connected graph. The Part-Detector (Section 3.1) provides the unary potentials for\neach body part location. The pair-wise potentials in the graph are computed using convolutional\npriors, which model the conditional distribution of the location of one body part to another. For\ninstance, given that body part B is located at the center pixel, the convolution prior PA|B (i, j) is the\nlikelihood of the body part A occurring in pixel location (i, j). For a body part A, we calculate the\n\ufb01nal marginal likelihood \u00afpA as:\n\n(1)\n\n(cid:0)pA|v \u2217 pv + bv\u2192A\n\n(cid:1)\n\n\u00afpA =\n\n1\nZ\n\n(cid:89)\n\nv\u2208V\n\nwhere v is the joint location, pA|v is the conditional prior described above, bv\u2192a is a bias term used\nto describe the background probability for the message from joint v to A, and Z is the partition\n\n4\n\n90x60x4Full Image 320x240px98x68x128Offset 4x160x120pximagesConv + ReLU +Pool (3 stages)Fully-connectioned equivalent model Conv + ReLU +Pool (3 stages)+53x38x128Replicate + Offset + Stride 2+++(1, 1)(2, 1)(1, 2)(2, 2)9x9 Conv + ReLU......90x60x512Interleaved9x9 Conv + ReLU9x9 Conv+ ReLU9x9 Conv+ ReLUFull Image 320x240px98x68x128Half-res Image 160x120pxConv + ReLU +Pool (3 stages)Fully-connectioned equivalent model 90x60x512Conv + ReLU +Pool (3 stages)53x38x1289x9 Conv + ReLU 45x30x12890x60x512+Point-wiseUpscale90x60x49x9 Conv+ ReLU9x9 Conv+ ReLU\ffunction. Evaluation of Eq 1 is analogous to a single round of sum-product belief propagation.\nConvergence to a global optimum is not guaranteed given that our spatial model is not tree structured.\nHowever, as it can been seen in our results (Fig 8b), the inferred solution is suf\ufb01ciently accurate for\nall poses in our datasets. The learned pair-wise distributions are purely uniform when any pairwise\nedge should to be removed from the graph structure. Fig 5 shows a practical example of how the\nSpatial-Model is able to remove an anatomically incorrect strong outlier from the face heat-map by\nincorporating the presence of a strong shoulder detection. For simplicity, only the shoulder and face\njoints are shown, however, this example can be extended to incorporate all body part pairs. If the\nshoulder heat-map shown in Fig 5 had an incorrect false-negative (i.e. no detection at the correct\nshoulder location), the addition of the background bias bv\u2192A would prevent the output heat-map\nfrom having no maxima in the detected face region.\n\nFigure 5: Didactic Example of Message Passing Between the Face and Shoulder Joints\n\nFig 5 contains the conditional distributions for face and shoulder parts learned on the FLIC [27]\ndataset. For any part A the distribution PA|A is the identity map, and so the message passed from\nany joint to itself is its unary distribution. Since the FLIC dataset is biased towards front-facing poses\nwhere the right shoulder is directly to the lower right of the face, the model learns the correct spatial\ndistribution between these body parts and has high probability in the spatial locations describing\nthe likely displacement between the shoulder and face. For datasets that cover a larger range of the\npossible poses (for instance the LSP [17] dataset), we would expect these distributions to be less\ntightly constrained, and therefore this simple Spatial-Model will be less effective.\nFor our practical implementation we treat the distributions above as energies to avoid the evalua-\ntion of Z. There are 3 reasons why we do not include the partition function. Firstly, we are only\nconcerned with the maximum output value of our network, and so we only need the output energy\nto be proportional to the normalized distribution. Secondly, since both the part detector and spa-\ntial model parameters contain only shared weight (convolutional) parameters that are equal across\npixel positions, evaluation of the partition function during back-propagation will only add a scalar\nconstant to the gradient weight, which would be equivalent to applying a per-batch learning-rate\nmodi\ufb01er. Lastly, since the number of parts is not known a priori (since there can be unlabeled peo-\nple in the image), and since the distributions pv describe the part location of a single person, we\ncannot normalize the Part-Model output. Our \ufb01nal model is a modi\ufb01cation to Eq 1:\n\n(cid:2)log(cid:0)SoftPlus(cid:0)eA|v\n\n(cid:1) \u2217 ReLU (ev) + SoftPlus (bv\u2192A)(cid:1)(cid:3)(cid:33)\n\n(cid:32)(cid:88)\n\nv\u2208V\n\n\u00afeA = exp\n\nwhere: SoftPlus (x) = 1/\u03b2 log (1 + exp (\u03b2x)) , 1/2 \u2264 \u03b2 \u2264 2\nReLU (x) = max (x, \u0001) , 0 < \u0001 \u2264 0.01\n\n(2)\n\nNote that the above formulation is no longer exactly equivalent to an MRF, but still satisfactorily\nencodes the spatial constraints of Eq 1. The network-based implementation of Eq 2 is shown in\nFig 6. Eq 2 replaces the outer multiplication of Eq 1 with a log space addition to improve numerical\nstability and to prevent coupling of the convolution output gradients (the addition in log space means\nthat the partial derivative of the loss function with respect to the convolution output is not dependent\non the output of any other stages). The inclusion of the SoftPlus and ReLU stages on the weights,\nbiases and input heat-map maintains a strictly greater than zero convolution output, which prevents\nnumerical issues for the values leading into the Log stage. Finally, a SoftPlus stage is used to\n\n5\n\nxx**f|ff|sFace UnaryShoulder UnaryFace Shoulder Face UnaryShoulder Unary**s|fs|s====Shoulder FaceFace FaceFace ShoulderShoulder Shoulder\fmaintain continuous and non-zero weight and bias gradients during training. With this modi\ufb01ed\nformulation, Eq 2 is trained using back-propagation and SGD.\n\nFigure 6: Single Round Message Passing Network\n\nThe convolution sizes are adjusted so that the largest joint displacement is covered within the con-\nvolution window. For our 90x60 pixel heat-map output, this results in large 128x128 convolution\nkernels to account for a joint displacement radius of 64 pixels (note that padding is added on the\nheat-map input to prevent pixel loss). Therefore for such large kernels we use FFT convolutions\nbased on the GPU implementation by Mathieu et al. [19].\nThe convolution weights are initialized using the empirical histogram of joint displacements created\nfrom the training examples. This initialization improves learned performance, decreases training\ntime and improves optimization stability. During training we randomly \ufb02ip and scale the heat-map\ninputs to improve generalization performance.\n\n3.3 Uni\ufb01ed Model\n\nSince our Spatial-Model (Section 3.2) is trained using back-propagation, we can combine our Part-\nDetector and Spatial-Model stages in a single Uni\ufb01ed Model. To do so, we \ufb01rst train the Part-\nDetector separately and store the heat-map outputs. We then use these heat-maps to train a Spatial-\nModel. Finally, we combine the trained Part-Detector and Spatial-Models and back-propagate\nthrough the entire network.\nThis uni\ufb01ed \ufb01ne-tuning further improves performance. We hypothesize that because the Spatial-\nModel is able to effectively reduce the output dimension of possible heat-map activations, the Part-\nDetector can use available learning capacity to better localize the precise target activation.\n\n4 Results\n\nThe models from Sections 3.1 and 3.2 were implemented within the Torch7 [6] framework (with\ncustom GPU implementations for the non-standard stages above). Training the Part-Detector takes\napproximately 48 hours, the Spatial-Model 12 hours, and forward-propagation for a single image\nthrough both networks takes 51ms 1.\nWe evaluated our architecture on the FLIC [27] and extended-LSP [17] datasets. These datasets\nconsist of still RGB images with 2D ground-truth joint information generated using Amazon Me-\nchanical Turk. The FLIC dataset is comprised of 5003 images from Hollywood movies with actors\nin predominantly front-facing standing up poses (with 1016 images used for testing), while the\nextended-LSP dataset contains a wider variety of poses of athletes playing sport (10442 training and\n1000 test images). The FLIC dataset contains many frames with more than a single person, while\nthe joint locations from only one person in the scene are labeled. Therefore an approximate torso\nbounding box is provided for the single labeled person in the scene. We incorporate this data by\nincluding an extra \u201ctorso-joint heat-map\u201d to the input of the Spatial-Model so that it can learn to\nselect the correct feature activations in a cluttered scene.\n\n1We use a 12 CPU workstation with an NVIDIA Titan GPU\n\n6\n\n++W11Conv bWSoftPlusb11SoftPluslogW12Conv bWSoftPlusb12SoftPluslogexpexpReLUReLUW21Conv bWSoftPlusb21SoftPluslogWSoftPlusb22SoftPluslogConv bW22\fThe FLIC-full dataset contains 20928 training images, however many of these training set\nimages contain samples from the 1016 test set scenes and so would allow unfair over-\nTherefore, we propose a new dataset - called FLIC-plus\ntraining on the FLIC test set.\n(http://cims.nyu.edu/\u223ctompson/\ufb02ic plus.htm) - which is a 17380 image subset from the FLIC-plus\ndataset. To create this dataset, we produced unique scene labels for both the FLIC test set and FLIC-\nplus training sets using Amazon Mechanical Turk. We then removed all images from the FLIC-plus\ntraining set that shared a scene with the test set. Since 253 of the sample images from the original\n3987 FLIC training set came from the same scene as a test set sample (and were therefore removed\nby the above procedure), we added these images back so that the FLIC-plus training set is a superset\nof the original FLIC training set. Using this procedure we can guarantee that the additional samples\nin FLIC-plus are suf\ufb01ciently independent to the FLIC test set samples.\nFor evaluation of the test-set performance we use the measure suggested by Sapp et. al. [27]. For a\ngiven normalized pixel radius (normalized by the torso height of each sample) we count the number\nof images in the test-set for which the distance of the predicted UV joint location to the ground-truth\nlocation falls within the given radius.\nFig 7a and 7b show our model\u2019s performance on the the FLIC test-set for the elbow and wrist joints\nrespectively and trained using both the FLIC and FLIC-plus training sets. Performance on the LSP\ndataset is shown in Fig 7c and 8a. For LSP evaluation we use person-centric (or non-observer-\ncentric) coordinates for fair comparison with prior work [30, 8]. Our model outperforms existing\nstate-of-the-art techniques on both of these challenging datasets with a considerable margin.\n\n(a) FLIC: Elbow\n\n(b) FLIC: Wrist\n\n(c) LSP: Wrist and Elbow\n\nFigure 7: Model Performance\n\nFig 8b illustrates the performance improvement from our simple Spatial-Model. As expected the\nSpatial-Model has little impact on accuracy for low radii threshold, however, for large radii it in-\ncreases performance by 8 to 12%. Uni\ufb01ed training of both models (after independent pre-training)\nadds an additional 4-5% detection rate for large radii thresholds.\n\n(a) LSP: Ankle and Knee\n\n(b) FLIC: Wrist\n\n(c) FLIC: Wrist\n\nFigure 8: (a) Model Performance (b) With and Without Spatial-Model (c) Part-Detector Performance\nVs Number of Resolution Banks (FLIC subset)\n\n7\n\n Ours (FLIC)Ours (FLIC\u2212plus)Toshev et. al.Jain et. al.MODECEichner et. al.Yang et. al.Sapp et. al.024681012141618200102030405060708090100Normalized distance error (pixels)Detection rate 024681012141618200102030405060708090100Normalized distance error (pixels)Detection rate 024681012141618200102030405060708090100Normalized distance error (pixels)Detection rate Ours: wristOurs: elbowToshev et al.: wristToshev et al.: elbowDantone et al.: wristDantone et al.: elbowPishchulin et al.: wristPishchulin et al.: elbow024681012141618200102030405060708090100Normalized distance error (pixels)Detection rate Ours: ankleOurs: kneeToshev et al.: ankleToshev et al.: kneeDantone et al.: ankleDantone et al.: kneePishchulin et al.: anklePishchulin et al.: knee024681012141618200102030405060708090100Normalized distance error (pixels)Detection rate Part\u2212ModelPart and Spatial\u2212ModelJoint Training024681012141618200102030405060708090100Normalized distance error (pixels)Detection rate 1 Bank2 Banks3 Banks\fThe impact of the number of resolution banks is shown in Fig 8c). As expected, we see a big\nimprovement when multiple resolution banks are added. Also note that the size of the receptive\n\ufb01elds as well as the number and size of the pooling stages in the network also have a large impact on\nthe performance. We tune the network hyper-parameters using coarse meta-optimization to obtain\nmaximal validation set performance within our computational budget (less than 100ms per forward-\npropagation).\nFig 9 shows the predicted joint locations for a variety of inputs in the FLIC and LSP test-sets. Our\nnetwork produces convincing results on the FLIC dataset (with low joint position error), however,\nbecause our simple Spatial-Model is less effective for a number of the highly articulated poses in\nthe LSP dataset, our detector results in incorrect joint predictions for some images. We believe that\nincreasing the size of the training set will improve performance for these dif\ufb01cult cases.\n\nFigure 9: Predicted Joint Positions, Top Row: FLIC Test-Set, Bottom Row: LSP Test-Set\n\n5 Conclusion\n\nWe have shown that the uni\ufb01cation of a novel ConvNet Part-Detector and an MRF inspired Spatial-\nModel into a single learning framework signi\ufb01cantly outperforms existing architectures on the task\nof human body pose recognition. Training and inference of our architecture uses commodity level\nhardware and runs at close to real-time frame rates, making this technique tractable for a wide variety\nof application areas.\nFor future work we expect to further improve upon these results by increasing the complexity and\nexpressiveness of our simple spatial model (especially for unconstrained datasets like LSP).\n\n6 Acknowledgments\n\nThe authors would like to thank Mykhaylo Andriluka for his support. This research was funded in\npart by the Of\ufb01ce of Naval Research ONR Award N000141210327.\n\nReferences\n[1] M. Andriluka, S. Roth, and B. Schiele. Pictorial structures revisited: People detection and articulated\n\npose estimation. In CVPR, 2009.\n\n8\n\n\f[2] M. Bergtholdt, J. Kappes, S. Schmidt, and C. Schn\u00a8orr. A study of parts-based object class detection using\n\ncomplete graphs. IJCV, 2010.\n\n[3] L. Bourdev and J. Malik. Poselets: Body part detectors trained using 3d human pose annotations. In\n\nICCV, 2009.\n\n[4] H. Bourlard, Y. Konig, and N. Morgan. Remap: recursive estimation and maximization of a posteriori\n\nprobabilities in connectionist speech recognition. In EUROSPEECH, 1995.\n\n[5] P. Buehler, A. Zisserman, and M. Everingham. Learning sign language by watching TV (using weakly\n\naligned subtitles). CVPR, 2009.\n\n[6] R. Collobert, K. Kavukcuoglu, and C. Farabet. Torch7: A matlab-like environment for machine learning.\n\nIn BigLearn, NIPS Workshop, 2011.\n\n[7] N. Dalal and B. Triggs. Histograms of oriented gradients for human detection. In CVPR, 2005.\n[8] M. Dantone, J. Gall, C. Leistner, and L. Van Gool. Human pose estimation using body parts dependent\n\njoint regressors. In CVPR\u201913.\n\n[9] M. Eichner and V. Ferrari. Better appearance models for pictorial structures. In BMVC, 2009.\n[10] P. Felzenszwalb, D. McAllester, and D. Ramanan. A discriminatively trained, multiscale, deformable part\n\nmodel. In CVPR, 2008.\n\n[11] A. Giusti, D. Ciresan, J. Masci, L. Gambardella, and J. Schmidhuber. Fast image scanning with deep\n\nmax-pooling convolutional neural networks. In CoRR, 2013.\n\n[12] G. Gkioxari, P. Arbelaez, L. Bourdev, and J. Malik. Articulated pose estimation using discriminative\n\narmlet classi\ufb01ers. In CVPR\u201913.\n\n[13] K. Grauman, G. Shakhnarovich, and T. Darrell. Inferring 3d structure with a statistical image-based shape\n\nmodel. In ICCV, 2003.\n\n[14] G. Heitz, S. Gould, A. Saxena, and D. Koller. Cascaded classi\ufb01cation models: Combining models for\n\nholistic scene understanding. 2008.\n\n[15] A. Jain, J. Tompson, M. Andriluka, G. Taylor, and C. Bregler. Learning human pose estimation features\n\nwith convolutional networks. In ICLR, 2014.\n\n[16] S. Johnson and M. Everingham. Learning Effective Human Pose Estimation from Inaccurate Annotation.\n\nIn CVPR\u201911.\n\n[17] S. Johnson and M. Everingham. Clustered pose and nonlinear appearance models for human pose esti-\n\nmation. In BMVC, 2010.\n\n[18] D. Lowe. Object recognition from local scale-invariant features. In ICCV, 1999.\n[19] M. Mathieu, M. Henaff, and Y. LeCun. Fast training of convolutional networks through ffts. In CoRR,\n\n2013.\n\n[20] G. Mori and J. Malik. Estimating human body con\ufb01gurations using shape context matching. ECCV, 2002.\n[21] F. Morin and Y. Bengio. Hierarchical probabilistic neural network language model. In Proceedings of the\n\nTenth International Workshop on Arti\ufb01cial Intelligence and Statistics, 2005.\n\n[22] F. Ning, D. Delhomme, Y. LeCun, F. Piano, L. Bottou, and P. Barbano. Toward automatic phenotyping of\n\ndeveloping embryos from videos. IEEE TIP, 2005.\n\n[23] L. Pishchulin, M. Andriluka, P. Gehler, and B. Schiele. Poselet conditioned pictorial structures.\n\nCVPR\u201913.\n\nIn\n\n[24] L. Pishchulin, M. Andriluka, P. Gehler, and B. Schiele. Strong appearance and expressive spatial models\n\nfor human pose estimation. In ICCV\u201913.\n\n[25] D. Ramanan, D. Forsyth, and A. Zisserman. Strike a pose: Tracking people by \ufb01nding stylized poses. In\n\nCVPR, 2005.\n\n[26] S. Ross, D. Munoz, M. Hebert, and J.A Bagnell. Learning message-passing inference machines for\n\nstructured prediction. In CVPR, 2011.\n\n[27] B. Sapp and B. Taskar. Modec: Multimodal decomposable models for human pose estimation. In CVPR,\n\n2013.\n\n[28] P. Sermanet, D. Eigen, X. Zhang, M. Mathieu, R. Fergus, and Y. LeCun. Overfeat: Integrated recognition,\n\nlocalization and detection using convolutional networks. In ICLR, 2014.\n\n[29] J. Tompson, M. Stein, Y. LeCun, and K. Perlin. Real-time continuous pose recovery of human hands\n\nusing convolutional networks. In TOG, 2014.\n\n[30] A. Toshev and C. Szegedy. Deeppose: Human pose estimation via deep neural networks. In CVPR, 2014.\n[31] Yi Yang and Deva Ramanan. Articulated pose estimation with \ufb02exible mixtures-of-parts. In CVPR\u201911.\n\n9\n\n\f", "award": [], "sourceid": 969, "authors": [{"given_name": "Jonathan", "family_name": "Tompson", "institution": "New York University"}, {"given_name": "Arjun", "family_name": "Jain", "institution": "New York University"}, {"given_name": "Yann", "family_name": "LeCun", "institution": "New York U"}, {"given_name": "Christoph", "family_name": "Bregler", "institution": "New York University"}]}