{"title": "Deep Network for the Integrated 3D Sensing of Multiple People in Natural Images", "book": "Advances in Neural Information Processing Systems", "page_first": 8410, "page_last": 8419, "abstract": "We present MubyNet -- a feed-forward, multitask, bottom up system for the integrated localization, as well as 3d pose and shape estimation, of multiple people in monocular images. The challenge is the formal modeling of the problem that intrinsically requires discrete and continuous computation, e.g. grouping people vs. predicting 3d pose. The model identifies human body structures (joints and limbs) in images, groups them based on 2d and 3d information fused using learned scoring functions, and optimally aggregates such responses into partial or complete 3d human skeleton hypotheses under kinematic tree constraints, but without knowing in advance the number of people in the scene and their visibility relations. We design a multi-task deep neural network with differentiable stages where the person grouping problem is formulated as an integer program based on learned body part scores parameterized by both 2d and 3d information. This avoids suboptimality resulting from separate 2d and 3d reasoning, with grouping performed based on the combined representation. The final stage of 3d pose and shape prediction is based on a learned attention process where information from different human body parts is optimally integrated. State-of-the-art results are obtained in large scale datasets like Human3.6M and Panoptic, and qualitatively by reconstructing the 3d shape and pose of multiple people, under occlusion, in difficult monocular images.", "full_text": "Deep Network for the Integrated 3D Sensing of\n\nMultiple People in Natural Images\n\nAndrei Zan\ufb01r2 Elisabeta Marinoiu2 Mihai Zan\ufb01r2 Alin-Ionut Popa2 Cristian Sminchisescu1,2\n\n{andrei.zanfir, elisabeta.marinoiu, mihai.zanfir, alin.popa}@imar.ro,\n\ncristian.sminchisescu@math.lth.se\n\n1Department of Mathematics, Faculty of Engineering, Lund University\n\n2Institute of Mathematics of the Romanian Academy\n\nAbstract\n\nWe present MubyNet \u2013 a feed-forward, multitask, bottom up system for the inte-\ngrated localization, as well as 3d pose and shape estimation, of multiple people\nin monocular images. The challenge is the formal modeling of the problem that\nintrinsically requires discrete and continuous computation, e.g. grouping people vs.\npredicting 3d pose. The model identi\ufb01es human body structures (joints and limbs)\nin images, groups them based on 2d and 3d information fused using learned scoring\nfunctions, and optimally aggregates such responses into partial or complete 3d\nhuman skeleton hypotheses under kinematic tree constraints, but without knowing\nin advance the number of people in the scene and their visibility relations. We\ndesign a multi-task deep neural network with differentiable stages where the person\ngrouping problem is formulated as an integer program based on learned body part\nscores parameterized by both 2d and 3d information. This avoids suboptimality\nresulting from separate 2d and 3d reasoning, with grouping performed based on the\ncombined representation. The \ufb01nal stage of 3d pose and shape prediction is based\non a learned attention process where information from different human body parts\nis optimally integrated. State-of-the-art results are obtained in large scale datasets\nlike Human3.6M and Panoptic, and qualitatively by reconstructing the 3d shape\nand pose of multiple people, under occlusion, in dif\ufb01cult monocular images.\n\n1\n\nIntroduction\n\nThe recent years have witnessed a resurgence of human sensing methods for body keypoint estimation\n[1; 2; 3; 4; 5; 6; 7] as well as 3d pose and shape reconstruction [8; 9; 10; 11; 12; 13; 14; 15; 16; 17;\n18; 19; 20; 21]. Some of the challenges are in the level of modeling \u2013 shifting towards accurate 3d\npose and shape, not just 2d keypoints or skeletons \u2013, and the integration of 2d and 3d reasoning with\nautomatic person localization and grouping. The discrete aspect of grouping and the continuous\nnature of pose estimation makes the formal integration of such computations dif\ufb01cult.\nIn this\npaper we propose a novel, feedforward deep network, supporting different supervision regimes,\nthat predicts the 3d pose and shape of multiple people in monocular images. We formulate and\nintegrate human localization and grouping into the network, as a binary linear integer program with\noptimal solution based on learned body part compatibility functions constructed using 2d and 3d\ninformation. State-of-the-art results on Human3.6M and Panoptic illustrate the feasibility of the\nproposed approach.\nRelated Work. Several authors focused on integrating different human sensing tasks into a single-\nshot end-to-end pipeline[22; 23; 24; 13]. The models are usually designed to handle a single person\nand often rely on a prior stage of detection. [13] encode the 3d information of a single person inside\na feature map forcing the network to output 3d joint positions for each semantic body joint at its\n\n32nd Conference on Neural Information Processing Systems (NeurIPS 2018), Montr\u00e9al, Canada.\n\n\fcorresponding 2d location. However, if some joints are occluded or hard to recover, their method\nmay not provide accurate estimates. [9] use a discretized 3d space around the person, from which\nthey read 3d joint activations. Their method is designed for one person and cannot easily be extended\nto handle large, crowded scenes. [25] use Region Proposal Networks [26] to obtain human bounding\nboxes and feeds them to predictive networks to obtain 2d and 3d pose. [27] provide a framework for\nthe 3d human pose and shape estimation of multiple people. They start with a feedforward semantic\nsegmentation of body parts, and 3d pose estimates based on DMHS [22], then re\ufb01ne the pose and\nshape parameters of a human body model [12] using non-linear optimization based on semantic\n\ufb01tting \u2013 a form of feedback. In contrast, we provide an integrated, yet feedforward, bottom up deep\nlearning framework for multi-person localization as well as 3d pose and shape estimation.\nOne of the challenges in the visual sensing of multiple people is grouping \u2013 identifying the components\nbelonging to each person, their level of visibility, and the number of people in the scene. Our network\naggregates different body joint proposals represented using both 2d and 3d information, to hypothesize\nlimbs, and later these are assembled into person instances based on the results of a joint optimization\nproblem. To address the arguably simpler, yet challenging problem of locating multiple people\nin the image (but not in 3d) [1] assign their network the task of regressing an additional feature\nmap, where a slice encodes either the x or y coordinate of the normalized direction of ground-truth\nlimbs. The information is redundant, as it is placed on the 2d projection of the ground truth limbs,\nwithin a distance \u03c3 from the line segment. Such part af\ufb01nity \ufb01elds are used to represent candidate\npart detection associations. The authors provide several solutions, one that is global but inef\ufb01cient\n(running times of 6 minutes/image being common) and one greedy. In the greedy algorithm, larger\nhuman body hypotheses (skeletons) are grown incrementally by solving a series of intermediate\nbipartite matching problems along kinematic trees. While the algorithm is ef\ufb01cient, it cannot be\nimmediately formalized as a cost function with global solution, it relies solely on 2d information, and\nthe af\ufb01nity functions are handcrafted. In contrast, we provide a method that leverages a volumetric 3d\nscene representation with learned scoring functions for the component parts, and an ef\ufb01cient global\nlinear integer programming solution for person grouping with kinematic constraint generation, and\namenable to ef\ufb01cient solvers (e.g. 30ms/image).\n\nFigure 1: Our multiple person sensing pipeline, MubyNet. The model is feed-forward and supports\nsimultaneous multiple person localization, as well as 3d pose and shape estimation. Multitask losses\nconstrain the output of the Deep Volume Encoding, Limb Scoring and 3D Pose & Shape Estimation\nmodules. Given an image I, the processing stages are as follows: Deep Feature Extractor to compute\nfeatures MI, Deep Volume Encoding to regress volumes containing 2d and 3d pose information VI.\nLimb Scoring collects all possible kinematic connections between 2d detected joints given their type,\nand predicts corresponding scores c, Skeleton Grouping performs multiple person localization, by\nassembling limbs into skeletons, Vp\nI , by solving a binary integer linear program. For each person,\nthe 3D Pose Decoding & Shape Estimation module estimates the 3d pose and shape (jp\n\n3d, \u03b8p, \u03b2p).\n\n2 Methodology\n\nOur modeling and computational pipeline is shown in \ufb01g.1. The image is processed using a deep\nconvolutional feature extractor to produce a representation MI. This is passed to a deep volume\nencoding module containing multiple 2d and 3d feature maps concatenated as VI = MI \u2295 M2d \u2295\nM3d. The volume encoding is passed to a limb scoring module that identi\ufb01es different human body\n\n2\n\n\fjoint hypotheses and their type in images, connects ones that are compatible (can form parent-child\nrelations in a human kinematic tree), and assigns them scores 1 given input features sampled in the\ndeep volume encoding VI, along the spatial direction connecting their putative image locations. The\nresulting scores are assembled in a vector c and passed to a binary integer programming module\nthat optimally computes skeletons for multiple people under kinematic constraints. The output is the\noriginal dense deep volume encoding Vp\nI , annotated now with additional person skeleton grouping\ninformation. This is passed to a \ufb01nal stage, producing 3d pose and shape estimates for each person\nbased on attention maps and deep auto-encoders.\n\nFigure 2: The volume encoding of multiple 3d ground truth skeletons in a scene. We associate a slice\n(column) in the volume to each one of the NJ \u00d7 3 joint components. We encode each 3d skeleton jp\n3d\nassociated to a person p, by writing each of its components in the corresponding slice, but only for\ncolumns \u2018intercepting\u2019 spatial locations associated with the image projection of the skeleton.\n\n2d and M3d =(cid:80)\n\n3d and Mt\nt Mt\n\nt Mt\n\nwhich represents the concatenation of MI, M2d =(cid:80)\n\nFigure 3: (a) Detailed view of a single stage t of our multi-stage Deep Volume Encoding (2d/3d)\nmodule. The image features MI, as well as predictions from the previous stage, Mt\u22121\n3d and Mt\u22121\n2d ,\n2d. The multi-stage module outputs VI,\nare used to re\ufb01ne the current representations Mt\n3d. (b) Detail of the\n3D Pose Decoding & Shape Estimation module. Given the estimated volume encoding VI, and the\nadditional information from the estimation of person partitions Vp\n3d. By\nusing auto-encoders, we recover the model pose and shape parameters (\u03b8p, \u03b2p).\nGiven a monocular RGB image I \u2208 Rh\u00d7w\u00d73, our goal is to recover the set of persons P present\n3d, \u03b8p, \u03b2p, tp) \u2208 P with 1 \u2264 p \u2264 |P|, j2d \u2208 R2NJ\u00d71 is the 2d skeleton,\nin the image, where (jp\nj3d \u2208 R3NJ\u00d71 is the 3d skeleton, (\u03b8, \u03b2) \u2208 R82\u00d71 is the SMPL [12] shape embedding, and t \u2208 R3\u00d71\nis the person\u2019s scene translation.\n\nI , we decode the 3d pose jp\n\n2d, jp\n\n2.1 Deep Volume Encoding (2d/3d)\n\nGiven the resolution of the input image, the resolution of the \ufb01nal maps produced by the network\nis H \u00d7 W with the network resolution of \ufb01nal maps h \u00d7 w. For the 2d and 3d skeletons, we\nadopt the Human3.6M [28] representation, with NJ = 17 joints and NL = 16 limbs. We refer to\nK \u2208 {0, 1}NJ\u00d7NJ as the kinematic tree, where K(i, j) = 1 means that nodes i and j are endpoints\nof a limb, in which i is the parent and j the child node. We denote by MI \u2208 Rh\u00d7w\u00d7128 the image\n1At this stage there is no person assignment so body joints very far apart have a putative connection as long\n\nas they are kinematically compatible.\n\n3\n\n\ffeatures, by M2d \u2208 Rh\u00d7w\u00d7NJ the 2d human joint activation maps, and by M3d \u2208 Rh\u00d7w\u00d73NJ a\nvolume that densely encodes 3d information.\nWe start with a deep feature extractor (e.g. VGG-16, module MI). Our pipeline progressively\nencodes volumes of intermediate 2d and 3d signals (i.e. M2d and M3d) which are decoded by\nspecialized layers later on. An illustration is given in \ufb01g. 3. The processing in modules such as Deep\nVolume Encoding is multi-stage [29; 22]. The input and the results of previous stages of processing\nare iteratively fed into the next, for re\ufb01nement. We use different internal representations than [29; 22],\nand rely on a single supervised loss at the end of the multi-stage chain of processing, where outputs\n(activation maps) are fused and compared against ground truth. We found this approach to converge\nfaster and produce slightly better results than the more standard per-stage supervision [29; 22].\nWe construct a representation combining 2d and 3d information capable of encoding multiple people\nin the following way: given an input image I, our Deep Volume Encoding module produces an\noutput tensor M3d containing the 3d structure of all people present in the image. At training\ntime, for any ground-truth person p with gp\n3d joints, kinematic tree structure K, and limbs\nL2d = {(i, j)|K(i, j) = 1, i, j \u2208 N, 1 \u2264 i, j \u2264 Nj}, we de\ufb01ne a ground-truth volume G3d.\nFor all points (x, y) on the line segment of any limb l \u2208 L2d connecting joints in gp\n2d, we set\n3d)(cid:62). The procedure is illustrated in \ufb01g. 2. This module also produces M2d, which\nG3d(x, y) = (gp\nencodes 2d joints activations. The corresponding ground-truth volume, G2d, will be composed of\ncon\ufb01dence maps, one for each joint type aggregating Gaussian peaks placed at the corresponding 2d\njoint positions. The loss function will then measure the error between the predicted M3d, M2d and\nthe ground-truth G3d, G2d\n\n2d, gp\n\n\u03c12d(M2d(x, y), G2d(x, y)) +\n\n\u03c13d(M3d(x, y), G3d(x, y))\n\n(1)\n\n(cid:88)\n\nLI =\n\n1\u2264x\u2264h,1\u2264y\u2264w\n\n(cid:88)\n\n1\u2264x\u2264h,1\u2264y\u2264w\nG3d(x,y) is valid\n\nWe choose \u03c12d to be the squared Euclidean loss, and \u03c13d the mean per-joint 3d position error\n(MPJPE). We explicitly train the network to output redundant 3d information along the 2d projection\nof a person\u2019s limbs. In this way, occlusion, blurry or hard-to-infer cases do not negatively impact the\n\ufb01nal, estimated {jp\n\n3d}, in a signi\ufb01cant way.\n\n2.2 Skeleton Grouping\n\nOur skeleton grouping strategy relies on detecting potential human body joints and their type in\nimages, assembling putative limbs, scoring them using trainable functions, and solving a global,\noptimal assignment problem (binary integer linear program) to \ufb01nd all connected components\nsatisfying strong kinematic tree constraints \u2013 each component being a different person.\n\nLimb Scoring\nBy accessing the M2d maps in VI we extract, via non-max suppression, N joint proposals J =\n{i|1 \u2264 i \u2264 N}, with t an index set of J such that if i \u2208 J, t(i) \u2208 {1, . . . , NJ} is the joint type of\ni (e.g. shoulder, elbow, knee, etc.). The list of all the feasible kinematic connections (i.e. limbs) is\nthen L = {(i, j)|K(i, j) = 1, i, j \u2208 N, 1 \u2264 i, j \u2264 |J|}. One needs a function to assess the quality of\ndifferent limb hypotheses. In order to learn the scoring c, an additional network layer Limb Scoring is\nintroduced. This takes as input the volume encoding map VI, and passes the result through a series\nof Conv/ReLU operations to build a map Mc \u2208 Rh\u00d7w\u00d7128. A subsequent process over Mc and M2d\nbuilds the candidate limb list L by sampling a \ufb01xed number Nsamples of features from Mc for every\nl \u2208 L, along the 2d direction of l. The resulting features have dimensions NL \u00d7 Nsamples \u00d7 128, and\nare passed to a multi-layer perceptron head followed by a softmax non-linearity to regress the \ufb01nal\nscoring c \u2208 [0, 1]NL\u00d71. Supervision is applicable on outputs of Limb Scoring, via a cross-entropy\nloss Lc. Any dataset containing 2d annotations of human body limbs can be used for training.\nPeople Grouping as a Binary Integer Programming Problem\nThe problem of identifying the skeletons of multiple people is posed as estimating the optimal L\u2217 \u2286 L\nsuch that graph G = (J, L\u2217) has properties (i) any connected component of G falls on a single person,\n(ii) \u2200p, q \u2208 L\u2217, p = (i1, j1) and q = (i2, j2) with t(i1) = t(i2) and t(j1) = t(j2): if j1 = j2 then\n\n4\n\n\fi1 (cid:54)= i2 and if i1 = i2 then j1 (cid:54)= j2 \u2013 these constraints ensure that connected components select at\nmost one joint of a given type, and (iii) the connected components are as large as possible.\nComputing L\u2217 is equivalent to \ufb01nding a binary indicator x \u2208 {0, 1}|L|\u00d71 in the set L. We can encode\nthe kinematic constraints (ii), by iterating over all limbs p \u2208 L, and \ufb01nding all the limbs q \u2208 L\nthat connect the same type of joints as p, and also share an end-point with it. Clearly, for any p,\nthe solution x can select at most one of these limbs q. This can be written row-by-row, as a sparse\nmatrix A \u2208 {0, 1}|L|\u00d7|L|, that constrains x such that Ax \u2264 b, where b is the all-ones vector 1|L|.\nIn order to satisfy requirement (i), we need a cost that well quali\ufb01es the semantic and geometrical\nrelationships between elements of the scene, learned as explained in the Limb Scoring paragraph\nabove. The limb score c(l),\u2200l \u2208 L measures how likely is that l is a limb of a person with the\nparticular joint endpoint types. To satisfy requirement (iii), we need to encourage the selection of as\nmany limbs as possible while still satisfying kinematic constraints. Given all these, the problem can\nnaturally be modeled as a binary integer program\n\nx\u2217(c) = arg max\n\nc(cid:62)x, subject to Ax \u2264 b, x \u2208 {0, 1}NL\u00d71\n\n(2)\nwhere an approximation to the optimal c\u2217 = arg maxc x\u2217(c)(cid:62)gc is learned within the Limb Scoring\nmodule. At testing time, given the learned scoring c, we apply (2) to compute the \ufb01nal, binarized\nsolution x\u2217 obeying all constraints. The global solution is very fast and accurate taking in the order\nof milliseconds per images with multiple people, as shown in the experimental section.\n\nx\n\n2.3\n\n3d Pose Decoding and Shape Estimation\n\n3d and the ground truth skeleton gp\n3d.\n\n3d is computed as the MPJPE between jp\n\nAn immediate way of decoding the M3d volume into 3d pose estimates for all the persons in the\nimage, is to simply average 3d skeleton predictions at spatial locations given by the limbs selected\nby the people grouping module. However, this does not take into account the differences in joint\nvisibility. We propose to learn a function that selectively attends to different regions in the M3d\nvolume, when decoding the 3d position for each joint. Given the feature volume Vp\nI of a person and\nits identi\ufb01ed skeleton, we collect a \ufb01xed number of samples along the direction of each 2d limb (see\n\ufb01g. 3 (b)). We train a multilayer perceptron to assign a score (weight) to each sample and for each\n3d joint. The \ufb01nal predicted 3d skeleton jp\n3d is the weighted sum of the 3d samples encoded in Mp\n3d.\nThe loss Lp\nIn order to further represent the 3d shape of each person, we use a SMPL-based human model\nrepresentation [12] controlled by a set of parameters \u03b8 \u2208 R72\u00d71, which encode joint angle rotations,\nand \u03b2 \u2208 R10\u00d71 body dimensions, respectively. The model vertices are obtained as a function\nV(\u03b8, \u03b2) \u2208 R6890\u00d73, and the joints as the product js = Rvec(V) \u2208 R3NJ\u00d71, where R is a regression\nmatrix and V is the matrix of all 3d vertices in the \ufb01nal mesh. Our goal is to map the predicted j3d\ninto a pair (\u03b8, \u03b2) that best explains the 3d skeleton. Previous approaches [14] formulated this task\nas a supervised problem of regressing (\u03b8, \u03b2), or forcing the projection of js to match 2d estimates\nj2d. The problem is at least two-fold: 1) \u03b8 encode axis-angle transformations that are cyclic in the\nangle and not unique in the axis, 2) regression on (\u03b8, \u03b2) does not balance the importance of each\nparameter (e.g., the global rotation encoded in \u03b80 is more important than the right foot rotation,\nencoded in a \u03b8i) in correctly inferring the full 3d body model. To address such dif\ufb01culties, we model\nthe problem as unsupervised auto-encoding inside a deep network, where the code is (\u03b8, \u03b2), and the\ndecoder is Rvec(V(\u03b8, \u03b2)), which is a specialized layer. This is a natural approach, as the loss is then\nsimply Ls\n3d = \u03c13d(j3d, js), which does not force \u03b8 to have a unique or a speci\ufb01c value, and naturally\nbalances the importance of each parameter. Additionally, the task is unsupervised. The encoder is a\nsimple MLP with ReLUs as non-linearities. To account for the unnatural twists along limb directions\nthat may appear, and the fact that \u03b2 is not unique for a 3d skeleton (it is unique only for a given V),\nwe also include in the loss function a GMM prior on the \u03b8 parameters, and a L2 prior on \u03b2. For those\nexamples where we have \u2018ground-truth\u2019 (\u03b8, \u03b2) parameters available, they are \ufb01tted in a supervised\nmanner using images and their corresponding shape and pose targets. The 3d loss can be derived as\n(3)\n\nL3d = Ls\n\n3d + Lp\n\n3d\n\n3 Experiments\n\nWe provide quantitative results on two datasets, Human3.6m [28] and CMU Panoptic [30] as well as\nqualitative 3d reconstructions for complex images.\n\n5\n\n\fHuman3.6m [28] is a large, single person dataset with accurate 2d and 3d ground truth obtained\nfrom a motion capture system. The dataset contains 15 actions performed by 11 actors in a laboratory\nenvironment. We provide results on the withheld, of\ufb01cial test set which has over 900,000 images, as\nwell as on the Human80K test set [31]. Human80K is a smaller, representative subset of Human3.6m.\nIt contains 80,000 images (55, 144 for training and 24, 416 for testing). While this dataset does not\nallow us to assess the quality of sensing multiple people, it allows us to extensively evaluate the\nperformance of our single-person component pipeline.\n\nCMU Panoptic [30] is a dataset that contains multiple people performing different social activities\n(e.g. society games, playing instruments, dancing) in an indoor dome where multiple cameras are\nplaced. We will only consider the monocular case for both training and testing. The setup is dif\ufb01cult\ndue to multiple people interacting, partial views and challenging camera angles. Following [27], we\ntest our method on 9, 600 sequences selected from four activities: Haggling, Ma\ufb01a, Ultimatum, and\nPizza and two cameras, 16 and 30 (we run the monocular system on each camera, independently, and\nthe total errors are averaged).\n\nTraining Procedure. We use multiple datasets with different types of annotations for training our\nnetwork. Speci\ufb01cally, we \ufb01rst train our M2d component on COCO [32] which contains a large variety\nof images with multiple people in natural scenes. Then, we use the Human80K dataset to learn the\n3d volume M3d, once M2d is \ufb01xed. Human80K contains accurate 3d joint positions annotations,\nhowever limited to single person, fully visible images, recorded in a laboratory setup. We further\nuse the CMU Panoptic dataset for \ufb01ne-tuning the M3d component on images with multiple people\nand for a large variety of (monocular) viewpoints, which results in occlusions and partial views.\nBased on the learned 2d joint map activations M2d, the 3d volume M3d and the image features\nMI, we proceed to learn the limb scoring function, c. For this task we use the COCO dataset. The\nattention-based decoding is learned on CMU Panoptic since having multiple people in the same scene\nhelps the decoder understand the dif\ufb01cult poses and learn where to focus in the case of occlusions\nand partially visible people. Finally, Human80K is used to learn the 3d shape auto-encoder due to its\nvariability in body pose con\ufb01gurations. We use a Nesterov solver with a learning rate of 1e \u2212 5 and a\nmomentum of 0.9. Our models are implemented in Caffe [33].\n\nFigure 4: (Left) The distribution of the learned score c compared to the distribution of the selected\nlimbs x\u2217 after optimizing (2). Note that in the learned limb probability scores there are already\nmany components close to 0 or 1 which indicates a well tuned function. (Right) Running time of\nour Binary Linear Integer Programming solver as a function of the number of components, dim(x).\nNotice fast running times for global solutions.\n\nExperimental Results. Table 2 (left) provides results of our method on Human80K. Firstly, we\nshow the performance without the attention-based decoding (simply averaging 3d skeletons at 2d\nlocations in M3d volume). This setup already performs better than DMHS [22]. Note that DMHS\nuses ground truth person bounding boxes while our method runs on full images. Our method with\nattention-based decoding obtains state-of-the art results on Human80K dataset. We also provide\nresults on the withheld test set of Human3.6m (over 900,000 images), where we considerably improve\nover the state-of-the art (60 mm compared to 73 mm error). Results are detailed for each action in\ntable 1. For the CMU Panoptic dataset, results are shown for each action in table 2 (right). When our\nmethod uses only Human80K as supervision for the 3d task, it already performs better than [27]. In\n\n6\n\n\ftable 2 (right) we also show results with a \ufb01ne-tuned version of our method on the Panoptic dataset.\nWe sampled data from the Haggling, Ma\ufb01a, Ultimatum actions (different recordings than those in test\ndata) and from all cameras for \ufb01ne-tuning our model. We obtained a total of 74, 936 data samples\nwhere the number of people per image ranges from 1 to 8. Our \ufb01ne-tuned method improves the\nprevious results by a large margin \u2013 notice errors of 72.1 mm down from 150.3 mm.\nWe show visual results of our method on natural images with complex interactions in \ufb01g. 5. We are\nable to correctly identify all the persons in an image as well as their associated 3d pose con\ufb01guration,\neven when they are far in depth (\ufb01rst row) or severely occluded (last four rows).\n\nLimb Scoring. In \ufb01g. 4 (left) we show the distribution of the learned scores c for the kinematically\nadmissible putative limbs and the distribution of the optimal limb indicator vector components x\u2217\nover 100 images from the COCO validation set. The learned limb scoring already has many of its\ncomponents close to either 0 or 1, although a considerable number still are \u2018undecided\u2019 resulting in\nnon-trivial integer programming problems. We tested the average time taken by the binary integer\nprogramming solver as a function of the number of detected limbs (the length of the score vector c).\nFigure 4 (right) shows that the method scales favorably with the number of components, i.e. dim(x).\n\nMethod\n[22]\n[27]\nMubyNet 49 47 51 52 60 56 56 82\n\nA1 A2 A3 A4 A5 A6 A7 A8 A9 A10 A11 A12 A13 A14 A15 Mean\n60\n54\n\n106 119\n101 109\n94\n\n85\n81\n69\n\n64\n62\n61\n\n64\n59\n\n78\n72\n\n67\n61\n\n68\n68\n\n56\n54\n\n68\n63\n\n73\n69\n60\n\n77\n74\n64\n\n57\n55\n48\n\n78\n75\n66\n\n62\n60\n49\n\nTable 1: Mean per joint 3d position error (in mm) on the Human3.6M dataset. MubyNet improves\nthe state-of-the-art by a large margin for all actions.\n\nMethod\n[22]\nMubyNet\nMubyNet attention\n\nMPJPE(mm)\n\n63.35\n59.31\n58.40\n\nMethod\n[22]\n[27]\nMubyNet\nMubyNet \ufb01ne-tuned\n\nHaggling Ma\ufb01a Ultimatum Pizza Mean\n221.3 203.4\n156.0 153.4\n162.5 150.3\n94.3 72.1\n\n193.6\n150.7\n145.0\n66.8\n\n187.3\n165.9\n152.3\n78.8\n\n217.9\n140.0\n141.4\n72.4\n\nTable 2: Mean per joint 3d position error (in mm). (Left) Human80K dataset. Our method with a mean\ndecoding of the 3d volume obtains state-of-the art results. Adding the attention mechanism further improves\nthe performance. (Right) CMU Panoptic dataset. Our method with 3d supervision only from Human80K\nperforms better than previous works. Fine-tuning on the CMU Panoptic dataset drastically reduces the error.\n\n4 Conclusions\n\nWe have presented a bottom up trainable model for the 2d and 3d human sensing of multiple people in\nmonocular images. The proposed model, MubyNet, is multitask and feed-forward, differentiable, and\nthus conveniently supports training all component parameters. The dif\ufb01cult problem of localizing and\ngrouping people is formulated as a binary linear integer program, and solved globally and optimally\nunder kinematic problem domain constraints and based on learned scoring functions that combine\n2d and 3d information for accurate reasoning. Both 3d human pose and shape are computed in a\n\ufb01nal predictive stage that fuses information based on learned attention maps and deep auto-encoders.\nAblation studies and model component analysis illustrate the adequacy of various design choices\nincluding the ef\ufb01ciency of our global, binary integer linear programming solution, under kinematic\nconstraints, for the human grouping problem. Our large-scale experimental evaluation in datasets like\nHuman3.6M and Panoptic, and for withheld test sets of over 1 million samples, offers competitive\nresults. Qualitative examples show that our model can reliably estimate the 3d properties of multiple\npeople in natural scenes, with occlusion, partial views, and complex backgrounds.\nAcknowledgments: This work was supported in part by the European Research Council Consolidator grant\nSEED, CNCS-UEFISCDI (PN-III-P4-ID-PCE-2016-0535, PN-III-P4-ID-PCCF-2016-0180), the EU Horizon\n2020 grant DE-ENIGMA (688835), and SSF.\n\n7\n\n\fFigure 5: Human pose and shape reconstruction of multiple people produced by MubyNet illustrate\ngood 3d estimates for distant people, complex poses or occlusion. For global translations, we optimize\nthe Euclidean loss between the 2d joint detections and the projections predicted by our 3d models.\n\n8\n\n\fReferences\n[1] Z. Cao, T. Simon, S. Wei, and Y. Sheikh, \u201cRealtime multi-person 2d pose estimation using part\n\naf\ufb01nity \ufb01elds,\u201d in CVPR, 2017.\n\n[2] E. Insafutdinov, L. Pishchulin, B. Andres, M. Andriluka, and B. Schiele, \u201cDeeperCut: A deeper,\n\nstronger, and faster multi-person pose estimation model,\u201d in ECCV, 2016.\n\n[3] G. Papandreou, T. Zhu, N. Kanazawa, A. Toshev, J. Tompson, C. Bregler, and K. Murphy,\n\n\u201cTowards accurate multi-person pose estimation in the wild,\u201d in CVPR, 2017.\n\n[4] J. J. Tompson, A. Jain, Y. LeCun, and C. Bregler, \u201cJoint training of a convolutional network and\n\na graphical model for human pose estimation,\u201d in NIPS, 2014.\n\n[5] A. Toshev and C. Szegedy, \u201cDeeppose: Human pose estimation via deep neural networks,\u201d in\n\nCVPR, 2014.\n\n[6] G. Gkioxari, B. Hariharan, R. Girshick, and J. Malik, \u201cUsing k-poselets for detecting people\n\nand localizing their keypoints,\u201d in CVPR, pp. 3582\u20133589, 2014.\n\n[7] U. Iqbal and J. Gall, \u201cMulti-person pose estimation with local joint-to-person associations,\u201d in\n\nECCV, 2016.\n\n[8] J. Martinez, R. Hossain, J. Romero, and J. J. Little, \u201cA simple yet effective baseline for 3d\n\nhuman pose estimation,\u201d in ICCV, 2017.\n\n[9] G. Pavlakos, X. Zhou, K. G. Derpanis, and K. Daniilidis, \u201cCoarse-to-\ufb01ne volumetric prediction\n\nfor single-image 3d human pose,\u201d in CVPR, 2017.\n\n[10] X. Zhou, M. Zhu, K. Derpanis, and K. Daniilidis, \u201cSparseness meets deepness: 3d human pose\n\nestimation from monocular video,\u201d in CVPR, 2016.\n\n[11] F. Bogo, A. Kanazawa, C. Lassner, P. Gehler, J. Romero, and M. J. Black, \u201cKeep it SMPL:\n\nAutomatic estimation of 3d human pose and shape from a single image,\u201d in ECCV, 2016.\n\n[12] M. Loper, N. Mahmood, J. Romero, G. Pons-Moll, and M. J. Black, \u201cSMPL: A skinned\n\nmulti-person linear model,\u201d SIGGRAPH, vol. 34, no. 6, pp. 248:1\u201316, 2015.\n\n[13] D. Mehta, S. Sridhar, O. Sotnychenko, H. Rhodin, M. Sha\ufb01ei, H.-P. Seidel, W. Xu, D. Casas,\nand C. Theobalt, \u201cVnect: Real-time 3d human pose estimation with a single rgb camera,\u201d ACM\nTransactions on Graphics (TOG), vol. 36, no. 4, p. 44, 2017.\n\n[14] A. Kanazawa, M. J. Black, D. W. Jacobs, and J. Malik, \u201cEnd-to-end recovery of human shape\n\nand pose,\u201d in CVPR, 2018.\n\n[15] C.-H. Chen and D. Ramanan, \u201c3d human pose estimation= 2d pose estimation+ matching,\u201d in\n\nCVPR, 2017.\n\n[16] F. Moreno-Noguer, \u201c3d human pose estimation from a single image via distance matrix regres-\n\nsion,\u201d in CVPR, 2017.\n\n[17] I. Katircioglu, B. Tekin, M. Salzmann, V. Lepetit, and P. Fua, \u201cLearning latent representations\n\nof 3d human pose with deep neural networks,\u201d IJCV, pp. 1\u201316, 2018.\n\n[18] X. Zhou, Q. Huang, X. Sun, X. Xue, and Y. Wei, \u201cTowards 3d human pose estimation in the\n\nwild: a weakly-supervised approach,\u201d in ICCV, 2017.\n\n[19] S. Li and A. B. Chan, \u201c3d human pose estimation from monocular images with deep convolu-\n\ntional neural network,\u201d in ACCV, 2014.\n\n[20] E. Brau and H. Jiang, \u201c3d human pose estimation via deep learning from 2d annotations,\u201d in 3D\n\nVision (3DV), 2016 Fourth International Conference on, pp. 582\u2013591, IEEE, 2016.\n\n[21] W. Chen, H. Wang, Y. Li, H. Su, Z. Wang, C. Tu, D. Lischinski, D. Cohen-Or, and B. Chen,\n\u201cSynthesizing training images for boosting human 3d pose estimation,\u201d in 3D Vision (3DV),\n2016.\n\n9\n\n\f[22] A. I. Popa, M. Zan\ufb01r, and C. Sminchisescu, \u201cDeep multitask architecture for integrated 2d and\n\n3d human sensing,\u201d in CVPR, 2017.\n\n[23] G. Rogez and C. Schmid, \u201cMocap-guided data augmentation for 3d pose estimation in the wild,\u201d\n\nin NIPS, 2016.\n\n[24] B. Tekin, P. Marquez Neila, M. Salzmann, and P. Fua, \u201cLearning to fuse 2d and 3d image cues\n\nfor monocular body pose estimation,\u201d in ICCV, 2017.\n\n[25] G. Rogez, P. Weinzaepfel, and C. Schmid, \u201cLcr-net: Localization-classi\ufb01cation-regression for\n\nhuman pose,\u201d in CVPR, 2017.\n\n[26] S. Ren, K. He, R. Girshick, and J. Sun, \u201cFaster r-cnn: Towards real-time object detection with\n\nregion proposal networks,\u201d in NIPS, pp. 91\u201399, 2015.\n\n[27] A. Zan\ufb01r, E. Marinoiu, and C. Sminchisescu, \u201cMonocular 3D Pose and Shape Estimation of\nMultiple People in Natural Scenes \u2013 The Importance of Multiple Scene Constraints,\u201d in CVPR,\n2018.\n\n[28] C. Ionescu, D. Papava, V. Olaru, and C. Sminchisescu, \u201cHuman3.6M: Large scale datasets and\n\npredictive methods for 3d human sensing in natural environments,\u201d PAMI, 2014.\n\n[29] S. Wei, V. Ramakrishna, T. Kanade, and Y. Sheikh, \u201cConvolutional pose machines,\u201d in CVPR,\n\nJune 2016.\n\n[30] H. Joo, H. Liu, L. Tan, L. Gui, B. Nabbe, I. Matthews, T. Kanade, S. Nobuhara, and Y. Sheikh,\n\n\u201cPanoptic studio: A massively multiview system for social motion capture,\u201d in ICCV, 2015.\n\n[31] C. Ionescu, J. Carreira, and C. Sminchisescu, \u201cIterated second-order label sensitive pooling for\n\n3d human pose estimation,\u201d in CVPR, pp. 1661\u20131668, 2014.\n\n[32] T.-Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Doll\u00e1r, and C. L. Zitnick,\n\n\u201cMicrosoft coco: Common objects in context,\u201d in ECCV, 2014.\n\n[33] Y. Jia, E. Shelhamer, J. Donahue, S. Karayev, J. Long, R. Girshick, S. Guadarrama, and\nT. Darrell, \u201cCaffe: Convolutional architecture for fast feature embedding,\u201d arXiv preprint\narXiv:1408.5093, 2014.\n\n10\n\n\f", "award": [], "sourceid": 5104, "authors": [{"given_name": "Andrei", "family_name": "Zanfir", "institution": "Institute of Mathematics of the Romanian Academy"}, {"given_name": "Elisabeta", "family_name": "Marinoiu", "institution": "IMAR"}, {"given_name": "Mihai", "family_name": "Zanfir", "institution": "IMAR"}, {"given_name": "Alin-Ionut", "family_name": "Popa", "institution": "IMAR"}, {"given_name": "Cristian", "family_name": "Sminchisescu", "institution": "LTH"}]}