{"title": "CRF-CNN: Modeling Structured Information in Human Pose Estimation", "book": "Advances in Neural Information Processing Systems", "page_first": 316, "page_last": 324, "abstract": "Deep convolutional neural networks (CNN) have achieved great success. On the other hand, modeling structural information has been proved critical in many vision problems. It is of great interest to integrate them effectively. In a classical neural network, there is no message passing between neurons in the same layer. In this paper, we propose a CRF-CNN framework which can simultaneously model structural information in both output and hidden feature layers in a probabilistic way, and it is applied to human pose estimation. A message passing scheme is proposed, so that in various layers each body joint receives messages from all the others in an efficient way. Such message passing can be implemented with convolution between features maps in the same layer, and it is also integrated with feedforward propagation in neural networks. Finally, a neural network implementation of end-to-end learning CRF-CNN is provided. Its effectiveness is demonstrated through experiments on two benchmark datasets.", "full_text": "CRF-CNN: Modeling Structured Information in\n\nHuman Pose Estimation\n\nXiao Chu\n\nThe Chinese University of Hong Kong\n\nxchu@ee.cuhk.edu.hk\n\nHongsheng Li\n\nThe Chinese University of Hong Kong\n\nhsli@ee.cuhk.edu.hk\n\nWanli Ouyang\n\nThe Chinese University of Hong Kong\n\nwlouyang@ee.cuhk.edu.hk\n\nXiaogang Wang\n\nThe Chinese University of Hong Kong\n\nxgwang@ee.cuhk.edu.hk\n\nAbstract\n\nDeep convolutional neural networks (CNN) have achieved great success. On\nthe other hand, modeling structural information has been proved critical in many\nvision problems. It is of great interest to integrate them effectively. In a classical\nneural network, there is no message passing between neurons in the same layer. In\nthis paper, we propose a CRF-CNN framework which can simultaneously model\nstructural information in both output and hidden feature layers in a probabilistic way,\nand it is applied to human pose estimation. A message passing scheme is proposed,\nso that in various layers each body joint receives messages from all the others\nin an ef\ufb01cient way. Such message passing can be implemented with convolution\nbetween features maps in the same layer, and it is also integrated with feedforward\npropagation in neural networks. Finally, a neural network implementation of end-\nto-end learning CRF-CNN is provided. Its effectiveness is demonstrated through\nexperiments on two benchmark datasets.\n\n1\n\nIntroduction\n\nA lot of efforts have been devoted to structure design of convolutional neural network (CNN). They\ncan be divided into two groups. One is to achieve higher expressive power by making CNN deeper\n[19, 10, 20]. The other is to model structures among features and outputs, either as post processing\n[6, 2] or as extra information to guide the learning of CNN [29, 22, 24]. They are complementary.\nHuman pose estimation is to estimate body joint locations from 2D images, which could be applied to\nassist other tasks such as [4, 14, 26] The very \ufb01rst attempt adopting CNN for human pose estimation\nis DeepPose [23]. It used CNN to regress joint locations repeatedly without directly modeling the\noutput structure. However, the prediction of body joint locations relies both on their own appearance\nscores and the prediction of other joints. Hence, the output space for human pose estimation is\nstructured. Later, Chen and Yuille [2] used a graphical model for the spatial relationship between\nbody joints and used it as post processing after CNN. Learning CNN features and structured output\ntogether was proposed in [22, 21, 24]. Researchers were also aware of the importance of introducing\nstructures at the feature level [3]. However, the design of CNN for structured output and structured\nfeatures was heuristic, without principled guidance on how information should be passed. As deep\nmodels are shown effective for many practical applications, researchers on statistical learning and\ndeep learning try to use probabilistic models to illustrate the ideas behind deep models [9, 7, 29].\nMotivated by these works, we provide a CRF framework that models structures in both output and\nhidden feature layers in CNN, called CRF-CNN. It provides us with a principled illustration on how\nto model structured information at various levels in a probabilistic way and what are the assumptions\n\n30th Conference on Neural Information Processing Systems (NIPS 2016), Barcelona, Spain.\n\n\fmade when incorporating different CRF into CNN. Existing works can be illustrated as special\nimplementations of CRF-CNN. DeepPose [23] only considered the feature-output relationship, and\nthe approaches in [2, 22] considered feature-output and output-output relationships. In contrast, our\nproposed full CRF-CNN model takes feature-output, output-output, and feature-feature relationships\ninto consideration, which is novel in pose estimation.\nIt also facilitates us in borrowing the idea behind the sum-product algorithm and developing a message\npassing scheme so that each body joint receives messages from all the others in an ef\ufb01cient way by\nsaving intermediate messages. Given a set of body joints as vertices on a graph, there is no conclusion\non whether a tree structured model [28, 8] or a loopy structured model [25, 16] is the best choice.\nA tree structure has exact inference while a loopy structure can model more complex relationship\namong vertices. Our proposed message passing scheme is applicable to both.\nOur contributions can be summarized as follows. (1) A CRF is proposed to simultaneously model\nstructured features and structured body part spatial relationship. We show step by step how ap-\nproximations are made to use an end-to-end learning CNN for implementing such CRF model. (2)\nMotivated by the ef\ufb01cient algorithm for marginalization on tree structures, we provide a message\npassing scheme for our CRF-CNN so that every vertex receives messages from all the others in an\nef\ufb01cient way. Message passing can be implemented with convolution between feature maps in the\nsame layer. Because of the approximation used, this message passing can be used for both tree and\nloopy structures. (3) CRF-CNN is applied to two human pose estimation benchmark datasets and\nachieve better performance on both dataset compared with previous methods.\n\n2 CRF-CNN\n\nThe power of combing powerful statistical models with CNN has been proved [6, 3]. In this section\nwe start with a brief review of CRF and study how the pose estimation problem can be formulated\nunder the proposed CRF-CNN framework. It includes estimating body joints independently from\nCNN features, modeling the spatial relationship of body joints in the output layer of CNN, and\nmodeling the spatial relationship of features in the hidden layers of CNN.\nLet I denote an image, and z = {z1, ..., zN} denote locations of N body joints. We are interested in\nmodeling the conditional probability p(z|I, \u0398) parameterized by \u0398, expressed in a Gibbs distribution:\n\np(z|I, \u0398) =\n\ne\u2212En(z,I,\u0398)\n\nZ\n\n=\n\n(cid:80)\n\ne\u2212En(z,I,\u0398)\nz\u2208Z e\u2212En(z,I,\u0398)\n\n,\n\n(1)\n\nwhere En(Z, I, \u0398) is the energy function. The conditional distribution by introducing latent variables\nh = {h1, h2, . . . , hK} can be modeled as follows:\n\np(z|I, \u0398) =\n\np(z, h|I, \u0398), where p(z, h|I, \u0398) =\n\n(cid:80)\n\ne\u2212En(z,h,I,\u0398)\n\nz\u2208Z,h\u2208H e\u2212En(z,h,I,\u0398)\n\n(2)\n\n(cid:88)\n\nh\n\nEn(z, h, I, \u0398) is the energy function to be de\ufb01ned later. The latent variables correspond to features\nobtained from a neural network in our implementation. We de\ufb01ne an undirected graph G = (V,E),\nwhere V = z \u222a h, E = Ez \u222a Eh \u222a Ezh. Ez, Eh, and Ezh denote sets of edges connecting body joints,\nconnecting latent variables, and connecting latent variables with body joints, respectively.\n\n2.1 Model 1\nDenote \u2205 as an empty set. If we suppose there is no edge connecting joints and no edge connecting\nlatent variables in the graphical model, i.e. Ez = \u2205, Eh = \u2205, then\n\np(z, h|I, \u0398) =\n\np(zi|h, I, \u0398)\n\np(hk|I, \u0398),\n\nEn(z, h, I, \u0398) =\n\n(i,k)\u2208Ezh\n\nk\n\n\u03c8zh(zi, hk) +\n\n\u03c6h(hk, I),\n\n(cid:89)\n\n(cid:88)\n\nk\n\nwhere \u03c6h(\u2217) denotes the unary/data term for image I, \u03c8zh(\u2217,\u2217) denotes the terms for the correlations\nbetween latent variables h and body joint con\ufb01gurations z. It corresponds to the model in Fig. 1(a)\nand it is a typical feedforward neural network.\n\n(cid:89)\n(cid:88)\n\ni\n\n2\n\n(3)\n\n(4)\n\n\fFigure 1: Different implementations of the CRF-CNN framework.\n\nExample. In DeepPose [23], CNN features h in the top hidden layer were obtained from images, and\ncould be treated as latent variables and illustrated by term \u03c6h(hk, I) in (4). There is no connection\nbetween neurons in hidden layers. Body joint locations were estimated from CNN features in [23],\nwhich could be illustrated by the term \u03c8zh(zi, hk). The body joints are independently estimated\nwithout considering their correlations, which means Ez = \u2205.\n\n2.2 Model 2\nIf we suppose Eh = \u2205 in the graphical model, p(z, h|I, \u0398) becomes\n\n(cid:89)\n(cid:88)\n\nk\n\n(i,k)\u2208Ezh\n\np(z, h|I, \u0398) = p(z|h, I, \u0398)\n\np(hk|I, \u0398).\n\n(5)\n\nCompared with (3), joint locations are no longer independent. The energy function for this model is\n\nEn(z, h, I, \u0398) =\n\n\u03c8z(zi, zj) +\n\n\u03c8zh(zi, hk) +\n\n\u03c6h(hk, I).\n\n(6)\n\n(cid:88)\n\nk\n\n(cid:88)\n\n(i,j)\u2208Ez\n\ni