{"title": "Deeply Learning the Messages in Message Passing Inference", "book": "Advances in Neural Information Processing Systems", "page_first": 361, "page_last": 369, "abstract": "Deep structured output learning shows great promise in tasks like semantic image segmentation. We proffer a new, efficient deep structured model learning scheme, in which we show how deep Convolutional Neural Networks (CNNs) can be used to directly estimate the messages in message passing inference for structured prediction with Conditional Random Fields CRFs). With such CNN message estimators, we obviate the need to learn or evaluate potential functions for message calculation. This confers significant efficiency for learning, since otherwise when performing structured learning for a CRF with CNN potentials it is necessary to undertake expensive inference for every stochastic gradient iteration. The network output dimension of message estimators is the same as the number of classes, rather than exponentially growing in the order of the potentials. Hence it is more scalable for cases that a large number of classes are involved. We apply our method to semantic image segmentation and achieve impressive performance, which demonstrates the effectiveness and usefulness of our CNN message learning method.", "full_text": "Deeply Learning the Messages in Message\n\nPassing Inference\n\nGuosheng Lin, Chunhua Shen, Ian Reid, Anton van den Hengel\n\nThe University of Adelaide, Australia; and Australian Centre for Robotic Vision\nE-mail: {guosheng.lin,chunhua.shen,ian.reid,anton.vandenhengel}@adelaide.edu.au\n\nAbstract\n\nDeep structured output learning shows great promise in tasks like semantic im-\nage segmentation. We proffer a new, ef\ufb01cient deep structured model learning\nscheme, in which we show how deep Convolutional Neural Networks (CNNs)\ncan be used to directly estimate the messages in message passing inference for\nstructured prediction with Conditional Random Fields (CRFs). With such CNN\nmessage estimators, we obviate the need to learn or evaluate potential functions\nfor message calculation. This confers signi\ufb01cant ef\ufb01ciency for learning, since oth-\nerwise when performing structured learning for a CRF with CNN potentials it is\nnecessary to undertake expensive inference for every stochastic gradient iteration.\nThe network output dimension of message estimators is the same as the number\nof classes, rather than exponentially growing in the order of the potentials. Hence\nit is more scalable for cases that involve a large number of classes. We apply\nour method to semantic image segmentation and achieve impressive performance,\nwhich demonstrates the effectiveness and usefulness of our CNN message learn-\ning method.\n\n1\n\nIntroduction\n\nLearning deep structured models has attracted considerable research attention recently. One popu-\nlar approach to deep structured model is formulating conditional random \ufb01elds (CRFs) using deep\nConvolutional Neural Networks (CNNs) for the potential functions. This combines the power of\nCNNs for feature representation learning and of the ability for CRFs to model complex relations.\nThe typical approach for the joint learning of CRFs and CNNs [1, 2, 3, 4, 5], is to learn the CNN\npotential functions by optimizing the CRF objective, e.g., maximizing the log-likelihood. The CNN\nand CRF joint learning has shown impressive performance for semantic image segmentation.\nFor the joint learning of CNNs and CRFs, stochastic gradient descent (SGD) is typically applied for\noptimizing the conditional likelihood. This approach requires the marginal inference for calculating\nthe gradient. For loopy graphs, marginal inference is generally expensive even when using approxi-\nmate solutions. Given that learning the CNN potential functions typically requires a large number of\ngradient iterations, repeated marginal inference would make the training intractably slow. Applying\nan approximate training objective is a solution to avoid repeat inference; pseudo-likelihood learning\n[6] and piecewise learning [7, 3] are examples of this kind of approach. In this work, we advocate a\nnew direction for ef\ufb01cient deep structured model learning.\nIn conventional CRF approaches, the \ufb01nal prediction is the result of inference based on the learned\npotentials. However, our ultimate goal is the \ufb01nal prediction (not the potentials themselves), so we\npropose to directly optimize the inference procedure for the \ufb01nal prediction. Our focus here is on\nthe extensively studied message passing based inference algorithms. As discussed in [8], we can\ndirectly learn message estimators to output the required messages in the inference procedure, rather\n\n\fthan learning the potential functions as in conventional CRF learning approaches. With the learned\nmessage estimators, we then obtain the \ufb01nal prediction by performing message passing inference.\nOur main contributions are as follows:\n1) We explore a new direction for ef\ufb01cient deep structured learning. We propose to directly learn the\nmessages in message passing inference as training deep CNNs in an end-to-end learning fashion.\nMessage learning does not require any inference step for the gradient calculation, which allows\nef\ufb01cient training. Furthermore, when cast as a tradiational classi\ufb01cation task, the network output\ndimension for message estimation is the same as the number of classes (K), while the network\noutput for general CNN potential functions in CRFs is K a, which is exponential in the order (a)\nof the potentials (for example, a = 2 for pairwise potentials, a = 3 for triple-cliques, etc). Hence\nCNN based message learning has signi\ufb01cantly fewer network parameters and thus is more scalable,\nespecially in cases which involve a large number of classes.\n2) The number of iterations in message passing inference can be explicitly taken into consideration\nin the message learning procedure. In this paper, we are particularly interested in learning messages\nthat are able to offer high-quality CRF prediction results with only one message passing iteration,\nmaking the message passing inference very fast.\n3) We apply our method to semantic image segmentation on the PASCAL VOC 2012 dataset and\nachieve impressive performance.\nRelated work Combining the strengths of CNNs and CRFs for segmentation has been explored in\nseveral recent methods. Some methods resort to a simple combination of CNN classi\ufb01ers and CRFs\nwithout joint learning. DeepLab-CRF in [9] \ufb01rst train fully CNN for pixel classi\ufb01cation and applies\na dense CRF [10] method as a post-processing step. Later the method in [2] extends DeepLab\nby jointly learning the dense CRFs and CNNs. RNN-CRF in [1] also performs joint learning of\nCNNs and the dense CRFs. They implement the mean-\ufb01eld inference as Recurrent Neural Networks\nwhich facilitates the end-to-end learning. These methods usually use CNNs for modelling the unary\npotentials only. The work in [3] trains CNNs to model both the unary and pairwise potentials in\norder to capture contextual information. Jointly learning CNNs and CRFs has also been explored\nfor other applications like depth estimation [4, 11]. The work in [5] explores joint training of Markov\nrandom \ufb01elds and deep networks for predicting words from noisy images and image classi\ufb01cation.\nAll these above-mentioned methods that combine CNNs and CRFs are based upon conventional\nCRF approaches. They aim to jointly learn or incorporate pre-trained CNN potential functions, and\nthen perform inference/prediction using the potentials. In contrast, our method here directly learns\nCNN message estimators for the message passing inference, rather than learning the potentials.\nThe inference machine proposed in [8] is relevant to our work in that it has discussed the idea of\ndirectly learning message estimators instead of learning potential functions for structured predic-\ntion. They train traditional logistic regressors with hand-crafted features as message estimators.\nMotivated by the tremendous success of CNNs, we propose to train deep CNNs based message es-\ntimators in an end-to-end learning style without using hand-crafted features. Unlike the approach in\n[8] which aims to learn variable-to-factor message estimators, our proposed method aims to learn\nthe factor-to-variable message estimators. Thus we are able to naturally formulate the variable\nmarginals \u2013 which is the ultimate goal for CRF inference \u2013 as the training objective (see Sec. 3.3).\nThe approach in [12] jointly learns CNNs and CRFs for pose estimation, in which they learn the\nmarginal likelihood of body parts but ignore the partition function in the likelihood. Message learn-\ning is not discussed in that work, and the exact relationship between this pose estimation approach\nand message learning remains unclear.\n\n2 Learning CRF with CNN potentials\n\nBefore describing our message learning method, we review the CRF-CNN joint learning approach\nand discuss limitations. An input image is denoted by x \u2208 X and the corresponding labeling mask\nis denoted by y \u2208 Y. The energy function is denoted by E(y, x), which measures the score of the\nprediction y given the input image x. We consider the following form of conditional likelihood:\n\nP (y|x) =\n\n1\n\nZ(x)\n\nexp [\u2212E(y, x)] =\n\n(cid:80)\nexp [\u2212E(y, x)]\ny(cid:48) exp [\u2212E(y(cid:48), x)]\n\n.\n\n(1)\n\n\fE(y, x) =(cid:80)\n\nHere Z is the partition function. The CRF model is decomposed by a factor graph over a set of\nfactors F. Generally, the energy function is written as a sum of potential functions (factor functions):\n\nF\u2208F EF (yF , xF ).\n\n(2)\nHere F indexes one factor in the factor graph; yF denotes the variable nodes which are connected\nto the factor F ; EF is the (log-) potential function (factor function). The potential function can be\na unary, pairwise, or high-order potential function. The recent method in [3] describes examples of\nconstructing general CNN based unary and pairwise potentials.\nTake semantic image segmentation as an example. To predict the pixel labels of a test image, we can\n\ufb01nd the mode of the joint label distribution by solving the maximum a posteriori (MAP) inference\nproblem: y(cid:63) = argmax y P (y|x). We can also obtain the \ufb01nal prediction by calculating the label\nmarginal distribution of each variable, which requires to solve a marginal inference problem:\n\nP (y|x).\n\ny\\yp\n\n(3)\nHere y\\yp indicates the output variables y excluding yp. For a general CRF graph with cycles,\nthe above inference problems is known to be NP-hard, thus approximate inference algorithms are\napplied. Message passing is a type of widely applied algorithms for approximate inference: loopy\nbelief propagation (BP) [13], tree-reweighted message passing [14] and mean-\ufb01eld approximation\n[13] are examples of the message passing methods.\nCRF-CNN joint learning aims to learn CNN potential functions by optimizing the CRF objective,\ntypically, the negative conditional log-likelihood, which is:\n\n\u2200p \u2208 N : P (yp|x) =(cid:80)\n\n\u2212 log P (y|x; \u03b8) = E(y, x; \u03b8) + log Z(x; \u03b8).\n\n(4)\n\nThe energy function E(y, x) is constructed by CNNs, for which all the network parameters are\ndenoted by \u03b8. Adding regularization, minimizing negative log-likelihood for CRF learning is:\n\n2 +(cid:80)N\n\nmin\u03b8\n\n\u03bb\n\n2 (cid:107)\u03b8(cid:107)2\n\ni=1[E(y(i), x(i); \u03b8) + log Z(x(i); \u03b8)].\n\n(5)\n\nHere x(i), y(i) denote the i-th training image and its segmentation mask; N is the number of training\nimages; \u03bb is the weight decay parameter. We can apply stochastic gradient descent (SGD) to opti-\nmize the above problem for learning \u03b8. The energy function E(y, x; \u03b8) is constructed from CNNs,\nand its gradient \u2207\u03b8E(y, x; \u03b8) can be easily computed by applying the chain rule as in conventional\nCNNs. However, the partition function Z brings dif\ufb01culties for optimization. Its gradient is:\n\n\u2207\u03b8 log Z(x; \u03b8) =\n\n\u2207\u03b8[\u2212E(y, x; \u03b8)]\n\n(cid:88)\n\ny\n\n= \u2212 E\n\n(cid:80)\nexp [\u2212E(y, x; \u03b8)]\ny(cid:48) exp [\u2212E(y(cid:48), x; \u03b8)]\ny\u223cP (y|x;\u03b8)\u2207\u03b8E(y, x; \u03b8).\n\n(6)\n\nDirect calculation of the above gradient is computationally infeasible for general CRF graphs. Usu-\nally it is necessary to perform approximate marginal inference to calculate the gradients at each SGD\niteration [13]. However, repeated marginal inference can be extremely expensive, as discussed in\n[3]. CNN training usually requires a huge number of SGD iterations (hundreds of thousands, or even\nmillions), hence this inference based learning approach is in general not scalable or even infeasible.\n\n3 Learning CNN message estimators\n\nIn conventional CRF approaches, the potential functions are \ufb01rst learned, and then inference is\nperformed based on the learned potential functions to generate the \ufb01nal prediction. In contrast, our\napproach directly optimizes the inference procedure for \ufb01nal prediction. We propose to learn CNN\nestimators to directly output the required intermediate values in an inference algorithm.\nHere we focus on the message passing based inference algorithm which has been extensively studied\nand widely applied. In the CRF prediction procedure, the \u201cmessage\u201d vectors are recursively calcu-\nlated based on the learned potentials. We propose to construct and learn CNNs to directly estimate\nthese messages in the message passing procedure, rather than learning the potential functions. In\nparticular, we directly learn factor-to-variable message estimators. Our message learning framework\n\n\fis general and can accommodate all message passing based algorithms such as loopy belief propa-\ngation (BP) [13], mean-\ufb01eld approximation [13] and their variants. Here we discuss using loopy BP\nfor calculating variable marginals. As shown by Yedidia et al. [15], loopy BP has a close relation\nwith Bethe free energy approximation.\nTypically, the message is a K-dimensional vector (K is the number of classes) which encodes the\ninformation of the label distribution. For each variable-factor connection, we need to recursively\ncompute the variable-to-factor message: \u03b2p\u2192F \u2208 RK, and the factor-to-variable message: \u03b2F\u2192p \u2208\nRK. The unnormalized variable-to-factor message is computed as:\n\n(7)\nHere Fp is a set of factors connected to the variable p; Fp\\F is the set of factors Fp excluding the\nfactor F . For loopy graphs, the variable-to-factor message is normalized at each iteration:\n\nF (cid:48)\u2208Fp\\F \u03b2F (cid:48)\u2192p(yp).\n\n\u00af\u03b2p\u2192F (yp) =(cid:80)\n\n\u03b2p\u2192F (yp) = log\n\nexp \u00af\u03b2p\u2192F (yp)\nexp \u00af\u03b2p\u2192F (y(cid:48)\np)\ny(cid:48)\n\n.\n\np\n\n(cid:80)\n(cid:20)\n\n(cid:88)\n\n\u03b2F\u2192p(yp) = log\n\nF \\y(cid:48)\ny(cid:48)\n\np,y(cid:48)\n\np=yp\n\nexp\n\n\u2212 EF (y(cid:48)\n\nF ) +\n\n(cid:21)\n\n\u03b2q\u2192F (y(cid:48)\nq)\n\n.\n\n(8)\n\n(9)\n\n(cid:88)\n\nq\u2208NF \\p\n\nThe factor-to-variable message is computed as:\n\n(cid:88)\n(cid:88)\n\np,y(cid:48)\n\n(cid:26)\n(cid:26)\n\nHere NF is a set of variables connected to the factor F ; NF\\p is the set of variables NF excluding\nthe variable p. Once we get all the factor-to-variable messages of one variable node, we are able to\ncalculate the marginal distribution (beliefs) of that variable:\n\n(cid:88)\nin which Zp is a normalizer: Zp =(cid:80)\n\nP (yp|x) =\n\ny\\yp\n\n(cid:20) (cid:88)\n\nF\u2208Fp\n\n1\nZp\n\nexp\n\nP (y|x) =\n\nexp [(cid:80)\n\nyp\n\nF\u2208Fp\n\n\u03b2F\u2192p(yp)].\n\n\u03b2F\u2192p(yp)\n\n,\n\n(10)\n\n(cid:21)\n\n3.1 CNN message estimators\n\nThe calculation of factor-to-variable message \u03b2F\u2192p depends on the variable-to-factor messages\n\u03b2p\u2192F . Substituting the de\ufb01nition of \u03b2p\u2192F in (8), \u03b2F\u2192p can be re-written as:\n\n(cid:20)\n(cid:20)\n\n(cid:88)\n(cid:88)\n\nlog\n\nq\u2208NF \\p\n\n(cid:21)(cid:27)\n\n(cid:80)\n(cid:80)\n\nexp \u00af\u03b2q\u2192F (y(cid:48)\nq)\nexp \u00af\u03b2q\u2192F (y(cid:48)(cid:48)\nq )\ny(cid:48)(cid:48)\n\nq\n\nexp(cid:80)\nexp(cid:80)\n\n(cid:21)(cid:27)\n\n\u03b2F\u2192p(yp) = log\n\nF \\y(cid:48)\ny(cid:48)\n\np=yp\n\nexp\n\n\u2212 EF (y(cid:48)\n\nF ) +\n\ny(cid:48)(cid:48)\n\nq\n\nlog\n\nexp\n\np=yp\n\nq,y(cid:48)\n\n= log\n\nF ) +\n\nF \\y(cid:48)\ny(cid:48)\n\nq\u2208NF \\p\n\n\u2212 EF (y(cid:48)\n\nF (cid:48)\u2208Fq\\F \u03b2F (cid:48)\u2192q(y(cid:48)\nq)\nF (cid:48)\u2208Fq\\F \u03b2F (cid:48)\u2192q(y(cid:48)(cid:48)\nq )\n(11)\nHere q denotes the variable node which is connected to the node p by the factor F in the factor\ngraph. We refer to the variable node q as a neighboring node of q. NF\\p is a set of variables\nconnected to the factor F excluding the node p. Clearly, for a pairwise factor which only connects\nto two variables, the set NF\\p only contains one variable node. The above equations show that\nthe factor-to-variable message \u03b2F\u2192p depends on the potential EF and \u03b2F (cid:48)\u2192q. Here \u03b2F (cid:48)\u2192q is the\nfactor-to-variable message which is calculated from a neighboring node q and a factor F (cid:48) (cid:54)= F .\nConventional CRF learning approaches learn the potential function then follow the above equations\nto compute the messages for calculating marginals. As discussed in [8], given that the goal is to\nestimate the marginals, it is not necessary to exactly follow the above equations, which involve\nlearning potential functions, to calculate messages. We can directly learn message estimators, rather\nthan indirectly learning the potential functions as in conventional methods.\nConsider the calculation in (11). The message \u03b2F\u2192p depends on the observation xpF and the\nmessages \u03b2F (cid:48)\u2192q. Here xpF denotes the observations that correspond to the node p and the factor\nF . We are able to formulate a factor-to-variable message estimator which takes xpF and \u03b2F (cid:48)\u2192q as\n\n\finputs and outputs the message vector, and we directly learn such estimators. Since one message\n\u03b2F\u2192p depends on a number of previous messages \u03b2F (cid:48)\u2192q, we can formulate a sequence of message\nestimators to model the dependence. Thus the output from a previous message estimator will be the\ninput of the following message estimator.\nThere are two message passing strategies for loopy BP: synchronous and asynchronous passing.\nWe here focus on the synchronous message passing, for which all messages are computed before\npassing them to the neighbors. The synchronous passing strategy results in much simpler message\ndependences than the asynchronous strategy, which simpli\ufb01es the training procedure. We de\ufb01ne one\ninference iteration as one pass of the graph with the synchronous passing strategy.\nWe propose to learn CNN based factor-to-variable message estimator. The message estimator mod-\nels the interaction between neighboring variable nodes. We denote by M a message estimator. The\nfactor-to-variable message is calculated as:\n\n\u03b2F\u2192p(yp) = MF (xpF , dpF , yp).\n\n(12)\nWe refer to dpF as the dependent message feature vector which encodes all dependent messages\nfrom the neighboring nodes that are connected to the node p by F . Note that the dependent messages\nare the output of message estimators at the previous inference iteration. In the case of running only\none message passing iteration, there are no dependent messages for MF , and thus we do not need\nto incorporate dpF . To have a general exposition, we here describe the case of running arbitrarily\nmany inference iterations.\nWe can choose any effective strategy to generate the feature vector dpF from the dependent mes-\nsages. Here we discuss a simple example. According to (11), we de\ufb01ne the feature vector dpF as a\nK-dimensional vector which aggregates all dependent messages. In this case, dpF is computed as:\n\n(cid:20)\n\n(cid:88)\n\nq\u2208NF \\p\n\nexp(cid:80)\n(cid:80)\ny(cid:48) exp(cid:80)\n\ndpF (y) =\n\nlog\n\nF (cid:48)\u2208Fq\\F MF (cid:48)(xqF (cid:48), dqF (cid:48), y)\nF (cid:48)\u2208Fq\\F MF (cid:48)(xqF (cid:48), dqF (cid:48), y(cid:48))\n\n(13)\n\n(cid:21)\n\n.\n\nWith the de\ufb01nition of dpF in (13) and \u03b2F\u2192p in (12), it clearly shows that the message estima-\ntion requires evaluating a sequence of message estimators. Another example is to concatenate all\ndependent messages to construct the feature vector dpF .\nThere are different strategies to formulate the message estimators in different iterations. One strategy\nis using the same message estimator across all inference iterations. In this case the message estimator\nbecomes a recursive function, and thus the CNN based estimator becomes a recurrent neural network\n(RNN). Another strategy is to formulate different estimator for each inference iteration.\n\n3.2 Details for message estimator networks\n\n\u03b2F\u2192p(yp) = MF (xpF , dpF , yp; \u03b8F ) =(cid:80)K\n\nWe formulate the estimator MF as a CNN, thus the estimation is the network outputs:\n\nk=1\u03b4(k = yp)zpF,k(x, dpF ; \u03b8F ).\n\n(14)\nHere \u03b8F denotes the network parameter which we need to learn. \u03b4(\u00b7) is the indicator function, which\nequals 1 if the input is true and 0 otherwise. We denote by zpF \u2208 RK as the K-dimensional output\nvector (K is the number of classes) of the message estimator network for the node p and the factor\nF ; zpF,k is the k-th value in the network output zpF corresponding to the k-th class.\nWe can consider any possible strategies for implementing zpF with CNNs. For example, we here\ndescribe a strategy which is analogous to the network design in [3]. We denote by C (1) as a fully\nconvolutional network (FCNN) [16] for convolutional feature generation, and C (2) as a traditional\nfully connected network for message estimation.\nGiven an input image x, the network output C (1)(x) \u2208 RN1\u00d7N2\u00d7r is a convolutional feature map,\nin which N1 \u00d7 N2 = N is the feature map size and r is the dimension of one feature vector. Each\nspatial position (each feature vector) in the feature map C (1)(x) corresponds to one variable node\nin the CRF graph. We denote by C (1)(x, p) \u2208 Rr, the feature vector corresponding to the variable\nnode p. Likewise, C (1)(x, NF\\p) \u2208 Rr is the averaged vector of the feature vectors that correspond\nto the set of nodes NF\\p. Recall that NF\\p is a set of nodes connected by the factor F excluding\nthe node p. For pairwise factors, NF\\p contains only one node.\n\n\fpF \u2208 R2r for the node-factor pair (p, F ) by concatenating\nWe construct the feature vector zC(1)\nC (1)(x, p) and C (1)(x, NF\\p). Finally, we concatenate the node-factor feature vector zC(1)\nand\nthe dependent message feature vector dpF as the input for the second network C (2). Thus the input\ndimension for C (2) is (2r + K). For running only one inference iteration, the input for C (2) is zC(1)\npF\nalone. The \ufb01nal output from the second network C (2) is the K-dimensional message vector zpF .\nTo sum up, we generate the \ufb01nal message vector zpF as:\n\npF\n\nzpF = C (2){ [ C (1)(x, p)(cid:62); C (1)(x, NF\\p )(cid:62); d(cid:62)\n\npF ](cid:62) }.\n\n(15)\n\nFor a general CNN based potential function in conventional CRFs, the potential network is usually\nrequired to have a large number of output units (exponential in the order of the potentials). For\nexample, it requires K 2 (K is the number of classes) outputs for the pairwise potentials [3]. A large\nnumber of output units would signi\ufb01cantly increase the number of network parameters. It leads to\nexpensive computations and tends to over-\ufb01t the training data. In contrast, for learning our CNN\nmessage estimator, we only need to formulate K output units for the network. Clearly it is more\nscalable in the cases of a large number of classes.\n\n3.3 Training CNN message estimators\n\nOur goal is to estimate the variable marginals in (3), which can be re-written with the estimators:\nP (yp|x) =\n\nMF (xpF , dpF , yp; \u03b8F ).\n\nP (y|x) =\n\nexp\n\nexp\n\n\u03b2F\u2192p(yp)\n\n=\n\n(cid:20) (cid:88)\n\nF\u2208Fp\n\n1\nZp\n\n(cid:21)\n\n1\nZp\n\n(cid:88)\n\nF\u2208Fp\n\n(cid:88)\n\ny\\yp\n\nHere Zp is the normalizer. The ideal variable marginal, for example, has the probability of 1 for the\nground truth class and 0 for the remaining classes. Here we consider the cross entropy loss between\nthe ideal marginal and the estimated marginal.\n\n\u03b4(yp = \u02c6yp) log\n\nF\u2208Fp\n\nMF (xpF , dpF , yp; \u03b8F )\n\nMF (xpF , dpF , y(cid:48)\n\np; \u03b8F )\n\nF\u2208Fp\n\n,\n\n(16)\n\nin which \u02c6yp is the ground truth label for the variable node p. Given a set of N training images and\nlabel masks, the optimization problem for learning the message estimator network is:\n\nmin\u03b8\n\n\u03bb\n\n2 (cid:107)\u03b8(cid:107)2\n\ni=1 J(x(i), \u02c6y(i); \u03b8).\n\n(17)\n\nThe work in [8] proposed to learn the variable-to-factor message (\u03b2p\u2192F ). Unlike their approach, we\naim to learn the factor-to-variable message (\u03b2F\u2192p), for which we are able to naturally formulate the\nvariable marginals, which is the ultimate goal for prediction, as the training objective. Moreover, for\nlearning \u03b2p\u2192F in their approach, the message estimator will depend on all neighboring nodes (con-\nnected by any factors). Given that variable nodes will have different numbers of neighboring nodes,\nthey only consider a \ufb01xed number of neighboring nodes (e.g., 20) and concatenate their features to\ngenerate a \ufb01xed-length feature vector for classi\ufb01cation. In our case for learning \u03b2F\u2192p, the message\nestimator only depends on a \ufb01xed number of neighboring nodes (connected by one factor), thus we\ndo not have this problem. Most importantly, they learn message estimators by training traditional\nprobabilistic classi\ufb01ers (e.g., simple logistic regressors) with hand-craft features, and in contrast, we\ntrain deep CNNs in an end-to-end learning style without using hand-craft features.\n\n3.4 Message learning with inference-time budgets\n\nOne advantage of message learning is that we are able to explicitly incorporate the expected number\nof inference iterations into the learning procedure. The number of inference iterations de\ufb01nes the\nlearning sequence of message estimators. This is particularly useful if we aim to learn the estimators\nwhich are capable of high-quality predictions within only a few inference iterations. In contrast,\n\nJ(x, \u02c6y; \u03b8) = \u2212(cid:88)\n= \u2212(cid:88)\n\np\u2208N\n\nK(cid:88)\nK(cid:88)\n\nyp=1\n\np\u2208N\n\nyp=1\n\n\u03b4(yp = \u02c6yp) log P (yp|x; \u03b8)\n\nexp(cid:80)\n(cid:80)\nexp(cid:80)\n2 +(cid:80)N\n\ny(cid:48)\n\np\n\n\fTable 1: Segmentation results on the PASCAL VOC 2012 \u201cval\u201d set. We compare with several recent CNN\nbased methods with available results on the \u201cval\u201d set. Our method performs the best.\n\nmethod\nContextDCRF [3]\nZoom-out [17]\nDeep-struct [2]\nDeepLab-CRF [9]\nDeepLap-MCL [9]\nBoxSup [18]\nBoxSup [18] VOC extra + COCO\n\ntraining set\nVOC extra\nVOC extra\nVOC extra\nVOC extra\nVOC extra\nVOC extra\n\nours\nours+\n\nVOC extra\nVOC extra\n\n# train (approx.)\n\nIoU val set\n\n10k\n10k\n10k\n10k\n10k\n10k\n133k\n10k\n10k\n\n70.3\n63.5\n64.1\n63.7\n68.7\n63.8\n68.1\n71.1\n73.3\n\nconventional potential function learning in CRFs is not able to directly incorporate the expected\nnumber of inference iterations.\nWe are particularly interested in learning message estimators for use with only one message passing\niteration, because of the speed of such inference. In this case it might be preferable to have large-\nrange neighborhood connections, so that large range interaction can be captured within one inference\npass.\n\n4 Experiments\n\nWe evaluate the proposed CNN message learning method for semantic image segmentation. We\nuse the publicly available PASCAL VOC 2012 dataset [19]. There are 20 object categories and one\nbackground category in the dataset. It contains 1464 images in the training set, 1449 images in the\n\u201cval\u201d set and 1456 images in the test set. Following the common practice in [20, 9], the training\nset is augmented to 10582 images by including the extra annotations provided in [21] for the VOC\nimages. We use intersection-over-union (IoU) score [19] to evaluate the segmentation performance.\nFor the learning and prediction of our method, we only use one message passing iteration.\nThe recent work in [3] (referred to as ContextDCRF) learns multi-scale fully convolutional CNNs\n(FCNNs) for unary and pairwise potential functions to capture contextual information. We follow\nthis CRF learning method and replace the potential functions by the proposed message estimators.\nWe consider 2 types of spatial relations for constructing the pairwise connections of variable nodes.\nOne is the \u201csurrounding\u201d spatial relation, for which one node is connected to its surround nodes. The\nother one is the \u201cabove/below\u201d spatial relation, for which one node is connected to the nodes that lie\nabove. For the pairwise connections, the neighborhood size is de\ufb01ned by a range box. We learn one\ntype of unary message estimator and 3 types of pairwise message estimators in total. One type of\npairwise message estimator is for the \u201csurrounding\u201d spatial relations, and the other two are for the\n\u201cabove/below\u201d spatial relations. We formulate one network for one type of message estimator.\nWe formulate our message estimators as multi-scale FCNNs, for which we apply a similar network\ncon\ufb01guration as in [3]. The network C (1) (see Sec. 3.2 for details) has 6 convolution blocks and C (2)\nhas 2 fully connected layers (with K output units). Our networks are initialized using the VGG-16\nmodel [22]. We train all layers using back-propagation. Our system is built on MatConvNet [23].\nWe \ufb01rst evaluate our method on the VOC 2012 \u201cval\u201d set. We compare with several recent CNN\nbased methods with available results on the \u201cval\u201d set. Results are shown in Table 1. Our method\nachieves the best performance. The comparing method ContextDCRF follows a conventional CRF\nlearning and prediction scheme:\nthey \ufb01rst learn potentials and then perform inference based on\nthe learned potentials to output \ufb01nal predictions. The result shows that learning the CNN message\nestimators is able to achieve similar performance compared to learning CNN potential functions in\nCRFs. Note that since here we only use one message passing iteration for the training and prediction,\nthe inference is particularly ef\ufb01cient.\nTo further improve the performance, we perform simple data augmentation in training. We generate\nextra 4 scales ([0.8, 0.9, 1.1, 1.2]) of the training images and their \ufb02ipped images for training. This\nresult is denoted by \u201cours+\u201d in the result table.\n\n\fTable 2: Category results on the PASCAL VOC 2012 test set. Our method performs the best.\n\no\nr\ne\n\nDeepLab-CRF [9]\nDeepLab-MCL [9]\nFCN-8s [16]\nCRF-RNN [1]\nours\n\nmethod mean a\n78.4\n84.4\n76.8\n87.5\n90.1\n\n66.4\n71.6\n62.2\n72.0\n73.4\n\ne\nk\ni\nb\n\n33.1\n54.5\n34.2\n39.0\n38.6\n\nd\nr\ni\nb\n\n78.2\n81.5\n68.9\n79.7\n77.8\n\nt\na\no\nb\n\n55.6\n63.6\n49.4\n64.2\n61.3\n\ne\nl\nt\nt\no\nb\n\n65.3\n65.9\n60.3\n68.3\n74.3\n\ns\nu\nb\n\n81.3\n85.1\n75.3\n87.6\n89.0\n\nr\na\nc\n\n75.5\n79.1\n74.7\n80.8\n83.4\n\nt\na\nc\n\n78.6\n83.4\n77.6\n84.4\n83.3\n\nr\ni\na\nh\nc\n\n25.3\n30.7\n21.4\n30.4\n36.2\n\nw\no\nc\n\n69.2\n74.1\n62.5\n78.2\n80.2\n\ne\nl\nb\na\nt\n\n52.7\n59.8\n46.8\n60.4\n56.4\n\ng\no\nd\n\n75.2\n79.0\n71.8\n80.5\n81.2\n\ne\ns\nr\no\nh\n\n69.0\n76.1\n63.9\n77.8\n81.4\n\ne\nk\ni\nb\nm\n79.1\n83.2\n76.5\n83.1\n83.1\n\nn\no\ns\nr\ne\np\n\n77.6\n80.8\n73.9\n80.6\n82.9\n\nd\ne\nt\nt\no\np\n\n54.7\n59.7\n45.2\n59.5\n59.2\n\np\ne\ne\nh\ns\n\n78.3\n82.2\n72.4\n82.8\n83.4\n\na\nf\no\ns\n\n45.1\n50.4\n37.4\n47.8\n54.3\n\nn\ni\na\nr\nt\n\n73.3\n73.1\n70.9\n78.3\n80.6\n\nv\nt\n\n56.2\n63.7\n55.1\n67.1\n70.8\n\nTable 3: Segmentation results on the PASCAL VOC 2012 test set. Compared to methods that use the same\naugmented VOC dataset, our method has the best performance.\n\nmethod\nContextDCRF [3]\nZoom-out [17]\nFCN-8s [16]\nSDS [20]\nDeconvNet-CRF [24]\nDeepLab-CRF [9]\nDeepLab-MCL [9]\nCRF-RNN [1]\n\ntraining set\nVOC extra\nVOC extra\nVOC extra\nVOC extra\nVOC extra\nVOC extra\nVOC extra\nVOC extra\n\nDeepLab-CRF [25] VOC extra + COCO\nDeepLab-MCL [25] VOC extra + COCO\nBoxSup (semi) [18] VOC extra + COCO\nCRF-RNN [1] VOC extra + COCO\n\nours\n\nVOC extra\n\n# train (approx.)\n\nIoU test set\n\n10k\n10k\n10k\n10k\n10k\n10k\n10k\n10k\n133k\n133k\n133k\n133k\n10k\n\n70.7\n64.4\n62.2\n51.6\n72.5\n66.4\n71.6\n72.0\n70.4\n72.7\n71.0\n74.7\n73.4\n\nWe further evaluate our method on the VOC 2012 test set. We compare with recent state-of-the-art\nCNN methods with competitive performance. The results are described in Table 3. Since the ground\ntruth labels are not available for the test set, we evaluate our method through the VOC evaluation\nserver. We achieve very competitive performance on the test set: 73.4 IoU score1, which is to date\nthe best performance amongst methods that use the same augmented VOC training dataset [21]\n(marked as \u201cVOC extra\u201d in the table). These results validate the effectiveness of direct message\nlearning with CNNs. We also include a comparison with methods which are trained on the much\nlarger COCO dataset (around 133K training images). Our performance is comparable with these\nmethods, even though we make use of many fewer training images.\nThe results for each category is shown in Table 2. We compare with several recent methods which\ntransfer layers from the same VGG-16 model and use the same training data. Our method performs\nthe best for 13 out of 20 categories.\n\n5 Conclusion\n\nWe have proposed a new deep message learning framework for structured CRF prediction. Learning\ndeep message estimators for the message passing inference reveals a new direction for learning deep\nstructured model. Learning CNN message estimators is ef\ufb01cient, which does not involve expensive\ninference steps for gradient calculation. The network output dimension for message estimation is\nthe same as the number of classes, which does not increase with the order of the potentials, and thus\nCNN message learning has less network parameters and is more scalable in the number of classes\ncompared to conventional potential function learning. Our impressive performance for semantic\nsegmentation demonstrates the effectiveness and usefulness of the proposed deep message learning.\nOur framework is general and can be readily applied to other structured prediction applications.\nAcknowledgements This research was supported by the Data to Decisions Cooperative Research\nCentre and by the Australian Research Council through the ARC Centre for Robotic Vision\nCE140100016 and through a Laureate Fellowship FL130100102 to I. Reid. Correspondence should\nbe addressed to C. Shen.\n\n1 The result link provided by VOC evaluation server: http://host.robots.ox.ac.uk:8080/anonymous/DBD0SI.html\n\n\fReferences\n[1] S. Zheng, S. Jayasumana, B. Romera-Paredes, V. Vineet, Z. Su, D. Du, C. Huang, and\n[Online]. Available:\n\nrandom \ufb01elds as recurrent neural networks,\u201d 2015.\n\nP. Torr, \u201cConditional\nhttp://arxiv.org/abs/1502.03240\n\n[2] A. Schwing and R. Urtasun, \u201cFully connected deep structured networks,\u201d 2015. [Online]. Available:\n\nhttp://arxiv.org/abs/1503.02351\n\n[3] G. Lin, C. Shen, I. Reid, and A. van den Hengel, \u201cEf\ufb01cient piecewise training of deep structured models\n\nfor semantic segmentation,\u201d 2015. [Online]. Available: http://arxiv.org/abs/1504.01013\n\n[4] F. Liu, C. Shen, and G. Lin, \u201cDeep convolutional neural \ufb01elds for depth estimation from a single image,\u201d\n\nin Proc. IEEE Conf. Comp. Vis. Pattern Recogn., 2015.\n\n[5] L. Chen, A. Schwing, A. Yuille, and R. Urtasun, \u201cLearning deep structured models,\u201d 2014. [Online].\n\nAvailable: http://arxiv.org/abs/1407.2538\n\n[6] J. Besag, \u201cEf\ufb01ciency of pseudolikelihood estimation for simple Gaussian \ufb01elds,\u201d Biometrika, 1977.\n[7] C. Sutton and A. McCallum, \u201cPiecewise training for undirected models,\u201d in Proc. Conf. Uncertainty\n\nArti\ufb01cial Intelli, 2005.\n\n[8] S. Ross, D. Munoz, M. Hebert, and J. Bagnell, \u201cLearning message-passing inference machines for struc-\n\ntured prediction,\u201d in Proc. IEEE Conf. Comp. Vis. Pattern Recogn., 2011.\n\n[9] L. Chen, G. Papandreou, I. Kokkinos, K. Murphy, and A. Yuille, \u201cSemantic image segmentation with deep\nconvolutional nets and fully connected CRFs,\u201d 2014. [Online]. Available: http://arxiv.org/abs/1412.7062\n[10] P. Kr\u00a8ahenb\u00a8uhl and V. Koltun, \u201cEf\ufb01cient inference in fully connected CRFs with Gaussian edge potentials,\u201d\n\nin Proc. Adv. Neural Info. Process. Syst., 2012.\n\n[11] F. Liu, C. Shen, G. Lin, and I. Reid, \u201cLearning depth from single monocular images using deep\n\nconvolutional neural \ufb01elds,\u201d 2015. [Online]. Available: http://arxiv.org/abs/1502.07411\n\n[12] J. Tompson, A. Jain, Y. LeCun, and C. Bregler, \u201cJoint training of a convolutional network and a graphical\n\nmodel for human pose estimation,\u201d in Proc. Adv. Neural Info. Process. Syst., 2014.\n\n[13] S. Nowozin and C. Lampert, \u201cStructured learning and prediction in computer vision,\u201d Found. Trends.\n\nComput. Graph. Vis., 2011.\n\n[14] V. Kolmogorov, \u201cConvergent tree-reweighted message passing for energy minimization,\u201d IEEE T. Pattern\n\nAnalysis & Machine Intelligence, 2006.\n\n[15] J. S. Yedidia, W. T. Freeman, Y. Weiss et al., \u201cGeneralized belief propagation,\u201d in Proc. Adv. Neural Info.\n\nProcess. Syst., 2000.\n\n[16] J. Long, E. Shelhamer, and T. Darrell, \u201cFully convolutional networks for semantic segmentation,\u201d in Proc.\n\nIEEE Conf. Comp. Vis. Pattern Recogn., 2015.\n\n[17] M. Mostajabi, P. Yadollahpour, and G. Shakhnarovich, \u201cFeedforward semantic segmentation with\n\nzoom-out features,\u201d 2014. [Online]. Available: http://arxiv.org/abs/1412.0774\n\n[18] J. Dai, K. He, and J. Sun, \u201cBoxSup: exploiting bounding boxes to supervise convolutional networks for\n\nsemantic segmentation,\u201d 2015. [Online]. Available: http://arxiv.org/abs/1503.01640\n\n[19] M. Everingham, L. V. Gool, C. Williams, J. Winn, and A. Zisserman, \u201cThe pascal visual object classes\n\n(VOC) challenge,\u201d Int. J. Comp. Vis., 2010.\n\n[20] B. Hariharan, P. Arbel\u00b4aez, R. Girshick, and J. Malik, \u201cSimultaneous detection and segmentation,\u201d in Proc.\n\nEuropean Conf. Computer Vision, 2014.\n\n[21] B. Hariharan, P. Arbelaez, L. Bourdev, S. Maji, and J. Malik, \u201cSemantic contours from inverse detectors,\u201d\n\nin Proc. Int. Conf. Comp. Vis., 2011.\n\n[22] K. Simonyan and A. Zisserman, \u201cVery deep convolutional networks for large-scale image recognition,\u201d\n\n2014. [Online]. Available: http://arxiv.org/abs/1409.1556\n\n[23] A. Vedaldi and K. Lenc, \u201cMatconvnet \u2013 convolutional neural networks for matlab,\u201d in Proceeding of the\n\nACM Int. Conf. on Multimedia, 2015.\n\n[24] H. Noh, S. Hong, and B. Han, \u201cLearning deconvolution network for semantic segmentation,\u201d in Proc.\n\nIEEE Conf. Comp. Vis. Pattern Recogn., 2015.\n\n[25] G. Papandreou, L. Chen, K. Murphy, and A. Yuille, \u201cWeakly-and semi-supervised learning of a DCNN\n\nfor semantic image segmentation,\u201d 2015. [Online]. Available: http://arxiv.org/abs/1502.02734\n\n\f", "award": [], "sourceid": 257, "authors": [{"given_name": "Guosheng", "family_name": "Lin", "institution": "The University of Adelaide"}, {"given_name": "Chunhua", "family_name": "Shen", "institution": null}, {"given_name": "Ian", "family_name": "Reid", "institution": "University of Adelaide"}, {"given_name": "Anton", "family_name": "van den Hengel", "institution": "University of Adelaide"}]}