{"title": "Efficient Inference in Fully Connected CRFs with Gaussian Edge Potentials", "book": "Advances in Neural Information Processing Systems", "page_first": 109, "page_last": 117, "abstract": "", "full_text": "Ef\ufb01cient Inference in Fully Connected CRFs with\n\nGaussian Edge Potentials\n\nPhilipp Kr\u00a8ahenb\u00a8uhl\n\nComputer Science Department\n\nStanford University\n\nphilkr@cs.stanford.edu\n\nVladlen Koltun\n\nComputer Science Department\n\nStanford University\n\nvladlen@cs.stanford.edu\n\nAbstract\n\nMost state-of-the-art techniques for multi-class image segmentation and labeling\nuse conditional random \ufb01elds de\ufb01ned over pixels or image regions. While region-\nlevel models often feature dense pairwise connectivity, pixel-level models are con-\nsiderably larger and have only permitted sparse graph structures. In this paper, we\nconsider fully connected CRF models de\ufb01ned on the complete set of pixels in an\nimage. The resulting graphs have billions of edges, making traditional inference\nalgorithms impractical. Our main contribution is a highly ef\ufb01cient approximate\ninference algorithm for fully connected CRF models in which the pairwise edge\npotentials are de\ufb01ned by a linear combination of Gaussian kernels. Our experi-\nments demonstrate that dense connectivity at the pixel level substantially improves\nsegmentation and labeling accuracy.\n\n1\n\nIntroduction\n\nMulti-class image segmentation and labeling is one of the most challenging and actively studied\nproblems in computer vision. The goal is to label every pixel in the image with one of several prede-\ntermined object categories, thus concurrently performing recognition and segmentation of multiple\nobject classes. A common approach is to pose this problem as maximum a posteriori (MAP) infer-\nence in a conditional random \ufb01eld (CRF) de\ufb01ned over pixels or image patches [8, 12, 18, 19, 9].\nThe CRF potentials incorporate smoothness terms that maximize label agreement between similar\npixels, and can integrate more elaborate terms that model contextual relationships between object\nclasses.\nBasic CRF models are composed of unary potentials on individual pixels or image patches and pair-\nwise potentials on neighboring pixels or patches [19, 23, 7, 5]. The resulting adjacency CRF struc-\nture is limited in its ability to model long-range connections within the image and generally results\nin excessive smoothing of object boundaries. In order to improve segmentation and labeling accu-\nracy, researchers have expanded the basic CRF framework to incorporate hierarchical connectivity\nand higher-order potentials de\ufb01ned on image regions [8, 12, 9, 13]. However, the accuracy of these\napproaches is necessarily restricted by the accuracy of unsupervised image segmentation, which is\nused to compute the regions on which the model operates. This limits the ability of region-based\napproaches to produce accurate label assignments around complex object boundaries, although sig-\nni\ufb01cant progress has been made [9, 13, 14].\nIn this paper, we explore a different model structure for accurate semantic segmentation and labeling.\nWe use a fully connected CRF that establishes pairwise potentials on all pairs of pixels in the image.\nFully connected CRFs have been used for semantic image labeling in the past [18, 22, 6, 17], but the\ncomplexity of inference in fully connected models has restricted their application to sets of hundreds\nof image regions or fewer. The segmentation accuracy achieved by these approaches is again limited\nby the unsupervised segmentation that produces the regions. In contrast, our model connects all\n\n1\n\n\f(a) Image\n\n(b) Unary classi\ufb01ers\n\n(c) Robust P n CRF (d) Fully connected CRF,\nMCMC inference, 36 hrs\n\n(e) Fully connected CRF,\nour approach, 0.2 seconds\n\nFigure 1: Pixel-level classi\ufb01cation with a fully connected CRF. (a) Input image from the MSRC-21 dataset. (b)\nThe response of unary classi\ufb01ers used by our models. (c) Classi\ufb01cation produced by the Robust P n CRF [9].\n(d) Classi\ufb01cation produced by MCMC inference [17] in a fully connected pixel-level CRF model; the algorithm\nwas run for 36 hours and only partially converged for the bottom image. (e) Classi\ufb01cation produced by our\ninference algorithm in the fully connected model in 0.2 seconds.\n\npairs of individual pixels in the image, enabling greatly re\ufb01ned segmentation and labeling. The\nmain challenge is the size of the model, which has tens of thousands of nodes and billions of edges\neven on low-resolution images.\nOur main contribution is a highly ef\ufb01cient inference algorithm for fully connected CRF models in\nwhich the pairwise edge potentials are de\ufb01ned by a linear combination of Gaussian kernels in an ar-\nbitrary feature space. The algorithm is based on a mean \ufb01eld approximation to the CRF distribution.\nThis approximation is iteratively optimized through a series of message passing steps, each of which\nupdates a single variable by aggregating information from all other variables. We show that a mean\n\ufb01eld update of all variables in a fully connected CRF can be performed using Gaussian \ufb01ltering\nin feature space. This allows us to reduce the computational complexity of message passing from\nquadratic to linear in the number of variables by employing ef\ufb01cient approximate high-dimensional\n\ufb01ltering [16, 2, 1]. The resulting approximate inference algorithm is sublinear in the number of\nedges in the model.\nFigure 1 demonstrates the bene\ufb01ts of the presented algorithm on two images from the MSRC-21\ndataset for multi-class image segmentation and labeling. Figure 1(d) shows the results of approxi-\nmate MCMC inference in fully connected CRFs on these images [17]. The MCMC procedure was\nrun for 36 hours and only partially converged for the bottom image. We have also experimented with\ngraph cut inference in the fully connected models [11], but it did not converge within 72 hours. In\ncontrast, a single-threaded implementation of our algorithm produces a detailed pixel-level labeling\nin 0.2 seconds, as shown in Figure 1(e). A quantitative evaluation on the MSRC-21 and the PAS-\nCAL VOC 2010 datasets is provided in Section 6. To the best of our knowledge, we are the \ufb01rst to\ndemonstrate ef\ufb01cient inference in fully connected CRF models at the pixel level.\n\n2 The Fully Connected CRF Model\nConsider a random \ufb01eld X de\ufb01ned over a set of variables {X1, . . . , XN}. The domain of each\nvariable is a set of labels L = {l1, l2, . . . , lk}. Consider also a random \ufb01eld I de\ufb01ned over variables\n{I1, . . . , IN}. In our setting, I ranges over possible input images of size N and X ranges over\npossible pixel-level image labelings. Ij is the color vector of pixel j and Xj is the label assigned to\npixel j.\nA conditional\nP (X|I) = 1\n\ndistribution\nc\u2208CG \u03c6c(Xc|I)), where G = (V,E) is a graph on X and each clique c\n\nZ(I) exp(\u2212(cid:80)\n\nis\n\ncharacterized\n\nby\n\na Gibbs\n\nrandom \ufb01eld\n\n(I, X)\n\n2\n\nskytreegrassbenchtreeroadgrass\fis E(x|I) =(cid:80)\n\nin a set of cliques CG in G induces a potential \u03c6c [15]. The Gibbs energy of a labeling x \u2208 LN\nc\u2208CG \u03c6c(xc|I). The maximum a posteriori (MAP) labeling of the random \ufb01eld is\nx\u2217 = arg maxx\u2208LN P (x|I). For notational convenience we will omit the conditioning in the rest of\nthe paper and use \u03c8c(xc) to denote \u03c6c(xc|I).\nIn the fully connected pairwise CRF model, G is the complete graph on X and CG is the set of all\nunary and pairwise cliques. The corresponding Gibbs energy is\n\nE(x) =\n\n\u03c8u(xi) +\n\n\u03c8p(xi, xj),\n\n(1)\n\n(cid:88)\n\n(cid:88)\n\ni\n\ni