{"title": "Scene Segmentation with CRFs Learned from Partially Labeled Images", "book": "Advances in Neural Information Processing Systems", "page_first": 1553, "page_last": 1560, "abstract": null, "full_text": "Scene Segmentation with Conditional Random Fields\n\nLearned from Partially Labeled Images\n\nINRIA and Laboratoire Jean Kuntzmann, 655 avenue de l\u2019Europe, 38330 Montbonnot, France\n\nJakob Verbeek and Bill Triggs\n\nAbstract\n\nConditional Random Fields (CRFs) are an effective tool for a variety of different\ndata segmentation and labeling tasks including visual scene interpretation, which\nseeks to partition images into their constituent semantic-level regions and assign\nappropriate class labels to each region. For accurate labeling it is important to\ncapture the global context of the image as well as local information. We in-\ntroduce a CRF based scene labeling model that incorporates both local features\nand features aggregated over the whole image or large sections of it. Secondly,\ntraditional CRF learning requires fully labeled datasets which can be costly and\ntroublesome to produce. We introduce a method for learning CRFs from datasets\nwith many unlabeled nodes by marginalizing out the unknown labels so that the\nlog-likelihood of the known ones can be maximized by gradient ascent. Loopy\nBelief Propagation is used to approximate the marginals needed for the gradi-\nent and log-likelihood calculations and the Bethe free-energy approximation to\nthe log-likelihood is monitored to control the step size. Our experimental results\nshow that effective models can be learned from fragmentary labelings and that\nincorporating top-down aggregate features signi\ufb01cantly improves the segmenta-\ntions. The resulting segmentations are compared to the state-of-the-art on three\ndifferent image datasets.\n\n1 Introduction\nIn visual scene interpretation the goal is to assign image pixels to one of several semantic classes or\nscene elements, thus jointly performing segmentation and recognition. This is useful in a variety of\napplications ranging from keyword-based image retrieval (using the segmentation to automatically\nindex images) to autonomous vehicle navigation [1].\nRandom \ufb01eld approaches are a popular way of modelling spatial regularities in images. Their ap-\nplications range from low-level noise reduction [2] to high-level object or category recognition (this\npaper) and semi-automatic object segmentation [3]. Early work focused on generative modeling us-\ning Markov Random Fields, but recently Conditional Random Field (CRF) models [4] have become\npopular owing to their ability to directly predict the segmentation/labeling given the observed image\nand the ease with which arbitrary functions of the observed features can be incorporated into the\ntraining process. CRF models can be applied either at the pixel-level [5, 6, 7] or at the coarser level\nof super-pixels or patches [8, 9, 10]. In this paper we label images at the level of small patches, using\nCRF models that incorporate both purely local (single patch) feature functions and more global \u2018con-\ntext capturing\u2019 feature functions that depend on aggregates of observations over the whole image or\nlarge regions.\nTraditional CRF training algorithms require fully-labeled training data.\nIn practice it is dif\ufb01cult\nand time-consuming to label every pixel in an image and most of the available image interpretation\ndatasets contain unlabeled pixels. Working at the patch level exacerbates this problem because\nmany patches contain several different pixel-level labels. Our CRF training algorithm handles this\nby allowing partial and mixed labelings and optimizing the probability for the model segmentation\nto be consistent with the given labeling constraints.\n\n1\n\n\fThe rest of the paper is organized as follows: we describe our CRF model in Section 2, present our\ntraining algorithm in Section 3, provide experimental results in Section 4, and conclude in Section 5.\n\n2 A Conditional Random Field using Local and Global Image Features\nWe represent images as rectangular grids of patches at a single scale, associating a hidden class label\nwith each patch. Our CRF models incorporate 4-neighbor couplings between patch labels. The local\nimage content of each patch is encoded using texture, color and position descriptors as in [10]. For\ntexture we compute the 128-dimensional SIFT descriptor [11] of the patch and vector quantize it\nby nearest-neighbour assignement against a ks = 1000 word texton dictionary learned by k-means\nclustering of all patches in the training dataset. Similarly, for color we take the 36-D hue descriptor\nof [12] and vector quantize it against a kh = 100 word color dictionary learned from the training\nset. Position is encoded by overlaying the image with an m \u00d7 m grid of cells (m = 8) and using the\nindex of the cell in which the patch falls as its position feature. Each patch is thus coded by three\nbinary vectors with respectively ks, kh and kp = m2 bits, each with a single bit set corresponding to\nthe observed visual word. Our CRF observation functions are simple linear functions of these three\nvectors. Generatively, the three modalities are modelled as being independent given the patch label.\nThe naive Bayes model of the image omits the 4-neighbor couplings and thus assumes that each\npatch label depends only on its three observation functions. Parameter estimation reduces to trivially\ncounting observed visual word frequencies for each label class and feature type. On the MSRC 9-\nclass image dataset this model returns an average classi\ufb01cation rate of 67.1% (see Section 4), so\nisolated appearance alone does not suf\ufb01ce for reliable patch labeling.\nIn recent years models based on histograms of visual words have proven very successful for im-\nage categorization (deciding whether or not the image as a whole belongs to a given category of\nscenes) [13]. Motivated by this, many of our models take the global image context into account\nby including observation functions based on image-wide histograms of the visual words of their\npatches. The hope is that this will help to overcome the ambiguities that arise when patches are clas-\nsi\ufb01ed in isolation. To this end, we de\ufb01ne a conditional model for patch labels that incorporates both\nlocal patch level features and global aggregate features. Let xi \u2208 {1, . . . , C} denote the label of\npatch i, yi denote the W -dimensional concatenated binary indicator vector of its three visual words\n(W = ks + hh + kp), and h denote the normalized histogram of all visual words in the image, i.e.\n\npatches i yi normalized to sum one. The conditional probablity of the label xi is then modeled as\n\nP\n\np(xi = l|yi, h) \u221d exp\n\nw=1 (\u03b1wlyiw + \u03b2wlhw)\n\n(1)\nwhere \u03b1wl, \u03b2wl are W \u00d7 C matrices of coef\ufb01cients to be learned. We can think of this as a mul-\ntiplicative combination of a local classi\ufb01er based on the patch-level observation yi and a global\ncontext or bias based on the image-wide histogram h.\nTo account for correlations among spatially neighboring patch labels, we add couplings between the\nlabels of neighboring patches to the single patch model (1). Let X denote the collection of all patch\nlabels in the image and Y denote the collected patch features. Then our CRF model for the coupled\npatch labels is:\n\n,\n\np(X|Y ) \u221d exp(cid:0) \u2212 E(X|Y )(cid:1),\nE(X|Y ) = X\n\nWX\n\n(\u03b1wxiyiw + \u03b2wxihw) +X\n\n(cid:16)\u2212PW\n\n(cid:17)\n\n(2)\n\ni\n\nw=1\n\ni\u223cj\n\n\u03c6ij(xi, xj),\n\n\u03c6ij(xi, xj) = \u03b3xi,xj [xi 6= xj],\n\n(3)\nwhere i \u223c j denotes the set of all adjacent (4-neighbor) pairs of patches i, j. We can write E(X|Y )\nwithout explicitly including h as an argument because h is a deterministic function of Y .\nWe have explored two forms of pairwise potential:\nwhere [\u00b7] is one if its argument is true and zero otherwise, and dij is some similarity measure over the\nappearance of the patches i and j. In the \ufb01rst form, \u03b3xi,xj is a general symmetric weight matrix that\nneeds to be learned. The second potential is designed to favor label transitions at image locations\nwith high contrast. As in [3] we use dij = exp(\u2212kzi \u2212 zjk2/(2\u03bb)), with zi \u2208 IR3 denoting the\naverage RGB value in the patch and \u03bb = hkzi \u2212 zjk2i, the average L2 norm between neighboring\nRGB values in the image. Models using the \ufb01rst form of potential will be denoted \u2018CRF\u03b3\u2019 and\nthose using the second will be denoted \u2018CRF\u03c4\u2019, or \u2018CRF\u03c3\u2019 if \u03c4 has been \ufb01xed to zero. A graphical\nrepresentation of the model is given in Figure 1.\n\n\u03c6ij(xi, xj) = (\u03c3 + \u03c4 dij) [xi 6= xj],\n\nand\n\n2\n\n\fh\n\nx\n\ny\n\nx\n\ny\n\nx\n\ny\n\nx\n\ny\n\nx\n\ny\n\nx\n\ny\n\nx\n\ny\n\nx\n\ny\n\nFigure 1: Graphical representation of the\nmodel with a single image- wide aggregate\nfeature function denoted by h. Squares\ndenote feature functions and circles de-\nnote variable nodes xi (here connected\nin a 4- neighbor grid covering the im-\nage). Arrows denote single node poten-\ntials due to feature functions, and undi-\nrected edges represent pairwise potentials.\nThe dashed lines indicate the aggregation\nof the single- patch observations yi into h.\n\ncation of the training data,PN\n\n3 Estimating a Conditional Random Field from Partially Labeled Images\nConditional models p(X|Y ) are usually trained by maximizing the log- likelihood of correct classi\ufb01-\nn=1 log p(Xn|Yn). This requires completely labeled training data, i.e.\na collection of N pairs (Xn, Yn)n=1,...,N with completely known Xn. In practice this is restrictive\nand it is useful to develop methods that can learn from partially labeled examples \u2013 images that\ninclude either completely unlabeled patches or ones with a retricted but nontrivial set of possible\nlabels. Formally, we will assume that an incomplete labeling X is known to belong to an associ-\nated set of admissible labelings A and we maximise the log- likelihood for the model to predict any\nlabeling in A:\n\nL = log p(X \u2208 A| Y ) = log X\n\np(X|Y )\n\nexp(cid:0) \u2212 E(X|Y )(cid:1)(cid:17) \u2212 log\n\nX\u2208A\n\n(cid:16)X\n\nexp(cid:0) \u2212 E(X|Y )(cid:1)(cid:17)\n\n.\n\n(4)\n\n(cid:16)X\n\nX\u2208A\n\n= log\n\nX\n\nNote that the log- likelihood is the difference between the partition functions of the restricted and\nunrestricted labelings, p(X | Y, X \u2208 A) and p(X|Y ). For completely labeled training images this\nreduces trivially to the standard labeled log- likelihood, while for partially labeled ones both terms\nof the log- likelihood are typically intractable because the set A contains O(C k) distinct labelings\nX where k is the number of unlabeled patches and C is the number of possible labels. Similarly,\nto \ufb01nd maximum likelihood parameter estimates using gradient descent we need to calculate partial\nderivatives with respect to each parameter \u03b8 and in general both terms are again intractable:\n\np(X|Y ) \u2212 p(X | Y, X \u2208 A)\n\n.\n\n(5)\n\nL \u2248 FBethe\n\nX\n\napproximate the full partition function log(P\nfor doing so can also be applied to the more restricted sum log(P\n\nHowever the situation is not actually much worse than the fully- labeled case. In any case we need to\nX exp\u2212E(X|Y )) or its derivatives and any method\nX\u2208A exp\u2212E(X|Y )) to give a\ncontrast- of- partition- function based approximation. Here we will use the Bethe free energy approx-\nimation for both partition functions [14]:\n\n\u2202\u03b8\n\n(cid:0)p(X|Y )(cid:1) \u2212 FBethe\n\n(cid:0)p(X | Y, X \u2208 A)(cid:1).\n\n(6)\nThe Bethe approximation is a variational method based on approximating the complete distribu-\ntion p(X|Y ) as the product of its pair- wise marginals (normalized by single- node marginals) that\nwould apply if the graph were a tree. The necessary marginals are approximated using Loopy Belief\nPropagation (LBP) and the log- likelihood and its gradient are then evaluated using them [14]. Here\nLBP is run twice (with the singleton marginals initialized from the single node potentials), once\nto estimate the marginals of p(X|Y ) and once for p(X | Y, X \u2208 A). We used standard undamped\nLBP with uniform initial messages without encountering any convergence problems. In practice\nthe approximate gradient and objective were consistent enough to allow parameter estimation using\nstandard conjugate gradient optimization with adaptive step lengths based on monitoring the Bethe\nfree- energy.\nComparison with excision of unlabeled nodes. The above training procedure requires two runs\nof loopy BP. A simple and often- used alternative is to discard unlabeled patches by excising nodes\n\n= X\n\n(cid:16)\n\n\u2202L\n\u2202\u03b8\n\n(cid:17) \u2202E(X|Y )\n\n3\n\n\fClass and frequency\n\nModel\nIND loc only\nIND loc+glo\nCRF\u03c3 loc only\nCRF\u03c3 loc+glo\nCRF\u03c3 loc+glo del unlabeled\nCRF\u03b3 loc only\nCRF\u03b3 loc+glo\nCRF\u03c4 loc only\nCRF\u03c4 loc+glo\nSchroff et al. [15]\nPLSA-MRF [10]\n\ng\nn\ni\nd\nl\ni\nu\nB\n\n%\n1\n\n.\n\n6\n1\n\ns\ns\na\nr\nG\n\n%\n4\n\n.\n\n2\n3\n\n%\n3\n\n.\n\n2\n1\n\ne\ne\nr\nT\n\nw\no\nC\n\n%\n2\n\n.\n\n6\n\n%\n4\n\n.\n\n5\n1\n\ny\nk\nS\n\ne\nn\na\nl\nP\n\n%\n2\n\n.\n\n2\n\ne\nc\na\nF\n\n%\n4\n\n.\n\n4\n\n%\n5\n\n.\n\n9\n\nr\na\nC\n\ne\nk\ni\nB\n\n%\n5\n\n.\n\n1\n\n63.8\n69.2\n75.0\n73.6\n84.6\n71.4\n74.6\n65.6\n75.0\n56.7\n74.0\n\n88.3\n88.1\n88.6\n91.1\n91.0\n86.8\n88.7\n85.4\n88.5\n84.8\n88.7\n\n51.9\n70.1\n72.7\n82.1\n76.6\n80.2\n82.5\n78.2\n82.3\n76.4\n64.4\n\n56.7\n69.3\n70.5\n73.6\n70.6\n81.0\n82.2\n74.3\n81.0\n83.8\n77.4\n\n88.4\n89.1\n94.7\n95.7\n91.3\n94.2\n93.9\n95.4\n94.4\n81.1\n95.7\n\n28.6\n44.8\n55.5\n78.3\n43.9\n63.8\n61.7\n61.8\n60.6\n53.8\n92.2\n\n64.0\n78.1\n83.2\n89.5\n77.8\n86.3\n88.8\n84.8\n88.7\n68.5\n88.8\n\n60.7\n67.8\n81.4\n84.5\n71.4\n85.7\n82.8\n85.2\n82.2\n71.4\n81.1\n\n24.9\n40.8\n69.1\n81.4\n30.6\n77.3\n76.8\n79.4\n76.1\n72.0\n78.7\n\nl\ne\nx\ni\nP\nr\ne\nP\n\n67.1\n74.4\n80.7\n84.9\n78.4\n82.3\n83.3\n80.3\n83.1\n75.2\n82.3\n\nTable 1: Classi\ufb01cation accuracies on the 9 MSRC classes using different models. For each class its\nfrequency in the ground truth labeling is also given.\n\nthat correspond to unlabeled or partially labeled patches from the graph. This leaves a random\n\ufb01eld with one or more completely labeled connected components whose log-likelihood p(X0|Y 0)\nwe maximize directly using gradient based methods. Equivalently, we can use the complete model\nbut set all of the pair-wise potentials connected to unlabeled nodes to zero: this decouples the labels\nof the unlabeled nodes from the rest of the \ufb01eld. As a result p(X|Y ) and p(X | Y, X \u2208 A) are\nequivalent for the unlabeled nodes and their contribution to the log-likelihood in Eq. (4) and the\ngradient in Eq. (5) vanishes.\nThe problem with this approach is that it systematically overestimates spatial coupling strengths.\nLooking at the training labelings in Figure 3 and Figure 4, we see that pixels near class bound-\naries often remain unlabeled. Since we leave patches unlabeled if they contain unlabeled pixels,\nlabel transitions are underrepresented in the training data, which causes the strength of the pairwise\ncouplings to be greatly overestimated. In contrast, the full CRF model provides realistic estimates\nbecause it is forced to include a (fully coupled) label transition somewhere in the unlabeled region.\n\n4 Experimental Results\nThis section analyzes the performance of our segmentation models in detail and compares it to other\nexisting methods. In our \ufb01rst set of experiments we use the Microsoft Research Cambridge (MSRC)\ndataset1. This consists of 240 images of 213 \u00d7 320 pixels and their partial pixel-level labelings.\nThe labelings assign pixels to one of nine classes: building, grass, tree, cow, sky, plane, face, car,\nand bike. About 30% of the pixels are unlabeled. Some sample images and labelings are shown in\nFigure 4. In our experiments we divide the dataset into 120 images for training and 120 for testing,\nreporting average results over 20 random train-test partitions. We used 20 \u00d7 20 pixel patches with\ncenters at 10 pixel intervals. (For the patch size see the red disc in Figure 4).\nTo obtain a labeling of the patches, pixels are assigned to the nearest patch center. Patches are al-\nlowed to have any label seen among their pixels, with unlabeled pixels being allowed to have any\nlabel. Learning and inference takes place at the patch level. To map the patch-level segmentation\nback to the pixel level we assign each pixel the marginal of the patch with the nearest center. (In Fig-\nure 4 the segmentations were post-processed by a applying a Gaussian \ufb01lter over the pixel marginals\nwith the scale set to half the patch spacing). The performance metrics ignore unlabeled test pixels.\nThe relative contributions of the different components of our model are summarized in Table 1.\nModels that incorporate 4-neighbor spatial couplings are denoted \u2018CRF\u2019 while ones that incorporate\nonly (local or global) patch-level potentials are denoted \u2018IND\u2019. Models that include global aggregate\nfeatures are denoted \u2018loc+glo\u2019, while ones that include only on local patch-level features are denoted\n\u2018loc only\u2019.\n\n1Available from http://research.microsoft.com/vision/cambridge/recognition.\n\n4\n\n\fFigure 2: Classi\ufb01cation accuracy as a\nfunction of the aggregation \ufb01neness c, for\nthe \u2018IND\u2019 (individual patch) classi\ufb01er us-\ning a single training and test set. Aggre-\ngate features (AF) were computed in each\ncell of a c \u00d7 c image partition. Results are\ngiven for models with no AFs (solid line),\nwith AFs of a single c (dotted curve), with\nAFs on grids 1\u00d71 up to c\u00d7c (solid curve),\nand with AFs on grids c \u00d7 c up to 10 \u00d7 10\n(dashed curve).\n\nBene\ufb01ts of aggregate features. The \ufb01rst main conclusion is that including global aggregate features\nhelps, for example improving the average classi\ufb01cation rate on the MSRC dataset from 67.1% to\n74.4% for the spatially uncoupled \u2018IND\u2019 model and from 80.7% to 84.9% for the \u2018CRF\u03c3\u2019 spatial\nmodel.\nThe idea of aggregation can be generalized to scales smaller than the complete image. We experi-\nmented with dividing the image into c \u00d7 c grids for a range of values of c. In each cell of the grid\nwe compute a separate histogram over the visual words, and for each patch in the cell we include\nan energy term based on this histogram in the same way as for the image-wide histogram in Eq. (1).\nFigure 2 shows how the performance of the individual patch classi\ufb01er depends on the use of aggre-\ngate features. From the dotted curve in the \ufb01gure we see that although using larger cells to aggregate\nfeatures is generally more informative, even \ufb01ne 10\u00d710 subdivisions (containing only 6\u201312 patches\nper cell) provide a signi\ufb01cant performance increase. Furthermore, including aggregates computed\nat several different scales does help, but the performance increment is small compared to the gain\nobtained with just image-wide aggregates. Therefore we included only image-wide aggregates in\nthe subsequent experiments.\nBene\ufb01ts of including spatial coupling. The second main conclusion from Table 1 is that including\nspatial couplings (pairwise CRF potentials) helps, respectively increasing the accuracy by 10.5% for\n\u2018loc+glo\u2019 and by 13.6% for \u2018loc only\u2019 for \u2018CRF\u03c3\u2019 relative to \u2018IND\u2019. The improvement is particularly\nnoticeable for rare classes when global aggregate features are not included: in this case the single\nnode potentials are less informative and frequent classes tend to be unduly favored due to their large\na priori probability.\nWhen the image-wide aggregate features are included (\u2018loc+glo\u2019), the simplest pairwise potential \u2013\nthe \u2018CRF\u03c3\u2019 Potts model \u2013 works better than the more general models \u2018CRF\u03b3\u2019 and \u2018CRF\u03c4\u2019, while\nif only the local features are included (\u2018loc only\u2019), the class-dependent pairwise potential \u2018CRF\u03b3\u2019\nworks best. The performance increment from global features is smallest for \u2018CRF\u03b3\u2019, the model\nthat also includes local contextual information. The overall in\ufb02uence of the local label transition\npreferences expressed in \u2018CRF\u03b3\u2019 appears to be similar to that of the global contextual information\nprovided by image-wide aggregate features.\nBene\ufb01ts of training by marginalizing partial labelings. Our third main conclusion from Table 1\nis that our marginalization based training method for handling missing labels is superior to the\ncommon heuristic of deleting any unlabeled patches. Learning a \u2018CRF\u03c3 loc+glo\u2019 model by removing\nall unlabeled patches (\u2018del unlabeled\u2019 in the table) leads to an estimate \u03c3 \u2248 11.5, whereas the\nmaximum likelihood estimate of (4) leads to \u03c3 \u2248 1.9. In particular, with \u2018delete unlabeled\u2019 training\nthe accuracy of the model drops signi\ufb01cantly for the classes plane and bike, both of which have\na relatively small area relative to their boundaries and thus many partially labeled patches. It is\ninteresting to note that even though \u03c3 has been severely over-estimated in the \u2018delete unlabeled\u2019\nmodel, the CRF still improves over the individual patch classi\ufb01cation obtained with \u2018IND loc+glo\u2019\nfor most classes, albeit not for bike and only marginally for plane.\nRecognition as function of the amount of labeling. We now consider how the performance drops\nas the fraction of labeled pixels decreases. We applied a morphological erosion operator to the man-\nual annotations, where we varied the size of the disk-shaped structuring element from 0, 5, . . . , 50.\n\n5\n\n024681070758085CAccuracy only c1 to cc to 10local only\fFigure 3: Recognition performance when learning from increasingly eroded label images (left).\nExample image with its original annotation, and erosions thereof with disk of size 10 and 20 (right).\n\nIn this way we obtain a series of annotations that resemble increasingly sloppy manual annotations,\nsee Figure 3. The \ufb01gure also shows the recognition performance of \u2018CRF\u03c3 loc+glo\u2019 and \u2018IND\nloc+glo\u2019 as a function of the fraction of labeled pixels. In addition to its superior performance when\ntrained on well labeled images, the CRF maintains its performance better as the labelling becomes\nsparser. Note that \u2018CRF\u03c3 loc+glo\u2019 learned from label images eroded with a disc of radius 30 (only\n28% of pixels labeled) still outperforms \u2018IND loc+glo\u2019 learned from the original labeling (71% of\npixels labeled). Also, the CRF actually performs better with 5 pixels of erosion than with the origi-\nnal labeling, presumably because ambiguities related to training patches with mixed pixel labels are\nreduced.\nComparison with related work. Table 1 also compares our recognition results on the MSRC\ndataset with those reported in [15, 10]. Our CRF model clearly outperforms the approach of [15],\nwhich uses aggregate features of an optimized scale but lacks spatial coupling in a random \ufb01eld,\ngiving a performance very similar to that of our \u2018IND loc+glo\u2019 model. Our CRF model also performs\nslightly better than our generative approach of [10], which is based on the same feature set but differs\nin its implementation of image-wide contextual information ([10] also used a 90%\u201310% training-test\npartition, not 50%-50% as here).\nUsing the Sowerby dataset and a subset of the Corel dataset we also compare our model with two\nCRF models that operate at pixel-level. The Sowerby dataset consists of 104 images of 96 \u00d7 64\npixels of urban and rural scenes labeled with 7 different classes: sky, vegetation, road marking, road\nsurface, building, street objects and cars. The subset of the Corel dataset contains 100 images of\n180\u00d7 120 pixels of natural scenes, also labeled with 7 classes: rhino/hippo, polar bear, water, snow,\nvegetation, ground, and sky. Here we used 10 \u00d7 10 pixel patches, with a spacing of respectively 2\nand 5 pixels for the Sowerby and Corel datasets. The other parameters were kept as before. Table 2\ncompares the recognition accuracies averaged over pixels for our CRF and independent patch models\nto the results reported on these datasets for TextonBoost [7] and the multi-scale CRF model of [5].\nIn this table \u2018IND\u2019 stands for results obtained when only the single node potentials are used in the\nrespective models, disregarding the spatial random \ufb01eld couplings. The total training time and test\ntime per image are listed for the full CRF models. The results show that on these datasets our model\nperforms comparably to pixel-level approaches while being much faster to train and test since it\noperates at patch-level and uses standard features as opposed to the boosting procedure of [7].\n\n5 Conclusion\nWe presented several image-patch-level CRF models for semantic image labeling that incorporate\nboth local patch-level observations and more global contextual features based on aggregates of ob-\nservations at several scales. We showed that partially labeled training images could be handled by\nmaximizing the total likelihood of the image segmentations that comply with the partial labeling,\nusing Loopy BP and Bethe free-energy approximations for the calculations. This allowed us to learn\neffective CRF models from images where only a small fraction of the pixels were labeled and class\ntransitions were not observed. Experiments on the MSRC dataset showed that including image-\n\n6\n\n203040506070606570758085Percentage of pixels labeledAccuracy disc 0disc 10disc 20CRF\u03c3 loc+gloIND loc+glo\fSowerby\n\nCorel\n\nAccuracy\n\nSpeed\n\nAccuracy\n\nSpeed\n\nTextonBoost [7]\nHe et al. [5] CRF\nCRF\u03c3 loc+glo\n\ntrain\n5h\n\ntest\nIND\nCRF\n10s\n85.6% 88.6%\n82.4% 89.5% Gibbs Gibbs\n86.0% 87.4% 20min\n\n5s\n\ntest\nCRF\nIND\n30s\n68.4% 74.6%\n66.9% 80.0% Gibbs Gibbs\n66.9% 74.6 % 15min\n\ntrain\n12h\n\n3s\n\nTable 2: Recognition accuracy and speeds on the Corel and Sowerby dataset.\n\nwide aggregate features is very helpful, while including additional aggregates at \ufb01ner scales gives\nrelatively little further improvement. Comparative experiments showed that our patch-level CRFs\nhave comparable performance to state-of-the-art pixel-level models while being much more ef\ufb01cient\nbecause the number of patches is much smaller than the number of pixels.\n\nReferences\n[1] P. Jansen, W. van der Mark, W. van den Heuvel, and F. Groen. Colour based off-road environment and\nterrain type classi\ufb01cation. In Proceedings of the IEEE Conference on Intelligent Transportation Systems,\npages 216\u2013221, 2005.\n\n[2] S. Geman and D. Geman. Stochastic relaxation, Gibbs distributions and the Bayesian restoration of\n\nimages. IEEE Transactions on Pattern Analysis and Machine Intelligence, 6(6):712\u2013741, 1984.\n\n[3] C. Rother, V. Kolmogorov, and A. Blake. GrabCut: interactive foreground extraction using iterated graph\n\ncuts. ACM Transactions on Graphics, 23(3):309\u2013314, 2004.\n\n[4] J. Lafferty, A. McCallum, and F. Pereira. Conditional random \ufb01elds: probabilistic models for segment-\ning and labeling sequence data. In Proceedings of the International Conference on Machine Learning,\nvolume 18, pages 282\u2013289, 2001.\n\n[5] X. He, R. Zemel, and M. Carreira-Perpi\u02dcn\u00b4an. Multiscale conditional random \ufb01elds for image labelling. In\nProceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 695\u2013702, 2004.\nIn\n\n[6] S. Kumar and M. Hebert. A hierarchical \ufb01eld framework for uni\ufb01ed context-based classi\ufb01cation.\n\nProceedings of the IEEE International Conference on Computer Vision, pages 1284\u20131291, 2005.\n\n[7] J. Shotton, J. Winn, C. Rother, and A. Criminisi. Textonboost:\n\njoint appearance, shape and context\nmodeling for multi-class object recognition and segmentation. In Proceedings of the European Conference\non Computer Vision, pages 1\u201315, 2006.\n\n[8] A. Quattoni, M. Collins, and T. Darrell. Conditional random \ufb01elds for object recognition. In Advances in\n\nNeural Information Processing Systems, volume 17, pages 1097\u20131104, 2005.\n\n[9] P. Carbonetto, G. Dork\u00b4o, C. Schmid, H. K\u00a8uck, and N. de Freitas. A semi-supervised learning approach to\nobject recognition with spatial integration of local features and segmentation cues. In Toward Category-\nLevel Object Recognition, pages 277\u2013300, 2006.\n\n[10] J. Verbeek and B. Triggs. Region classi\ufb01cation with Markov \ufb01eld aspect models. In Proceedings of the\n\nIEEE Conference on Computer Vision and Pattern Recognition, 2007.\n\n[11] D. Lowe. Distinctive image features from scale-invariant keypoints. International Journal of Computer\n\nVision, 60(2):91\u2013110, 2004.\n\n[12] J. van de Weijer and C. Schmid. Coloring local feature extraction.\n\nConference on Computer Vision, pages 334\u2013348, 2006.\n\nIn Proceedings of the European\n\n[13] The 2005 PASCAL visual object classes challenge.\n\nIn F. d\u2019Alche-Buc, I. Dagan, and J. Quinonero,\neditors, Machine Learning Challenges: Evaluating Predictive Uncertainty, Visual Object Classi\ufb01cation,\nand Recognizing Textual Entailment, First PASCAL Machine Learning Challenges Workshop. Springer,\n2006.\n\n[14] J. Yedidia, W. Freeman, and Y. Weiss. Understanding belief propagation and its generalizations. Technical\n\nReport TR-2001-22, Mitsubishi Electric Research Laboratories, 2001.\n\n[15] F. Schroff, A. Criminisi, and A. Zisserman. Single-histogram class models for image segmentation. In\n\nProceedings of the Indian Conference on Computer Vision, Graphics and Image Processing, 2006.\n\n7\n\n\fC\nR\nS\nM\n\no\nl\ng\n+\nc\no\nl\n\n\u03c3\nF\nR\nC\n\ng\nn\ni\nl\ne\nb\na\nL\n\ny\nb\nr\ne\nw\no\nS\n\no\nl\ng\n+\nc\no\nl\n\n\u03c3\nF\nR\nC\n\ng\nn\ni\nl\ne\nb\na\nL\n\nl\ne\nr\no\nC\n\no\nl\ng\n+\nc\no\nl\n\n\u03c3\nF\nR\nC\n\ng\nn\ni\nl\ne\nb\na\nL\n\nFigure 4: Samples from the MSRC, Sowerby, and Corel datasets with segmentation and labeling.\n\n8\n\n\f", "award": [], "sourceid": 243, "authors": [{"given_name": "Bill", "family_name": "Triggs", "institution": null}, {"given_name": "Jakob", "family_name": "Verbeek", "institution": null}]}