{"title": "Pylon Model for Semantic Segmentation", "book": "Advances in Neural Information Processing Systems", "page_first": 1485, "page_last": 1493, "abstract": "Graph cut optimization is one of the standard workhorses of image segmentation since for binary random field representations of the image, it gives globally optimal results and there are efficient polynomial time implementations. Often, the random field is applied over a flat partitioning of the image into non-intersecting elements, such as pixels or super-pixels. In the paper we show that if, instead of a flat partitioning, the image is represented by a hierarchical segmentation tree, then the resulting energy combining unary and boundary terms can still be optimized using graph cut (with all the corresponding benefits of global optimality and efficiency). As a result of such inference, the image gets partitioned into a set of segments that may come from different layers of the tree. We apply this formulation, which we call the pylon model, to the task of semantic segmentation where the goal is to separate an image into areas belonging to different semantic classes. The experiments highlight the advantage of inference on a segmentation tree (over a flat partitioning) and demonstrate that the optimization in the pylon model is able to flexibly choose the level of segmentation across the image. Overall, the proposed system has superior segmentation accuracy on several datasets (Graz-02, Stanford background) compared to previously suggested approaches.", "full_text": "A Pylon Model for Semantic Segmentation\n\nVictor Lempitsky\n\nAndrea Vedaldi\n\nVisual Geometry Group, University of Oxford\u2217\n{vilem,vedaldi,az}@robots.ox.ac.uk\n\nAndrew Zisserman\n\nAbstract\n\nGraph cut optimization is one of the standard workhorses of image segmentation since for\nbinary random \ufb01eld representations of the image, it gives globally optimal results and there\nare ef\ufb01cient polynomial time implementations. Often, the random \ufb01eld is applied over a\n\ufb02at partitioning of the image into non-intersecting elements, such as pixels or super-pixels.\nIn the paper we show that if, instead of a \ufb02at partitioning, the image is represented by a\nhierarchical segmentation tree, then the resulting energy combining unary and boundary\nterms can still be optimized using graph cut (with all the corresponding bene\ufb01ts of global\noptimality and ef\ufb01ciency). As a result of such inference, the image gets partitioned into a\nset of segments that may come from different layers of the tree.\nWe apply this formulation, which we call the pylon model, to the task of semantic seg-\nmentation where the goal is to separate an image into areas belonging to different semantic\nclasses. The experiments highlight the advantage of inference on a segmentation tree (over\na \ufb02at partitioning) and demonstrate that the optimization in the pylon model is able to \ufb02ex-\nibly choose the level of segmentation across the image. Overall, the proposed system has\nsuperior segmentation accuracy on several datasets (Graz-02, Stanford background) com-\npared to previously suggested approaches.\n\n1\n\nIntroduction\n\nSemantic segmentation (i.e. the task of assigning each pixel of a photograph to a semantic class label) is\noften tackled via a \u201c\ufb02at\u201d conditional random \ufb01eld model [10, 29]. This model considers the subdivision\nof an image into small non-overlapping elements (pixels or small superpixels). It then learns and evaluates\nthe likelihood of each element as belonging to one of the semantic classes (unary terms) and combine these\nlikelihoods with pairwise terms that encourage neighboring elements to take the same labels, and in this way\npropagates the information from elements that are certain about their labels to uncertain ones. The appeal of\nthe \ufb02at CRF model is the availability of ef\ufb01cient MAP inference based on graph cut [7], which is exact for\ntwo-label problems with submodular pairwise terms [4, 16] and gets very close to global optima for many\npractical cases of multi-label segmentation [31].\nThe main limitation of the \ufb02at CRF model is that since each superpixel takes only one semantic label, super-\npixels have to be small, so that they do not straddle class boundaries too often. Thus, the amount of visual\ninformation inside the superpixel is limited. The best performing CRF models therefore consider wider local\ncontext around each superpixel, but as the object and class boundaries are not known in advance, the support\narea over which such context information is aggregated is not adapted. For this reason, such context-based\ndescriptors have limited repeatability and may not allow reliable classi\ufb01cation. This is, in fact, a manifesta-\ntion of a well-known chicken-and-egg problem between segmentation and recognition (given spatial support\nbased on proper segmentation, recognition is easy [20], but to get the proper segmentation prior recognition\nis needed).\nRecently, several semantic segmentation methods that explicitly interleave segmentation and recognition have\nbeen proposed. Such methods [8, 11, 18] consider a large pool of overlapping segments that are much bigger\n\u2217Victor Lempitsky is currently with Yandex, Moscow. This work was supported by ERC grant VisRec no. 228180\n\nand by the PASCAL Network of Excellence.\n\n1\n\n\fFigure 1: Pool-based binary segmentation. For binary semantic segmentation, the pylon model is able\nto \ufb01nd a globally optimal subset of segments and their labels (bottom row), while optimizing unary and\nboundary costs. Here we show a result of such inference for images from each of the Graz-02 [23] datasets\n(people and bikes \u2013 left, cars \u2013 right).\n\nthan superpixels in \ufb02at CRF approaches. These methods then perform joint optimization over the choice of\nseveral non-overlapping segments from the pool and the semantic labels of the chosen segments. As a result,\nin the ideal case, a photograph is pieced from a limited number of large segments, each of which can be\nunambiguously assigned to one of the semantic classes, based on the information contained in it. Essentially,\nthe photograph is then \u201cexplained\u201d by these segments that often correspond to objects or their parts. Such\nscene explanation can then be used as a basis for more high-level scene understanding than just semantic\nsegmentation.\nIn this work, we present a pylon model for semantic segmentation which largely follows the pool-based\nsemantic segmentation approach from [8, 11, 18]. Our goal is to overcome the main problem of existing\npool-based approaches, which is the fact that they all face very hard optimization problems and tackle them\nwith rather inexact and slow algorithms (greedy local moves for [11], loose LP relaxations in [8, 18]). Our\naim is to integrate the exact and ef\ufb01cient inference employed by \ufb02at CRF methods with the strong scene\ninterpretation properties of the pool-based approaches.\nLike previous pool-based approaches, the pylon model \u201cexplains\u201d each image as a union of non-intersecting\nsegments. We achieve the tractability of the inference by restricting the pool of segments to come from a\nsegmentation tree. Segmentation trees have been investigated for a long time, and several ef\ufb01cient algorithms\nhave been developed [1, 2, 38, 27]. Furthermore, any binary unsupervised algorithm (e.g. normalized cut\n[28]) can be used to obtain a segmentation tree via iterative application. As segmentation trees re\ufb02ect the\nhierarchical nature of visual scenes, algorithms based on segmentation-trees achieved very impressive results\nfor visual-recognition tasks [13, 22, 34]. For our purpose, the important property of tree-based segment pool\nis that each image region is covered by segments of very different sizes and there is a good chance that one\nsuch segment does not straddle object boundaries but is still big enough to contain enough visual information\nfor a reliable class identi\ufb01cation.\nInference in pylons optimizes the sum of the real-valued costs of the segments selected to explain the im-\nage. Similarly to random \ufb01eld approaches, pylons also include spatial smoothness terms that encourage\nthe boundary compactness of the resulting segmentations (this could be e.g. the popular contrast-dependent\nPotts-potentials). Such boundary terms often remedy the imperfections of segmentation trees by propagating\nthe information from big segments that \ufb01t within object boundaries to smaller ones that have to supplement\nthe big segments to \ufb01t class boundaries accurately.\nThe most important advantage of pylons over previous pool-based methods [8, 11, 18] is the tractability of\ninference. Similarly to \ufb02at CRFs, in the two-class (e.g. foreground-background) case, the globally optimal\nset of segments can be found exactly and ef\ufb01ciently via graph cut (Figure 1). Such inference can then be\nextended to multi-label problems via an alpha-expansion procedure [7] that gives solutions close to a global\noptimum. Effectively, inference in pylons is as \u201ceasy\u201d as in the \ufb02at CRF approach. We then utilize such a\n\u201cfree lunch\u201d to achieve a better than state-of-the-art performance on several datasets (Graz-02 datasets[23]\nfor binary label segmentations, Stanford background dataset [11] for multi-label segmentation). At least\nin part, the excellent performance of our system is explained by the fact that we can learn both unary and\nboundary term parameters within a standard max-margin approach developed for CRFs [32, 33, 35], which is\n\n2\n\n\fnot easily achievable with the approximate and slow inference in previous pool-based methods [17]. We also\ndemonstrate that the pylon model achieves higher segmentation accuracy than \ufb02at CRFs, or non-loopy pylon\nmodels without boundary terms, given the same features and the same learning procedure.\nOther related work. The use of segmentation trees for semantic segmentation has a long history. The\nolder works of [5] and [9] as well as a recent work [22] use a sequence of top-down inference processes\non a segmentation tree to infer the class labels at the leaf level. Our work is probably more related to the\napproaches performing MAP estimation in tree-structured/hierarchical random \ufb01elds. For this, Awasthi et\nal. [3], Reynolds and Murphy [25] and Plath et al. [24] use pure tree-based random \ufb01elds without boundary\nterms, while Schnitzspan et al. [26] and Ladicky et al. [19] incorporate boundary terms and perform semantic\nsegmentation at different levels of granularity. The weak consistency between levels is then enforced with\nhigher-order potentials. Overall, our philosophy is different from all these works as we obtain an explicit\nscene interpretation as a union of few non-intersecting segments, while the tree-structured/hierarchical CRF\nworks assign class labels and aggregate unary terms over all segments in the tree/hierarchy. Our inference\nhowever is similar to that of [19].\nIn fact, while below we demonstrate how inference in pylons can be\nreduced to submodular pseudo-boolean quadratic optimization, it can also be reduced to the hierarchical\nassociative CRFs introduced in [19]. We also note that another interesting approach to joint segmentation\nand classi\ufb01cation based on this class of CRFs has been recently proposed by Singaraju and Vidal [30].\n\n2 Random \ufb01elds, Pool-based models, and Pylons\n\nWe now derive a joint framework covering the \ufb02at random \ufb01eld models, the preceding pool-based models,\nand the pylon model introduced in this paper.\nWe consider a semantic segmentation problem for an image I and a set of K semantic classes, so that each\npart of the image domain has to be assigned to one of the classes. Let S = {Si|i = 1 . . . N} be a pool of\nsegments, i.e. a set of sub-regions of the image domain. For a traditional (\ufb02at) random \ufb01eld approach, this\npool comes from an image partitioned into is a set of small non-intersecting segments (or pixels); in the case\nof the pool-based models this is an arbitrary set of many segments coming from multiple \ufb02at segmentations\n[18] or explored via local moves [11]. In the pylon case, S contains all segments in a segmentation tree\ncomputed for an image I.\nA segmentation f then assigns each Si an integer label fi within a range from 0 to K. A special label fi=0\nmeans that the segment is not included into the segmentation, while the rest of the labels mean that the\nsegment participates in the explanation of the scene and is assigned to a semantic class fi. Not all labelings\nare consistent and correspond to valid segmentations. First of all, the completeness constraint requires that\neach image pixel p is covered by a segment with non-zero label:\n\n\u2200p \u2208 I, \u2203i : Si (cid:51) p, fi > 0\n\n\u2200i (cid:54)= j : Si \u2229 Sj (cid:54)= \u2205 \u21d2 fi \u00b7 fj = 0 .\n\n(1)\nFor the \ufb02at random \ufb01eld case, this means that zero labels are prohibited and each segment has to be assigned\nsome non-zero label. For pool-based methods and the pylon model, this is not the case as each pixels has a\nmultitude of segments in S covering it. Thus, zero labels are allowed. Furthermore, non-zero labels should\nbe controlled by the non-overlap constraint requiring that overlapping segments cannot take non-zero labels:\n(2)\nOnce again, the constraint (2) is not needed for \ufb02at CRFs as their pools do not contain overlapping segments.\nIt is, however, non-trivial for the existing pool-based models and for the pylon model, where overlapping\n(nested) segments exist. Under the constraints (1) and (2), each pixel p in the image is covered by exactly\none segment with non-zero label and we de\ufb01ne the number of this segment as i(p). The semantic label f (p)\nof the pixel p is then determined as fi(p).\nTo formulate the energy function, we de\ufb01ne the set of real-valued unary terms Ui(fi), where each Ui speci\ufb01es\nthe cost of including a segment Si into the segmentation with the label fi > 0. Furthermore, we associate\nthe non-negative boundary cost Vpq with any pair of pixels adjacent in the image domain (p, q) \u2208 N . For\nany segmentation f we then de\ufb01ne the boundary cost as the sum of boundary costs over the sets of adjacent\npixel pairs (p, q) that straddle the boundaries between classes induced by this segmentation (i.e. (p, q) \u2208\nN : f (p) (cid:54)= f (q)). In other words, the boundary terms are accumulated along the boundary between pool\nsegments that are assigned different non-zero semantic labels.\nOverall, the energy that we are interested in, is de\ufb01ned as:\nUi(fi) +\n\n(cid:88)\n\n(cid:88)\n\nVpq\n\n(3)\n\nE(f ) =\n\ni\u22081..N|fi>0\n\n(p,q)\u2208N :f (p)(cid:54)=f (q)\n\n3\n\n\fFigure 2: Inference in the Pylon model(best viewed in color.): a tree segmentation of an image (left) and\na corresponding graphical model for the 2-class pylon (right). Each pair of nodes in the graphical model\ncorrespond to a segment in a segmentation tree, while each edge corresponds to the pairwise term in the\npseudo-boolean energy (9)\u2013(10). Blue edges (4) enforce the segment cost potentials (U-terms) as well as\nconsistency of x (children of a shaded node have to be shaded). Red edges (6) and magenta edges (7)\nenforce non-overlap and completeness. Green edges (8) encode boundary terms. Shading gives an example\nvalid labeling for x variables (xt\ni=1 are shaded). Left \u2013 the corresponding semantic segmentation on the\nsegmentation tree consisting of three segments is highlighted.\n\nand we wish to minimize this subject to the constraints (1) and (2). The energy (3) contains the contribution\nof unary terms only from those segments that are selected to explain the image (fi > 0).\nNote that the energy functional has the same form as that of a traditional random \ufb01eld (with weighted Potts\nboundary terms). The pool-based model in [18] is also similar, but lacks the boundary terms. It is well-known\nthat for \ufb02at random \ufb01elds, the optimal segmentation f in the binary case K = 2 with Vpq \u2265 0 can be found\nwith graph cut [7, 12, 16]. Furthermore, for K > 2 one can get very close to global optimum (within a factor\n2 with guarantee [7], but much closer in practice [31]) by applying graph cut-based alpha-expansions [7].\nFor pylons as well as for the pool-based approaches [11, 18], the segment pool is much richer. As a con-\nsequence, the constraints (1) and (2) that are trivial to enforce in the case of the \ufb02at random \ufb01eld, become\nnon-trivial. In the next section, we demonstrate that in the case of a tree-based pool of segments (pylon\nmodel), one still can \ufb01nd the globally optimal f in the case K = 2 and Vpq \u2265 0, and use alpha-expansions in\nthe case K > 2.\n1-class model. Before discussing the inference and learning in the pylon model, we brie\ufb02y introduce a\nmodi\ufb01cation of the generic model derived above, which we call a 1-class model. A 1-class model can be\nused for semantic foreground-background segmentation tasks (e.g. segmenting out people in an image). The\n2-class model de\ufb01ned in (1)\u2013(3) for K = 2 can of course also be used for this purpose. The difference is\nthat the 1-class model treats the foreground and background in an asymmetric way. Namely, for 1-class case\nthe labels xi can only take the values of 0 or 1 (i.e. K=1) and the completeness constraint (1) is omitted. As\nsuch, each segmentation f de\ufb01nes the foreground as a set of segments with fi=1 and the semantic label of\na pixel f (p) is de\ufb01ned to be 1 if p belongs to some segment Si with fi = 1 and f (p) = 0 otherwise. In a\n1-class case, each segment has thus a single unary cost Ui = Ui(1) associated with it. The energy remains\nthe same as in (3).\nFor the \ufb02at random \ufb01eld case, the 1-class and 2-class models are equivalent (one can just de\ufb01ne U 1class\n=\nU 2class\n(1) to get the same energy upto an additive constant). For pool-based models and pylons,\nthis is no longer the case, and the 1-class model is non-trivially different from the 2-class model. Intuitively,\na 1-class model only \u201cexplains\u201d the foreground as a union of segments, while leaving the background part\n\u201cunexplained\u201d. As shown in our experiments, this may be bene\ufb01cial e.g. when the visual appearance of\nforeground is more repeatable than that of the background.\n\n(2) \u2212 U 2class\n\ni\n\ni\n\ni\n\n3\n\nInference in pylon models\n\nTwo-class case. We \ufb01rst demonstrate how the energy (3) can be globally minimized subject to (1)\u2013(2) in the\ncase of a tree-based pool and K = 2. Later, we will outline inference in the case K > 2 and in the case of\na 1-class model K = 1. For each segment number i = 1..N we de\ufb01ne p(i) to be the number of its parent\nsegment in a tree. We further assume that the \ufb01rst L segments correspond to leaves in the segmentation tree\nand that the last segment SN is the root (i.e. the entire image).\n\n4\n\n\fi and x2\n\ni indicating whether the segment falls entirely\nFor each segment i, we introduce two binary variables x1\ninto the segment assigned to class 1 or 2. The exact semantic meaning and relation to variables f of these\nlabels is as follows: xt\ni equals 1 if and only if one of its ancestors j up the tree (including the segment i\nitself) has a label fj = t. We now re-express the constraints (1)\u2013(2) and the energy (3) via a real valued (i.e.\npseudo-boolean) energy of the newly-introduced variables that involve pairwise terms only (Figure 2).\nFirst of all, the de\ufb01nition of the x variables implies that if xt\nFurthermore, if xt\n(3)). These two conditions can be expressed with the bottom-up pairwise term on the variables xt\n(one term for each t = 1, 2):\n\np(i) has to be zero as well.\np(i) = 0 implies that the segment i has a label fi = t (incurring the cost Ui(t) in\n\ni is zero, then xt\n\ni = 1 and xt\n\ni and xt\n\np(i)\n\nEt\n\n(4)\nThese potentials express almost all unary terms in (3) except for the unary term for the root node, that can be\nexpressed as a sum of two unary terms on the new variables (one term for each t = 1, 2):\n\ni (1, 0) = Ui(t), Et\n\ni (0, 0) = 0, Et\n\ni (1, 1) = 0 .\n\ni (0, 1) = +\u221e, Et\n\nEt\n\nN (0) = 0, Et\n\nN (1) = UN (t) .\n\n(5)\n\ni can be 1 at\nThe non-overlap constraint (2) can be enforced by demanding that at most one of x1\nthe same time (as otherwise there are two segments with non-zero f-variables that overlap), introducing the\nfollowing exclusion pairwise term on the variables x1\n(0, 1) = EEXC\n\n(1, 1) = +\u221e .\n\n(1, 0) = 0, EEXC\n\n(0, 0) = EEXC\n\ni and x2\ni :\n\ni and x2\n\nEEXC\n\n(6)\n\ni\n\ni\n\ni\n\ni\n\nThe completeness constraint (1) can be expressed by demanding that each leaf segment is covered by either\nan ancestor segment with label 1 or with label 2. Consequently, in the leaf node, at least one of x1\ni has\nto be 1, hence the following pairwise completeness potential for all leaf segments i = 1..L:\n\ni and x2\n\nECPL\n\ni\n\n(0, 0) = +\u221e, ECPL\n\ni\n\n(0, 1) = ECPL\n\ni\n\n(1, 0) = ECPL\n\ni\n\n(1, 1) = 0 .\n\n(7)\n\nand Sj is then de\ufb01ned as the sum of pixel-level pairwise costs Vij =(cid:80) Vpq over all pairs of adjacent pixels\n\nFinally, the only unexpressed part of the optimization problem is the boundary term in (3). To express the\nboundary term, we consider the set P of pairs of numbers of adjacent leaf segments. For each such pair (i, j)\nof leaf segments (Si, Sj) we consider all pairs of adjacent pixels (p, q). The boundary cost Vij between Si\n(p, q) \u2208 N such that p \u2208 Si and q \u2208 Sj or vice versa (i.e. p \u2208 Sj and q \u2208 Si). The boundary terms can then\nbe expressed with pairwise terms over variables x1\n\ni and x1\n\nEBND\n\nij\n\n(0, 0) = EBND\n\nij\n\n(1, 1) = 0, EBND\n\nij\n\n(1, 0) = Vij .\n\nj for all (i, j) \u2208 P:\n(0, 1) = EBND\n\nij\n\nOverall, the constrained minimization problem (1)\u2013(3) for the variables f, is expressed as the unconstrained\nminimization of the following energy of boolean variables x1, x2:\n\n(8)\n\n(9)\n\n(10)\n\nE(x1, x2) =\n\nEt\n\ni (xt\n\ni, xt\n\np(i)) +\n\nEt\n\nN (xt\n\nN ) +\n\nEBND\n\ni,j (x1\n\ni , x1\n\nj ) +\n\n(cid:88)\n\nN\u22121(cid:88)\n\nt=1,2\n\ni=1\n\n(cid:88)\nN(cid:88)\n\nt=1,2\n\n(cid:88)\n\n(i,j)\u2208P\n\nL(cid:88)\n\nEEXC\n\ni\n\n(x1\n\ni , x2\n\ni ) +\n\nECPL\n\ni\n\n(x1\n\ni , x2\ni )\n\ni=1\n\ni=1\n\nThe energy (9)\u2013(10) contains two parts. The pairwise terms in the \ufb01rst part (9) involve only such pairs of\nvariables that both terms come either from x1 set or from x2 set. All the pairwise terms in (9) are submodular,\ni.e. they obey E(0, 0) + E(1, 1) \u2264 E(0, 1) + E(1, 0). The pairwise terms in the second part (10) involve\nonly such pairs of variables where one term comes from the x1 set and the other from the x2 set. All terms\nin (10) are supermodular, i.e. obey E(0, 0) + E(1, 1) \u2265 E(0, 1) + E(1, 0).\nThus, in the energy (9)\u2013(10), submodular terms act within x1 and x2 sets of variables and supermodular\nterms act only across the two sets. One can then perform a variable substitution x2 = 1 \u2212 \u02dcx2, and get a\nnew energy function E(x1, \u02dcx2). During the substitution, the terms (9) remain submodular, while the terms\n(10) change from being supermodular to being submodular in the new variables. As a result, one gets a\npseudo-boolean pairwise energy with submodular terms only, which can therefore be minimized exactly and\nin a low-polynomial in N time through the graph cut in a specially constructed graph [4, 6, 16]. Given the\n\n5\n\n\fFigure 3: Several examples from the Stanford background dataset [11], where the ability of the pylon model\n(middle row) to choose big enough segments allowed it to obtain better semantic segmentation compared to\na \ufb02at CRF de\ufb01ned on leaf segments (bottom row). Colors: grey=sky, olive=tree, purple=road, green=grass,\nblue=water, red=building, orange=foreground.\n\noptimal values for x1 and \u02dcx2, it is trivial to infer the optimal values for x2 and ultimately for the f variables\n(for the latter step one goes up the tree and set fi = t whenever xt\nOne-class case. Inference in the one-class case is simpler that in the two-class case. As one may expect, it\n1} and omit the pairwise terms (6) and (7)\nis suf\ufb01cient to introduce just a single set of binary variables {xi\n(cid:88)\naltogether. The resulting energy function is then:\n\ni = 1 and xt\n\nN\u22121(cid:88)\n\np(i) = 0).\n\n(11)\n\nE(x1) =\n\ni=1\n\nE1\n\ni (x1\n\ni , x1\n\np(i)) + E1\n\nN (x1\n\nN ) +\n\nEBND\n\ni,j (x1\n\ni , x1\nj )\n\n(i,j)\u2208P\n\nIn this case, the non-overlap constraint is enforced by in\ufb01nite terms within (4). The pseudo-boolean energy\n(11) is submodular and, hence, can be optimized directly via graph cut.\nMulti-class case. As in the \ufb02at CRF case, the alpha-expansion procedure [7] can be used to extend the 2-class\ninference procedure to the case K > 2. Alpha-expansion is an iterative convergent process, where 2-class\ninference is applied at each iteration. In our case, given the current labeling f, and a particular \u03b1 \u2208 1 . . . K,\neach segment has the following three options: (1a) a segment with the non-zero label can retain it (1b) a\nsegment with zero label can change it to the current non-zero label of its ancestor (if any), (2) label fi can\nbe changed to \u03b1, (3) label fi can be changed to 0 (or kept at 0 if already there). Thus, each step results in\nthe 2-class inference task, where U and V potentials of the 2-class inference are induced by the U and V\npotentials of the multi-label problem (in fact, some boundary terms then become asymmetric if one of the\nadjacent segments have the current label \u03b1. We do not detail this case here since it is handled in exactly the\nsame way as in [7]). Alpha-expansion then performs a series of 2-class inferences for \u03b1 sweeping the range\n1 . . . K multiple times until convergence.\n\n4\n\nImplementation and Experiments\n\nSegmentation tree. For this paper, we used the popular segmentation tree approach [2] that is based on the\npowerful pPb edge detector and is known to produce high-quality segmentation trees. The implementation\n[2] is rather slow (orders of magnitude slower than our inference) and we plan to explore faster segmentation\ntree approaches.\nFeatures. We use the following features to describe a segment Si: (1) a histogram hSIFT\nof densely sampled\nvisual SIFT words computed with vl feat [36]. We use a codebook of size 512, and soft-assign each word\nto the 5 nearest codewords via the locality-constrained linear coding [39]; (2) a histogram hCOL\nof RGB colors\n(codebook size 128; hard-assignment); (3) a histogram hLOC\nof locations (where each pixel corresponds to a\nnumber from 1 to 36 depending on its position in a uniform 6 \u00d7 6 grid; (4) the \u201ccontour shape\u201d descriptor\nhSHP\nfrom [13] (a binned histogram of oriented pPb edge detector responses). Each of the four histograms is\ni\n\ni\n\ni\n\ni\n\n6\n\n\fi = si \u00b7(cid:2)H(hSIFT\n\nthen normalized and mapped by a non-linear coordinate-wise mapping H(\u00b7) to a higher-dimensional space,\nwhere the inner product (linear kernel) closely approximates the \u03c72-kernel in the original space [37]. The\nunary term U t\ni is then computed as a scalar product of the stacked descriptor and the parameter weight vector\nU :\nwt\n\n(12)\nNote, that each unary term is also multiplied by si, which is the size of the segment Si. Without such\nmultiplication, the inference process would be biased towards small segments (leaves in the segmentation\ntrees).\nThe boundary cost for a pair of pixel (p, q) \u2208 N is set based on the local boundary strength \u2206pq estimated\nwith gPb edge detector. The exact value of Vpq is then computed as a linear combination of exponentiated\n\u2206pq with several bandwidths:\n\n)T 1(cid:3) \u00b7 wt\n\n)T H(hSHP\n\n)T H(hCOL\n\n)T H(hLOC\n\nU .\n\nU t\n\ni\n\ni\n\ni\n\ni\n\n(cid:20)\n\n(cid:18)\u2212\u2206pq\n\n(cid:19)\n\n10\n\n(cid:18)\u2212\u2206pq\n\n(cid:19)\n\n40\n\n(cid:18)\u2212\u2206pq\n\n(cid:19)\n\n100\n\n(cid:21)\n\nVpq =\n\nexp\n\nexp\n\nexp\n\n\u00b7 wV\n\n(13)\n\n1\n\nWe discuss the learning of parameters w below. The meta-parameters (codebook sizes, number of words in\nsoft-assignment, number of bins for location and contour shape descriptors, bandwidths) were not tweaked\n(we set them based on previous experience and have not tried other values).\nU , wV ], wV \u2265 0 the parameter of the\nMax-margin learning parameters. Denote by w = [w1\npylon model (\u02c6x1(w), \u02c6x2(w)), de\ufb01ned as the minimizer of the energy E(x1, x2) given in (9)\u2013(10). The\ngoal is to \ufb01nd a parameter w such that (\u02c6x1(w), \u02c6x2(w)) has a small Hamming distance \u2206(\u02c6x1(w), \u02c6x2(w))\nto the segmentation \u00afx1, \u00afx2 of a training image. The Hamming distance is simply the number of pixels\nincorrectly labeled. To obtain a convex optimization problem and regularize its solution, we use the large\nmargin formulation of [33, 14]. The \ufb01rst step is to rewrite the optimization task (9)\u2013(10) as:\n\nU , . . . , wK\n\n(\u02c6x1(w), \u02c6x2(w)) = argmax\n\n\u2212E(x1, x2) = argmax\n\nF (x1, x2) + (cid:104)\u03a8(x1, x2), w(cid:105),\n\n(14)\n\nx1,x2\n\nx1,x2\n\nwhere \u03a8(x1, x2) is a concatentation of the summed coef\ufb01cients of (12) and (13) and F (x1, x2) accounts for\nthe terms of E(x1, x2) that do not depend on w. Then margin rescaling [14] is used to construct a convex\nupper bound of the Hamming loss \u2206(\u02c6x1(w), \u02c6x2(w)):\n\nx1,x2\n\n\u2206(cid:48)(w) = max\n\n\u2206(x1, x2) + F (x1, x2) \u2212 F (\u00afx1, \u00afx2) + (cid:104)\u03a8(x1, x2), w(cid:105) \u2212 (cid:104)\u03a8(\u00afx1, \u00afx2), w(cid:105)\n\n(15)\nThe function \u2206(cid:48)(w) is convex because it is the upper envelope of a family of planes, one for each set-\nting of x1, x2. This allows to learn the parameter w as the minimizer of the convex objective function\n\u03bb(cid:107)w(cid:107)2/2 + \u2206(cid:48)(w), where \u03bb controls over\ufb01tting. Optimization uses the cutting plane algorithm described\nin [14], which gradually approximates \u2206(cid:48)(w) by selecting a small representative subset of the exponential\nnumber of planes that \ufb01gure in (15). These representative planes are found by maximizing (15), which can\nbe done by the algorithm described in Sect. 3 after accounting for the loss \u2206(x1, x2) in a suitable adjustment\nof the potentials.\nDatasets. We consider the three Graz-02 datasets [23] that to the best of our knowledge represent the\nmost challenging datasets for semantic binary (foreground-background) segmentation. Each Graz-02 dataset\nhas one class of interest (bicycles, cars, and people). The datasets are loosely annotated at the pixel level.\nPrevious methods reported performance for the \ufb01xed splits including 150 training and 150 testing images.\nThe customary performance measure is the equal recall-precision rate averaged over all pixels in the test\nset. In general, when trained with Hamming loss, our method produces recall slightly lower than precision.\nWe therefore retrained our system with weighted Hamming loss (so that false negatives are penalized higher\nthan false positive), tuning the balancing constant to achieve approximately equal recall and precision (an\nalternative would be to use parametric max\ufb02ow [15]).\nWe also consider the Stanford background dataset [11] containing 715 images of outdoor scenes with pixel-\naccurate annotations into 8 semantic classes (sky, tree, road, grass, water, building, mountain, and foreground\nobject). Similar to previous approaches, we report the percentage of correctly labeled pixels on 5 random\nsplits of a \ufb01xed size (572 training, 143 testing).\nResults. We compare the performance of our system with the state-of-the-art in Table 1. We note that our\napproach performs considerably better than state-of-the-art including the CRF-based method [10], the pool-\nbased methods [11, 18], and the approach based on the same gPb-based tree [22]. There are probably three\n\n7\n\n\fMethod\n\nGraz-02 dataset [23]\n\nFulkerson et al. [10]\n\nMarszalek&Schmid [21] 53.8 44.1\n72.2 72.2\n83.4 84.9\n83.7 83.3\n\n1-class pylon\n2-class pylon\n\nequal recall-precision\nBikes Cars People\n61.8\n66.3\n81.5\n82.5\n\nMethod\n\nStanford background dataset [11]\ncorrect %\n76.4 \u00b1 1.22\nGould et al. [11]\nMunoz et al. [22]\nKumar&Koller [18] 79.42 \u00b1 1.41\n81.90 \u00b1 1.09\n\n8-class pylon\n\n76.9\n\nTable 1: Comparison with state-of-the-art. Left \u2013 equal recall-precision on the Graz datasets (pylon models\nwere trained with class-weighted Hamming loss to achieve approximately equal recall-precision). Right\n\u2013 percentage of correctly labelled pixels on the Stanford dataset. For all datasets, our systems achieves a\nconsiderable improvement over the state-of-the-art.\n\nGraz-02 Bikes\n\nGraz-02 Cars\n\nModel\n\n1-class pylon\n2/8-class pylon\nFlat CRF \u2013 0\nFlat CRF \u2013 20\nFlat CRF \u2013 40\nFlat CRF \u2013 60\nFlat CRF \u2013 80\n\nrec. prec. Ham.\nrec. prec. Ham.\n81.7 87.0\n3.1\n80.8 86.9\n7.7\n3.4\n80.4 85.6\n7.8\n81.2 86.1\n3.3\n80.7 86.8\n8.8\n79.4 83.8\n3.6\n81.1 83.7\n8.2\n81.3 84.6\n3.8\n78.3 85.4\n81.2 82.1\n8.6\n71.2 84.2 10.3 79.5 80.8\n4.1\n4.9\n64.5 81.1 12.4 74.7 76.8\n3.9\n76.7 83.9\n1-class pylon (no bnd)\n78.3 85.7\n2/8-class pylon (no bnd) 79.6 85.7\n77.9 84.0\n3.8\n\n8.5\n8.3\n\n\u2013\n\n\u2013\n\nGraz-02 People\nrec. prec. Ham. mean\n77.3 85.0\n78.7 84.4\n73.7 79.8\n76.7 80.7\n76.0 80.6\n71.6 79.0\n68.9 80.2\n76.3 84.9\n76.6 82.9\n\nStanford background\ndiff. to full\n6.4\n6.3 81.90 0.00 \u00b1 0.00\n80.07 \u22121.84 \u00b1 0.15\n7.9\n81.13 \u22120.78 \u00b1 0.42\n7.3\n80.25 \u22121.65 \u00b1 0.69\n7.4\n77.99 \u22123.91 \u00b1 0.74\n8.4\n75.01 \u22126.89 \u00b1 0.47\n8.4\n6.6\n81.29 \u22120.62 \u00b1 0.24\n6.9\n\n\u2013\n\n\u2013\n\nTable 2: Comparison with baseline methods with the same features and the same training procedure (un-\nweighted Hamming loss was used in all cases). \u2019Flat CRF \u2013 X\u2019 correspond to \ufb02at random \ufb01elds trained and\nevaluated on the segmentations obtained by thresholding the segmentation tree at level X. The last two lines\ncorrespond to the pylon model trained and evaluated with boundary terms disabled. For Graz-02, recall,\nprecision and Hamming error for the prede\ufb01ned splits are given. For Stanford background, % of correctly-\nlabeled pixels is measured over 5 random splits, then the mean and the difference to the full pylon model are\ngiven. For all datasets, the full pylon models perform better than the baselines (the best baseline for each\ndataset is underlined).\n\nreasons for this higher performance: superior features, a superior learning procedure, and a superior model\n(pylon).\nTo clarify what is the bene\ufb01t of the pylon model alone, we perform an extensive comparison with baselines\n(Table 2). We compare with the \ufb02at CRF approaches, where the partitions are obtained by thresholding the\nsegmentation tree at different levels. We also determine the bene\ufb01t of having boundary terms by comparing\nwith the pylon model without these terms. All baseline models used the same features and the same max-\nmargin learning procedure. The full pylon model performs better than baselines, although the advantage is\nnot as large as that over the preceding methods.\nEf\ufb01ciency. The runtime of the entire framework is dominated by the pre-computation of segmentation trees\nand the features. After such pre-computation, our graph cut inference is extremely fast: less than 0.1s per\nimage/label which is orders of magnitude faster than inference in previous pool-based methods. Training the\nmodel (after the precomputation) takes 85 minutes for one split of the Stanford background dataset (compared\nto 55 minutes for the \ufb02at CRF).\n\n5 Discussion\n\nDespite a very strong performance of our system in the experiments, we believe that the main appeal of the\npylon model is in the combination of interpretability, tractability, and \ufb02exibility. The interpretability is not\nadequately measured by the quantitative evaluation, but it may be observed in qualitative examples (Figures\n1 and 3), where many segments chosen by the pylon model to \u201cexplain\u201d a photograph correspond to objects\nor their high-level parts. The pylon model generalizes the \ufb02at CRF model for semantic segmentation that\noperates with small low-level structural elements. Notably, despite such generalization, the inference and\nmax-margin learning in the pylon model is as easy as in the \ufb02at CRF model.\n\n8\n\n\fReferences\n[1] N. Ahuja. A transform for multiscale image segmentation by integrated edge and region detection. IEEE Trans. Pattern\n\n[2] P. Arbelaez, M. Maire, C. Fowlkes, and J. Malik. Contour detection and hierarchical image segmentation. IEEE Trans.\n\n[3] P. Awasthi, A. Gagrani, and B. Ravindran. Image modeling using tree structured conditional random \ufb01elds. In IJCAI,\n\nAnal. Mach. Intell., 18(12), 1996.\n\nPattern Anal. Mach. Intell., 33(5):898\u2013916, 2011.\n\npages 2060\u20132065, 2007.\n\n[4] E. Boros and P. L. Hammer. Pseudo-boolean optimization. Discrete Applied Mathematics, 123(1-3):155\u2013225, 2002.\n[5] C. A. Bouman and M. Shapiro. A multiscale random \ufb01eld model for bayesian image segmentation. IEEE Transactions\n\non Image Processing, 3(2):162\u2013177, 1994.\n\n[6] Y. Boykov and V. Kolmogorov. An experimental comparison of min-cut/max-\ufb02ow algorithms for energy minimization\n\nin vision. IEEE Trans. Pattern Anal. Mach. Intell., 26(9):1124\u20131137, 2004.\n\n[7] Y. Boykov, O. Veksler, and R. Zabih. Fast approximate energy minimization via graph cuts. IEEE Trans. Pattern Anal.\n\nMach. Intell., 23(11):1222\u20131239, 2001.\n\n[8] X. Chen, A. Jain, A. Gupta, and L. Davis. Piecing together the segmentation jigsaw using context. In CVPR, 2011.\n[9] X. Feng, C. K. I. Williams, and S. N. Felderhof. Combining belief networks and neural networks for scene segmentation.\n\nIEEE Trans. Pattern Anal. Mach. Intell., 24(4):467\u2013483, 2002.\n\n[10] B. Fulkerson, A. Vedaldi, and S. Soatto. Class segmentation and object localization with superpixel neighborhoods. In\n\nICCV, pages 670\u2013677, 2009.\n\npages 1\u20138, 2009.\n\n[11] S. Gould, R. Fulton, and D. Koller. Decomposing a scene into geometric and semantically consistent regions. In ICCV,\n\n[12] D. M. Greig, B. T. Porteous, and A. H. Seheult. Exact maximum a posteriori estimation for binary images. Journal of\n\nthe Royal Statistical Society, 51(2), 1989.\n\n[13] C. Gu, J. J. Lim, P. Arbelaez, and J. Malik. Recognition using regions. In CVPR, pages 1030\u20131037, 2009.\n[14] T. Joachims, T. Finley, and C.-N. J. Yu. Cutting-plane training of structural SVMs. Machine Learning, 77(1), 2009.\n[15] V. Kolmogorov, Y. Boykov, and C. Rother. Applications of parametric max\ufb02ow in computer vision. In ICCV, pages\n\n1\u20138, 2007.\n\nMach. Intell., 26(2):147\u2013159, 2004.\n\n[16] V. Kolmogorov and R. Zabih. What energy functions can be minimized via graph cuts? IEEE Trans. Pattern Anal.\n\n[17] A. Kulesza and F. Pereira. Structured learning with approximate inference. In NIPS, 2007.\n[18] M. P. Kumar and D. Koller. Ef\ufb01ciently selecting regions for scene understanding. In CVPR, 2010.\n[19] L. Ladicky, C. Russell, P. Kohli, and P. H. S. Torr. Associative hierarchical crfs for object class image segmentation. In\n\nICCV, pages 739\u2013746, 2009.\n\n2007.\n\n[20] T. Malisiewicz and A. A. Efros. Improving spatial support for objects via multiple segmentations. In BMVC, September\n\n[21] M. Marszalek and C. Schmid. Accurate object localization with shape masks. In CVPR, 2007.\n[22] D. Munoz, J. A. Bagnell, and M. Hebert. Stacked hierarchical labeling. In ECCV (6), pages 57\u201370, 2010.\n[23] A. Opelt, A. Pinz, M. Fussenegger, and P. Auer. Generic object recognition with boosting. IEEE Trans. Pattern Anal.\n\n[24] N. Plath, M. Toussaint, and S. Nakajima. Multi-class image segmentation using conditional random \ufb01elds and global\n\n[25] J. Reynolds and K. Murphy. Figure-ground segmentation using a hierarchical conditional random \ufb01eld. In CRV, pages\n\nMach. Intell., 28(3):416\u2013431, 2006.\n\nclassi\ufb01cation. In ICML, page 103, 2009.\n\n175\u2013182, 2007.\n\n[26] P. Schnitzspan, M. Fritz, and B. Schiele. Hierarchical support vector random \ufb01elds: Joint training to combine local and\n\nglobal features. In ECCV (2), pages 527\u2013540, 2008.\n\n[27] E. Sharon, A. Brandt, and R. Basri. Fast multiscale image segmentation. In CVPR, 2000.\n[28] J. Shi and J. Malik. Normalized cuts and image segmentation. In CVPR, pages 731\u2013737, 1997.\n[29] J. Shotton, J. M. Winn, C. Rother, and A. Criminisi. TextonBoost: Joint appearance, shape and context modeling for\n\nmulti-class object recognition and segmentation. In ECCV (1), pages 1\u201315, 2006.\n\n[30] D. Singaraju and R. Vidal. Using global bag of features models in random \ufb01elds for joint categorization and segmenta-\n\ntion of objects. In CVPR, 2011.\n\n[31] R. Szeliski, R. Zabih, D. Scharstein, O. Veksler, V. Kolmogorov, A. Agarwala, M. F. Tappen, and C. Rother. A com-\nparative study of energy minimization methods for markov random \ufb01elds with smoothness-based priors. IEEE Trans.\nPattern Anal. Mach. Intell., 30(6):1068\u20131080, 2008.\n\n[32] M. Szummer, P. Kohli, and D. Hoiem. Learning crfs using graph cuts. In ECCV, 2008.\n[33] B. Taskar, C. Guestrin, and D. Koller. Max-margin markov networks. In NIPS, 2003.\n[34] S. Todorovic and N. Ahuja. Learning subcategory relevances for category recognition. In CVPR, 2008.\n[35] I. Tsochantaridis, T. Hofmann, T. Joachims, and Y. Altun. Support vector machine learning for interdependent and\n\nstructured output spaces. In ICML, 2004.\n\nhttp://www.vlfeat.org/, 2008.\n\n[36] A. Vedaldi and B. Fulkerson.\n\nVLFeat: An open and portable library of computer vision algorithms.\n\n[37] A. Vedaldi and A. Zisserman. Ef\ufb01cient additive kernels via explicit feature maps. In CVPR, 2010.\n[38] O. Veksler. Image segmentation by nested cuts. In CVPR, pages 1339\u2013, 2000.\n[39] J. Wang, J. Yang, K. Yu, F. Lv, T. S. Huang, and Y. Gong. Locality-constrained linear coding for image classi\ufb01cation.\n\nIn CVPR, pages 3360\u20133367, 2010.\n\n9\n\n\f", "award": [], "sourceid": 849, "authors": [{"given_name": "Victor", "family_name": "Lempitsky", "institution": null}, {"given_name": "Andrea", "family_name": "Vedaldi", "institution": null}, {"given_name": "Andrew", "family_name": "Zisserman", "institution": null}]}