{"title": "Recursive Segmentation and Recognition Templates for 2D Parsing", "book": "Advances in Neural Information Processing Systems", "page_first": 1985, "page_last": 1992, "abstract": null, "full_text": "Recursive Segmentation and Recognition Templates\n\nfor 2D Parsing\n\nLong (Leo) Zhu\n\nCSAIL MIT\n\nYuanhao Chen\n\nUSTC\n\nleozhu@csail.mit.edu\n\nyhchen4@ustc.edu.cn\n\nYuan Lin\n\nShanghai Jiaotong University\nloirey@sjtu.edu.cn\n\nChenxi Lin\n\nMicrosoft Research Asia\n\nAlan Yuille\n\nUCLA\n\nchenxil@microsoft.com\n\nyuille@stat.ucla.edu\n\nAbstract\n\nLanguage and image understanding are two major goals of arti\ufb01cial intelligence\nwhich can both be conceptually formulated in terms of parsing the input signal\ninto a hierarchical representation. Natural language researchers have made great\nprogress by exploiting the 1D structure of language to design ef\ufb01cient polynomial-\ntime parsing algorithms. By contrast, the two-dimensional nature of images makes\nit much harder to design ef\ufb01cient image parsers and the form of the hierarchical\nrepresentations is also unclear. Attempts to adapt representations and algorithms\nfrom natural language have only been partially successful.\nIn this paper, we propose a Hierarchical Image Model (HIM) for 2D image pars-\ning which outputs image segmentation and object recognition. This HIM is rep-\nresented by recursive segmentation and recognition templates in multiple layers\nand has advantages for representation, inference, and learning. Firstly, the HIM\nhas a coarse-to-\ufb01ne representation which is capable of capturing long-range de-\npendency and exploiting different levels of contextual information. Secondly, the\nstructure of the HIM allows us to design a rapid inference algorithm, based on dy-\nnamic programming, which enables us to parse the image rapidly in polynomial\ntime. Thirdly, we can learn the HIM ef\ufb01ciently in a discriminative manner from\na labeled dataset. We demonstrate that HIM outperforms other state-of-the-art\nmethods by evaluation on the challenging public MSRC image dataset. Finally,\nwe sketch how the HIM architecture can be extended to model more complex\nimage phenomena.\n\n1 Introduction\n\nLanguage and image understanding are two major tasks in arti\ufb01cial intelligence. Natural language\nresearchers have formalized this task in terms of parsing an input signal into a hierarchical represen-\ntation. They have made great progress in both representation and inference (i.e. parsing). Firstly,\nthey have developed probabilistic grammars (e.g. stochastic context free grammar (SCFG) [1] and\nbeyond [2]) which are capable of representing complex syntactic and semantic language phenom-\nena. For example, speech contains elementary constituents, such as nouns and verbs, that can be\nrecursively composed into a hierarchy of (e.g. noun phrase or verb phrase) of increasing complex-\nity. Secondly, they have exploited the one-dimensional structure of language to obtain ef\ufb01cient\npolynomial-time parsing algorithms (e.g. the inside-outside algorithm [3]).\nBy contrast, the nature of images makes it much harder to design ef\ufb01cient image parsers which are\ncapable of simultaneously performing segmentation (parsing an image into regions) and recogni-\ntion (labeling the regions). Firstly, it is unclear what hierarchical representations should be used to\nmodel images and there are no direct analogies to the syntactic categories and phrase structures that\noccur in speech. Secondly, the inference problem is formidable due to the well-known complexity\n\n1\n\n\fand ambiguity of segmentation and recognition. Unlike most languages (Chinese is an exception),\nwhose constituents are well-separated words, the boundaries between different image regions are\nusually highly unclear. Exploring all the different image partitions results in combinatorial explo-\nsions because of the two-dimensional nature of images (which makes it impossible to order these\npartitions to enable dynamic programming). Overall it has been hard to adapt methods from natural\nlanguage parsing and apply them to vision despite the high-level conceptual similarities (except for\nrestricted problems such as text [4]).\nAttempts at image parsing must make trade-offs between the complexity of the models and the\ncomplexity of the computation (for inference and learning). Broadly speaking, recent attempts can\nbe divided into two different styles. The \ufb01rst style emphasizes the modeling problem and develops\nstochastic grammars [5, 6] capable of representing a rich class of visual relationships and conceptual\nknowledge about objects, scenes, and images. This style of research pays less attention to the com-\nplexity of computation. Learning is usually performed, if at all, only for individual components of\nthe models. Parsing is performed by MCMC sampling and is only ef\ufb01cient provided effective pro-\nposal probabilities can be designed [6]. The second style builds on the success of conditional random\n\ufb01elds (CRF\u2019s) [7] and emphasizes ef\ufb01cient computation. This yields simpler (discriminative) mod-\nels which are less capable of representing complex image structures and long range interactions.\nEf\ufb01cient inference (e.g. belief propagation and graph-cuts) and learning (e.g. AdaBoost, MLE)\nare available for basic CRF\u2019s and make these methods attractive. But these inference algorithms\nbecome less effective, and can fail, if we attempt to make the CRF models more powerful. For ex-\nample, TextonBoost [8] requires the parameters of the CRF to be tuned manually. Overall, it seems\nhard to extend the CRF style methods to include long-range relationships and contextual knowledge\nwithout signi\ufb01cantly altering the models and the algorithms.\nIn this paper, we introduce Hierarchical Image Models (HIM)\u2019s for image parsing. HIM\u2019s balance\nthe trade-off between model and inference complexity by introducing a hierarchy of hidden states.\nIn particular, we introduce recursive segmentation and recognition templates which represent com-\nplex image knowledge and serve as elementary constituents analogous to those used in speech. As\nin speech, we can recursively compose these constituents at lower levels to form more complex\nconstituents at higher level. Each node of the hierarchy corresponds to an image region (whose size\ndepends on the level in the hierarchy). The state of each node represents both the partitioning of\nthe corresponding region into segments and the labeling of these segments (i.e. in terms of objects).\nSegmentations at the top levels of the hierarchy give coarse descriptions of the image which are\nre\ufb01ned by the segmentations at the lower levels. Learning and inference (parsing) are made ef\ufb01cient\nby exploiting the hierarchical structure (and the absence of loops). In short, this novel architecture\noffers two advantages: (I) Representation \u2013 the hierarchical model using segmentation templates is\nable to capture long-range dependency and exploiting different levels of contextual information, (II)\nComputation \u2013 the hierarchical tree structure enables rapid inference (polynomial time) and learning\nby variants of dynamic programming (with pruning) and the use of machine learning (e.g. structured\nperceptrons [9]).\nTo illustrate the HIM we implement it for parsing images and we evaluate it on the public MSRC\nimage dataset [8]. Our results show that the HIM outperforms the other state-of-the-art approaches.\nWe discuss ways that HIM\u2019s can be extended naturally to model more complex image phenomena.\n\n2 Hierarchical Image Model\n2.1 The Model\nWe represent an image by a hierarchical graph de\ufb01ned by parent-child relationships. See \ufb01gure 1.\nThe hierarchy corresponds to the image pyramid (with 5 layers in this paper). The top node of the\nhierarchy represents the whole image. The intermediate nodes represent different sub-regions of the\nimage. The leaf nodes represent local image patches (27 \u00d7 27 in this paper). We use a to index\nnodes of the hierarchy. A node a has only one parent node denoted by P a(a) and four child nodes\ndenoted by Ch(a). Thus, the hierarchy is a quad tree and Ch(a) encodes all its vertical edges. The\nimage region represented by node a is denoted by R(a). A pixel in R(a), indexed by r, corresponds\nto an image pixel. The set of pairs of neighbor pixels in R(a) is denoted by E(a).\nA con\ufb01guration of the hierarchy is an assignment of state variables y = {ya} with ya = (sa, ca)\nat each node a, where s and c denote region partition and object labeling, respectively and (s, c) is\ncalled the \u201cSegmentation and Recognition\u201d pair, which we call an S-R pair. All state variables are\n\n2\n\n\fFigure 1: The left panel shows the structure of the Hierarchical Image Model. The grey circles are the nodes of the hierarchy. All nodes,\nexcept the top node, have only one parent nodes. All nodes except the leafs are connected to four child nodes. The middle panel shows a\ndictionary of 30 segmentation templates. The color of the sub-parts of each template indicates the object class. Different sub-parts may share\nthe same label. For example, three sub-parts may have only two distinct labels. The last panel shows that the ground truth pixel labels (upper\nright panel) can be well approximated by composing a set of labeled segmentation templates (bottom right panel).\n\nFigure 2: This \ufb01gure illustrates how the segmentation templates and object labels (S-R pair) represent image regions in a coarse-to-\ufb01ne\nway. The left \ufb01gure is the input image which is followed by global, mid-level and local S-R pairs. The global S-R pair gives a coarse description\nof the object identity (horse), its background (grass), and its position in the image (central). The mid-level S-R pair corresponds to the region\nbounded by the black box in the input image. It represents (roughly) the shape of the horse\u2019s leg. The four S-R pairs at the lower level combine\nto represent the same leg more accurately.\nunobservable. More precisely, each region R(a) is described by a segmentation templates which is\nselected from a dictionary DS. Each segmentation template consists of a partition of the region into\nK non-overlapping sub-parts, see \ufb01gure 1. In this paper K \u2264 3, |Ds| = 30, and the segmentation\ntemplates are designed by hand to cover the taxonomy of shape segmentations that happen in images,\nsuch as T-junctions, Y-junctions, and so on. The variable s refers to the indexes of the segmentation\ntemplates in the dictionary, i.e., sa \u2208 {1..|Ds|}. c gives the object labels of K sub-parts (i.e. labels\none sub-part as \u201chorse\u201d another as \u201cdog\u201d and another as \u201cgrass\u201d). Hence ca is a K-dimension vector\nwhose components take values 1, ..., M where M is the number of object classes. The labeling of\na \u2208 {1..M} and is directly obtained from sa, ca. Any two\na pixel r in region R(a) is denoted by or\npixels belonging to the same sub-part share the same label. The labeling or\na is de\ufb01ned at the level of\nnode a. In other words, each level of the hierarchy has a separate labeling \ufb01eld. We will show how\nour model encourages the labelings or\nA novel feature of this hierarchical representation is the multi-level S-R pairs which explicitly model\nboth the segmentation and labeling of its corresponding region, while traditional vision approaches\n[8, 10, 11] use labeling only. The S-R pairs de\ufb01ned in a hierarchical form provide a coarse-to-\ufb01ne\nrepresentation which captures the \u201cgist\u201d (semantical meaning) of image regions. As one can see\nin \ufb01gure 2, the global S-R pair gives a coarse description (the identities of objects and their spatial\nlayout) of the whole image which is accurate enough to encode high level image properties in a\ncompact form. The mid-level one represents the leg of a horse roughly. The four templates at the\nlower level further re\ufb01ne the interpretations. We will show this approximation quality empirically\nin section 3.\nThe conditional distribution over all the states is given by:\n\na at different levels to be consistent.\n\np(y|x; \u03b1) =\n\n1\n\nZ(x; \u03b1)\n\nexp{\u2212E1(x, s, c; \u03b11) \u2212 E2(x, s, c; \u03b12) \u2212 E3(s, c; \u03b13)\n\u2212E4(c; \u03b14) \u2212 E5(s; \u03b15) \u2212 E6(s, c; \u03b16)}\n\n(1)\n\nwhere x refers to the input image, y is the parse tree, \u03b1 are the parameters to be estimated, Z(x; \u03b1)\nis the partition function and Ei(x, y) are energy terms. Equivalently, the conditional distribution can\nbe reformulated in a log-linear form:\n\nlog p(y|x; \u03b1) = \u03c8(x, y) \u00b7 \u03b1 \u2212 log Z(x; \u03b1)\n\n(2)\n\n3\n\n\fEach energy term is of linear form, Ei(x, y) = \u2212\u03c8i(x, y) \u00b7 \u03b1i, where the inner product is calculated\non potential functions de\ufb01ned over the hierarchical structure. There are six types of energy terms\nde\ufb01ned as follows.\nThe \ufb01rst term E1(x, s, c) is an object speci\ufb01c data term which represents image features of regions.\na is the summation over all nodes at different\n\nWe set E1(x, s, c) = \u2212(cid:80)\n\na \u03b11\u03c81(x, sa, ca) where\n\nlevels of the hierarchy, and \u03c81(x, sa, ca) is of the form:\n\n(cid:80)\n(cid:88)\n\n\u03c81(x, sa, ca) =\n\n1\n\n|R(a)|\n\nr\u2208R(a)\n\nlog p(or\n\na|x)\n\n(3)\n\n(cid:80)\n\na|x) = exp{F (xr,or\na)}\n\nwhere p(or\no(cid:48) exp{F (xr,o(cid:48))} , xr is a local image region centered at the location of r, and\nF (\u00b7,\u00b7) is a strong classi\ufb01er output by multi-class boosting [12]. The image features used by the\nclassi\ufb01er (47 in total) are the greyscale intensity, the color (R,G, B channels), the intensity gradient,\nthe Canny edge, the response of DOG (difference of Gaussians) and DOOG (Difference of Offset\nGaussian) \ufb01lters at different scales (13*13 and 22*22) and orientations (0,30,60,...), and so on. We\nuse 55 types of shape (spatial) \ufb01lters (similar to [8]) to calculate the responses of 47 image features.\nThere are 2585 = 47 \u2217 55 features in total.\n\nThe second term (segmentation speci\ufb01c) E2(x, s, c) = \u2212(cid:80)\n\na \u03b12\u03c82(x, sa, ca) is designed to favor\nthe segmentation templates in which the pixels belonging to the same partitions (i.e., having the\nsame labels) have similar appearance. We de\ufb01ne:\n\n\u03c82(x, sa, ca) =\n\n1\n\n|E(a)|\n\n\u03c6(xr, xq|or\n\na)\na, oq\n\n(4)\n\n(cid:88)\n\n(q,r)\u2208E(a)\n\n(cid:189)\n\na\n\na\n\nif or\nif or\n\na, oq\n\na) =\n\n\u03b3(r, q)\n0\n\na = oq\na (cid:54)= oq\n\nwhere E(a) are the set of edges connecting pixels q, r in a neighborhood and \u03c6(xr, xq|or\na) has the\na, oq\n, where \u03b3(r, q) = \u03bb exp{\u2212 g2(r,q)\nform of \u03c6(xr, xq|or\n2\u03b32 }\ndist(r,q),\ng(., .) is a distance measure on the colors xr, xq and dist(r, q) measures the spatial distance between\nr and q. \u03c6(xr, xq|or\na) is so called the contrast sensitive Potts model which is widely used in\ngraph-cut algorithms [13] as edge potentials (only in one level) to favors pixels with similar colour\nhaving the same labels.\n\nThe third term, de\ufb01ned as E3(s, c) = \u2212(cid:80)\n\nthe nodes a at all\nlevels are considered and b is the parent of a) is proposed to encourage the consistency between\nthe con\ufb01gurations of every pair of parent-child nodes in two consecutive layers. \u03c83(sa, ca, sb, cb) is\nde\ufb01ned by the Hamming distance:\n\na,b=P a(a) \u03b13\u03c83(sa, ca, sb, cb) (i.e.\n\na, oq\n\n1\n\n\u03c83(sa, ca, sb, cb) =\n\n1\n\n|R(a)|\n\n\u03b4(or\n\nb)\na, or\n\n(5)\n\n(cid:88)\n\nr\u2208R(a)\n\n(cid:88)\n\n(cid:88)\n\na, or\n\nb) is the Kronecker delta, which equals one whenever or\n\nwhere \u03b4(or\nb and zero otherwise. The\nhamming function ensures to glue the segmentation templates (and their labels) at different levels\ntogether in a consistent hierarchical form. This energy term is a generalization of the interaction\nenergy in the Potts model. However, E3(s, c) has a hierarchical form which allows multi-level\ninteractions.\nThe fourth term E4(c) is designed to model the co-occurrence of two object classes (e.g., a cow is\nunlikely to appear next to an aeroplane):\n\na = or\n\nE4(c) = \u2212\n\n\u03b14(i, j)\u03c84(i, j, ca, ca) \u2212\n\n\u03b14(i, j)\u03c84(i, j, ca, cb)\n\n(6)\n\na\n\ni,j=1..M\n\na,b=P a(a)\n\ni,j=1..M\n\nwhere \u03c84(i, j, ca, cb) is an indicator function which equals one while i \u2261 ca and j \u2261 cb (i \u2261 ca\nmeans i is a component of ca) hold true and zero otherwise. \u03b14 is a matrix where each entry \u03b14(i, j)\nencodes the compatibility between two classes i and j. The \ufb01rst term on the r.h.s encodes the classes\nin a single template while the second term encodes the classes in two templates of the parent-child\nnodes. It is worth noting that class dependency is encoded at all levels to capture both short-range\nand long-range interactions.\n\n4\n\n(cid:88)\n\n(cid:88)\n\n\fa\n\nj\u2261ca\n\n(cid:80)\n\nThe \ufb01fth term E5(s) = \u2212(cid:80)\nthe segmentation template. Similarly the sixth term E6(s, c) = \u2212(cid:80)\n\na \u03b15\u03c85(sa), where \u03c85(sa) = log p(sa) encode the generic prior of\n\u03b16\u03c86(sa, j), where\n\u03c86(sa, j) = log p(sa, j), models the co-occurrence of the segmentation templates and the object\nclasses. \u03c85(sa) and \u03c86(sa, j) are directly obtained from training data by label counting. The pa-\nrameters \u03b15 and \u03b16 are both scalars.\nJusti\ufb01cations. The HIM has several partial similarities with other work. HIM is a coarse-to-\ufb01ne\nrepresentation which captures the \u201cgist\u201d of image regions by using the S-R pairs at multiple levels.\nBut the traditional concept of \u201cgist\u201d [14] relies only on image features and does not include segmen-\ntation templates. Levin and Weiss [15] use a segmentation mask which is more object-speci\ufb01c than\nour segmentation templates (and they do not have a hierarchy). It is worth nothing that, in contrast\nto TextonBoost [8], we do not use \u201clocation features\u201d in order to avoid the dangers of over\ufb01tting to\na restricted set of scene layouts. Our approach has some similarities to some hierarchical models\n(which have two-layers only) [10],[11] \u2013 but these models also lack segmentation templates. The\nhierarchial model proposed by [16] is an interesting alternative but which does not perform explicit\nsegmentation.\n2.2 Parsing by Dynamic Programming\nParsing an image is performed as inference of the HIM. More precisely, the task of parsing is to\nobtain the maximum a posterior (MAP):\ny\u2217 = arg max\n\np(y|x; \u03b1) = arg max\n\n\u03c8(x, y) \u00b7 \u03b1\n\n(7)\n\ny\n\ny\n\nThe size of the states of each node is O(M K|Ds|) where K = 3, M = 21,|Ds| = 30 in our case.\nSince the form of y is a tree, Dynamic Programming (DP) can be applied to calculate the best parse\ntree y\u2217 according to equation 7. Note that the pixel label oa is determined by (s, c), so we only\nneed consider a subset of pixel labelings. It is unlike \ufb02at MRF representation where we need to do\nexhaustive search over all pixel labels o (which would be impractical for DP). The \ufb01nal output of\nthe model for segmentation is the pixel labeling determined by the (s, c) of the lowest level.\nIt is straight forward to see that the computational complexity of DP is O(M 2K|Ds|2H) where H is\nthe number of edges of the hierarchy. Although DP can be performed in polynomial time, the huge\nnumber of states make exact DP still impractical. Therefore, we resort to a pruned version of DP\nsimilar to the method described in [17]. For brevity we omit the details.\n2.3 Learning the Model\nSince HIM is a conditional model, in principle, estimation of its parameters can be achieved by\nany discriminative learning approach, such as maximum likelihood learning as used in Conditional\nRandom Field (CRF) [7], max-margin learning [18], and structure-perceptron learning [9]. In this\npaper, we adopt the structure-perceptron learning which has been applied for learning the recursive\ndeformable template (see paper [19]). Note that structure-perceptron learning is simple to imple-\nment and only needs to calculate the most probable con\ufb01gurations (parses) of the model. By con-\ntrast, maximum likelihood learning requires calculating the expectation of features which is dif\ufb01cult\ndue to the large states of HIM. Therefore, structure-perceptron learning is more \ufb02exible and compu-\ntationally simpler. Moreover, Collins [9] proved theoretical results for convergence properties, for\nboth separable and non-separable cases, and for generalization.\nThe structure-perceptron learning will not compute the partition function Z(x; \u03b1). Therefore we do\nnot have a formal probabilistic interpretation. The goal of structure-perceptron learning is to learn\na mapping from inputs x \u2208 X to output structure y \u2208 Y . In our case, X is a set of images, with\nY being a set of possible parse trees which specify the labels of image regions in a hierarchical\nform. It seems that the ground truth of parsing trees needs all labels of both segmentation template\nand pixel labelings. In our experiment, we will show that how to obtain the ground truth directly\nfrom the segmentation labels without extra human labeling. We use a set of training examples\n{(xi, yi) : i = 1...n} and a set of functions \u03c8 which map each (x, y) \u2208 X \u00d7 Y to a feature vector\n\u03c8(x, y) \u2208 Rd. The task is to estimate a parameter vector \u03b1 \u2208 Rd for the weights of the features.\nThe feature vectors \u03c8(x, y) can include arbitrary features of parse trees, as we discussed in section\n2.1. The loss function used in structure-perceptron learning is usually of form:\n\nLoss(\u03b1) = \u03c8(x, y) \u00b7 \u03b1 \u2212 max\n\n\u03c8(x, y) \u00b7 \u03b1,\n\ny\n\n(8)\n\n5\n\n\fInput: A set of training images with ground truth (xi, yi) for i = 1..N. Initialize parameter vector \u03b1 = 0.\nFor t = 1..T, i = 1..N\n\n\u2022 \ufb01nd the best state of the model on the i\u2019th training image with current parameter setting, i.e., y\u2217 = arg maxy \u03c8(xi, y)\u00b7 \u03b1\n\u2022 Update the parameters: \u03b1 = \u03b1 + \u03c8(xi, yi) \u2212 \u03c8(xi, y\u2217)\n\u2022 Store: \u03b1t,i = \u03b1\n\n(cid:80)\n\nOutput: Parameters \u03b3 =\n\nt,i \u03b1t,i/N T\n\nFigure 3: Structure-perceptron learning\n\n(cid:80)T\n\nt=1\n\n(cid:80)N\n\nwhere y is the correct structure for input x, and y is a dummy variable.\nThe basic structure-perceptron algorithm is designed to minimize the loss function. We adapt \u201cthe\naveraged parameters\u201d version whose pseudo-code is given in \ufb01gure 3. The algorithm proceeds in\na simple way (similar to the perceptron algorithm for classi\ufb01cation). The parameters are initialized\nto zero and the algorithm loops over the training examples. If the highest scoring parse tree for\ninput x is not correct, then the parameters \u03b1 are updated by an additive term. The most dif\ufb01cult\nstep of the method is \ufb01nding y\u2217 = arg maxy \u03c8(xi, y) \u00b7 \u03b1. This is precisely the parsing (inference)\nproblem. Hence the practicality of structure-perceptron learning, and its computational ef\ufb01ciency,\ndepends on the inference algorithm. As discussed earlier, see section 2.2, the inference algorithm\nhas polynomial computational complexity for an HIM which makes structure-perceptron learning\npractical for HIM. The averaged parameters are de\ufb01ned to be \u03b3 =\ni=1 \u03b1t,i/N T , where T\nis the number of epochs, N T is the total number of iterations. It is straightforward to store these\naveraged parameters and output them as the \ufb01nal estimates.\n3 Experimental Results\nDataset. We use a standard public dataset, the MSRC 21-class Image Dataset [8], to perform exper-\nimental evaluations for the HIM. This dataset is designed to evaluate scene labeling including both\nimage segmentation and multi-class object recognition. The ground truth only gives the labeling of\nthe image pixels. To supplement this ground truth (to enable learning), we estimate the true labels\n(states of the S-R pair ) of the nodes in the \ufb01ve-layer hierarchy of HIM by selecting the S-R pairs\nwhich have maximum overlap with the labels of the image pixels. This approximation only results\nin 2% error in labeling image pixels. There are a total of 591 images. We use the identical splitting\nas [8], i.e., 45% for training, 10% for validation, and 45% for testing. The parameters learnt from\nthe training set, with the best performance on validation set, are selected.\nImplementation Details. For a given image x, the parsing result is obtained by estimating the best\ncon\ufb01guration y\u2217 of the HIM. To evaluate the performance of parsing we use the global accuracy\nmeasured in terms of all pixels and the average accuracy over the 21 object classes (global accuracy\npays most attention to frequently occurring objects and penalizes infrequent objects). A computer\nwith 8 GB memory and 2.4 GHz CPU was used for training and testing. For each class, there are\naround 4, 500 weak classi\ufb01ers selected by multi-class boosting. The boosting learning takes about\n35 hours of which 27 hours are spent on I/O processing and 8 hours on computing. The structure-\nperceptron learning takes about 20 hours to converge in 5520(T = 20, N = 276) iterations. In the\ntesting stage, it takes 30 seconds to parse an image with size of 320 \u00d7 200 (6s for extracting image\nfeatures, 9s for computing the strong classi\ufb01er of boosting and 15s for parsing the HIM).\nResults. Figure 4 (best viewed in color) shows several parsing results obtained by the HIM and by\na|x) learnt by boosting). One can see that the HIM is able to roughly\nthe classi\ufb01er by itself (i.e. p(or\ncapture different shaped segmentation boundaries (see the legs of the cow and sheep in rows 1 and\n3, and the boundary curve between sky and building in row 4). Table 1 shows that HIM improves\nthe results obtained by the classi\ufb01er by 6.9% for average accuracy and 5.3% for global accuracy. In\nparticular, in rows 6 and 7 in \ufb01gure 4, one can observe that boosting gives many incorrect labels.\nIt is impossible to correct such large mislabeled regions without the long-range interactions in the\nHIM, which improves the results by 20% and 32%.\nComparisons. In table 1, we compare the performance of our approach with other successful meth-\nods [8, 20, 21]. Our approach outperforms those alternatives by 6% in average accuracy and 4%\nin global accuracy. Our boosting results are better than Textonboost [8] because of image features.\nWould we get better results if we use a \ufb02at CRF with our boosting instead of a hierarchy? We argue\nthat we would not because the CRF only improves TextonBoost\u2019s performance by 3 percent [8],\nwhile we gain 5 percent by using the hierarchy (and we start with a higher baseline). Some other\n\n6\n\n\fFigure 4: This \ufb01gure is best viewed in color. The colors indicate the labels of 21 object classes as in [8]. The columns (except the fourth\n\u201caccuracy\u201d column) show the input images, ground truth, the labels obtained by HIM and the boosting classi\ufb01er respectively. The \u201caccuracy\u201d\ncolumn shows the global accuracy obtained by HIM (left) and the boosting classi\ufb01er (right). In these 7 examples, HIM improves boosting by\n1%, -1% (an outlier!), 1%, 10%, 18%, 20% and 32% in terms of global accuracy.\n\nAverage\nGlobal\n\nTextonboost[8]\n\n57.7\n72.2\n\n64.0\n73.5\n\nPLSA-MRF [20] Auto-context [21] Classi\ufb01er only HIM\n74.1\n81.2\n\n67.2\n75.9\n\n68\n77.7\n\nTable 1: Performance Comparisons for average accuracy and global accuracy. \u201cClassi\ufb01er only\u201d are the results where the pixel labels are\npredicted by the classi\ufb01er obtained by boosting only.\n\nmethods [22, 11, 10], which are worse than [20, 21] and evaluated on simpler datasets [10, 11] (less\nthan 10 classes), are not listed here due to lack of space. In summary, our results are signi\ufb01cantly\nbetter than the state-of-the-art methods.\nDiagnosis on the function of S-R Pair. Figure 5 shows how the S-R pairs (which include the\nsegmentation templates) can be used to (partially) parse an object into its constituent parts, by the\ncorrespondence between S-R pairs and speci\ufb01c parts of objects. We plot the states of a subset of S-R\npairs for some images. For example, the S-R pair consisting of two horizontal bars labeled \u201ccow\u201d\nand \u201cgrass\u201d respectively indicates the cow\u2019s stomach consistently across different images. Similarly,\nthe cow\u2019s tail can be located according to the con\ufb01guration of another S-R pair with vertical bars.\nIn principle, the whole object can be parsed into its constituent parts which are aligned consistently.\nDeveloping this idea further is an exciting aspect of our current research.\n4 Conclusion\nThis paper describes a novel hierarchical image model (HIM) for 2D image parsing. The hierarchical\nnature of the model, and the use of recursive segmentation and recognition templates, enables the\nHIM to represent complex image structures in a coarse-to-\ufb01ne manner. We can perform inference\n(parsing) rapidly in polynomial time by exploiting the hierarchical structure. Moreover, we can learn\nthe HIM probability distribution from labeled training data by adapting the structure-perceptron\nalgorithm. We demonstrated the effectiveness of HIM\u2019s by applying them to the challenging task of\nsegmentation and labeling of the public MSRC image database. Our results show that we outperform\nother state-of-the-art approaches.\n\n7\n\n\fFigure 5: The S-R pairs can be used to parse the object into parts. The colors indicate the identities of objects. The shapes (spacial layout)\nof the segmentation templates distinguish the constituent parts of the object. Observe that the same S-R pairs (e.g. stomach above grass, and\ntail to the left of grass) correspond to the same object part in different images.\nThe design of the HIM was motivated by drawing parallels between language and vision processing.\nWe have attempted to capture the underlying spirit of the successful language processing approaches\n\u2013 the hierarchical representations based on the recursive composition of constituents and ef\ufb01cient\ninference and learning algorithms. Our current work attempts to extend the HIM\u2019s to improve their\nrepresentational power while maintaining computational ef\ufb01ciency.\n5 Acknowledgments\nThis research was supported by NSF grant 0413214 and the W.M. Keck foundation.\n\nReferences\n[1] F. Jelinek and J. D. Lafferty, \u201cComputation of the probability of initial substring generation by stochastic context-free grammars,\u201d\n\nComputational Linguistics, vol. 17, no. 3, pp. 315\u2013323, 1991.\n\n[2] M. Collins, \u201cHead-driven statistical models for natural language parsing,\u201d Ph.D. Thesis, University of Pennsylvania, 1999.\n[3] K. Lari and S. J. Young, \u201cThe estimation of stochastic context-free grammars using the inside-outside algorithm,\u201d in Computer Speech\n\nand Languag, 1990.\n\n[4] M. Shilman, P. Liang, and P. A. Viola, \u201cLearning non-generative grammatical models for document analysis,\u201d in Proceedings of IEEE\n\nInternational Conference on Computer Vision, 2005, pp. 962\u2013969.\n\n[5] Z. Tu and S. C. Zhu, \u201cImage segmentation by data-driven markov chain monte carlo,\u201d IEEE Transactions on Pattern Analysis and\n\nMachine Intelligence, vol. 24, no. 5, pp. 657\u2013673, 2002.\n\n[6] Z. Tu, X. Chen, A. L. Yuille, and S. C. Zhu, \u201cImage parsing: Unifying segmentation, detection, and recognition,\u201d in Proceedings of IEEE\n\nInternational Conference on Computer Vision, 2003, pp. 18\u201325.\n\n[7] J. D. Lafferty, A. McCallum, and F. C. N. Pereira, \u201cConditional random \ufb01elds: Probabilistic models for segmenting and labeling sequence\n\ndata,\u201d in Proceedings of International Conference on Machine Learning, 2001, pp. 282\u2013289.\n\n[8] J. Shotton, J. M. Winn, C. Rother, and A. Criminisi, \u201cTextonBoost: Joint appearance, shape and context modeling for multi-class object\n\nrecognition and segmentation,\u201d in Proceedings of European Conference on Computer Vision, 2006, pp. 1\u201315.\n\n[9] M. Collins, \u201cDiscriminative training methods for hidden markov models: theory and experiments with perceptron algorithms,\u201d in Pro-\nceedings of Annual Meeting on Association for Computational Linguistics conference on Empirical methods in natural language pro-\ncessing, 2002, pp. 1\u20138.\n\n[10] X. He, R. S. Zemel, and M. \u00b4A. Carreira-Perpi\u02dcn\u00b4an, \u201cMultiscale conditional random \ufb01elds for image labeling,\u201d in Proceedings of IEEE\n\nComputer Society Conference on Computer Vision and Pattern Recognition, 2004, pp. 695\u2013702.\n\n[11] S. Kumar and M. Hebert, \u201cA hierarchical \ufb01eld framework for uni\ufb01ed context-based classi\ufb01cation,\u201d in Proceedings of IEEE International\n\nConference on Computer Vision, 2005, pp. 1284\u20131291.\n\n[12] E. L. Allwein, R. E. Schapire, and Y. Singer, \u201cReducing multiclass to binary: A unifying approach for margin classi\ufb01ers,\u201d Journal of\n\nMachine Learning Research, vol. 1, pp. 113\u2013141, 2000.\n\n[13] Y. Boykov and M.-P. Jolly, \u201cInteractive graph cuts for optimal boundary and region segmentation of objects in n-d images,\u201d in Proceedings\n\nof IEEE International Conference on Computer Vision, 2001, pp. 105\u2013112.\n\n[14] A. Oliva and A. Torralba, \u201cBuilding the gist of a scene: the role of global image features in recognition,\u201d IEEE Transactions on Pattern\n\nAnalysis and Machine Intelligence, vol. 155, pp. 23\u201336, 2006.\n\n[15] A. Levin and Y. Weiss, \u201cLearning to combine bottom-up and top-down segmentation,\u201d in Proceedings of European Conference on\n\nComputer Vision, 2006, pp. 581\u2013594.\n\n[16] E. B. Sudderth, A. B. Torralba, W. T. Freeman, and A. S. Willsky, \u201cLearning hierarchical models of scenes, objects, and parts,\u201d in\n\nProceedings of IEEE International Conference on Computer Vision, 2005, pp. 1331\u20131338.\n\n[17] Y. Chen, L. Zhu, C. Lin, A. L. Yuille, and H. Zhang, \u201cRapid inference on a novel and/or graph for object detection, segmentation and\n\nparsing,\u201d in Advances in Neural Information Processing Systems, 2007.\n\n[18] B. Taskar, D. Klein, M. Collins, D. Koller, and C. Manning, \u201cMax-margin parsing,\u201d in Proceedings of Annual Meeting on Association\n\nfor Computational Linguistics conference on Empirical methods in natural language processing, 2004.\n\n[19] L. Zhu, Y. Chen, X. Ye, and A. L. Yuille, \u201cStructure-perceptron learning of a hierarchical log-linear model,\u201d in Proceedings of IEEE\n\nComputer Society Conference on Computer Vision and Pattern Recognition, 2008.\n\n[20] J. Verbeek and B. Triggs, \u201cRegion classi\ufb01cation with markov \ufb01eld aspect models,\u201d in Proceedings of IEEE Computer Society Conference\n\non Computer Vision and Pattern Recognition, 2007.\n\n[21] Z. Tu, \u201cAuto-context and its application to high-level vision tasks,\u201d in Proceedings of IEEE Computer Society Conference on Computer\n\nVision and Pattern Recognition, 2008.\n\n[22] J. Verbeek and B. Triggs, \u201cScene segmentation with crfs learned from partially labeled images,\u201d in Advances in Neural Information\n\nProcessing Systems, vol. 20, 2008.\n\n8\n\n\f", "award": [], "sourceid": 3450, "authors": [{"given_name": "Leo", "family_name": "Zhu", "institution": null}, {"given_name": "Yuanhao", "family_name": "Chen", "institution": null}, {"given_name": "Yuan", "family_name": "Lin", "institution": null}, {"given_name": "Chenxi", "family_name": "Lin", "institution": null}, {"given_name": "Alan", "family_name": "Yuille", "institution": null}]}