{"title": "Submodular Field Grammars: Representation, Inference, and Application to Image Parsing", "book": "Advances in Neural Information Processing Systems", "page_first": 4307, "page_last": 4317, "abstract": "Natural scenes contain many layers of part-subpart structure, and distributions over them are thus naturally represented by stochastic image grammars, with one production per decomposition of a part. Unfortunately, in contrast to language grammars, where the number of possible split points for a production $A \\rightarrow BC$ is linear in the length of $A$, in an image there are an exponential number of ways to split a region into subregions. This makes parsing intractable and requires image grammars to be severely restricted in practice, for example by allowing only rectangular regions. In this paper, we address this problem by associating with each production a submodular Markov random field whose labels are the subparts and whose labeling segments the current object into these subparts. We call the result a submodular field grammar (SFG). Finding the MAP split of a region into subregions is now tractable, and by exploiting this we develop an efficient approximate algorithm for MAP parsing of images with SFGs. Empirically, we present promising improvements in accuracy when using SFGs for scene understanding, and show exponential improvements in inference time compared to traditional methods, while returning comparable minima.", "full_text": "Submodular Field Grammars: Representation,\nInference, and Application to Image Parsing\n\nAbram L. Friesen and Pedro Domingos\n\nPaul G. Allen School of Computer Science and Engineering\n\nUniversity of Washington\n\nSeattle, WA 98195\n\n{afriesen,pedrod}@cs.washington.edu\n\nAbstract\n\nNatural scenes contain many layers of part-subpart structure, and distributions\nover them are thus naturally represented by stochastic image grammars, with one\nproduction per decomposition of a part. Unfortunately, in contrast to language\ngrammars, where the number of possible split points for a production A \u2192 BC\nis linear in the length of A, in an image there are an exponential number of ways\nto split a region into subregions. This makes parsing intractable and requires\nimage grammars to be severely restricted in practice, for example by allowing\nonly rectangular regions. In this paper, we address this problem by associating\nwith each production a submodular Markov random \ufb01eld whose labels are the\nsubparts and whose labeling segments the current object into these subparts. We\ncall the resulting model a submodular \ufb01eld grammar (SFG). Finding the MAP\nsplit of a region into subregions is now tractable, and by exploiting this we de-\nvelop an ef\ufb01cient approximate algorithm for MAP parsing of images with SFGs.\nEmpirically, we show promising improvements in accuracy when using SFGs for\nscene understanding, and demonstrate exponential improvements in inference time\ncompared to traditional methods, while returning comparable minima.\n\nIntroduction\n\n1\nUnderstanding natural scenes is a challenging problem that requires simultaneously detecting, seg-\nmenting, and recognizing each object in a scene despite noise, distractors, and ambiguity. Fortunately,\nnatural scenes possess inherent structure in the form of contextual and part-subpart relationships\nbetween objects. Such relationships are well modeled by a grammar, which de\ufb01nes a set of production\nrules that specify the decomposition of objects into their parts. Natural language is the most common\napplication of such grammars, but the compositional structure of natural scenes makes stochastic\nimage grammars a natural candidate for representing distributions over images (see Zhu and Mumford\n[1] for a review). Importantly, natural language can be parsed ef\ufb01ciently with respect to a grammar\nbecause the number of possible split points for each production A \u2192 BC is linear in the length of\nthe constituent corresponding to A. However, images cannot be parsed ef\ufb01ciently in this way because\nthere are an exponential number of ways to split an image into arbitrarily-shaped subregions. As\nsuch, previous image-grammar approaches could only ensure tractability by severely limiting the\npossible decompositions of each region either explicitly, for example by allowing only rectangular\nregions, or by sampling (e.g., Poon and Domingos [2], Zhao and Zhu [3]).\nDue to these limitations, many approaches to scene understanding instead use a Markov random\n\ufb01eld (MRF) to de\ufb01ne a probabilistic model over pixel labels (e.g., Shotton et al. [4], Gould et al.\n[5]), thereby capturing some natural structure while still permitting objects to have arbitrary shapes.\nMost such MRFs use planar- or tree-structured graphs in the label space [6, 7]. While these models\ncan improve labeling accuracy, their restricted structures mean that they can capture little of the\ncompositional structure present in natural images without an exponential number of labels. Inference\nin MRFs is intractable in general [8] but is tractable under certain restrictions. For pairwise binary\nMRFs, if the energy is submodular [9], meaning that each pair of neighboring pixels prefers to have\n32nd Conference on Neural Information Processing Systems (NeurIPS 2018), Montr\u00e9al, Canada.\n\n\fthe same label \u2013 a natural assumption for images \u2013 then the exact MAP labeling of the MRF can be\nef\ufb01ciently recovered with a graph-cut algorithm [10\u201312]. For multi-label problems, a constant-factor\napproximation can be found ef\ufb01ciently using a move-making algorithm, such as \u03b1-expansion [13].\nIn this work, we de\ufb01ne a powerful new class of tractable models that combines the tractability and\nregion-shape \ufb02exibility afforded by submodular MRFs with the high-level compositional structure of\nan image grammar. We associate with each production A \u2192 BC a submodular MRF whose labels\nare the subconstituents (i.e., B, C) of that production. We call the resulting model a submodular \ufb01eld\ngrammar (SFG). Finding the MAP labeling to split a region into arbitrarily-shaped subregions is\nnow tractable and we exploit this to develop an ef\ufb01cient approximate algorithm for MAP parsing of\nimages with SFGs. Our algorithm, SFG-PARSE, is an iterative move-making algorithm that provably\nconverges to a local minimum of the energy and reduces to \u03b1-expansion for trivial grammars. Like\nother move-making algorithms, each step of SFG-PARSE chooses the best move from an exponentially\nlarge set of neighbors, thus overcoming many of the main issues with local minima [13]. Empirically,\nwe compare SFG-PARSE to belief propagation and \u03b1-expansion. We show that SFG-PARSE parses\nimages in exponentially less time than both of these while returning comparable minima. Using deep\nconvolutional neural network features as inputs, we investigate the modeling capability of SFGs.\nWe show promising improvements in semantic segmentation accuracy when using SFGs in place of\nstandard MRFs and when compared to the neural network features on their own.\nLike SFGs, associative hierarchical MRFs [14, 15] also de\ufb01ne multi-level MRFs, but use precomputed\nsegmentations to set the regions of the non-terminal variables and thus do not permit arbitrary image\nregions. Neural parsing methods [16, 17] are grammar-like models for scene understanding, but use\nprecomputed superpixels and thus also do not permit arbitrary region shapes. Most relevant is the work\nof Kumar and Koller [6] and Delong et al. [7], who de\ufb01ne tree-structured submodular cost functions\nand use iterative fusion-style graph-cut algorithms for inference, much like SFG-PARSE. SFGs\ncan be seen as an extension of these works that interprets the labelings at each level as productions\nin a grammar and permits multiple different productions of each symbol, thus de\ufb01ning a directed-\nacyclic-graph (DAG) cost function. This allows SFGs to be exponentially more expressive than these\nmodels with only a low-order polynomial increase in inference complexity. In the simple case of a\ntree-structured grammar (i.e., a non-recursive grammar in which each symbol only appears in the body\nof at most one production), SFGs and SFG-PARSE reduce to these existing approaches albeit without\nthe label costs of Delong et al. [7]; however, it should be possible to extend SFGs in a similar manner.\nIn order to clearly describe and motivate SFGs, we present them here in the context of image parsing.\nHowever, SFGs are a general and \ufb02exible model class that is applicable anywhere grammars or MRFs\nare used, including social network modeling and probabilistic knowledge bases.\n2 Preliminaries\n2.1 Submodular MRFs\nA Markov random \ufb01eld (MRF)\np(y,I) = 1\n\nfor scene understanding de\ufb01nes a probabilistic model\nZ exp(\u2212E(y,I)) over labeling y \u2208 Y n and image I, where n = |I| is the number of pix-\ny(cid:48)\u2208Y n exp(\u2212E(y(cid:48),I)) is the partition function, and Y is the set of labels, which encode\nsemantic classes such as Sky or Ground. MRFs for computer vision typically use pairwise energies\n(p,q)\u2208I \u03b8pq(yp, yq), where y = (y0, . . . , yn) is a vector of labels;\nop is the intensity value of pixel p; \u03b8p and \u03b8pq are the unary and pairwise energy terms for pixels p and\nedges (p, q), respectively; and, with a slight abuse of notation, we say that I contains both the nodes\nand edges in the MRF over the image. For binary labels Y = {Y1, Y2}, an MRF is submodular if its\nenergy satis\ufb01es \u03b8pq(Y1, Y1)+\u03b8pq(Y2, Y2) \u2264 \u03b8pq(Y1, Y2)+\u03b8pq(Y2, Y1) for all edges (p, q) \u2208 I. If the\nenergy is submodular, the MAP labeling y\u2217 = arg maxy\u2208Y n p(y,I) can be computed exactly with\na single graph cut in time c(n), where c(n) is worst-case low-order polynomial (the true complexity\ndepends on the chosen min-cut/max-\ufb02ow algorithm), but nearly linear time in practice [12, 13]. Thus,\nsubmodularity reduces the complexity of an optimization over 2n states to nearly-linear time. While\nsubmodularity is useful for MAP inference, it also captures the fact that neighboring pixels in natural\nimages tend to have the same label (e.g., Sky pixels appear next to other Sky pixels), which means\nthat the MAP labeling in general partitions the image into contiguous regions of each label.\n2.2\nA context-free grammar (CFG) is a tuple G = (N, \u03a3, R, S) containing a \ufb01nite set of nonterminal\nsymbols N; a \ufb01nite set of terminal symbols \u03a3; a \ufb01nite set of productions R = {v : X \u2192 Y1 . . . Yk}\n\nImage grammars\n\nels, Z =(cid:80)\nE(y,I) =(cid:80)\n\np\u2208I \u03b8p(yp, op) +(cid:80)\n\n2\n\n\fp(t,I) =(cid:81)\n\np = Yi} for any i \u2208 {1, . . . , k}.\n\nwith head symbol X \u2208 N and subconstituent symbols Yi \u2208 N \u222a \u03a3 for i = 1 . . . k; and a special\nstart symbol S \u2208 N that does not appear on the right-hand side of any production. For scene\nunderstanding, a grammar for outdoor scenes might contain a production S \u2192 Sky Ground, which\nwould partition the image into Sky and Ground subregions.\nTo extend CFGs to images, we introduce the notion of a region R \u2286 I, which speci\ufb01es a subset\nof the pixels and can have arbitrary shape. A parse (tree) t \u2208 TG(I) of image I with respect to\ngrammar G is a tree of nodes n = (v,R), each containing a production v \u2208 R and a corresponding\nimage region R \u2286 I, where TG(I) is the set of valid parse trees for I under G, which we will\nwrite as T to simplify notation. For each node n = (v,R) in a parse tree, the regions of its children\n{ci = (vi,Ri) : ci \u2208 ch(n)} partition (segment) their parent\u2019s region such that R = \u222aiRi and\n\u2229iRi = \u2205. If we let v = X \u2192 Y1 . . . Yk, then this partition is equivalently de\ufb01ned by a labeling\nyv \u2208 Y|R|\nv where Yv = {Y1, . . . , Yk}, as there is a one-to-one correspondence between labelings and\npartitions of R. Given a labeling for a production, the region of a subconstituent is simply the subset\nof pixels labeled as that subconstituent Ri = {p : yv\nA stochastic image grammar de\ufb01nes a generative probabilistic model of images by associating with\neach nonterminal a categorical distribution over the productions of that nonterminal. The generative\nprocess samples a production of the current nonterminal from this distribution, starting with the\nstart symbol S with the entire image as its region, and then partitions the current region into disjoint\nsubregions \u2013 one for each subconstituent of the production. This process then recurses on each\nsubconstituent-subregion pair, and terminates when a terminal symbol is produced, at which point\nthe pixels for that region are generated. Formally, the probability of a parse t \u2208 T of an image is\n(v,R)\u2208t p(v|head(v)) \u00b7 p(yv|v,R), where p(yv|v,R) speci\ufb01es the probability of each\nlabeling yv \u2208 Y|R|\n(i.e., partition) of R. Note that the above distribution over productions is the\nsame categorical distribution as that used in PCFGs for natural language [18], but the distribution\nover segmentations is assumed to be uniform in PCFGs for natural language and is typically not\nmade explicit. It is this latter distribution that causes representational challenges, as we must now\nspecify a distribution p(yv|v,R) for each production and for each of the 2n possible image regions.\nWe show how this can be achieved ef\ufb01ciently in the following section.\n3 Submodular Field Grammars\nAs the main contribution of this work, we de\ufb01ne (submodular) \ufb01eld grammars by combining the image\ngrammars de\ufb01ned above with (submodular) MRFs. We do this by de\ufb01ning for each production v an\nq ). A copy of\nthis MRF is instantiated each time an instance (equivalently, a token, as this relates to the well-known\ntype-token distinction) of X is parsed as v, in the same way that each instance of a symbol in a\ngrammar uses the same categorical distribution to select productions. In particular, an instance of\na symbol has an associated region R \u2286 I and the MRF instantiated for that instance is simply the\nsubset of the full-image MRF containing all of the nodes in R and all of the edges between the\nq ). We\n(p,q)\u2208R \u03b8v\nthus write the labeling distribution as p(yv|v,R) \u221d exp(\u2212Ev(yv,R)) and we write the energy of a\n(v,R)\u2208t wv + Ev(yv,R),\nwhere the weights {wv} parameterize each symbol\u2019s categorical distribution over productions and\nthe probability of a parse tree is p(t,I) \u221d exp(\u2212E(t,I)). To simplify notation, we will omit v,I,\nand R when clear from context and sum over just v.\nWe refer to this model as a \ufb01eld grammar G = (N, \u03a3, R, S, \u0398) parameterized by \u0398, which contains\nboth the categorical weights and the MRF parameters. As in the image grammar formulation above,\nthe pixels are generated when a terminal symbol is produced. Conversely, when parsing a given\nimage, the unary terms {\u03b8v\np} can depend directly on the pixels of the image being parsed or on\nfeatures of the image, as in a conditional random \ufb01eld. In our experiments, however, only the unary\nterms of the terminal symbols depend on the pixel values.\nThe MRFs in a \ufb01eld grammar can be parameterized arbitrarily but, in order to permit ef\ufb01cient MAP\ninference, we require that each term \u03b8v\npq satisfy the previously-stated binary submodularity condition\nfor all edges (p, q) and all productions v : X \u2192 Y1Y2 once the grammar has been converted to one in\nwhich each production has only two subconstituents, which is always possible and in the worst case\nincreases the grammar size quadratically [18]. Note that it is easy to extend this to the non-binary case\nby requiring that the pairwise terms satisfy the \u03b1-expansion or \u03b1\u03b2-swap conditions [13], for example,\n\np) +(cid:80)\nnodes in R. The energy of this instance is Ev(yv,R) =(cid:80)\nparse tree (where each node contains production instances) as E(t,I) =(cid:80)\n\nassociated MRF over the full image Ev(yv,I) =(cid:80)\n\np) +(cid:80)\n\np\u2208R \u03b8v\n\np(yv\n\npq(yv\n\np, yv\n\nv\n\np\u2208I \u03b8v\n\np(yv\n\n(p,q)\u2208I \u03b8v\n\npq(yv\n\np, yv\n\n3\n\n\fq ) \u2265 \u03b8c\n\np, yv\n\npq(yc\n\np, yc\n\np, yv\n\np, yv\n\nq , yc\n\np, yc\n\npq(yv\n\np, yv\n\nq), where yv\n\nergy E(t,R) = (cid:80)\ncan rewrite this as E(t,R) = w(t) +(cid:80)\n(cid:80)\n\npq, where w(t) =(cid:80)\np =(cid:80)\n\n(p,q)\u2208R \u03b8t\n\npq = (cid:80)\n\nbut we focus on the binary case here for simplicity. We also require that \u03b8v\np, yc\npq(yv\nq)\nfor every production v \u2208 R, for every production c that is a descendant of v in the grammar,\nq \u2208 Yv and yc\nq \u2208 Yc. This ensures\nand for all possible labelings (yv\nthat segmentations of higher-level productions are submodular relative to their descendants, and\ncaptures a natural property of composition in images: that objects have larger regions than their\nparts. This means that the ratio of boundary length to region area is smaller for a symbol relative\nto its descendants, and thus its pairwise terms should be stronger. A grammar that satis\ufb01es these\nconditions is a submodular \ufb01eld grammar (SFG). Figure 1 shows a partial example of a (submodular)\n\ufb01eld grammar applied to image parsing, demonstrating the interleaved choices of productions and\nlabelings, and the subregion decompositions resulting from these choices.\n3.1 Relationship to other models\nAbove, we de\ufb01ned an SFG as an image grammar\nwith an MRF at each production. An SFG can be\nequivalently reformulated as a planar MRF with\none label for each path in the grammar. The num-\nber of such paths is exponential in the height of\nthe grammar. This reformulation can be seen as\nfollows. A parse tree over a region R has en-\nv\u2208t wv + Ev(yv,Rv). We\np\u2208R \u03b8t\np +\nv\u2208t wv, 1[\u00b7] is the\np)\u00b71[p \u2208 Rv],\nindicator function, \u03b8t\np(yv\nv\u2208t \u03b8v\nq ) \u00b7 1[(p, q) \u2208 Rv].\nand \u03b8t\nv\u2208t \u03b8v\nThis describes a \ufb02at MRF in which \u03b8t\np and \u03b8t\npq\nare the unary and pairwise terms. Inference in\nthis \ufb02at MRF is not easier, and is likely harder,\nbecause it requires an exponentially-large set of la-\nbels and the hard constraints of the grammar must\nbe enforced explicitly. However, this formulation\nwill prove useful for our parsing algorithm.\nAnother key bene\ufb01t of our grammar-based formu-\nlation is sub-parse reuse, which enables exponen-\ntial reductions in inference complexity and better\nsample complexity. For example, consider reusing\na Wheel symbol among many vehicle types. In-\nstead of having to learn and perform inference for\neach Wheel symbol (once per vehicle type and per vehicle-parent type, etc.), only one Wheel need be\nlearned and inference on it performed only once.\nBeyond PCFGs and MRFs, SFGs also generalize sum-product networks (SPNs) [19, 2]. Details on\nthis mapping are given in the supplement. 1 Figure 1 shows a partial mapping of an SFG to an SPN.\n4\nWhen trying to understand natural scenes, it is important to recognize and reason about the relation-\nships between objects. These relationships can be identi\ufb01ed by \ufb01nding the MAP parse of an image\nwith respect to a grammar that encodes them, such as a submodular \ufb01eld grammar. The \ufb02at semantic\nlabels traditionally used in scene understanding can also be recovered from this parse if they are\nencoded in the grammar, e.g., as the terminal symbols. We exploit this ability in our experiments.\nFor natural language, the optimal parse of a PCFG can be recovered exactly in time cubic in the\nlength of the sentence with the CYK algorithm [20], which uses dynamic programming to ef\ufb01ciently\nparse a sentence in a bottom-up pass through the grammar. This is possible because each sentence\nonly has a linear number of split points, meaning that all sub-spans of the sentence can be ef\ufb01ciently\nrepresented and enumerated. The key operation in the CYK algorithm is to compute the optimal parse\nof a given span s (i.e., contiguous sub-sentence) as a given production v : X \u2192 Y Z by explicitly\niterating over all split points i of that span, computing the probability of parsing s as v with split i,\n\nFigure 1: A DAG representing some of the possible\nproduction and labeling choices when parsing an im-\nage with an SFG. Each sum node represents either a\nchoice of production for a particular region or a choice\nof labeling for the MRF representing a particular pro-\nduction of a region. Product nodes denote the partition\nof a region as de\ufb01ned by its labeling, where an MRF\nnode\u2019s color denotes its label. Red edges denote a\npartial parse tree for the image shown at the bottom.\nBest viewed in color.\n\nInference\n\n1Supplementary material is available at https://homes.cs.washington.edu/~pedrod/papers/neurips18sp.pdf.\n\n4\n\n+\u00d7++\u00d7\u00d7\u00d7\u2026\u00d7\u00d7\u00d7\u00d7\u2026S\u279eABS\u279eBC++\u2026+++++\u2026++\u2026B\u279eGHB\u279eJK\u2026\u2026++A\u279eDEA\u279eFG\u2026\u00d7\u00d7\u2026\u00d7\u00d7\u2026\u2026+++++++\u2026\u2026\u2026\u2026C\u279eLM+\u2026\f(a)\n\n(b)\n\nE(y,R) =(cid:80)\n\np \u03b8p(yp) +(cid:80)\n\nn) iff each label in yc is taken either from y0 or y1 such that yc\n\nn) is a combination of y0 = (y0\np \u2208 {y0\n\nFigure 2: The main components of SFG-PARSE: (a) Parsing a region R as X \u2192 Y Z by fusing a parse of R\nas Y \u2192 AB with a parse of R as Z \u2192 CD, and (b) Subsequently improving the parse of R as X \u2192 Y Z by\nindependently (re)parsing each of its subregions and then fusing these new parses. See text for more detail.\nand choosing the split point with highest probability. The probability of parsing s as v with split i\nis de\ufb01ned recursively as the product of p(v|head(v)) and the respective probabilities of the optimal\nparses of the two sub-spans as Y and Z, respectively. CYK uses dynamic programming to cache the\noptimal parse of each sub-span as each symbol to avoid re-computing these unnecessarily.\nUnfortunately, CYK applied to images is intractable because it is infeasible to enumerate all subre-\ngions of an image. Instead, we propose to construct (and cache) a parse of the entire image as each\nproduction and then use subregions of this parse to de\ufb01ne the parse of each subregion, mirroring how\ndistributions over subregions are de\ufb01ned in SFGs. We then exploit submodularity to \ufb01nd a locally\noptimal parse from an exponentially large set, without enumerating all such parses. Speci\ufb01cally, we\noptimally combine the parses of the subconstituents of a production to create a parse as that production.\nWe refer to this procedure as fusion as it is analogous to the fusion moves of Lempitsky et al. [21].\n4.1 Parse tree construction\nFollowing Lempitsky et al. [21], let y0, y1 \u2208 Y n be two labelings of a submodular MRF with energy\npq \u03b8pq(yp, yq) and let C(y0, y1) = {yc} denote the set of combinations\nof y0 and y1. A labeling yc = (yc\nn) and y1 =\np} for all pixels\n(y1\n0, . . . , y1\np = 1 . . . n. The fusion y\u2217 of y0 and y1 is then de\ufb01ned as the minimum energy combination y\u2217 =\narg miny\u2208C(y0,y1) E(y,R). Under certain conditions on E, fusion is a submodular minimization.\nRecall that each parse tree t equivalently corresponds to a particular labeling of a planar MRF with\none label per path in the grammar. With a slight abuse of notation, we use t to represent both the full\nparse tree and the corresponding planar MRF labeling. Let v : X \u2192 Y1Y2 be a production and t1, t2\nbe parses of some region R \u2286 I as productions u1 : Y1 \u2192 Z1Z2 and u2 : Y2 \u2192 Z3Z4, respectively.\nDe\ufb01nition 1. For production v : X \u2192 Y1Y2 and parse trees t1, t2 over region R with head symbols\nY1, Y2, the fusion of t1 and t2 as v is the minimum energy parse tree tv = arg mint\u2208C(t1,t2) E(t,R)\nconstructed from the combination of t1 and t2, with (v,R) appended as root.\nBecause t1 and t2 are MRF labelings, we can fuse them to create a new parse tree tv in which each\npixel in R is labeled with the same path that it had in either t1 or t2. When we do this, we prepend\nv to each pixel\u2019s label, which is equivalent to adding (v,R) as the new root node of tv. Figure 2a\nshows an example of fusing two parse trees to create a new parse tree.\nProposition 1. The fusion of two parse trees is a submodular minimization.\nAlthough fusion requires \ufb01nding the optimal labeling from an exponentially large set, two parse trees\ncan be fused with a single graph cut by exploiting submodularity. Proofs are given in the supplement.\nFinally, we de\ufb01ne the union of two parse trees t = t1 \u222a t2 that have the same productions but are\nover disjoint regions (i.e., R1 \u2229 R2 = \u2205) as the parse tree t in which the region of each node in t is\nthe union of the regions of the corresponding nodes in t1 and t2.\n4.2 SFG-Parse\nPseudocode for our parsing algorithm, SFG-PARSE, is presented in Algorithm 1. SFG-PARSE is an\niterative move-making algorithm that ef\ufb01ciently and provably converges to a local minimum of the\nenergy function. Currently, SFG-PARSE applies only to non-recursive grammars, but we believe it\nwill be straightforward to extend it to recursive ones.\nTo parse an image with respect to a given non-recursive grammar, SFG-PARSE starts at the terminal\n\n0, . . . , y0\n\np, y1\n\n0, . . . , yc\n\n5\n\nTo improve parse of 1. (re)parse as Y 2. (re)parse as Y given 3. (re)parse as Z 4. (re)parse as Z given 5. fuse with \u00d7\u00d7\u00d7\u00d7YZABCDfuseX\u279eYZ\u00d7ABCDYZ\u00d7\u00d7CDAB\u00d7\u00d7CBYZZYYYZZEFHGEFGHABCD- confusing part: not clear that X->Y->AB in subregion of LHS \ufb01gure is just sub-selecting from existing parse of Y->AB over entire region - need to explain clearly what\u2019s happening\u2026DATo improve parse of 1. (re)parse as Y 2. (re)parse as Y given 3. (re)parse as Z 4. (re)parse as Z given 5. fuse with \u00d7\u00d7\u00d7\u00d7YZABCDfuseX\u279eYZ\u00d7ABCDYZ\u00d7\u00d7CDAB\u00d7\u00d7CBYZZYYYZZEFHGEFGHABCD- confusing part: not clear that X->Y->AB in subregion of LHS \ufb01gure is just sub-selecting from existing parse of Y->AB over entire region - need to explain clearly what\u2019s happening\u2026DA\fFigure 3: One iteration of SFG-PARSE applied to the image shown on the right with respect to the simple\ngrammar on the left. Proceeding from bottom to top, SFG-PARSE \ufb01rst parses the image as each of the terminal\nsymbols (i.e., each pixel in the image is labeled as that terminal symbol), and then fuses these to create parses of\nthe image as symbols B, C, and D. These parses are then fused in turn to create parses of the image as A and\n\ufb01nally S. The \ufb01nal full parse tree returned is the parse of S.\n\nsymbols and moves backwards through the productions towards the start symbol (line 9), constructing\nand caching a parse of the full image as each production. The parse for each production is constructed\nby fusing the cached parses of that production\u2019s subconstituents (lines 13 and 14). An example of this\nprocedure is shown in Figure 3, where the parses of symbols S, A, B, C, and D are constructed by\nfusing the parses of the subconstituents of their respective productions. For simplicity, the grammar\nin Figure 3 only contains a single production for each symbol, and no symbol is a subconstituent\nof multiple productions; in general, however, most symbols in the grammar appear on both the\nleft- and right-hand sides of multiple productions. To accommodate this, SFG-PARSE maintains\nmultiple instances (aka. tokens) of each symbol and chooses the appropriate production and instance\nduring parsing. This is discussed in more detail below. Subsequent iterations of SFG-PARSE simply\nrepeat this bottom-up procedure while ensuring that for each (re)parse of a production, the previous\niteration\u2019s parse of that production (or a guaranteed lower energy alternative) can be constructed via\nfusion. This guarantees convergence of SFG-PARSE.\nIn CYK, each span of a sentence is explicitly parsed as each production, making it straightforward to\nhave multiple instances of a symbol. However, since there are an exponential number of subregions\nof an image, SFG-PARSE instead constructs a parse of the entire image for each production and\nreuses that parse for each of these subregions. To ensure consistency of this parse, if only one parse\nwere allowed per production then each instance would have to be parsed with the exact same set of\nproductions, a severe restriction on the expressivity of the model. To avoid this, SFG-PARSE permits\nmultiple instances of a symbol X, one per unique path from the root to a production of X in \u02c6t, where\n\u02c6t is the best parse of S from the previous iteration. This allows the number of instances of each\nsymbol to grow or shrink at each iteration according to the parse constructed in the previous iteration.\nProcessing of instances and their corresponding regions occurs on lines 5-6. For each instance x of\nsymbol X in \u02c6t for a production v : X \u2192 Y Z, SFG-PARSE records pointers to x\u2019s child instances y\nand z, which are later (line 12) used to determine which instances of Y and Z to fuse when parsing v.\nIn the common scenario that a symbol has no instances \u2013 either because it doesn\u2019t appear in \u02c6t or\nbecause \u02c6t was not provided \u2013 then that symbol is assigned the region containing the entire image as\nan instance (line 7), which serves as a powerful initialization method. If a symbol has no instances,\nthen it did not appear in \u02c6t and its parse can be constructed by fusing any instances of its production\u2019s\nsubconstituents without affecting convergence. In the rare case that a symbol has multiple instances,\none can be chosen either by estimating a bound on the energy or even randomly (line 12).\n\n6\n\n\ffor each region RX in region list R[X] do\n\n// each region is an instance (token) of X\nRY ,RZ \u2190 the child regions of v for RX if they exist, else choose heuristically\ntv, ev \u2190 fuse tRY and tRZ as production v over region RX\ntv, ev \u2190 fuse tRY and tRZ as production v over region RX = I\\RX given tv\ntRX , eRX \u2190 the full parse tv \u222a tv with lowest energy ev // choose best parse of RX\n// S only ever has a single region, which contains all of the pixels\n\n\u02c6t, \u02c6e \u2190 tRS , eRS\n\nfor each terminal T \u2208 \u03a3 do tRT \u2190 the trivial parse with all pixels parsed as T\nwhile the energy of any production of the start symbol S has not converged do\n\n// record the instances (i.e., regions) of each symbol in \u02c6t and initialize instance-less symbols\nfor each node in \u02c6t with production u : X \u2192 Y Z, region RX, and subregions RY ,RZ do\nappend RY ,RZ to region lists R[Y ],R[Z] and set as the child regions of u for RX\nfor each symbol X \u2208 N with no regions in R[X] do append RX = I to R[X]\n// perform the upward pass to parse with the SFG at this iteration\nfor each symbol X \u2208 N, in reverse topological order do\nfor each production v : X \u2192 Y Z of symbol X do\n\nAlgorithm 1 Compute the (approximate) MAP parse of an image with respect to an SFG.\nInput: The image I, a non-recursive SFG G = (N, \u03a3, R, S, \u0398), and an (optional) input parse \u02c6t.\nOutput: A parse of the image, t\u2217, with energy E(t\u2217,I) \u2264 E(\u02c6t,I).\n1: function SFG-PARSE(I, G, \u02c6t)\n2:\n3:\n4:\n5:\n6:\n7:\n8:\n9:\n10:\n11:\n12:\n13:\n14:\n15:\n16:\n17:\nIf a symbol X does have an instance in \u02c6t, SFG-PARSE \ufb01rst parses only that instance\u2019s region RX\ninto tree tv (line 13) and then parses the remainder of the image RX as v given the partial parse tv\n(line 14). The union of these gives a full parse of the entire image as v for this instance. Parsing an\ninstance in two parts is necessary to ensure that SFG-PARSE never returns a worse parse. Figure 2b\nshows an inef\ufb01cient version of the process for re-parsing an instance of X, where \ufb01rst the subregions\nlabeled as Y and Z are re-parsed (steps 1-2), then the remaining pixels are re-parsed given the other\nparses (steps 3-4), and \ufb01nally the unions of these parses are fused to get a parse of the region as X\n(step 5). For ef\ufb01ciency reasons, SFG-PARSE does not actually reparse Y and Z for each production\nthat produces them; instead, their parses are cached and re-used. We de\ufb01ne parsing a region RX given\na parse tv of another region RX to mean that each pairwise term with a pixel in each region already\nhas the label of the pixel in RX set to its value in tv (i.e., like conditioning in a probabilistic model).\nFinally, the parse of the production u with the lowest energy over RX is then chosen as its parse\n(line 15). At the end of the upward pass, the parse of the full image \u02c6t is simply the parse of the start\nsymbol\u2019s region, which always contains all pixels (line 16).\n4.3 Analysis\nIn this section, we analyze the convergence and computational complexity of SFG-PARSE.\nTheorem 1. Given a parse \u02c6t of S over the entire image with energy E(\u02c6t), each iteration of SFG-\nPARSE constructs a parse t of S over the entire image with energy E(t) \u2264 E(\u02c6t) and, since the\nminimum energy of an image parse is \ufb01nite, SFG-PARSE will always converge.\n\nreturn \u02c6t, \u02c6e\n\nTheorem 1 shows that SFG-PARSE always converges to a local minimum of the energy function.\nLike other move-making algorithms, SFG-PARSE explores an exponentially large set of moves at\neach step, so the returned local minimum is generally much better than those returned by more local\nprocedures [13]. Further, we typically observe convergence in fewer than ten iterations, with the\nmajority of the energy improvement occurring in the \ufb01rst iteration.\nProposition 2. Let c(n) be the time complexity of computing a graph cut on n pixels and |G| be the\nsize of the grammar de\ufb01ning the SFG. Then each iteration of SFG-PARSE takes time O(|G|c(n)n).\nProposition 2 shows that each iteration of SFG-PARSE has complexity O(|G|c(n)n), where n is the\nnumber of pixels and c(n) is the complexity of the graph-cut algorithm, which is low-order polynomial\nin n in the worst case, but nearly linear-time in practice [12, 13]. The additional factor of n is due to\nthe number of regions (i.e., instances) of each symbol, which in the worst case is O(n) but in practice\nis almost always a small constant (often one). Thus, SFG-PARSE typically runs in time O(|G|c(n)).\nNote that directly applying \u03b1-expansion to parsing an SFG requires optimizing an MRF with one\n\n7\n\n\fInference evaluation\n\nlabel for each path in the grammar, which would take time exponential in the height of the grammar.\nSFG-PARSE can be extended to productions with more than two subconstituents by replacing the\ninternal graph cut used to fuse subtrees with a multi-label algorithm such as \u03b1-expansion. SFG-\nPARSE would still converge because each subtree would still never increase in energy. Alternatively,\nan algorithm such as QPBO [22] could be used, which would obviate the submodularity requirement.\n5 Experiments\nWe evaluated our model and inference algorithm in two experiments, both using unary features from\nDeepLab [23, 24], a state-of-the-art convolutional semantic segmentation network. First, to evaluate\nthe performance of SFG-PARSE, we programmatically generated SFGs and compared the runtime\nof and minimum energy returned by SFG-PARSE to that of \u03b1-expansion and max-product belief\npropagation (BP), two standard MRF inference algorithms. Second, to evaluate SFGs as a model of\nnatural scenes, we segmented images at multiple levels of granularity, used these segmentations to\ngenerate SFGs over the DeepLab features (in place of the raw pixel intensities), and compared the\nsegmentation accuracy resulting from parsing the generated SFGs using SFG-PARSE to that of using\n(a) DeepLab features alone and (b) a planar submodular MRF on the DeepLab features.\nThe DeepLab features are trained on the Stanford Background Dataset (SBD) [5] training set. Evalua-\ntions are performed on images from the SBD test set. In all MRFs (planar and in each SFG), the pair-\nwise terms are standard contrast-dependent boundary terms [25] multiplied by a single weight, wBF.\n5.1\nTo evaluate the performance of SFG-PARSE, we programmatically generated SFGs while varying\ntheir height, number of productions per nonterminal (#prods), and strength of the pairwise (boundary)\nterms. Each algorithm was evaluated using the same grammars, DeepLab features, and randomly\nselected images. We compared the performance of SFG-PARSE to that of running \u03b1-expansion on a\n\ufb02at pairwise MRF containing one label for each possible parse path in the grammar and also to running\nBP on a multi-level (3-D) pairwise MRF with the same height as the grammar. These are the natural\ncomparisons, as existing hierarchical MRF algorithms do not support the DAG structure that makes\nSFGs so powerful. Details of these models and additional \ufb01gures are provided in the supplement.\nIncreasing boundary strength, grammar height, and #prods each make inference more challenging.\nIndividual pixels cannot be \ufb02ipped easily with stronger boundary terms, while grammar height and\n#prods both determine the number of paths in the grammar. Figure 4a plots the average minimum\nenergy of the parses found by each algorithm versus the boundary factor, wBF (x-axis is log scale) and\nFigure 4d plots inference time versus boundary factor. As shown, SFG-PARSE returns comparable\nor better parses to both BP and \u03b1-expansion and in less time. In Figure 4e, we set wBF to 20 and\nplot inference time versus grammar height. The energies are shown in Figure 4b. As expected,\ninference time for SFG-PARSE scales linearly with height, whereas it scales exponentially for both\n\u03b1-expansion and BP. Again, the energies and accuracies of the parses returned by SFG-PARSE\nare nearly identical to those of \u03b1-expansion. Finally, we set wBF to 20 and plot inference time\nversus #prods in Figure 4f, and energy versus #prods in Figure 4c. Once again, SFG-PARSE returns\nequivalent parses to \u03b1-expansion and BP but in much less time.\n5.2 Model evaluation\nTo evaluate whether natural scenes exhibit the compositional part-subpart structure over arbitrarily-\nshaped regions that SFGs can capture but previous methods cannot, we generated grammars on SBD\nimages where the semantic labels were the terminals. We then computed the mean pixel accuracy of\nthe terminal labeling from the parse tree returned by SFG-PARSE.\nGrammars were generated (not learned) as follows. We \ufb01rst over-segmented each of the 143 test\nimages at 4 different levels of granularity and intersected the most \ufb01ne-grained of these with the label\nregions. We created a unique grammar for each image by taking that image\u2019s over-segmentations\nand the over-segmentations of four other randomly chosen images and adding a symbol for each\ncontiguous region in each segmentation. We then added productions between overlapping segments\nfor each subsequent pair of granularity levels within each image and across images. Finally, we added\nterminal productions from the symbols in the most granular level, where each terminal production\ncan produce only those labels that occur in its head symbol\u2019s corresponding segment (note that we\nsimilarly restricted the possible labels produced by other models to ensure the comparison was fair).\nOn average, each induced grammar had 860 symbols and 1250 productions with 5 subconstituents\neach. The features output by DeepLab were used as the unaries in the MRFs of the terminal\n\n8\n\n\f(a)\n\n(d)\n\n(b)\n\n(e)\n\n(c)\n\n(f)\n\nFigure 4: The energy of the returned parse (a,b,c) and total running time (d,e,f) when evaluating MAP inference\nusing belief propagation, \u03b1-expansion, and SFG-PARSE while varying (a,d) boundary strength, (b,e) grammar\nheight, and (c,f) number of productions. In all \ufb01gures, lower is better. Each data point is the average over\nthe same 10 randomly-selected images. Missing data points for BP indicate that it returned an inconsistent\nparse with in\ufb01nite energy. Missing data points for \u03b1-expansion indicate that it ran out of time or memory.\nFigures S1, S2, and S3 in the supplement show the mean pixel accuracies for each experiment.\n\n87.77\n\n87.93\n\n90.03\n\nDeepLab DeepLab+MRF DeepLab+SFG\n\nTable 1: Mean pixel accuracy on 143 SBD test images.\n\nproductions. All productions had uniform probability and the same MRF parameters were used across\nall images. This ensured that any improvement in performance was due solely to the structure of the\nunderlying grammar. Further details about the induced grammars are provided in the supplement.\nAfter parsing each image with respect to its\ngrammar, we computed the mean pixel accu-\nracy of the terminal labeling of the parse. We\ncompared this to the accuracy of the DeepLab\nfeatures alone and to the accuracy of a standard\n\ufb02at submodular MRF over the DeepLab features,\nwith pairwise terms set in the same way as in the SFGs. These results are shown in Table 1, which\nshows a 20% relative decrease in error for SFGs, which is quite remarkable given how well the\nDeepLab features do on their own and how little the \ufb02at MRF helps. While this does not constitute a\nfull evaluation of SFGs for semantic segmentation as we did not learn the SFGs, it provides evidence\nthat SFGs are a compelling model class. In the supplement, we propose an approach for learning\nSFGs but we leave its implementation and evaluation for future work as it requires the creation of new\ndatasets of parsed images, which is outside the scope of this paper. Even without learning, however,\nthis experiment demonstrates that natural scenes do exhibit high-level compositional structure and that\nSFGs are able to ef\ufb01ciently exploit this structure to improve scene understanding and image parsing.\n6 Conclusion\nThis paper proposed submodular \ufb01eld grammars (SFGs), a novel stochastic image grammar formula-\ntion that combines the expressivity of image grammars with the ef\ufb01cient combinatorial optimization\ncapabilities of submodular MRFs. SFGs are the \ufb01rst image grammars to enable ef\ufb01cient parsing of\nobjects with arbitrary region shapes. To achieve this, we presented SFG-PARSE, a move-making\nalgorithm that exploits submodularity to \ufb01nd the (approximate) MAP parse of an SFG. Analytically,\nwe showed that SFG-PARSE is both convergent and fast. Empirically, we showed (i) that SFG-PARSE\nachieves accuracies and energies comparable to \u03b1-expansion \u2013 which returns optima within a constant\nfactor of the global optimum \u2013 while taking exponentially less time to do so and (ii) that SFGs are\nable to represent the compositional structure of images to better parse and understand natural scenes.\nIn future work, we plan to focus on learning the parameters and structure of SFGs, as we believe\nthat their unique combination of tractability and expressivity will lead to better understanding of\nnatural scenes. We also plan to apply SFGs to other domains, such as activity recognition, social\nnetwork modeling, and probabilistic knowledge bases.\n\n9\n\n0.10.31 3 10 30 100Boundary scale factor-8.8-8.6-8.4-8.2-8-7.8-7.6Minimum energy105BP-expSFG0123456Grammar height-8-6-4Minimum energy105BP-expSFG123456#productions per nonterminal-8-7-6-5Minimum energy105BP-expSFG0.10.31 3 10 30 100Boundary scale factor2000400060008000Time (s)BP-expSFG0123456Grammar height123Time (s)104BP-expSFG246#productions per nonterminal200040006000800010000Time (s)BP-expSFG\fAcknowledgements\nAF would like to thank Robert Gens, Rahul Kidambi, and Gena Barnabee for useful discussions,\ninsights, and assistance with this document. The DGX-1 used for this research was donated by\nNVIDIA. This research was partly funded by ONR grant N00014-16-1-2697 and AFRL contract\nFA8750-13-2-0019. The views and conclusions contained in this document are those of the authors\nand should not be interpreted as necessarily representing the of\ufb01cial policies, either expressed or\nimplied, of ONR, AFRL, or the United States Government.\nReferences\n[1] Song-Chun Zhu and David Mumford. A stochastic grammar of images. Foundations and Trends in\n\nComputer Graphics and Vision, 2(4):259\u2013362, 2006.\n\n[2] Hoifung Poon and Pedro Domingos. Sum-product networks: A new deep architecture. In Proceedings of\n\nthe 27th Conference on Uncertainty in Arti\ufb01cial Intelligence, pages 337\u2013346. AUAI Press, 2011.\n\n[3] Yibiao Zhao and Song-Chun Zhu. Image parsing via stochastic scene grammar. In Advances in Neural\n\nInformation Processing Systems, 2011.\n\n[4] Jamie Shotton, John Winn, Carsten Rother, and Antonio Criminisi. TextonBoost for image understand-\ning: Multi-class object recognition and segmentation by jointly modeling texture, layout, and context.\nInternational Journal of Computer Vision, 81(1):2\u201323, 2009.\n\n[5] Stephen Gould, Richard Fulton, and Daphne Koller. Decomposing a scene into geometric and semantically\n\nconsistent regions. In Proceedings of the IEEE International Conference on Computer Vision, 2009.\n\n[6] M. Pawan Kumar and Daphne Koller. MAP estimation of semi-metric MRFs via hierarchical graph cuts.\n\nIn Proceedings of the 25th Conference on Uncertainty in Arti\ufb01cial Intelligence, pages 313\u2013320, 2009.\n\n[7] Andrew Delong, Lena Gorelick, Olga Veksler, and Yuri Boykov. Minimizing energies with hierarchical\n\ncosts. International Journal of Computer Vision, 100(1):38\u201358, 2012.\n\n[8] V. Chandrasekaran, N. Srebro, and P. Harsha. Complexity of inference in graphical models. In Proceedings\n\nof the 24th Conference on Uncertainty in Arti\ufb01cial Intelligence, pages 70\u201378, 2008.\n\n[9] Vladimir Kolmogorov and Ramin Zabih. What energy functions can be minimized via graph cuts? IEEE\n\nTransactions on Pattern Analysis and Machine Intelligence, 26(2):147\u2013159, 2004.\n\n[10] P. L. Hammer. Some network \ufb02ow problems solved with pseudo-Boolean programming. Operations\n\nResearch, 13:388\u2013399, 1965.\n\n[11] D. M. Greig, B.T. Porteous, and A. H. Seheult. Exact maximum a posteriori estimation for binary images.\n\nJournal of the Royal Statistical Society. Series B (Methodological), 51(2):271\u2013279, 1989.\n\n[12] Yuri Boykov and Vladimir Kolmogorov. An experimental comparison of min-cut/max-\ufb02ow algorithms for\nenergy minimization in vision. IEEE Transactions on Pattern Analysis and Machine Intelligence, 26(9):\n1124\u20131137, 2004.\n\n[13] Yuri Boykov, Olga Veksler, and Ramin Zabih. Fast approximate energy minimization via graph cuts. IEEE\n\nTransactions on Pattern Analysis and Machine Intelligence, 23(11):1222\u20131239, 2001.\n\n[14] Chris Russell, Lubor Ladick\u00fd, Pushmeet Kohli, and Philip H.S. Torr. Exact and approximate inference\nin associative hierarchical networks using graph cuts. The 26th Conference on Uncertainty in Arti\ufb01cial\nIntelligence, 2010.\n\n[15] Victor Lempitsky, Andrea Vedaldi, and Andrew Zisserman. A pylon model for semantic segmentation. In\n\nNeural Information Processing Systems, 2011.\n\n[16] Richard Socher, Cliff C. Lin, Chris Manning, and Andrew Y. Ng. Parsing natural scenes and natural\n\n10\n\n\flanguage with recursive neural networks. In Proceedings of the 28th International Conference on Machine\nLearning, pages 129\u2013136, 2011.\n\n[17] Abhishek Sharma, Oncel Tuzel, and Ming-Yu Liu. Recursive context propagation network for semantic\n\nscene labeling. In Advances in Neural Information Processing Systems, pages 2447\u20132455, 2014.\n\n[18] Daniel S. Jurafsky and James H. Martin. Speech and Language Processing: An Introduction to Natural\n\nLanguage Processing, Computational Linguistics, and Speech Recognition. Prentice Hall, 2000.\n\n[19] Robert Gens and Pedro Domingos. Learning the structure of sum-product networks. In Proceedings of the\n\n30th International Conference on Machine Learning, pages 873\u2013880. Omnipress, 2013.\n\n[20] John Hopcroft and Jeffrey Ullman. Introduction to Automata Theory, Languages, and Computation.\n\nAddison-Wesley, Reading MA, 1979.\n\n[21] Victor Lempitsky, Carsten Rother, Stefan Roth, and Andrew Blake. Fusion moves for Markov random \ufb01eld\noptimization. IEEE Transactions on Pattern Analysis and Machine Intelligence, 32(8):1392\u20131405, 2010.\n\n[22] Vladimir Kolmogorov and Carsten Rother. Minimizing nonsubmodular functions with graph cuts - a\n\nreview. IEEE transactions on pattern analysis and machine intelligence, 29(7):1274\u20139, 2007.\n\n[23] Liang-Chieh Chen, George Papandreou, Iasonas Kokkinos, Kevin Murphy, and Alan L. Yuille. Semantic\nIn Proceedings of the\n\nimage segmentation with deep convolutional nets and fully connected CRFs.\nInternational Conference on Learning Representations, 2015.\n\n[24] Liang-Chieh Chen, George Papandreou, Iasonas Kokkinos, Kevin Murphy, and Alan L. Yuille. DeepLab:\nSemantic image segmentation with deep convolutional nets, atrous convolution, and fully connected CRFs.\narXiv preprint arXiv:1606.00915 [cs.CV], 2016.\n\n[25] Jamie Shotton, John Winn, Carsten Rother, and Antonio Criminisi. TextonBoost: Joint appearance,\nshape and context modeling for multi-class object recognition and segmentation. Proceedings European\nConference on Computer Vision (ECCV), 3951, 2006.\n\n11\n\n\f", "award": [], "sourceid": 2101, "authors": [{"given_name": "Abram", "family_name": "Friesen", "institution": "University of Washington"}, {"given_name": "Pedro", "family_name": "Domingos", "institution": "University of Washington"}]}