{"title": "Message Passing Inference for Large Scale Graphical Models with High Order Potentials", "book": "Advances in Neural Information Processing Systems", "page_first": 1134, "page_last": 1142, "abstract": "To keep up with the Big Data challenge, parallelized algorithms based on dual decomposition have been proposed to perform inference in Markov random fields. Despite this parallelization, current algorithms struggle when the energy has high order terms and the graph is densely connected. In this paper we propose a partitioning strategy followed by a message passing algorithm which is able to exploit pre-computations. It only updates the high-order factors when passing messages across machines. We demonstrate the effectiveness of our approach on the task of joint layout and semantic segmentation estimation from single images, and show that our approach is orders of magnitude faster than current methods.", "full_text": "Message Passing Inference for Large Scale Graphical\n\nModels with High Order Potentials\n\nJian Zhang\nETH Zurich\n\nAlexander G. Schwing\nUniversity of Toronto\n\nRaquel Urtasun\n\nUniversity of Toronto\n\njizhang@ethz.ch\n\naschwing@cs.toronto.edu\n\nurtasun@cs.toronto.edu\n\nAbstract\n\nTo keep up with the Big Data challenge, parallelized algorithms based on dual de-\ncomposition have been proposed to perform inference in Markov random \ufb01elds.\nDespite this parallelization, current algorithms struggle when the energy has high\norder terms and the graph is densely connected. In this paper we propose a parti-\ntioning strategy followed by a message passing algorithm which is able to exploit\npre-computations. It only updates the high-order factors when passing messages\nacross machines. We demonstrate the effectiveness of our approach on the task of\njoint layout and semantic segmentation estimation from single images, and show\nthat our approach is orders of magnitude faster than current methods.\n\n1\n\nIntroduction\n\nGraphical models are a very useful tool to capture the dependencies between the variables of inter-\nest. In domains such as computer vision, natural language processing and computational biology\nthey have been very widely used to solve problems such as semantic segmentation [37], depth re-\nconstruction [21], dependency parsing [4, 25] and protein folding [36].\nDespite decades of research, \ufb01nding the maximum a-posteriori (MAP) assignment or the minimi-\nmum energy con\ufb01guration remains an open problem, as it is NP-hard in general. Notable exceptions\nare specialized solvers such as graph-cuts [7, 3] and dynamic programming [19, 1], which retrieve\nthe global optima for sub-modular energies and tree-shaped graphs. Algorithms based on message\npassing [18, 9], a series of graph cut moves [16] or branch-and-bound techniques [5] are common\nchoices to perform approximate inference in the more general case. A task closely related to MAP\ninference but typically harder is computation of the probability for a given con\ufb01guration. It requires\ncomputing the partition function, which is typically done via message passing [18], sampling or by\nrepeatedly using MAP inference to solve tasks perturbed via Gumbel distributions [8].\nOf particular dif\ufb01culty is the case where the involved potentials depend on more than two variables,\ni.e., they are high-order, or the graph is densely connected. Several techniques have been developed\nto allow current algorithms to handle high-order potentials, but they are typically restricted to poten-\ntials of a speci\ufb01c form, e.g., a function of the cardinality [17] or piece-wise linear potentials [11, 10].\nFor densely connected graphs with Gaussian potentials ef\ufb01cient inference methods based on \ufb01ltering\nhave been proposed [14, 33].\nAlternating minimization approaches, which iterate between solving for subsets of variables have\nalso been studied [32, 38, 29]. However, most approaches loose their guarantees since related sub-\nproblems are solved independently. Another method to improve computational ef\ufb01ciency is to\ndivide the model into smaller tasks, which are solved in parallel using dual decomposition tech-\nniques [13, 20, 22]. Contrasting alternating minimization, convergence properties are ensured.\nHowever, these techniques are computationally expensive despite the division of computation, since\nglobal and dense interactions are still present.\n\n1\n\n\fIn this work we show that for many graphical models it is possible to devise a partitioning strat-\negy followed by a message passing algorithm such that ef\ufb01ciency can be improved signi\ufb01cantly.\nIn particular, our approach adds additional terms to the energy function (i.e., regions to the Hasse\ndiagram) such that the high-order factors can be pre-computed and remain constant during local\nmessage passing within each machine. As a consequence, high-order factors are only accessed once\nbefore sending messages across machines. This contrasts tightening approaches [27, 28, 2, 26],\nwhere additional regions are added to better approximate the marginal polytope at the cost of ad-\nditional computations, while we are mainly interested in computational ef\ufb01ciency. In contrast to\nre-scheduling strategies [6, 30, 2], our rescheduling is \ufb01xed and does not require additional compu-\ntation.\nOur experimental evaluations show that state-of-the-art techniques [9, 22] have dif\ufb01culties optimiz-\ning energy functions that correspond to densely connected graphs with high-order factors. In con-\ntrast our approach is able to achieve more than one order of magnitude speed-ups while retrieving\nthe same solution in the complex task of jointly estimating 3D room layout and image segmentation\nfrom a single RGB-D image.\n\n2 Background: Dual Decomposition for Message Passing\n\norder factors. To this end, we consider distributions de\ufb01ned over a discrete domain S =(cid:81)N\n\nWe start by reviewing dual-decomposition approaches for inference in graphical models with high-\ni=1 Si,\nwhich is composed of a product of N smaller discrete spaces Si = {1, . . . ,|Si|}. We model our dis-\ntribution to depend log-linearly on a scoring function \u03b8(s) de\ufb01ned over the aforementioned discrete\nproduct space S, i.e., p(s) = 1\nZ exp \u03b8(s), with Z the partition function. Given the scoring function\n\u03b8(s) of a con\ufb01guration s, it is unfortunately generally #P-complete to compute its probability since\nthe partition function Z is required. Its logarithm equals the following variational program [12]:\n\nlog Z = max\np\u2208\u2206\n\np(s)\u03b8(s) + H(p),\n\n(1)\n\n(cid:88)\n\ns\n\nr,sr\n\nwhere H denotes the entropy and \u2206 indicates the probability simplex.\nThe variational program in Eq. (1) is challenging as it operates on the exponentially sized domain\nS. However, we can make use of the fact that for many relevant applications the scoring function\nr\u2208R \u03b8r(sr). These local scoring functions\n\u03b8r depend on a subset of variables sr = (si)i\u2208r, de\ufb01ned on a domain Sr \u2286 S, which is speci\ufb01ed by\ni\u2208r Si. We refer to R as the\nset of all restriction required to compute the scoring function \u03b8.\n\n\u03b8(s) is additively composed of local terms, i.e., \u03b8(s) =(cid:80)\nthe restriction often referred to as region r \u2286 {1, . . . , N}, i.e., Sr =(cid:81)\nLocality of the scoring function allows to equivalently rewrite the expected score via(cid:80)\npr(sr)\u03b8r(sr) by employing marginals pr(sr) = (cid:80)\n(cid:80)\napproximated by a weighted sum of local entropies H(p) \u2248 (cid:80)\nrequired to ful\ufb01ll local marginalization constraints, i.e.,(cid:80)\n\ns p(s)\u03b8(s) =\np(s). Unfortunately an exact de-\ncomposition of the entropy H(p) using marginals is not possible. Instead, the entropy is typically\nr crH(pr), with cr the counting\nnumbers. The task remains intractable despite the entropy approximation since the marginals pr(sr)\nare required to arise from a valid joint distribution p(s). However, if we require the marginals to be\nconsistent only locally, we obtain a tractable approximation [34]. We thus introduce local beliefs\nbr(sr) to denote the approximation, not to be confused with the true marginals pr. The beliefs are\nbp(sp) = br(sr) \u2200r, sr, p \u2208 P (r),\nwhere the set P (r) subsumes the set of all parents of region r for which we want marginalization to\nhold.\nPutting all this together, we obtain the following approximation:\n\nsp\\sr\n\ns\\sr\n\n(cid:88)\n\nmax\n\nb\n\ns.t.\n\n(cid:88)\nbr \u2208 \u2206(cid:80)\n\nr\n\nsp\\sr\n\n(cid:26)\n\nbr(sr)\u03b8r(sr) +\n\ncrH(br)\n\nr,sr\n\n\u2200r\n\nbr \u2208 C =\n\nbr :\n\nbp(sp) = br(sr) \u2200sr, p \u2208 P (r).\n\n(2)\n\nThe computation and memory requirements can be too demanding when dealing with large graph-\nical models. To address this issue, [13, 22] showed that this task can be distributed onto multiple\n\n2\n\n\fAlgorithm: Distributed Message Passing Inference\nLet a = 1/|M (r)| and repeat until convergence\n1. For every \u03ba in parallel: iterate T times over r \u2208 R(\u03ba):\n\u2200p \u2208 P (r), sr\n\n\u03bbp\u2192p(cid:48)(sp) + (cid:80)\n\np(cid:48)\u2208P (p)\n\nr(cid:48)\u2208C(p)\u2229\u03ba\\r\n\n\u00b5p\u2192r(sr) = \u0001\u02c6cp ln\n\n\u03bbr\u2192p(sr) \u221d\n\nsp\\sr\n\n(cid:88)\n\u02c6cr +(cid:80)\n(cid:88)\n\n\u02c6cp\np\u2208P (r)\n\nexp\n\n\u02c6\u03b8p(sp) \u2212 (cid:80)\n\uf8eb\uf8ed\u02c6\u03b8r(sr) +\n(cid:88)\n\u03bbc\u2192r(sc) \u2212 (cid:88)\n\nc\u2208C(r)\u2229\u03ba\n\n\u02c6cp\n\n2. Exchange information by iterating once over r \u2208 G \u2200\u03ba \u2208 M (r)\n\n\u03bd\u03ba\u2192r(sr) = a\n\n\u03bbc\u2192r(sc) +\n\n\u03bbr\u2192p(sr) \u2212 a\n\nc\u2208C(r)\n\nc\u2208C(r)\u2229\u03ba\n\np\u2208P (r)\n\n\u03ba\u2208M (r),p\u2208P (r)\n\nFigure 1: A block-coordinate descent algorithm for the distributed inference task.\n\n\u03bbc\u2192r(sc) + \u03bd\u03ba\u2192r(sr) +\n\n\u00b5p\u2192r(sr)\n\n\u03bbr(cid:48)\u2192p(sr(cid:48)) + \u03bd\u03ba\u2192p(sp)\n\n\u0001\u02c6cp\n\n(cid:88)\n\np\u2208P (r)\n\n(cid:88)\n\n(3)\n\n\uf8f6\uf8f8\u2212 \u00b5p\u2192r(sr)(4)\n(cid:88)\n\n\u03bbr\u2192p(sr) (5)\n\ncomputers \u03ba by employing dual decomposition techniques. More speci\ufb01cally, the task is partitioned\ninto multiple independent tasks with constraints at the boundary ensuring consistency of the parts\nr that\nupon convergence. Hence, an additional constraint is added to make sure that all beliefs b\u03ba\nare assigned to multiple computers, i.e., those at the boundary of the parts, are consistent upon\nconvergence and equal a single region belief br. The distributed program is then:\n\n(cid:88)\n\n(cid:88)\n\u2200\u03ba, r \u2208 R\u03ba, sr, p \u2208 P (r) (cid:80)\n\nr (sr)\u02c6\u03b8r(sr) +\nb\u03ba\n\n\u03ba,r\n\n\u02c6crH(b\u03ba\nr )\n\n\u03ba,r,sr\n\n\u2200\u03ba, r \u2208 R\u03ba, sr\n\nmax\nr \u2208\u2206\n\nbr,b\u03ba\n\ns.t.\n\np (sp) = b\u03ba\nb\u03ba\n\nr (sr)\n\nsp\\sr\nb\u03ba\nr (sr) = br(sr),\n\nwhere R\u03ba refers to regions on comptuer \u03ba. We uniformly distributed the scores \u03b8r(sr) and the\ncounting numbers cr of a region r to all overlapping machines. Thus \u02c6\u03b8r = \u03b8r/|M (r)| and \u02c6cr =\ncr/|M (r)| with M (r) the set of machines that are assigned to region r.\nNote that this program operates on the regions de\ufb01ned by the energy decomposition. To derive an\nef\ufb01cient algorithm making use of the structure incorporated in the constraints we follow [22] and\nchange to the dual domain. For the marginalization constraints we introduce Lagrange multipliers\nr\u2192p(sr) for every computer \u03ba, all regions r \u2208 R\u03ba assigned to that computer, all its states sr\n\u03bb\u03ba\nand all its parents p. For the consistency constraint we introduce Lagrange multipliers \u03bd\u03ba\u2192r(sr)\nfor all computers, regions and states. The arrows indicate that the Lagrange multipliers can be\ninterpreted as messages sent between different nodes in a Hasse diagram with nodes corresponding\nto the regions.\nThe resulting distributed inference algorithm [22] is summarized in Fig. 1. It consists of two parts,\nthe \ufb01rst of which is a standard message passing on the Hasse-diagram de\ufb01ned locally on each com-\nputer \u03ba. The second operation interrupts message passing occasionally to exchange information\nbetween computers. This second task of exchanging messages is often visualized on a graph G with\nnodes corresponding to computers and additional vertices denoting shared regions.\nFig. 2(a) depicts a region graph with four unary regions and two high-order ones, i.e., R =\n{{1},{2},{3},{4},{1, 2, 3},{1, 2, 3, 4}}. We partition this region graph onto two computers\n\u03ba1, \u03ba2 as indicated via the dashed rectangles. The graph G containing as nodes both computers\nand the shared region is provided as well. The connections between all regions are labeled with the\ncorresponding message, i.e., \u03bb, \u00b5 and \u03bd. We emphasize that the consistency messages \u03bd are only\nmodi\ufb01ed when sending information between computers \u03ba. Investigating the provided example in\nFig. 2(a) more carefully we observe that the computation of \u00b5 as de\ufb01ned in Eq. (3) in Fig. 1 in-\nvolves summing over the state-space of the third-order region {1, 2, 3} and the fourth-order region\n{1, 2, 3, 4}. The presence of those high-order regions make dual decomposition approaches [22]\n\n3\n\n\f(a)\n\n(b)\n\nFigure 2: Standard distributed message passing operating on an inference task partitioned to two\ncomputers (left) is compared to the proposed approach (right) where newly introduced regions (yel-\nlow) ensure constant messages \u00b5 from the high-order regions.\n\nimpractical. In the next section we show how message passing algorithms can become orders of\nmagnitude faster when adding additional regions.\n\n3 Ef\ufb01cient Message Passing for High-order Models\n\nThe distributed message passing procedure described in the previous section involves summations\nover large state-spaces when computing the messages \u00b5.\nIn this section we derive an approach\nthat can signi\ufb01cantly reduce the computation by adding additional regions and performing message-\npassing with a speci\ufb01c message scheduling. Our key observation is that computation can be greatly\nreduced if the high-order regions are singly-connected since their outgoing message \u00b5 remains con-\nstant. Generally, singly-connected high-order regions do not occur in graphical models. However, in\nmany cases we can use dual decomposition to distribute the computation in a way that the high-order\nregions become singly-connected if we introduce additional intermediate regions located between\nthe high-order regions and the low-order ones (e.g., unary regions).\nAt \ufb01rst sight, adding regions increases computational complexity since we have to iterate over ad-\nditional terms. However, we add regions only if they result in constant messages from regions with\neven larger state space. By pre-computing those constant messages rather than re-evaluating them at\nevery iteration, we hence decrease computation time despite augmenting the graph with additional\nregions, i.e., additional marginal beliefs br.\nSpeci\ufb01cally, we observe that there are no marginalization constraints for the singly-connected high-\norder regions, subsumed in the set H\u03ba = {r \u2208 \u02c6R\u03ba : P (r) = \u2205,|C(r)| = 1}, since their set\nof parents is empty. An important observation made precise in Claim 1 is that the corresponding\nmessages \u00b5 are constant for high-order regions unless \u03bd\u03ba\u2192r changes. Therefore we can improve the\nmessage passing algorithm discussed in the previous section by introducing additional regions to\nincrease the size of the set |H\u03ba| as much as possible while not changing the cost function. The latter\nis ensured by requiring the additional counting numbers and potentials to equal zero. However, we\nnote that the program will change since the constraint set is augmented.\nMore formally, let \u02c6R\u03ba be the set of all regions, i.e., the regions R\u03ba of the original task on computer\n\u03ba in addition to the newly added regions \u02c6r \u2208 \u02c6R\u03ba\\R\u03ba. Let H\u03ba = {r \u2208 \u02c6R\u03ba : P (r) = \u2205,|C(r)| = 1}\nbe the set of high-order regions on computer \u03ba that are singly connected and have no parent. Further,\nlet its complement H\u03ba = \u02c6R\u03ba \\ H\u03ba denote all remaining regions. The inference task is given by\n\n(cid:88)\n\n(cid:88)\n\u2200\u03ba, r \u2208 H\u03ba, sr, p \u2208 P (r) (cid:80)\n\nr (sr)\u02c6\u03b8r(sr) +\nb\u03ba\n\n\u03ba,r\n\n\u02c6crH(b\u03ba\nr )\n\n\u03ba,r,sr\n\n\u2200\u03ba, r \u2208 \u02c6R\u03ba, sr\n\nmax\nr \u2208\u2206\n\nbr,b\u03ba\n\ns.t.\n\nb\u03ba\np (sp) = b\u03ba\n\nr (sr)\n\nsp\\sr\nb\u03ba\nr (sr) = br(sr).\n\n(9)\n\nEven though we set \u03b8r(sr) \u2261 0 for all states sr, and \u02c6cr = 0 for all newly added regions r \u2208 \u02c6R\u03ba\\R\u03ba,\nthe inference task is not identical to the original problem since the constraint set is not the same. Note\nthat new regions introduce new marginalization constraints. Next we show that messages leaving\nsingly-connected high-order regions are constant.\n\n4\n\n\u03b1 = {1, 2, 3}\u03b2 = {1, 2, 3, 4}\u03b1 = {1, 2, 3}\u03b2 = {1, 2, 3, 4}{1}{2}{3}{4}1\u03b1\u03bb\u21921\u03b1\u00b5\u21921\u03b2\u03bb\u21921\u03b2\u00b5\u21922\u03b1\u03bb\u21922\u03b1\u00b5\u21922\u03b2\u03bb\u21922\u03b2\u00b5\u21923\u03b1\u03bb\u21923\u03b1\u00b5\u21924\u03b2\u03bb\u21924\u03b2\u00b5\u21923\u03b2\u03bb\u21923\u03b2\u00b5\u2192\u03ba12\u03ba\u03b1 = {1, 2, 3}\u03b2 = {1, 2, 3, 4}1\u03ba\u03b2\u03c5\u21922\u03ba\u03b1\u03c5\u21922\u03ba\u03b2\u03c5\u21921\u03ba\u03b1\u03c5\u2192\u03b1 = {1, 2, 3}\u03b2 = {1, 2, 3, 4}\u03b1 = {1, 2, 3}\u03b2 = {1, 2, 3, 4}\u03c3 = {1, 2}\u03c0 = {3, 4}{1}{2}{3}{4}\u03ba12\u03ba1\u03c3\u03bb\u21921\u03c3\u00b5\u21922\u03c3\u03bb\u21922\u03c3\u00b5\u21923\u03c0\u03bb\u21923\u03c0\u00b5\u21924\u03c0\u03bb\u21924\u03c0\u00b5\u2192\u03c3\u03b1\u03bb\u2192\u03b1\u03c3\u00b5\u2192\u03c3\u03b2\u03bb\u2192\u03b2\u03c3\u00b5\u21923\u03b1\u03bb\u21923\u03b1\u00b5\u2192\u03c0\u03b2\u03bb\u2192\u03b2\u03c0\u00b5\u2192\u03b1 = {1, 2, 3}\u03b2 = {1, 2, 3, 4}1\u03ba\u03b2\u03c5\u21922\u03ba\u03b1\u03c5\u21922\u03ba\u03b2\u03c5\u21921\u03ba\u03b1\u03c5\u2192\fAlgorithm: Message Passing for Large Scale Graphical Models with High Order Potentials\nLet a = 1/|M (r)| and repeat until convergence\n1. For every \u03ba in parallel: Update singly-connected regions p \u2208 H\u03ba: let r = C(p) \u2200sr\n\n\u02c6\u03b8p(sp) \u2212 (cid:80)\n\n\u03bbp\u2192p(cid:48)(sp) + (cid:80)\n\np(cid:48)\u2208P (p)\n\nr(cid:48)\u2208C(p)\u2229\u03ba\\r\n\n\u0001\u02c6cp\n\n(cid:88)\n\nsp\\sr\n\n\u00b5p\u2192r(sr) = \u0001\u02c6cp ln\n\nexp\n\n\u03bbr(cid:48)\u2192p(sr(cid:48)) + \u03bd\u03ba\u2192p(sp)\n\n2. For every \u03ba in parallel: iterate T times over r \u2208 \u02c6R\u03ba:\n\u2200p \u2208 P (r) \\ H\u03ba, sr\n\n\u03bbp\u2192p(cid:48)(sp) + (cid:80)\n\np(cid:48)\u2208P (p)\n\nr(cid:48)\u2208C(p)\u2229\u03ba\\r\n\n\u03bbr(cid:48)\u2192p(sr(cid:48)) + \u03bd\u03ba\u2192p(sp)\n\n\u03bbc\u2192r(sc) + \u03bd\u03ba\u2192r(sr) +\n\n\u00b5p\u2192r(sr)\n\n\u0001\u02c6cp\n\n(cid:88)\n\np\u2208P (r)\n\n(cid:88)\n\n(6)\n\n\uf8f6\uf8f8\u2212 \u00b5p\u2192r(sr)(7)\n(cid:88)\n\n\u03bbr\u2192p(sr) (8)\n\n\u03bbc\u2192r(sc) +\n\n\u03bbr\u2192p(sr) \u2212 a\n\np\u2208P (r)\n\n\u03ba\u2208M (r),p\u2208P (r)\n\n(cid:88)\n\nsp\\sr\n\nexp\n\n\u02c6\u03b8p(sp) \u2212 (cid:80)\n\uf8eb\uf8ed\u02c6\u03b8r(sr) +\n(cid:88)\n\u03bbc\u2192r(sc) \u2212 (cid:88)\n\nc\u2208C(r)\u2229\u03ba\n\nc\u2208C(r)\u2229\u03ba\n\n\u02c6cp\np\u2208P (r)\n\n\u02c6cp\n\n\u02c6cr +(cid:80)\n(cid:88)\n\nc\u2208C(r)\n\n\u00b5p\u2192r(sr) = \u0001\u02c6cp ln\n\u2200p \u2208 P (r), sr\n\u03bbr\u2192p(sr) \u221d\n\n\u03bd\u03ba\u2192r(sr) = a\n\n3. Exchange information by iterating once over r \u2208 G \u2200\u03ba \u2208 M (r)\n\nFigure 3: A block-coordinate descent algorithm for the distributed inference task.\n\nClaim 1. During message passing updates de\ufb01ned in Fig. 1 the multiplier \u00b5p\u2192r(sr) is constant for\nsingly-connected high-order regions p.\n\nProof: More carefully investigating Eq. (3) which de\ufb01nes \u00b5, it follows that(cid:80)\n(cid:80)\np(cid:48)\u2208P (p) \u03bbp\u2192p(cid:48)(sp) =\n0 because P (p) = \u2205 since p is assumed singly-connected. For the same reason we obtain\nr(cid:48)\u2208C(p)\u2229\u03ba\\r \u03bbr(cid:48)\u2192p(sr(cid:48)) = 0 because r(cid:48) \u2208 C(p) \u2229 \u03ba \\ r = \u2205 and \u03bd\u03ba\u2192p(sp) is constant upon\neach exchange of information. Therefore, \u00b5p\u2192r(sr) is constant irrespective of all other messages\n(cid:4)\nand can be pre-computed upon exchange of information.\nWe can thus pre-compute the constant messages before performing message passing. Our approach\nis summarized in Fig. 3. We now provide its convergence properties in the following claim.\nClaim 2. The algorithm outlined in Fig. 3 is guaranteed to converge to the global optimum of the\nprogram given in Eq. (9) for \u0001cr > 0 \u2200r and is guaranteed to converge in case \u0001cr \u2265 0 \u2200r.\nProof: The message passing algorithm is derived as a block-coordinate descent algorithm in the\ndual domain. Hence it inherits the properties of block-coordinate descent algorithms [31] which are\nguaranteed to converge to a single global optimum in case of strict concavity (\u0001cr > 0 \u2200r) and which\nare guaranteed to converge in case of concavity only (\u0001cr \u2265 0 \u2200r), which proves the claim.\n(cid:4)\nWe note that Claim 1 nicely illustrates the bene\ufb01ts of working with region graphs rather than factor\ngraphs. A bi-partite factor graph contains variable nodes connected to possibly high-order factors.\nAssume that we distributed the task at hand such that every high-order region of size larger than two\nis connected to at most two local variables. By adding a pairwise region in between the original\nhigh-order factor node and the variable nodes we are able to reduce computational complexity since\nthe high-order factors are now singly connected. Therefore, we can guarantee that the complexity of\nthe local message-passing steps run in each machine reduces from the state-space size of the largest\nfactor to the size of the largest newly introduced region in each computer. This is summarized in the\nfollowing claim.\nClaim 3. Assume we are given a high-order factor-graph representation of a graphical model. By\ndistributing the model onto multiple computers and by introducing additional regions we reduce the\ncomplexity of the message passing iterations on every computer generally dominated by the state-\n\n5\n\n\fvp0\n\ny1\n\ny2\n\nr3\n\nr4\n\nr1\n\nr3\n\nr4\n\nvp2\n\ny4\n\ny3\nvp1\n\nvp0\n\ny1\n\ny2\n\nr2\n\nvp2\n\ny4\n\ny3\nvp1\n\nr1\n\nr2\n\n(a) Layout parameterization.\n\n(b) Compatibility.\n\n(c) Joint model.\n\nFigure 4: Parameterization of the layout task is visualized in (a). Compatibility of a superpixel\nlabeling with a wall parameterization using third-order functions is outlined in (b) and the graphical\nmodel for the joint layout-segmentation task is depicted in (c).\n\nrel. duality gap\n\nOurs [s]\ncBP [s]\ndcBP [s]\n\n1\n0.78\n31.60\n19.48\n\u0001 = 0\n\n0.1\n5.92\n986.54\n1042.8\n\n0.01\n51.59\n1736.6\n1772.6\n\nOurs [s]\ncBP [s]\ndcBP [s]\n\n15.58\n411.81\n451.71\n\u0001 = 1\n\nrel. duality gap\n\n1\n\n0.1\n\n448.26\n4357.9\n4506.6\n\n0.01\n1150.1\n4479.9\n4585.3\n\n|Sr\u2229H\u03ba\n\n|.\n\nmax =\n\nmax) with s(cid:48)\n\nTable 1: Average time to achieve the speci\ufb01ed relative duality gap for \u0001 = 0 (left) and \u0001 = 1 (right).\nspace size of the largest region smax = maxr\u2208R\u03ba |Sr| from O(smax) to O(s(cid:48)\nmaxr\u2208 \u02c6R\u03ba\nProof: The complexity of standard message passing on a region graph is linear on the largest state-\nspace region, i.e., O(smax). Since some operations can be pre-computed as per Claim 1 we em-\nphasize that the largest newly introduced region on computer \u03ba is of state-space size s(cid:48)\nmax which\n(cid:4)\nconcludes the proof.\nClaim 3 indicates that distributing computation in addition to message rescheduling is a powerful\ntool to cope with high-order potentials. To gain some insight, we illustrate our idea with a speci\ufb01c\nexample. Suppose we distribute the inference computation on two computers \u03ba1, \u03ba2 as shown in\nFig. 2(a). We compare it to a task on \u02c6R regions, i.e., we introduce additional regions \u02c6r \u2208 \u02c6R\\R. The\nmessages required in the augmented task are visualized in Fig. 2(b). Each computer (box highlighted\nwith dashed lines) is assigned a task speci\ufb01ed by the contained region graph. As before we also\nvisualize the messages \u03bd occasionally sent between the computers in a graph containing as nodes\nthe shared factors and the computers (boxes drawn with dashed lines). The algorithm proceeds by\npassing messages \u03bb, \u00b5 on each computer independently for T rounds. Afterwards messages \u03bd are\nexchanged between computers. Importantly, we note that messages for singly-connected high-order\nregions within dashed boxes are only required to be computed once upon exchanging message \u03bd.\nThis is the case for all high-order regions in Fig. 2(b) and for no high-order region in Fig. 2(a),\nhighlighting the obtained computational bene\ufb01ts.\n\n4 Experimental Evaluation\n\nWe demonstrate the effectiveness of our approach in the task of jointly estimating the layout and\nsemantic labels of indoor scenes from a single RGB-D image. We use the dataset of [38], which is\na subset of the NYU v2 dataset [24]. Following [38], we utilize 202 images for training and 101 for\ntesting. Given the vanishing points (points where parallel lines meet at in\ufb01nity), the layout task can\nbe formulated with four random variables s1, . . . , s4, each of which corresponds to angles for rays\noriginating from two distinct vanishing points [15]. We discretize each ray into |Si| = 25 states. To\nde\ufb01ne the segmentation task, we partition each image into super pixels. We then de\ufb01ne a random\nvariable with six states for each super pixel si \u2208 Si = {left, front, right, ceiling, \ufb02oor, clutter} with\ni > 4. We refer the reader to Fig. 4(a) and Fig. 4(b) for an illustration of the parameterization of the\nproblem. The graphical model for the joint problem is depicted in Fig. 4(c).\nThe score of the joint model is given by a sum of scores\n\n\u03b8(s) = \u03b8lay(s1, . . . , s4) + \u03b8label(s5, . . . , sM +4) + \u03b8comp(s),\n\nwhere \u03b8lay is de\ufb01ned as the sum of scores over the layout faces, which can be decomposed into a\nsum of pairwise functions using integral geometry [23]. The labeling score \u03b8label contains unary\n\n6\n\ny4y3y2y1l4l5l3l2l1Layout NetworkSegmentation NetworkCompatibility Network\f(normalized primal/dual \u0001 = 0)\n\n(normalized primal/dual \u0001 = 1)\n\n(factor agreement \u0001 = 0)\n\n(factor agreement \u0001 = 1)\n\nFigure 5: Average normalized primal/dual and factor agreement for \u0001 = 1 and \u0001 = 0.\n\n\u03b8comp(s) =(cid:80)\n\npotentials and pairwise regularization between neighboring superpixels. The third function, \u03b8comp,\ncouples the two tasks and encourages the layout and the segmentation to agree in their labels, e.g., a\nsuperpixel on the left wall of the layout is more likely to be assigned the left-wall or the object label.\nThe compatibility score decomposes into a sum of \ufb01fth-order scores, one for each superpixel, i.e.,\ni>4 \u03b8comp,i(s1, . . . , s4, si). Using integral geometry [23], we can further decompose\neach superpixel score \u03b8comp,i into a sum of third-order energies. As illustrated in Fig. 4(c), every\nsuperpixel variable si, i > 4 is therefore linked to 4-choose-2 third order functions of state-space\nsize 6 \u00b7 252. These functions measure the overlap of each superpixel with a region speci\ufb01ed by two\nlayout ray angles si, sj with i, j \u2208 {1, . . . , 4}, i (cid:54)= j. This is illustrated in Fig. 4(b) for the area\nhighlighted in purple and the blue region de\ufb01ned by s2 and s3. Since a typical image has around\n250 superpixels, there are approximately 1000 third-order factors.\nFollowing Claim 3 we recognize that the third-order functions are connected to at most two vari-\nables if we distribute the inference such that the layout task is assigned to one computer while the\nsegmentation task is divided onto other machines. Importantly, this corresponds to a roughly equal\nsplit of the problem when using our approach, since all tasks are pairwise and the state-space of\nthe layout task is higher than the one of the semantic-segmentation. Despite the third-order regions\ninvolved in the original model, every local inference task contains at most pairwise factors.\nWe use convex BP [35, 18, 9] and distributed convex BP [22] as baselines. For our method, we assign\nlayout nodes to the \ufb01rst machine and segmentation nodes to the second one. Without introducing\nadditional regions and pre-computations the workload of this split is highly unbalanced. This makes\ndistributed convex BP even slower than convex BP since many messages are exchanged over the\nnetwork. To be more fair to distributed convex BP, we split the nodes into two parts, each with 2\nlayout variables and half of the segmentation variables. For all experiments, we set cr = 1 and\nevaluate the settings \u0001 = 1 and \u0001 = 0. For a fair comparison we employ a single core for our\napproach and convex BP and two cores for distributed convex BP. Note that our approach can be run\nin parallel to achieve even faster convergence.\nWe compare our method to the baselines using two metrics: Normalized primal/dual is a rescaled\nversion of the original primal and dual normalized by the absolute value of the optimal score. This\nallow us to compare different images that might have fairly different energies.\nIn case none of\nthe algorithms converged we normalize all energies using the mean of the maximal primal and the\nminimum dual. The second metric is the factor agreement, which is de\ufb01ned as the proportion of\nfactors that agree with the connected node marginals.\nFig. 5 depicts the normalized primal/dual as well as the factor agreement for \u0001 = 0 (i.e., MAP)\nand \u0001 = 1 (i.e., marginals). We observe that our proposed approach converges signi\ufb01cantly faster\n\n7\n\n10010110210300.511.522.5 ours primalcBP primaldcBP c = 1 primaldcBP c = 2 primalours dualcBP dualdcBP c = 1 dualdcBP c = 2 dual101102103\u22123\u22122\u22121012345 ours primalcBP primaldcBP c = 1 primaldcBP c = 2 primalours dualcBP dualdcBP c = 1 dualdcBP c = 2 dual010020030040050060070080090010000.650.70.750.80.850.90.951 ours agreementcBP agreementdcBP c = 1 agreementdcBP c = 2 agreement05001000150020002500300000.10.20.30.40.50.60.70.80.91 ours agreementcBP agreementdcBP c = 1 agreementdcBP c = 2 agreement\flayout err: 0.90% segmentation err: 4.74% layout err: 1.15% segmentation err: 5.12% layout err: 1.75% segmentation err: 3.98%\n\nlayout err: 2.36% segmentation err: 4.06% layout err: 2.38% segmentation err: 3.77% layout err: 2.88% segmentation err: 6.01%\n\nlayout err: 2.89% segmentation err: 3.99% layout err: 4.20% segmentation err: 3.65% layout err: 4.79% segmentation err: 4.17%\n\nlayout err: 13.97% segmentation err: 32.08% layout err: 25.89% segmentation err: 16.70% layout err: 18.04% segmentation err: 5.34%\nFigure 6: Qualitative Result (\u0001 = 0) : First column illustrates the inferred layout (blue) and layout\nground truth (red). The second and third columns are estimated and ground truth segmentations re-\nspectively. Failure modes are shown in the last row. They are due to bad vanishing point estimation.\n\nthan the baselines. We additionally observe that for densely coupled tasks, the performance of\ndcBP degrades when exchanging messages every other iteration (yellow curves). Importantly, in\nour experiments we never observed any of the other approaches to converge when our approach\ndid not converge. Tab. 1 depicts the time in seconds required to achieve a certain relative duality\ngap. We observe that our proposed approach outperforms all baselines by more than one order of\nmagnitude. Fig. 6 shows qualitative results for \u0001 = 0. Note that our approach manages to accurately\npredict layouts and corresponding segmentations. Some failure cases are illustrated in the bottom\nrow. They are largely due to failures in the vanishing point detection which our approach can not\nrecover from.\n\n5 Conclusions\n\nWe have proposed a partitioning strategy followed by a message passing algorithm which is able to\nspeed-up signi\ufb01cantly dual decomposition methods for parallel inference in Markov random \ufb01elds\nwith high-order terms and dense connections. We demonstrate the effectiveness of our approach on\nthe task of joint layout and semantic segmentation estimation from single images, and show that our\napproach is orders of magnitude faster than existing methods. In the future, we plan to investigate\nthe applicability of our approach to other scene understanding tasks.\n\nReferences\n[1] A. Amini, T. Wymouth, and R. Jain. Using Dynamic Programming for Solving Variational Problems in\n\nVision. PAMI, 1990.\n\n[2] D. Batra, S. Nowozin, and P. Kohli. Tighter Relaxations for MAP-MRF Inference: A Local Primal-Dual\n\nGap based Separation Algorithm. In Proc. AISTATS, 2011.\n\n[3] Y. Boykov, O. Veksler, and R. Zabih. Fast Approximate Energy Minimization via Graph Cuts. PAMI,\n\n2001.\n\n[4] M. Collins. Head-Driven Statistical Models for Natural Language Parsing. Computational Linguistics,\n\n2003.\n\n[5] R. Dechter. Reasoning with Probabilistic and Deterministic Graphical Models: Exact Algorithms. Mor-\n\ngan & Claypool, 2013.\n\n[6] G. Elidan, I. McGraw, and D. Koller. Residual belief propagation: Informed scheduling for asynchronous\n\nmessage passing. In Proc. UAI, 2006.\n\n[7] L. R. Ford and D. R. Fulkerson. Maximal \ufb02ow through a network. Canadian Journal of Mathematics,\n\n1956.\n\n[8] T. Hazan and T. Jaakkola. On the Partition Function and Random Maximum A-Posteriori Perturbations.\n\nIn Proc. ICML, 2012.\n\n[9] T. Hazan and A. Shashua. Norm-Product Belief Propagation: Primal-Dual Message-Passing for LP-\n\nRelaxation and Approximate-Inference. Trans. Information Theory, 2010.\n\n8\n\n\f[10] P. Kohli and P. Kumar. Energy Minimization for Linear Envelope MRFs. In Proc. CVPR, 2010.\n[11] P. Kohli, L. Ladick`y, and P. H. S. Torr. Robust higher order potentials for enforcing label consistency.\n\nIJCV, 2009.\n\n[12] D. Koller and N. Friedman. Probabilistic Graphical Models: Principles and Techniques. MIT Press,\n\n2009.\n\n[13] N. Komodakis, N. Paragios, and G. Tziritas. MRF Optimization via Dual Decomposition: Message-\n\nPassing Revisited. In Proc. ICCV, 2007.\n\n[14] P. Kr\u00a8ahenb\u00a8uhl and V. Koltun. Ef\ufb01cient Inference in Fully Connected CRFs with Gaussian Edge Potentials.\n\nIn Proc. NIPS, 2011.\n\n[15] D. C. Lee, A. Gupta, M. Hebert, and T. Kanade. Estimating Spatial Layout of Rooms using Volumetric\n\nReasoning about Objects and Surfaces. In Proc. NIPS, 2010.\n\n[16] V. Lempitsky, C. Rother, S. Roth, and A. Blake. Fusion Moves for Markov Random Field Optimization.\n\nPAMI, 2010.\n\n[17] Y. Li, D. Tarlow, and R. Zemel. Exploring compositional high order pattern potentials for structured\n\noutput learning. In Proc. CVPR, 2013.\n\n[18] T. Meltzer, A. Globerson, and Y. Weiss. Convergent Message Passing Algorithms: a unifying view. In\n\nProc. UAI, 2009.\n\n[19] J. Pearl. Probabilistic Reasoning in Intelligent Systems: Networks of Plausible Inference. Morgan Kauf-\n\nmann, 1988.\n\n[20] M. Salzmann. Continuous Inference in Graphical Models with Polynomial Energies. In Proc. CVPR,\n\n2013.\n\n[21] M. Salzmann and R Urtasun. Beyond feature points: structured prediction for monocular non-rigid 3d\n\nreconstruction. In Proc. ECCV, 2012.\n\n[22] A. G. Schwing, T. Hazan, M. Pollefeys, and R. Urtasun. Distributed Message Passing for Large Scale\n\nGraphical Models. In Proc. CVPR, 2011.\n\n[23] A. G. Schwing, T. Hazan, M. Pollefeys, and R. Urtasun. Ef\ufb01cient Structured Prediction for 3D Indoor\n\nScene Understanding. In Proc. CVPR, 2012.\n\n[24] N. Silberman, D. Hoiem, P. Kohli, and R. Fergus.\n\nRGBD Images. In Proc. ECCV, 2012.\n\nIndoor Segmentation and Support Inference from\n\n[25] D. A. Smith and J. Eisner. Dependency parsing by belief propagation. In Proc. EMNLP, 2008.\n[26] D. Sontag, D. K. Choe, and Y. Li. Ef\ufb01ciently Searching for Frustrated Cycles in MAP Inference. In Proc.\n\nUAI, 2012.\n\n[27] D. Sontag and T. Jaakkola. New Outer Bounds on the Marginal Polytope. In Proc. NIPS, 2007.\n[28] D. Sontag, T. Meltzer, A. Globerson, and T. Jaakkola. Tightening LP Relaxations for MAP using Message\n\nPassing. In Proc. NIPS, 2008.\n\n[29] D. Sun, C. Liu, and H. P\ufb01ster. Local Layering for Joint Motion Estimation and Occlusion Detection. In\n\nProc. CVPR, 2014.\n\n[30] C. Sutton and A. McCallum. Improved dynamic schedules for belief propagation. In Proc. UAI, 2007.\n[31] P. Tseng and D. P. Bertsekas. Relaxation Methods for Problems with Strictly Convex Separable Costs and\n\nLinear Constraints. Mathematical Programming, 1987.\n\n[32] L. Valgaerts, A. Bruhn, H. Zimmer, J. Weickert, C. Stroll, and C. Theobalt. Joint Estimation of Motion,\n\nStructure and Geometry from Stereo Sequences. In Proc. ECCV, 2010.\n\n[33] V. Vineet and P. H. S. Torr J. Warrell. Filter-based Mean-Field Inference for Random Fields with Higher\n\nOrder Terms and Product Label-Spaces. In Proc. ECCV, 2012.\n\n[34] M. J. Wainwright and M. I. Jordan. Graphical Models, Exponential Families and Variational Inference.\n\nFoundations and Trends in Machine Learning, 2008.\n\n[35] Y. Weiss, C. Yanover, and T. Meltzer. MAP Estimation, Linear Programming and Belief Propagation with\n\nConvex Free Energies. In Proc. UAI, 2007.\n\n[36] C. Yanover, O. Schueler-Furman, and Y. Weiss. Minimizing and Learning Energy Functions for Side-\n\nChain Prediction. J. of Computational Biology, 2008.\n\n[37] J. Yao, S. Fidler, and R. Urtasun. Describing the scene as a whole: Joint object detection, scene classi\ufb01-\n\ncation and semantic segmentation. In Proc. CVPR, 2012.\n\n[38] J. Zhang, K. Chen, A. G. Schwing, and R. Urtasun. Estimating the 3D Layout of Indoor Scenes and its\n\nClutter from Depth Sensors. In Proc. ICCV, 2013.\n\n9\n\n\f", "award": [], "sourceid": 662, "authors": [{"given_name": "Jian", "family_name": "Zhang", "institution": "ETH Zurich"}, {"given_name": "Alex", "family_name": "Schwing", "institution": "University of Toronto"}, {"given_name": "Raquel", "family_name": "Urtasun", "institution": "University of Toronto"}]}