{"title": "Barrier Frank-Wolfe for Marginal Inference", "book": "Advances in Neural Information Processing Systems", "page_first": 532, "page_last": 540, "abstract": "We introduce a globally-convergent algorithm for optimizing the tree-reweighted (TRW) variational objective over the marginal polytope. The algorithm is based on the conditional gradient method (Frank-Wolfe) and moves pseudomarginals within the marginal polytope through repeated maximum a posteriori (MAP) calls. This modular structure enables us to leverage black-box MAP solvers (both exact and approximate) for variational inference, and obtains more accurate results than tree-reweighted algorithms that optimize over the local consistency relaxation. Theoretically, we bound the sub-optimality for the proposed algorithm despite the TRW objective having unbounded gradients at the boundary of the marginal polytope. Empirically, we demonstrate the increased quality of results found by tightening the relaxation over the marginal polytope as well as the spanning tree polytope on synthetic and real-world instances.", "full_text": "Barrier Frank-Wolfe for Marginal Inference\n\nRahul G. Krishnan\nCourant Institute\n\nNew York University\n\nSimon Lacoste-Julien\n\nINRIA - Sierra Project-Team\n\n\u00b4Ecole Normale Sup\u00b4erieure, Paris\n\nDavid Sontag\nCourant Institute\n\nNew York University\n\nAbstract\n\nWe introduce a globally-convergent algorithm for optimizing the tree-reweighted\n(TRW) variational objective over the marginal polytope. The algorithm is based\non the conditional gradient method (Frank-Wolfe) and moves pseudomarginals\nwithin the marginal polytope through repeated maximum a posteriori (MAP) calls.\nThis modular structure enables us to leverage black-box MAP solvers (both exact\nand approximate) for variational inference, and obtains more accurate results than\ntree-reweighted algorithms that optimize over the local consistency relaxation.\nTheoretically, we bound the sub-optimality for the proposed algorithm despite\nthe TRW objective having unbounded gradients at the boundary of the marginal\npolytope. Empirically, we demonstrate the increased quality of results found by\ntightening the relaxation over the marginal polytope as well as the spanning tree\npolytope on synthetic and real-world instances.\n\nIntroduction\n\n1\nMarkov random \ufb01elds (MRFs) are used in many areas of computer science such as vision and\nspeech. Inference in these undirected graphical models is generally intractable. Our work focuses on\nperforming approximate marginal inference by optimizing the Tree Re-Weighted (TRW) objective\n(Wainwright et al., 2005). The TRW objective is concave, is exact for tree-structured MRFs, and\nprovides an upper bound on the log-partition function.\nFast combinatorial solvers for the TRW objective exist, including Tree-Reweighted Belief Propaga-\ntion (TRBP) (Wainwright et al., 2005), convergent message-passing based on geometric program-\nming (Globerson and Jaakkola, 2007), and dual decomposition (Jancsary and Matz, 2011). These\nmethods optimize over the set of pairwise consistency constraints, also called the local polytope.\nSontag and Jaakkola (2007) showed that signi\ufb01cantly better results could be obtained by optimizing\nover tighter relaxations of the marginal polytope. However, deriving a message-passing algorithm\nfor the TRW objective over tighter relaxations of the marginal polytope is challenging. Instead,\nSontag and Jaakkola (2007) use the conditional gradient method (also called Frank-Wolfe) and off-\nthe-shelf linear programming solvers to optimize TRW over the cycle consistency relaxation. Rather\nthan optimizing over the cycle relaxation, Belanger et al. (2013) optimize the TRW objective over\nthe exact marginal polytope. Then, using Frank-Wolfe, the linear minimization performed in the\ninner loop can be shown to correspond to MAP inference.\nThe Frank-Wolfe optimization algorithm has seen increasing use in machine learning, thanks in\npart to its ef\ufb01cient handling of complex constraint sets appearing with structured data (Jaggi, 2013;\nLacoste-Julien and Jaggi, 2015). However, applying Frank-Wolfe to variational inference presents\nchallenges that were never resolved in previous work. First, the linear minimization performed\nin the inner loop is computationally expensive, either requiring repeatedly solving a large linear\nprogram, as in Sontag and Jaakkola (2007), or performing MAP inference, as in Belanger et al.\n(2013). Second, the TRW objective involves entropy terms whose gradients go to in\ufb01nity near the\nboundary of the feasible set, therefore existing convergence guarantees for Frank-Wolfe do not apply.\nThird, variational inference using TRW involves both an outer and inner loop of Frank-Wolfe, where\nthe outer loop optimizes the edge appearance probabilities in the TRW entropy bound to tighten it.\n\n1\n\n\fNeither Sontag and Jaakkola (2007) nor Belanger et al. (2013) explore the effect of optimizing over\nthe edge appearance probabilities.\nAlthough MAP inference is in general NP hard (Shimony, 1994), it is often possible to \ufb01nd exact so-\nlutions to large real-world instances within reasonable running times (Sontag et al., 2008; Allouche\net al., 2010; Kappes et al., 2013). Moreover, as we show in our experiments, even approximate\nMAP solvers can be successfully used within our variational inference algorithm. As MAP solvers\nimprove in their runtime and performance, their iterative use could become feasible and as a byprod-\nuct enable more ef\ufb01cient and accurate marginal inference. Our work provides a fast deterministic\nalternative to recently proposed Perturb-and-MAP algorithms (Papandreou and Yuille, 2011; Hazan\nand Jaakkola, 2012; Ermon et al., 2013).\nContributions. This paper makes several theoretical and practical innovations. We propose a mod-\ni\ufb01cation to the Frank-Wolfe algorithm that optimizes over adaptively chosen contractions of the\ndomain and prove its rate of convergence for functions whose gradients can be unbounded at the\nboundary. Our algorithm does not require a different oracle than standard Frank-Wolfe and could be\nuseful for other convex optimization problems where the gradient is ill-behaved at the boundary.\nWe instantiate the algorithm for approximate marginal inference over the marginal polytope with\nthe TRW objective. With an exact MAP oracle, we obtain the \ufb01rst provably convergent algorithm\nfor the optimization of the TRW objective over the marginal polytope, which had remained an open\nproblem to the best of our knowledge. Traditional proof techniques of convergence for \ufb01rst order\nmethods fail as the gradient of the TRW objective is not Lipschitz continuous.\nWe develop several heuristics to make the algorithm practical: a fully-corrective variant of Frank-\nWolfe that reuses previously found integer assignments thereby reducing the need for new (approxi-\nmate) MAP calls, the use of local search between MAP calls, and signi\ufb01cant re-use of computations\nbetween subsequent steps of optimizing over the spanning tree polytope. We perform an extensive\nexperimental evaluation on both synthetic and real-world inference tasks.\n2 Background\nMarkov Random Fields: MRFs are undirected probabilistic graphical models where the probability\ndistribution factorizes over cliques in the graph. We consider marginal inference on pairwise MRFs\nwith N random variables X1, X2, . . . , XN where each variable takes discrete states xi \u2208 VALi. Let\nG = (V, E) be the Markov network with an undirected edge {i, j} \u2208 E for every two variables\nXi and Xj that are connected together. Let N (i) refer to the set of neighbors of variable Xi. We\norganize the edge log-potentials \u03b8ij(xi, xj) for all possible values of xi \u2208 VALi, xj \u2208 VALj in\nthe vector \u03b8ij, and similarly for the node log-potential vector \u03b8i. We regroup these in the overall\nvector (cid:126)\u03b8. We introduce a similar grouping for the marginal vector (cid:126)\u00b5: for example, \u00b5i(xi) gives the\ncoordinate of the marginal vector corresponding to the assignment xi to variable Xi.\nTree Re-weighted Objective (Wainwright et al., 2005): Let Z((cid:126)\u03b8) be the partition function for the\nMRF and M be the set of all valid marginal vectors (the marginal polytope). The maximization of\nthe TRW objective gives the following upper bound on the log partition function:\n\n(cid:88)\n\nlog Z((cid:126)\u03b8) \u2264 min\n\u03c1\u2208T max\n(cid:88)\n\n\u03c1ij)H(\u00b5i) +\n\n(cid:124)\n\n(cid:123)(cid:122)\n\n(cid:125)\n(cid:126)\u00b5\u2208M (cid:104)(cid:126)\u03b8, (cid:126)\u00b5(cid:105) + H((cid:126)\u00b5; \u03c1)\n\u03c1ijH(\u00b5ij), H(\u00b5i) := \u2212\n\nTRW((cid:126)\u00b5;(cid:126)\u03b8,\u03c1)\n\n,\n\n(cid:88)\n\ni\u2208V\n\n(1 \u2212\n\nj\u2208N (i)\n\n(ij)\u2208E\n\nwhere the TRW entropy is:\n\nH((cid:126)\u00b5; \u03c1) :=\n\n(cid:88)\n\nxi\n\n(1)\n\n(2)\n\n\u00b5i(xi) log \u00b5i(xi).\n\nT is the spanning tree polytope, the convex hull of edge indicator vectors of all possible spanning\ntrees of the graph. Elements of \u03c1 \u2208 T specify the probability of an edge being present under a\n(cid:111)\nspeci\ufb01c distribution over spanning trees. M is dif\ufb01cult to optimize over, and most TRW algorithms\noptimize over a relaxation called the local consistency polytope L \u2287 M:\nL :=\n\n\u00b5ij(xi, xj) = \u00b5j(xj),(cid:80)\n\n\u00b5i(xi) = 1 \u2200i \u2208 V, (cid:80)\n\n(cid:110)\n(cid:126)\u00b5 \u2265 0, (cid:80)\n\n\u00b5ij(xi, xj) = \u00b5i(xi) \u2200{i, j} \u2208 E\n\nxj\n\nxi\n\nxi\n\n.\n\nThe TRW objective TRW((cid:126)\u00b5; (cid:126)\u03b8, \u03c1) is a globally concave function of (cid:126)\u00b5 over L, assuming that \u03c1 is\nobtained from a valid distribution over spanning trees of the graph (i.e. \u03c1 \u2208 T).\nFrank-Wolfe (FW) Algorithm: In recent years, the Frank-Wolfe (aka conditional gradient) al-\ngorithm has gained popularity in machine learning (Jaggi, 2013) for the optimization of convex\n\n2\n\n\ffunctions over compact domains (denoted D). The algorithm is used to solve minx\u2208D f (x) by\niteratively \ufb01nding a good descent vertex by solving the linear subproblem:\n(3)\n(FW oracle),\n\ns(k) = arg min\n\ns\u2208D (cid:104)\u2207f (x(k)), s(cid:105)\n\n2Cf\nk + 2\n\n,\n\nand then taking a convex step towards this vertex: x(k+1) = (1 \u2212 \u03b3)x(k) + \u03b3s(k) for a suitably\nchosen step-size \u03b3 \u2208 [0, 1]. The algorithm remains within the feasible set (is projection free), is\ninvariant to af\ufb01ne transformations of the domain, and can be implemented in a memory ef\ufb01cient\nmanner. Moreover, the FW gap g(x(k)) := (cid:104)\u2212\u2207f (x(k)), s(k) \u2212 x(k)(cid:105) provides an upper bound on\nthe suboptimality of the iterate x(k). The primal convergence of the Frank-Wolfe algorithm is given\nby Thm. 1 in Jaggi (2013), restated here for convenience: for k \u2265 1, the iterates x(k) satisfy:\n\n(4)\nf (x(k)) \u2212 f (x\u2217) \u2264\nwhere Cf is called the \u201ccurvature constant\u201d. Under the assumption that \u2207f is L-Lipschitz continu-\nous1 on D, we can bound it as Cf \u2264 L diam||.||(D)2.\nMarginal Inference with Frank-Wolfe: To optimize max(cid:126)\u00b5\u2208M\nTRW((cid:126)\u00b5; (cid:126)\u03b8, \u03c1) with Frank-Wolfe,\nthe linear subproblem (3) becomes arg max(cid:126)\u00b5\u2208M(cid:104) \u02dc\u03b8, (cid:126)\u00b5(cid:105), where the perturbed potentials \u02dc\u03b8 correspond\nto the gradient of TRW((cid:126)\u00b5; (cid:126)\u03b8, \u03c1) with respect to (cid:126)\u00b5. Elements of \u02dc\u03b8 are of the form \u03b8c(xc) + Kc(1 +\nlog \u00b5c(xc)), evaluated at the pseudomarginals\u2019 current location in M, where Kc is the coef\ufb01cient\nof the entropy for the node/edge term in (2). The FW linear subproblem here is thus equivalent\nto performing MAP inference in a graphical model with potentials \u02dc\u03b8 (Belanger et al., 2013), as\nthe vertices of the marginal polytope are in 1-1 correspondence with valid joint assignments to the\nrandom variables of the MRF, and the solution of a linear program is always achieved at a vertex\nof the polytope. The TRW objective does not have a Lipschitz continuous gradient over M, and so\nstandard convergence proofs for Frank-Wolfe do not hold.\n3 Optimizing over Contractions of the Marginal Polytope\nMotivation: We wish to (1) use the fewest possible MAP calls, and (2) avoid regions near the\nboundary where the unbounded curvature of the function slows down convergence. A viable option\nto address (1) is through the use of correction steps, where after a Frank-Wolfe step, one opti-\nmizes over the polytope de\ufb01ned by previously visited vertices of M (called the fully-corrective\nFrank-Wolfe (FCFW) algorithm and proven to be linearly convergence for strongly convex objec-\ntives (Lacoste-Julien and Jaggi, 2015)). This does not require additional MAP calls. However, we\nfound (see Sec. 5) that when optimizing the TRW objective over M, performing correction steps can\nsurprisingly hurt performance. This leaves us in a dilemma: correction steps enable decreasing the\nobjective without additional MAP calls, but they can also slow global progress since iterates after\ncorrection sometimes lie close to the boundary of the polytope (where the FW directions become\nless informative). In a manner akin to barrier methods and to Garber and Hazan (2013)\u2019s local linear\noracle, our proposed solution maintains the iterates within a contraction of the polytope. This gives\nus most of the mileage obtained from performing the correction steps without suffering the conse-\nquences of venturing too close to the boundary of the polytope. We prove a global convergence rate\nfor the iterates with respect to the true solution over the full polytope.\nWe describe convergent algorithms to optimize TRW((cid:126)\u00b5; (cid:126)\u03b8, \u03c1) for (cid:126)\u00b5 \u2208 M. The approach we adopt\nto deal with the issue of unbounded gradients at the boundary is to perform Frank-Wolfe within\na contraction of the marginal polytope given by M\u03b4 for \u03b4 \u2208 [0, 1], with either a \ufb01xed \u03b4 or an\nadaptive \u03b4.\nDe\ufb01nition 3.1 (Contraction polytope). M\u03b4 := (1 \u2212 \u03b4)M + \u03b4 u0, where u0 \u2208 M is the vector\nrepresenting the uniform distribution.\nMarginal vectors that lie within M\u03b4 are bounded away from zero as all the components of u0 are\nstrictly positive. Denoting V (\u03b4) as the set of vertices of M\u03b4, V as the set of vertices of M and\nf ((cid:126)\u00b5) := \u2212TRW((cid:126)\u00b5; (cid:126)\u03b8, \u03c1), the key insight that enables our novel approach is that:\n(cid:125)\n(cid:104)\u2207f, v(cid:105) + \u03b4u0.\n\u2261 (1 \u2212 \u03b4) arg min\nv\u2208V\n\n(Linear Minimization over M\u03b4)\n1I.e. (cid:107)\u2207f (x) \u2212 \u2207f (x(cid:48))(cid:107)\u2217\u2264 L(cid:107)x \u2212 x(cid:48)(cid:107) for x, x(cid:48) \u2208 D. Notice that the dual norm (cid:107)\u00b7(cid:107)\u2217 is needed here.\n\n(cid:104)\u2207f, (1 \u2212 \u03b4)v + \u03b4u0(cid:105)\n(De\ufb01nition of v(\u03b4))\n\n(cid:68)\u2207f, v(\u03b4)(cid:69)\n(cid:125)\n(cid:123)(cid:122)\n\n(Run MAP solver and shift vertex)\n\narg min\nv(\u03b4)\u2208V (\u03b4)\n\n(cid:124)\n\n\u2261 arg min\n\nv\u2208V\n\n(cid:124)\n\n(cid:123)(cid:122)\n\n(cid:125)\n\n(cid:124)\n\n(cid:123)(cid:122)\n\n3\n\n\fAlgorithm 1: Updates to \u03b4 after a MAP call (Adaptive \u03b4 variant)\n1: At iteration k. Assuming x(k), u0, \u03b4(k\u22121), f are de\ufb01ned and s(k) has been computed\n2: Compute g(x(k)) = (cid:104)\u2212\u2207f (x(k)), s(k) \u2212 x(k)(cid:105)\n3: Compute gu(x(k)) = (cid:104)\u2212\u2207f (x(k)), u0 \u2212 x(k)(cid:105)\n4: if gu(x(k)) < 0 then\nLet \u02dc\u03b4 = g(x(k))\n5:\n\u22124gu(x(k))\nif \u02dc\u03b4 < \u03b4(k\u22121) then\n6:\n7:\n8:\n9: end if\n\n(and set \u03b4(k) = \u03b4(k\u22121) if it was not updated)\n\n(Compute FW gap)\n(Compute \u201cuniform gap\u201d)\n\n(cid:16)\u02dc\u03b4, \u03b4(k\u22121)\n\n(Compute new proposal for \u03b4)\n\n\u03b4(k) = min\n\n(Shrink by at least a factor of two if proposal is smaller)\n\nend if\n\n(cid:17)\n\n2\n\nTherefore, to solve the FW subproblem (3) over M\u03b4, we can run as usual a MAP solver and simply\nshift the resulting vertex of M towards u0 to obtain a vertex of M\u03b4. Our solution to optimize over\nrestrictions of the polytope is more broadly applicable to the optimization problem de\ufb01ned below,\nwith f satisfying Prop. 3.3 (satis\ufb01ed by the TRW objective) in order to get convergence rates.\nProblem 3.2. Solve minx\u2208D f (x) where D is a compact convex set and f is convex and continu-\nously differentiable on the relative interior of D.\nProperty 3.3. (Controlled growth of Lipschitz constant over D\u03b4). We de\ufb01ne D\u03b4 := (1\u2212 \u03b4)D + \u03b4u0\nfor a \ufb01xed u0 in the relative interior of D. We suppose that there exists a \ufb01xed p \u2265 0 and L such\nthat for any \u03b4 > 0, \u2207f (x) has a bounded Lipschitz constant L\u03b4 \u2264 L\u03b4\u2212p \u2200x \u2208 D\u03b4.\nFixed \u03b4: The \ufb01rst algorithm \ufb01xes a value for \u03b4 a-priori and performs the optimization over D\u03b4. The\nfollowing theorem bounds the sub-optimality of the iterates with respect to the optimum over D.\nTheorem 3.4 (Suboptimality bound for \ufb01xed-\u03b4 algorithm). Let f satisfy the properties in Prob. 3.2\nand Prop. 3.3, and suppose further that f is \ufb01nite on the boundary of D. Then the use of Frank-Wolfe\nfor minx\u2208D\u03b4 f (x) realizes a sub-optimality over D bounded as:\n\nf (x(k)) \u2212 f (x\u2217) \u2264\n\n2C\u03b4\n\n(k + 2)\n\n+ \u03c9 (\u03b4 diam(D)) ,\n\nwhere x\u2217 is the optimal solution in D, C\u03b4 \u2264 L\u03b4 diam||.||(D\u03b4)2, and \u03c9 is the modulus of continuity\nfunction of the (uniformly) continuous f (in particular, \u03c9(\u03b4) \u2193 0 as \u03b4 \u2193 0).\nThe full proof is given in App. C. The \ufb01rst term of the bound comes from the standard Frank-Wolfe\nconvergence analysis of the sub-optimality of x(k) relative to x\u2217(\u03b4), the optimum over D\u03b4, as in (4)\nand using Prop. 3.3. The second term arises by bounding f (x\u2217(\u03b4)) \u2212 f (x\u2217) \u2264 f ( \u02dcx) \u2212 f (x\u2217) with a\ncleverly chosen \u02dcx \u2208 D\u03b4 (as x\u2217(\u03b4) is optimal in D\u03b4). We pick \u02dcx := (1 \u2212 \u03b4)x\u2217 + \u03b4u0 and note that\n(cid:107) \u02dcx \u2212 x\u2217(cid:107)\u2264 \u03b4 diam(D). As f is continuous on a compact set, it is uniformly continuous and we\nthus have f ( \u02dcx) \u2212 f (x\u2217) \u2264 \u03c9(\u03b4 diam(D)) with \u03c9 its modulus of continuity function.\nAdaptive \u03b4: The second variant to solve minx\u2208D f (x) iteratively perform FW steps over D\u03b4, but\nalso decreases \u03b4 adaptively. The update schedule for \u03b4 is given in Alg. 1 and is motivated by the\nconvergence proof. The idea is to ensure that the FW gap over D\u03b4 is always at least half the FW\ngap over D, relating the progress over D\u03b4 with the one over D. It turns out that FW-gap-D\u03b4 =\n(1 \u2212 \u03b4)FW-gap-D + \u03b4 \u00b7 gu(x(k)), where the \u201cuniform gap\u201d gu(x(k)) quanti\ufb01es the decrease of the\nfunction when contracting towards u0. When gu(x(k)) is negative and large compared to the FW\ngap, we need to shrink \u03b4 (see step 5 in Alg. 1) to ensure that the \u03b4-modi\ufb01ed direction is a suf\ufb01cient\ndescent direction. We can show that the algorithm converges to the global solution as follows:\nTheorem 3.5 (Global convergence for adaptive-\u03b4 variant over D). For a function f satisfying the\n(cid:17)\nproperties in Prob. 3.2 and Prop. 3.3, the sub-optimality of the iterates obtained by running the FW\nupdates over D\u03b4 with \u03b4 updated according to Alg. 1 is bounded as:\n.\n\n(cid:16)\n\nk\u2212 1\n\np+1\n\nf (x(k)) \u2212 f (x\u2217) \u2264 O\n\nA full proof with a precise rate and constants is given in App. D. The sub-optimality hk := f (x(k))\u2212\nf (x\u2217) traverses three stages with an overall rate as above. The updates to \u03b4(k) as in Alg. 1 enable us\n\n4\n\n\fstopping criterion \u0001, \ufb01xed reference point u0 in the interior of M. Let \u03b4(\u22121) = \u03b4(init).\n\nAlgorithm 2: Approximate marginal inference over M (solving (1)). Here f is the negative TRW objective.\n1: Function TRW-Barrier-FW(\u03c1(0), \u0001, \u03b4(init), u0):\n2: Inputs: Edge-appearance probabilities \u03c1(0), \u03b4(init) \u2264 1\n3: Let V := {u0} (visited vertices), x(0) = u0\n4: for i = 0 . . . MAX RHO ITS do {FW outer loop to optimize \u03c1 over T}\nfor k = 0 . . . MAXITS do {FCFW inner loop to optimize x over M}\n5:\n6:\n7:\n\n(Compute gradient)\n(Run MAP solver to compute FW vertex)\n\n4 initial contraction of polytope, inner loop\n\n(Initialize the algorithm at the uniform distribution)\n\nLet \u02dc\u03b8 = \u2207f (x(k); (cid:126)\u03b8, \u03c1(i))\nLet s(k) \u2208 arg min\n(cid:104)\u02dc\u03b8, v(cid:105)\nv\u2208M\nCompute g(x(k)) = (cid:104)\u2212\u02dc\u03b8, s(k) \u2212 x(k)(cid:105)\nif g(x(k)) \u2264 \u0001 then\n\nbreak FCFW inner loop\n\n(x(k) is \u0001-optimal)\n\n(Inner loop FW duality gap)\n\n8:\n9:\n10:\n11:\n12:\n13:\n14:\n15:\n16:\n17:\n18:\n19:\n20:\n21:\n22:\n23:\n24: end for\n25: return x(0) and \u03c1(i)\n\n(For Adaptive-\u03b4: Run Alg. 1 to modify \u03b4)\n\n(\u03b4) = (1 \u2212 \u03b4(k))s(k) + \u03b4(k)u0 and d(k)\n\nend if\n\u03b4(k) = \u03b4(k\u22121)\nLet s(k)\nx(k+1) = arg min{f (x(k) + \u03b3 d(k)\nUpdate correction polytope: V := V \u222a {s(k)}\nx(k+1) := CORRECTION(x(k+1), V, \u03b4(k), \u03c1(i))\nx(k+1), Vsearch := LOCALSEARCH(x(k+1), s(k), \u03b4(k), \u03c1(i))\nUpdate correction polytope (with vertices from LOCALSEARCH): V := V \u222a {Vsearch}\n\n(\u03b4) = s(k)\n(\u03b4) ) : \u03b3 \u2208 [0, 1]}\n\n(optional: correction step)\n\n(FW step with line search)\n\n(\u03b4-contracted quantities)\n\n(\u03b4) \u2212 x(k)\n\n(optional: fast MAP solver)\n\nend for\n\u03c1v \u2190 minSpanTree(edgesMI(x(k)))\n\u03c1(i+1) \u2190 \u03c1(i) + ( i\nx(0) \u2190 x(k),\nIf i < MAX RHO ITS then x(0) = CORRECTION(x(0), V, \u03b4(\u22121), \u03c1(i+1))\n\ni+2 )(\u03c1v \u2212 \u03c1(i))\n\u03b4(\u22121) \u2190 \u03b4(k\u22121)\n\n(Re-initialize for FCFW inner loop)\n\n(FW vertex of the spanning tree polytope)\n\n(Fixed step-size schedule FW update for \u03c1 kept in relint(T))\n\nk\n\n. Solving this gives us the desired bound.\n\nto (1) upper bound the duality gap over D as a function of the duality gap in D\u03b4 and (2) lower bound\nthe value of \u03b4(k) as a function of hk. Applying the standard Descent Lemma with the Lipschitz\nconstant on the gradient of the form L\u03b4\u2212p (Prop. 3.3), and replacing \u03b4(k) by its bound in hk, we get\nthe recurrence: hk+1 \u2264 hk \u2212 Chp+2\nApplication to the TRW Objective: min(cid:126)\u00b5\u2208M \u2212TRW((cid:126)\u00b5; (cid:126)\u03b8, \u03c1) is akin to minx\u2208D f (x) and the\n(strong) convexity of \u2212TRW((cid:126)\u00b5; (cid:126)\u03b8, \u03c1) has been previously shown (Wainwright et al., 2005; London\net al., 2015). The gradient of the TRW objective is Lipschitz continuous over M\u03b4 since all marginals\nare strictly positive. Its growth for Prop. 3.3 can be bounded with p = 1 as we show in App. E.1. This\ngives a rate of convergence of O(k\u22121/2) for the adaptive-\u03b4 variant, which interestingly is a typical\nrate for non-smooth convex optimization. The hidden constant is of the order O((cid:107)\u03b8(cid:107)\u00b7|V |). The\nmodulus of continuity \u03c9 for the TRW objective is close to linear (it is almost a Lipschitz function),\nand its constant is instead of the order O((cid:107)\u03b8(cid:107)+|V |).\n4 Algorithm\nAlg. 2 describes the pseudocode for our proposed algorithm to do marginal inference with\nTRW((cid:126)\u00b5; (cid:126)\u03b8, \u03c1). minSpanTree \ufb01nds the minimum spanning tree of a weighted graph, and\nedgesMI((cid:126)\u00b5) computes the mutual information of edges of G from the pseudomarginals in (cid:126)\u00b52 (to\nperform FW updates over \u03c1 as in Alg. 2 in Wainwright et al. (2005)). It is worthwhile to note that\nour approach uses three levels of Frank-Wolfe: (1) for the (tightening) optimization of \u03c1 over T, (2)\nto perform approximate marginal inference, i.e for the optimization of (cid:126)\u00b5 over M, and (3) to perform\nthe correction steps (lines 16 and 23). We detail a few heuristics that aid practicality.\nFast Local Search: Fast methods for MAP inference such as Iterated Conditional Modes (Be-\nsag, 1986) offer a cheap, low cost alternative to a more expensive combinatorial MAP solver. We\n\n2The component ij has value H(\u00b5i) + H(\u00b5j) \u2212 H(\u00b5ij).\n\n5\n\n\fwarm start the ICM solver with the last found vertex s(k) of the marginal polytope. The subroutine\nLOCALSEARCH (Alg. 6 in Appendix) performs a \ufb01xed number of FW updates to the pseudo-\nmarginals using ICM as the (approximate) MAP solver.\nRe-optimizing over the Vertices of M (FCFW algorithm): As the iterations of FW progress,\nwe keep track of the vertices of the marginal polytope found by Alg. 2 in the set V . We make use\nof these vertices in the CORRECTION subroutine (Alg. 5 in Appendix) which re-optimizes the\nobjective function over (a contraction of) the convex hull of the elements of V (called the correction\npolytope). x(0) in Alg. 2 is initialized to the uniform distribution which is guaranteed to be in M\n(and M\u03b4). After updating \u03c1, we set x(0) to the approximate minimizer in the correction polytope.\nThe intuition is that changing \u03c1 by a small amount may not substantially modify the optimal x\u2217\n(for the new \u03c1) and that the new optimum might be in the convex hull of the vertices found thus far.\nIf so, CORRECTION will be able to \ufb01nd it without resorting to any additional MAP calls. This\nencourages the MAP solver to search for new, unique vertices instead of rediscovering old ones.\nApproximate MAP Solvers: We can swap out the exact MAP solver with an approximate MAP\nsolver. The primal objective plus the (approximate) duality gap may no longer be an upper bound\non the log-partition function (black-box MAP solvers could be considered to optimize over an inner\nbound to the marginal polytope). Furthermore, the gap over D may be negative if the approximate\nMAP solver fails to \ufb01nd a direction of descent. Since adaptive-\u03b4 requires that the gap be positive\nin Alg. 1, we take the max over the last gap obtained over the correction polytope (which is always\nnon-negative) and the computed gap over D as a heuristic.\nTheoretically, one could get similar convergence rates as in Thm. 3.4 and 3.5 using an approximate\nMAP solver that has a multiplicative guarantee on the gap (line 8 of Alg. 2), as was done previously\nfor FW-like algorithms (see, e.g., Thm. C.1 in Lacoste-Julien et al. (2013)). With an \u0001-additive\nerror guarantee on the MAP solution, one can prove similar rates up to a suboptimality error of \u0001.\nEven if the approximate MAP solver does not provide an approximation guarantee, if it returns an\nupper bound on the value of the MAP assignment (as do branch-and-cut solvers for integer linear\nprograms, or Sontag et al. (2008)), one can use this to obtain an upper bound on log Z (see App. J).\n(cid:80)N\n5 Experimental Results\nSetup: The L1 error in marginals is computed as: \u03b6\u00b5 := 1\ni=1|\u00b5i(1) \u2212 \u00b5\u2217i (1)|. When using\nexact MAP inference, the error in log Z (denoted \u03b6log Z) is computed by adding the duality gap to\nthe primal (since this guarantees us an upper bound). For approximate MAP inference, we plot the\nprimal objective. We use a non-uniform initialization of \u03c1 computed with the Matrix Tree Theorem\n(Sontag and Jaakkola, 2007; Koo et al., 2007). We perform 10 updates to \u03c1, optimize (cid:126)\u00b5 to a duality\ngap of 0.5 on M, and always perform correction steps. We use LOCALSEARCH only for the real-\nworld instances. We use the implementation of TRBP and the Junction Tree Algorithm (to compute\nexact marginals) in libDAI (Mooij, 2010). Unless speci\ufb01ed, we compute marginals by optimizing\nthe TRW objective using the adaptive-\u03b4 variant of the algorithm (denoted in the \ufb01gures as M\u03b4).\nMAP Solvers: For approximate MAP, we run three solvers in parallel: QPBO (Kolmogorov and\nRother, 2007; Boykov and Kolmogorov, 2004), TRW-S (Kolmogorov, 2006) and ICM (Besag, 1986)\nusing OpenGM (Andres et al., 2012) and use the result that realizes the highest energy. For exact\ninference, we use Gurobi Optimization (2015) or toulbar2 (Allouche et al., 2010).\nTest Cases: All of our test cases are on binary pairwise MRFs. (1) Synthetic 10 nodes cliques:\nSame setup as Sontag and Jaakkola (2007, Fig. 2), with 9 sets of 100 instances each with cou-\npling strength drawn from U[\u2212\u03b8, \u03b8] for \u03b8 \u2208 {0.5, 1, 2, . . . , 8}. (2) Synthetic Grids: 15 trials with\n5 \u00d7 5 grids. We sample \u03b8i \u223c U[\u22121, 1] and \u03b8ij \u2208 [\u22124, 4] for nodes and edges. The potentials\nwere (\u2212\u03b8i, \u03b8i) for nodes and (\u03b8ij,\u2212\u03b8ij;\u2212\u03b8ij, \u03b8ij) for edges. (3) Restricted Boltzmann Machines\n(RBMs): From the Probabilistic Inference Challenge 2011.3 (4) Horses: Large (N \u2248 12000) MRFs\nrepresenting images from the Weizmann Horse Data (Borenstein and Ullman, 2002) with potentials\nlearned by Domke (2013). (5) Chinese Characters: An image completion task from the KAIST\nHanja2 database, compiled in OpenGM by Andres et al. (2012). The potentials were learned using\nDecision Tree Fields (Nowozin et al., 2011). The MRF is not a grid due to skip edges that tie nodes\nat various offsets. The potentials are a combination of submodular and supermodular and therefore\na harder task for inference algorithms.\n\nN\n\n3http://www.cs.huji.ac.il/project/PASCAL/index.php\n\n6\n\n\f(a) \u03b6log Z: 5 \u00d7 5 grids\n\nM vs M\u03b4\n\n(b) \u03b6log Z: 10 node cliques\n\nM vs M\u03b4\n\n(c) \u03b6\u00b5: 5 \u00d7 5 grids\nApprox. vs. Exact MAP\n\n(d) \u03b6log Z: 40 node RBM\nApprox. vs. Exact MAP\n\n(e) \u03b6\u00b5: 10 node cliques\nOptimization over T\n\n(f) \u03b6log Z: 10 node cliques\nOptimization over T\n\nFigure 1: Synthetic Experiments: In Fig. 1(c) & 1(d), we unravel MAP calls across updates to \u03c1. Fig. 1(d)\ncorresponds to a single RBM (not an aggregate over trials) where for \u201cApprox MAP\u201d we plot the absolute error\nbetween the primal objective and log Z (not guaranteed to be an upper bound).\n\nOn the Optimization of M versus M\u03b4\n\nWe compare the performance of Alg. 2 on optimizing over M (with and without correction), op-\ntimizing over M\u03b4 with \ufb01xed-\u03b4 = 0.0001 (denoted M0.0001) and optimizing over M\u03b4 using the\nadaptive-\u03b4 variant. These plots are averaged across all the trials for the \ufb01rst iteration of optimizing\nover T. We show error as a function of the number of MAP calls since this is the bottleneck for\nlarge MRFs. Fig. 1(a), 1(b) depict the results of this optimization aggregated across trials. We \ufb01nd\nthat all variants settle on the same average error. The adaptive \u03b4 variant converges faster on average\nfollowed by the \ufb01xed \u03b4 variant. Despite relatively quick convergence for M with no correction on\nthe grids, we found that correction was crucial to reducing the number of MAP calls in subsequent\nsteps of inference after updates to \u03c1. As highlighted earlier, correction steps on M (in blue) worsen\nconvergence, an effect brought about by iterates wandering too close to the boundary of M.\n\nOn the Applicability of Approximate MAP Solvers\n\nSynthetic Grids: Fig. 1(c) depicts the accuracy of approximate MAP solvers versus exact MAP\nsolvers aggregated across trials for 5 \u00d7 5 grids. The results using approximate MAP inference are\ncompetitive with those of exact inference, even as the optimization is tightened over T. This is an\nencouraging and non-intuitive result since it indicates that one can achieve high quality marginals\nthrough the use of relatively cheaper approximate MAP oracles.\nRBMs: As in Salakhutdinov (2008), we observe for RBMs that\nthe bound provided by\nTRW((cid:126)\u00b5; (cid:126)\u03b8, \u03c1) over L\u03b4 is loose and does not get better when optimizing over T. As Fig. 1(d) depicts\nfor a single RBM, optimizing over M\u03b4 realizes signi\ufb01cant gains in the upper bound on log Z which\nimproves with updates to \u03c1. The gains are preserved with the use of the approximate MAP solvers.\nNote that there are also fast approximate MAP solvers speci\ufb01cally for RBMs (Wang et al., 2014).\nHorses: See Fig. 2 (right). The models are close to submodular and the local relaxation is a good\napproximation to the marginal polytope. Our marginals are visually similar to those obtained by\nTRBP and our algorithm is able to scale to large instances by using approximate MAP solvers.\n\n7\n\n01020304050607080MAPcalls01020304050ErrorinLogZ(\u03b6logZ)M\u03b4M0.0001ML\u03b4M(nocorrection)0510152025MAPcalls0102030405060ErrorinLogZ(\u03b6logZ)M\u03b4M0.0001ML\u03b4M(nocorrection)020406080100120MAPcalls0.00.10.20.30.40.5ErrorinMarginals(\u03b6\u00b5)ExactMAPM\u03b4L\u03b4ApproxMAPM\u03b4020406080100MAPcalls10\u22121100101102103ErrorinLogZ(\u03b6logZ)ExactMAPM\u03b4L\u03b4ApproxMAPM\u03b40.512345678\u03b80.00.10.20.30.40.50.60.70.8ErrorinMarginals(\u03b6\u00b5)perturbMAPL\u03b4L\u03b4(\u03c1opt)M\u03b4(\u03c1opt)M\u03b40.512345678\u03b810\u22121100101102ErrorinLogZ(\u03b6logZ)perturbMAPL\u03b4L\u03b4(\u03c1opt)M\u03b4(\u03c1opt)M\u03b4\fFigure 2: Results on real world test cases. FW(i) corresponds to the \ufb01nal marginals at the ith iteration of\noptimizing \u03c1. The area highlighted on the Chinese Characters depicts the region of uncertainty.\n\nOn the Importance of Optimizing over T\n\nSynthetic Cliques:\nIn Fig. 1(e), 1(f), we study the effect of tightening over T against coupling\nstrength \u03b8. We consider the \u03b6\u00b5 and \u03b6log Z obtained for the \ufb01nal marginals before updating \u03c1 (step 19)\nand compare to the values obtained after optimizing over T (marked with \u03c1opt). The optimization\nover T has little effect on TRW optimized over L\u03b4. For optimization over M\u03b4, updating \u03c1 realizes\nbetter marginals and bound on log Z (over and above those obtained in Sontag and Jaakkola (2007)).\nChinese Characters: Fig. 2 (left) displays marginals across iterations of optimizing over T. The\nsubmodular and supermodular potentials lead to frustrated models for which L\u03b4 is very loose, which\nresults in TRBP obtaining poor results.4 Our method produces reasonable marginals even before the\n\ufb01rst update to \u03c1, and these improve with tightening over T.\n\nRelated Work for Marginal Inference with MAP Calls\n\nHazan and Jaakkola (2012) estimate log Z by averaging MAP estimates obtained on randomly per-\nturbed in\ufb02ated graphs. Our implementation of the method performed well in approximating log Z\nbut the marginals (estimated by \ufb01xing the value of each random variable and estimating log Z for\nthe resulting graph) were less accurate than our method (Fig. 1(e), 1(f)).\n6 Discussion\nWe introduce the \ufb01rst provably convergent algorithm for the TRW objective over the marginal\npolytope, under the assumption of exact MAP oracles. We quantify the gains obtained both from\nmarginal inference over M and from tightening over the spanning tree polytope. We give heuristics\nthat improve the scalability of Frank-Wolfe when used for marginal inference. The runtime cost of\niterative MAP calls (a reasonable rule of thumb is to assume an approximate MAP call takes roughly\nthe same time as a run of TRBP) is worthwhile particularly in cases such as the Chinese Characters\nwhere L is loose. Speci\ufb01cally, our algorithm is appropriate for domains where marginal inference is\nhard but there exist ef\ufb01cient MAP solvers capable of handling non-submodular potentials. Code is\navailable at https://github.com/clinicalml/fw-inference.\nOur work creates a \ufb02exible, modular framework for optimizing a broad class of variational objec-\ntives, not simply TRW, with guarantees of convergence. We hope that this will encourage more\nresearch on building better entropy approximations. The framework we adopt is more generally\napplicable to optimizing functions whose gradients tend to in\ufb01nity at the boundary of the domain.\nOur method to deal with gradients that diverge at the boundary bears resemblance to barrier func-\ntions used in interior point methods insofar as they bound the solution away from the constraints.\nIteratively decreasing \u03b4 in our framework can be compared to decreasing the strength of the barrier,\nenabling the iterates to get closer to the facets of the polytope, although its worthwhile to note that\nwe have an adaptive method of doing so.\nAcknowledgements\nRK and DS gratefully acknowledge the support of the DARPA Probabilistic Programming for Ad-\nvancing Machine Learning (PPAML) Program under AFRL prime contract no. FA8750-14-C-0005.\n\n4We run TRBP for 1000 iterations using damping = 0.9; the algorithm converges with a max norm difference\n\nbetween consecutive iterates of 0.002. Tightening over T did not signi\ufb01cantly change the results of TRBP.\n\n8\n\nGround TruthGround TruthTRBP MarginalsTRBP MarginalsCOND\u22120.01 MarginalsCOND\u22120.01 MarginalsCOND\u22120.01 Marginals \u2212 Opt RhoCOND\u22120.01 Marginals \u2212 Opt RhoGround TruthMAPTRBPFW(1)FW(10)Ground TruthGround TruthTRBP MarginalsTRBP MarginalsCOND\u22120.01 MarginalsCOND\u22120.01 MarginalsCOND\u22120.01 Marginals \u2212 Opt RhoCOND\u22120.01 Marginals \u2212 Opt RhoGround TruthMAPTRBPFW(1)FW(10)Ground TruthGround TruthTRBP MarginalsTRBP MarginalsCOND\u22120.01 MarginalsCOND\u22120.01 MarginalsCOND\u22120.01 Marginals \u2212 Opt RhoCOND\u22120.01 Marginals \u2212 Opt RhoGround TruthMAPTRBPFW(1)FW(10)Ground TruthGround TruthTRBP MarginalsTRBP MarginalsCOND\u22120.01 MarginalsCOND\u22120.01 MarginalsCOND\u22120.01 Marginals \u2212 Opt RhoCOND\u22120.01 Marginals \u2212 Opt RhoGround TruthMAPTRBPFW(1)FW(10)\fReferences\nD. Allouche, S. de Givry, and T. Schiex. Toulbar2, an open source exact cost function network solver, 2010.\nB. Andres, B. T., and J. H. Kappes. OpenGM: A C++ library for discrete graphical models, June 2012.\nD. Belanger, D. Sheldon, and A. McCallum. Marginal inference in MRFs using Frank-Wolfe. NIPS Workshop\n\non Greedy Optimization, Frank-Wolfe and Friends, 2013.\n\nJ. Besag. On the statistical analysis of dirty pictures. J R Stat Soc Series B, 1986.\nE. Borenstein and S. Ullman. Class-speci\ufb01c, top-down segmentation. In ECCV, 2002.\nY. Boykov and V. Kolmogorov. An experimental comparison of min-cut/max-\ufb02ow algorithms for energy mini-\n\nmization in vision. TPAMI, 2004.\n\nJ. Domke. Learning graphical model parameters with approximate marginal inference. TPAMI, 2013.\nS. Ermon, C. P. Gomes, A. Sabharwal, and B. Selman. Taming the curse of dimensionality: Discrete integration\n\nby hashing and optimization. In ICML, 2013.\n\nD. Garber and E. Hazan. A linearly convergent conditional gradient algorithm with applications to online and\n\nstochastic optimization. arXiv preprint arXiv:1301.4666, 2013.\n\nA. Globerson and T. Jaakkola. Convergent propagation algorithms via oriented trees. In UAI, 2007.\nI. Gurobi Optimization. Gurobi optimizer reference manual, 2015.\nT. Hazan and T. Jaakkola. On the partition function and random maximum a-posteriori perturbations. In ICML,\n\n2012.\n\nM. Jaggi. Revisiting Frank-Wolfe: Projection-free sparse convex optimization. In ICML, 2013.\nJ. Jancsary and G. Matz. Convergent decomposition solvers for tree-reweighted free energies. In AISTATS,\n\n2011.\n\nJ. Kappes et al. A comparative study of modern inference techniques for discrete energy minimization prob-\n\nlems. In CVPR, 2013.\n\nV. Kolmogorov. Convergent tree-reweighted message passing for energy minimization. TPAMI, 2006.\nV. Kolmogorov and C. Rother. Minimizing nonsubmodular functions with graph cuts-A Review. TPAMI, 2007.\nT. Koo, A. Globerson, X. Carreras, and M. Collins. Structured prediction models via the matrix-tree theorem.\n\nIn EMNLP-CoNLL, 2007.\n\nS. Lacoste-Julien and M. Jaggi. On the global linear convergence of Frank-Wolfe optimization variants. In\n\nNIPS, 2015.\n\nS. Lacoste-Julien, M. Jaggi, M. Schmidt, and P. Pletscher. Block-coordinate Frank-Wolfe optimization for\n\nstructural SVMs. In ICML, 2013.\n\nB. London, B. Huang, and L. Getoor. The bene\ufb01ts of learning with strongly convex approximate inference. In\n\nICML, 2015.\n\nJ. M. Mooij.\n\nlibDAI: A free and open source C++ library for discrete approximate inference in graphical\n\nmodels. JMLR, 2010.\n\nS. Nowozin, C. Rother, S. Bagon, T. Sharp, B. Yao, and P. Kohli. Decision tree \ufb01elds. In ICCV, 2011.\nG. Papandreou and A. Yuille. Perturb-and-map random \ufb01elds: Using discrete optimization to learn and sample\n\nfrom energy models. In ICCV, 2011.\n\nR. Salakhutdinov. Learning and evaluating Boltzmann machines. Technical report, 2008.\nS. Shimony. Finding MAPs for belief networks is NP-hard. Arti\ufb01cial Intelligence, 1994.\nD. Sontag and T. Jaakkola. New outer bounds on the marginal polytope. In NIPS, 2007.\nD. Sontag, T. Meltzer, A. Globerson, Y. Weiss, and T. Jaakkola. Tightening LP relaxations for MAP using\n\nmessage-passing. In UAI, 2008.\n\nM. J. Wainwright, T. S. Jaakkola, and A. S. Willsky. A new class of upper bounds on the log partition function.\n\nIEEE Transactions on Information Theory, 2005.\n\nS. Wang, R. Frostig, P. Liang, and C. Manning. Relaxations for inference in restricted Boltzmann machines. In\n\nICLR Workshop, 2014.\n\n9\n\n\f", "award": [], "sourceid": 369, "authors": [{"given_name": "Rahul", "family_name": "Krishnan", "institution": "New York University"}, {"given_name": "Simon", "family_name": "Lacoste-Julien", "institution": "INRIA"}, {"given_name": "David", "family_name": "Sontag", "institution": "NYU"}]}