{"title": "Smooth and Strong: MAP Inference with Linear Convergence", "book": "Advances in Neural Information Processing Systems", "page_first": 298, "page_last": 306, "abstract": "Maximum a-posteriori (MAP) inference is an important task for many applications. Although the standard formulation gives rise to a hard combinatorial optimization problem, several effective approximations have been proposed and studied in recent years. We focus on linear programming (LP) relaxations, which have achieved state-of-the-art performance in many applications. However, optimization of the resulting program is in general challenging due to non-smoothness and complex non-separable constraints.Therefore, in this work we study the benefits of augmenting the objective function of the relaxation with strong convexity. Specifically, we introduce strong convexity by adding a quadratic term to the LP relaxation objective. We provide theoretical guarantees for the resulting programs, bounding the difference between their optimal value and the original optimum. Further, we propose suitable optimization algorithms and analyze their convergence.", "full_text": "Smooth and Strong:\n\nMAP Inference with Linear Convergence\n\nOfer Meshi\nTTI Chicago\n\nMehrdad Mahdavi\n\nTTI Chicago\n\nAlexander G. Schwing\nUniversity of Toronto\n\nAbstract\n\nMaximum a-posteriori (MAP) inference is an important task for many applica-\ntions. Although the standard formulation gives rise to a hard combinatorial opti-\nmization problem, several effective approximations have been proposed and stud-\nied in recent years. We focus on linear programming (LP) relaxations, which have\nachieved state-of-the-art performance in many applications. However, optimiza-\ntion of the resulting program is in general challenging due to non-smoothness and\ncomplex non-separable constraints.\nTherefore, in this work we study the bene\ufb01ts of augmenting the objective function\nof the relaxation with strong convexity. Speci\ufb01cally, we introduce strong convex-\nity by adding a quadratic term to the LP relaxation objective. We provide theoret-\nical guarantees for the resulting programs, bounding the difference between their\noptimal value and the original optimum. Further, we propose suitable optimization\nalgorithms and analyze their convergence.\n\nIntroduction\n\n1\nProbabilistic graphical models are an elegant framework for reasoning about multiple variables with\nstructured dependencies. They have been applied in a variety of domains, including computer vi-\nsion, natural language processing, computational biology, and many more. Throughout, \ufb01nding the\nmaximum a-posteriori (MAP) con\ufb01guration, i.e., the most probable assignment, is one of the central\ntasks for these models. Unfortunately, in general the MAP inference problem is NP-hard. Despite\nthis theoretical barrier, in recent years it has been shown that approximate inference methods based\non linear programming (LP) relaxations often provide high quality MAP solutions in practice. Al-\nthough tractable in principle, LP relaxations pose a real computational challenge. In particular, for\nmany applications, standard LP solvers perform poorly due to the large number of variables and\nconstraints [33]. Therefore, signi\ufb01cant research effort has been put into designing ef\ufb01cient solvers\nthat exploit the special structure of the MAP inference problem.\nSome of the proposed algorithms optimize the primal LP directly, however this is hard due to com-\nplex coupling constraints between the variables. Therefore, most of the specialized MAP solvers\noptimize the dual function, which is often easier since it preserves the structure of the underlying\nmodel and facilitates elegant message-passing algorithms. Nevertheless, the resulting optimization\nproblem is still challenging since the dual function is piecewise linear and therefore non-smooth.\nIn fact, it was recently shown that LP relaxations for MAP inference are not easier than general\nLPs [22]. This result implies that there exists an inherent trade-off between the approximation error\n(accuracy) of the relaxation and its optimization error (ef\ufb01ciency).\nIn this paper we propose new ways to explore this trade-off. Speci\ufb01cally, we study the bene\ufb01ts of\nadding strong convexity in the form of a quadratic term to the MAP LP relaxation objective. We\nshow that adding strong convexity to the primal LP results in a new smooth dual objective, which\nserves as an alternative to soft-max. This smooth objective can be computed ef\ufb01ciently and opti-\nmized via gradient-based methods, including accelerated gradient. On the other hand, introducing\nstrong convexity in the dual leads to a new primal formulation in which the coupling constraints\nare enforced softly, through a penalty term in the objective. This allows us to derive an ef\ufb01cient\n\n1\n\n\fconditional gradient algorithm, also known as the Frank-Wolfe (FW) algorithm. We can then add\nstrong convexity to both primal and dual to obtain a smooth and strongly convex objective, for which\nvarious algorithms enjoy linear convergence rate. We provide theoretical guarantees for the new ob-\njective functions, analyze the convergence rate of the proposed algorithms, and compare them to\nexisting approaches. All of our algorithms are guaranteed to globally converge to the optimal value\nof the modi\ufb01ed objective function. Finally, we show empirically that our methods are competitive\nwith other state-of-the-art algorithms for MAP LP relaxation.\n\n2 Related Work\n\nSeveral authors proposed ef\ufb01cient approximations for MAP inference based on LP relaxations [e.g.,\n30]. Kumar et al. [12] show that LP relaxation dominates other convex relaxations for MAP infer-\nence. Due to the complex non-separable constraints, only few of the existing algorithms optimize\nthe primal LP directly. Ravikumar et al. [23] present a proximal point method that requires iterative\nprojections onto the constraints in the inner loop. Inexactness of these iterative projections compli-\ncates the convergence analysis of this scheme. In Section 4.1 we show that adding a quadratic term\nto the dual problem corresponds to a much easier primal in which agreement constraints are enforced\nsoftly through a penalty term that accounts for constraint violation. This enables us to derive a sim-\npler projection-free algorithm based on conditional gradient for the primal relaxed program [4, 13].\nRecently, Belanger et al. [1] used a different non-smooth penalty term for constraint violation, and\nshowed that it corresponds to box-constraints on dual variables. In contrast, our penalty terms are\nsmooth, which leads to a different objective function and faster convergence guarantees.\nMost of the popular algorithms for MAP LP relaxations focus on the dual program and optimize\nit in various ways. The subgradient algorithm can be applied to the non-smooth objective [11],\nhowever its convergence rate is rather slow, both in theory and in practice. In particular, the algorithm\nrequires O(1/\u270f2) iterations to obtain an \u270f-accurate solution to the dual problem. Algorithms based\non coordinate minimization can also be applied [e.g., 6, 10, 31], and often converge fast, but they\nmight get stuck in suboptimal \ufb01xed points due to the non-smoothness of the objective. To overcome\nthis limitation it has been proposed to smooth the dual objective using a soft-max function [7, 8].\nCoordinate minimization methods are then guaranteed to converge to the optimum of the smoothed\nobjective. Meshi et al. [17] have shown that the convergence rate of such algorithms is O(1/\u270f),\nwhere is the smoothing parameter. Accelerated gradient algorithms have also been successfully\napplied to the smooth dual, obtaining improved convergence rate of O(1/p\u270f), which can be used\nto obtain a O(1/\u270f) rate w.r.t. the original objective [24]. In Section 4.2 we propose an alternative\nsmoothing technique, based on adding a quadratic term to the primal objective. We then show how\ngradient-based algorithms can be applied ef\ufb01ciently to optimize the new objective function.\nOther globally convergent methods that have been proposed include augmented Lagrangian [15, 16],\nbundle methods [9], and a steepest descent approach [25, 26]. However, the convergence rate of\nthese methods in the context of MAP inference has not been analyzed yet, making them hard to\ncompare to other algorithms.\n\n3 Problem Formulation\n\nIn this section we formalize MAP inference in graphical models. Consider a set of n discrete vari-\nables X1, . . . , Xn, and denote by xi a particular assignment to variable Xi. We refer to subsets of\nthese variables by r \u2713{ 1, . . . , n}, also known as regions, and the total number of regions is referred\nto as q. Each subset is associated with a local score function, or factor, \u2713r(xr). The MAP problem is\nto \ufb01nd an assignment x which maximizes a global score function that decomposes over the factors:\n\nThe above combinatorial optimization problem is hard in general, and tractable only in several spe-\ncial cases. Most notably, for tree-structured graphs or super-modular pairwise score functions, ef-\n\ufb01cient dynamic programming algorithms can be applied. Here we do not make such simplifying\nassumptions and instead focus on approximate inference. In particular, we are interested in approx-\n\nmax\n\nx Xr\n\n\u2713r(xr) .\n\n2\n\n\fimations based on the LP relaxation, taking the following form:\n\nmax\n\u00b52ML\n\nf (\u00b5) :=Xr Xxr\nwhere: ML =\u21e2\u00b5 0 Pxr\nPxp\\xr\nxr \u2713r(xr) + Xp:r2p\ng() := Xr\n\nmax\n\nmin\n\n\n\n\u00b5r(xr)\u2713r(xr) = \u00b5>\u2713\n\n(1)\n\n\u00b5r(xr) = 1\n\n\u00b5p(xp) = \u00b5r(xr) 8r, xr, p : r 2 p ,\n\n8r\n\npr(xr) Xc:c2r\n\nrc(xc)! \u2318 Xr\n\nmax\nxr\n\n\u02c6\u2713\nr(xr) ,\n\n(2)\n\nwhere \u2018r 2 p\u2019 represents a containment relationship between the regions p and r. The dual program\nof the above LP is formulated as minimizing the re-parameterization of factors [32]:\n\nThis is a piecewise linear function in the dual variables . Hence, it is convex (but not strongly) and\nnon-smooth. Two commonly used optimization schemes for this objective are subgradient descent\nand block coordinate minimization. While the convergence rate of the former can be upper bounded\nby O(1/\u270f2), the latter is non-convergent due to the non-smoothness of the objective function.\nTo remedy this shortcoming, it has been proposed to smooth the objective by replacing the local\nmaximization with a soft-max [7, 8]. The resulting unconstrainted program is:\n\nexp \u02c6\u2713\n\nr(xr)\n\n ! .\n\n logXxr\n\n(\u00b5r(xr)\u2713r(xr) + H (\u00b5r)) ,\n\nmin\n\n\n\ng() :=Xr\n\u00b52MLXr Xxr\n\nmax\n\n(3)\n\n(4)\n\nThis dual form corresponds to adding local entropy terms to the primal given in Eq. (1), obtaining:\n\nwhere H(\u00b5r) = Pxr\n\nthe smooth optimal value g\u21e4:\n\n\u00b5r(xr) log \u00b5r(xr) denotes the entropy. The following guarantee holds for\n\ng\u21e4 \uf8ff g\u21e4 \uf8ff g\u21e4 + Xr\n\nlog Vr ,\n\n(5)\n\nwhere g\u21e4 is the optimal value of the dual program given in Eq. (2), and Vr = |r| denotes the number\nof variables in region r.\nPr Vr [see 24]. In\nThe dual given in Eq. (3) is a smooth function with Lipschitz constant L = 1\nthis case coordinate minimization algorithms are globally convergent (to the smooth optimum), and\ntheir convergence rate can be bounded by O(1/\u270f) [17]. Gradient-based algorithms can also be\napplied to the smooth dual and have similar convergence rate O(1/\u270f). This can be improved using\nNesterov\u2019s acceleration scheme to obtain an O(1/p\u270f) rate [24]. The gradient of Eq. (3) takes the\nsimple form:\n\nrpr(xr)g =0@br(xr) Xxp\\xr\n\nbp(xp)1A ,\n\nwhere\n\nbr(xr) / exp \u02c6\u2713\n\nr(xr)\n\n ! .\n\n(6)\n\nIntroducing Strong Convexity\n\n4\nIn this section we study the effect of adding strong convexity to the objective function. Speci\ufb01cally,\nwe add the Euclidean norm of the variables to either the dual (Section 4.1) or primal (Section 4.2)\nfunction. We study the properties of the objectives, and propose appropriate optimization schemes.\n\n4.1 Strong Convexity in the Dual\nAs mentioned above, the dual given in Eq. (2) is a piecewise linear function, hence not smooth.\nIntroducing strong convexity to control the convergence rate, is an alternative to smoothing. We\npropose to introduce strong convexity by simply adding the L2 norm of the variables to the dual\n\n3\n\n\f\u02c6\u2713(\u00b5)\nr\n\nAlgorithm 1 Block-coordinate Frank-Wolfe for soft-constrained primal\n1: Initialize: \u00b5r(xr) = {xr = argmaxx0r\n2: while not converged do\n3:\n4:\n\nPick r at random\nLet sr(xr) = {xr = argmaxx0r\n)>(sr\u00b5r)\nLet \u2318 =\nPc:c2r kArc(sr\u00b5r)k2 , and clip to [0, 1]\nUpdate \u00b5r (1 \u2318)\u00b5r + \u2318sr\n\n(\u02c6\u2713(\u00b5)\n Prksr\u00b5rk2+ 1\n\n(x0r)} for all r, xr\n\n(x0r)} for all xr\n\n5:\n6:\n7: end while\n\n\u02c6\u2713(\u00b5)\nr\n\n1\n\nr\n\nprogram given in Eq. (2), i.e.,\n\nThe corresponding primal objective is then (see Appendix A):\n\nmin\n\n\n\n\u02d8g() := g() +\n\n\n2kk2 .\n\n(7)\n\n2\n\n1\n\nmax\n\u00b52\u21e5\n\n= \u00b5>\u2713 \n\n\n2kA\u00b5k2 ,\n\nf(\u00b5) := \u00b5>\u2713 \n\n\u00b5p(xp) \u00b5r(xr)1A\n\n2 Xr,xr,p:r2p\n\u21e3Pxp\\xr\n\n0@Xxp\\xr\nwhere \u21e5 preserves only the separable per-region simplex constraints in ML, and for convenience\n\u00b5p(xp) \u00b5r(xr)\u2318. Importantly, this primal program is similar\nwe de\ufb01ne (A\u00b5)r,xr,p = 1\nto the original primal given in Eq. (1), but the non-separable marginalization constraints in ML are\nenforced softly \u2013 via a penalty term in the objective. Interestingly, the primal in Eq. (8) is somewhat\nsimilar to the objective function obtained by the steepest descent approach proposed by Schwing\net al. [25], despite being motivated from different perspectives. Similar to Schwing et al. [25], our\nalgorithm below is also based on conditional gradient, however ours is a single-loop algorithm,\nwhereas theirs employs a double-loop procedure.\nWe obtain the following guarantee for the optimum of the strongly convex dual (see Appendix C):\n\n(8)\n\ng\u21e4 \uf8ff \u02d8g\u21e4 \uf8ff g\u21e4 +\n\n\n2\n\nh ,\n\n(9)\n\nwhere h is chosen such that k\u21e4k2 \uf8ff h. It can be shown that h = (4M qk\u2713k1)2, where M =\nmaxr Wr, and Wr is the number of con\ufb01gurations of region r (see Appendix C). Notice that this\nbound is worse than the soft-max bound stated in Eq. (5) due to the dependence on the magnitude\nof the parameters \u2713 and the number of con\ufb01gurations Wr.\n\nOptimization It is easy to modify the subgradient algorithm to optimize the strongly convex dual\ngiven in Eq. (7).\nIt only requires adding the term to the subgradient. Since the objective is\nnon-smooth and strongly convex, we obtain a convergence rate of O(1/\u270f) [19]. We note that\ncoordinate descent algorithms for the dual objective are still non-convergent, since the program\nis still non-smooth. Instead, we propose to optimize the primal given in Eq. (8) via a conditional\ngradient algorithm [4]. Speci\ufb01cally, in Algorithm 1 we implement the block-coordinate Frank-Wolfe\nalgorithm proposed by Lacoste-Julien et al. [13]. In Algorithm 1 we denote Pr = |{p : r 2 p}|, we\nde\ufb01ne (\u00b5) as pr(xr) = 1\n\n\u00b5p(xp) \u00b5r(xr)\u2318, and Arc\u00b5r =Pxr\\xc\nsub-optimality of the current solution by keeping track of the duality gapPr (\u02c6\u2713\n\nIn Appendix D we show that the convergence rate of Algorithm 1 is O(1/\u270f), similar to subgradient\nin the dual. However, Algorithm 1 has several advantages over subgradient. First, the step-size re-\nquires no tuning since the optimal step \u2318 is computed analytically. Second, it is easy to monitor the\nr)>(sr \u00b5r), which\nprovides a sound stopping condition.1 Notice that the basic operation for the update is maximiza-\n\u02c6\u2713\ntion over the re-parameterization (maxxr\nr(xr)), which is similar to a subgradient computation.\nThis operation is sometimes cheaper than coordinate minimization, which requires computing max-\n\n\u21e3Pxp\\xr\n\n\u00b5r(xr).\n\n1Similar rate guarantees can be derived for the duality gap.\n\n4\n\n\fmarginals [see 28]. We also point out that, similar to Lacoste-Julien et al. [13], it is possible to\nexecute Algorithm 1 in terms of dual variables, without storing primal variables \u00b5r(xr) for large\nparent regions (see Appendix E for details). As we demonstrate in Section 5, this can be important\nwhen using global factors.\nWe note that Algorithm 1 can be used with minor modi\ufb01cations in the inner loop of an augmented\nLagrangian algorithm [15]. But we show later that this double-loop procedure is not necessary to\nobtain good results for some applications. Finally, Meshi et al. [18] show how to use the objective\nin Eq. (8) to obtain an ef\ufb01cient training algorithm for learning the score functions \u2713 from data.\n\n(10)\n\n(11)\n\n4.2 Strong Convexity in the Primal\nWe next consider appending the primal given in Eq. (1) with a similar L2 norm, obtaining:\n\nmax\n\u00b52ML\n\nf(\u00b5) := \u00b5>\u2713 \n\n\n2k\u00b5k2 .\n\nIt turns out that the corresponding dual function takes the form (see Appendix B):\n\nmin\n\n\n\n\u02dcg() :=Xr\n\nmax\n\nu2\u21e3u> \u02c6\u2713\n\nr \n\n\n\n2 kuk2\u2318 =Xr\n\n2\n\n min\nu2\n\n0@ \n2\n\n\u02c6\u2713\nr\n\n\n\nu \n\n\n\n2\n\n21A .\n\n\u02c6\u2713\nr\n\n\n\nThus the dual objective involves scaling the factor reparameterization \u02c6\u2713\nr by 1/, and then projecting\nthe resulting vector onto the probability simplex. We denote the result of this projection by ur (or\njust u when clear from context). The L2 norm in Eq. (10) has the same role as the entropy terms\nin Eq. (4), and serves to smooth the dual function. This is a consequence of the well known duality\nbetween strong convexity and smoothness [e.g., 21]. In particular, the dual stated in Eq. (11) is\nsmooth with Lipschitz constant L = q/.\nTo calculate the objective value we need to compute the projection ur onto the simplex for all factors.\nThis can be done by sorting the elements of the scaled reparameterization \u02c6\u2713\nr/, and then shifting\nall elements by the same value such that all positive elements sum to 1. The negative elements are\nthen set to 0 [see, e.g., 3, for details]. Intuitively, we can think of ur as a max-marginal which does\nnot place weight 1 on the maximum element, but instead spreads the weight among the top scoring\nelements, if their score is close enough to the maximum. The effect is similar to the soft-max case,\nwhere br can also be thought-of as a soft max-marginal (see Eq. (6)). On the other hand, unlike br,\nour max-marginal ur will most likely be sparse, since only a few elements tend to have scores close\nto the maximum and hence non-zero value in ur.\nAnother interesting property of the dual in Eq. (11) is invariance to shifting, which is also the case\nfor the non-smooth dual provided in Eq. (2) and the soft-max dual given in Eq. (3). Speci\ufb01cally,\nshifting all elements of pr(\u00b7) by the same value does not change the objective value, since the\nprojection onto the simplex is shift-invariant.\nWe next bound the difference between the smooth optimum and the original one. The bound follows\neasily from the bounded norm of \u00b5r in the probability simplex:\n\nf\u21e4 \n\n\n2\n\nq \uf8ff f\u21e4 \uf8ff f\u21e4 ,\n\nor equivalently:\n\nf\u21e4 \uf8ff\u21e3f\u21e4 +\n\n\n2\n\nq\u2318 \uf8ff f\u21e4 +\n\n\n2\n\nq .\n\nWe actually use the equivalent form on the right in order to get an upper bound rather than a lower\nbound.2 From strong duality we immediately get a similar guarantee for the dual optimum:\n\ng\u21e4 \uf8ff\u21e3\u02dcg\u21e4 +\n\n\n2\n\nq\u2318 \uf8ff g\u21e4 +\n\n\n2\n\nq .\n\nNotice that this bound is better than the corresponding soft-max bound stated in Eq. (5), since it does\nnot depend on the scope size of regions, i.e., Vr.\n\n2In our experiments we show the shifted objective value.\n\n5\n\n\f270\n271\n272\n273\n274\n275\n276\n277\n278\n279\n280\n281\n282\n283\n284\n285\n286\n287\n288\n289\n290\n291\n292\n293\n294\n295\n296\n297\n298\n299\n300\n301\n302\n303\n304\n305\n306\n307\n308\n309\n310\n311\n312\n313\n314\n315\n316\n317\n318\n319\n320\n321\n322\n323\n\nConvex\n\nh Dual: min\n\n\n\nt\no\no\nm\n\ns\n-\nn\no\nN\n\nx\na\nm\n-\n2\nL\n\nx\na\nm\n\n-\nt\nf\no\nS\n\ng() :=r\n\nSubgradient O(1/2)\nCD (non-convergent)\nPrimal: max\n\u00b5\u2713\n\u00b5ML\nProximal projections\nDual: min\nGradient O(1/)\nAccelerated O(1/)\nCD?\nPrimal: max\n\u00b5ML\n\n\u02dcg() :=r\n\n\n\n\n\nDual: min\nGradient O(1/)\nAccelerated O(1/)\nCD O(1/)\nPrimal: max\n\u00b5ML\n\n\u02c6\u2713\nr(xr)\n\nmax\nxr\n\nmax\n\nuu \u02c6\u2713\n\nr \n\n2 u2\n\n\u00b5\u2713 \ng() :=r\n\n2\u00b52\n logxr\n\nexp \u02c6\n\nr (xr )\n\n \n\n\u00b5\u2713 + r\n\nH(\u00b5r)\n\nStrongly-convex\nDual: min\ng() + \nSubgradient O(1/)\nPrimal: max\n\u00b5\u21e5\nFW O(1/)\n\n\n\n22\n\n\u00b5\u2713 \n\n2A\u00b52\n\n\n\n.\n\n22\n\u02dcg() + \n log( 1\n ))\n\n2 Dual: min\nGradient O( 1\nAccelerated O( 1 log( 1\n ))\nPrimal: max\n\u00b5\u2713 \n2A\u00b52 \n\u00b5\u21e5\nSDCA O((1 + 1\n ) log( 1\n ))\n\n4\nn\no\ni\nt\nc\ne\nS\n\n2\u00b52\n\n.\n\n1\n4\nn\no\ni\nt\nc\ne\nS\n\n.\n\n3\n4\nn\no\ni\nt\nc\ne\nS\n\n\n\n22\ng() + \n log( 1\n ))\n\nDual: min\nGradient O( 1\nAccelerated O( 1 log( 1\nPrimal: max\n\u00b5\u2713 \n\u00b5\u21e5\n\n ))\n\n2A\u00b52 + r\n\n.\n\n3\n4\nn\no\ni\nt\nc\ne\n\nH(\u00b5r) S\n\nTable 1: Summary of objective functions, algorithms and rates. Existing approaches are shaded.\nTable 1: Summary of objective functions, algorithms and rates. Row and column headers pertain to\nthe dual objective. Previously known approaches are shaded.\nOptimization To solve the dual program given in Eq. (11) we can use gradient-based algorithms.\nThe gradient takes the form:\nOptimization To solve the dual program given in Eq. (11) we can use gradient-based algorithms.\nThe gradient takes the form:\n\nrpr(xr)\u02dcg =0@ur(xr) Xxp\\xr\nup(xp)1A ,\nrpr(xr)\u02dcg =0@ur(xr) Xxp\\xr\n\nup(xp)1A ,\n\nwhich only requires computing the projection ur, as in the objective function. Notice that this form\nis very similar to the soft-max gradient (Eq. (6)), with projections u taking the role of beliefs b. The\ngradient descent algorithm applies the updates: 1\nLr\u02dcg iteratively. The convergence rate of\nthis scheme for our smooth dual is O(1/\u270f), which is similar to the soft-max rate [26]. As in the\nwhich only requires computing the projection ur, as in the objective function. Notice that this form\nsoft-max case, Nesterov\u2019s accelerated gradient method achieves a better O(1/p\u270f) rate [see 16].\nis very similar to the soft-max gradient (Eq. (6)), with projections u taking the role of beliefs b. The\ngradient descent algorithm applies the updates: 1\nLr\u02dcg iteratively. The convergence rate of\nUnfortunately, it is not clear how to derive ef\ufb01cient coordinate minimization updates for the dual in\nthis scheme for our smooth dual is O(1/\u270f), which is similar to the soft-max rate [20]. As in the\nEq. (11), since the projection ur depends on the dual variables in a non-linear manner.\nsoft-max case, Nesterov\u2019s accelerated gradient method achieves a better O(1/p\u270f) rate [see 24].\nFinally, we point out that the program in Eq. (10) is very similar to the one solved in the inner loop\nof proximal point methods [5]. Therefore our gradient-based algorithm can be used with minor\nUnfortunately, it is not clear how to derive ef\ufb01cient coordinate minimization updates for the dual in\nmodi\ufb01cations as a subroutine within such proximal algorithms (requires mapping the \ufb01nal dual\nEq. (11), since the projection ur depends on the dual variables in a non-linear manner.\nsolution to a feasible primal solution [see, e.g., 15]).\nFinally, we point out that the program in Eq. (10) is very similar to the one solved in the inner\n4.3 Smooth and Strong\nloop of proximal point methods [23]. Therefore our gradient-based algorithm can be used with\nIn order to obtain a smooth and strongly convex objective function, we can add an L2 regularizer to\nminor modi\ufb01cations as a subroutine within such proximal algorithms (requires mapping the \ufb01nal\nthe smooth program given in Eq. (11) (similarly possible for the soft-max dual in Eq. (3)). Gradient-\ndual solution to a feasible primal solution [see, e.g., 17]).\nbased algorithms have linear convergence rate in this case [26]. Equivalently, we can add an L2\nterm to the primal in Eq. (8). Although conditional gradient is not guaranteed to converge linearly in\n4.3 Smooth and Strong\nthis case [27], stochastic coordinate ascent (SDCA) does enjoy linear convergence, and can even be\nIn order to obtain a smooth and strongly convex objective function, we can add an L2 term to the\naccelerated to gain better dependence on the smoothing and convexity parameters [28]. This requires\nsmooth program given in Eq. (11) (similarly possible for the soft-max dual in Eq. (3)). Gradient-\nonly minor modi\ufb01cations to the algorithms discussed above, which are highlighted in Appendix F.\nbased algorithms have linear convergence rate in this case [20]. Equivalently, we can add an L2\nTo conclude this section, we summarize all objective functions and algorithms in Table 1.\nterm to the primal in Eq. (8). Although conditional gradient is not guaranteed to converge linearly in\nthis case [5], stochastic coordinate ascent (SDCA) does enjoy linear convergence, and can even be\n5 Experiments\naccelerated to gain better dependence on the smoothing and convexity parameters [27]. This requires\nWe now proceed to evaluate the proposed methods on real and synthetic data and compare them to\nonly minor modi\ufb01cations to the algorithms presented above, which are highlighted in Appendix F.\nexisting state-of-the-art approaches. We begin with a synthetic model adapted from Kolmogorov\nTo conclude this section, we summarize all objective functions and algorithms in Table 1.\n[12]. This example was designed to show that coordinate descent algorithms might get stuck in\nsuboptimal points due to non-smoothness. We compare the following MAP inference algorithms:\n5 Experiments\nnon-smooth coordinate descent (CD), non-smooth subgradient descent, smooth CD (for soft-max),\ngradient descent (GD) and accelerated GD (AGD) with either soft-max or L2 smoothing (Section\nWe now proceed to evaluate the proposed methods on real and synthetic data and compare them to\nexisting state-of-the-art approaches. We begin with a synthetic model adapted from Kolmogorov\n[10]. This example was designed to show that coordinate descent algorithms might get stuck in\nsuboptimal points due to non-smoothness. We compare the following MAP inference algorithms:\nnon-smooth coordinate descent (CD), non-smooth subgradient descent, smooth CD (for soft-max),\ngradient descent (GD) and accelerated GD (AGD) with either soft-max or L2 smoothing (Section\n4.2), our Frank-Wolfe Algorithm 1 (FW), and the linear convergence variants (Section 4.3). In Fig. 1\n\n6\n\n6\n\n\f0\n\ne\nv\n\ni\nt\nc\ne\nb\nO\n\nj\n\n\u221220\n\n\u221240\n\n\u221260\n100\n\n \n\n102\n\nIterations\n\n \n\n104\n\n106\n\nCD Non\u2212smooth\nSubgradient\nCD Soft, \u03b3=1\nCD Soft, \u03b3=0.1\nCD Soft, \u03b3=0.01\nGD Soft, \u03b3=0.01\nAGD Soft, \u03b3=0.01\nGD L2, \u03b3=0.01\nAGD L2, \u03b3=1\nAGD L2, \u03b3=0.1\nAGD L2, \u03b3=0.01\nFW, \u03bb=0.01\nFW, \u03bb=0.001\nFW, \u03bb=0.0001\nAGD, \u03b3=0.1, \u03bb=0.001\nSDCA, \u03b3=0.1, \u03bb=0.001\nNon\u2212smooth OPT\n\nFigure 1: Comparison of various inference algorithms on a synthetic model. The objective value as\na function of the iterations is plotted. The optimal value is shown in thin dashed dark line.\nwe notice that non-smooth CD (light blue, dashed) is indeed stuck at the initial point. Second, we\nobserve that the subgradient algorithm (yellow) is extremely slow to converge. Third, we see that\nsmooth CD algorithms (green) converge nicely to the smooth optimum. Gradient-based algorithms\nfor the same smooth (soft-max) objective (purple) also converge to the same optimum, while AGD\nis much faster than GD. We can also see that gradient-based algorithms for the L2-smooth objective\n(red) preform slightly better than their soft-max counterparts. In particular, they have faster conver-\ngence and tighter objective for the same value of the smoothing parameter, as our theoretical analysis\nsuggests. For example, compare the convergence of AGD soft and AGD L2 both with = 0.01.\nFor the optimal value, compare CD soft and AGD L2 both with = 1. Fourth, we note that the\nFW algorithm (blue) requires smaller values of the strong-convexity parameter in order to achieve\nhigh accuracy, as our bound in Eq. (9) predicts. We point out that the dependence on the smoothing\nor strong convexity parameter is roughly linear, which is also aligned with our convergence bounds.\nFinally, we see that for this model the smooth and strongly convex algorithms (gray) perform similar\nor even slightly worse than either the smooth-only or strongly-convex-only counterparts.\nIn our experiments we compare the number of iterations rather than runtime of the algorithms since\nthe computational cost per iteration is roughly the same for all algorithms (includes a pass over\nall factors), and the actual runtime greatly depends on the implementation. For example, gradient\ncomputation for L2 smoothing requires sorting factors rather than just maximizing over their values,\nincurring worst-case cost of O(Wr log Wr) per factor instead of just O(Wr) for soft-max gradient.\nHowever, one can use partitioning around a pivot value instead of sorting, yielding O(Wr) cost\nin expectation [3], and caching the pivot can also speed-up the runtime considerably. Moreover,\nlogarithm and exponent operations needed by the soft-max gradient are much slower than the basic\noperations used for computing the L2 smooth gradient. As another example, we point out that AGD\nalgorithms can be further improved by searching for the effective Lipschitz constant rather than\nusing the conservative bound L (see [24] for more details). In order to abstract away these details\nwe compare the iteration cost of the vanilla versions of all algorithms.\nWe next conduct experiments on real data from a protein side-chain prediction problem from\nYanover et al. [33]. This problem can be cast as MAP inference in a model with unary and pairwise\nfactors. Fig. 2 (left) shows the convergence of various MAP algorithms for one of the proteins (sim-\nilar behavior was observed for the other instances). The behavior is similar to the synthetic example\nabove, except for the much better performance of non-smooth coordinate descent. In particular, we\nsee that coordinate minimization algorithms perform very well in this setting, better than gradient-\nbased and the FW algorithms (this \ufb01nding is consistent with previous work [e.g., 17]). Only a closer\nlook (Fig. 2, left, bottom) reveals that smoothing actually helps to obtain a slightly better solution\nhere. In particular, the soft-max CD (with = 0.001) and L2-max AGD (with = 0.01), as well\nas the primal (SDCA) and dual (AGD) algorithms for the smooth and strongly convex objective, are\nable to recover the optimal solution within the allowed number of iterations. The non-smooth FW\nalgorithm also \ufb01nds a near-optimal solution.\nFinally, we apply our approach to an image segmentation problem with a global cardinality factor.\nSpeci\ufb01cally, we use the Weizmann Horse dataset for foreground-background segmentation [2]. All\nimages are resized to 150 \u21e5 150 pixels, and we use 50 images to learn the parameters of the model\nand the other 278 images to test inference. Our model consists of unary and pairwise factors along\nwith a single global cardinality factor, that serves to encourage segmentations where the number of\nforeground pixels is not too far from the trainset mean. Speci\ufb01cally, we use the cardinality factor\nfrom Li and Zemel [14], de\ufb01ned as: \u2713c(x) = max{0,|s s0| t}2, where s =Pi xi. Here, s0 is a\n\nreference cardinality computed from the training set, and t is a tolerance parameter, set to t = s0/5.\n\n7\n\n\f250\n\n200\n\ne\nv\ni\nt\nc\ne\nj\nb\nO\n\n150\n\n100\n\n50\n\n \n\n0\n100\n\n107\n\n106\n\ne\nv\ni\nt\nc\ne\nb\nO\n\nj\n\n105\n\n104\n\n103\n\n102\n100\n\n2000\n\n101\n\n0\n\ne\nv\n\ni\nt\nc\ne\nb\nO\n\nj\n\n-2000\n\n-4000\n\n-6000\n\n \n\n102\n\nIterations\n\n103\n\n104\n\nCD Non-smooth\nSubgradient\nCD Soft .=0.01\nCD Soft .=0.001\nAGD L2 .=0.01\nAGD L2 .=0.001\nAGD Soft .=0.01\nAGD Soft .=0.001\nFW 6=0.01\nAGD .=0.01 6=0.01\nSDCA .=0.01 6=0.01\n\n101\n\n102\n\nIterations\n\n103\n\n104\n\n0x 105\n\ne\nv\n\ni\nt\nc\ne\nb\nO\n\nj\n\n\u22122\n\n\u22124\n\n\u22126\n\n\u22128\n\n \n\n100\n\n \n\nSubgradient\nMPLP\nFW\n\n103\n\n104\n\n101\n\n102\n\nIterations\n\n-10000\n\n100\n\n-8000\n\nFigure 2: (Left) Comparison of MAP inference algorithms on a protein side-chain prediction prob-\nlem. In the upper \ufb01gure the solid lines show the optimized objective for each algorithm, and the\ndashed lines show the score of the best decoded solution (obtained via simple rounding). The bot-\ntom \ufb01gure shows the value of the decoded solution in more detail. (Right) Comparison of MAP\ninference algorithms on an image segmentation problem. Again, solid lines show the value of the\noptimized objective while dashed lines show the score of the best decoded solution so far.\n\nIterations\n\n104\n\n101\n\n102\n\n103\n\nFirst we notice that not all of the algorithms are ef\ufb01cient in this setting. In particular, algorithms that\noptimize the smooth dual (either soft-max or L2 smoothing) need to enumerate factor con\ufb01gurations\nin order to compute updates, which is prohibitive for the global cardinality factor. We therefore\ntake the non-smooth subgradient and coordinate descent [MPLP, 6] as baselines, and compare their\nperformance to that of our FW Algorithm 1 (with = 0.01). We use the variant that does not store\nprimal variables for the global factor (Appendix E). We point out that MPLP requires calculating\nmax-marginals for factors, rather than a simple maximization for subgradient and FW. In the case of\ncardinality factors this can be done at similar cost using dynamic programming [29], however there\nare other types of factors where max-marginal computation might be more expensive than max [28].\nIn Fig. 2 (right) we show a typical run for a single image, where we limit the number of iterations to\n10K. We observe that subgradient descent is again very slow to converge, and coordinate descent\nis also rather slow here (in fact, it is not even guaranteed to reach the optimum). In contrast, our\nFW algorithm converges orders of magnitude faster and \ufb01nds a high quality solution (for runtime\ncomparison see Appendix G). Over the entire 278 test instances we found that FW gets the highest\nscore solution for 237 images, while MPLP \ufb01nds the best solution in only 41 images, and subgradient\nnever wins. To explain this success, recall that our algorithm enforces the agreement constraints\nbetween factor marginals only softly. It makes sense that in this setting it is not crucial to reach full\nagreement between the cardinality factor and the other factors in order to obtain a good solution.\n\n6 Conclusion\nIn this paper we studied the bene\ufb01ts of strong convexity for MAP inference. We introduced a sim-\nple L2 term to make either the dual or primal LP relaxations strongly convex. We analyzed the\nresulting objective functions and provided theoretical guarantees for their optimal values. We then\nproposed several optimization algorithms and derived upper bounds on their convergence rates. Us-\ning the same machinery, we obtained smooth and strongly convex objective functions, for which our\nalgorithms retained linear convergence guarantees. Our approach offers new ways to trade-off the\napproximation error of the relaxation and the optimization error. Indeed, we showed empirically that\nour methods signi\ufb01cantly outperform strong baselines on problems involving cardinality potentials.\nTo extend our work we aim at natural language processing applications since they share charac-\nteristics similar to the investigated image segmentation task. Finally, we were unable to derive\nclosed-form coordinate minimization updates for our L2-smooth dual in Eq. (11). We hope to \ufb01nd\nalternative smoothing techniques which facilitate even more ef\ufb01cient updates.\n\nReferences\n[1] D. Belanger, A. Passos, S. Riedel, and A. McCallum. Message passing for soft constraint dual decompo-\n\nsition. In UAI, 2014.\n\n8\n\n\f[2] E. Borenstein, E. Sharon, and S. Ullman. Combining top-down and bottom-up segmentation. In CVPR,\n\n2004.\n\n[3] J. Duchi, S. Shalev-Shwartz, Y. Singer, and T. Chandra. Ef\ufb01cient projections onto the l 1-ball for learning\n\nin high dimensions. In ICML, pages 272\u2013279, 2008.\n\n[4] M. Frank and P. Wolfe. An algorithm for quadratic programming, volume 3, pages 95\u2013110. 1956.\n[5] D. Garber and E. Hazan. A linearly convergent conditional gradient algorithm with applications to online\n\nand stochastic optimization. arXiv preprint arXiv:1301.4666, 2013.\n\n[6] A. Globerson and T. Jaakkola. Fixing max-product: Convergent message passing algorithms for MAP\n\nLP-relaxations. In NIPS. MIT Press, 2008.\n\n[7] T. Hazan and A. Shashua. Norm-product belief propagation: Primal-dual message-passing for approxi-\n\nmate inference. IEEE Transactions on Information Theory, 56(12):6294\u20136316, 2010.\n\n[8] J. Johnson. Convex Relaxation Methods for Graphical Models: Lagrangian and Maximum Entropy Ap-\n\nproaches. PhD thesis, EECS, MIT, 2008.\n\n[9] J. H. Kappes, B. Savchynskyy, and C. Schn\u00a8orr. A bundle approach to ef\ufb01cient map-inference by la-\n\ngrangian relaxation. In CVPR, 2012.\n\n[10] V. Kolmogorov. Convergent tree-reweighted message passing for energy minimization. IEEE Transac-\n\ntions on Pattern Analysis and Machine Intelligence, 28(10):1568\u20131583, 2006.\n\n[11] N. Komodakis, N. Paragios, and G. Tziritas. MRF energy minimization and beyond via dual decomposi-\n\ntion. IEEE PAMI, 2010.\n\n[12] M. P. Kumar, V. Kolmogorov, and P. H. S.Torr. An analysis of convex relaxations for map estimation of\n\ndiscrete mrfs. JMLR, 10:71\u2013106, 2009.\n\n[13] S. Lacoste-Julien, M. Jaggi, M. Schmidt, and P. Pletscher. Block-coordinate Frank-Wolfe optimization\n\nfor structural SVMs. In ICML, pages 53\u201361, 2013.\n\n[14] Y. Li and R. Zemel. High order regularization for semi-supervised learning of structured output problems.\n\nIn ICML, pages 1368\u20131376, 2014.\n\n[15] A. L. Martins, M. A. T. Figueiredo, P. M. Q. Aguiar, N. A. Smith, and E. P. Xing. An augmented\n\nlagrangian approach to constrained map inference. In ICML, pages 169\u2013176, 2011.\n\n[16] O. Meshi and A. Globerson. An alternating direction method for dual map lp relaxation. In ECML, 2011.\n[17] O. Meshi, T. Jaakkola, and A. Globerson. Convergence rate analysis of map coordinate minimization\n\nalgorithms. In NIPS, pages 3023\u20133031, 2012.\n\n[18] O. Meshi, N. Srebro, and T. Hazan. Ef\ufb01cient training of structured svms via soft constraints. In AISTATS,\n\n2015.\n\n2013.\n\n[19] A. Nemirovski and D. Yudin. Problem complexity and method ef\ufb01ciency in optimization. Wiley, 1983.\n[20] Y. Nesterov. Introductory Lectures on Convex Optimization: A Basic Course, volume 87. Kluwer Aca-\n\ndemic Publishers, 2004.\n\n[21] Y. Nesterov. Smooth minimization of non-smooth functions. Math. Prog., 103(1):127\u2013152, 2005.\n[22] D. Prusa and T. Werner. Universality of the local marginal polytope. In CVPR, pages 1738\u20131743. IEEE,\n\n[23] P. Ravikumar, A. Agarwal, and M. J. Wainwright. Message-passing for graph-structured linear programs:\n\nProximal methods and rounding schemes. JMLR, 11:1043\u20131080, 2010.\n\n[24] B. Savchynskyy, S. Schmidt, J. Kappes, and C. Schnorr. A study of Nesterov\u2019s scheme for lagrangian\n\ndecomposition and map labeling. CVPR, 2011.\n\n[25] A. G. Schwing, T. Hazan, M. Pollefeys, and R. Urtasun. Globally Convergent Dual MAP LP Relaxation\n\nSolvers using Fenchel-Young Margins. In Proc. NIPS, 2012.\n\n[26] A. G. Schwing, T. Hazan, M. Pollefeys, and R. Urtasun. Globally Convergent Parallel MAP LP Relaxation\n\nSolver using the Frank-Wolfe Algorithm. In Proc. ICML, 2014.\n\n[27] S. Shalev-Shwartz and T. Zhang. Accelerated proximal stochastic dual coordinate ascent for regularized\n\nloss minimization. In ICML, 2014.\n\n[28] D. Sontag, A. Globerson, and T. Jaakkola. Introduction to dual decomposition for inference. In Optimiza-\n\ntion for Machine Learning, pages 219\u2013254. MIT Press, 2011.\n\n[29] D. Tarlow, I. Givoni, and R. Zemel. Hop-map: Ef\ufb01cient message passing with high order potentials. In\n\nAISTATS, volume 9, pages 812\u2013819. JMLR: W&CP, 2010.\n\n[30] M. Wainwright, T. Jaakkola, and A. Willsky. MAP estimation via agreement on trees: message-passing\n\nand linear programming. IEEE Transactions on Information Theory, 51(11):3697\u20133717, 2005.\n\n[31] T. Werner. A linear programming approach to max-sum problem: A review. IEEE Transactions on Pattern\n\nAnalysis and Machine Intelligence, 29(7):1165\u20131179, 2007.\n\n[32] T. Werner. Revisiting the linear programming relaxation approach to gibbs energy minimization and\n\nweighted constraint satisfaction. IEEE PAMI, 32(8):1474\u20131488, 2010.\n\n[33] C. Yanover, T. Meltzer, and Y. Weiss. Linear programming relaxations and belief propagation \u2013 an\n\nempirical study. Journal of Machine Learning Research, 7:1887\u20131907, 2006.\n\n9\n\n\f", "award": [], "sourceid": 183, "authors": [{"given_name": "Ofer", "family_name": "Meshi", "institution": "TTI Chicago"}, {"given_name": "Mehrdad", "family_name": "Mahdavi", "institution": "TTI Chicago"}, {"given_name": "Alex", "family_name": "Schwing", "institution": "University of Toronto"}]}