{"title": "Convergence Rate Analysis of MAP Coordinate Minimization Algorithms", "book": "Advances in Neural Information Processing Systems", "page_first": 3014, "page_last": 3022, "abstract": "Finding maximum aposteriori (MAP) assignments in graphical models is an important task in many applications. Since the problem is generally hard, linear programming (LP) relaxations are often used. Solving these relaxations efficiently is thus an important practical problem. In recent years, several authors have proposed message passing updates corresponding to coordinate descent in the dual LP. However,these are generally not guaranteed to converge to a global optimum. One approach to remedy this is to smooth the LP, and perform coordinate descent on the smoothed dual. However, little is known about the convergence rate of this procedure. Here we perform a thorough rate analysis of such schemes and derive primal and dual convergence rates. We also provide a simple dual to primal mapping that yields feasible primal solutions with a guaranteed rate of convergence. Empirical evaluation supports our theoretical claims and shows that the method is highly competitive with state of the art approaches that yield global optima.", "full_text": "Convergence Rate Analysis of MAP Coordinate\n\nMinimization Algorithms\n\nOfer Meshi \u2217\n\nTommi Jaakkola \u2020\n\nAmir Globerson \u2217\n\nmeshi@cs.huji.ac.il\n\ntommi@csail.mit.edu\n\ngamir@cs.huji.ac.il\n\nAbstract\n\nFinding maximum a posteriori (MAP) assignments in graphical models is an im-\nportant task in many applications. Since the problem is generally hard, linear pro-\ngramming (LP) relaxations are often used. Solving these relaxations ef\ufb01ciently\nis thus an important practical problem. In recent years, several authors have pro-\nposed message passing updates corresponding to coordinate descent in the dual\nLP. However, these are generally not guaranteed to converge to a global optimum.\nOne approach to remedy this is to smooth the LP, and perform coordinate descent\non the smoothed dual. However, little is known about the convergence rate of this\nprocedure. Here we perform a thorough rate analysis of such schemes and derive\nprimal and dual convergence rates. We also provide a simple dual to primal map-\nping that yields feasible primal solutions with a guaranteed rate of convergence.\nEmpirical evaluation supports our theoretical claims and shows that the method is\nhighly competitive with state of the art approaches that yield global optima.\n\n1\n\nIntroduction\n\nMany applications involve simultaneous prediction of multiple variables. For example, we may seek\nto label pixels in an image, infer amino acid residues in protein design, or \ufb01nd the semantic role of\nwords in a sentence. These problems can be cast as maximizing a function over a set of labels (or\nminimizing an energy function). The function typically decomposes into a sum of local functions\nover overlapping subsets of variables.\nSuch maximization problems are nevertheless typically hard. Even for simple decompositions (e.g.,\nsubsets correspond to pairs of variables), maximizing over the set of labels is often provably NP-\nhard. One approach would be to reduce the problem to a tractable one, e.g., by constraining the\nmodel to a low tree-width graph. However, empirically, using more complex interactions together\nwith approximate inference methods is often advantageous. One popular family of approximate\nmethods is the linear programming (LP) relaxation approach. Although these LPs are generally\ntractable, general purpose LP solvers typically do not exploit the problem structure [28]. Therefore\na great deal of effort has gone into designing solvers that are speci\ufb01cally tailored to typical MAP-\nLP relaxations. These include, for example, cut based algorithms [2], accelerated gradient methods\n[8], and augmented Lagrangian methods [10, 12]. One class of particularly simple algorithms,\nwhich we will focus on here, are coordinate minimization based approaches. Examples include\nmax-sum-diffusion [25], MPLP [5] and TRW-S [9]. These work by \ufb01rst taking the dual of the LP\nand then optimizing the dual in a block coordinate fashion [21].\nIn many cases, the coordinate\nblock operations can be performed in closed form, resulting in updates quite similar to the max-\nproduct message passing algorithm. By coordinate minimization we mean that at each step a set\nof coordinates is chosen, all other coordinates are \ufb01xed, and the chosen coordinates are set to their\n\n\u2217School of Computer Science and Engineering, The Hebrew University of Jerusalem, Israel\n\u2020CSAIL, MIT, Cambridge, MA\n\n1\n\n\foptimal value given the rest. This is different from a coordinate descent strategy where instead a\ngradient step is performed on the chosen coordinates (rather than full optimization).\nA main caveat of the coordinate minimization approach is that it will not necessarily \ufb01nd the global\noptimum of the LP (although in practice it often does). This is a direct result of the LP objective\nnot being strictly convex. Several authors have proposed to smooth the LP with entropy terms\nand employ variants of coordinate minimization [7, 26]. However, the convergence rates of these\nmethods have not been analyzed. Moreover, since the algorithms work in the dual, there is no\nsimple procedure to map the result back into primal feasible variables. We seek to address all these\nshortcomings: we present a convergence rate analysis of dual coordinate minimization algorithms,\nprovide a simple scheme for generating primal variables from dual ones, and analyze the resulting\nprimal convergence rates.\nConvergence rates for coordinate minimization are not common in the literature. While asymptotic\nconvergence is relatively well understood [22], \ufb01nite rates have been harder to obtain. Recent work\n[17] provides rates for rather limited settings which do not hold in our case. On the other hand,\nfor coordinate descent methods, some rates have been recently obtained for greedy and stochastic\nupdate schemes [16, 20]. These do not apply directly to the full coordinate minimization case which\nwe study. A related analysis of MAP-LP using smoothing appeared in [3]. However, their approach\nis speci\ufb01c to LDPC codes, and does not apply to general MAP problems as we analyze here.\n\n2 MAP and LP relaxations\nConsider a set of n discrete variables x1, . . . , xn, and a set C of subsets of these variables (i.e., c \u2208 C\nis a subset of {1, . . . , n}). We consider maximization problems over functions that decompose\naccording to these subsets. In particular, each subset c is associated with a local function or factor\n\u03b8c(xc) and we also include factors \u03b8i(xi) for each individual variable.1 The MAP problem is to \ufb01nd\nan assignment x = (x1, . . . , xn) to all the variables which maximizes the sum of these factors:\n\nMAP(\u03b8) = max\n\nx\n\n\u03b8c(xc) +\n\n\u03b8i(xi)\n\n(1)\n\nLinear programming relaxations are a popular approach to approximating combinatorial optimiza-\ntion problems [6, 23, 25]. For example, we obtain a relaxation of the discrete optimization problem\ngiven in Eq. (1) by replacing it with the following linear program:2\n\n(cid:88)\n\nc\u2208C\n\nn(cid:88)\n\ni=1\n\nP M AP : max\n\u00b5\u2208ML\n\nP (\u00b5) = max\n\u00b5\u2208ML\n\n\u03b8c(xc)\u00b5c(xc) +\n\n\u03b8i(xi)\u00b5i(xi)\n\n= max\n\u00b5\u2208ML\n\nwhere P (\u00b5) is the primal (linear) objective and the local marginal polytope ML enforces basic\nconsistency constraints on the marginals {\u00b5i(xi),\u2200xi} and {\u00b5c(xc),\u2200xc}. Speci\ufb01cally,\n\nML =\n\n\u00b5 \u2265 0 :\n\n\u00b5c(xc) = \u00b5i(xi) \u2200c, i \u2208 c, xi\n\u00b5i(xi) = 1\n\n\u2200i\n\nxc\\i\n\nxi\n\n\u00b5 \u00b7 \u03b8 (2)\n\n(3)\n\n(cid:88)\n\n(cid:88)\n\ni\n\nxi\n\n(cid:27)\n\n(cid:27)\n\n(cid:26)(cid:88)\n\n(cid:88)\n\nc\n\nxc\n\n(cid:80)\n(cid:80)\n\n(cid:26)\n\nIf the maximizer of P M AP has only integral values (i.e., 0 or 1) it can be used to \ufb01nd the MAP\nassignment (e.g., by taking the xi that maximizes \u00b5i(xi)). However, in the general case the solution\nmay be fractional [24] and the maximum of P M AP is an upper bound on MAP(\u03b8).\n\n2.1 Smoothing the LP\n\nAs mentioned earlier, several authors have considered a smoothed version of the LP in Eq. (2).\nAs we shall see, this offers several advantages over solving the LP directly. Given a smoothing\nparameter \u03c4 > 0, we consider the following smoothed primal problem:\n\nP M AP\u03c4 : max\n\u00b5\u2208ML\n\nP\u03c4 (\u00b5) = max\n\u00b5\u2208ML\n\n\u00b5 \u00b7 \u03b8 +\n\n1\n\u03c4\n\nH(\u00b5c) +\n\n1\n\u03c4\n\nH(\u00b5i)\n\n(4)\n\n(cid:26)\n\n(cid:88)\n\nc\n\n(cid:27)\n\n(cid:88)\n\ni\n\n1Although singleton factors are not needed for generality, we keep them for notational convenience.\n2We use \u00b5 and \u03b8 to denote vectors consisting of all \u00b5 and \u03b8 values respectively.\n\n2\n\n\fwhere H(\u00b5c) and H(\u00b5i) are local entropy terms. Note that as \u03c4 \u2192 \u221e we obtain the original primal\nproblem. In fact, a stronger result can be shown. Namely, that the optimal value of P M AP is O( 1\n\u03c4 )\nclose to the optimal value of P M AP\u03c4 . This justi\ufb01es using the smoothed objective P\u03c4 as a proxy to\nP in Eq. (2). We express this in the following lemma (which appears in similar forms in [7, 15]).\nLemma 2.1. Denote by \u00b5\u2217 the optimum of problem P M AP in Eq. (2) and by \u02c6\u00b5\u2217 the optimum of\nproblem P M AP\u03c4 in Eq. (4). Then:\n\nwhere Hmax =(cid:80)\n\nc log |xc|+(cid:80)\n\n\u02c6\u00b5\u2217 \u00b7 \u03b8 \u2264 \u00b5\u2217 \u00b7 \u03b8 \u2264 \u02c6\u00b5\u2217 \u00b7 \u03b8 +\ni log |xi|. In other words, the smoothed optimum is an O( 1\n\nHmax\n\n\u03c4\n\n(5)\n\n\u03c4 )-optimal\n\nsolution of the original non-smoothed problem.\n\nWe shall be particularly interested in the dual of P M AP\u03c4 since it facilitates simple coordinate\nminimization updates. Our dual variables will be denoted by \u03b4ci(xi), which can be interpreted as\nthe messages from subset c to node i about the value of variable xi. The dual variables are therefore\nindexed by (c, i, xi) and written as \u03b4ci(xi). The dual objective can be shown to be:\n\n(cid:32)\n\n(cid:88)\n\nc\n\n(cid:88)\n\nxc\n\n1\n\u03c4\n\nlog\n\n(cid:88)\n\ni:i\u2208c\n\n(cid:33)\n\n(cid:88)\n\ni\n\n(cid:88)\n\nxi\n\n1\n\u03c4\n\nlog\n\n(cid:32)\n\nF (\u03b4) =\n\nexp\n\n\u03c4 \u03b8c(xc) \u2212 \u03c4\n\n\u03b4ci(xi)\n\n+\n\nexp\n\n\u03c4 \u03b8i(xi) + \u03c4\n\n(cid:88)\n\nc:i\u2208c\n\n(cid:33)\n\n\u03b4ci(xi)\n\n(6)\n\n(7)\n\nThe dual problem is an unconstrained smooth minimization problem:\n\nDM AP\u03c4 : min\n\n\u03b4\n\nF (\u03b4)\n\nConvex duality implies that the optima of DM AP\u03c4 and P M AP\u03c4 coincide.\nFinally, we shall be interested in transformations between dual variables \u03b4 and primal variables \u00b5\n(see Section 5). The following are the transformations obtained from the Lagrangian derivation (i.e.,\nthey can be used to switch from optimal dual variables to optimal primal variables).\n\u00b5c(xc; \u03b4) \u221d exp\n\n, \u00b5i(xi; \u03b4) \u221d exp\n\n\u03c4 \u03b8c(xc) \u2212 \u03c4\n\n(cid:88)\n\n(cid:88)\n\n\u03c4 \u03b8i(xi) + \u03c4\n\n\u03b4ci(xi)\n\n\u03b4ci(xi)\n\n(cid:32)\n\n(cid:33)\n\n(cid:33)\n\n(cid:32)\n\ni:i\u2208c\n\nc:i\u2208c\n\n(8)\nWe denote the vector of all such marginals by \u00b5(\u03b4). For the dual variables \u03b4 that minimize F (\u03b4)\nit holds that \u00b5(\u03b4) are feasible (i.e., \u00b5(\u03b4) \u2208 ML). However, we will also consider \u00b5(\u03b4) for non\noptimal \u03b4, and show how to obtain primal feasible approximations from \u00b5(\u03b4). These will be helpful\nin obtaining primal convergence rates.\nIt is easy to see that: (\u2207F (\u03b4t))c,i,xi\n\n= \u00b5i(xi; \u03b4t) \u2212 \u00b5c(xi; \u03b4t), where (with some abuse of notation)\n\u00b5c(xc\\i, xi). The elements of the gradient thus correspond to inconsis-\ntency between the marginals \u00b5(\u03b4t) (i.e., the degree to which they violate the constraints in Eq. (3)).\nWe shall make repeated use of this fact to link primal and dual variables.\n\nwe denote: \u00b5c(xi) =(cid:80)\n\nxc\\i\n\n3 Coordinate Minimization Algorithms\n\nIn this section we propose several coordinate minimization procedures for solving DM AP\u03c4 (Eq.\n(7)). We \ufb01rst set some notation to de\ufb01ne block coordinate minimization algorithms. Denote the\nobjective we want to minimize by F (\u03b4) where \u03b4 corresponds to a set of N variables. Now de\ufb01ne\nS = {S1, . . . , SM} as a set of subsets, where each subset Si \u2286 {1, . . . , N} describes a coordinate\nblock. We will assume that Si \u2229 Sj = \u2205 for all i, j and that \u222aiSi = {1, . . . , N}.\nBlock coordinate minimization algorithms work as follows: at each iteration, \ufb01rst set \u03b4t+1 = \u03b4t.\nNext choose a block Si and set:\n\n\u03b4t+1\nSi\n\n= arg min\n\u03b4Si\n\nFi(\u03b4Si; \u03b4t)\n\n(9)\n\nwhere we use Fi(\u03b4Si; \u03b4t) to denote the function F restricted to the variables \u03b4Si and where all other\nvariables are set to their value in \u03b4t. In other words, at each iteration we fully optimize only over the\nvariables \u03b4Si while \ufb01xing all other variables. We assume that the minimization step in Eq. (9) can\nbe solved in closed form, which is indeed the case for the updates we consider.\nRegarding the choice of an update schedule, several options are available:\n\n3\n\n\f\u2022 Cyclic: Decide on a \ufb01xed order (e.g., S1, . . . , SM ) and cycle through it.\n\u2022 Stochastic: Draw an index i uniformly at random3 at each iteration and use the block Si.\n\u2022 Greedy: Denote by \u2207SiF (\u03b4t) the gradient \u2207F (\u03b4t) evaluated at coordinates Si only. The\ngreedy scheme is to choose Si that maximizes (cid:107)\u2207SiF (\u03b4t)(cid:107)\u221e. In other words, choose the\nset of coordinates that correspond to maximum gradient of the function F . Intuitively this\ncorresponds to choosing the block that promises the maximal (local) decrease in objective.\nNote that to \ufb01nd the best coordinate we presumably must process all sets Si to \ufb01nd the best\none. We will show later that this can be done rather ef\ufb01ciently in our case.\n\nIn our analysis, we shall focus on the Stochastic and Greedy cases, and analyze their rate of con-\nvergence. The cyclic case is typically hard to analyze, with results only under multiple conditions\nwhich do not hold here (e.g., see [17]).\nAnother consideration when designing coordinate minimization algorithms is the choice of block\nsize. One possible choice is all variables \u03b4ci(\u00b7) (for a speci\ufb01c pair ci). This is the block chosen in the\nmax-sum-diffusion (MSD) algorithm (see [25] and [26] for non-smooth and smooth MSD). A larger\nblock that also facilitates closed form updates is the set of variables \u03b4\u00b7i(\u00b7). Namely, all messages\ninto a variable i from c such that i \u2208 c. We call this a star update. The update is used in [13] for the\nnon-smoothed dual (but the possibility of applying it to the smoothed version is mentioned).\nFor simplicity, we focus here only on the star update, but the derivation is similar for other choices.\nTo derive the star update around variable i, one needs to \ufb01x all variables except \u03b4\u00b7i(\u00b7) and then set\nthe latter to minimize F (\u03b4). Since F (\u03b4) is differentiable this is pretty straightforward. The update\nturns out to be:4\n\n\u03b4t+1\nci (xi) = \u03b4t\n\nci(xi) +\n\n1\n\u03c4\n\nlog \u00b5t\n\nc(xi) \u2212 1\n\nNi + 1\n\n\u00b7 1\n\u03c4\n\nlog\n\n\u00b5t\n\nc(cid:48)(xi)\n\n(10)\n\nwhere Ni = |{c : i \u2208 c}|. It is interesting to consider the improvement in F (\u03b4) as a result of the\nstar update. It can be shown to be exactly:\n\n(cid:33)\n\n(cid:32)\n\n\u00b5t\n\nc(cid:48):i\u2208c(cid:48)\n\ni(xi) \u00b7 (cid:89)\nNi+1\uf8f6\uf8f8Ni+1\n(cid:33) 1\n\n\u00b5t\n\nc(xi)\n\nF (\u03b4t) \u2212 F (\u03b4t+1) = \u2212 1\n\u03c4\n\nlog\n\n(cid:32)\n\n\uf8eb\uf8ed(cid:88)\n\nxi\n\ni(xi) \u00b7 (cid:89)\n\n\u00b5t\n\nc:i\u2208c\n\nThe RHS is known as Matusita\u2019s divergence measure [11], and is a generalization of the Bhat-\ntacharyya divergence to several distributions. Thus the improvement can be easily computed be-\nfore actually applying the update and is directly related to how consistent the Ni + 1 distributions\ni(xi) are. Recall that at the optimum they all agree as \u00b5 \u2208 ML, and thus the expected\nc(xi), \u00b5t\n\u00b5t\nimprovement is zero.\n\n4 Dual Convergence Rate Analysis\n\nWe begin with the convergence rates of the dual F using greedy and random schemes described in\nSection 3. In Section 5 we subsequently show how to obtain a primal feasible solution and how\nthe dual rates give rise to primal rates. Our analysis builds on the fact that we can lower bound the\nimprovement at each step, as a function of some norm of the block gradient.\n\n4.1 Greedy block minimization\nTheorem 4.1. De\ufb01ne B1 to be a constant such that (cid:107)\u03b4t \u2212 \u03b4\u2217(cid:107)1 \u2264 B1 for all t.\nminimization of each block Si satis\ufb01es:\n\nIf coordinate\n\nfor all t, then for any \u0001 > 0 after T = kB2\n\n1\n\n\u0001\n\nF (\u03b4t) \u2212 F (\u03b4t+1) \u2265 1\nk\n\n(cid:107)\u2207SiF (\u03b4t)(cid:107)2\u221e\n\n(11)\niterations of the greedy algorithm, F (\u03b4T ) \u2212 F (\u03b4\u2217) \u2264 \u0001.\n\n3Non uniform schedules are also possible. We consider the uniform for simplicity.\n4The update is presented here in additive form, there is an equivalent absolute form [21].\n\n4\n\n\fProof. Using H\u00a8older\u2019s inequality we obtain the bound:\n\nF (\u03b4t) \u2212 F (\u03b4\u2217) \u2264 \u2207F (\u03b4t)(cid:62)(\u03b4t \u2212 \u03b4\u2217) \u2264 (cid:107)\u2207F (\u03b4t)(cid:107)\u221e \u00b7 (cid:107)\u03b4t \u2212 \u03b4\u2217(cid:107)1\n\n(12)\n(F (\u03b4t) \u2212 F (\u03b4\u2217)). Now, using the condition on the improvement and\n\nImplying: (cid:107)\u2207F (\u03b4t)(cid:107)\u221e \u2265 1\nthe greedy nature of the update, we obtain a bound on the improvement:\nF (\u03b4t) \u2212 F (\u03b4t+1) \u2265 1\nk\n\n(cid:107)\u2207F (\u03b4t)(cid:107)2\u221e\n\nB1\n\n(cid:107)\u2207SiF (\u03b4t)(cid:107)2\u221e =\n1\n\n(cid:0)F (\u03b4t) \u2212 F (\u03b4\u2217)(cid:1)2 \u2265 1\n\n1\nk\n\n\u2265\n\n(cid:0)F (\u03b4t) \u2212 F (\u03b4\u2217)(cid:1)(cid:0)F (\u03b4t+1) \u2212 F (\u03b4\u2217)(cid:1)\n\nkB2\n1\n\n\u2264 F (\u03b4t) \u2212 F (\u03b4\u2217) \u2212(cid:0)F (\u03b4t+1) \u2212 F (\u03b4\u2217)(cid:1)\n\n(F (\u03b4t) \u2212 F (\u03b4\u2217)) (F (\u03b4t+1) \u2212 F (\u03b4\u2217))\n\nHence,\n\n1\n\nkB2\n1\n\nkB2\n1\n\n=\n\n1\n\nF (\u03b4t+1) \u2212 F (\u03b4\u2217)\n\n\u2212\n\n1\n\nF (\u03b4t) \u2212 F (\u03b4\u2217)\n\n(13)\n\nSumming over t we obtain:\n\u2264\n\nT\nkB2\n1\n\n1\n\nF (\u03b4T ) \u2212 F (\u03b4\u2217)\n\n\u2212\n\n1\n\nF (\u03b40) \u2212 F (\u03b4\u2217)\n\n\u2264\n\n1\n\nF (\u03b4T ) \u2212 F (\u03b4\u2217)\n\n(14)\n\nand the desired result follows.\n\n4.2 Stochastic block minimization\nTheorem 4.2. De\ufb01ne B2 to be a constant such that (cid:107)\u03b4t \u2212 \u03b4\u2217(cid:107)2 \u2264 B2 for all t.\nminimization of each block Si satis\ufb01es:\n\nIf coordinate\n\n(15)\n\nF (\u03b4t) \u2212 F (\u03b4t+1) \u2265 1\nk\n\n(cid:107)\u2207SiF (\u03b4t)(cid:107)2\n\n2\n\n2\n\nfor all t, then for any \u0001 > 0 after T = k|S|B2\nE[F (\u03b4T )] \u2212 F (\u03b4\u2217) \u2264 \u0001.5\nThe proof is similar to Nesterov\u2019s analysis (see Theorem 1 in [16]). The proof in [16] relies on the\nimprovement condition in Eq. (15) and not on the precise nature of the update. Note that since the\ncost of the update is roughly linear in the size of the block then this bound does not tell us which\nblock size is better (the cost of an update times the number of blocks is roughly constant).\n\niterations of the stochastic algorithm we have that\n\n\u0001\n\n4.3 Analysis of DM AP\u03c4 block minimization\n\nWe can now obtain rates for our coordinate minimization scheme for optimizing DM AP\u03c4 by \ufb01nding\nthe k to be used in conditions Eq. (15) and Eq. (11). The result for the star update is given below.\nProposition 4.3. The star update for xi satis\ufb01es the conditions in Eqs. 15 and 11 with k = 4\u03c4 Ni.\n\nThis can be shown using Equation 2.4 in [14], which states that if Fi(\u03b4Si; \u03b4) (see Eq.\n(9)) has\nLipschitz constant Li then Eq. (15) is satis\ufb01ed with k = 2Li. We can then use the fact that the\nLipschitz constant of a star block is at most 2\u03c4 Ni (this can be calculated as in [18]) to obtain the\nresult.6 To complete the analysis, it turns out that B1 and B2 can be bounded via a function of \u03b8 by\nbounding (cid:107)\u03b4(cid:107)1 (see supplementary, Lemma 1.2). We proceed to discuss the implications of these\nbounds.\n\n4.4 Comparing the different schemes\n\nThe results we derived have several implications. First, we see that both stochastic and greedy\nschemes achieve a rate of O( \u03c4\n\u0001 ). This matches the known rates for regular (non-accelerated) gra-\ndient descent on functions with Lipschitz continuous gradient (e.g., see [14]), although in practice\ncoordinate minimization is often much faster.\n\n5Expectation is taken with respect to the randomization of blocks.\n6We also provide a direct proof in the supplementary, Section 2.\n\n5\n\n\fThe main difference between the greedy and stochastic rates is that the factor |S| (the number of\nblocks) does not appear in the greedy rate, and does appear in the stochastic one. This can have a\nconsiderable effect since |S| is either the number of variables n (in the star update) or the number\nof factors |C| (in MPLP). Both can be signi\ufb01cant (e.g., |C| is the number of edges in a pairwise\nMRF model). The greedy algorithm does pay a price for this advantage, since it has to \ufb01nd the\noptimal block to update at each iteration. However, for the problem we study here this can be\ndone much more ef\ufb01ciently using a priority queue. To see this, consider the star update. A change\nin the variables \u03b4\u00b7i(\u00b7) will only affect the blocks that correspond to variables j that are in c such\nthat i \u2208 c. In many cases this is small (e.g., low degree pairwise MRFs) and thus we will only\nhave to change the priority queue a small number of times, and this cost would be negligible when\nusing a Fibonacci heap for example.7 Indeed, our empirical results show that the greedy algorithm\nconsistently outperforms the stochastic one (see Section 6).\n\n5 Primal convergence\n\nThus far we have considered only dual variables. However, it is often important to recover the primal\nvariables. We therefore focus on extracting primal feasible solutions from current \u03b4, and characterize\nthe degree of primal optimality and associated rates. The primal variables \u00b5(\u03b4) (see Eq. (8)) need\nnot be feasible in the sense that the consistency constraints in Eq. (3) are not necessarily satis\ufb01ed.\nThis is true also for other approaches to recovering primal variables from the dual, such as averaging\nsubgradients when using subgradient descent (see, e.g., [21]).\nWe propose a simple two-step algorithm for transforming any dual variables \u03b4 into primal feasible\nvariables \u02dc\u00b5(\u03b4) \u2208 ML. The resulting \u02dc\u00b5(\u03b4) will also be shown to converge to the optimal primal\nsolution in Section 5.1. The procedure is described in Algorithm 1 below.\n\n\u00b5i(xi) +(cid:80)\n1|Xc\\i| (\u00b5c(xi) \u2212 \u00af\u00b5i(xi))\n\nc:i\u2208c\n\n1|Xc\\i| \u00b5c(xi)\n\n(cid:17)\n\nAlgorithm 1 Mapping to feasible primal solution\nStep 1: Make marginals consistent.\nFor all i do: \u00af\u00b5i(xi) =\n\n(cid:16)\n\n1+(cid:80)\n\n1\nc:i\u2208c\n\n1|Xc\\i|\n\ni:i\u2208c\nStep 2: Make marginals non-negative.\n\u03bb = 0\nfor c \u2208 C, xc do\n\nFor all c do: \u00af\u00b5c(xc) = \u00b5c(xc) \u2212(cid:80)\n(cid:27)\n(cid:27)\n\nelse if \u00af\u00b5c(xc) > 1 then\n\nif \u00af\u00b5c(xc) < 0 then\n\n\u2212\u00af\u00b5c(xc)+ 1|Xc|\n\n(cid:26)\n(cid:26)\n\n\u03bb = max\n\n\u03bb,\n\n\u2212\u00af\u00b5c(xc)\n\n\u03bb = max\n\n\u03bb,\n\n\u00af\u00b5c(xc)\u22121\n\u00af\u00b5c(xc)\u2212 1|Xc|\n\nend if\n\nend for\nfor (cid:96) = 1, . . . , n; c \u2208 C do\n\n\u02dc\u00b5(cid:96)(x(cid:96)) = (1 \u2212 \u03bb)\u00af\u00b5(cid:96)(x(cid:96)) + \u03bb 1|X(cid:96)|\n\nend for\nImportantly, all steps consist of cheap elementary local calculations in contrast to other methods pre-\nviously proposed for this task (compare to [18, 27]). The \ufb01rst step performs a Euclidian projection\nof \u00b5(\u03b4) to consistent marginals \u00af\u00b5. Speci\ufb01cally, it solves:\n\nmin\n\n\u00af\u00b5\n\n1\n2\n\n(cid:107)\u00b5(\u03b4) \u2212 \u00af\u00b5(cid:107)2\n\n,\n\ns.t. \u00af\u00b5c(xi) = \u00af\u00b5i(xi), for all c, i \u2208 c, xi\n\n,\n\n\u00af\u00b5i(xi) = 1, for all i\n\n(cid:88)\n\ni\n\nNote that we did not include non-negativity constraints above, so the projection might result in neg-\native \u00af\u00b5. In the second step we \u201cpull\u201d \u00af\u00b5 back into the feasible regime by taking a convex combination\n\n7This was also used in the residual belief propagation approach [4], which however is less theoretically\n\njusti\ufb01ed than what we propose here.\n\n6\n\n\fwith the uniform distribution u (see [3] for a related approach). In particular, this step solves the\nsimple problem of \ufb01nding the smallest \u03bb \u2208 [0, 1] such that 0 \u2264 \u02dc\u00b5 \u2264 1 (where \u02dc\u00b5 = (1 \u2212 \u03bb)\u00af\u00b5 + \u03bbu).\nSince this step interpolates between two distributions that satisfy consistency and normalization\nconstraints, \u02dc\u00b5 will be in the local polytope ML.\n\n5.1 Primal convergence rate\n\nNow that we have a procedure for obtaining a primal solution we analyze the corresponding conver-\ngence rate. First, we show that if we have \u03b4 for which (cid:107)\u2207F (\u03b4)(cid:107)\u221e \u2264 \u0001 then \u02dc\u00b5(\u03b4) (after Algorithm\n1) is an O(\u0001) primal optimal solution.\nTheorem 5.1. Denote by P \u2217\n\u03c4 the optimum of the smoothed primal P M AP\u03c4 . For any set of dual\nvariables \u03b4, and any \u0001 \u2208 R(\u03c4 ) (see supp. for de\ufb01nition of R(\u03c4 )) it holds that if (cid:107)\u2207F (\u03b4)(cid:107)\u221e \u2264 \u0001 then\n\u03c4 \u2212 P\u03c4 (\u02dc\u00b5(\u03b4)) \u2264 C0\u0001. The constant C0 depends only on the parameters \u03b8 and is independent of \u03c4.\nP \u2217\nThe proof is given in the supplementary \ufb01le (Section 1). The key idea is to break F (\u03b4) \u2212 P\u03c4 (\u02dc\u00b5(\u03b4))\ninto components, and show that each component is upper bounded by O(\u0001). The range R(\u03c4 ) consists\nof \u0001 \u2265 O( 1\n\u03c4 ) and \u0001 \u2264 O(e\u2212\u03c4 ). As we show in the supplementary this range is large enough to\nguarantee any desired accuracy in the non-smoothed primal. We can now translate dual rates into\nprimal rates. This can be done via the following well known lemma:\nLemma 5.2. Any convex function F with Lipschitz continuous gradient and Lipschitz constant L\nsatis\ufb01es (cid:107)\u2207F (\u03b4)(cid:107)2\nThese results together with the fact that (cid:107)\u2207F (\u03b4)(cid:107)2\n2 \u2265 (cid:107)\u2207F (\u03b4)(cid:107)2\u221e, and the Lipschitz constant of\nF (\u03b4) is O(\u03c4 ), lead to the following theorem.\nTheorem 5.3. Given any algorithm for optimizing DM AP\u03c4 and \u0001 \u2208 R(\u03c4 ), if the algorithn is\nguaranteed to achieve F (\u03b4t) \u2212 F (\u03b4\u2217) \u2264 \u0001 after O(g(\u0001)) iterations, then it is guaranteed to be \u0001\nprimal optimal, i.e., P \u2217\nThe theorem lets us directly translate dual convergence rates into primal ones. Note that it applies\nto any algorithm for DM AP\u03c4 (not only coordinate minimization), and the only property of the\nalgorithm used in the proof is F (\u03b4t) \u2264 F (0) for all t. Put in the context of our previous results, any\nalgorithm that achieves F (\u03b4t)\u2212 F (\u03b4\u2217) \u2264 \u0001 in t = O(\u03c4 /\u0001) iterations, then it is guaranteed to achieve\n\u03c4 \u2212 P\u03c4 (\u02dc\u00b5(\u03b4t(cid:48)\nP \u2217\n\n\u03c4 \u2212 P\u03c4 (\u02dc\u00b5(\u03b4t)) \u2264 \u0001 after O(g( \u00012\n\n)) \u2264 \u0001 in t(cid:48) = O(\u03c4 2/\u00012) iterations.\n\n2 \u2264 2L (F (\u03b4) \u2212 F (\u03b4\u2217)).\n\n\u03c4 )) iterations.8\n\n6 Experiments\n\n\u0001\n\n\u0001\n\n(cid:17)\n\nIn this section we evaluate coordinate minimization algorithms on a MAP problem, and compare\nthem to state-of-the-art baselines. Speci\ufb01cally, we compare the running time of greedy coordinate\nminimization, stochastic coordinate minimization, full gradient descent, and FISTA \u2013 an accelerated\ngradient method [1] (details on the gradient-based algorithms are provided in the supplementary,\n\nSection 3). Gradient descent is known to converge in O(cid:0) 1\n(cid:16) 1\u221a\n\n(cid:1) iterations while FISTA converges\n\nin O\niterations [1]. We compare the performance of the algorithms on protein side-chain\nprediction problems from the dataset of Yanover et al. [28]. These problems involve \ufb01nding the 3D\ncon\ufb01guration of rotamers given the backbone structure of a protein. The problems are modeled by\nsingleton and pairwise factors and can be posed as \ufb01nding a MAP assignment for the given model.\nFigure 1(a) shows the objective value for each algorithm over time. We \ufb01rst notice that the greedy\nalgorithm converges faster than the stochastic one. This is in agreement with our theoretical analysis.\nSecond, we observe that the coordinate minimization algorithms are competitive with the acceler-\nated gradient method FISTA and are much faster than the gradient method. Third, as Theorem 5.3\npredicts, primal convergence is slower than dual convergence (notice the logarithmic timescale).\nFinally, we can see that better convergence of the dual objective corresponds to better convergence\nof the primal objective, in both fractional and integral domains. In our experiments the quality of\nthe decoded integral solution (dashed lines) signi\ufb01cantly exceeds that of the fractional solution. Al-\nthough sometimes a fractional solution can be useful in itself, this suggests that if only an integral\nsolution is sought then it could be enough to decode directly from the dual variables.\n\n8We omit constants not depending on \u03c4 and \u0001.\n\n7\n\n\f(a)\n\n(b) talg/tgreedy\n\nGreedy\nStochastic\nFISTA\nGradient\n\n1\n8.6 \u00b1 0.6\n814.2 \u00b1 38.1\n13849.8 \u00b1 6086.5\n\nFigure 1: Comparison of coordinate minimization, gradient descent, and the accelerated gradient\nalgorithms on protein side-chain prediction task. Figure (a) shows a typical run of the algorithms.\nFor each algorithm the dual objective of Eq. (6) is plotted as a function of execution time. The value\n(Eq. (4)) of the feasible primal solution of Algorithm 1 is also shown (lower solid line), as well as\nthe objective (Eq. (1)) of the best decoded integer solution (dashed line; those are decoded directly\nfrom the dual variables \u03b4). Table (b) shows the ratio of runtime of each algorithm w.r.t. the greedy\nalgorithm. The mean ratio over the proteins in the dataset is shown followed by standard error.\n\nThe table in Figure 1(b) shows overall statistics for the proteins in the dataset. Here we run each\nalgorithm until the duality gap drops bellow a \ufb01xed desired precision (\u0001 = 0.1) and compare the\ntotal runtime. The table presents the ratio of runtime of each algorithm w.r.t. the greedy algorithm\n(talg/tgreedy). These results are consistent with the example in Figure 1(a).\n\n7 Discussion\n\nWe presented the \ufb01rst convergence rate analysis of dual coordinate minimization algorithms on\nMAP-LP relaxations. We also showed how such dual iterates can be turned into primal feasible\niterates and analyzed the rate with which these primal iterates converge to the primal optimum. The\nprimal mapping is of considerable practical value, as it allows us to monitor the distance between the\nupper (dual) and lower (primal) bounds on the optimum and use this as a stopping criterion. Note\nthat this cannot be done without a primal feasible solution.9\nThe overall rates we obtain are of the order O( \u03c4\n\u0001 ) for the DM AP\u03c4 problem. If one requires an \u0001\naccurate solution for P M AP , then \u03c4 needs to be set to O( 1\n\u0001 ) (see Eq. (5)) and the overall rate is\n\u00012 ) in the dual. As noted in [8, 18], a faster rate of O( 1\n\u0001 ) may be obtained using accelerated\nO( 1\nmethods such as Nesterov\u2019s [15] or FISTA [1]. However, these also have an extra factor of N which\ndoes not appear in the greedy rate. This could partially explain the excellent performance of the\ngreedy scheme when compared to FISTA (see Section 6).\nOur analysis also highlights the advantage of using greedy block choice for MAP problems. The\nadvantage comes from the fact that the choice of block to update is quite ef\ufb01cient since its cost is of\nthe order of the other computations required by the algorithm. This can be viewed as a theoretical\nreinforcement of selective scheduling algorithms such as Residual Belief Propagation [4].\nMany interesting questions still remain to be answered. How should one choose between different\nblock updates (e.g., MSD vs star)? What are lower bounds on rates? Can we use acceleration as in\n[15] to obtain better rates? What is the effect of adaptive smoothing (see [19]) on rates? We plan to\naddress these in future work.\nAcknowledgments: This work was supported by BSF grant 2008303. Ofer Meshi is a recipient of the Google\nEurope Fellowship in Machine Learning, and this research is supported in part by this Google Fellowship.\n\n9An alternative commonly used progress criterion is to decode an integral solution from the dual variables,\nand see if its value is close to the dual upper bound. However, this will only work if P M AP has an integral\nsolution and we have managed to decode it.\n\n8\n\n10\u22122100102104106\u221250050100150200250Runtime (secs)Objective GreedyStochasticFISTAGradient\fReferences\n[1] A. Beck and M. Teboulle. A fast iterative shrinkage-thresholding algorithm for linear inverse problems.\n\nSIAM J. Img. Sci., 2(1):183\u2013202, Mar. 2009.\n\n[2] Y. Boykov, O. Veksler, and R. Zabih. Fast approximate energy minimization via graph cuts. In Proc.\n\nIEEE Conf. Comput. Vision Pattern Recog., 1999.\n\n[3] D. Burshtein. Iterative approximate linear programming decoding of ldpc codes with linear complexity.\n\nIEEE Transactions on Information Theory, 55(11):4835\u20134859, 2009.\n\n[4] G. Elidan, I. Mcgraw, and D. Koller. Residual belief propagation: informed scheduling for asynchronous\n\nmessage passing. In UAI, 2006.\n\n[5] A. Globerson and T. Jaakkola. Fixing max-product: Convergent message passing algorithms for MAP\n\nLP-relaxations. In J. Platt, D. Koller, Y. Singer, and S. Roweis, editors, NIPS 20. MIT Press, 2008.\n\n[6] M. Guignard and S. Kim. Lagrangean decomposition: A model yielding stronger Lagrangean bounds.\n\nMathematical Programming, 39(2):215\u2013228, 1987.\n\n[7] T. Hazan and A. Shashua. Norm-product belief propagation: Primal-dual message-passing for approxi-\n\nmate inference. IEEE Transactions on Information Theory, 56(12):6294\u20136316, 2010.\n\n[8] V. Jojic, S. Gould, and D. Koller. Fast and smooth: Accelerated dual decomposition for MAP inference.\n\nIn Proceedings of International Conference on Machine Learning (ICML), 2010.\n\n[9] V. Kolmogorov. Convergent tree-reweighted message passing for energy minimization. IEEE Transac-\n\ntions on Pattern Analysis and Machine Intelligence, 28(10):1568\u20131583, 2006.\n\n[10] A. L. Martins, M. A. T. Figueiredo, P. M. Q. Aguiar, N. A. Smith, and E. P. Xing. An augmented\n\nlagrangian approach to constrained map inference. In ICML, pages 169\u2013176, 2011.\n\n[11] K. Matusita. On the notion of af\ufb01nity of several distributions and some of its applications. Annals of the\n\nInstitute of Statistical Mathematics, 19:181\u2013192, 1967. 10.1007/BF02911675.\n\n[12] O. Meshi and A. Globerson. An alternating direction method for dual map lp relaxation. In ECML PKDD,\n\npages 470\u2013483. Springer-Verlag, 2011.\n\n[13] O. Meshi, D. Sontag, T. Jaakkola, and A. Globerson. Learning ef\ufb01ciently with approximate inference via\n\ndual losses. In ICML, pages 783\u2013790, New York, NY, USA, 2010. ACM.\n\n[14] Y. Nesterov. Introductory Lectures on Convex Optimization: A Basic Course, volume 87. Kluwer Aca-\n\ndemic Publishers, 2004.\n\n[15] Y. Nesterov. Smooth minimization of non-smooth functions. Math. Prog., 103(1):127\u2013152, May 2005.\n[16] Y. Nesterov. Ef\ufb01ciency of coordinate descent methods on huge-scale optimization problems. Core dis-\n\ncussion papers, Universit catholique de Louvain, 2010.\n\n[17] A. Saha and A. Tewari. On the \ufb01nite time convergence of cyclic coordinate descent methods, 2010.\n\npreprint arXiv:1005.2146.\n\n[18] B. Savchynskyy, S. Schmidt, J. Kappes, and C. Schnorr. A study of Nesterov\u2019s scheme for lagrangian\n\ndecomposition and map labeling. CVPR, 2011.\n\n[19] B. Savchynskyy, S. Schmidt, J. H. Kappes, and C. Schn\u00a8orr. Ef\ufb01cient mrf energy minimization via adaptive\n\ndiminishing smoothing. In UAI, 2012.\n\n[20] S. Shalev-Shwartz and A. Tewari. Stochastic methods for l1-regularized loss minimization. J. Mach.\n\nLearn. Res., 12:1865\u20131892, July 2011.\n\n[21] D. Sontag, A. Globerson, and T. Jaakkola. Introduction to dual decomposition for inference. In Optimiza-\n\ntion for Machine Learning, pages 219\u2013254. MIT Press, 2011.\n\n[22] P. Tseng. Convergence of a block coordinate descent method for nondifferentiable minimization 1. Jour-\n\nnal of Optimization Theory and Applications, 109(3):475\u2013494, 2001.\n\n[23] M. Wainwright, T. Jaakkola, and A. Willsky. MAP estimation via agreement on trees: message-passing\n\nand linear programming. IEEE Transactions on Information Theory, 51(11):3697\u20133717, 2005.\n\n[24] M. Wainwright and M. I. Jordan. Graphical Models, Exponential Families, and Variational Inference.\n\nNow Publishers Inc., Hanover, MA, USA, 2008.\n\n[25] T. Werner. A linear programming approach to max-sum problem: A review. IEEE Transactions on Pattern\n\nAnalysis and Machine Intelligence, 29(7):1165\u20131179, 2007.\n\n[26] T. Werner. Revisiting the decomposition approach to inference in exponential families and graphical\n\nmodels. Technical Report CTU-CMP-2009-06, Czech Technical University, 2009.\n\n[27] T. Werner. How to compute primal solution from dual one in MAP inference in MRF? In Control Systems\n\nand Computers (special issue on Optimal Labeling Problems in Structual Pattern Recognition), 2011.\n\n[28] C. Yanover, T. Meltzer, and Y. Weiss. Linear programming relaxations and belief propagation \u2013 an\n\nempirical study. Journal of Machine Learning Research, 7:1887\u20131907, 2006.\n\n9\n\n\f", "award": [], "sourceid": 1379, "authors": [{"given_name": "Ofer", "family_name": "Meshi", "institution": null}, {"given_name": "Amir", "family_name": "Globerson", "institution": null}, {"given_name": "Tommi", "family_name": "Jaakkola", "institution": null}]}