{"title": "MAP Estimation for Graphical Models by Likelihood Maximization", "book": "Advances in Neural Information Processing Systems", "page_first": 1180, "page_last": 1188, "abstract": "Computing a {\\em maximum a posteriori} (MAP) assignment in graphical models is a crucial inference problem for many practical applications. Several provably convergent approaches have been successfully developed using linear programming (LP) relaxation of the MAP problem. We present an alternative approach, which transforms the MAP problem into that of inference in a finite mixture of simple Bayes nets. We then derive the Expectation Maximization (EM) algorithm for this mixture that also monotonically increases a lower bound on the MAP assignment until convergence. The update equations for the EM algorithm are remarkably simple, both conceptually and computationally, and can be implemented using a graph-based message passing paradigm similar to max-product computation. We experiment on the real-world protein design dataset and show that EM's convergence rate is significantly higher than the previous LP relaxation based approach MPLP. EM achieves a solution quality within $95$\\% of optimal for most instances and is often an order-of-magnitude faster than MPLP.", "full_text": "MAP Estimation for Graphical Models by\n\nLikelihood Maximization\n\nAkshat Kumar\n\nDepartment of Computer Science\n\nUniversity of Massachusetts\n\nAmherst, MA\n\nakshat@cs.umass.edu\n\nShlomo Zilberstein\n\nDepartment of Computer Science\n\nUniversity of Massachusetts\n\nAmherst, MA\n\nshlomo@cs.umass.edu\n\nAbstract\n\nComputing a maximum a posteriori (MAP) assignment in graphical models is a\ncrucial inference problem for many practical applications. Several provably con-\nvergent approaches have been successfully developed using linear programming\n(LP) relaxation of the MAP problem. We present an alternative approach, which\ntransforms the MAP problem into that of inference in a mixture of simple Bayes\nnets. We then derive the Expectation Maximization (EM) algorithm for this mix-\nture that also monotonically increases a lower bound on the MAP assignment\nuntil convergence. The update equations for the EM algorithm are remarkably\nsimple, both conceptually and computationally, and can be implemented using a\ngraph-based message passing paradigm similar to max-product computation. Ex-\nperiments on the real-world protein design dataset show that EM\u2019s convergence\nrate is signi\ufb01cantly higher than the previous LP relaxation based approach MPLP.\nEM also achieves a solution quality within 95% of optimal for most instances.\n\n1\n\nIntroduction\n\nGraphical models provide an effective framework to model complex systems via simpler local in-\nteractions and also provide an insight into the structure of the underlying probabilistic model. In\nparticular, we focus on the class of undirected models called Markov random \ufb01elds (MRFs) for\nwhich the joint distribution can be speci\ufb01ed as the product of potential functions over the cliques\nof the graph. For many practical problems modeled using MRFs, \ufb01nding the maximum a posteriori\n(MAP) assignment or the most probable assignment to the variables in the graph is a key infer-\nence problem. For example, MAP estimation has been applied to image processing in computer\nvision [17, 11], protein design and protein side-chain prediction problems [17, 11], and natural lan-\nguage processing [7]. Finding the MAP assignment is NP-hard in general except for tree-structured\ngraphs and graphs with bounded treewidth [5, 3]. This further underscores the need for developing\nscalable approximation algorithms that provide good solution quality.\nRecently, many algorithms have been proposed for approximating the MAP problem [15, 5, 11, 9, 6].\nParticularly, linear programming (LP) relaxation of the MAP problem has emerged as a popular\ntechnique to solve large-scale problems such as protein design and prediction problems [17, 10, 11].\nSuch approaches relax the constraint that the solution for the MAP problem be integral. However, for\nlarge problems such as protein design, the large size of the LP prohibits the application of standard\nLP-solvers [17]. To alleviate such scalability issues, convergent message passing algorithms have\nbeen introduced, which monotonically decrease the dual objective of the LP relaxation [5, 11, 9].\nConvergence to the global optima is not guaranteed in general, but when the solution is integral, it\ncan be shown to be globally optimal. The main advantage of these approaches lies in their ability\nto provide an upper bound on the problem and a certi\ufb01cate of optimality when upper bound is\nsuf\ufb01ciently close to the decoded solution.\n\n1\n\n\fIn our work, we take a different approach to the MAP problem based on mean \ufb01eld methods in\nvariational inference [16]. First, we present an alternate representation of the MAP problem by de-\ncomposing the MRF into a \ufb01nite-mixture of simple Bayes nets in which maximizing the likelihood\nof a special variable is equivalent to solving the MAP problem. Our approach is inspired by re-\ncent developments in planning by probabilistic inference and goal-directed planning [1, 13, 14, 12].\nSecond, using this alternate representation, we derive the EM algorithm for approximate MAP es-\ntimation. EM increases the lower bound on the MAP assignment monotonically until convergence\nand lends itself naturally to a graph-based message passing implementation.\nThe main advantage of the EM approach lies in settings where a good approximation to MAP needs\nto be generated quickly. In our experiments on some of the largest protein design problems [17, 11],\nwe show that EM increases the lower bound on MAP rapidly. This attribute of EM combined\nwith the Max-Product LP algorithm (MPLP) [5, 11] that decreases the upper bound rapidly (as\nobserved empirically) yields a new hybrid approach that provides quality-bounded solutions signi\ufb01-\ncantly faster than previous approaches. Although convergence to the global optima is not guaranteed,\nEM achieves an average solution quality within 95% of optimal for the protein design problems and\nis signi\ufb01cantly faster than both MPLP [5, 11] and max-product (MP) [8]. We show that each iter-\nation of EM is faster than that of max-product or MPLP by a factor related to the average degree\nof the graph. Empirically, the speedup factor can be as high as 30 for densely connected problems.\nWe also show that EM is an embarrassingly parallel algorithm and can be parallelized easily to fur-\nther speedup the convergence. Finally we also discuss potential pitfalls that are inherent in the EM\nformulation and highlight settings in which EM may not perform well.\n\n2 Markov Random Fields and the MAP Problem\n\nA pairwise Markov random \ufb01eld (MRF) can be described by an undirected graph G = (V, E) con-\nsisting of a set of nodes, one per variable in x = {x1, . . . , xn}, and a set of edges that connect pairs\nof nodes. A variable can take any value from a set of possible values referred to as the domain of\nthat variable. An edge (i, j) between nodes xi and xj speci\ufb01es a function \u03b8ij. The joint assignment\nx has the probability:\n\np(x; \u03b8) =\n\ne\n\nij\u2208E \u03b8ij (xi,xj ).\n\nP\n\n1\nZ\n\nP\n\nThe MAP problem consists of \ufb01nding the most probable assignment to all variables under p(x; \u03b8).\nThis is equivalent to \ufb01nding the complete assignment x that maximizes the function f(x; \u03b8) =\nij\u2208E \u03b8ij(xi, xj). Before describing our formulation of the MAP problem, we \ufb01rst describe the\nmarginal polytope associated with the MAP problem and its outer bound based on LP relaxation.\nThen we discuss the relation of our approach with these polytopes. For details, we refer to [16, 11].\nLet \u00b5 denote a vector of marginal probabilities (also called mean parameters) for each node and\nedge of the MRF. That is, \u00b5 includes \u00b5i(xi) \u2200i \u2208 V and \u00b5ij(xi, xj) \u2200(i, j) \u2208 E. The set of \u00b5 that\narises from some joint distribution p is referred to as the marginal polytope:\n\nM(G) = {\u00b5 | \u2203p(x) s.t. p(xi, xj) = \u00b5ij(xi, xj), p(xi) = \u00b5i(xi)}\n\nThe MAP problem is then equivalent to solving the following LP:\n\nmax\n\nx\n\nf(x; \u03b8) = max\n\n\u00b5\u2208M(G)\n\n\u00b5 \u00b7 \u03b8 = max\n\u00b5\u2208M(G)\n\n\u03b8ij(xi, xj)\u00b5ij(xi, xj)\n\nX\n\nX\n\nij\u2208E\n\nxixj\n\nIt can be shown that there always exists a maximizing solution \u00b5 which is integral and gives the\noptimal x. Unfortunately, the number of constraints used to describe this polytope are exponential\nand thus it cannot be solved ef\ufb01ciently. To remedy this, LP relaxations are proposed that outer bound\nthe polytope M(G). The relaxation weakens the global constraint that \u00b5 arises from some common\ndistribution p. Instead, only pairwise and singleton consistency is required for mean parameters as\ngiven by the following conditions:\n\n\u00b5i(xi) = 1 \u2200i \u2208 V ,\n\n\u00b5ij(\u02c6xi, xj) = \u00b5j(xj) ,\n\n\u00b5ij(xi, \u02c6xj) = \u00b5i(xi) \u2200(i, j) \u2208 E\n\nX\n\nX\n\nX\n\nxi\n\n\u02c6xi\n\n\u02c6xj\n\nThe outer bound polytope is expressed as\n\nML(G) = {\u00b5 \u2265 0 | The conditions of Eq. 3 hold}\n\n2\n\n(1)\n\n(2)\n\n(3)\n\n(4)\n\n\fFigure 1: a) A pairwise Markov random \ufb01eld; b) Equivalent mixture representation\n\nLP relaxation approaches such as MPLP [5, 11] optimize the function \u00b5 \u00b7 \u03b8 over this outer bound\nML(G) and consequently yield an upper bound on the MAP. Next we describe our approach for\nestimating the MAP.\n\ndistributions that factorize according to the variables of the MRF, p0(x) = Qn\n\nIn the de\ufb01nition of marginal polytope M(G) (Eq. 1), no\nInner bound on the marginal polytope\nrestrictions are placed on the probability distribution p. Consider the following class of probability\ni(xi). This is\nsimilar to the mean \ufb01eld methods used in variational inference [16]. Our approach is to directly\noptimize over the following set of mean parameters \u00b5:\n\ni=1 p0\n\nMlb(G) = {\u00b5 | \u2203p0(x) s.t. \u00b5i(xi) = p0\n\n(5)\nwhere p0 is the distribution that factorizes according the variables in the MRF. Clearly Mlb(G) is\nan inner bound over M(G), because in M(G) there is no such restriction on the class of allowed\nprobability distributions. The optimization criterion for estimating MAP under this set is:\n\nj(xj)}\n\ni(xi) , \u00b5ij(xi, xj) = p0\nX\n\nX\n\ni(xi)p0\n\n\u03b8ij(xi, xj)\u00b5i(xi)\u00b5j(xj)\n\n(6)\n\nmax\n\nx\n\nflb(x; \u03b8) = max\n\n\u00b5\u2208Mlb(G)\n\n\u00b5 \u00b7 \u03b8 = max\n\n\u00b5\u2208Mlb(G)\n\nij\u2208E\n\nxixj\n\nlb \u2264 f ?. A simple observation shows that indeed f ?\n\nLet f ?\nlb denote the optimizing value for the above formulation and f ? for the formulation in Eq. 2.\nlb = f ?. The reason is that there always\nClearly f ?\nexists a maximizing \u00b5 \u2208 M(G) that is integral and thus f ? corresponds to an integral assignment\nx which is also the MAP assignment [16]. Since all the integral assignments are also allowed by\nthe de\ufb01nition of the factored distribution p0 and Mlb(G), it follows that optimizing over Mlb(G)\nlb = f ? and yields the MAP estimate. It is worth noticing that the constraints describing\nimplies f ?\nthe set Mlb(G) are only linear in the number of nodes in the MRF and correspond to normalization\nconstraints as opposed to the exponentially large constraint set for M(G).\nIt might appear that we have signi\ufb01cantly reduced the space of allowed mean parameters \u00b5 while\nstill preserving the MAP. But the problem still remains challenging. The reduced set of parameters\nMlb(G) is non-convex because of the non-linear constraint \u00b5ij(xi, xj) = \u00b5i(xi)\u00b5j(xj) [16]. Thus\noptimization over Mlb(G) cannot be done using linear programming. To alleviate this problem,\nwe next present another reformulation of the optimization problem in Eq. 6. Then we present the\nExpectation Maximization (EM) algorithm [4] for this reformulation that monotonically increases\nthe lower bound on the MAP assignment using likelihood maximization until convergence.\n\n3 MAP as a Mixture of Bayes Nets\n\nIn this section we reformulate the optimization problem in Eq. 6 and recast it as the problem of\nlikelihood maximization in a \ufb01nite-mixture of simple Bayes nets. The key idea is to decompose\nthe MRF into a mixture of simpler Bayes nets with many hidden variables \u2013 all the variables xi of\nthe MRF. To incorporate the potential functions \u03b8\u2019s of the MRF and achieve equivalence between\nthe likelihood and MAP value, a special binary reward variable \u02c6\u03b8 is introduced with its conditional\ndistribution proportional to potentials \u03b8. The details of the reformulation follow.\nFor each edge (i, j) in the graph G corresponding to the MRF, we create a depth-1 Bayes net. It\nconsists of a binary reward variable \u02c6\u03b8 with its parents being the variables xi and xj. The reason\nfor calling it a reward variable will become clear later. Fig. 1(a) shows a pairwise MRF over four\nvariables. Fig. 1(b) shows the equivalent mixture of Bayes nets for each of the four edges in this\nMRF. The mixture random variable l, which is used to identify the Bayes nets, can take values from\n1 to |E|, the number of edges in the graph, with uniform probability. That is, if k = |E|, then\n\n3\n\nx1x2x4x31l=1l=2l=3l=4x1x1x2x4\u02c6\u03b8\u02c6\u03b8\u02c6\u03b8\u02c6\u03b8x2x4x3x3MixtureofBayesnets1\fP (l = i) = 1/k for any 1 \u2264 i \u2264 k. In what follows, we will also use the variable l to denote the\ncorresponding edge in the MRF.\nThe parameters to estimate in this mixture are the marginal probabilities for each node xi. That is\np = hp1, . . . , pni. This step directly establishes the connection with the space of factored probability\ndistribution p0 of the set Mlb(G) (see Eq. 5).\nNext we set the conditional probability distribution of the variable \u02c6\u03b8 for each of the Bayes nets. This\nis done as follows:\n\nP (\u02c6\u03b8 = 1|xl1, xl2, l) = \u03b8l(xl1, xl2) \u2212 \u03b8min\n\u03b8max \u2212 \u03b8min\n\n(7)\n\nwhere l indicates a particular Bayes net corresponding to an edge of the MRF, xl1 and xl2 are\nthe parent variables of \u02c6\u03b8 in this Bayes net and \u03b8l the potential function for this edge. \u03b8max is the\nmaximum value for any potential function \u03b8, and \u03b8min the minimum value. For example, for l = 1 in\nFig. 1(b), xl1 = x1, xl2 = x2 and P (\u02c6\u03b8 = 1|x1, x2, l = 1) = (\u03b812(x1, x2) \u2212 \u03b8min)/(\u03b8max \u2212 \u03b8min).\nNote that these probabilities are nothing but the normalized potential functions \u03b8ij of the original\nMRF. For this reason, \u02c6\u03b8 is also called a reward variable.\nIt is used to establish the equivalence\nbetween the MAP value and the likelihood of observing \u02c6\u03b8 = 1.\nThe full joint for a particular Bayes net indicated by the variable l is given by\nP (\u02c6\u03b8, xl1, xl2|l; p) = P (\u02c6\u03b8|xl1, xl2, l)pl1(xl1; p)pl2(xl2; p).\n\n(8)\nwhere pl1 is the marginal associated with the variable xl1. Let us denote the variables (xl1, xl2) by\nxl and let \u02c6\u03b8xl denote the probability P (\u02c6\u03b8 = 1|xl1, xl2, l), then the following theorem establishes the\nlink between the likelihood and MAP value. \u03b8l(xl) denotes the corresponding potential function \u03b8l\nof the MRF, for l = 1 in Fig. 1(b), \u03b8l(xl) = \u03b812(x1, x2).\nTheorem 1. Let the CPT of binary reward variable \u02c6\u03b8 be selected such that \u02c6\u03b8xl \u221d \u03b8l(xl). Then\nmaximizing the likelihood Lp = P (\u02c6\u03b8 = 1; p) of observing the reward variable in the mixture of\nBayes nets is equivalent to the MAP estimation of the original MRF.\n\nProof. The likelihood for a single Bayes net is given by\n\nP (\u02c6\u03b8 = 1, xl1, xl2|l; p) =X\n\nFor the complete mixture, it is given by\n\nLp\n\nxl\n\nl = P (\u02c6\u03b8 = 1|l; p) =X\nLp =X\nX\nX\n\nl\n\n\u02c6\u03b8xl pl1(xl1; p)pl2(xl2; p).\n\n(9)\n\nX\n\nX\n\nl\n\nxl\n\n1\nk\n\nxl\n\n\u02c6\u03b8xl pl1(xl1; p)pl2(xl2; p).\n\n(10)\n\nP (l)Lp\n\nl =\n\nUpon substituting the de\ufb01nition of \u02c6\u03b8xl from Eq. 7 and using simple algebraic manipulations, we get\n\n\u03b8l(xl)pl1(xl1; p)pl2(xl2; p) = k(\u03b8min + (\u03b8max \u2212 \u03b8min)Lp).\n\nl\n\nxl\n\nNotice that the LHS of the above equation is the same as the optimization objective in Eq. 6. Thus\nwe have shown that maximizing the likelihood Lp provides the MAP estimate.\n\nThe above equation can also be explained intuitively in the context of goal directed planning. The\nRHS can be rewritten as k(Lp\u03b8max + \u00afLp\u03b8min), where \u00afLp = 1\u2212 Lp. According to this formulation,\nthere are only two rewards in the system: \u03b8min and \u03b8max. The goal is to achieve the higher reward\n\u03b8max for each edge in the MRF. Thus maximizing the probability Lp of achieving this goal solves\nthe optimization problem.\n\n4 EM Algorithm for MAP Estimation\n\nWe now derive the EM algorithm [4] for maximizing the likelihood of the reward variable in the\nmixture of Bayes nets. In this mixture, only the reward variable is treated as observed (\u02c6\u03b8 = 1); all\n\n4\n\n\fAlgorithm 1: Graph-based message passing for MAP estimation\ninput : Graph G = (V, E) for the MRF and potentials \u03b8 for each edge\nrepeat\n\nh\n\nforeach node i \u2208 V do\nMPLP: Send message \u03b3i\u2192j to each neighbor j \u2208 Ne(i)\n\u03b3i\u2192j(xj) \u2190 maxxi\n|Ne(i)|+1\n\nSet node belief bi(xi) to the sum of incoming messages: bi(xi) =P\n\u03b4i\u2192j(xj) \u2190P\n\nEM: Send message \u03b4i\u2192j to each neighbor j \u2208 Ne(i)\n\n\u03b8ij(xi, xj) \u2212 \u03b3j\u2192i(xi) +\n\npi(xi)\u02c6\u03b8xixj\n\nP\n\n2\n\nxi\n\nSet marginal probability to sum of incoming messages: p?\n\ni (xi) = pi(xi)\n\nuntil stopping criterion is satis\ufb01ed\nMPLP: Return complete assignment x s.t. xi = argmax\u02c6xi bi(\u02c6xi)\nEM : Return complete assignment x s.t. xi = argmax\u02c6xi pi(\u02c6xi)\n\ni\n\nk\u2208Ne(i) \u03b3k\u2192i(xi)\n\nk\u2208Ne(i) \u03b3k\u2192i(xi)\n\nP\n\nk\u2208Ne(i) \u03b4k\u2192i(xi)\n\nCi\n\nother variables are latent. We note that EM is not guaranteed to converge to a global optimum. How-\never, our experiments show that EM achieves an average solution quality within 95% of optimal for\nthe standard MAP benchmark of protein design problems. We also show that the update equations\nfor EM can be implemented ef\ufb01ciently using graph-based message passing and are computationally\nmuch faster than other message-passing algorithms such as max-product [8] and MPLP [5]. Below,\nwe derive the update equations for the M-step. The E-step can be directly inferred from that. The\nparameters p to estimate are the marginal probabilities pi for each variable xi.\n\nM-step: EM maximizes the following expected complete log-likelihood for the mixture of Bayes\nnets. The variable p denotes the previous parameters and p? denotes the new parameters.\n\nP (\u02c6\u03b8 = 1, xl, l; p) log P (\u02c6\u03b8 = 1, xl, l; p?)\n\n(11)\n\nQ(p, p?) =X\n\nX\n\nThe full joint is given by:\n\nl\n\nxl\n\nP (\u02c6\u03b8 = 1, xl, l; p) = P (\u02c6\u03b8 = 1|xl, l)P (xl|l; p)P (l) =\n\n1\nk\n\n\u02c6\u03b8xl pl1(xl1; p)pl2(xl2; p)\n\nWe will omit the parameter p whenever the expression is unambiguous. Taking the log, we get:\n\nlog P (\u02c6\u03b8 = 1, xl, l; p) = hInd. terms of pi + log pl1(xl1) + log pl2(xl2)\n\n(12)\nSubstituting the above equation into the de\ufb01nition of Q(p, p?) (Eq. 11) and discarding the terms\nwhich are independent of p?, we get:\n\nl1(xl1) + log p?\n\nl2(xl2)}\n\n(13)\n\nQ(p, p?) =\n\n1\nk\n\nX\n\nX\n\nl\n\nxl\n\nQ(p, p?) =\n\n1\nk\n\n\u02c6\u03b8xl pl1(xl1)pl2(xl2){log p?\nX\n\ni (xi) X\n\npi(xi) log p?\n\nnX\n\ni=1\n\nxi\n\nj\u2208Ne(i)\n\nxj\n\nX\n\nUpon simplifying the above equation by grouping together the terms associated with the variables\nxi of the MRF, we get:\n\n\u02c6\u03b8xixj pj(xj)\n\n(14)\n\nwhere Ne(i) denotes the set of immediate neighbors of the node i in the MRF graph. The above\nexpression can be easily maximized by maximizing for variables xi\u2019s individually. The \ufb01nal update\nequation for the marginals is given by:\n\ni (xi) =\np?\n\nj\u2208Ne(i)\n\n\u02c6\u03b8xixj pj(xj)\n\n(15)\n\nxj\n\npi(xi)P\n\nP\n\nCi\n\nwhere Ci is the normalization constant for variable xi, and \u02c6\u03b8xixj is the normalized reward:\n\nP (\u02c6\u03b8 = 1|xi, xj, l) = (\u03b8ij(xi, xj) \u2212 \u03b8min)/(\u03b8max \u2212 \u03b8min).\n\nAlgorithm 1 shows the graph-based message passing technique for both EM and MPLP. For both\nEM and MPLP, parameters are initialized randomly. The rest of the steps are self-explanatory.\n\n5\n\n\fFigure 2: a) Quality achieved by EM for all protein design instances. b) Quality for the largest\ninstance \u20181fpo\u2019; x-axis denotes time (in sec.), y-axis denotes the quality achieved.\n\nComplexity analysis and implementation Consider a single message \u03b3 sent out by a node in\nMPLP. The complexity of computing \u03b3 is O(d2 \u00b7 deg), where d is the domain size of the variables,\nand deg represents the average degree of the graph or the average number of neighbors of a node.\nFor EM, this complexity is only O(d2). Therefore, the computational complexity of each iteration\nof EM is lower than that of MPLP by a factor of deg. The same result holds for max-product,\nbecause its message-passing structure is similar to that of MPLP. The average number of neighbors\nin dense graphs such as the ones encountered in protein-design problems can be as high as 30. This\nmakes EM signi\ufb01cantly faster than the previous approaches as we demonstrate empirically in the\nnext section.\nEM\u2019s simple message passing scheme also facilitates a very ef\ufb01cient parallel implementation. In\nparticular, all the \u03b4 messages for the current iteration in Alg. 1 can be computed in parallel for each\nnode i, because they depend only on the parameters from the previous iteration. In contrast, MPLP\nfollows a block coordinate descent strategy in which optimization is performed over a subset of\nvariables, keeping all the other variables \ufb01xed [5]. Therefore, opportunities for parallelism in the\ncurrent implementations of MPLP are limited.\n\n5 Experiments\n\nOur \ufb01rst set of experiments are on the protein design problems (total of 97 instances), which are\ndescribed in [17]. In these problems, given a desired backbone structure of the protein, the task is to\n\ufb01nd a sequence of amino-acids that is as stable as possible or has the lowest energy. This problem can\nbe represented as \ufb01nding the MAP con\ufb01guration in an MRF. These problems are particularly hard\nand dense with up to 170 variables, each having a large domain size of up to 150 values. We compare\nperformance with the MPLP algorithm as described in [5, 11] and with max-product [8]. We used\nthe standard setting for MPLP \u2013 \ufb01rst it is run with edge based clusters for 1000 iterations [5] and\nthen clusters of size 3 are added to tighten the LP relaxation [11]. EM was implemented in JAVA. To\nspeedup the convergence of EM, we used a simple modi\ufb01cation of the M-step as described in [12].\nAll our experiments were done on a Mac Pro with dual quad-core processor and 4GB RAM. All\nalgorithms used only a single processor for computation. We note that another clustering-based\nimprovement of MPLP is presented in [9]. Such clusters can be similarly incorporated into the EM\nalgorithm, which currently does not use any clusters. Therefore, comparisons with such clustering\ntechniques are left for future work.\nThe main purpose of our experiments is to show that EM achieves high solution quality, much more\nquickly than MPLP or max-product. Therefore EM provides a good alternative, particularly when\nfast near-optimal solutions are desired. As reported in [11], for protein design problems solved\nexactly, mean running time was 9.7 hours. For all the problems, instead of running MPLP until the\nnear-optimal solution is found, we used a \ufb01xed cutoff of 5000 sec. For all the problems, we ran EM\nand max-product for 1500 iterations. For EM, different runs were initialized randomly and the best\nof 5-runs is plotted. Empirically, EM achieves a solution quality within 95% of optimal on average\nmuch faster than MPLP. The longest time EM took for any protein design instance was 352 sec. for\nthe \u20181fpo\u2019 instance (Fig. 2(b)).\n\n6\n\n 50 55 60 65 70 75 80 85 90 95 100 0 10 20 30 40 50 60 70 80 90 100% of optimalInstanceOptimalUBAVG-OPTAVG-UB-50 0 50 100 150 200 250 300 350 400 450 0 100 2001fpo 200 1700 3200 4700N=170, C=3167EMMPLPU.B.\fFigure 3: Quality comparison with MPLP for six of the largest protein design instances. The x-axis\ndenotes time (in sec.) and the y-axis denotes the quality achieved.\n\nFig. 2(a) shows the solution quality EM achieves for all the instances in 1500 iterations. Since a tight\nupper bound for all the problems except the instance \u20181fpo\u2019 is known [11], we show the percentage\nof optimal EM achieves. The legend titled \u2018Optimal\u2019 in Fig. 2(a) shows this value. For the unsolved\ninstance \u20181fpo\u2019, we use the best known upper bound MPLP achieved in 10 hours (\u2248 434). As it is\nclear from this graph, EM achieves near-optimal solution quality for all the instances, within 95%\non average. To show empirically that MPLP decreases the upper bound quickly, we also show the\npercentage of solution quality EM achieves when instead of using the best known upper bound, we\nuse the upper bound provided by MPLP after 1,000 iterations. The legend in Fig. 2(a) titled \u2018UB\u2019\nshows this percentage. Even using this bound, EM achieves a quality within 91% on average (legend\n\u2018AVG-UB\u2019). This further suggests that combining EM\u2019s ability to rapidly increase the lower bound\nand MPLP\u2019s ability to decrease the upper bound quickly, is a good way to create a hybrid approach\nthat can provide provably near-optimal solutions much faster.\nFig. 2(b) shows the quality achieved by EM and MPLP as a function of time for the largest instance\n\u20181fpo\u2019. To show the convergence curve of EM clearly, the plot uses a different scale for time T \u2264 200\nand the rest. This graph also shows that EM provides a much better solution quality, much faster\nthan the MPLP. Legend \u2018U.B\u2019 denotes the best known upper bound. Empirically, we noticed that\nthe main advantage of the EM approach was for problems which were large in size having many\nvariables. For smaller problems, EM and MPLP were comparable in performance. Fig. 3 shows\nthe quality comparisons with time for some of the largest protein design instances. Each graph title\nshows the instance name, N denotes the number of variables in the MRF, C denotes the number\nof potential functions \u03b8 or edges in the graph. For all these problems, EM provided near-optimal\nsolution quality and is signi\ufb01cantly faster than MPLP.\nWe also compared EM with max-product. Table 1 shows this comparison for some of the largest\nprotein design instances. In this table, MP Quality denotes the best quality max-product achieved in\n1500 iterations, Time/Iteration denotes the time and iteration number when it was achieved for the\n\ufb01rst time. Again, EM outperforms max-product by a signi\ufb01cant margin, achieving a higher solution\nquality much faster. For some of the problems, such as \u20181tmy\u2019 and \u20181or7\u2019, the quality achieved\nby EM was much higher. Also, for none of these problems max-product converged. This may\nbe due to the fact that these are highly constrained problems with many cycles in the graph. The\naverage degree of a node for these problems is very high, e.g., \u2248 37 for \u20181fpo\u2019. The time required\nper iteration of max-product was 11 sec. for this instance. Therefore the predicted time per EM\u2019s\niteration is 11/37 \u2248 0.298 sec.; the actual time for EM was 0.235 sec. The same result holds for\nother instances as well. This is consistent with the complexity analysis in Sec. 4.\nWe also tested EM on the protein prediction problems [17, 11] which are simpler and sparser than\nthe protein design problems. The LP relaxations in this case can be solved even by the standard\nLP solvers unlike the protein design problems [17]. Surprisingly, EM does not work well on these\n\n7\n\n-50 0 50 100 150 200 250 300 0 100 2001fs1 200 1700 3200 4700N=114, C=2262EMMPLPU.B.-50 0 50 100 150 200 250 300 0 100 2001gef 200 1700 3200 4700N=119, C=2408EMMPLPU.B.-50 0 50 100 150 200 250 300 0 100 2001iib 200 1700 3200 4700N=103, C=2235EMMPLPU.B.-50 0 50 100 150 200 250 300 350 0 100 2001on2 200 1700 3200 4700N=135, C=2547EMMPLPU.B.-50 0 50 100 150 200 250 300 0 100 2001tmy 200 1700 3200 4700N=117, C=2720EMMPLPU.B.-50 0 50 100 150 200 250 300 0 100 2002phy 200 1700 3200 4700N=124, C=2753EMMPLPU.B.\fInstance MP Quality Time/Iteration\n\nEM Quality Time/Iteration\n\n267.6\n267.3\n288.2\n245.7\n317.1\n264.8\n410.2\n407.1\n\n202.3/1220\n71.6/428\n250.1/1462\n78.1/442\n146.6/807\n222.4/1067\n240.8/1087\n263.6/1125\n\nU.B.\n276.4\n276.1\n292.8\n251.1\n327.23\n272.1\n419.3\n434\n\n1fs1\n1gef\n1bkb\n1iib\n1on2\n1tmy\n1or7\n1fpo\n\n268.3\n239.8\n272.4\n236.2\n314.7\n202.9\n368.1\n406.2\n\n3628.5/344\n13938.9/1331\n10928.9/965\n11493.1/1099\n11628.34/1226\n\n99.5/8\n234.9/22\n9072.7/791\n\nTable 1: Solution quality and time (in sec.) comparison between EM and Max-Product (MP). U.B.\ndenotes the best known upper bound.\n\nproblems. For the hardest instance \u20181a8i\u2019 (812 variables, 10124 edges, edge density=.03), MPLP\nachieves the near optimal value of 73, whereas EM could only achieve a value of \u2212374. The reason\nfor this lies in the reward structure of the problem or the values \u03b8min and \u03b8max. For this problem\n\u03b8min = \u22125770.96 and \u03b8max = 3.88. As shown earlier, EM works with the normalized rewards\nassigning 0 to the minimum reward \u22125770.96, and 1 to the maximum reward 3.88. This dramatic\nscaling of the reward is particularly problematic for EM as shown below.\nAccording to Thm. 1, the log-likelihood EM converges to is \u22122.949 \u00d7 10\u22124. For EM to achieve the\nvalue 73, the log-likelihood should be \u22122.913 \u00d7 10\u22124. However the drastic scaling of the reward\ncauses this minor difference to signi\ufb01cantly affect solution quality. In such settings, EM may not\nwork well. In contrast, for the largest protein design instance \u20181fpo\u2019, the minimum reward is \u221259.2\nand maximum is 4.37. We also experimented on a 10\u00d710 grid graph with 5 values per variable using\nthe Potts model similar to [5]. We randomly generated 100 instances and found that EM achieved\ngood solution quality, within 95% of optimal on average. The difference between the maximum and\nminimum reward in these problems was less than 5, with a typical setting: \u03b8min\u2248\u22122.5, \u03b8max\u22482.5.\nThis further supports the previous analysis.\n\n6 Conclusion\n\nA number of techniques have been developed recently to \ufb01nd the MAP assignment of Markov ran-\ndom \ufb01elds. Particularly successful are approaches based on LP relaxation of the MAP problem\nsuch as MPLP. Such approaches minimize an upper bound relatively quickly, but take much longer\nto \ufb01nd a good solution. In contrast, our proposed formulation seeks to provide good quality solu-\ntions quickly by directly maximizing a lower bound on the MAP value over the inner bound on the\nmarginal polytope. The proposed Expectation Maximization (EM) algorithm increases this lower\nbound monotonically by likelihood maximization and is guaranteed to converge. Furthermore, EM\u2019s\nupdate equations can be ef\ufb01ciently implemented using a graph-based message passing paradigm.\nAlthough EM may get stuck at a local optimum, our empirical results on the protein design dataset\nshow that EM performs very well, producing solutions within 95% of optimal on average. EM\nachieves such high solution quality signi\ufb01cantly faster than MPLP or max-product for many large\nprotein design problems. Another signi\ufb01cant advantage EM enjoys is the ease of parallelization.\nUsing advanced parallel computing paradigms such as Google\u2019s MapReduce [2] can further speedup\nthe algorithm with little additional effort. Finally, we examined a setting in which EM may not\nwork well due to a large gap between the minimum and maximum reward. Our ongoing efforts\ninclude incorporating some of the advanced clustering techniques based on LP relaxation of the\nMAP problem with the EM method, and designing heuristics that can help EM avoid getting stuck\nin local optima for problems with large variations in the reward structure.\n\n7 Acknowledgment\n\nWe thank anonymous reviewers for their helpful suggestions. Support for this work was provided in\npart by the National Science Foundation Grant IIS-0812149 and by the Air Force Of\ufb01ce of Scienti\ufb01c\nResearch Grant FA9550-08-1-0181.\n\n8\n\n\fReferences\n[1] H. Attias. Planning by probabilistic inference. In Proc. of the 9th Int. Workshop on Arti\ufb01cial Intelligence\n\nand Statistics, 2003.\n\n[2] J. Dean and S. Ghemawat. MapReduce: a \ufb02exible data processing tool. Communications of the ACM,\n\n53(1):72\u201377, 2010.\n\n[3] R. Dechter. Constraint Processing. Morgan Kaufmann Publishers Inc., San Francisco, CA, USA, 2003.\n[4] A. P. Dempster, N. M. Laird, and D. B. Rubin. Maximum likelihood from incomplete data via the EM\n\nalgorithm. Journal of the Royal Statistical society, Series B, 39(1):1\u201338, 1977.\n\n[5] A. Globerson and T. Jaakkola. Fixing Max-Product: Convergent message passing algorithms for MAP\n\nLP-relaxations. In Advances in Neural Information Processing Systems, 2007.\n\n[6] K. Jung, P. Kohli, and D. Shah. Local rules for global MAP: When do they work? In Advances in Neural\n\nInformation Processing Systems, 2009.\n\n[7] C. D. Manning and H. Sch\u00a8utze. Foundations of statistical natural language processing. MIT Press,\n\nCambridge, MA, USA, 1999.\n\n[8] J. Pearl. Probabilistic Reasoning in Intelligent Systems. Morgan Kaufmann Publishers Inc., 1988.\n[9] D. Sontag, A. Globerson, and T. Jaakkola. Clusters and coarse partitions in LP relaxations. In Advances\n\nin Neural Information Processing Systems, pages 1537\u20131544, 2008.\n\n[10] D. Sontag and T. Jaakkola. New outer bounds on the marginal polytope. In Advances in Neural Informa-\n\ntion Processing Systems, 2007.\n\n[11] D. Sontag, T. Meltzer, A. Globerson, T. Jaakkola, and Y. Weiss. Tightening LP relaxations for MAP using\n\nmessage passing. In Proc. of Uncertainty in Arti\ufb01cial Intelligence, pages 503\u2013510, 2008.\n\n[12] M. Toussaint, L. Charlin, and P. Poupart. Hierarchical POMDP controller optimization by likelihood\n\nmaximization. In Proc. of Uncertainty in Arti\ufb01cial Intelligence, pages 562\u2013570, 2008.\n\n[13] M. Toussaint, S. Harmeling, and A. Storkey. Probabilistic inference for solving (PO)MDPs. Technical\n\nReport EDIINF-RR-0934, University of Edinburgh, School of Informatics, 2006.\n\n[14] M. Toussaint and A. J. Storkey. Probabilistic inference for solving discrete and continuous state markov\ndecision processes. In Proc. of the International Conference on Machine Learning, pages 945\u2013952, 2006.\n[15] M. Wainwright, T. Jaakkola, and A. Willsky. MAP estimation via agreement on (hyper)trees: Message-\npassing and linear programming approaches. IEEE Transactions on Information Theory, 51:3697\u20133717,\n2002.\n\n[16] M. J. Wainwright and M. I. Jordan. Graphical models, exponential families, and variational inference.\n\nFoundations and Trends in Machine Learning, 1(1-2):1\u2013305, 2008.\n\n[17] C. Yanover, T. Meltzer, Y. Weiss, P. Bennett, and E. Parrado-hernndez. Linear programming relaxations\n\nand belief propagation \u2013 an empirical study. Journal of Machine Learning Research, 7:2006, 2006.\n\n9\n\n\f", "award": [], "sourceid": 951, "authors": [{"given_name": "Akshat", "family_name": "Kumar", "institution": null}, {"given_name": "Shlomo", "family_name": "Zilberstein", "institution": null}]}