{"title": "Message-Passing for Approximate MAP Inference with Latent Variables", "book": "Advances in Neural Information Processing Systems", "page_first": 1197, "page_last": 1205, "abstract": "We consider a general inference setting for discrete probabilistic graphical models where we seek maximum a posteriori (MAP) estimates for a subset of the random variables (max nodes), marginalizing over the rest (sum nodes). We present a hybrid message-passing algorithm to accomplish this. The hybrid algorithm passes a mix of sum and max messages depending on the type of source node (sum or max). We derive our algorithm by showing that it falls out as the solution of a particular relaxation of a variational framework. We further show that the Expectation Maximization algorithm can be seen as an approximation to our algorithm. Experimental results on synthetic and real-world datasets, against several baselines, demonstrate the efficacy of our proposed algorithm.", "full_text": "Message-Passing for Approximate MAP Inference\n\nwith Latent Variables\n\nJiarong Jiang\n\nDept. of Computer Science\nUniversity of Maryland, CP\n\nPiyush Rai\n\nSchool of Computing\n\nUniversity of Utah\n\njiarong@umiacs.umd.edu\n\npiyush@cs.utah.edu\n\nHal Daum\u00b4e III\n\nDept. of Computer Science\nUniversity of Maryland, CP\nhal@umiacs.umd.edu\n\nAbstract\n\nWe consider a general inference setting for discrete probabilistic graphical models\nwhere we seek maximum a posteriori (MAP) estimates for a subset of the random\nvariables (max nodes), marginalizing over the rest (sum nodes). We present a hy-\nbrid message-passing algorithm to accomplish this. The hybrid algorithm passes\na mix of sum and max messages depending on the type of source node (sum or\nmax). We derive our algorithm by showing that it falls out as the solution of a par-\nticular relaxation of a variational framework. We further show that the Expectation\nMaximization algorithm can be seen as an approximation to our algorithm. Ex-\nperimental results on synthetic and real-world datasets, against several baselines,\ndemonstrate the ef\ufb01cacy of our proposed algorithm.\n\n1\n\nIntroduction\n\nProbabilistic graphical models provide a compact and principled representation for capturing com-\nplex statistical dependencies among a set of random variables. In this paper, we consider the general\nmaximum a posteriori (MAP) problem in which we want to maximize over a subset of the variables\n(max nodes, denoted X), marginalizing the rest (sum nodes, denoted Z). This problem is termed\nas the Marginal-MAP problem. A typical example is the minimum Bayes risk (MBR) problem [1]\nwhere the goal is to \ufb01nd an assignment \u02c6x which optimizes a loss \u2113(\u02c6x, x) with regard to some usually\nunknown truth x. Since x is latent, we need to marginalize it before optimizing with respect to \u02c6x.\nAlthough the speci\ufb01c problems of estimating marginals and estimating MAP individually have been\nstudied extensively [2, 3, 4], similar developments for the more general problem of simultaneous\nmarginal and MAP estimation are lacking. More recently, [5] proposed a method based optimizing\na variational objective on speci\ufb01c graph structures, and is a simultaneous development as the method\nwe propose in this paper (please refer to the supplementary material for further details and other\nrelated work).\n\nThis problem is fundamentally dif\ufb01cult. As mentioned in [6, 7], even for a tree-structured model,\nwe cannot solve the Marginal-MAP problem exactly in poly-time unless P = N P . Moreover, it\nhas been shown [8] that even if a joint distribution p(x, z) belongs to the exponential family, the\n\ncorresponding marginal distribution p(x) = Pz p(x, z) is in general not exponential family (with a\n\nvery short list of exceptions, such as Gaussian random \ufb01elds). This means that we cannot directly\napply algorithms for MAP inference to our task. Motivated by this problem, we propose a hybrid\nmessage passing algorithm which is both intuitive and justi\ufb01ed according to variational principles.\nOur hybrid message passing algorithm uses a mix of sum and max messages with the message type\ndepending on the source node type.\n\nExperimental results on chain and grid structured synthetic data sets and another real-world dataset\nshow that our hybrid message-passing algorithm works favorably compared to standard sum-\nproduct, standard max-product, or the Expectation-Maximization algorithm which iteratively pro-\nvides MAP and marginal estimates. Our estimates can be further improved by a few steps of local\n\n1\n\n\fsearch [6]. Therefore, using the solution found by our hybrid algorithm to initialize some local\nsearch algorithms largely improves the performance on both accuracy and convergence speed, com-\npared to the greedy stochastic search method described in [6]. We also give an example in Sec. 5\nof how our algorithm can also be used to solve other practical problem which can be cast under the\nMarginal-MAP framework. In particular, the Minimum Bayes Risk [9] problem for decomposable\nloss-functions can be readily solved under this framework.\n\n2 Problem Setting\n\nIn our setting, the nodes in a graphical model with discrete random variables are divided into two\nsets: max and sum nodes. We denote a graph G = (V, E), V = X \u222a Z where X is the set of nodes\nfor which we want to compute the MAP assignment (max nodes), and Z is the set of nodes for which\nwe need the marginals (sum nodes). Let x = {x1, . . . , xm} (xs \u2208 Xs), z = {z1, . . . , zn} (zs \u2208 Zs)\nbe the random variables associated with the nodes in X and Z respectively. The exponential family\ndistribution p over these random variables is de\ufb01ned as follows:\n\np\u03b8(x, z) = exp [h\u03b8, \u03c6(x, z)i \u2212 A(\u03b8)]\n\nwhere \u03c6(x, z) is the suf\ufb01cient statistics of the enumeration of all node assignments, and \u03b8 is the vec-\n\ntor of canonical or exponential parameters. A(\u03b8) = logPx,z exp[h\u03b8, \u03c6(x, z)i] is the log-partition\n\nfunction. In this paper, we consider only pairwise node interactions and use standard overcomplete\nrepresentation of the suf\ufb01cient statistics [10] (de\ufb01ned by indicator function I later).\n\nThe general MAP problem can be formalized as the following maximization problem:\n\nwith corresponding marginal probabilities of the z nodes, given x\u2217.\n\nx\u2217 = arg max\n\nx Xz\n\np\u03b8(x, z)\n\np(zs|x\u2217) = XZ\\{zs}\n\np(z|x\u2217), s = 1, . . . , n\n\n(1)\n\n(2)\n\nBefore proceeding, we introduce some notations for clarity of exposition: Subscripts s, u, t, etc.\ndenote nodes in the graphical model. zs, xs are sum and max random variables respectively, asso-\nciated with some node s. vs can be either a sum (zs) or a max random (xs) variable, associated with\nsome node s. N (s) is the set of neighbors of node s. Xs, Zs, Vs are the state spaces from which xs,\nzs, vs take values.\n\n2.1 Message Passing Algorithms\nThe sum-product and max-product algorithms are standard message-passing algorithms for inferring\nmarginal and MAP estimates respectively in probabilistic graphical models. Their idea is to store\na belief state associated with each node, and iteratively passing messages between adjacent nodes,\nwhich are used to update the belief states. It is known [11] that these algorithms are guaranteed to\nconverge to the exact solution on trees or polytrees. On loopy graphs, they are no longer guaranteed\nto converge, but they can still provide good estimates when converged [12].\nIn the standard sum product algorithm, the message Mts passed from node s to one of its neighbors\nt is as follows:\n\n\uf8f1\uf8f2\n\uf8f3\n\nt)\uf8fc\uf8fd\n\uf8fe\n\nMts(vs) \u2190 \u03ba Xv\u2032\n\nt\u2208Vt\n\nexp[\u03b8st(vs, v\u2032\n\nt) + \u03b8t(v\u2032\n\nt)] Yu\u2208N (t)\\s\n\nMut(v\u2032\n\n(3)\n\nwhere \u03ba is a normalization constant. When the messages converge, i.e. {Mts, Mst} does not change\nfor every pair of nodes s and t, the belief (psuedomarginal distribution) for the node s is given by\n\n\u00b5s(vs) = \u03ba exp{\u03b8s(vs)}Qt\u2208N (s) Mts(vs). The outgoing messages for max product algorithm have\n\nthe same form but with a maximization instead of a summation in Eq. (3). After convergence, the\nMAP assignment for each node is the assignment with the highest max-marginal probability.\n\nOn loopy graphs, the tree-weighted sum and max product [13, 14] can help \ufb01nd the upper bound\nof the marginal or MAP problem. They decompose the loopy graph into several spanning trees and\nreweight the messages by the edge appearance probability.\n\n2\n\n\f2.2 Local Search Algorithm\nEq (1) can be viewed as doing a variable elimination for z nodes \ufb01rst, followed by a maximization\nover x. Its maximization step may be performed using heuristic search techniques [7, 6]. Eq (2) can\nbe computed by running standard sum-product over z, given the MAP x\u2217 assignments. In [6], the\nassignment for the MAP nodes are found by greedily searching the best neighboring assignments\nwhich only differs on one node. However, the hybrid algorithm we propose allows simultaneously\napproximating both Eq (1) and Eq (2).\n\n3 HYBRID MESSAGE PASSING\n\nIn our setting, we wish to compute MAP estimates for one set of nodes and marginals for the rest.\nOne possible approach is to run standard sum/max product algorithms over the graph, and \ufb01nd the\nmost-likely assignment for each max node according to the maximum of sum or max marginals1.\nThese na\u00a8\u0131ve approaches have their own shortcomings; for example, although using standard max-\nproduct may perform reasonably when there are many max nodes, it inevitably ignores the effect of\nsum nodes which should ideally be summed over. This is analogous to the difference between EM\nfor Gaussian mixture models and K-means. (See Sec. 6)\n\n3.1 ALGORITHM\nWe now present a hybrid message-passing algorithm which passes sum-style or max-style messages\nbased on the type of nodes from which the message originates.\nIn the hybrid message-passing\nalgorithm, a sum node sends sum messages to its neighbors and a max node sends max messages.\nThe type of message passed depends on the type of source node, not the destination node.\n\nMore speci\ufb01cally, the outgoing messages from a source node are as follows:\n\n\u2022 Message from sum node t to any neighbor s:\n\nMts(vs) \u2190 \u03ba1 Xz \u2032\n\nt\u2208Zt\n\nexp[\u03b8st(vs, z\u2032\n\nt) + \u03b8t(z\u2032\n\nt)] Yu\u2208N (t)\\s\n\nMut(z\u2032\n\n\u2022 Message from max node t to any neighbor s:\n\nMts(vs) \u2190 \u03ba2 max\nt\u2208Xt\n\nx\u2032\n\nexp[\u03b8st(vs, x\u2032\n\nt) + \u03b8t(x\u2032\n\nt)] Yu\u2208N (t)\\s\n\nMut(x\u2032\n\nand \u03ba1,\u03ba2 are normalization constants. Algo 1 shows the procedure to do hybrid message-passing.\n\nAlgorithm 1 Hybrid Message-Passing Algorithm\nInputs: Graph G = (V, E), V = X \u222a Z, potentials \u03b8s, s \u2208 V and \u03b8st, (s, t) \u2208 E.\n\n1. Initialize the messages to some arbitrary value.\n2. For each node s \u2208 V in G, do the following until messages converge (or maximum number\n\nof iterations reached)\n\n\u2022 If s \u2208 X, update messages by Eq.(5).\n\u2022 If s \u2208 Z, update messages by Eq.(4).\n3. Compute the local belief for each node s.\n\n\u00b5s(ys) = \u03ba exp{\u03b8s(vs)}Qt\u2208N (s) Mts(vs)\n4. For all xs \u2208 X, return arg maxxs\u2208Xs \u00b5s(xs)\n5. For all zs \u2208 Z, return \u00b5s(zs).\n\nWhen there is only a single type of node in the graph, the hybrid algorithm reduces to the standard\nmax or sum-product algorithm. Otherwise, it passes different messages simultaneously and gives an\napproximation to the MAP assignment on max nodes as well as the marginals on sum nodes. On\nthe loopy graphs, we can also apply this scheme to pass hybrid tree-reweighted messages between\nnodes to obtain marginal and MAP estimates. (See Appendix C of the supplementary material)\n\n1Running the standard sum-product algorithm and choosing the maximum likelihood assignment for the\n\nmax nodes is also called maximum marginal decoding [15, 16].\n\n3\n\n\uf8f1\uf8f2\n\uf8f3\n\uf8f1\uf8f2\n\uf8f3\n\nt)\uf8fc\uf8fd\n\uf8fe\nt)\uf8fc\uf8fd\n\uf8fe\n\n(4)\n\n(5)\n\n\f(7)\n\n(8)\n\n(9)\n\n3.2 VARIATIONAL DERIVATION\nIn this section, we show that the Marginal-MAP problem can be framed under a variational frame-\nwork, and the hybrid message passing algorithm turns out to be a solution of it. (a detailed derivation\nis in Appendix A of the supplementary material). To see this, we construct a new graph G\u00afx with xs\u2019\nassignments \ufb01xed to be \u00afx \u2208 X = X1 \u00d7 \u00b7 \u00b7 \u00b7 \u00d7 Xm, so the log-partition function A(\u03b8\u00afx) of the graph\nG\u00afx is\n\np(\u00afx, z) + log A(\u03b8) = log p(\u00afx) + const\n\n(6)\n\nA(\u03b8\u00afx) = logXz\n\nAs the constant only depends on the log-partition function of the original graph and does not vary\nwith different assignments of MAP nodes, A(\u03b8\u00afx) exactly estimates the log-likelihood of assignment\n\u00afx. Therefore argmax\u00afx\u2208X log p(\u00afx) = argmax\u00afx\u2208X A(\u03b8\u00afx). Moreover, A(\u03b8\u00afx) can be approximated\nby the following [10]:\n\nwhere M (G\u00afx) is the following marginal polytope of graph Gx:\n\nA(\u03b8\u00afx) \u2248 sup\n\nh\u03b8, \u00b5i + HBethe(\u00b5)\n\n\u00b5\u2208M (G\u00afx)\n\nM (G\u00afx) = \uf8f1\uf8f2\n\uf8f3\n\n\u00b5(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)\n\n\u00b5s(zs), \u00b5st(vs, vt): marginals with \u00afx \ufb01xed to its assignment\n\n\u00b5s(xs) = (cid:26) 1\n\n0\n\nif xs = \u00afxs\nelse\n\n\uf8fc\uf8fd\n\uf8fe\n\nRecall, vs stands for xs or zs. HBethe(\u00b5) is the Bethe energy of the graph:\n\nHBethe(\u00b5) =Xs\n\nHs(\u00b5s) \u2212 X(s,t)\u2208E\n\nIst(\u00b5st), Hs(\u00b5s) = \u2212 Xvs\u2208Vs\n\n\u00b5s(vs) log \u00b5s(vs)\n\nIst(\u00b5st) =\n\n\u00b5st(vs, vt) log\n\nX(vs,vt)\u2208Vs\u00d7Vt\n\n\u00b5st(vs, vt)\n\n\u00b5s(vs)\u00b5t(vt)\n\nFor readability, we use \u00b5sum, \u00b5max to subsume the node and pairwise marginals for sum/max nodes\nand \u00b5sum\u2192max, \u00b5max\u2192sum are the pairwise marginals for edges between different types of nodes. The\ndirection here is used to be consistent with the distinction of the constraints as well as the messages.\n\nSolving the Marginal-MAP problem is therefore equivalent to solving the following optimization\nproblem:\n\nmax\n\u00afx\u2208X\n\nsup\n\nh\u03b8, \u00b5i + HBethe(\u00b5) \u2248 sup\n\nsup\n\nh\u03b8, \u00b5i + HBethe(\u00b5)\n\n(10)\n\n\u00b5other\u2208M (G\u00afx)\n\n\u00b5max\u2208M\u00afx\n\n\u00b5other\u2208M (G\u00afx)\n\n\u00b5other contains all other node/pairwise marginals except \u00b5max. The Bethe entropy terms can be\nwritten as (H is the entropy and I is mutual information)\n\nHBethe(\u00b5) = H\u00b5max + H\u00b5sum \u2212 I\u00b5max\u2192\u00b5max \u2212 I\u00b5sum\u2192\u00b5sum \u2212 I\u00b5max\u2192\u00b5sum \u2212 I\u00b5sum\u2192\u00b5max\n\nIf we force to satisfy the second condition in (8), the entropy of max nodes H\u00b5max = Hs(\u00b5s) = 0,\n\u2200s \u2208 X and the mutual information between max nodes I\u00b5max\u2192\u00b5max = Ist(xs, xt) = 0, \u2200s, t \u2208 X.\nFor mutual information between different types of nodes, we can either force xs to have integral so-\nlutions, or relax xs to have non-integral solution, or relax xs on one direction2. In practice, we relax\nthe mutual information on the message from sum nodes to max nodes, so the mutual information\n\u00b5s(xs)\u00b5t(zt) =\n\u00b5s(x\u2217)\u00b5t(zt) = 0, \u2200s \u2208 X, t \u2208 Z, where x\u2217 is the assigned state of x at node\ns. Finally, we only require sum nodes to satisfy normalization and marginalization conditions, the\nentropy for sum nodes, mutual information between sum nodes, and from sum node to max node\ncan be nonzero.\nThe above process relaxes the polytope M (G\u00afx) to be M\u00afx \u00d7 Lz(G\u00afx), where\n\u00b5s(zs) = 1, \u00b5s(xs) = 1 iff xs = \u00afxs,\n\non the other direction I\u00b5max\u2192\u00b5sum = Ist(xs, zt) = P(xs,zt)\u2208Xs\u00d7Zt \u00b5st(xs, zt) log \u00b5st(xs,zt)\nPzt\u2208Zt \u00b5st(x\u2217, zt) log \u00b5st(x\u2217,zt)\n\nLz(G\u00afx) =\uf8f1\uf8f4\uf8f4\uf8f4\uf8f2\n\uf8f4\uf8f4\uf8f4\uf8f3\n\n\u00b5 \u2265 0(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)\n\nPzs\n\nPzt\nPzs\n\n\u00b5st(vs, zt) = \u00b5s(vs),\n\u00b5st(zs, vt) = \u00b5t(vt),\n\n\u00b5st(xs, zt) = \u00b5t(zt) iff xs = \u00afxs,\n\n\u00b5st(xs, xt) = 1 iff xs = \u00afxs, xt = \u00afxt.\n\n\uf8fc\uf8f4\uf8f4\uf8f4\uf8fd\n\uf8f4\uf8f4\uf8f4\uf8fe\n\n2This results in four different relaxations for different combinations of message types and the hybrid algo-\n\nrithm performed empirically the best.\n\n4\n\n\fThis analysis results in the following optimization problem.\n\nsup\n\nsup\n\n\u00b5max\u2208M \u00afx\n\n\u00b5others\u2208M (G \u00afx)\n\nh\u03b8, \u00b5i + H(\u00b5sum) \u2212 I(\u00b5sum\u2192sum) \u2212 I(\u00b5sum\u2192max)\n\nFurther relaxing \u00b5\u00afxs to have non-integral solutions, de\ufb01ne\n\nFinally we get\n\nL(G) =\uf8f1\uf8f2\n\uf8f3\n\n\u00b5 \u2265 0(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)\n\n\u00b5s(vs) = 1,\n\n\u00b5st(vs, vt) = \u00b5s(vs),\n\n\u00b5st(vs, vt) = \u00b5t(vt). \uf8fc\uf8fd\nPvs\n\uf8fe\n\nPvt\nPvs\n\nsup\n\nh\u00b5, \u03b8i + H(\u00b5sum) \u2212 I(\u00b5sum\u2192sum) \u2212 I(\u00b5sum\u2192max)\n\n\u00b5\u2208L(G)\n\n(11)\n\nSo M\u00afx \u00d7 Mz(G\u00afx) \u2286 M\u00afx \u00d7 Lz(G\u00afx) \u2286 L(G). Unfortunately, M\u00afx \u00d7 Mz(G\u00afx) is not guaranteed to be\nconvex and we can only obtain an approximate solution to the problem de\ufb01ned in Eq (11). Taking\nthe Lagrangian formulation, for an x node, the partial derivative of the Lagrangian with respect to\n\u00b5s(xs), s \u2208 X keeps the same form as in max product derivation[10], and the situations are identical\nfor \u00b5s(zs), s \u2208 Z and pairwise psuedo-marginals, so the hybrid message-passing algorithm provides\na solution to Eq (11) (see Appendix A of the supplementary material for a detailed derivation).\n\n4 Expectation Maximization\n\nAnother plausible approach to solve the Marginal MAP problem is by the Expectation Maximiza-\ntion(EM) algorithm [17], typically used for maximum likelihood parameter estimation in latent vari-\nable models. In our setting, the variables Z correspond to the latent variables. We now show one\nway of approaching this problem by applying the sum-product and max-product algorithms in the E\nand M step respectively. To see this, let us \ufb01rst de\ufb01ne3:\n\nF (\u02dcp, x) = E \u02dcp[log p(x, z)] + H(\u02dcp(z))\n\n(12)\n\nwhere H(\u02dcp) = \u2212E \u02dcp[log \u02dcp(z)].\nThen EM can be interpreted as a joint maximization of the function F [18]: At iteration t, for\nthe E-step, \u02dcp(t) is set to be the \u02dcp that maximizes F (\u02dcp, x(t\u22121)) and for the M-step, x(t) is the x\nthat maximizes F (\u02dcp(t), x). Given F , the following two properties4 show that jointly maximizing\n\nfunction F is equivalent to maximizing the objective function p(x) = Pz p(x, z).\n\n1. With the value of x \ufb01xed in function F , the unique solution to maximizing F (\u02dcp, x) is given\n\nby \u02dcp(z) = p(z|x).\n\n2. If \u02dcp(z) = p(z|x), then F (\u02dcp, x) = log p(x) = logPz p(x, z).\n\n4.1 Expectation Maximization via Message Passing\nNow we can derive the EM algorithm for solving the Marginal-MAP problem by jointly maximizing\nfunction F . In the E-step, we need to estimate \u02dcp(z) = p(z|x) given x. This can be done by \ufb01xing x\nvalues at their MAP assignments and running the sum-product algorithm over the resulting graph:\np\u03b8(z | \u00afx) log p\u03b8(x, z), where \u00afx is the assignment given by the pre-\nz\u223cp\u03b8(z | \u00afx) log p\u03b8(x | z), as the log p\u03b8(z) term in the\nz\u223cp\u03b8(z | \u00afx) log p\u03b8(x | z) = maxxPz p(z | \u00afx)h\u03b8, \u03c6(x, z)i,\n\nThe M-step works by maximizing E\nvious M-step. This is equivalent to maximizing E\nmaximization is independent of x. maxx E\nwhich in the overcomplete representation [10] can be approximated by\n\nXs\u2208X,i\n\n\u03b8s;i + Xt\u2208Z,j\n\n\uf8ee\n\uf8f0\n\n\u00b5t;j\u03b8st;ij\uf8f9\n\uf8fb\n\nIs;i(xs) + X(s,t)\u2208E,s,t\u2208X X(i,j)\n\n\u03b8st;ij Ist;ij(xs, xt) + C\n\nwhere C subsumes the terms irrelevant to the maximization over x, \u00b5t is the psuedo-marginal of\nnode t given \u00afx5. Then, the M-step amounts to running the max product algorithm with potentials on\nx nodes modi\ufb01ed according to Eq. (13). Summarizing, the EM algorithm for solving marginal-MAP\nestimation can be interpreted as follows:\n\n\u2022 E-step: Fix xs to be the MAP assignment value from iteration (k \u2212 1) and run sum product\n\nto get beliefs on sum nodes zs, say \u00b5t, t \u2208 Z.\n\n3By directly applying Jensen\u2019s inequality to the objective function maxx logPz p(x, z)\n\n4The proofs are straightforward following Lemma 1 and 2 in [18] page 4-5. More details are in Appendix\n\nB of the supplementary material\n\n5A detailed derivation is in Appendix B.4 of the supplementary material\n\n5\n\n\f\u2022 M-step: Build a new graph \u02dcG = ( \u02dcV , \u02dcE) only containing the max nodes. \u02dcV =X and \u02dcE =\n{(s, t)|\u2200(s, t) \u2208 E, s, t \u2208 X}. For each max node s in the graph, set its potential as\n\n\u02dc\u03b8s;i = \u03b8s;i + Pj \u03b8st;ij\u00b5t;j, where t \u2208 Z and (s, t) \u2208 E. \u02dc\u03b8st;ij = \u03b8st;ij \u2200(s, t) \u2208 \u02dcE. Run\n\nmax product over this new graph and update the MAP assignment.\n\n4.2 Relationship with the Hybrid Algorithm\nApart from the fact that the hybrid algorithm passes different messages simultaneously and EM\ndoes it iteratively, to see the connection with the hybrid algorithm, let us \ufb01rst consider the message\npassed in the E-step at iteration k. xs are \ufb01xed at the last assignment which maximizes the message\nat iteration k \u2212 1, denoted as x\u2217 here. The M (k\u22121)\n\nare the messages computed at iteration k \u2212 1.\n\nut\n\nM (k)\n\nts (zs) = \u03ba1{exp[\u03b8st(zs, x\u2217\n\nt ) + \u03b8t(x\u2217\n\nt )] Yu\u2208N (t)\\s\n\nM (k\u22121)\n\nut\n\n(x\u2217\n\nt )}\n\n(13)\n\nNow assume there exists an iterative algorithm which, at each iteration, computes the messages\nused in both steps of the message-passing variant of the EM algorithm, denoted \u02dcMts. Eq (13) then\nbecomes\n\n\u02dcM (k)\n\nts (zs) == \u03ba1 max\n\nx\u2032\n\n{exp[\u03b8st(zs, x\u2032\n\nt) + \u03b8t(x\u2032\n\nt)] Yu\u2208N (t)\\s\n\n\u02dcM (k\u22121)\n\nut\n\n(x\u2032\n\nt)}\n\nSo the max nodes (x\u2019s) should pass the max messages to its neighbors (z\u2019s), which is what the hybrid\nmessage-passing algorithm does.\n\nIn the M-step for EM (as discussed in Sec. 4), all the sum nodes t are removed from the graph\n\nand the parameters of the adjacent max nodes are modi\ufb01ed as: \u03b8s;i = \u03b8s;i + Pj \u03b8st;ij\u00b5t;j. \u00b5t is\n\ncomputed by the sum product at the E-step of iteration k, and these sum messages are used (in form\nof the marginals \u00b5t) in the subsequent M-step (with the sum nodes removed). However, a max node\nmay prefer different assignments according to different neighboring nodes. With such uncertainties,\nespecially during the \ufb01rst a few iterations, it is very likely that making hard decisions will directly\nlead to the bad local optima. In comparison, the hybrid message passing algorithm passes mixed\nmessages instead of making deterministic assignments in each iteration.\n\n5 MBR Decoding\n\nMost work on \ufb01nding \u201cbest\u201d solutions in graphical models focuses on the MAP estimation problem:\n\ufb01nd the x that maximizes p\u03b8(x). In many practical applications, one wishes to \ufb01nd an x that min-\nimizes some risk, parameterized by a given loss function. This is the minimum Bayes risk (MBR)\nsetting, which has proven useful in a number of domains, such as speech recognition [9], natural\nlanguage parsing [19, 20], and machine translation [1]. We are given a loss function \u2113(x, \u02c6x) which\nmeasures the loss of \u02c6x assuming x is the truth. We assume losses are non-negative. Given this loss\nfunction, the minimum Bayes risk solution is the minimizer of Eq (14):\n\nMBR\u03b8 = arg min\n\n\u02c6x\n\nEx\u223cp[\u2113(x, \u02c6x)] = arg min\n\n\u02c6x Xx\n\np(x)\u2113(x, \u02c6x)\n\n(14)\n\nWe now assume that \u2113 decomposes over the structure of x. In particular, suppose that: \u2113(x, \u02c6x) =\n\nPc\u2208C \u2113(xc, \u02c6xc), where C is some set of cliques in x, and xc denotes the variables associated with\n\nthat clique. For example, for Hamming loss, the cliques are simply the set of pairs of vertices of the\nform (xi, \u02c6xi), and the loss simply counts the number of disagreements. Such decompositionality is\nwidely assumed in structured prediction algorithms [21, 22].\nAssume lc(x, x\u2032) \u2264 L \u2200c, x, x\u2032. Therefore l(x, x\u2032) \u2264 |C|L. We can then expand Eq (14) into the\nfollowing:\n\nMBR\u03b8 = arg min\n\np(x)\u2113(x, \u02c6x) = argmax\n\np(x)(|C|L \u2212 l(x, x\u2032))\n\nThis resulting expression has exactly the same form as the MAP-with-marginal problem, where\nx is the variable being marginalized and \u02c6x being the variable being maximized. Fig. 1 shows a\nsimple example of transforming a MAP lattice problem into an MBR problem under Hamming loss.\nTherefore, we can apply our hybrid algorithm to solve the MBR problem.\n\n6\n\n\u02c6x Xx\n\u02c6x Xx\n\n= arg max\n\nx\u2032 Xx\n\nexp\"h\u03b8, xi + logXc\n\n[L \u2212 \u2113(xc, \u02c6xc)] \u2212 A(\u03b8)#\n\n\fAverage KL\u2212divergence on sum nodes\n\n \n\nMax Product\nSum Product\nHybrid Message Passing\n\n0.05\n\n0.045\n\n0.04\n\n0.035\n\n0.03\n\n0.025\n\n0.02\n\n0.015\n\n0.01\n\n0.005\n\ne\nc\nn\ne\ng\nr\ne\nv\nd\n\u2212\nL\nK\n\ni\n\n0\n\n \n\n10\n\n20\n\n30\n\n40\n60\n% of sum nodes\n\n50\n\n70\n\n80\n\n90\n\nFigure 1: The Augmented Model For Solving\nThe MBR Problem Under Hamming Loss over\na 6-node simple lattice\n\nFigure 2: Comparison of Various Algorithms For\nMarginals on 10-Node Chain Graph\n\n6 EXPERIMENTS\n\nWe perform the experiments on synthetic datasets as well as a real-world protein side-chain pre-\ndiction dataset [23], and compare our hybrid message-passing algorithm (both its standard belief\npropagation and the tree-reweighted belief propagation (TRBP) versions) against a number of base-\nlines such as the standard sum/max product based MAP estimates, EM, TRBP, and the greedy local\nsearch algorithm proposed in [6].\n\n6.1 Synthetic Data\nFor synthetic data, we \ufb01rst take a 10-node chain graph with varying splits of sum vs max nodes,\nand random potentials. Each node can take one of the two states (0/1). The node and the edge\npotentials are drawn from UNIFORM(0,1) and we randomly pick nodes in the graph to be sum\nor max nodes. For this small graph, the true assignment is computable by explicitly maximizing\n\np(x) = Pz p(x, z) = 1\n\nZ Pz Qs\u2208V \u03c8s(vs)Q(s,t)\u2208E \u03c8st(vs, vt), where Z is some normalization\n\nconstant and \u03c8s(vs) = exp \u03b8s(vs).\nFirst, we compare the various algorithms on the MAP assignments. Assume that the aforementioned\nn) and some algorithm gives the approximate as-\nmaximization gives assignment x\u2217 = (x\u2217\nsignment x = (x1, . . . , xn). The metrics we use here are the 0/1 loss and the Hamming loss.\n\n1, . . . , x\u2217\n\n0.7\n\n0.6\n\n0.5\n\n0.4\n\n0.3\n\n0.2\n\n0.1\n\ne\n\nt\n\na\nr\n \nr\no\nr\nr\n\nE\n\n0/1 Loss on a 10\u2212node chain graph\n\nmax\nsum\nhybrid\nEM\nmax+local search\nsum+local search\nhybrid+local search\n\n \n\n0.25\n\nHamming loss on a 10\u2212node chain graph\n\n \n\nmax\nsum\nhybrid\nEM\nmax+local search\nsum+local search\nhybrid+local search\n\ne\n\nt\n\na\nr\n \nr\no\nr\nr\n\nE\n\n0.2\n\n0.15\n\n0.1\n\n0.05\n\n \n\n0\n0.1\n\n0.2\n\n0.3\n\n0.4\n\n0.5\n\n0.6\n\n% of sum nodes\n\n0.7\n\n0.8\n\n0.9\n\n \n\n0\n0.1\n\n0.2\n\n0.3\n\n0.4\n\n0.5\n\n0.6\n\n% of sum nodes\n\n0.7\n\n0.8\n\n0.9\n\nFigure 3: Comparison of Various Algorithms For MAP Estimates on 10-Node Chain Graph: 0-1 Loss (Left),\nHamming Loss (Right)\n\nFig. 3 shows the loss on the assignment of the max nodes. In the \ufb01gure, as the number of sum nodes\ngoes up, the accuracy of the standard sum-product based estimation (sum) gets better, whereas the\naccuracy of standard max-product based estimation (max) worsens. However, our hybrid message-\npassing algorithm (hybrid), on an average, results in the lowest loss compared to the other baselines,\nwith running times similar to the sum/max product algorithms.\n\nWe also compare a stochastic greedy search approach described in [6] initialized by the results of\nsum/max/hybrid algorithm (sum/max/hybrid+local search). As shown in [6], local search with sum\nproduct initialization empirically performs better than with max product, so later on, we only com-\npare the results with local search using sum product initialization (LS). Best of the three initialization\nmethods, by starting at the hybrid algorithm results, the search algorithm in very few steps can \ufb01nd\n\n7\n\n\flog\u2212likelihood of the assignment normalized by hybrid algorithm\n\n1.005\n\n1\n\n0.995\n\n0.99\n\n0.985\n\nd\no\no\nh\n\ni\nl\n\ni\n\ne\nk\nL\ne\nv\ni\nt\n\n \n\nl\n\na\ne\nR\n\n \n\nd\no\no\nh\n\ni\nl\n\ni\n\ne\nk\nL\ne\nv\ni\nt\n\n \n\nl\n\na\ne\nR\n\nmax\nsum\nhybrid\nLS\nhybrid+LS\n\n1.001\n\n1\n\n0.999\n\n0.998\n\n0.997\n\n0.996\n\n0.995\n\n0.994\n\n0.993\n\n0.992\n\nlog\u2212likelihood of the assignments normalized by hybrid algo\n\n \n\nTR\u2212max\nTR\u2212sum\nTR\u2212hybrid\nLS\nTR\u2212hybrid+LS\n\n0.7\n\n0.8\n\n0.9\n\n0.98\n\n \n\n10% 20% 30% 40% 50% 60% 70% 80% 90%\n\n% of sum node\n\n0.991\n\n \n\n0.1\n\n0.2\n\n0.3\n\n0.4\n\n0.5\n\n0.6\n\n% of sum nodes\n\nFigure 4: Approximate Log-Partition Function Scores on a 50-Node Tree (Left) and an 8*10 Grid (Right)\nGraph Normalized By the Result of Hybrid Algorithm\n\nthe local optimum, which often happened to be the global optimum as well. In particular, it only\ntakes 1 or 2 steps of search in the 10-node chain case and 1 to 3 steps in the 50-node tree case.\n\nNext, we experiment with the marginals estimation. Fig 2 shows the mean of KL-divergence on the\nmarginals for the three message-passing algorithms (each averaged over 100 random experiments)\ncompared to the true marginals of p(z|x). Greedy search of [6] is not included since it only provides\nMAP, not marginals. The x-axis shows the percentage of sum nodes in the graph. Just like in the\nMAP case, our hybrid method consistently produces the smallest KL-divergence compared to others.\n\nWhen the computation of the truth is intractable, the loglikelihood of the assignment can be approx-\nimated by the log-partition function with Bethe approximation according to Sec. 3.2. Note that this\nis exact on trees. Here, we use a 50-node tree with binary node states and 8 \u00d7 10 grid with various\nstates 1 \u2264 |Ys| \u2264 20. On the grid graph, we apply tree-reweighted sum or max product [14, 13],\nand our hybrid version based on TRBP. For the edge appearance probability in TRBP, we apply a\ncommon approach that use a greedy algorithm to \ufb01nding the spanning trees with as many uncovered\nedges as possible until all the edges in the graph are covered at least once. Even if the message-\npassing algorithms are not guaranteed to converge on loopy graphs, we can still compare the best\nresult they provide after a certain number of iterations\n\nFig. 4 presents the results. In the tree case, as expected, using hybrid message-passing algorithms\u2019s\nresult to initialize the local search algorithm performs the best. On the grid graph, the local search\nalgorithm initialized by the sum product results works well when there are few max nodes, but as\nthe search space grows exponentially with the number of max nodes, so it takes hundreds of steps to\n\ufb01nd the optimum. On the other hand, because the hybrid TRBP starts in a good area, it consistently\nachieves the highest likelihood among all four algorithms with fewer extra steps.\n6.2 Real-world Data\nWe then experiment with the protein side-chain pre-\ndiction dataset [23, 24] which consists a set of pro-\ntein structures for which we need to \ufb01nd lowest en-\nergy assignment for rotamer residues. There are two\nsets of residues: core residues and surface residues.\nThe core residues are the residues which are con-\nnected to more than 19 other residues, and the sur-\nface ones are the others. Since the MAP results are\nusually lower on the surface residues than the core\nresidues nodes [24], we choose the surface residues\nto be max nodes and the core nodes to be the sum\nnodes. The ground truth is given by the maximum\nlikelihood assignment of the residues, so we do not\nexpect to have a better results on the core nodes, but\nwe hope that any improvement on the accuracy of the surface nodes can make up the loss on the\ncore nodes and thus give a better performance overall. As shown in Table 6.2, the improvements of\nthe hybrid methods on the surface nodes are more than the loss the the core nodes, thus improving\nthe overall performance.\n\nTable 1: Accuracy on the 1st, the 1st & 2rd\nAngles\n\u03c71\nsum product\nmax product\nhybrid\nTRBP\nhybrid TRBP\n\u03c71 \u2227 \u03c72\nsum product\nmax product\nhybrid\nTRBP\nhybrid TRBP\n\nCORE\n0.8325\n0.8336\n0.8336\n0.8364\n0.8359\nCORE\n0.7005\n0.7078\n0.7033\n0.7174\n0.7186\n\nALL\n0.7900\n0.7900\n0.7910\n0.7942\n0.7950\nALL\n0.6482\n0.6512\n0.6485\n0.6592\n0.6597\n\n0.7564\n0.7555\n0.7573\n0.7608\n0.7626\n\n0.6069\n0.6064\n0.6051\n0.6112\n0.6140\n\nSURFACE\n\nSURFACE\n\n8\n\n\fReferences\n[1] Shankar Kumar, William Byrne, and Speech Processing. Minimum bayes-risk decoding for statistical\n\nmachine translation. In HLT-NAACL, 2004.\n\n[2] David Sontag and Tommi Jaakkola. New outer bounds on the marginal polytope.\n\nNeural Information Processing Systems, 2007.\n\nIn In Advances in\n\n[3] Amir Globerson and Tommi Jaakkola. Fixing max-product: Convergent message passing algorithms for\n\nmap lp-relaxations. In NIPS, 2007.\n\n[4] Pradeep Ravikumar, Alekh Agarwal, and Martin J. Wainwright. Message-passing for graph-structured\n\nlinear programs: proximal projections, convergence and rounding schemes. In ICML, 2008.\n\n[5] Qiang Liu and Alexander Ihler. Variational algorithms for marginal map. In UAI, 2011.\n[6] James D. Park. MAP Complexity Results and Approximation Methods. In UAI, 2002.\n[7] D. Koller and N. Friedman. Probabilistic Graphical Models: Principles and Techniques. MIT Press,\n\n2009.\n\n[8] Shaul K. Bar-Lev, Daoud Bshouty, Peter Enis, Gerard Letac, I-Li Lu, and Donald Richards. The diagnonal\nmultivariate natural exponential families and their classi\ufb01cation. In Journal of Theoretical Probability,\npages 883\u2013929, 1994.\n\n[9] Vaibhava Goel and William J. Byrne. Minimum Bayes-risk automatic speech recognition. Computer\n\nSpeech and Language, 14(2), 2000.\n\n[10] M. J. Wainwright and M. I. Jordan. Graphical Models, Exponential Families, and Variational Inference.\n\nFoundations and Trends in Machine Learning, 2008.\n\n[11] Judea Pearl. Probabilistic Reasoning in Intelligent Systems: Networks of Plausible Inference. Morgan\n\nKaufmann Publishers Inc., San Francisco, CA, USA, 1988.\n\n[12] Jonathan S. Yedidia, William T. Freeman, and Yair Weiss. Generalized belief propagation. In NIPS, 2000.\n[13] Martin J. Wainwright, Tommi S. Jaakkola, and Alan S. Willsky. Exact map estimates by tree agreement.\n\nIn NIPS, 2002.\n\n[14] Martin J. Wainwright, Tommi S. Jaakkola, and Alan S. Willsky. Tree-reweighted belief propagation\n\nalgorithms and approximate ml estimation by pseudo-moment matching. In AISTATS, 2003.\n\n[15] Mark Johnson. Why doesnt em \ufb01nd good hmm pos-taggers. In EMNLP, pages 296\u2013305, 2007.\n[16] Pradeep Ravikumar, Martin J. Wainwright, and Alekh Agarwal. Message-passing for graph-structured\n\nlinear programs: Proximal methods and rounding schemes, 2008.\n\n[17] A. P. Dempster, N. M. Laird, and D. B. Rubin. Maximum Likelihood from Incomplete Data via the EM\n\nalgorithm. Journal of The Royal Statistica Society, 1977.\n\n[18] Radford M. Neal and Geoffrey E. Hinton. A View of the EM Algorithm that Justi\ufb01es Incremental, Sparse,\n\nand Other Variants. In Learning in graphical models, pages 355\u2013368, 1999.\n\n[19] Slav Petrov and Dan Klein. Discriminative log-linear grammars with latent variables. In NIPS, 2008.\n[20] Ivan Titov and James Henderson. A latent variable model for generative dependency parsing. In IWPT,\n\n2007.\n\n[21] Ben Taskar, Vassil Chatalbashev, Daphne Koller, and Carlos Guestrin. Learning structured prediction\n\nmodels: a large margin approach. 2004.\n\n[22] Ioannis Tsochantaridis, Google Inc, Thorsten Joachims, Thomas Hofmann, Yasemin Altun, and Yoram\nSinger. Large margin methods for structured and interdependent output variables. Journal of Machine\nLearning Research, 6:1453\u20131484, 2005.\n\n[23] Chen Yanover, Talya Meltzer, and Yair Weiss. Linear programming relaxations and belief propagation \u2013\n\nan empirical study. Journal of Machine Learning Research, 7:1887\u20131907, 2006.\n\n[24] Chen Yanover, Ora Schueler-furman, and Yair Weiss. Minimizing and learning energy functions for\n\nside-chain prediction. In RECOMB2007, 2007.\n\n9\n\n\f", "award": [], "sourceid": 693, "authors": [{"given_name": "Jiarong", "family_name": "Jiang", "institution": null}, {"given_name": "Piyush", "family_name": "Rai", "institution": null}, {"given_name": "Hal", "family_name": "Daume", "institution": null}]}