{"title": "MAP Inference in Chains using Column Generation", "book": "Advances in Neural Information Processing Systems", "page_first": 1844, "page_last": 1852, "abstract": "Linear chains and trees are basic building blocks in many applications of graphical models. Although exact inference in these models can be performed by dynamic programming, this computation can still be prohibitively expensive with non-trivial target variable domain sizes due to the quadratic dependence on this size. Standard message-passing algorithms for these problems are inefficient because they compute scores on hypotheses for which there is strong negative local evidence. For this reason there has been significant previous interest in beam search and its variants; however, these methods provide only approximate inference. This paper presents new efficient exact inference algorithms based on the combination of it column generation and pre-computed bounds on the model's cost structure. Improving worst-case performance is impossible. However, our method substantially speeds real-world, typical-case inference in chains and trees. Experiments show our method to be twice as fast as exact Viterbi for Wall Street Journal part-of-speech tagging and over thirteen times faster for a joint part-of-speed and named-entity-recognition task. Our algorithm is also extendable to new techniques for approximate inference, to faster two-best inference, and new opportunities for connections between inference and learning.", "full_text": "MAP Inference in Chains using Column Generation\n\nDavid Belanger\u2217, Alexandre Passos\u2217, Sebastian Riedel\u2020, Andrew McCallum\n\n{belanger,apassos,mccallum}@cs.umass.edu, s.riedel@cs.ucl.ac.uk\n\nDepartment of Computer Science, University of Massachusetts, Amherst\n\n\u2020 Department of Computer Science, University College London\n\nAbstract\n\nLinear chains and trees are basic building blocks in many applications of graphi-\ncal models, and they admit simple exact maximum a-posteriori (MAP) inference\nalgorithms based on message passing. However, in many cases this computa-\ntion is prohibitively expensive, due to quadratic dependence on variables\u2019 domain\nsizes. The standard algorithms are inef\ufb01cient because they compute scores for\nhypotheses for which there is strong negative local evidence. For this reason\nthere has been signi\ufb01cant previous interest in beam search and its variants; how-\never, these methods provide only approximate results. This paper presents new\nexact inference algorithms based on the combination of column generation and\npre-computed bounds on terms of the model\u2019s scoring function. While we do\nnot improve worst-case performance, our method substantially speeds real-world,\ntypical-case inference in chains and trees. Experiments show our method to be\ntwice as fast as exact Viterbi for Wall Street Journal part-of-speech tagging and\nover thirteen times faster for a joint part-of-speed and named-entity-recognition\ntask. Our algorithm is also extendable to new techniques for approximate infer-\nence, to faster 0/1 loss oracles, and new opportunities for connections between\ninference and learning. We encourage further exploration of high-level reasoning\nabout the optimization problem implicit in dynamic programs.\n\n1\n\nIntroduction\n\nMany uses of graphical models either directly employ chains or tree structures\u2014as in part-of-speech\ntagging\u2014or employ them to enable inference in more complex models\u2014as in junction trees and tree\nblock coordinate descent [1]. Traditional message-passing inference in these structures requires an\namount of computation dependent on the product of the domain sizes of variables sharing an edge\nin the graph. Even in chains, exact inference is prohibitive in tasks with large domains due to the\nquadratic dependence on domain size. For this reason, many practitioners rely on beam search or\nother approximate inference techniques [2]. However, inference by beam search is approximate.\nThis not only hurts test-time accuracy, but can also interfere with parameter estimation [3].\nWe present a new algorithm for exact MAP inference in chains that is substantially faster than Viterbi\nin the typical case. We draw on four key ideas: (1) it is wasteful to compute and store messages to\nand from low-scoring states, (2) it is possible to compute bounds on data-independent (not varying\nwith the input data) scores of the model of\ufb02ine, (3) inference should make decisions based on local\nevidence for variables\u2019 values and rely on the graph only for disambiguation [4], and (4) runtime\nbehavior should adapt to the cost structure of the model (i.e., the algorithm should be energy-aware\n[5]). The combination of these ideas yields provably exact MAP inference for chains and trees that\ncan be more than an order of magnitude faster than traditional methods. Our algorithm has wide-\nranging applicability, and we believe it could bene\ufb01cially replace many traditional uses of Viterbi\nand beam search.\n\n\u2217The \ufb01rst two authors contributed equally to this paper.\n\n1\n\n\fWe exploit the connections between message-passing algorithms and LP relaxations for MAP infer-\nence. Directly solving LP relaxations for MAP using a state-of-the-art solver is inef\ufb01cient because\nit ignores key structure of the problem [6]. However, it is possible to leverage message-passing as a\nfast subroutine to solve smaller LPs, and use high-level techniques to compose these solutions into\na solution to the original problem.\nWith this interplay in mind, we employ column generation [7], a family of approaches to solving\nlinear programs that are dual to cutting planes: they start by solving restricted primal problems,\nwhere many LP variables are set to zero, and slowly add other LP variables until they are able to\nprove that adding no other variable can improve the solution. From these properties of column\ngeneration, we also show how to perform approximate inference that is guaranteed not to be worse\nthan the optimal by a given gap, how to construct an ef\ufb01cient 0/1-loss oracle by running 2-best\ninference in a subset of the graphical model, and how to learn parameters in such a way to make\ninference even faster.\nThe use of column generation has not been widely explored or appreciated in graphical models.\nThis paper is intended to demonstrate its bene\ufb01ts and encourage further work in this direction.\nWe demonstrate experimentally that our method has substantial speed advantages while retaining\nguaranteed exact inference. In Wall Street Journal part-of-speech tagging our method is more than\n2.5 times faster than Viterbi, and also faster than beam search with a width of two. In joint POS\ntagging and named entity recognition, our method is thirteen times faster than Viterbi and also faster\nthan beam search with a width of seven.\n\n2 Delayed Column Generation in LPs\n\nIn LPs used for combinatorial optimization problems, we know a priori that there are optimal solu-\ntions in which many variables will be set to zero. This is enforced by the problem\u2019s constraints or it\ncharacterizes optimality (e.g., the solution to a shortest path LP would not include multiple paths).\nColumn generation is a technique for exploiting this sparsity for faster inference. It restricts an LP\nto a subset of its variables (implicitly setting the others to zero) and alternates between solving this\nrestricted linear program and selecting which variables should be added to it, based on whether\nthey could potentially improve the objective. When no candidates remain, the current solution to the\nrestricted problem is guaranteed to be the exact solution of the unrestricted problem.\nThe test to determine whether un-generated variables could potentially improve the objective is\nwhether their reduced cost is positive, which is also the test employed by some pivoting rules in\nthe simplex algorithm [8, 7]. The difference between the algorithms is that simplex enumerates\nprimal variables explicitly, while column generation \u201cgenerates\u201d them only as needed. The key to\nan ef\ufb01cient column generation algorithm is an oracle that can either prove that no variable with\npositive reduced cost exists or produce one.\nConsider the general LP:\n\ncT x s.t. Ax \u2264 b,\n\nx \u2265 0\n\nWith corresponding Lagrangian:\n\nmax.\n\nL(x, \u03bb) = cT x + \u03bbt (b \u2212 Ax) = \u03a3i\n\n(cid:0)ci \u2212 AT\n\ni \u03bb(cid:1) xi + \u03bbtb.\n\n(1)\n\n(2)\n\nFor a given assignment to the dual variables \u03bb, a variable xi is a candidate for being added to the\nrestricted problem if its reduced cost ri = ci \u2212 AT\ni \u03bb, the scalar multiplying it in the Lagrangian, is\npositive. Another way to justify this decision rule is by considering the constraints in the LP dual:\n\nmin.\n\nbT \u03bb s.t. AT \u03bb \u2265 c \u03bb \u2265 0\n\n(3)\n\nHere, the reduced cost of a primal variable equals the degree to which its dual constraint is violated,\nand thus column generation in the primal is equivalent to cutting planes in the dual [7]. If there is\nno variable of positive reduced cost, then the current dual variables from the restricted problem are\nfeasible in the unrestricted problem, and thus we have a primal-dual optimal pair, and can terminate\ncolumn generation. An advantageous property of column generation that we employ later on is that\nit maintains primal feasibility across iterations, and thus it can be halted to provide approximate,\nanytime solutions.\n\n2\n\n\f3 Connection Between LP Relaxations and Message-Passing in Chains\n\nThis section provides background showing how the LP formulation of the inference problem in\nchains leads to the known message-passing algorithm. The derivation follows Wainwright and Jor-\ndan [9], but is specialized for chains and highlights connections to our contributions.\nThe LP for MAP inference in chains is as follows\n\n\u00b5i(xi)\u03b8i(xi) +(cid:80)\n\nmax. (cid:80)\ns.t. (cid:80)\n(cid:80)\n(cid:80)\n\ni,xi\n\nxi\n\nxi\n\nxi+1\n\ni,xi,xi+1\n\n\u00b5i(xi) = 1\n\u00b5i(xi, xi+1) = \u00b5i+1(xi+1)\n\n\u00b5i(xi, xi+1) = \u00b5i(xi)\n\n\u00b5i(xi, xi+1)\u03c4 (xi, xi+1)\n\n\u2200i\n\u2200i, xi+1\n\u2200i, xi\n\n(4)\n\nwhere \u03b8i(xi) is the score obtained from assigning the i-th variable to value xi, \u00b5i(xi) is an indica-\ntor variable saying whether or not the MAP assignment sets the i-th variable to the value xi, and\n\u03c4 (xi, xi+1) is the score the model assigns to a transition from value xi to value xi+1. It\u2019s implicitly\nassumed that all variables are positive. We assume a static \u03c4, but all statements trivially generalize\nto position-dependent \u03c4i.\nWe can restructure this LP to only depend on the pairwise assignment variables \u00b5i(xi, xi+1) by\ncreating an edge between the last variable in the chain and an arti\ufb01cial variable and then \u201cbilling\u201d\nall local scores to the pairwise edge that touches them from the right. Then we restructure the\nconstraints to sum out both sides of each edge, and add indicator variables \u00b5n(xn,\u00b7) and 0-scoring\ntransitions for the arti\ufb01cial edge. This leaves the following LP (with dual variables written after their\ncorresponding constraints).\n\nmax. (cid:80)\ns.t. (cid:80)\n(cid:80)\n\n\u00b5i(xi, xi+1)(\u03c4i(xi, xi+1) + \u03b8i(xi))\n\ni,xi,xi+1\n\n\u00b5n(xn,\u00b7) = 1\n\n\u00b5i\u22121(xi\u22121, xi) =(cid:80)\n\nxn\nxi\u22121\n\n\u00b5i(xi, xi+1)\n\n(N )\n(\u03b1i(xi))\n\nxi+1\n\nThe dual of this linear program is\n\nmin. N\ns.t. \u03b1i+1(xi+1) \u2212 \u03b1i(xi) \u2265 \u03c4 (xi, xi+1) + \u03b8i(xi) \u2200i, xi, xi+1\n\nN \u2212 \u03b1n(xn) \u2265 \u03b8n(xn)\n\n\u2200xn\n\nand setting the \u03b1 dual variables by\n\n\u03b1i+1(xi+1) = max\n\nxi\n\n\u03b1i(xi) + \u03b8i(xi) + \u03c4 (xi, xi+1)\n\n(5)\n\n(6)\n\n(7)\n\nand N = maxxn \u03b1n(xn) + \u03b8n(xn) is a suf\ufb01cient condition for dual feasibility, and as N will\nhave the value of the primal solution, for optimality. Note that this equation is exactly the forward\nmessage-passing equation for max-product belief propagation in chains, i.e. the Viterbi algorithm.\nA setting of the dual variables is optimal if maximization of the problem\u2019s Lagrangian over the\nprimal variables yields a primal-feasible setting. The coef\ufb01cients on the edge variables \u00b5i(xi, xi+1)\nare their reduced costs,\n\n\u03b1i(xi) \u2212 \u03b1i+1(xi+1) + \u03b8i(xi) + \u03c4 (xi, xi+1).\n\n(8)\n\nFor duals that obey the constraints of (6), it is clear that the maximal reduced cost is zero, when xi\nis set to the argmax used when constructing \u03b1i+1(xi+1). Therefore, to a obtain a primal-optimal\nsolution, we start at the end of the chain and follow the argmax indices to the beginning, which is\nthe same backward sweep of the Viterbi algorithm.\n\n3.1\n\nImproving the reduced cost with information from both ends of the chain\n\nColumn generation adds all variables with positive reduced cost to the restricted LP, but equation (8)\nleads to an inef\ufb01cient algorithm because it is positive for many irrelevant edge settings. In (8),\nthe only terms that involve xi+1 are \u03c4 (xi, xi+1) and the \u03c4 (x(cid:48)\ni, xi+1) term that is part of \u03b1i+1(xi+1).\nThese are data-independent. Therefore, even if there is very strong local evidence against a particular\n\n3\n\n\fsetting xi+1, pairs xi, xi+1 may have positive reduced cost if the global transition factor \u03c4 (xi, xi+1)\nplaces positive weight on their compatibility.\nWe can improve upon this by exploring different LP formulations than that of Wainwright and\nJordan. Note that in equation (5) a local score is \u201cbilled\u201d to its rightmost edge. Instead, if we split\nit halfway (now using phantom edges in both sides of the chain), we would obtain slightly different\nmessage passing rules and the following reduced cost expression:\n\n\u03b1i(xi) \u2212 \u03b1i+1(xi+1) +\n\n1\n2\n\n(\u03b8i(xi) + \u03b8j(xj)) + \u03c4 (xi, xi+1).\n\n(9)\n\nThis contains local information for both xi and xi+1, though it halves the magnitude of it. In table\n2 we demonstrate that this yields comparable performance to using the reduced cost of (8), which\nstill outperforms Viterbi. An even better reduced cost expression can be obtained by duplicating the\nmarginalization constraints, we have:\n\nmax. (cid:80)\ns.t. (cid:80)\n(cid:80)\n(cid:80)\n(cid:80)\n\ni,xi,xi+1\n\n\u00b5i(xi, xi+1)(cid:0)\u03c4 (xi, xi+1) + 1\n\u00b5i\u22121(xi\u22121, xi) =(cid:80)\n\u00b5i(xi, xi+1) =(cid:80)\n\n\u00b5n(xn,\u00b7) = 1\n\u00b50(\u00b7, x1) = 1\n\nxi+1\n\n\u00b5i(xi, xi+1)\n\u00b5i\u22121(xi\u22121, xi)\n\nxi\u22121\n\nxn\n\nx1\nxi\u22121\n\nxi+1\n\n2 \u03b8i+1(xi+1)(cid:1)\n\n2 \u03b8i(xi) + 1\n\n(N +)\n(N\u2212)\n(\u03b1i(xi))\n(\u03b2i(xi))\n\n(10)\n\nFollowing similar logic as in the previous section, setting the dual variables according to (7) and\n\n\u03b2i\u22121(xi\u22121) = max\nxi\n\n\u03b2i(xi) + \u03b8i(xi) + \u03c4(xi\u22121, xi)\n\n(11)\n\nis a suf\ufb01cient condition for optimality.\nIn effect, we solve the LP of equation (10) in two independent procedures, each solving the one-\ndirectional subproblem in (6), and either one of these subroutines is suf\ufb01cient to construct a primal\noptimal solution. This redundancy is important, though, because the resulting reduced cost\n\n2Ri(xi, xi+1) = 2\u03c4 (xi, xi+1) + \u03b8i(xi) + \u03b8i+1(xi+1)\n\n+ (\u03b1i(xi) \u2212 \u03b1i+1(xi+1)) + (\u03b2i+1(xi+1) \u2212 \u03b2i(xi)) .\n\n(12)\n\nincorporates global information from both directions in the chain. In table 2 we show that column\ngeneration with (12) is fastest, which is not obvious, given the extra overhead of computing the \u03b2\nmessages. This is the reduced cost that we use in the following discussion and experiments, unless\nexplicitly stated otherwise.\n\n4 Column Generation Algorithm\n\nWe present an algorithm for exact MAP inference that in practice is usually faster than traditional\nmessage passing. Like all column generation algorithms, our technique requires components for\nthree tasks: choosing the initial set of variables in the restricted LP, solving the restricted LP, and\n\ufb01nding variables with positive reduced cost. When no variable of positive reduced cost exists, the\ncurrent solution to the restricted problem is optimal because we have a primal-feasible, dual-feasible\npair.\nPseudocode for our algorithm is provided in Figure 1. In the following description, many concepts\nwill be explained in terms of nodes, despite our LP being de\ufb01ned over edges. The edge quantities\ncan be de\ufb01ned in terms of node quantities, such as the \u03b1 and \u03b2 messages, and it is more ef\ufb01cient to\nstore these than the quadratically-many edge quantities.\n\n4.1\n\nInitialization\n\nTo initialize the LP, we \ufb01rst de\ufb01ne a restricted domain for each node in the graphical model con-\nsisting of only xL\ni = argmax \u03b8i(xi). Other initialization strategies, such as adding the high-scoring\ntransitions, or the k best xi, are also valid. Next, we include in the initial restricted LP all the indica-\ntor variables \u00b5i(xL\ni+1) corresponding to these size-one domains. Solving the initial restricted LP\nis very ef\ufb01cient, since all nodes have only one valid setting, and no maximization is needed when\npassing messages.\n\ni , xL\n\n4\n\n\f4.2 Warm-Starting the Restricted LP\n\nUpdating all messages using the max-product rules of equations (7) and (11) is a valid way to solve\nthe restricted LP, but it doesn\u2019t leverage the messages that were optimal for previous calls to the\nproblem. In practice, the restricted domains of every node are not updated at every iteration, and\nhence many of the previous messages may still appear in a dual-optimal setting of the current re-\nstricted problem. As usual, solving the restricted LP, can be decomposed into independently solving\neach of the one-directional LPs, and thus we update \u03b1 independently of \u03b2.\nTo construct a primal setting from either the \u03b1 or \u03b2 messages, we employ the standard technique\nof back-tracing the argmaxes used in their update equations. In some regions of the chain, we can\navoid updating messages because we can guarantee that the proposed message updates would yield\nthe same maximization and thus the same primal setting. Simple rules include, for example, avoiding\nupdating \u03b1 to the left of the \ufb01rst updated domain and to avoid updating \u03b1i(\u2217) if |Di\u22121|= 1, since\nmaximization over |Di\u22121| is trivial. Furthermore, to the right of the the last updated domain, if we\ncompute new messages \u03b1(cid:48)\ni doesn\u2019t\nchange, we can revert to the previous \u03b1i(\u2217) and terminate message passing. An analogous statement\ncan be made about the \u03b2 variables.\nWhen solving the restricted LP, some constraints are trivially satis\ufb01ed because they only involve\nvariables that are implicitly set to zero, and hence the corresponding dual variables can be set arbi-\ntrarily. To prevent extraneous un-generated variables from having a high reduced cost, we choose\nduals by guessing values that should be feasible in the unrestricted LP, with a smaller computa-\ntional cost than solving the unrestricted LP directly. We employ the same update equation used for\nthe in-domain messages in (7) and (11), and maximize over the restricted domain of the variable\u2019s\nneighbor. In our experiments, over 90% of the restricted domains were of size 1, so this dependence\non the size of the neighbor domain was not a computational bottleneck in practice, and still allowed\nthe reduced-cost oracle to consider \ufb01ve or less candidate edges in each iteration in more than 86%\nof the calls.\n\ni(\u2217) and \ufb01nd that the argmax at the current MAP assignment x\u2217\n\n4.3 Reduced-Cost Oracle\n\nExhaustively searching the chain for variables of positive reduced cost by iterating over all settings of\nall edges would be as expensive as exact max-product message-passing. However, our oracle search\nstrategy is ef\ufb01cient because it prunes these away using precomputed bounds on the transitions.\nFirst we decompose equation (12) as follows\n\ni (xi) + S\u2212\n\ni (xi+1)\n\n2Ri(xi, xi+1) = 2\u03c4 (xi, xi+1) + S+\n\ni (xi) = \u03b8i(xi)+\u03b1i(xi)\u2212\u03b2i(xi) and S\u2212\n\n(13)\ni (xi+1) = \u03b8i+1(xi+1)\u2212\u03b1i+1(xi+1)+\u03b2i+1(xi+1).\nwhere S+\nIf in practice, most settings for each edge have negative reduced cost, we can ef\ufb01ciently \ufb01nd candi-\ndate settings by \ufb01rst upper-bounding S+\ni (xi) + 2\u03c4 (xi, xi+1), \ufb01nding all possible values xi+1 that\ncould yield a positive reduced cost, and then doing the reverse. Finally, we search over the much\nsmaller set of candidates for xi and xi+1. This strategy is described in Figure 1.\nAfter the \ufb01rst round of column generation, if Ri(xi, xi+1) hasn\u2019t changed for every xi, xi+1, then\nno variables of positive reduced cost can exist because they would have been added in the previous\niteration, and we can skip the oracle. This condition can be checked while passing messages.\nLastly, a \ufb01nal pruning strategy is that if there are settings xi, x(cid:48)\n\ni such that\n\u03c4 (xi\u22121, x(cid:48)\n\n\u03b8i(xi) + min\nxi\u22121\n\n\u03c4 (xi\u22121, xi) + min\nxi+1\n\n\u03c4 (xi, xi+1) > \u03b8i(x(cid:48)\n\ni) + max\nxi\u22121\n\ni) + max\nxi+1\n\n\u03c4 (x(cid:48)\n\ni, xi+1), (14)\n\nthen we know with certainty that setting x(cid:48)\ni is suboptimal. This helps prune the oracle\u2019s search space\nef\ufb01ciently because the above maxima and minima are data-independent of\ufb02ine computations. We\ncan do so by \ufb01rst linearly searching through the labels for a node for the one with highest local score\nand then using precomputed bounds on the transition scores to linearly discard states whose upper\nbound on the score is smaller than the lower bound of the best state.\n\n5\n\n\f: Algorithm: CG-Infer\nbegin\n\nfor i = 1 \u2192 n do\n\nend\n\nend\n\nend\n\nDi = {argmax \u03b8i(xi)}\n\nend\nwhile domains haven\u2019t converged do\n(\u03b1, \u03b2) \u2190 GetM essages(D, \u03b8)\nfor i = 1 \u2192 n do\n\ni+1 \u2190 ReducedCostOracle(i)\n\nD\u2217\ni , D\u2217\nDi \u2190 Di \u222a D\u2217\nDi+1 \u2190 Di+1 \u222a D\u2217\n\ni\n\ni+1\n\n: Algorithm: ReducedCostOracle(i)\nbegin\n\nU\u03c4 (\u00b7, xj) \u2190 maxxi \u03c4 (xi, xj)\nU\u03c4 (xi,\u00b7) \u2190 maxxj \u03c4 (xi, xj)\nUi \u2190 maxxi S+\ni (xi)\ni \u2190 {xj|S\ni (xj) + Ui + 2U\u03c4 (\u00b7, xj) > 0}\n\u2212\nC(cid:48)\ni \u2190 maxxi+i\u2208C(cid:48)\n\u2212\nU(cid:48)\ni (xj)\nCi \u2190 {xi|S+\ni + 2U\u03c4 (xi,\u00b7) > 0}\ni (xi) + U(cid:48)\nD\u00d7D(cid:48) \u2190 {xi, xj \u2208 Ci, C(cid:48)\ni|R(xi, xj) > 0}\nreturn D, D(cid:48)\n\nS\n\ni\n\nend\n\nFigure 1: Column Generation Algorithm and Pruning Strategy for Reduced Cost Oracle\n\n5 Extensions of the Algorithm\n\nThe column generation algorithm is fairly general, and can be easily extended to allow for many\ninteresting use cases. In section 7 we provide experiments supporting the usefulness of these exten-\nsions, and they are described in more detail in appendix A.\nFirst of all, our algorithm generalizes easily to MAP inference in trees by using a similar structure\nbut a different reduced cost expression that considers messages \ufb02owing in both directions across\neach edge (appendix A.1). The reduced cost oracle can also be used to compute the duality gap\nof an approximate solution. This allows early stopping of our algorithm if the gap is small and\nalso provides analysis of the sub-optimality of the output of beam search (appendix A.2). Further-\nmore, margin violation queries when doing structured SVM training with a 0/1 loss can be done\nef\ufb01ciently using a small modi\ufb01cation of our algorithm, in which we also add variables of small\nnegative reduced cost and do 2-best inference within the restricted domains (appendix A.3). Lastly,\nregularizing the transition weights more strongly allows one to train models that will decode more\nquickly (appendix A.4). Most standard inference algorithms, such as Viterbi, do not have this be-\nhavior where the inference time is affected by the actual model scores. By coupling inference and\nlearning, practitioners have more freedom to trade off test-time speed vs. accuracy.\n\n6 Related Work\n\nColumn generation has been employed as a way of dramatically speeding up MAP inference prob-\nlems in Riedel et al [10], which applies it directly to the LP relaxation for dependency parsing with\ngrandparent edges.\nThere has been substantial prior work on improving the speed of max-product inference in chains\nby pruning the search process. CarpeDiem [11] relies on an an expression similar to the oriented,\nleft-to-right reduced cost equation of (8), also with a similar pruning strategy to the one described\nin section 4.3. Following up, Kaji et al. [12] presented a staggered decoding strategy that similarly\nattempts to bound the best achievable score using uninstantiated domains, but only used local scores\nwhen searching for new candidates. The dual variables obtained in earlier runs were then used to\nwarm-start the inference in later runs, similarly to what is done in section 4.2. Their techniques\nobtained similar speed-ups as ours over Viterbi inference. However, their algorithms do not pro-\nvide extensions to inference in trees, a margin-violation oracle, and approximate inference using\na duality gap. Furthermore, Kaji et al. use data-dependent transition scores. This may improve\nour performance as well, if the transition scores are more sharply peaked. Similarly, Raphael [13]\nalso presents a staggered decoding strategy, but does so in a way that applies to many dynamic\nprogramming algorithms.\nThe strategy of preprocessing data-independent factors to speed up max-product has been previously\nexplored by McAuley and Caetano [14], who showed that if the transition weights are large, savings\ncan be obtained by sorting them of\ufb02ine. Our contributions, on the other hand, are more effective\n\n6\n\n\fwhen the transitions are small. The same authors have also explored strategies to reduce the worst-\ncase complexity of message-passing by exploiting faster matrix multiplication algorithms [15].\nAlternative methods of leveraging the interplay be-\ntween fast dynamic programming algorithms and\nhigher-level LP techniques have been explored else-\nwhere. For example, in dual decomposition [16], in-\nference in joint models is reduced to repeated infer-\nence in independent models. Tree block-coordinate\ndescent performs approximate inference in loopy\nmodels using exact inference in trees as a subrou-\ntine [1]. Column generation is cutting planes in the\ndual, and cutting planes have been used successfully\nin various machine learning contexts. See, for exam-\nple, Sontag et al [17] and Riedel et al [18].\nThere is a mapping between dynamic programs and\nshortest path problems [19]. Our reduced cost is an\nestimate of the desirability of an edge setting, and\nthus our algorithm is heuristic search in the space of\nedge settings. With dual feasibility, this heuristic is consistent, and thus our algorithm is iteratively\nconstructing a heuristic such that it can perform A\u2217 search for the \ufb01nal restricted LP [20].\n\nFigure 2: Training-time manipulation of ac-\ncuracy vs. test throughput for our algorithm\n\n7 Experiments\n\nWe compare the performance of column generation with exact and approximate inference on\nWall Street Journal [21] part-of-speech (POS) tagging and joint POS tagging and named-entity-\nrecognition (POS/NER). The output variable domain size is 45 for POS and 360 for POS/NER. The\ntest set contains 5463 sentences. The POS model was trained with a 0/1 loss structured SVM and\nthe POS/NER model was trained using SampleRank [22].\nTable 1 compares the inference times and accuracies of column generation (CG), Viterbi, Viterbi\nwith the \ufb01nal pruning technique described in section 4.3 (Viterbi+P), CG with duality gap termi-\nnation condition 0.15% (CG+DG), and beam search. For POS, CG, is more than twice as fast as\nViterbi, with speed comparable to a beam of size 3. Whereas CG is exact, Beam-3 loses 1.6%\naccuracy. Exact inference in the model obtains a tagging accuracy of 95.3%.\nFor joint POS and NER tagging, the speedups are even more dramatic. We observe a 13x speedup\nover Viterbi and are comparable in speed with a beam of size 7, while being exact. As in POS,\nCG-DG provides a mild speedup.\nOver 90% of tokens in the POS task had a domain of size one, and over 99% had a domain of size\n3 or smaller. Column generation always \ufb01nished in at most three iterations, and 22% of the time it\nterminated after one. 86% of the time, the reduced-cost oracle iterated over at most 5 candidate edge\nsettings, which is a signi\ufb01cant reduction from the worst-case behavior of 452. The pruning strategy\nin Viterbi+P manages to restrict the number of possible labels for each token to at most 5 for over\n65% of the tokens, and prunes the size of each domain by half over 95% of the time.\nTable 2.A presents results for a 0/1 loss oracle described in section 5. Baselines are a standard Viterbi\n2-best search1 and Viterbi 2-best with the pruning technique of 4.3 (Viterbi+P). CG outperforms\nViterbi 2-best on both POS and POS/NER. Though Viterbi+P presents an effective speedup, we\nare still 19x faster on POS/NER. In terms of absolute throughput, POS/NER is faster than POS\nbecause the POS/NER model wasn\u2019t trained with a regularized structured SVM, and thus there are\nfewer margin violations. Our 0/1 oracle is quite ef\ufb01cient when determining that there isn\u2019t a margin\nviolation, but requires extra work when required to actually produce the 2-best setting.\nTable 2.B shows column generation with two other reduced-cost formulations on the same POS\ntagging task. CG-\u03b1 uses the reduced-cost from equation (8) while CG-\u03b1+\u03b8i+1 uses the reduced-\ncost from equation (9). The full CG is clearly bene\ufb01cial, despite requiring computation of \u03b2.\n\n1Implemented by replacing all maximizations in the viterbi code with two-best maximizations.\n\n7\n\n\fAlgorithm % Exact\n100\nViterbi\n100\nViterbi+P\n100\nCG\n98.9\nCG-DG\nBeam-1\n57.7\n92.6\nBeam-2\n98.4\nBeam-3\nBeam-4\n99.5\n\nSent./sec.\n3144.6\n4515.3\n8227.6\n9355.6\n12117.6\n7519.3\n6802.5\n5731.2\n\nAlgorithm % Exact\n100\nViterbi\n100\nViterbi+P\n100\nCG\n98.4\nCG-DG\nBeam-1\n66.6\n98.5\nBeam-5\n99.2\nBeam-7\nBeam-10\n99.5\n\nSent./sec.\n56.9\n498.9\n779.9\n804\n3717.0\n994.97\n772.8\n575.1\n\nTable 1: Comparing inference time and exactness of Column Generation (CG), Viterbi, Viterbi with\nthe \ufb01nal pruning technique of section 4.3 (Viterbi+P), and CG with duality gap termination condition\n0.15%(CG+DG), and beam search on POS tagging (left) and joint POS/NER (right).\n\nMethod\nCG\nViterbi 2-best\nViterbi+P 2-best\n\nPOS\nSent./sec.\n85.0\n56.0\n119.6\n\nPOS/NER\nSent./sec.\n299.9\n.06\n11.7\n\nReduced Cost\nCG\nCG-\u03b1\nCG-\u03b1+\u03b8i+1\n\nPOS Sent./sec.\n8227.6\n5125.8\n4532.1\n\nTable 2: (A) the speedups for a 0/1 loss oracle (B) comparing reduced cost formulations\n\nIn Figure 2, we explore the ability to manipulate training time regularization to trade off test accu-\nracy and test speed, as discussed in section 5. We train a structured SVM with L2 regularization\n(coef\ufb01cient 0.1) the emission weights, and vary the L2 coef\ufb01cient on the transition weights from 0.1\nto 10. A 4x gain in speed can be obtained at the expense of an 8% relative decrease in accuracy.\n\n8 Conclusions and future work\n\nIn this paper we presented an ef\ufb01cient family of algorithms based on column generation for MAP\ninference in chains and trees. This algorithm exploits the fact that inference can often rule out\nmany possible values, and we can ef\ufb01ciently expand the set of values on the \ufb02y. Depending on the\nparameter settings it can be twice as fast as Viterbi in WSJ POS tagging and 13x faster in a joint\nPOS-NER task.\nOne avenue of further work is to extend the bounding strategies in this algorithm for inference\nin cluster graphs or junction trees, allowing faster inference in higher-order chains or even loopy\ngraphical models. The connection between inference and learning shown in section 5 also bears\nfurther study, since it would be helpful to have more prescriptive advice for regularization strategies\nto achieve certain desired accuracy/time tradeoffs.\n\nAcknowledgments\n\nThis work was supported in part by the Center for Intelligent Information Retrieval. The Univer-\nsity of Massachusetts gratefully acknowledges the support of Defense Advanced Research Projects\nAgency (DARPA) Machine Reading Program under Air Force Research Laboratory (AFRL) prime\ncontract no. FA8750-09-C-0181, in part by IARPA via DoI/NBC contract #D11PC20152, in part by\nArmy prime contract number W911NF-07-1-0216 and University of Pennsylvania subaward num-\nber 103-548106 , and in part by UPenn NSF medium IIS-0803847. Any opinions, \ufb01ndings and\nconclusions or recommendations expressed in this material are the authors\u2019 and do not necessarily\nre\ufb02ect those of the sponsor. The U.S. Government is authorized to reproduce and distribute reprint\nfor Governmental purposes notwithstanding any copyright annotation thereon\n\nReferences\n\n[1] David Sontag and Tommi Jaakkola. Tree block coordinate descent for MAP in graphical models.\n\nIn\nProceedings of the Twelfth International Conference on Arti\ufb01cial Intelligence and Statistics (AI-STATS),\nvolume 8, pages 544\u2013551. JMLR: W&CP, 2009.\n\n8\n\n\f[2] C. Pal, C. Sutton, and A. McCallum. Sparse forward-backward using minimum divergence beams for fast\ntraining of conditional random \ufb01elds. In Acoustics, Speech and Signal Processing, 2006. ICASSP 2006\nProceedings. 2006 IEEE International Conference on, volume 5, pages V\u2013V. IEEE, 2006.\n\n[3] A. Kulesza, F. Pereira, et al. Structured learning with approximate inference. Advances in neural infor-\n\nmation processing systems, 20:785\u2013792, 2007.\n\n[4] L. Shen, G. Satta, and A. Joshi. Guided learning for bidirectional sequence classi\ufb01cation.\n\nMeeting-Association for Computational Linguistics, volume 45, page 760, 2007.\n\nIn Annual\n\n[5] D. Tarlow, D. Batra, P. Kohli, and V. Kolmogorov. Dynamic tree block coordinate ascent. In ICML, pages\n\n113\u2013120, 2011.\n\n[6] C. Yanover, T. Meltzer, and Y. Weiss. Linear programming relaxations and belief propagation\u2013an empir-\n\nical study. The Journal of Machine Learning Research, 7:1887\u20131907, 2006.\n\n[7] M. Lubbecke and J. Desrosiers. Selected topics in column generation. Operations Research, 53:1007\u2013\n\n1023, 2004.\n\n[8] D. Bertsimas and J. Tsitsiklis. Introduction to Linear Optimization. Athena Scienti\ufb01c, 1997.\n[9] M.J. Wainwright and M.I. Jordan. Graphical models, exponential families, and variational inference.\n\nFoundations and Trends in Machine Learning, 1(1-2):1\u2013305, 2008.\n\n[10] S. Riedel, D. Smith, and A. McCallum. Parse, price and cutdelayed column and row generation for graph\nbased parsers. Proceedings of the Conference on Empirical methods in natural language processing\n(EMNLP \u201912), 2012.\n\n[11] R. Esposito and D.P. Radicioni. Carpediem: an algorithm for the fast evaluation of SSL classi\ufb01ers. In\n\nProceedings of the 24th international conference on Machine learning, pages 257\u2013264. ACM, 2007.\n\n[12] N. Kaji, Y. Fujiwara, N. Yoshinaga, and M. Kitsuregawa. Ef\ufb01cient staggered decoding for sequence\nlabeling. In Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics,\npages 485\u2013494. Association for Computational Linguistics, 2010.\n\n[13] C. Raphael. Coarse-to-\ufb01ne dynamic programming. Pattern Analysis and Machine Intelligence, IEEE\n\nTransactions on, 23(12):1379\u20131390, 2001.\n\n[14] J. McAuley and T. Caetano. Exploiting data-independence for fast belief-propagation. In International\n\nConference on Machine Learning 2010, volume 767, page 774, 2010.\n\n[15] J.J. McAuley and T.S. Caetano. Faster algorithms for max-product message-passing. Journal of Machine\n\nLearning Research, 12:1349\u20131388, 2011.\n\n[16] A.M. Rush, D. Sontag, M. Collins, and T. Jaakkola. On dual decomposition and linear programming\nrelaxations for natural language processing. In Proceedings of the 2010 Conference on Empirical Methods\nin Natural Language Processing, pages 1\u201311. Association for Computational Linguistics, 2010.\n\n[17] D. Sontag and T. Jaakkola. New outer bounds on the marginal polytope. In Advances in Neural Informa-\n\ntion Processing Systems, 2007.\n\n[18] S. Riedel. Improving the accuracy and ef\ufb01ciency of MAP inference for Markov logic. Proceedings of\n\nUAI 2008, pages 468\u2013475, 2008.\n\n[19] R. Kipp Martin, Ronald L. Rardin, and Brian A. Campbell. Polyhedral characterization of discrete dy-\n\nnamic programming. Operations Research, 38(1):pp. 127\u2013138, 1990.\n\n[20] R.K. Ahuja, T.L. Magnanti, J.B. Orlin, and K. Weihe. Network \ufb02ows: theory, algorithms and applications.\n\nZOR-Methods and Models of Operations Research, 41(3):252\u2013254, 1995.\n\n[21] M.P. Marcus, M.A. Marcinkiewicz, and B. Santorini. Building a large annotated corpus of english: The\n\npenn treebank. Computational linguistics, 19(2):313\u2013330, 1993.\n\n[22] M. Wick, K. Rohanimanesh, K. Bellare, A. Culotta, and A. McCallum. SampleRank: training factor\n\ngraphs with atomic gradients. In Proceedings of ICML, 2011.\n\n9\n\n\f", "award": [], "sourceid": 914, "authors": [{"given_name": "David", "family_name": "Belanger", "institution": null}, {"given_name": "Alexandre", "family_name": "Passos", "institution": null}, {"given_name": "Sebastian", "family_name": "Riedel", "institution": null}, {"given_name": "Andrew", "family_name": "McCallum", "institution": null}]}