{"title": "Training Factor Graphs with Reinforcement Learning for Efficient MAP Inference", "book": "Advances in Neural Information Processing Systems", "page_first": 2044, "page_last": 2052, "abstract": "Large, relational factor graphs with structure defined by first-order logic or other languages give rise to notoriously difficult inference problems. Because unrolling the structure necessary to represent distributions over all hypotheses has exponential blow-up, solutions are often derived from MCMC. However, because of limitations in the design and parameterization of the jump function, these sampling-based methods suffer from local minima|the system must transition through lower-scoring configurations before arriving at a better MAP solution. This paper presents a new method of explicitly selecting fruitful downward jumps by leveraging reinforcement learning (RL). Rather than setting parameters to maximize the likelihood of the training data, parameters of the factor graph are treated as a log-linear function approximator and learned with temporal difference (TD); MAP inference is performed by executing the resulting policy on held out test data. Our method allows efficient gradient updates since only factors in the neighborhood of variables affected by an action need to be computed|we bypass the need to compute marginals entirely. Our method provides dramatic empirical success, producing new state-of-the-art results on a complex joint model of ontology alignment, with a 48\\% reduction in error over state-of-the-art in that domain.", "full_text": "Training Factor Graphs with Reinforcement\n\nLearning for Ef\ufb01cient MAP Inference\n\nMichael Wick, Khashayar Rohanimanesh, Sameer Singh, Andrew McCallum\n\nDepartment of Computer Science\n\nUniversity of Massachusetts Amherst\n\n{mwick,khash,sameer,mccallum}@cs.umass.edu\n\nAmherst, MA 01003\n\nAbstract\n\nLarge, relational factor graphs with structure de\ufb01ned by \ufb01rst-order logic or other\nlanguages give rise to notoriously dif\ufb01cult inference problems. Because unrolling\nthe structure necessary to represent distributions over all hypotheses has exponen-\ntial blow-up, solutions are often derived from MCMC. However, because of lim-\nitations in the design and parameterization of the jump function, these sampling-\nbased methods suffer from local minima\u2014the system must transition through\nlower-scoring con\ufb01gurations before arriving at a better MAP solution. This pa-\nper presents a new method of explicitly selecting fruitful downward jumps by\nleveraging reinforcement learning (RL). Rather than setting parameters to maxi-\nmize the likelihood of the training data, parameters of the factor graph are treated\nas a log-linear function approximator and learned with methods of temporal dif-\nference (TD); MAP inference is performed by executing the resulting policy on\nheld out test data. Our method allows ef\ufb01cient gradient updates since only factors\nin the neighborhood of variables affected by an action need to be computed\u2014we\nbypass the need to compute marginals entirely. Our method yields dramatic em-\npirical success, producing new state-of-the-art results on a complex joint model\nof ontology alignment, with a 48% reduction in error over state-of-the-art in that\ndomain.\n\n1\n\nIntroduction\n\nFactor graphs are a widely used representation for modeling complex dependencies amongst hidden\nvariables in structured prediction problems. There are two common inference problems: learning\n(setting model parameters) and decoding (maximum a posteriori (MAP) inference). MAP inference\nis the problem of \ufb01nding the most probable setting to the graph\u2019s hidden variables conditioned on\nsome observed variables.\nFor certain types of graphs, such as chains and trees, exact inference and learning is polynomial time\n[1, 2, 3]. Unfortunately, many interesting problems require more complicated structure rendering\nexact inference intractable [4, 5, 6, 7]. In such cases we must rely on approximate techniques; in\nparticular, stochastic methods such as Markov chain Monte Carlo (e.g., Metropolis-Hastings) have\nbeen applied to problems such as MAP inference in these graphs [8, 9, 10, 11, 6]. However, for\nmany real-world structured prediction tasks, MCMC (and other local stochastic methods) are likely\nto struggle as they transition through lower-scoring regions of the con\ufb01guration space.\nFor example, consider the structured prediction task of clustering where the MAP inference problem\nis to group data points into equivalence classes according to some model. Assume for a moment that\n\n1\n\n\fFigure 1: The \ufb01gure on the left shows the sequence of states along an optimal path beginning at\na single-cluster con\ufb01guration and ending at the MAP con\ufb01guration (F1 scores for each state are\nshown). The \ufb01gure on the right plots the F1 scores along the optimal path to the goal for the case\nwhere the MAP clustering has forty instances (twenty per cluster) instead of 5.\n\nthis model is perfect and exactly re\ufb02ects the pairwise F1 score. Even in these ideal conditions\nMCMC must make many downhill jumps to reach the MAP con\ufb01guration. For example, Figure 1\nshows the F1 scores of each state along the optimal path to the MAP clustering (assuming each\nMCMC jump can reposition one data point at a time). We can see that several consecutive downhill\ntransitions must be realized before model-scores begin to improve.\nTaking into account the above discussion with an emphasis on the delayed feedback nature of the\nMAP inference problem immediately inspires us to employ reinforcement learning (RL) [12]. RL\nis a framework for solving the sequential decision making problem with delayed reward. There has\nbeen an extensive study of this problem in many areas of machine learning, planning, and robotics.\nOur approach is to directly learn the parameters of the log-linear factor graph with reinforcement\nlearning during a training phase; MAP inference is performed by executing the policy. Because we\ndevelop the reward-structure to assign the most mass to the goal con\ufb01guration, the parameters of the\nmodel can also be interpreted as a regularized version of maximum likelihood that is smoothed over\nneighboring states in the proposal manifold.\nThe rest of this document is organized as follows: in \u00a72 we brie\ufb02y overview background material.\nIn \u00a73 we describe the details of our algorithm and discuss a number of ideas for coping with the\ncombinatorial complexity in both state and action spaces. In \u00a74.3 we present our empirical results,\nand \ufb01nally in \u00a76 we conclude and lay out a number of ideas for future work.\n\n2 Preliminaries\n\n2.1 Factor Graphs\n\nA factor graph is undirected bipartite graphical representation of a probability distribution with\nrandom variables and factors as nodes. Let X be a set of observed variables and Y be a set of\nhidden variables. The factor graph expresses the conditional probability of Y = y given X = x\ndiscriminatively:\n\n(cid:89)\n\n\u03c8i\u2208\u03a8\n\n(cid:33)\n\n(cid:32)(cid:88)\n\nk\n\nP (y|x) =\n\n1\nZX\n\n\u03c8i(x, yi) =\n\n1\nZX\n\nexp\n\n\u03b8k\u03c6k(x, yk)\n\n(1)\n\nWhere ZX is an input-dependent normalizing constant ensuring that the distribution sums to one,\n\u03a8 is the set of factors, and \u03c8(x, yi) are factors over the observed variables x and a set of hidden\nvariables yi that are the neighbors of the factor (we use superscript to denote a set). Factors are\nlog-linear combinations of features \u03c6(x, yi) and parameters \u03b8 = {\u03b8j}. The problem of learning is\nto \ufb01nd a setting of the parameters \u03b8 that explains the data. For example, maximum likelihood sets\nthe parameters so that the model\u2019s feature expectations matches the data\u2019s expectations.\n\n2\n\nP =.44R =1.0F1=.61P =.34R =.80F1=.48P =.48R =.70F1=.57P =1.0R =1.0F1=1.00.590.60.610.620.630.640.650.66012345678920\f2.2 Reinforcement Learning\n\nMost of the discussion here is based on [12]. Reinforcement learning (RL) refers to a class of\nproblems in which an agent interacts with the environment and the objective is to learn a course of\nactions that optimizes a long-term measure of a delayed reward signal. The most popular realization\nof RL has been in the context of markov decision processes (MDPs).\nAn MDP is the tuple M = (cid:104)S,A,R,P(cid:105), where S is the set of states, A is the set of actions,\nR : S \u00d7 A \u00d7 S \u2192 IR is the reward function, i.e. R(s, a, s(cid:48)) is the expected reward when action a is\ntaken in state s and transitions to state s(cid:48), and P : S \u00d7 A \u00d7 S \u2192 [0, 1] is the transition probability\nfunction, i.e. P a(s, s(cid:48)) is the probability of reaching state s(cid:48) if action a is taken in state s.\n\nA stochastic policy \u03c0 is de\ufb01ned as \u03c0 : S \u00d7 A \u2192 [0, 1] such that(cid:80)\nt =(cid:80)T\n\na \u03c0(a|s) = 1, where \u03c0(s, a) is\nthe probability of choosing action a (as the next action) when in state s. Following a policy on an\nt accumulated over the course of the run, where\nMDP results in an expected discounted reward R\u03c0\nR\u03c0\nGiven a Q-function (Q : S \u00d7 A \u2192 IR) that represents the expected discounted reward for taking\naction a in state s, the optimal policy \u03c0(cid:63) can be found by locally maximizing Q at each step. Meth-\nods of temporal difference (TD) [13] can be used to learn the optimal policy in MDPs, and even\nhave convergence guarantees when the Q-function is in tabular form. However, in practice, tabular\nrepresentations do not scale to large or continuous domains; a problem that function approximation\ntechniques address [12]. Although the convergence properties of these approaches have not yet been\nestablished, the methods have been applied successfully to many problems [14, 15, 16, 17].\nWhen linear functional approximation is used, the state-action pair (cid:104)s, a(cid:105) is represented by a feature\nvector \u03c6(s, a) and the Q value is represented using a vector of parameters \u03b8, i.e.\n\nk=0 \u03b3krt+k+1. An optimal policy \u03c0(cid:63) is a policy that maximizes this reward.\n\n(2)\n\n(3)\n\nQ(s, a) = (cid:88)\n\n\u03b8k\u03c6k\n\n\u03c6k\u2208\u03c6(s,a)\n\n(cid:16)\n\nInstead of updating the Q values directly, the updates are made to the parameters \u03b8:\n\n\u03b8 \u2190 \u03b8 + \u03b1\n\nrt+1 \u2212 Q(st, at) + \u03b3 max\n\na\n\nQ(st+1, a)\n\n\u03c6(st, at)\n\n(cid:17)\n\nnotice the similarity between the linear function approximator (Equation 2) and the log-linear factors\n(right-hand side of Equation 1); namely, the approximator has the same form as the unnormalized\nlog probabilities of the distribution. This enables us to share the parameters \u03b8 from Equation 1.\n\n3 Our Approach\n\nIn our RL treatment of learning factor graphs, each state in the system represents a complete as-\nsignment to the hidden variables Y =y. Given a particular state, an action modi\ufb01es the setting to a\nsubset of the hidden variables; therefore, an action can also be de\ufb01ned as a setting to all the hidden\nvariables Y =y(cid:48). However, in order to cope with complexity of the action space, we introduce a pro-\nposer (as in Metropolis-Hastings) B : Y \u2192 Y that constrains the space by limiting the number of\npossible actions from each state. The reward function R can be de\ufb01ned as the residual performance\nimprovement when the systems transitions from a current state y to a neighboring state y(cid:48) on the\nmanifold induced by B. In our approach, we use a performance measure based on the ground truth\nlabels (for example, F1, accuracy, or normalized mutual information) as the reward. These rewards\nensure that the ground truth con\ufb01guration is the goal.\n\n3.1 Model\nRecall that an MDP is de\ufb01ned as M = (cid:104)S,A,R,P(cid:105) with a set of states S, set of actions A,\nreward function R and transition probability function P; we can now reformulate MAP inference\nand learning in factor graphs as follows:\n\u2022 States: we require the state space to encompass the entire feasible region of the factor graph.\nTherefore, a natural de\ufb01nition for a state is a complete assignment to the hidden variables Y \u2190y and\n\n3\n\n\fthe state space itself is de\ufb01ned as the set S = {y | y \u2208 DOM(Y )}, where DOM(Y ) is the domain\nspace of Y , and we omit the \ufb01xed observables x for clarity since only y is required to uniquely\nidentify a state. Note that unless the hidden variables are highly constrained, the feasible regional\nwill be combinatorial in |Y |; we discuss how to cope with this in the following sections.\n\u2022 Actions Given a state s (e.g., an assignment of Y variables), an action may be de\ufb01ned as a\nconstrained set of modi\ufb01cations to a subset of the hidden variable assignments. We constrain the\naction space to a manageable size by using a proposer, or a behavior policy from which actions\nare sampled. A proposer de\ufb01nes the set of reachable states by describing the distribution over\nneighboring states s(cid:48) given a state s. In context of the action space of an MDP, the proposer can be\nviewed in two ways. First, each possible neighbor state s(cid:48) can be considered the result of an action\na, leading to a large number of deterministic actions. Second, it can be regarded as a single highly\nstochastic action, whose next state s(cid:48) is a sample from the distribution given by the proposer. Both\nof these views are equivalent; the former view is used for notation simplicity.\n\n\u2022 Reward Function The reward function is designed so that the policy learned through delayed\nreward reaches the MAP con\ufb01guration. Rewards are shaped to facilitate ef\ufb01cient learning in this\ncombinatorial space. Let F be some performance metric (for example, for information extraction\ntasks, it could be F 1 score based on the ground truth labels).\nThe reward function used is the residual improvement based on the performance metric F when the\nsystem transitions between states s and s(cid:48):\n\nR(s, s(cid:48)) = F(s(cid:48)) \u2212 F(s)\n\n(4)\n\nthis reward can viewed as learning to minimize the geodesic distance between a current state and the\nMAP con\ufb01guration on the proposal manifold. Alternatively, we could de\ufb01ne a Euclidean reward as\nF(s(cid:63)) \u2212 F(s(cid:48)), where s(cid:63) is the ground truth. We choose an F such that the ground truth scores the\nhighest, that is s(cid:63) = arg maxs F(s).\n\u2022 Transition Probability Function: Recall that the actions in our system are samples generated\nfrom a proposer B, and that each action uniquely identi\ufb01es a next state in the system. The function\nthat returns this next state deterministically is called simulate(s,a). Thus, given the state s and the\naction a, the next state s(cid:48) has probability P a(s, s(cid:48)) = 1 if s(cid:48) = simulate(s, a), and 0 otherwise.\n\n3.2 Ef\ufb01cient Q Value Computations\n\nWe use linear function approximation to obtain Q values over the state/action space. That is,\nQ(s, a) = \u03b8 \u00b7 \u03c6(s, a), where \u03c6(s, a) are features over the state-action pair s, a. We show below\nhow Q values can be derived from the factor graph (Equation 1) in a manner that enables ef\ufb01cient\ncomputations.\nAs mentioned previously, a state is an assignment to hidden variables Y =y and an action is another\nassignment to the hidden variables Y =y(cid:48) (that results from changing the values of a subset of the\nvariables \u2206Y \u2208 Y ). Let \u03b4y be the setting to those variables in y and \u03b4y(cid:48) be the new setting to\nthose variables in y(cid:48). For each assignment, the factor graph can compute the conditional probability\np(y | x). Then, the residual log-probability S resulting from taking action a in state y and reaching\ny(cid:48) is therefore log(p(y(cid:48) | x))\u2212 log(p(y | x)). Plugging in the model from Equation 1 and performing\nsome algebraic manipulation so redundant factors cancel yields:\n\n\uf8eb\uf8ed (cid:88)\n\ny(cid:48)i\u2208\u03b4y(cid:48)\n\n\u03b8 \u00b7\n\n\u03c6(x, y(cid:48)i) \u2212 (cid:88)\n\nyi\u2208\u03b4y\n\n\uf8f6\uf8f8\n\n\u03c6(x, yi)\n\n(5)\n\nWhere the partition function ZX and factors outside the neighborhood of \u2206y cancel. In practice an\naction will modify a small subset of the variables so this computation is extremely ef\ufb01cient. We are\nnow justi\ufb01ed in using Equation 5 (derived from the model) to compute the inner product (\u03b8\u00b7 \u03c6(s, a))\nfrom Equation 2.\n\n4\n\n\f3.3 Algorithm\n\nNow that we have de\ufb01ned MAP inference in a factor graph as an MDP, we can apply a wide variety\nof RL algorithms to learn the model\u2019s parameters. In particular, we build upon Watkin\u2019s Q(\u03bb) [18,\n19], a temporal difference learning algorithm [13]; we augment it with function approximation as\ndescribed in the previous section. Our RL learning method for factor graphs is shown in Algorithm 1.\n\n\u2212\u2192\n\u03b8 and \u2212\u2192e =\n\n\u2212\u2192\n0\n\ns(cid:48) \u2190 simulate(s, a)\n\u03c6(s, s(cid:48)) \u2190 set of features between s, s(cid:48)\nQ(s, a) \u2190 \u03b8 \u00b7 \u03c6(s, s(cid:48))\n{Equation 5}\n\nend for\nrepeat {For every step of the episode}\n\nif with probability (1 \u2212 \u0001) then\n\n{Accumulate eligibility traces}\n{Equation 4}\n\ns \u2190 random initial con\ufb01guration\nSample n actions a \u2190 B(s); collect action samples in AB(s)\nfor samples a \u2208 AB(s) do\n\nAlgorithm 1 Modi\ufb01ed Watkin\u2019s-Q(\u03bb) for Factor Graphs\n1: Input: Performance metric F, proposer B\n2: Initialize\n3: repeat {For every episode}\n4:\n5:\n6:\n7:\n8:\n9:\n10:\n11:\n12:\n13:\n14:\n15:\n16:\n17:\n18:\n19:\n20:\n21:\n22:\n23:\n24:\n25:\n26:\n27:\n28:\n29:\n30:\n31:\n32:\n33:\n34: until end of training\n\ns(cid:48)(cid:48) \u2190 simulate(s(cid:48), a)\n\u03c6(s(cid:48), s(cid:48)(cid:48)) \u2190 set of features between s(cid:48), s(cid:48)(cid:48)\nQ(s(cid:48), a) \u2190 \u03b8 \u00b7 \u03c6(s(cid:48), s(cid:48)(cid:48))\nend for\na \u2190 arg maxa(cid:48) Q(s(cid:48), a(cid:48))\n\u03b4 \u2190 \u03b4 + \u03b3Q(s(cid:48), a)\n\u03b8 \u2190 \u2212\u2192\n\u2212\u2192\n\u03b8 + \u03b1\u03b4\u2212\u2192e\ns \u2190 s(cid:48)\nuntil end of episode\n\na \u2190 arg maxa(cid:48) Q(s, a(cid:48))\n\u2212\u2192e \u2190 \u03b3\u03bb\u2212\u2192e\nelse\nSample a random action a \u2190 B(s)\n\u2212\u2192e \u2190 \u2212\u2192\nend if\ns(cid:48) \u2190 simulate(s, a)\n\u2200\u03c6i \u2208 \u03c6(s, s(cid:48)) : e(i) \u2190 e(i) + \u03c6i\nObserve reward r = F(s) \u2212 F (s(cid:48))\n\u03b4 \u2190 r \u2212 Q(s, a)\nSample n actions a \u2190 B(s(cid:48)); collect action samples in AB(s(cid:48))\nfor samples a \u2208 AB(s(cid:48)) do\n\n{Equation 3 with elig. traces}\n\n0\n\nAt the beginning of each episode, the factor graph is initialized to a random initial state s (by assign-\ning Y =y0). Then, during each step of the episode, the maximum action is obtained by repeatedly\nsampling from the proposal distribution (s(cid:48)=simulate(s, a)). The system transition to the greedy state\ns(cid:48) with high probability (1 \u2212 \u0001), or transitions to a random state instead. We also include eligibility\ntraces that have been modi\ufb01ed to handle function approximation [12].\nOnce learning has completed on a training set, MAP inference can be evaluated on test data by\nexecuting the resulting policy. Because Q-values encode both the reward and value together, policy\nexecution can be performed by choosing the action that maximizes the Q-function at each state.\n\n4 Experiments\n\nWe evaluate our approach by training a factor graph for solving the ontology alignment problem.\nOntology alignment is the problem of mapping concepts from one ontology to semantically equiv-\nalent concepts from another ontology; our treatment of the problem involves learning a \ufb01rst-order\nprobabilistic model that clusters concepts into semantically equivalent sets. For our experiments,\n\n5\n\n\fwe use the the dataset provided by the Illinois Semantic Integration Archive (ISIA)1. There are two\nontology mappings: one between two course catalog hierarchies, and another between two company\npro\ufb01le hierarchies. Each ontology is organized as a taxonomy tree. The course catalog contains 104\nconcepts and 4360 data records while the company pro\ufb01le domain contains 219 concepts and 23139\nrecords. For our experiments we perform two-fold cross validation with even splits.\nThe conditional random \ufb01eld we use to model the problem factors into binary decisions over sets\nof concepts, where the binary variable is one if all concepts in the set map to each other, and zero\notherwise. Each of these hidden variables neighbors a factor that also examines the observed concept\ndata. Since there are variables and factors for each hypothetical cluster, the size of the CRF is\ncombinatorial in the number of concepts in the ontology, and it cannot be full instantiated even for\nsmall amounts of data. Therefore, we believe that this is be a good dataset demonstrate the scalability\nof the approach.\n\n4.1 Features\n\nThe features used to represent the ontology alignment problem are described here. We choose\nto encode our features in \ufb01rst order logic, aggregating and quantifying pairwise comparisons of\nconcepts over entire sets. These features are described more detail in our technical report [17].\nThe pairwise feature extractors are the following:\n\u2022 TFIDF cosine similarity between concept-names of ci and cj\n\u2022 TFIDF cosine similarity between data-records that instantiate ci and cj\n\u2022 TFIDF similarity of the children of ci and cj\n\u2022 Lexical features for each string in the concept name\n\u2022 True if there is a substring overlap between ci and cj\n\u2022 True if both concepts are the same level in the tree\nThe above pairwise features are used as a basis for features over entire sets with the following \ufb01rst\norder quanti\ufb01ers and aggregators:\n\u2022 \u2200: universal \ufb01rst order logic quanti\ufb01er\n\u2022 \u2203: existential quanti\ufb01er\n\u2022 Average: conditional mean over a cluster\n\u2022 Max: maximum value obtained for a cluster\n\u2022 Min: minimum value obtained for a cluster\n\u2022 Bias: conditional bias, counts number of pairs where a pairwise feature could potentially \ufb01re.\nThe real-valued aggregators (min,max,average) are also quantized into bins of various sizes corre-\nsponding to the number of bins={2,4,20,100}. Note that our \ufb01rst order features must be computed\non-the-\ufb02y since the model is too large to be grounded in advance.\n\nRL\n\nF1\n94.3\n76.9\nMH-CD1\nMH-SR\n92.0\nGA-PW 89.9\n80\nGLUE\n\nCourse Catalog\n\nPrecision Recall\n92.6\n57.0\n76.3\n81.5\n80\n\n96.1\n78.0\n88.9\n100\n80\n\nF1\n84.5\n64.7\n81.5\n81.5\n80\n\nCompany Pro\ufb01le\n\nPrecision Recall\n84.5\n64.7\n75.9\n75.9\n80\n\n84.5\n64.7\n88.0\n88.0\n80\n\nTable 1: pairwise-matching precision, recall and F1 on the course catalog dataset\n\n4.2 Systems\n\nIn this section we evaluate the performance of our reinforcement learning approach to MAP infer-\nence and compare it current stochastic and greedy alternatives. In particular, we compare piecewise\n[20], contrastive divergence [21], and SampleRank [22, 11, 23]; these are described in more detail\nbelow.\n\n1http://pages.cs.wisc.edu/ anhai/wisc-si-archive/\n\n6\n\n\f\u2022 Piecewise (GA-PW): the CRF parameters are learned by training independent logistic regression\nclassi\ufb01ers in a piecewise fashion. Inference is performed by greedy agglomerative clustering.\n\u2022 Contrastive Divergence (MH-CD1) with Metropolis-Hastings the system is trained with con-\ntrastive divergence and allowed to wander one step from the ground-truth con\ufb01guration. Once the\nparameters are learned, MAP inference is performed using Metropolis-Hastings (with a proposal\ndistribution that modi\ufb01es a single variable at a time).\n\u2022 SampleRank with Metropolis-Hastings (MH-SR): this system is the same as above, but trains\nthe CRF using SampleRank rather than CD1. MAP is performed with Metropolis-Hastings using a\nproposal distribution that modi\ufb01es a single variable at a time (same proposer as in MH-CD1).\n\u2022 Reinforcement Learning (RL): this is the system introduced in the paper that trains the CRF\nwith delayed reward using Q(\u03bb) to learn state-action returns. The actions are derived from the same\nproposal distribution as used by our Metropolis-Hastings (MH-CD1,MH-SR) systems (modifying a\nsingle variable at a time); however it is exhaustively applied to \ufb01nd the maximum action. We set the\nRL parameters as follows: \u03b1=0.00001, \u03bb=0.9, \u03b3=0.9.\n\u2022 GLUE: in order to compare with a well-known system on the this dataset, we choose GLUE [24].\nIn these experiments contrastive divergence and SampleRank were run for 10,000 samples each ,\nwhile reinforcement learning was run for twenty episodes and 200 steps per episode. CD1 and\nSampleRank were run for more steps to compensate for only observing a single action at each step\n(recall RL computes the action with the maximum value at each step by observing a large number\nof samples).\n\n4.3 Results\n\nIn Table 1 we compare F1 (pairwise-matching) scores of the various systems on the course catalog\nand company pro\ufb01le datasets. We also compare to the well known system, GLUE [24]. Sam-\npleRank (MH-SR), contrastive divergence (MH-CD1) and reinforcement learning (RL) underwent\nten training episodes initialized from random con\ufb01gurations; during MAP inference we initialized\nthe systems to the state predicted by greedy agglomerative clustering. Both SampleRank and rein-\nforcement learning were able to achieve higher scores than greedy; however, reinforcement learning\noutperformed all systems with an error reduction of 75.3% over contrastive divergence, 28% over\nSampleRank, 71% over GLUE and 48% over the previous state of the art (greedy agglomerative in-\nference on a conditional random \ufb01eld). Reinforcement learning also reduces error over each system\non the company pro\ufb01le dataset.\nAfter observing the improvements obtained by reinforcement learning, we wished to test how robust\nthe method was at recovering from the local optima problem described in the introduction. To gain\nmore insight, we designed a separate experiment to compare Metropolis-Hastings inference (trained\nwith SampleRank) and reinforcement learning more carefully.\nIn the second experiment we evaluate our approach under more dif\ufb01cult conditions. In particular,\nthe MAP inference procedures are initialized to random clusterings (in regions riddled with the\ntype of local optima discussed in the introduction). We then compare greedy MAP inference on a\nmodel whose parameters were learned with RL, to Metropolis-Hastings on a model with parameters\nlearned from SampleRank. More speci\ufb01cally, we generate a set of ten random con\ufb01gurations from\nthe test corpus and run both algorithms, averaging the results over the ten runs. The \ufb01rst two rows\nof Table 2 summarizes this experiment. Even though reinforcement learning\u2019s policy requires it\nto be greedy with respect to the q-function, we observe that it is able to better escape the random\ninitial con\ufb01guration than the Metropolis-Hastings method. This is demonstrated in the \ufb01rst rows\nof Table 2. Although both systems perform worse than under these conditions than those of the\nprevious experiment, reinforcement learning does much better in this situation, indicating that the\nq-function learned is fairly robust and capable of generalizing to random regions of the space.\nAfter observing Metropolis-Hasting\u2019s tendency to get stuck in regions of lower score than reinforce-\nment learning, we test RL to see if it would fall victim to these same optima. In the last two rows\nof Table 2 we record the results of re-running both reinforcement learning and Metropolis-Hastings\n(on the SampleRank model) from the con\ufb01gurations Metropolis-Hastings became stuck. We notice\nthat RL is able to climb out of these optima and achieve a score comparable to our \ufb01rst experiment.\n\n7\n\n\fMH is also able to progress out of the optima, demonstrating that the stochastic method is capable\nof escaping optima, but perhaps not as quickly on this particular problem.\n\nRL on random\n\nF1\n86.4\nMH-SR on random 81.1\n93.0\nMH-SR on MH-SR 84.3\n\nRL on MH-SR\n\nPrecision Recall\n85.6\n79.3\n91.5\n81.5\n\n87.2\n82.9\n94.6\n87.3\n\nTable 2: Average pairwise-matching precision, recall and F1 over ten random initialization points,\nand on the output of MH-SR after 10,000 inference steps.\n\n5 Related Work\n\nThe expanded version of this work is our technical report [17], which provides additional detail and\nmotivation. Our approach is similar in spirit to Zhang and Dietterich who propose a reinforcement\nlearning framework for solving combinatorial optimization problems [25]. Similar to this approach,\nwe also rely on generalization techniques in RL in order to directly approximate a policy over un-\nseen test domains. However, our formulation provides a framework that explicitly targets the MAP\nproblem in large factor graphs and takes advantage of the log-linear representation of such models\nin order to employ a well studied class of generalization techniques in RL known as linear function\napproximation. Learning generalizable function approximators has been also studied for ef\ufb01ciently\nguiding standard search algorithms through experience [26].\nThere are a number of approaches for learning parameters that speci\ufb01cally target the problem of\nMAP inference. For example, the frameworks of LASO [27] and SEARN [28]) formulate MAP\nin the context of search optimization, where a cost function is learned to score partial (incomplete)\ncon\ufb01gurations that lead to a goal state. In this framework, actions incrementally construct a solution,\nrather than explore the solution space itself. As shown in [28] these frameworks have connections to\nlearning policies in reinforcement learning. However, the policies are learned over incomplete con-\n\ufb01gurations. In contrast, we formulate parameter learning in factor graphs as an MDP over the space\nof complete con\ufb01gurations from which a variety of RL methods can be used to set the parameters.\nAnother approach that targets the problem of MAP inference is SampleRank [11, 23], which com-\nputes atomic gradient updates from jumps in the local search space. This method has the advantage\nof learning over the space of complete con\ufb01gurations, but ignores the issue of delayed reward.\n\n6 Conclusions and Future Work\n\nWe proposed an approach for solving the MAP inference problem in large factor graphs by using\nreinforcement learning to train model parameters. RL allows us to evaluate jumps in the con\ufb01gu-\nration space based on a value function that optimizes the long term improvement in model scores.\nHence \u2013 unlike most search optimization approaches \u2013 the system is able to move out of local optima\nwhile aiming for the MAP con\ufb01guration. Bene\ufb01tting from log linear nature of factor graphs such\nas CRFs we are also able to employ well studied RL linear function approximation techniques for\nlearning generalizable value functions that are able to provide value estimates on the test set. Our\nexperiments over a real world domain shows impressive error reduction when compared to the other\napproaches. Future work should investigate additional RL paradigms for training models such as\nactor-critic.\n\nAcknowledgments\n\nThis work was supported in part by the CIIR; SRI #27-001338 and ARFL #FA8750-09-C-0181,\nCIA, NSA and NSF #IIS-0326249; Army #W911NF-07-1-0216 and UPenn subaward #103-548106;\nand UPenn NSF #IS-0803847. Any opinions, \ufb01ndings and conclusions or recommendations ex-\npressed in this material are the authors\u2019 and do not necessarily re\ufb02ect those of the sponsor.\n\n8\n\n\fReferences\n[1] Andrew McCallum, Dayne Freitag, and Fernando Pereira. Maximum entropy markov models for information extraction and segmenta-\n\ntion. In International Conference on Machine Learning (ICML), 2000.\n\n[2] John D. Lafferty, Andrew McCallum, and Fernando Pereira. Conditional random \ufb01elds: Probabilistic models for segmenting and labeling\n\nsequence data. In Int Conf on Machine Learning (ICML), 2001.\n\n[3] Ben Taskar, Carlos Guestrin, and Daphne Koller. Max-margin markov networks. In NIPS, 2003.\n\n[4] Ryan McDonald and Fernando Pereira. Online learning of approximate dependency parsing algorithms. In European Chapter of the\n\nAssociation for Computational Linguistics (EACL), pages 81\u201388, 2006.\n\n[5] Matthew Richardson and Pedro Domingos. Markov logic networks. Machine Learning, 62, 2006.\n\n[6] Brian Milch, Bhaskara Marthi, and Stuart Russell. BLOG: Relational Modeling with Unknown Objects. PhD thesis, University of\n\nCalifornia, Berkeley, 2006.\n\n[7] Andrew McCallum, Khashayar Rohanimanesh, Michael Wick, Karl Schultz, and Sameer Singh. Factorie: Ef\ufb01cient probabilistic pro-\ngramming via imperative declarations of structure, inference and learning. In Neural Information Processing Systems(NIPS) Workshop\non Probabilistic Programming, Vancouver, BC, Canda, 2008.\n\n[8] Aria Haghighi and Dan Klein. Unsupervised coreference resolution in a nonparametric bayesian model. In Association for Computational\n\nLinguistics (ACL), 2007.\n\n[9] Hanna Pasula, Bhaskara Marthi, Brian Milch, Stuart Russell, and Ilya Shpitser. Identity uncertainty and citation matching. In Advances\n\nin Neural Information Processing Systems 15. MIT Press, 2003.\n\n[10] Sonia Jain and Radford M. Neal. A split-merge markov chain monte carlo procedure for the dirichlet process mixture model. Journal of\n\nComputational and Graphical Statistics, 13:158\u2013182, 2004.\n\n[11] Aron Culotta. Learning and inference in weighted logic with application to natural language processing. PhD thesis, University of\n\nMassachusetts, May 2008.\n\n[12] Richard S. Sutton and Andrew G. Barto. Reinforcement Learning: An Introduction. The MIT Press, March 1998.\n\n[13] Richard S. Sutton. Learning to predict by the methods of temporal differences. Machine Learning, pages 9\u201344, 1988.\n\n[14] Robert H. Crites and Andrew G. Barto. Improving elevator performance using reinforcement learning. In Advances in Neural Information\n\nProcessing Systems 8, pages 1017\u20131023. MIT Press, 1996.\n\n[15] Wei Zhang and Thomas G. Dietterich. Solving combinatorial optimization tasks by reinforcement learning: A general methodology\n\napplied to resource-constrained scheduling. Journal of Arti\ufb01cial Intelligence Reseach, 1, 2000.\n\n[16] Gerald Tesauro. Temporal difference learning and td-gammon. Commun. ACM, 38(3):58\u201368, 1995.\n\n[17] Khashayar Rohanimanesh, Michael Wick, Sameer Singh, and Andrew McCallum. Reinforcement learning for map inference in large\n\nfactor graphs. Technical Report #UM-CS-2008-040, University of Massachusetts, Amherst, 2008.\n\n[18] Christopher J. Watkins. Learning from Delayed Rewards. PhD thesis, Kings College, Cambridge, 1989.\n\n[19] Christopher J. Watkins and Peter Dayan. Q-learning. Machine Learning, 8(3):279\u2013292, May 1992.\n\n[20] Andrew McCallum and Charles Sutton. Piecewise training with parameter independence diagrams: Comparing globally- and locally-\n\ntrained linear-chain CRFs. In NIPS Workshop on Learning with Structured Outputs, 2004.\n\n[21] Geoffrey E. Hinton. Training products of experts by minimizing contrastive divergence. Neural Computation, 14(8):1771\u20131800, 2002.\n\n[22] Culotta. First. In International Joint Conference on Arti\ufb01cial Intelligence, 2007.\n\n[23] Khashayar Rohanimanesh, Michael Wick, and Andrew McCallum. Inference and learning in large factor graphs with adaptive proposal\n\ndistributions. Technical Report #UM-CS-2009-028, University of Massachusetts, Amherst, 2009.\n\n[24] AnHai Doan, Jayant Madhavan, Pedro Domingos, and Alon Y. Halevy. Learning to map between ontologies on the semantic web. In\n\nWWW, page 662, 2002.\n\n[25] Wei Zhang and Thomas G. Dietterich. A reinforcement learning approach to job-shop scheduling. In International Joint Conference on\n\nArti\ufb01cial Intelligence (IJCAI), pages 1114\u20131120, 1995.\n\n[26] Justin Boyan and Andrew W. Moore. Learning evaluation functions to improve optimization by local search. J. Mach. Learn. Res.,\n\n1:77\u2013112, 2001.\n\n[27] Hal Daum\u00b4e III and Daniel Marcu. Learning as search optimization: approximate large margin methods for structured prediction. In\n\nInternational Conference on Machine learning (ICML), 2005.\n\n[28] Hal Daum\u00b4e III, John Langford, and Daniel Marcu. Search-based structured prediction. Machine Learning, 2009.\n\n9\n\n\f", "award": [], "sourceid": 910, "authors": [{"given_name": "Khashayar", "family_name": "Rohanimanesh", "institution": null}, {"given_name": "Sameer", "family_name": "Singh", "institution": null}, {"given_name": "Andrew", "family_name": "McCallum", "institution": null}, {"given_name": "Michael", "family_name": "Black", "institution": null}]}