{"title": "FACTORIE: Probabilistic Programming via Imperatively Defined Factor Graphs", "book": "Advances in Neural Information Processing Systems", "page_first": 1249, "page_last": 1257, "abstract": "Discriminatively trained undirected graphical models have had wide empirical success, and there has been increasing interest in toolkits that ease their application to complex relational data. The power in relational models is in their repeated structure and tied parameters; at issue is how to define these structures in a powerful and flexible way. Rather than using a declarative language, such as SQL or first-order logic, we advocate using an imperative language to express various aspects of model structure, inference, and learning. By combining the traditional, declarative, statistical semantics of factor graphs with imperative definitions of their construction and operation, we allow the user to mix declarative and procedural domain knowledge, and also gain significant efficiencies. We have implemented such imperatively defined factor graphs in a system we call Factorie, a software library for an object-oriented, strongly-typed, functional language. In experimental comparisons to Markov Logic Networks on joint segmentation and coreference, we find our approach to be 3-15 times faster while reducing error by 20-25%-achieving a new state of the art.", "full_text": "FACTORIE: Probabilistic Programming\nvia Imperatively De\ufb01ned Factor Graphs\n\nAndrew McCallum, Karl Schultz, Sameer Singh\n\nDepartment of Computer Science\n\nUniversity of Massachusetts Amherst\n\nAmherst, MA 01003\n\n{mccallum, kschultz, sameer}@cs.umass.edu\n\nAbstract\n\nDiscriminatively trained undirected graphical models have had wide empirical\nsuccess, and there has been increasing interest in toolkits that ease their applica-\ntion to complex relational data. The power in relational models is in their repeated\nstructure and tied parameters; at issue is how to de\ufb01ne these structures in a pow-\nerful and \ufb02exible way. Rather than using a declarative language, such as SQL\nor \ufb01rst-order logic, we advocate using an imperative language to express various\naspects of model structure, inference, and learning. By combining the traditional,\ndeclarative, statistical semantics of factor graphs with imperative de\ufb01nitions of\ntheir construction and operation, we allow the user to mix declarative and proce-\ndural domain knowledge, and also gain signi\ufb01cant ef\ufb01ciencies. We have imple-\nmented such imperatively de\ufb01ned factor graphs in a system we call FACTORIE,\na software library for an object-oriented, strongly-typed, functional language. In\nexperimental comparisons to Markov Logic Networks on joint segmentation and\ncoreference, we \ufb01nd our approach to be 3-15 times faster while reducing error by\n20-25%\u2014achieving a new state of the art.\n\n1\n\nIntroduction\n\nConditional random \ufb01elds [1], or discriminatively trained undirected graphical models, have become\nthe tool of choice for addressing many important tasks across bioinformatics, natural language pro-\ncessing, robotics, and many other \ufb01elds [2, 3, 4]. While relatively simple structures such as linear\nchains, grids, or fully-connected af\ufb01nity graphs have been employed successfully in many contexts,\nthere has been increasing interest in more complex relational structure\u2014capturing more arbitrary\ndependencies among sets of variables, in repeated patterns\u2014and interest in models whose variable-\nfactor structure changes during inference, as in parse trees and identity uncertainty. Implementing\nsuch complex models from scratch in traditional programming languages is dif\ufb01cult and error-prone,\nand hence there has been several efforts to provide a high-level language in which models can be\nspeci\ufb01ed and run. For generative, directed graphical models these include BLOG [5], IBAL [6],\nand Church [7]. For conditional, undirected graphical models, these include Relational Markov\nNetworks (RMNs) using SQL [8], and Markov Logic Networks (MLNs) using \ufb01rst-order logic [9].\nRegarding logic, for many years there has been considerable effort in integrating \ufb01rst-order logic and\nprobability [9, 10, 11, 12, 13]. However, we contend that in many of these proposed combinations,\nthe \u2018logic\u2019 aspect is not crucial to the ultimate goal of accurate and expressive modeling. The power\nof relational factor graphs is in their repeated relational structure and tied parameters. First-order\nlogic is one way to specify this repeated structure, but it is less than ideal because of its focus on\nboolean outcomes and inability to easily and ef\ufb01ciently express relations such as graph reachability\nand set size comparison. Logical inference is used in some of these systems, such as PRISM [12],\nbut in others, such as Markov Logic [9], it is largely replaced by probabilistic inference.\n\n1\n\n\fThis paper proposes an approach to probabilistic programming that preserves the declarative sta-\ntistical semantics of factor graphs, while at the same time leveraging imperative constructs (pieces\nof procedural programming) to greatly aid both ef\ufb01ciency and natural intuition in specifying model\nstructure, inference, and learning, as detailed below. Our approach thus supports users in combining\nboth declarative and procedural knowledge. Rather than \ufb01rst-order logic, model authors have access\nto a Turing complete language when writing their model speci\ufb01cation. The point, however, is not\nmerely to have greater formal expressiveness; it is ease-of-use and ef\ufb01ciency.\nWe term our approach imperatively de\ufb01ned factor graphs (IDFs). Below we develop this approach\nin the context of Markov chain Monte-Carlo inference, and de\ufb01ne four key imperative constructs\u2014\narguing that they provide a natural interface to central operations in factor graph construction and\ninference. These imperative constructs (1) de\ufb01ne the structure connecting variables and factors,\n(2) coordinate variable values, (3) map the variables neighboring a factor to suf\ufb01cient statistics,\nand (4) propose jumps from one possible world to another. A model written as an IDF is a factor\ngraph, with all the traditional semantics of factors, variables, possible worlds, scores, and partition\nfunctions; we are simply providing an extremely \ufb02exible language for their succinct speci\ufb01cation,\nwhich also enables ef\ufb01cient inference and learning.\nOur \ufb01rst embodiment of the approach is the system we call FACTORIE (loosely named for \u201cFactor\ngraphs, Imperative, Extensible\u201d, see http://factorie.cs.umass.edu) strongly-typed, func-\ntional programming language Scala [14]. The choice of Scala stems from key inherent advantages\nof the language itself, plus its full interoperability with Java, and recent growing usage in the ma-\nchine learning community. By providing a library and direct access to a full programming language\n(as opposed to our own, new \u201clittle language\u201d), the model authors have familiar and extensive re-\nsources for implementing the procedural aspects of the design, as well as the ability to bene\ufb01cially\nmix data pre-processing, evaluation, and other book-keeping code in the same \ufb01les as the proba-\nbilistic model speci\ufb01cation. Furthermore, FACTORIE is object-oriented in that variables and factor\ntemplates are objects, supporting inheritance, polymophism, composition, and encapsulation.\nThe contributions of this paper are introducing the novel IDF methodology for specifying factor\ngraphs, and successfully demonstrating it on a non-trivial task. We present experimental results\napplying FACTORIE to the substantial task of joint inference in segmentation and coreference of\nresearch paper citations, surpassing previous state-of-the-art results. In comparison to Markov Logic\n(Alchemy) on the same data, we achieve a 20-25% reduction in error, and do so 3-15 times faster.\n\n2\n\nImperatively De\ufb01ned Factor Graphs\n\nA factor graph G is a bipartite graph over factors and variables de\ufb01ning a probability distribution\nover a set of target variables y, optionally conditioned on observed variables x. A factor \u03a8i com-\nputes a scalar value over the subset of variables that are its neighbors in the graph. Often this real-\nvalued function is de\ufb01ned as the exponential of the dot product over suf\ufb01cient statistics {fik(xi, yi)}\nand parameters {\u03b8ik}, where k \u2208 {1 . . . Ki} and Ki is the number of parameters for factor \u03a8i.\nFactor graphs often use parameter tying, i.e.\nthe same parameters for several factors. A factor\ntemplate Tj consists of parameters {\u03b8jk}, suf\ufb01cient statistic functions {fjk}, and a description of\nan arbitrary relationship between variables, yielding a set of satisfying tuples {(xi, yi)}. For each\nof these variable tuples (xi, yi) \u2208 Tj that ful\ufb01lls the relationship, the factor template instantiates\na factor that shares {\u03b8jk} and {fjk} with all other instantiations of Tj. Let T be the set of factor\ntemplates. In this case the probability distribution is de\ufb01ned:\n\n(cid:89)\n\n(cid:89)\n\n\uf8ee\uf8f0 Kj(cid:88)\n\nk=1\n\n\uf8f9\uf8fb.\n\np(y|x) =\n\n1\n\nZ(x)\n\nTj\u2208T\n\n(xi,yi)\u2208Tj\n\nexp\n\n\u03b8jkfjk(xi, yi)\n\nAs in all relational factor graphs, our language supports variables and factor template de\ufb01nitions. In\nour case the variables\u2014which can be binary, categorical, ordinal, real, etc\u2014are typed objects in the\nobject-oriented language, and can be sub-classed. Relations between variables can be represented\ndirectly as members (instance variables) in these variable objects, rather than as indices into global\ntables. In addition we allow for new variable types to be programmed by model authors via polymor-\nphism. For example, the user can easily create new variable types such as a set-valued variable type,\n\n2\n\n\fFigure 1: Example of variable classes for a linear chain and a coreference model.\n\nclass Token(str:String) extends EnumVariable(str)\nclass Label(str:String, val token:Token) extends EnumVariable(str) with VarInSeq\nclass Mention(val string:String) extends PrimitiveVariable[Entity]\nclass Entity extends SetVariable[Mention] {\n\nvar canonical:String = \"\"\ndef add(m:Mention, d:DiffList) = {\n\nsuper.add(m,d); m.set(this,d)\ncanonical = recomputeCanonical(members)\n\n}\ndef remove(m:Mention, d:DiffList) = {\n\nsuper.remove(m,d); m.set(null,d)\ncanonical = recomputeCanonical(members)\n\n}\n\n}\n\nrepresenting a group of unique values, as well as traits augmenting variables to represent sequences\nof elements with left and right neighbors.\nTypically, IDF programming consists of two distinct stages: de\ufb01ning the data representation, then\nde\ufb01ning the factors for scoring. This separation offers great \ufb02exibility. In the \ufb01rst stage the model\nauthor implements infrastructure for storing a possible world\u2014variables, their relations and values.\nSomewhat surprisingly, authors can do this with a mind-set and style they would employ for deter-\nministic programming, including usage of standard data structures such as linked lists, hash tables\nand objects embedded in other objects. In some cases authors must provide API functions for \u201cun-\ndoing\u201d and \u201credoing\u201d changes to variables that will be tracked by MCMC, but in most cases such\nfunctionality is already provided by the library\u2019s wide variety of variable object implementations.\nFor example, in a linear-chain CRF model, a variable containing a word token can be declared as the\nToken class shown in Figure 1.1 A variable for labels can be declared similarly, with the addition\nthat each Label2 object has an instance variable that points to its corresponding Token. The second\nstage of our linear-chain CRF implementation is described in Section 2.2.\nConsider also the task of entity resolution in which we have a set of Mentions to be co-referenced\ninto Entities. A Mention contains its string form, but its value as a random variable is the Entity\nto which it is currently assigned. An Entity is a set-valued variable\u2014the set of Mentions assigned\nto it; it holds and maintains a canonical string form representative of all its Mentions (see Figure\n13). The add/remove methods are explained in section 2.3.\n\n2.1\n\nInference and Imperative Constraint Preservation\n\nFor inference, we rely on MCMC to achieve ef\ufb01ciency with models that not only have large tree-\nwidth but an exponentially-sized unrolled network, as is common with complex relational data\n[15, 9, 5]. The key is to avoid unrolling the network over multiple hypotheses, and to represent\nonly one variable-value con\ufb01guration at a time. As in BLOG [5], MCMC steps can adjust model\nstructure as necessary, and with each step the FACTORIE library automatically builds a DiffList\u2014a\ncompact object containing the variables changed by the step, as well as undo and redo capabilities.\nCalculating the factor graph\u2019s \u2018score\u2019 for a step only requires DiffList variables, their factors, and\nneighboring variables, as described in Section 2.4. In fact, unlike BLOG and BLAISE [16], we\nbuild inference and learning entirely on DiffList scores and never need to score the entire model.\nThis enables ef\ufb01cient reasoning about observed data larger than memory, or models in which the\nnumber of factors is a high-degree polynomial of the number of variables.\nA key component of many MCMC inference procedures is the proposal distribution that proposes\nchanges to the current con\ufb01guration. This is a natural place for injecting prior knowledge about\ncoordination of variable values and various structural changes. In fact, in some cases we can avoid\n\n1Objects of class EnumVariable hold variables with a value selected from a \ufb01nite enumerated set.\n2In Scala var/val indicates a variable declaration; trait VarInSeq provides methods for obtaining next\n\nand prev labels in a sequence.\n\n3In Scala def indicates a function de\ufb01nition where the value returned is the last line-of-code in the function;\n\nmembers is the set of variables in the superclass SetVariable.\n\n3\n\n\fFigure 2: Examples of FACTORIE factor templates. Some error-checking code is elided for brevity.\nval crfTemplate = new TemplateWithDotStatistics3[Label,Label,Token] {\ndef unroll1 (label:Label) = Factor(label, label.next, label.token)\ndef unroll2 (label:Label) = Factor(label.prev, label, label.prev.token)\ndef unroll3 (token:Token) = throw new Error(\"Token values shouldn\u2019t change\")\n\n}\nval depParseTemplate = new Template1[Node] with DotStatistics2[Word,Word] {\n\ndef unroll1(n:Node) = n.selfAndDescendants\ndef statistics(n:Node) = Stat(n.word, closestVerb(n).word)\ndef closestVerb(n:Node) = if (isVerb(n.word)) n else closestVerb(n.parent)\n\n}\nval corefTemplate = new Template2[Mention,Entity] with DotStatistics1[Bool] {\n\ndef unroll1 (m:Mention) = Factor(m, m.entity)\ndef unroll2 (e:Entity) = for (mention <- e.mentions) yield Factor(mention, e)\ndef statistics(m:Mention,e:Entity) = Bool(distance(m.string,e.canonical)<0.5)\n\n}\nval logicTemplate1 = Forany[Person] { p => p.smokes \u2212\u2212> p.cancer }\nval logicTemplate2 = Forany[Person] { p => p.friends.smokes <\u2212\u2212> p.smokes }\nexpensive deterministic factors altogether with property-preserving proposal functions [17]. For ex-\nample, coreference transitivity can be ef\ufb01ciently enforced by proper initialization and a transitivity-\npreserving proposal function; projectivity in dependency parsers can be enforced similarly. We term\nthis imperative constraint preservation. In FACTORIE proposal distributions may be implemented\nby the model author. Alternatively, the FACTORIE library provides several default inference meth-\nods, including Gibbs sampling, as well as default proposers for many variable classes.\n\n2.2\n\nImperative Structure De\ufb01nition\n\nAt the heart of model structure de\ufb01nition is the pattern of connectivity between variables and factors,\nand the DiffList must have extremely ef\ufb01cient access to this. Unlike BLOG, which uses a complex,\nhighly-indexed data structure that must be updated during inference, we instead specify this con-\nnectivity imperatively: factor template objects have methods (e.g., unroll1, unroll2, etc., one for\neach factor argument) that \ufb01nd the factor\u2019s other variable neighbors given a single variable from the\nDiffList. This is typically accomplished using a simple data structure that is already available as\npart of the natural representation of the data, (e.g., as would be used by a non-probabilistic program-\nmer). The unroll method then constructs a Factor with these neighbors as arguments, and returns\nit. The unroll method may optionally return multiple Factors in response to a single changed vari-\nable. Note that this approach also ef\ufb01ciently supports a model structure that varies conditioned on\nvariable values, because the unroll methods can examine and perform calculations on these values.\nThus we now have the second stage of FACTORIE programming, in which the model author im-\nplements the factor templates that de\ufb01ne the factors which score possible worlds. In our linear-\nchain CRF example, the factor between two succesive Labels and a Token might be declared as\ncrfTemplate in Figure 2. Here unroll1 simply uses the token instance variable of each Label to\n\ufb01nd the corresponding third argument to the factor. This simple example does not, however, show the\ntrue expressive power of imperative structure de\ufb01nition. Consider instead a model for dependency\nparsing (with similarly de\ufb01ned Word and Node variables). In the same Figure, depParsingTemplate\nde\ufb01nes a template for factors that measure compatibility between a word and its closest verb as\nmeasured through parse tree connectivity. Such arbitrary-depth graph search is awkward in \ufb01rst-\norder logic, yet it is a simple one-line recursive method in FACTORIE. The statistics method is\ndescribed below in Section 2.4.\nConsider also the coreference template measuring the compatibility between a Mention and the\ncanonical representation of its assigned Entity. In response to a moved Mention, unroll1 returns\na factor between the Mention and its newly assigned Entity. In response to a changed Entity,\nunroll2 returns a list of factors between itself all its member Mentions. It is inherent that sometimes\ndifferent unroll methods will construct multiple copies of the same factor; they are automatically\ndeduplicated by the FACTORIE library. Syntactic sugar for extended \ufb01rst-order logic primitives is\nalso provided, and these can be mixed with imperative constructs; see the bottom of Figure 2 for\ntwo small examples. Specifying templates in FACTORIE can certainly be more verbose when not\nrestricted to \ufb01rst-order logic; in this case we trade off some brevity for \ufb02exibility.\n\n4\n\n\f2.3\n\nImperative Variable Coordination\n\nVariables\u2019 value-assignment methods can be overriden to automatically change other variable val-\nues in coordination with the assignment\u2014an often-desirable encapsulation of domain knowledge we\nterm imperative variable coordination. For example, in response to a named entity label change, a\ncoreference mention can have its string value automatically adjusted, rather than relying on MCMC\ninference to stumble upon this self-evident coordination. In Figure 1, Entity does a basic form of\ncoordination by re-calculating its canonical string representation whenever a Mention is added or\nremoved from its set.\nThe ability to use prior knowledge for imperative variable coordination also allows the designer\nto de\ufb01ne the feasible region for the sampling.\nIn the proposal function, users make changes by\ncalling value-assignment functions, and any changes made automatically through coordinating\nvariables are appended to the DiffList. Since a factor template\u2019s contribution to the overall score\nwill not change unless its neighboring variables have changed, once we know every variable that has\nchanged we can ef\ufb01ciently score the proposal.\n\n2.4\n\nImperative Variable-Statistics Mapping\n\nIn a somewhat unconventional use of functional mapping, we support a separation between factor\nneighbors and suf\ufb01cient statistics. Neighbors are variables touching the factor whose changes imply\nthat the factor needs to be re-scored. Suf\ufb01cient statistics are the minimal set of variable values that\ndetermine the score contribution of the factor. These are usually the same; however, by allowing\na function to perform the mapping, we provide an extremely powerful yet simple way to allow\nmodel designers to represent their data in natural ways, and concern themselves separately with how\nto parameterize them. For example, the two neighbors of a skip-edge factor [18] may each have\ncardinality equal to the number of named entities types, but we may only care to have the skip-edge\nfactor enforce whether or not they match. We term this imperative variable-statistics mapping.\nConsider corefTemplate in Figure 2, the neighbors of the template are (cid:104)Mention, Entity(cid:105) pairs.\nHowever, the suf\ufb01cient statistic is simply a Boolean based on the \u201cdistance\u201d of the unrolled Mention\nfrom the canonical value of the Entity. This allows the template to separate the natural represen-\ntation of possible worlds from the suf\ufb01cient statistics needed to score its factors. Note that these\nsuf\ufb01cient statistics can be calculated as arbitrary functions of the unrolled Mention and the Entity.\nThe models described in Section 3 use a number of factors whose suf\ufb01cient statistics derive from the\ndomains of its neighbors as well as those with arbitrary feature functions based on their neighbors.\nAn MCMC proposal is scored as follows. First, a sample is generated from the proposal distribution,\nplacing an initial set of variables in the DiffList. Next the value-assignment method is called\nfor each of the variables on the DiffList, and via imperative variable coordination other variables\nmay be added to the DiffList. Given the set of variables that have changed, FACTORIE iterates\nover each one and calls the unroll function for factor templates matching the variable\u2019s type. This\ndynamically provides the relevant structure of the graph via imperative structure de\ufb01nition, resulting\nin a set of factors that should be re-scored. The neighbors of each returned factor are given to the\ntemplate\u2019s statistics function, and the suf\ufb01cient statistics are used to generate the factor\u2019s score\nusing the template\u2019s current parameter vector. These scores are summed, producing the \ufb01nal score\nfor the MCMC step.\n\n2.5 Learning\n\nMaximum likelihood parameter estimation traditionally involves \ufb01nding the gradient, however for\ncomplex models this can be prohibitively expensive since it requires the inference of marginal distri-\nbutions over factors. Alternatively some have proposed online methods, such as perceptron, which\navoids the need for marginals however still requires full decoding which can also be computationally\nexpensive. We avoid both of these issues by using sample-rank [19]. This is a parameter estimation\nmethod that learns a ranking over all possible con\ufb01gurations by observing the difference between\nscores of proposed MCMC jumps. Parameter changes are made when the model\u2019s ranking of a\nproposed jump disagrees with a ranking determined by labeled truth. When there is such a disagree-\nment, a perceptron-style update to active parameters is performed by \ufb01nding all factors whose score\nhas changed (i.e., factors with a neighbor in the DiffList). The active parameters are indexed by the\n\n5\n\n\fsuf\ufb01cient statistics of these factors. Sample-rank is described in detail in [20]. As with inference,\nlearning is ef\ufb01cient because it uses the DiffList and the imperative constructs described earlier.\n\n3 Joint Segmentation and Coreference\n\nTasks involving multiple information extraction steps are traditionally solved using a pipeline archi-\ntecture, in which the output predictions of one stage are input to the next stage. This architecture\nis susceptible to cascading of errors from one stage to the next. To minimize this error, there has\nbeen signi\ufb01cant interest in joint inference over multiple steps of an information processing pipeline\n[21, 22, 23]. Full joint inference usually results in exponentially large models for which learning and\ninference become intractable. One widely studied joint-inference task in information extraction is\nsegmentation and coreference of research paper citation strings [21, 23, 24]. This involves segment-\ning citation strings into author, title and venue \ufb01elds (segmentation), and clustering the citations\nthat refer to the same underlying paper entity (coreference). Previous results have shown that joint\ninference reduces error [21], and this task provides a good testbed for probabilistic programming.\nWe now describe an IDF for the task. For more details, see [24].\n\n3.1 Variables and Proposal Distribution\n\nAs in the example given in Section 2, a Mention represents a citation and is a random variable that\ntakes a single Entity as its value. An Entity is a set-valued variable containing Mention variables.\nThis representation eliminates the need for an explicit transitivity constraint, since a Mention can\nhold only one Entity value, and this value is coordinated with the Entity\u2019s set-value.\nVariables for segmentation consist of Tokens, Labels and Fields. Each Token represents an ob-\nserved word in a citation. Each Token has a corresponding Label which is an unobserved variable\nthat can take one of four values: author, title, venue or none. There are three Field variables asso-\nciated with each Mention, one for each \ufb01eld type (author, venue or title), that store the contiguous\nblock of Tokens representing the Field; Labels and Fields are coordinated. This alternate repre-\nsentation of segmentation provides \ufb02exibility in specifying factor templates over predicted Fields.\nThe proposal function for coreference randomly selects a Mention, and with probability 0.8 moves\nit to a random existing cluster, otherwise to a new singleton cluster. The proposal function for\nsegmentation selects a random Field and grows or shrinks it by a random amount. When jointly\nperforming both tasks, one of the proposal functions is randomly selected. The value-assignment\nfunction for the Field ensures that the Labels corresponding to the affected Tokens are correctly\nset when a Field is changed. This is an example of imperative variable coordination.\n\n3.2 Factor Templates\n\nSegmentation Templates: Segmentation templates examine only Field, Label and Token vari-\nables, i.e. not using information from coreference predictions. These factor templates are IDF trans-\nlations of the Markov logic rules described in [21]. There is a template between every Token and its\nLabel. Markov dependencies are captured by a template that examines successive Labels as well\nas the Token of the earlier Label. The suf\ufb01cient statistics for these factors are the tuples created\nfrom the neighbors of the factor: e.g., the values of two Labels and one Token. We also have a\nfactor template examining every Field with features based on the presence of numbers, dates, and\npunctuation. This takes advantage of variable-statistics mapping.\nCoreference Templates: The isolated coreference factor templates use only Mention variables.\nThey consist of two factor templates that share the same suf\ufb01cient statistics, but have separate\nweight vectors and different ways of unrolling the graph. An Af\ufb01nity factor is created for all pairs of\nMentions that are coreferent, while a Repulsion factor is created for all pairs that are not coreferent.\nThe features of these templates correspond to the SimilarTitle and SimilarVenue \ufb01rst-order features\nin [21]. We also add SimilarDate and DissimilarDate features that look at the \u201cdate-like\u201d tokens.\nJoint Templates: To allow the tasks to in\ufb02uence each other, factor templates are added that are\nunrolled during both segmentation and coreference sampling. Thus these factor templates neighbor\nMentions, Fields, and Labels, and use the segmentation predictions for coreference, and vice-\nversa. We add templates for the JntInfCandidates rule from [21]. We create this factor template\n\n6\n\n\fTable 1: Cora coreference and segmentation results\n\nCoreference\n\nCluster Rec. Author\n\nFellegi-Sunter\nIsolated MLN\n\nJoint MLN\nIsolated IDF\n\nJoint IDF\n\nPrec/Recall\n78.0/97.7\n94.3/97.0\n94.3/97.0\n97.09/95.42\n95.34/98.25\n\nF1\n86.7\n95.6\n95.6\n96.22\n96.71\n\n62.7\n78.1\n75.2\n86.01\n94.62\n\nSegmentation F1\n\nTitle Venue Total\nn/a\nn/a\n98.2\n97.3\n98.4\n97.6\n98.51\n97.63\n98.72\n97.99\n\nn/a\n98.2\n98.3\n98.58\n98.78\n\nn/a\n99.3\n99.5\n99.35\n99.42\n\nsuch that (m, m(cid:48)) are unrolled only if they are in the same Entity. The neighbors include Label\nand Mention. Af\ufb01nity and Repulsion factor templates are also created between pairs of Fields\nof the same type; for Af\ufb01nity the Fields belong to coreferent mention pairs, and for Repulsion they\nbelong to a pair of mentions that are not coreferent. The features of these templates denote similarity\nbetween \ufb01eld strings, namely: StringMatch, SubString, Pre\ufb01x/Suf\ufb01xMatch, TokenIntersectSize, etc.\nOne notable difference between the JntInfCandidate and joint Af\ufb01nity/Repulsion templates is the\npossible number of instantiations. JntInfCandidates can be calculated during preprocessing as there\nare O(nm2) of these (where n is the maximum mention length, and m is the number of mentions).\nHowever, preprocessing joint Af\ufb01nity/Repulsion templates is intractable as the number of such fac-\ntors is O(m2n4). We are able to deal with such a large set of possible factor instantiations due to\nthe interplay of structure de\ufb01nition, variable-statistics mapping, and on-the-\ufb02y feature calculation.\nOur model also contains a number of factor templates that cannot be easily captured by \ufb01rst-order\nlogic. For example consider StringMatch and SubString between two \ufb01elds. For arbitrary length\nstrings these features require the model designer to specify convoluted logic rules. The rules are even\nless intuitive when considering a feature based on more complex calculations such as StringEditDis-\ntance. It is conceivable to preprocess and store all instantiations of these features, but in practice this\nis intractable. Thus on-the-\ufb02y feature calculation within FACTORIE is employed to remain tractable.\n\n4 Experimental Results\n\nThe joint segmentation and coreference model described above is applied to the Cora dataset[25].4\nThe dataset contains 1295 total mentions in 134 clusters, with a total of 36487 tokens. Isolated train-\ning consists of 5 loops of 100,000 samples each, and 300,000 samples for inference. For the joint\ntask we run training for 5 loops of 250,000 samples each, with 750,000 samples for inference. We\naverage the results of 10 runs of three-fold cross validation, with the same folds as [21]. Segmenta-\ntion is evaluated on token precision, recall and F1. For coreference, pairwise coreference decisions\nare evaluated. The fraction of clusters that are correctly predicted (cluster recall) is also calculated.\nIn Table 1, we see both our isolated and joint models outperform the previous state-of-the-art results\nof [21] on both tasks. We see a 25.23% error reduction in pairwise coreference F1, and a 20.0%\nerror reduction of tokenwise segmentation F1 when comparing to the joint MLN. The improvements\nof joint over isolated IDF are statistically signi\ufb01cant at 1% using the T-test.\nThe experiments run very quickly, which can be attributed to sample-rank and the application of\nvariable coordination and structure de\ufb01nition of the models as described earlier. Each of the isolated\ntasks \ufb01nishes initialization, training and evaluation within 3 minutes, while the joint task takes 18\nminutes. The running time for the MLNs reported in [21] are between 50-90 minutes for learning\nand inference. Thus we can see that IDFs provide a signi\ufb01cant boost in ef\ufb01ciency by avoiding the\nneed to unroll or score the entire graph. Note also that the timing result from [21] is for a model that\ndid not enforce transitivity constraints on the coreference predictions. Adding transitivity constraints\ndramatically increases running time [26], whereas the IDF supports transitivity implicitly.\n\n4Available at http://alchemy.cs.washington.edu/papers/poon07\n\n7\n\n\f5 Related Work\n\nOver the years there have been many efforts to build graphical models toolkits. Many of\nthem are useful as teaching aids, such as the Bayes Net Toolbox and Probabilistic Modeling\nToolkit (PMTK)[27] (both in Matlab), but do not scale up to substantial real problems.\nThere has been growing interest in building systems that can perform as workhorses, doing real\nwork on large data. For example, Infer.NET (CSoft) [28] is intended to be deployed in a num-\nber of Microsoft products, and has been applied to problems in computer vision. Like IDFs it is\nembedded in a pre-existing programming language, rather than embodying its own new \u201clittle lan-\nguage,\u201d and its users have commented positively about this facet. Unlike IDFs it is designed for\nmessaging-passing inference, and must unroll the graphical model before inference, creating factors\nto represent all possible worlds, which makes it unsuitable for our applications. The very recent\nlanguage Figaro [29] is also implemented as a library. Like FACTORIE it is implemented in Scala,\nand provides an object-oriented framework for models; unlike FACTORIE it tightly intertwines data\nrepresentation and scoring, and it is not designed for changing model structure during inference; it\nalso does not yet support learning.\nBLOG [5] and some of its derivatives can also scale to substantial data sets, and, like IDFs, are\ndesigned for graphical models that cannot be fully unrolled. Unlike IDFs, BLOG, as well as IBAL\n[6] and Church [7], are designed for generative models, though Church can also represent condi-\ntional, undirected models. We are most interested in supporting advanced discriminative models of\nthe type that have been successful for natural language processing, computer vision, bioinformatics,\nand elsewhere. Note that FACTORIE also supports generative models; for example latent Dirichlet\nallocation can be coded in about 15 lines.\nTwo systems focussing on discriminatively-trained relational models are relational Markov networks\n(RMNs) [8], and Markov logic networks (MLNs, with Alchemy as its most popular implementation).\nTo de\ufb01ne repeated relational structure and parameter tying, both use declarative languages: RMNs\nuse SQL and MLNs use \ufb01rst-order logic. By contrast, as discussed above, IDFs are in essence an\nexperiment in taking an imperative approach.\nThere has, however, been both historical and recently growing interest in using imperative program-\nming languages for de\ufb01ning learning systems and probabilistic models. For example, work on the-\nory re\ufb01nement [30] viewed domain theories as \u201cstatements in a procedural programming language,\nrather than the common view of a domain theory being a collection of declarative Prolog state-\nments.\u201d More recently, IBAL [6] and Church [7] are both fundamentally programs that describe the\ngenerative storyline for the data. IDFs, of course, share the combination of imperative programming\nwith probabilistic modeling, but IDFs have their semantics de\ufb01ned by undirected factor graphs, and\nare typically discriminatively trained.\n\n6 Conclusion\n\nIn this paper we have described imperatively de\ufb01ned factor graphs (IDFs), a framework to support\nef\ufb01cient learning and inference in large factor graphs of changing structure. We preserve the tradi-\ntional, declarative, statistical semantics of factor graphs while allowing imperative de\ufb01nitions of the\nmodel structure and operation. This allows model authors to combine both declarative and procedu-\nral domain knowledge, while also obtaining signi\ufb01cantly more ef\ufb01cient inference and learning than\ndeclarative approaches. We have shown state-of-the-art results in citation matching that highlight\nthe advantages afforded by IDFs for both accuracy and speed.\n\nAcknowledgments\n\nThis work was supported in part by NSF medium IIS-0803847; the Central Intelligence Agency,\nthe National Security Agency and National Science Foundation under NSF grant IIS-0326249;\nSRI International subcontract #27-001338 and ARFL prime contract #FA8750-09-C-0181; Army\nprime contract number W911NF-07-1-0216 and University of Pennsylvania subaward number 103-\n548106. Any opinions, \ufb01ndings and conclusions or recommendations expressed in this material are\nthe authors\u2019 and do not necessarily re\ufb02ect those of the sponsor.\n\n8\n\n\fReferences\n[1] John D. Lafferty, Andrew McCallum, and Fernando Pereira. Conditional random \ufb01elds: Probabilistic\n\nmodels for segmenting and labeling sequence data. In Int Conf on Machine Learning (ICML), 2001.\n\n[2] Charles Sutton and Andrew McCallum. An introduction to conditional random \ufb01elds for relational learn-\n\ning. In Introduction to Statistical Relational Learning. 2007.\n\n[3] A. Bernal, K. Crammer, A. Hatzigeorgiou, and F. Pereira. Global discriminative learning for higher-\n\naccuracy computation gene prediction. In PloS Computational Biology, 2007.\n\n[4] A. Quottoni, M. Collins, and T. Darrell. Conditional random \ufb01elds for object recognition. In NIPS, 2004.\n[5] Brian Milch. Probabilistic Models with Unknown Objects. PhD thesis, University of California, Berkeley,\n\n2006.\n\n[6] Avi Pfeffer. IBAL: A probabilistic rational programming language. In IJCAI, pages 733\u2013740, 2001.\n[7] Noah D. Goodman, Vikash K. Mansinghka, Daniel Roy, Keith Bonawitz, and Joshua B. Tenenbaum.\n\nChurch: a language for generative models. In Uncertainty in Arti\ufb01cial Intelligence (UAI), 2008.\n\n[8] Ben Taskar, Abbeel Pieter, and Daphne Koller. Discriminative probabilistic models for relational data. In\n\nUncertainty in Arti\ufb01cial Intelligence (UAI), 2002.\n\n[9] Matthew Richardson and Pedro Domingos. Markov logic networks. Machine Learning, 62(1-2), 2006.\n[10] David Poole. Probabilistic horn abduction and bayesian networks. Arti\ufb01cial Intelligence, 64, 1993.\n[11] Stephen Muggleton and Luc DeRaedt. Inductive logic programming theory and methods. In Journal of\n\nLogic Programming, 1994.\n\n[12] Taisuke Sato and Yoshitaka Kameya. PRISM: a language for symbolic-statistical modeling. In Interna-\n\ntional Joint Conference on Arti\ufb01cial Intelligence (IJCAI), 1997.\n\n[13] Luc De Raedt and Kristian Kersting. Probabilistic logic learning.\n\nRelational Data Mining, 2003.\n\nSIGKDD Explorations: Multi-\n\n[14] Martin Odersky. An Overview of the Scala Programming Language (second edition). Technical Report\n\nIC/2006/001, EPFL Lausanne, Switzerland, 2006.\n\n[15] Aron Culotta and Andrew McCallum. Tractable learning and inference with high-order representations.\n\nIn ICML WS on Open Problems in Statistical Relational Learning, 2006.\n\n[16] Keith A. Bonawitz. Composable Probabilistic Inference with Blaise. PhD thesis, MIT, 2008.\n[17] Aron Culotta. Learning and inference in weighted logic with application to natural language processing.\n\nPhD thesis, University of Massachusetts, 2008.\n\n[18] Charles Sutton and Andrew McCallum. Collective segmentation and labeling of distant entities in infor-\n\nmation extraction. Technical Report TR#04-49, University of Massachusetts, July 2004.\n\n[19] Aron Culotta, Michael Wick, and Andrew McCallum. First-order probabilistic models for coreference\n\nresolution. In NAACL: Human Language Technologies (NAACL/HLT), 2007.\n\n[20] Khashayar Rohanimanesh, Michael Wick, and Andrew McCallum. Inference and learning in large factor\ngraphs with a rank based objective. Technical Report UM-CS-2009-08, University of Massachusetts,\nAmherst, 2009.\n\n[21] Hoifung Poon and Pedro Domingos. Joint inference in information extraction. In AAAI, 2007.\n[22] Vasin Punyakanok, Dan Roth, and Wen-tau Yih. The necessity of syntactic parsing for semantic role\n\nlabeling. In International Joint Conf on Arti\ufb01cial Intelligence (IJCAI), pages 1117\u20131123, 2005.\n\n[23] Ben Wellner, Andrew McCallum, Fuchun Peng, and Michael Hay. An integrated, conditional model of\n\ninformation extraction and coreference with application to citation matching. In AUAI, 2004.\n\n[24] Sameer Singh, Karl Schultz, and Andrew McCallum. Bi-directional joint inference for entity resolution\n\nand segmentation using imperatively-de\ufb01ned factor graphs. In ECML PKDD, pages 414\u2013429, 2009.\n\n[25] Andrew McCallum, Kamal Nigam, Jason Rennie, and Kristie Seymore. A machine learning approach to\n\nbuilding domain-speci\ufb01c search engines. In Int Joint Conf on Arti\ufb01cial Intelligence (IJCAI), 1999.\n\n[26] Hoifung Poon, Pedro Domingos, and Marc Sumner. A general method for reducing the complexity of\n\nrelational inference and its application to MCMC. In AAAI, 2008.\n\n[27] Kevin Murphy and Matt Dunham. PMTK: Probabilistic modeling toolkit. In Neural Information Pro-\n\ncessing Systems (NIPS) Workshop on Probabilistic Programming, 2008.\n\n[28] John Winn and Tom Minka. Infer.NET/CSoft, 2008. http://research.microsoft.com/mlp/ml/Infer/Csoft.htm.\n[29] Avi Pfeffer. Figaro: An Object-Oriented Probabilistic Programming Language. Technical report, Charles\n\nRiver Analytics, 2009.\n\n[30] Richard Maclin and Jude W. Shavlik. Creating advice-taking reinforcement learners. Machine Learning,\n\n22, 1996.\n\n9\n\n\f", "award": [], "sourceid": 857, "authors": [{"given_name": "Andrew", "family_name": "McCallum", "institution": null}, {"given_name": "Karl", "family_name": "Schultz", "institution": null}, {"given_name": "Sameer", "family_name": "Singh", "institution": null}]}