{"title": "Conditional Models of Identity Uncertainty with Application to Noun Coreference", "book": "Advances in Neural Information Processing Systems", "page_first": 905, "page_last": 912, "abstract": null, "full_text": " Conditional Models of Identity Uncertainty\n with Application to Noun Coreference\n\n\n\n Andrew McCallum Ben Wellner\n Department of Computer Science The MITRE Corporation\n University of Massachusetts Amherst 202 Burlington Road\n Amherst, MA 01003 USA Bedford, MA 01730 USA\n mccallum@cs.umass.edu wellner@cs.umass.edu\n\n\n\n\n Abstract\n\n Coreference analysis, also known as record linkage or identity uncer-\n tainty, is a difficult and important problem in natural language process-\n ing, databases, citation matching and many other tasks. This paper intro-\n duces several discriminative, conditional-probability models for coref-\n erence analysis, all examples of undirected graphical models. Unlike\n many historical approaches to coreference, the models presented here\n are relational--they do not assume that pairwise coreference decisions\n should be made independently from each other. Unlike other relational\n models of coreference that are generative, the conditional model here can\n incorporate a great variety of features of the input without having to be\n concerned about their dependencies--paralleling the advantages of con-\n ditional random fields over hidden Markov models. We present positive\n results on noun phrase coreference in two standard text data sets.\n\n\n\n1 Introduction\n\nIn many domains--including computer vision, databases and natural language\nprocessing--we find multiple views, descriptions, or names for the same underlying ob-\nject. Correctly resolving these references is a necessary precursor to further processing and\nunderstanding of the data. In computer vision, solving object correspondence is necessary\nfor counting or tracking. In databases, performing record linkage or de-duplication creates\na clean set of data that can be accurately mined. In natural language processing, corefer-\nence analysis finds the nouns, pronouns and phrases that refer to the same entity, enabling\nthe extraction of relations among entities as well as more complex propositions.\n\nConsider, for example, the text in a news article that discusses the entities George Bush,\nColin Powell, and Donald Rumsfeld. The article contains multiple mentions of Colin\nPowell by different strings--\"Secretary of State Colin Powell,\" \"he,\" \"Mr. Powell,\" \"the\nSecretary\"--and also refers to the other two entities with sometimes overlapping strings.\nThe coreference task is to use the content and context of all the mentions to determine how\nmany entities are in the article, and which mention corresponds to which entity.\n\nThis task is most frequently solved by examining individual pair-wise distance measures\nbetween mentions independently of each other. For example, database record-linkage and\ncitation reference matching has been performed by learning a pairwise distance metric\nbetween records, and setting a distance threshold below which records are merged (Monge\n\n\f\n& Elkan, 1997; McCallum et al., 2000; Bilenko & Mooney, 2002; Cohen & Richman,\n2002). Coreference in NLP has also been performed with distance thresholds or pairwise\nclassifiers (McCarthy & Lehnert, 1995; Ge et al., 1998; Soon et al., 2001; Ng & Cardie,\n2002).\n\nBut these distance measures are inherently noisy and the answer to one pair-wise corefer-\nence decision may not be independent of another. For example, if we measure the distance\nbetween all of the three possible pairs among three mentions, two of the distances may\nbe below threshold, but one above--an inconsistency due to noise and imperfect measure-\nment. For example, \"Mr. Powell\" may be correctly coresolved with \"Powell,\" but par-\nticular grammatical circumstances may make the model incorrectly believe that \"Powell\"\nis coreferent with a nearby occurrence of \"she.\" Inconsistencies might be better resolved\nif the coreference decisions are made in dependent relation to each other, and in a way\nthat accounts for the values of the multiple distances, instead of a threshold on single pairs\nindependently.\n\nRecently Pasula et al. (2003) have proposed a formal, relational approach to the problem\nof identity uncertainty using a type of Bayesian network called a Relational Probabilistic\nModel (Friedman et al., 1999). A great strength of this model is that it explicitly captures\nthe dependence among multiple coreference decisions.\n\nHowever, it is a generative model of the entities, mentions and all their features, and thus\nhas difficulty using many features that are highly overlapping, non-independent, at varying\nlevels of granularity, and with long-range dependencies. For example, we might wish to\nuse features that capture the phrases, words and character n-grams in the mentions, the\nappearance of keywords anywhere in the document, the parse-tree of the current, preceding\nand following sentences, as well as 2-d layout information. To produce accurate generative\nprobability distributions, the dependencies between these features should be captured in the\nmodel; but doing so can lead to extremely complex models in which parameter estimation\nis nearly impossible.\n\nSimilar issues arise in sequence modeling problems. In this area significant recent suc-\ncess has been achieved by replacing a generative model--hidden Markov models--with a\nconditional model--conditional random fields (CRFs) (Lafferty et al., 2001). CRFs have\nreduced part-of-speech tagging errors by 50% on out-of-vocabulary words in comparison\nwith HMMs (Ibid.), matched champion noun phrase segmentation results (Sha & Pereira,\n2003), and significantly improved extraction of named entities (McCallum & Li, 2003),\ncitation data (Peng & McCallum, 2004), and the segmentation of tables in government re-\nports (Pinto et al., 2003). Relational Markov networks (Taskar et al., 2002) are similar\nmodels, and have been shown to significantly improve classification of Web pages.\n\nThis paper introduces three conditional undirected graphical models for identity uncer-\ntainty. The models condition on the mentions, and generate the coreference decisions, (and\nin some cases also generate attributes of the entities). In the first most general model, the\ndependency structure is unrestricted, and the number of underlying entities explicitly ap-\npears in the model structure. The second and third models have no structural dependence\non the number of entities, and fall into a class of Markov random fields in which inference\ncorresponds to graph partitioning (Boykov et al., 1999).\n\nAfter introducing the first two models as background generalizations, we show experi-\nmental results using the third, most specific model on a noun coreference problem in two\ndifferent standard newswire text domains: broadcast news stories from the DARPA Auto-\nmatic Content Extraction (ACE) program, and newswire articles from the MUC-6 corpus.\nIn both domains we take advantage of the ability to use arbitrary, overlapping features of\nthe input, including multiple grammatical features, string equality, substring, and acronym\nmatches. Using the same features, in comparison with an alternative natural language pro-\ncessing technique, we reduce error by 33% and 28% in the two domains on proper nouns\nand by 10% on all nouns in the MUC-6 data.\n\n\f\n2 Three Conditional Models of Identity Uncertainty\n\n\nWe now describe three possible configurations for conditional models of identity uncer-\ntainty, each progressively simpler and more specific than its predecessor. All three are\nbased on conditionally-trained, undirected graphical models.\n\nUndirected graphical models, also known as Markov networks or Markov random fields,\nare a type of probabilistic model that excels at capturing interdependent data in which\ncausality among attributes is not apparent. We begin by introducing notation for mentions,\nentities and attributes of entities, then in the following subsections describe the likelihood,\ninference and estimation procedures for the specific undirected graphical models.\n\nLet E = (E1, ...Em) be a collection of classes or \"entities\". Let X = (X1, ...Xn) be a\ncollection of random variables over observations or \"mentions\"; and let Y = (Y1, ...Yn) be\na collection of random variables over integer identifiers, unique to each entity, specifying to\nwhich entity a mention refers. Thus the y's are integers ranging from 1 to m, and if Yi = Yj,\nthen mention Xi is said to refer to the same underlying entity as Xj. For example, some\nparticular entity e4, U.S. Secretary of State, Colin L. Powell, may be mentioned multiple\ntimes in a news article that also contains mentions of other entities: x6 may be \"Colin\nPowell\"; x9 may be \"he\"; x17 may be \"the Secretary of State.\" In this case, the unique\ninteger identifier for this entity, e4, is 4, and y6 = y9 = y17 = 4.\n\nFurthermore, entities may have attributes. Let A be a random variable over the collection of\nall attributes for all entities. Borrowing the notation of Relational Markov Networks (Taskar\net al., 2002), we write the random variable over the attributes of entity Es as Es.A =\n{Es.A1, Es.A2, Es.A3, ...}. For example, these three attributes may be gender, birth year,\nand surname. Continuing the above example, then e4.a1 = MALE, e4.a2 = 1937, and e4.a3\n= Powell. One can interpret the attributes as the values that should appear in the fields of\na database record for the given entity. Attributes such as surname may take on one of the\nfinite number of values that appear in the mentions of the data set.\n\nWe may examine many features of the mentions, x, but since a conditional model doesn't\ngenerate them, we don't need random variable notation for them. Separate measured fea-\ntures of the mentions and entity-assignments, y, are captured in different feature functions,\nf (), over cliques in the graphical model. Although the functions may be real-valued, typ-\nically they are binary. The parameters of the model are associated with these different\nfeature functions. Details and example feature functions and parameterizations are given\nfor the three specific models below.\n\nThe task is then to find the most likely collection of entity-assignments, y, (and optionally\nalso the most likely entity attributes, a), given a collection of mentions and their con-\ntext, x. A generative probabilistic model of identity uncertainty is trained to maximize\nP (Y, A, X). A conditional probabilistic model of identity uncertainty is instead trained to\nmaximize P (Y, A|X), or simply P (Y|X).\n\n\n2.1 Model 1: Groups of nodes for entities\n\nFirst we consider an extremely general undirected graphical model in which there is a node\nfor the mentions, x,1 a node for the entity-assignment of each mention, y, and a node for\neach of the attributes of each entity, e.a. These nodes are connected by edges in some\nunspecified structure, where an edge indicates that the values of the two connected random\nvariables are dependent on each the other.\n\n\n 1Even though there are many mentions in x, because we are not generating them, we can represent\nthem as a single node. This helps show that feature functions can ask arbitrary questions about various\nlarge and small subsets of the mentions and their context. We will still use xi to refer to the content\nand context of the ith mention.\n\n\f\nThe parameters of the model are defined over cliques in this graph. Typically the parame-\nters on many different cliques would be tied in patterns that reflect the nature of the repeated\nrelational structure in the data. Patterns of tied parameters are common in many graphi-\ncal models, including HMMs and other finite state machines (Lafferty et al., 2001), where\nthey are tied across different positions in the input sequence, and by more complex pat-\nterns based on SQL-like queries, as in Markov Relational Networks (Taskar et al., 2002).\nFollowing the nomenclature of the later, these parameter-tying-patterns are called clique\ntemplates; each particular instance a template in the graph we call a hit.\n\nFor example, one clique template may specify a pattern consisting of two mentions, their\nentity-assignment nodes, and an entity's surname attribute node. The hits would consist\nof all possible combinations of such nodes. Multiple feature functions could then be run\nover each hit. One feature function might have value 1 if, for example, both mentions were\nassigned to the same entity as the surname node, and if the surname value appears as a\nsubstring in both mention strings (and value 0 otherwise).\n\nThe Hammersley-Clifford theorem stipulates that the probability of a particular set of val-\nues on the random variables in an undirected graphical model is a product of potential\nfunctions over cliques of the graph. Our cliques will be the hits, h = {h, ...}, resulting\nfrom a set of clique templates, t = {t, ...}. In typical fashion, we will write the probability\ndistribution in exponential form, with each potential function calculated as a dot-product\nof feature functions, f , and learned parameters, ,\n\n 1\n P (y, a|x) = exp lfl(y, a, x : ht) ,\n Zx tt htht l\n\nwhere (y, a, x : ht) indicates the subset of the entity-assignment, attribute, and mention\nnodes selected by the clique template hit ht; and Zx is a normalizer to make the probabili-\nties over all y sum to one (also known as the partition function).\n\nThe parameters, , can be learned by maximum likelihood from labeled training data.\nCalculating the partition function is problematic because there are a very large number of\npossible y's and a's. Loopy belief propagation or Gibbs sampling sampling have been used\nsuccessfully in other similar situations, e.g. (Taskar et al., 2002).\n\nHowever, note that both loopy belief propagation and Gibbs sampling only work over a\ngraph with fixed structure. But in our problem the number of entities (and thus number of\nattribute nodes, and the domain of the entity-assignment nodes) is unknown. Inference in\nthese models must determine for us the highest-probability number of entities.\n\nIn related work on a generative probabilistic model of identity uncertainty, Pasula et al.\n(2003), solve this problem by alternating rounds of Metropolis-Hastings sampling on a\ngiven model structure with rounds of Metropolis-Hastings to explore the space of new\ngraph structures.\n\n\n2.2 Model 2: Nodes for mention pairs, with attributes on mentions\n\nTo avoid the need to change the graphical model structure during inference, we now remove\nany parts of the graph that depend on the number of entities, m: (1) The per-mention\nentity-assignment nodes, Yi, are random variables whose domain is over the integers 0\nthrough m; we remove these nodes, replacing them with binary-valued random variables,\nYij, over each pair of mentions, (Xi, Xj) (indicating whether or not the two mentions are\ncoreferent); although it is not strictly necessary, we also restrict the clique templates to\noperate over no more than two mentions (for efficiency). (2) The per-entity attribute nodes\nA are removed and replaced with attribute nodes associated with each mention; we write\nxi.a for the set of attributes on mention xi.\n\nEven though the clique templates are now restricted to pairs of mentions, this does not\nimply that pairwise coreference decisions are made independently of each other--they are\n\n\f\nstill highly dependent. Many pairs will overlap with each other, and constraints will flow\nthrough these overlaps. This point is reiterated with an example in the next subsection.\n\nNotice, however, that it is possible for the model as thus far described to assign non-zero\nprobability to an inconsistent set of entity-assignments, y. For example, we may have an\n\"inconsistent triangle\" of coreference decisions in which yij and yjk are 1, while yik is 0.\nWe can enforce the impossibility of all inconsistent configurations by adding inconsistency-\nchecking functions f(yij, yjk, yik) for all mention triples, with the corresponding 's\nfixed at negative infinity--thus assigning zero probability to them. (Note that this is simply\na notational trick; in practice the inference implementation simply avoids any configura-\ntions of y that are inconsistent--a check that is simple to perform.) Thus we have\n\n \n 1\n P (y, a|x) = exp lfl(xi, xj, yij, xi.a, xj.a) + f(yij, yjk, yik) .\n Z \n x i,j,l i,j,k\n\nWe can also enforce consistency among the attributes of coreferent mentions by similar\nmeans. There are many widely-used techniques for efficiently and drastically reducing\nthe number of pair-wise comparisons, e.g. (Monge & Elkan, 1997; McCallum et al., 2000).\nIn this case, we could also restrict fl(xi, xj, yij) 0, yij = 0.\n\n2.3 Model 3: Nodes for mention pairs, graph partitioning with learned distance\n\nWhen gathering attributes of entities is not necessary, we can avoid the extra complication\nof attributes by removing them from the model. What results is a straightforward, yet\nhighly expressive, discriminatively-trained, undirected graphical model that can use rich\nfeature sets and relational inference to solve identity uncertainty tasks. Determining the\nmost likely number of entities falls naturally out of inference. The model is\n\n \n 1\n P (y|x) = exp lfl(xi, xj, yij) + f(yij, yjk, yik) . (1)\n Z \n x i,j,l i,j,k\n\n\nRecently there has been increasing interest in study of the equivalence between graph par-\ntitioning algorithms and inference in certain kinds of undirected graphical models, e.g.\n(Boykov et al., 1999). This graphical model is an example of such a case. With some\nthought, one can straightforwardly see that finding the highest probability coreference so-\nlution, y = arg maxy P (y|x), exactly corresponds to finding the graph partitioning of a\n(different) graph in which the mentions are the nodes and the edge weights are the (log)\nclique potentials on the pair of nodes xi, xj involved in their edge: \n l lfl(xi, xj , yij ),\nwhere fl(xi, xj, 1) = -fl(xi, xj, 0), and edge weights range from - to +. Unlike\nclassic mincut/maxflow binary partitioning, here the number of partitions (corresponding\nto entities) is unknown, but a single optimal number of partitions exists; negative edge\nweights encourage more partitions.\n\nGraph partitioning with negative edge weights is NP-hard, but it has a history of good\napproximations, and several efficient algorithms to choose from. Our current experiments\nuse an instantiation of the minimizing-disagreements Correlational Clustering algorithm in\n(Bansal et al., 2002). This approach is a simple yet effective partitioning scheme. It works\nby measuring the degree of inconsistency incurred by including a node in a partition, and\nmaking repairs. We refer the reader to Bansal et al. (2002) for further details.\n\nThe resulting solution does not make pairwise coreference decisions independently of each\nother. It has a significant \"relational\" nature because the assignment of a node to a par-\ntition (or, mention to an entity) depends not just on a single low distance measurement\nto one other node, but on its low distance measurement to all nodes in the partition (and\nfurthermore on its high distance measurement to all nodes of all other partitions). For ex-\nample, the \"Mr. Powell\"/\"Powell\"/\"she\" problem discussed in the introduction would be\n\n\f\nprevented by this model because, although the distance between \"Powell\" and \"she\" might\ngrammatically look low, the distance from \"she\" to another member of the same partition,\n(\"Mr. Powell\") is very high.\n\nInterestingly, in our model, the distance measure between nodes is learned from labeled\ntraining data. That is, we use data, D, in which the correct coreference partitions are\nknown in order to learn a distance metric such that, when the same data is clustered, the\ncorrect partitions emerge. This is accomplished by maximum likelihood--adjusting the\nweights, , to maximize the product of Equation 1 over all instances x, y in the training\nset. Fortunately this objective function is concave--it has a single global maximum--\nand there are several applicable optimization methods to choose from, including gradient\nascent, stochastic gradient ascent and conjugate gradient; all simply require the derivative\nof the objective function. The derivative of the log-likelihood, L, is\n\n \n L = fl(xi, xj, yij) - P(y |x) fl(xi, xj, y ) ,\n ij \n l x,y D i,j,l y i,j,l\n\n\nwhere P(y |x) is defined by Equation 1, using the current set of parameters, , and\n is a sum over all possible partitionings.\n y\n\nThe number of possible partitionings is exponential in the number of mentions, so for\nany reasonably-sized problem, we obviously must resort to approximate inference for the\nsecond expectation. A simple option is stochastic gradient ascent in the form of a voted\nperceptron (Collins, 2002). Here we calculate the gradient for a single training instance at a\ntime, and rather than use a full expectation in the second line, simply using the single most\nlikely (or nearly most likely) partitioning as found by a graph partitioning algorithm, and\nmake progressively smaller steps in the direction of these gradients while cycling through\nthe instances, x, y in the training data. Neither the full sum, , or the partition func-\n y\ntion, Zx, need to be calculated in this case. Further details are given in (Collins, 2002).\n\n\n3 Experiments with Noun Coreference\n\nWe present experimental results on natural language noun phrase coreference using Model\n3 applied to two applicable data sets: the DARPA MUC-6 corpus, and a set of 117 stories\nfrom the broadcast news portion of the DARPA ACE data set. Both data sets have annotated\ncoreferences. We pre-process both data sets with the Brill part-of-speech tagger.\n\nWe compare our Model 3 against two other techniques representing typical approaches to\nthe problem of identity uncertainty. The first is single-link clustering with a threshold,\n(single-link-threshold), which is universally used in database record-linkage and citation\nreference matching (Monge & Elkan, 1997; Bilenko & Mooney, 2002; McCallum et al.,\n2000; Cohen & Richman, 2002). It forms partitions by simply collapsing the spanning\ntrees of all mentions with pairwise distances below some threshold. For each experiment,\nthe threshold was selected by cross validation.\n\nThe second technique, which we call best-previous-match, has been used in natural lan-\nguage processing applications (Morton, 1997; Ge et al., 1998; Ng & Cardie, 2002). It\nworks by scanning linearly through a document, and associating each mention with its\nbest-matching predecessor--best as measured with a single pairwise distance.\n\nIn our experiments, both single-link-threshold and best-previous-match implementations\nuse a distance measure based on a binary maximum entropy classifier--matching the prac-\ntice of Morton (1997) and Cohen and Richman (2002).\n\nWe use an identical feature set for all techniques, including our Method 3. The features,\ntypical of those used in many other NLP coreference systems, are modeled after those\nin Ng and Cardie (2002). They include tests for string and substring matches, acronym\n\n\f\nmatches, parse-derived head-word matches, gender, WORDNET subsumption, sentence\ndistance, distance in the parse tree; etc., and are detailed in an accompanying technical\nreport. They are quite non-independent, and operate at multiple levels of granularity.\n\n Table 1 shows standard MUC-\n ACE MUC-6 MUC-6 style F1 scores for three experi-\n (Proper) (Proper) (All) ments. In the first two experi-\n best-previous-match 90.98 88.83 70.41 ments, we consider only proper\n single-link-threshold 91.65 88.90 60.83 nouns, and perform five-fold cross\n Model 3 93.96 91.59 73.42 validation. In the third exper-\n iment, we perform the standard\n Table 1: F1 results on three data sets. MUC evaluation, including all\nnouns--pronouns, common and proper--and use the standard 30/30 document train/test\nsplit; furthermore, as in Harabagiu et al. (2001), we consider only mentions that have\na coreferent. Model 3 out-performs both the single-link-threshold and the best-previous-\nmatch techniques, reducing error by 28% over single-link-threshold on the ACE proper\nnoun data, by 24% on the MUC-6 proper noun data, and by 10% over the best-previous-\nmatch technique on the full MUC-6 task. All differences from Model 3 are statistically\nsignificant. Historically, these data sets have been heavily studied, and even small gains\nhave been celebrated.\n\nOur overall results on MUC-6 are slightly better (with unknown statistical significance)\nthan the best published results of which we are aware with a matching experimental design,\nHarabagiu et al. (2001), who reach 72.3% using the same training and test data.\n\n\n4 Related Work and Conclusions\n\nThere has been much related work on identity uncertainty in various specific fields. Tra-\nditional work in de-duplication for databases or reference-matching for citations measures\nthe distance between two records by some metric, and then collapses all records at a dis-\ntance below a threshold, e.g. (Monge & Elkan, 1997; McCallum et al., 2000). This method\nis not relational, that is, it does not account for the inter-dependent relations among multi-\nple decisions to collapse. Most recent work in the area has focused on learning the distance\nmetric (Bilenko & Mooney, 2002; Cohen & Richman, 2002) not the clustering method.\n\nNatural language processing has had similar emphasis and lack of emphasis respectively.\nPairwise coreference learned distance measures have used decision trees (McCarthy &\nLehnert, 1995; Ng & Cardie, 2002), SVMs (Zelenko et al., 2003), maximum entropy clas-\nsifiers (Morton, 1997), and generative probabilistic models (Ge et al., 1998). But all use\nthresholds on a single pairwise distance, or the maximum of a single pairwise distance to\ndetermine if or where a coreferent merge should occur.\n\nPasula et al. (2003) introduce a generative probability model for identity uncertainty based\non Probabilistic Relational Networks networks. Our work is an attempt to gain some of the\nsame advantages that CRFs have over HMMs by creating conditional models of identity\nuncertainty. The models presented here, as instances of conditionally-trained undirected\ngraphical models, are also instances of relational Markov networks (Taskar et al., 2002)\nand conditional Random fields (Lafferty et al., 2001). Taskar et al. (2002) briefly discuss\nclustering of dyadic data, such as people and their movie preferences, but not identity\nuncertainty or inference by graph partitioning.\n\nIdentity uncertainty is a significant problem in many fields. In natural language processing,\nit is not only especially difficult, but also extremely important, since improved corefer-\nence resolution is one of the chief barriers to effective data mining of text data. Natural\nlanguage data is a domain that has particularly benefited from rich and overlapping fea-\nture representations--representations that lend themselves better to conditional probability\nmodels than generative ones (Lafferty et al., 2001; Collins, 2002; Morton, 1997). Hence\nour interest in conditional models of identity uncertainty.\n\n\f\nAcknowledgments\nWe thank Andrew Ng, Jon Kleinberg, David Karger, Avrim Blum and Fernando Pereira for helpful\nand insightful discussions. This work was supported in part by the Center for Intelligent Information\nRetrieval and in part by SPAWARSYSCEN-SD grant numbers N66001-99-1-8912 and N66001-02-\n1-8903, and DARPA under contract number F30602-01-2-0566 and in part by the National Science\nFoundation under NSF grant #IIS-0326249 and in part by the Defense Advanced Research Projec\nts Agency (DARPA), through the Department of the Interior, NBC, Acquisition Services Division,\nunder contract number NBCHD030010.\n\nReferences\n\nBansal, N., Chawala, S., & Blum, A. (2002). Correlation clustering. The 43rd Annual Symposium on\n Foundations of Computer Science (FOCS) (pp. 238247).\nBilenko, M., & Mooney, R. J. (2002). Learning to combine trained distance metrics for duplicate\n detection in databases (Technical Report Technical Report AI 02-296). Artificial Intelligence\n Laboratory, University of Texas at Austin, Austin, TX.\nBoykov, Y., Veksler, O., & Zabih, R. (1999). Fast approximate energy minimization via graph cuts.\n ICCV (1) (pp. 377384).\nCohen, W., & Richman, J. (2002). Learning to match and cluster entity names. Proceedings of\n KDD-2002, 8th International Conference on Knowledge Discovery and Data Mining.\nCollins, M. (2002). Discriminative training methods for hidden markov models: Theory and experi-\n ments with perceptron algorithms.\nFriedman, N., Getoor, L., Koller, D., & Pfeffer, A. (1999). Learning probabilistic relational models.\n IJCAI (pp. 13001309).\nGe, N., Hale, J., & Charniak, E. (1998). A statistical approach to anaphora resolution. Proceedings\n of the Sixth Workshop on Very Large Corpora (pp. 161171).\nHarabagiu, S., Bunescu, R., & Maiorano, S. (2001). Text and knowledge mining for coreference\n resolution. Proceedings of the 2nd Meeting of the North American Chapter of the Association of\n Computational Linguistics (NAACL-2001) (pp. 5562).\nLafferty, J., McCallum, A., & Pereira, F. (2001). Conditional random fields: Probabilistic models for\n segmenting and labeling sequence data. Proc. ICML (pp. 282289).\nMcCallum, A., & Li, W. (2003). Early results for named entity recognition with conditional random\n fields, feature induction and web-enhanced lexicons. Seventh Conference on Natural Language\n Learning (CoNLL).\nMcCallum, A., Nigam, K., & Ungar, L. H. (2000). Efficient clustering of high-dimensional data sets\n with application to reference matching. Knowledge Discovery and Data Mining (pp. 169178).\nMcCarthy, J. F., & Lehnert, W. G. (1995). Using decision trees for coreference resolution. IJCAI (pp.\n 10501055).\nMonge, A. E., & Elkan, C. (1997). An efficient domain-independent algorithm for detecting approx-\n imately duplicate database records. Research Issues on Data Mining and Knowledge Discovery.\nMorton, T. (1997). Coreference for NLP applications. Proceedings ACL.\nNg, V., & Cardie, C. (2002). Improving machine learning approaches to coreference resolution.\n Fortieth Anniversary Meeting of the Association for Computational Linguistics (ACL-02).\nPasula, H., Marthi, B., Milch, B., Russell, S., & Shpitser, I. (2003). Identity uncertainty and citation\n matching. Advances in Neural Information Processing (NIPS).\nPeng, F., & McCallum, A. (2004). Accurate information extraction from research papers using con-\n ditional random fields. Proceedings of Human Language Technology Conference and North Amer-\n ican Chapter of the Association for Computational Linguistics (HLT-NAACL).\nPinto, D., McCallum, A., Lee, X., & Croft, W. B. (2003). Table extraction using conditional random\n fields. Proceedings of the 26th ACM SIGIR.\nSha, F., & Pereira, F. (2003). Shallow parsing with conditional random fields (Technical Report CIS\n TR MS-CIS-02-35). University of Pennsylvania.\nSoon, W. M., Ng, H. T., & Lim, D. C. Y. (2001). A machine learning approach to coreference\n resolution of noun phrases. Computational Linguistics, 27, 521544.\nTaskar, B., Abbeel, P., & Koller, D. (2002). Discriminative probabilistic models for relational data.\n Eighteenth Conference on Uncertainty in Artificial Intelligence (UAI02).\nZelenko, D., Aone, C., & Richardella, A. (2003). Kernel methods for relation extraction. Journal of\n Machine Learning Research (submitted).\n\n\f\n", "award": [], "sourceid": 2557, "authors": [{"given_name": "Andrew", "family_name": "McCallum", "institution": null}, {"given_name": "Ben", "family_name": "Wellner", "institution": null}]}