{"title": "Inverting Grice's Maxims to Learn Rules from Natural Language Extractions", "book": "Advances in Neural Information Processing Systems", "page_first": 1053, "page_last": 1061, "abstract": "We consider the problem of learning rules from natural language text sources. These sources, such as news articles and web texts, are created by a writer to communicate information to a reader, where the writer and reader share substantial domain knowledge. Consequently, the texts tend to be concise and mention the minimum information necessary for the reader to draw the correct conclusions. We study the problem of learning domain knowledge from such concise texts, which is an instance of the general problem of learning in the presence of missing data. However, unlike standard approaches to missing data, in this setting we know that facts are more likely to be missing from the text in cases where the reader can infer them from the facts that are mentioned combined with the domain knowledge. Hence, we can explicitly model this \"missingness\" process and invert it via probabilistic inference to learn the underlying domain knowledge. This paper introduces a mention model that models the probability of facts being mentioned in the text based on what other facts have already been mentioned and domain knowledge in the form of Horn clause rules. Learning must simultaneously search the space of rules and learn the parameters of the mention model. We accomplish this via an application of Expectation Maximization within a Markov Logic framework. An experimental evaluation on synthetic and natural text data shows that the method can learn accurate rules and apply them to new texts to make correct inferences. Experiments also show that the method out-performs the standard EM approach that assumes mentions are missing at random.", "full_text": "Inverting Grice\u2019s Maxims to Learn Rules from\n\nNatural Language Extractions\n\nMohammad Shahed Sorower, Thomas G. Dietterich, Janardhan Rao Doppa\n\nWalker Orr, Prasad Tadepalli, and Xiaoli Fern\n\nSchool of Electrical Engineering and Computer Science\n\n{sorower,tgd,doppa,orr,tadepall,xfern}@eecs.oregonstate.edu\n\nOregon State University\n\nCorvallis, OR 97331\n\nAbstract\n\nWe consider the problem of learning rules from natural language text sources. These sources,\nsuch as news articles and web texts, are created by a writer to communicate information to a\nreader, where the writer and reader share substantial domain knowledge. Consequently, the\ntexts tend to be concise and mention the minimum information necessary for the reader to\ndraw the correct conclusions. We study the problem of learning domain knowledge from such\nconcise texts, which is an instance of the general problem of learning in the presence of missing\ndata. However, unlike standard approaches to missing data, in this setting we know that facts\nare more likely to be missing from the text in cases where the reader can infer them from\nthe facts that are mentioned combined with the domain knowledge. Hence, we can explicitly\nmodel this \u201cmissingness\u201d process and invert it via probabilistic inference to learn the underlying\ndomain knowledge. This paper introduces a mention model that models the probability of facts\nbeing mentioned in the text based on what other facts have already been mentioned and domain\nknowledge in the form of Horn clause rules. Learning must simultaneously search the space of\nrules and learn the parameters of the mention model. We accomplish this via an application of\nExpectation Maximization within a Markov Logic framework. An experimental evaluation on\nsynthetic and natural text data shows that the method can learn accurate rules and apply them\nto new texts to make correct inferences. Experiments also show that the method out-performs\nthe standard EM approach that assumes mentions are missing at random.\n\n1\n\nIntroduction\n\nThe immense volume of textual information available on the web provides an important opportunity\nand challenge for AI: Can we develop methods that can learn domain knowledge by reading natural\ntexts such as news articles and web pages. We would like to acquire at least two kinds of domain\nknowledge: concrete facts and general rules. Concrete facts can be extracted as logical relations or\nas tuples to populate a data base. Systems such as Whirl [3], TextRunner [5], and NELL [1] learn\nextraction patterns that can be applied to text to extract instances of relations.\nGeneral rules can be acquired in two ways. First, they may be stated explicitly in the text\u2014\nparticularly in tutorial texts. Second, they can be acquired by generalizing from the extracted con-\ncrete facts. In this paper, we focus on the latter setting: Given a data base of literals extracted from\nnatural language texts (e.g., newspaper articles), we seek to learn a set of probabilistic Horn clauses\nthat capture general rules.\nUnfortunately for rule learning algorithms, natural language texts are incomplete. The writer tends\nto mention only enough information to allow the reader to easily infer the remaining facts from\nshared background knowledge. This aspect of economy in language was \ufb01rst pointed out by Grice\n\n1\n\n\fnecessary, but no more.\n\n3 Be relevant.\n4 Be clear.\n\nTable 1: Grice\u2019s Conversational Maxims\n\n1 Be truthful\u2014do not say falsehoods.\n2 Be concise\u2014say as much as\n\n[7] in his maxims of cooperative conversation (see Table 1). For example, consider the following\nsentence that discusses a National Football League (NFL) game:\n\u201cGiven the commanding lead of Kansas city on the\nroad, Denver Broncos\u2019 14-10 victory surprised many\u201d\nThis mentions that Kansas City is the away team and\nthat the Denver Broncos won the game, but does not\nmention that Kansas City lost the game or that the\nDenver Broncos was the home team. Of course these\nfacts can be inferred from domain knowledge rules such as the rule that \u201cif one team is the winner,\nthe other is the loser (and vice versa)\u201d and the rule \u201cif one team is the home team, the other is the\naway team (and vice versa)\u201d. This is an instance of the second maxim.\nAnother interesting case arises when shared knowledge could lead the reader to an incorrect infer-\nence:\n\u201cAhmed Said Khadr, an Egyptian-born Canadian, was killed last October in Pakistan.\u201d\nThis explicitly mentions that Khadr is Canadian, because otherwise the reader would infer that he\nwas Egyptian based on the domain knowledge rule \u201cif a person is born in a country, then the person\nis a citizen of that country\u201d. Grice did not discuss this case, but we can state this as a corollary of\nthe \ufb01rst maxim: Do not by omission mislead the reader into believing falsehoods.\nThis paper formalizes the \ufb01rst two maxims, including this corollary, and then shows how to apply\nthem to learn probabilistic Horn clause rules from propositions extracted from news stories. We\nshow that rules learned this way are able to correctly infer more information from incomplete texts\nthan a baseline approach that treats propositions in news stories as missing at random.\nThe problem of learning rules from extracted texts has been studied previously [11, 2, 17]. These\nsystems rely on \ufb01nding documents in which all of the facts participating in a rule are mentioned.\nIf enough such documents can be found, then standard rule learning algorithms can be applied. A\ndrawback of this approach is that it is dif\ufb01cult to learn rules unless there are many documents that\nprovide such complete training examples. The central hypothesis of our work is that by explicitly\nmodeling the process by which facts are mentioned, we can learn rules from sets of documents that\nare smaller and less complete.\nThe line of work most similar to this paper is that of Michael and Valiant [10, 9] and Doppa, et al.\n[4]. They study learning hard (non-probabilistic) rules from incomplete extractions. In contrast with\nour approach of learning explicit probabilistic models, they take the simpler approach of implicitly\ninverting the conversational maxims when counting evidence for a proposed rule. Speci\ufb01cally, they\ncount an example as consistent with a proposed rule unless it explicitly contradicts the rule. Al-\nthough this approach is much less expensive than the probabilistic approach described in this paper,\nit has dif\ufb01culty with soft (probabilistic) rules. To handle these, these authors sort the rules by their\nscores and keep high scoring rules even if they have some contraditions. Such an approach can learn\n\u201calmost hard\u201d rules, but will have dif\ufb01culty with rules that are highly probabilistic (e.g., that the\nhome team is somewhat more likely to win a game than the away team).\nOur method has additional advantages. First, it provides a more general framework that can support\nalternative sets of conversational maxims, such as mentions based on saliency, recency (prefer to\nmention a more recent event rather than an older event), and surprise (prefer to mention a less likely\nevent rather than a more likely event). Second, when applied to new articles, it assigns probabilities\nto alternative interpretations, which is important for subsequent processing. Third, it provides an\nelegant, \ufb01rst-principles account of the process, which can then be compiled to yield more ef\ufb01cient\nlearning and reasoning procedures.\n\n2 Technical Approach\n\nWe begin with a logical formalization of the Gricean maxims. Then we present our implementation\nof these maxims in Markov Logic [15]. Finally, we describe a method for probabilistically inverting\nthe maxims to learn rules from textual mentions.\n\n2\n\n\fFormalizing the Gricean maxims. Consider a writer and a reader who share domain knowledge\nK. Suppose that when told a fact F , the reader will infer an additional fact G. We will write this\nas (K, MENTION(F ) (cid:96)reader G), where (cid:96)reader represents the inference procedure of the reader\nand MENTION is a modal operator that captures the action of mentioning a fact in the text. Note\nthat the reader\u2019s inference procedure is not standard \ufb01rst-order deduction, but instead is likely to be\nincomplete and non-monotonic or probabilistic.\nWith this notation, we can formalize the \ufb01rst two Gricean maxims as follows:\n\u2022 Mention true facts/don\u2019t lie:\n\nF \u21d2 MENTION(F )\n\nMENTION(F ) \u21d2 F\n\n(1)\n(2)\nThe \ufb01rst formula is overly strong, because it requires the writer to mention all true facts. Below,\nwe will show how to use Markov Logic weights to weaken this. The second formula captures a\npositive version of \u201cdon\u2019t lie\u201d\u2014if something is mentioned, then it is true. For news articles, it\ndoes not need to be weakened probabilistically.\n\n\u2022 Don\u2019t mention facts that can be inferred by the reader:\n\nMENTION(F ) \u2227 G \u2227 (K, MENTION(F ) (cid:96)reader G \u21d2 \u00acMENTION(G)\n\n\u2022 Mention facts needed to prevent incorrect reader inferences:\n\nMENTION(F ) \u2227 \u00acG \u2227 (K, MENTION(F ) (cid:96)reader G) \u2227\n\nH \u2227 (K, MENTION(F \u2227 H) (cid:54)(cid:96)reader G) \u21d2 MENTION(H)\n\nIn this formula H is a true fact that, when combined with F , is suf\ufb01cient to prevent the reader\nfrom inferring G.\n\nImplementation in Markov Logic. Although this formalization is very general, it is dif\ufb01cult to\napply directly because of the embedded invocation of the reader\u2019s inference procedure and the use\nof the MENTION modality. Consequently, we sidestep this problem by manually \u201ccompiling\u201d the\nmaxims into ordinary \ufb01rst-order Markov Logic as follows. The notation w : indicates that a rule has\na weight w in Markov Logic.\nThe \ufb01rst maxim is encoded in terms of fact-to-mention and mention-to-fact rules. For each predicate\nP in the domain of discourse, we write\n\nw1 : FACT P \u21d2 MENTION P\n\nw2 : MENTION P \u21d2 FACT P.\n\nSuppose that the shared knowledge K contains the Horn clause rule P \u21d2 Q, then we encode the\npositive form of second maxim in terms of the mention-to-mention rule:\n\nw3 : MENTION P \u2227 FACT Q \u21d2 \u00acMENTION Q\n\nOne might expect that we could encode the faulty-inference-by-omission corollary as\n\nw4 : MENTION P \u2227 \u00acFACT Q \u21d2 MENTION NOTQ,\n\nwhere we have chosen MENTION NOTQ to play the role of H in axiom 2. However, in news\nstories, there is a strong preference for H to be a positive assertion, rather than a negative as-\nsertion. For example, in the citizenship case, it would be unnatural to say \u201cAhmed Said Khadr,\nan Egyptian-born non-Egyptian. . . \u201d. In particular, because CITIZENOF(p, c) is generally a func-\ntion from p to c (i.e., a person is typically a citizen of only one country), it suf\ufb01ces to men-\ntion CITIZENOF(Khadr, Canada) to prevent the faulty inference CITIZENOF(Khadr, Egypt).\nHence, for rules of the form P (x, y) \u21d2 Q(x, y), where Q is a function from its \ufb01rst to its second\nargument, we can implement the inference-by-omission maxim as\n\nw5 : MENTION P(x, y) \u2227 FACT Q(x, z) \u2227 (y (cid:54)= z) \u21d2 MENTION Q(x, z).\n\nFinally, the shared knowledge P \u21d2 Q is represented by the fact-to-fact rule:\n\nw6 : FACT P \u21d2 FACT Q\n\n3\n\n\fIn Markov Logic, each of these rules is assigned a (learned) weight which can be viewed as a cost\nof violating the rule. The probability of a world \u03c9 is proportional to\n\n\uf8eb\uf8ed(cid:88)\n\nj\n\n\uf8f6\uf8f8 ,\n\nexp\n\nwjI[Rule j is satis\ufb01ed by \u03c9]\n\nwhere j iterates over all groundings of the Markov logic rules in world \u03c9 and I[\u03c6] is 1 if \u03c6 is true\nand 0 otherwise.\nAn advantage of Markov Logic is that it allows us to de\ufb01ne a probabilistic model even when there\nare contradictions and cycles in the logical rules. Hence, we can include both a rule that says \u201cif the\nhome team is mentioned, then the away team is not mentioned\u201d and rules that say \u201cthe home team is\nalways mentioned\u201d and \u201cthe away team is always mentioned\u201d. Obviously a possible world \u03c9 cannot\nsatisfy all of these rules. The relative weights on the rules determine the probability that particular\nliterals are actually mentioned.\nLearning. We seek to learn both the rules and their weights. We proceed by \ufb01rst proposing can-\ndidate fact-to-fact rules and then automatically generating the other rules (especially the mention-\nto-mention rules) from the general rule schemata described above. Then we apply EM to learn the\nweights on all of the rules. This has the effect of removing unnecessary rules by driving their weights\nto zero.\nProposing Candidate Fact-to-Fact Rules. For each predicate symbol and its speci\ufb01ed arity, we\ngenerate a set of candidate Horn clauses with that predicate as the head (consequent). For the rule\nbody (antecedent), we consider all conjunctions of literals involving other predicates (i.e., we do not\nallow recursive rules) up to a \ufb01xed maximum length. Each candidate rule is scored on the mentions\nin the training documents for support (number of training examples that mention all facts in the\nbody) and con\ufb01dence (the conditional probability that the head is mentioned given that the body is\nsatis\ufb01ed). We discard all rules that do not achieve minimum support \u03c3 and then keep the top \u03c4 most\ncon\ufb01dent rules. The values of \u03c3 and \u03c4 are determined via cross-validation within the training set.\nThe selected rules are then entered into the knowledge base. From each fact-to-fact rule, we derive\nmention-to-mention rules as described above. For each predicate, we also generate fact-to-mention\nand mention-to-fact rules.\nLearning the Weights. The goal of\nweight learning is to maximize the\nlikelihood of the observed mentions\n(in the training set) by adjusting the\nweights of the rules. Because our\ntraining data only consists of men-\ntions and no facts, the facts are la-\ntent (hidden variables), and we must\napply the EM algorithm to learn the\nweights.\nWe employ the Markov Logic system\nAlchemy [8] for learning and infer-\nence. To implement EM, we applied\nthe MC-SAT algorithm in the E-step\nand maximum pseudo-log likelihood\n(\u201cgenerative training\u201d) for the M step.\nEM is iterated to convergence, which\nonly requires a few iterations. Table 2\nsummarizes the pseudo-code of the\nalgorithm. MAP inference for pre-\ndiction is achieved using Alchemy\u2019s\nextension of MaxWalkSat.\nTreating Missing Mentions\nas\nMissing At Random: An alterna-\ntive to the Gricean mention model\ndescribed above is to assume that the writer chooses which facts to mention (or omit) at random\n\nInput: DI =Incomplete training examples\n\u03c4 = number of rules per head\n\u03c3 = minimum support per rule\nOutput: M = Explicit mention model\n1: LEARN GRICEAN MENTION MODEL:\n2: exhaustively learn rules for each head\n3: discard rules with less than \u03c3 support\n4: select the \u03c4 most con\ufb01dent rules R for each head\n5: R(cid:48) := R\n6: for each rule (f actP => f actQ) \u2208 R do\nadd mentionP \u21d2 \u00acmentionQ to R(cid:48)\n7:\n8: end for\n9: for every f actP \u2208 R do\n10:\n11:\n12: end for\n13: repeat\nE-Step: apply inference to predict weighted facts F\n14:\nde\ufb01ne complete weighted data DC := DI \u222a F\n15:\n16: M-Step: learn weights for rules in R(cid:48) using data DC\n17: until convergence\n18: return the set of weighted rules R(cid:48)\n\nadd f actP \u21d2 mentionP to R(cid:48)\nadd mentionP \u21d2 f actP to R(cid:48)\n\nTable 2: Learn Gricean Mention Model\n\n4\n\n\fTable 3: Synthetic Data Properties\n\nq\n\nMentioned literals\nComplete records\n\n(%)\n(%)\n\n0.17\n91.38\n61.70\n\n0.33\n80.74\n30.64\n\n0.50\n68.72\n8.51\n\n0.67\n63.51\n5.53\n\n0.83\n51.70\n0.43\n\n0.97\n42.13\n0.00\n\naccording to some unknown probability distribution that does not depend on the values of the\nmissing variables\u2014a setting known as Missing-At-Random (MAR). When data are MAR, it is\npossible to obtain unbiased estimates of the true distribution via imputation using EM [16]. We\nimplemented this approach as follows. We apply the same method of learning rules (requiring\nminimum support \u03c3 and then taking the \u03c4 most con\ufb01dent rules). Each learned rule has the general\nform MENTION A \u21d2 MENTION B. The collection of rules is treated as a model of the joint\ndistribution over the mentions. Generative weight learning combined with Alchemy\u2019s builtin EM\nimplementation is then applied to learn the weights on these rules.\n\n3 Experimental Evaluation\n\nand\ninto\n\ntreated\ntwo\n\nthe\n\nground\n\ncorrelated\n\nsets: W L\n\nThese games were\n\nTEAMGAMESCORE(Game, T eam, Score)\nthat\ndivided\n\npredicates\n\nthese\n\ncan\n\nbe\n\nHOMETEAM(Game, T eam),\n\nWe evaluated our mention model approach using data generated from a known mention model to\nunderstand its behavior. Then we compared its performance to the MAR approach on actual extrac-\ntions from news stories about NFL football games, citizenship, and Somali ship hijackings.\nSynthetic Mention Experiment. The goal of this experiment was to evaluate the ability of our\nmethod to learn accurate rules from data that match the assumptions of the algorithm. We also\nsought to understand how performance varies as a function of the amount of information omitted\nfrom the text.\nThe data were generated using a database of NFL games (from 1998 and 2000-2005)\ndownloaded from www.databasefootball.com.\nthen en-\ncoded using the predicates TEAMINGAME(Game, T eam), GAMEWINNER(Game, T eam),\nGAMELOSER(Game, T eam),\nAWAYTEAM(Game, T eam),\nand\ntruth.\nas\nNote\n=\n{GAMEWINNER, GAMELOSER, TEAMGAMESCORE} and HA = {HOMETEAM, AWAYTEAM}.\nFrom this ground truth, we generate a set of mentions for each game as follows. One literal is\nchosen uniformly at random from each of W L and HA and mentioned. Then each of the remaining\nliterals is mentioned with probability 1\u2212q, where q is a parameter that we varied in the experiments.\nTable 3 shows the average percentage of literals mentioned in each generated \u201cnews story\u201d and the\npercentage of generated \u201cnews stories\u201d that mentioned all literals.\nFor each q, we generated 5 differ-\nent datasets, each containing 235\ngames. For each value of q, we\nran the algorithm \ufb01ve times.\nIn\neach iteration, one dataset was used\nfor training, another for validation,\nand the remaining 3 for testing.\nThe training and validation datasets\nshared the same value of q. The re-\nsulting learned rules were evaluated\non the test sets for all of the differ-\nent values of q. The validation set is\nemployed to determine the thresh-\nolds \u03c4 and \u03c3 during rule learning and to decide when to terminate EM. The chosen values were\n\u03c4 = 10, \u03c3 = 0.5 (50% of the total training instances), and between 3 and 8 EM iterations.\nTable 4 reports the proportion of complete game records (i.e., all four literals) that were correctly\ninferred, averaged over the \ufb01ve runs. Note that any facts mentioned in the generated articles are\n\nTable 4: Gricean Mention Model Performance on Synthetic\nData. Each cell indicates % of complete records inferred.\n\n0.83\n(%)\n100\n90\n93\n81\n61\n56\n\n0.33\n(%)\n100\n99\n99\n98\n98\n81\n\n0.97\n(%)\n100\n85\n87\n66\n54\n41\n\n0.17\n(%)\n100\n100\n100\n100\n99\n91\n\n0.50\n(%)\n100\n97\n98\n92\n72\n72\n\n0.67\n(%)\n100\n96\n97\n92\n71\n68\n\nTraining q\n\nTest q\n\n0.17\n0.33\n0.50\n0.67\n0.83\n0.97\n\n5\n\n\fautomatically correctly inferred, so if no inference was performed at all, the results would match the\nsecond row of Table 3. Notice that when trained on data with low missingness (e.g. q = 0.17), the\nalgorithm was able to learn rules that predict well for articles with much higher levels of missing\nvalues. This is because q = 0.17 means that only 8.62% of the literals are missing in the training\ndataset, which results in 61.70% complete records. These are suf\ufb01cient to allow learning highly-\naccurate rules. However, as the proportion of missing literals in the training data increases, the\nalgorithm starts learning incorrect rules, so performance drops. In particular, when q = 0.97, the\ntraining documents contain no complete records (Table 3). Nonetheless, the learned rules are still\nable to completely and correctly reconstruct 41% of the games!\nThe rules learned under such high levels of missingness are not totally correct. Here is an example\nof one learned rule (for q = 0.97):\n\nFACT HOMETEAM(g, t1) \u2227 FACT TEAMINGAME(g, t1) \u21d2 FACT GAMEWINNER(g, t1).\n\nTraining q\n\nTest q\n\n0.97\n\n0.50\n(%)\n93\n\n0.67\n(%)\n92\n\n0.83\n(%)\n89\n\n0.97\n(%)\n85\n\n0.17\n(%)\n98\n\n0.33\n(%)\n95\n\nTable 5: Percentage of Literals Correctly Predicted\n\nThis rule says that the home team always wins. When appropriately weighted in Markov Logic, this\nis a reasonable rule even though it is not perfectly correct (nor was it a rule that we applied during\nthe synthetic data generation process).\nIn addition to measuring the fraction of\nentire games correctly inferred, we can\nobtain a more \ufb01ne-grained assessment by\nmeasuring the fraction of individual liter-\nals correctly inferred. Table 5 shows this\nfor the q = 0.97 training scenario. We\ncan see that even when the test articles\nhave q = 0.97 (which means only 42.13% of literals are mentioned), the learned rules are able to\ncorrectly infer 85% of the literals. By comparison, if the literals had been predicted independently\nat random, only 6.25% would be correctly predicted.\nExperiments with Real Data: We performed experiments on three datasets extracted from news\nstories: NFL games, citizenship, and Somali ship hijackings.\nNFL Games. A state-of-the-art infor-\nmation extraction system from BBN\nTechnologies [6, 14] was applied to a\ncorpus of 1000 documents taken from\nthe Gigaword corpus V4 [13] to ex-\ntract the same \ufb01ve propositions em-\nployed in the synthetic data experi-\nments. The BBN coreference sys-\ntem attempted to detect and combine\nmultiple mentions of the same game\nwithin a single article. The result-\ning data set contained 5,850 games.\nHowever, the data still contained many\ncoreference errors, which produced\ngames apparently involving more than\ntwo teams or where one team achieved multiple scores.\nTo address these problems, we took each extracted game and applied a set of integrity constraints.\nThe integrity constraints were learned automatically from 5 complete game records. Examples of\nthe learned constraints include \u201cEvery game has exactly two teams\u201d and \u201cEvery game has exactly\none winner.\u201d Each extracted game was then converted into multiple games by deleting literals in\nall possible ways until all of the integrity constraints were satis\ufb01ed. The team names were replaced\n(arbitrarily) with constants A and B. The games were then processed to remove duplicates. The\nresult was a set of 56 distinct extracted games, which we call NFL Train. To develop a test set,\nNFL Test, we manually extracted 55 games from news stories about the 2010 NFL season (which\nhas no overlap with Gigaword V4). Table 6 summarizes these game records.\nHere is an excerpt from one of the stories that was analyzed during learning: \u201cWilliam Floyd rushed\nfor three touchdowns and Steve Young scored two more, moving the San Francisco 49ers one victory\n\nTable 6: Statistics on mentions for extracted NFL games\n(after repairing violations of integrity constraints). Under\n\u201cHome/Away\u201d, \u201cmen none\u201d gives the percentage of articles\nin which neither the Home nor the Away team was men-\ntioned, \u201cmen one\u201d, the percentage in which exactly one of\nHome or Away was mentioned, and \u201cmen both\u201d, the per-\ncentage where both were mentioned.\n\nmen men men\nboth\nnone\n(%)\n(%)\n23.2\n17.9\n83.6\n0.0\n\nmen men men\nboth\nnone\n(%)\n(%)\n25.0\n17.9\n1.8\n0.0\n\none\n(%)\n57.1\n98.2\n\nNFL Train\nNFL Test\n\nHome/Away\n\nWinner/Loser\n\none\n(%)\n58.9\n19.6\n\n6\n\n\ffrom the Super Bowl with a 44-15 American football rout of Chicago.\u201d The initial set of literals\nextracted by the BBN system was the following:\nMENTION TEAMINGAME(N F LGame9209, SanF rancisco49ers) \u2227\nMENTION TEAMINGAME(N F LGame9209, ChicagoBears) \u2227\nMENTION GAMEWINNER(N F LGame9209, SanF rancisco49ers) \u2227\nMENTION GAMEWINNER(N F LGame9209, ChicagoBears) \u2227\nMENTION GAMELOSER(N F LGame9209, ChicagoBears).\n\nAfter processing with the learned integrity constraints, the extracted interpretation was the following:\nMENTION TEAMINGAME(N F LGame9209, SanF rancisco49ers) \u2227\nMENTION TEAMINGAME(N F LGame9209, ChicagoBears) \u2227\nMENTION GAMEWINNER(N F LGame9209, SanF rancisco49ers) \u2227\nMENTION GAMELOSER(N F LGame9209, ChicagoBears).\n\nHome/Away\npred.\nobs.\nmen\nmen\none\none\n(%)\n(%)\n49.9\n58.9\n19.6\n34.5\n\nWinner/Loser\npred.\nobs.\nmen\nmen\none\none\n(%)\n(%)\n49.8\n57.1\n98.2\n47.9\n\nNFL Train\nNFL Test\n\nTable 7: Observed percentage of cases where ex-\nactly one literal is mentioned and the percentage\npredicted if the literals were missing at random\n\nIt is interesting to ask whether these data are\nconsistent with the explicit mention model ver-\nsus the missing-at-random model. Let us sup-\npose that under MAR, the probability that a fact\nwill be mentioned is p. Then the probability\nthat both literals in a rule (e.g., home/away or\nwinner/loser) will be mentioned is p2, the prob-\nability that both will be missing is (1\u2212 p)2, and\nthe probability that exactly one will be men-\ntioned is 2p(1 \u2212 p). We can \ufb01t the best value\nfor p to the observed missingness rates to min-\nimize the KL divergence between the predicted\nand observed distributions. If the explicit mention model is correct, then the MAR \ufb01t will be a poor\nestimate of the fraction of cases where exactly one literal is missing. Table 7 shows the results. On\nNFL Train, it is clear that the MAR model seriously underestimates the probability that exactly one\nliteral will be mentioned. The NFL Test data is inconsistent with the MAR assumption, because\nthere are no cases where both predicates are mentioned. If we estimate p based only on the cases\nwhere both are missing or one is missing, the MAR model seriously underestimates the one-missing\nprobability. Hence, we can see that train and test, though drawn from different corpora and extracted\nby different methods, both are inconsistent with the MAR assumption.\nWe applied both our explicit mention model and the MAR model to the\nNFL dataset. The cross-validated parameter values for the explicit mention\nmodel were \u0001 = 0.5 and \u03c4 = 50, and the number of EM iterations varied\nbetween 2 and 3. We measured performance relative to the performance\nthat could be attained by a system that uses the correct rules. The results are\nsummarized in Table 8. Our method achieves perfect performance, whereas\nthe MAR method only reconstructs half of the reconstructable games. This\nre\ufb02ects the extreme dif\ufb01culty of the test set, where none of the articles men-\ntions all literals involved in any rule.\nHere are a few examples of the rules that are learned:\n0.00436 : FACT TEAMINGAME(g, t1) \u2227 FACT GAMELOSER(g, t2) \u2227 (t1 (cid:54)= t2) \u21d2\n0.17445 : MENTION TEAMINGAME(g, t1) \u2227 MENTION GAMELOSER(g, t2) \u2227 (t1 (cid:54)= t2) \u21d2\n\nFACT GAMEWINNER(g, t1)\n\u00acMENTION GAMEWINNER(g, t1)\nThe \ufb01rst rule is a weak form of the \u201cfact\u201d rule that if one team is the loser, the other is the winner.\nThe second rule is the corresponding \u201cmention\u201d rule that if the loser is mentioned then the winner is\nnot. The small weights on these rules are dif\ufb01cult to interpret in isolation, because in Markov logic,\nall of the weights are coupled and there are other learned rules that involve the same literals.\nBirthplace and Citizenship. We repeated this same experiment on a different set of 182 articles\nselected from the ACE08 Evaluation corpus [12] and extracted by the same methods.\nIn these\n\nTable 8: NFL test set\nperformance.\n\nGricean MAR\nModel Model\n(%)\n(%)\n100.0\n50.0\n\n7\n\n\farticles, the citizenship of a person is mentioned 583 times and birthplace only 25 times. Both are\nmentioned in the same article only 6 times (and of these, birthplace and citizenship are the same in\nonly 4). Clearly, this is another case where the MAR assumption does not hold. Integrity constraints\nwere applied to force each person to have at most one birthplace and one country of citizenship,\nand then both methods were applied. The cross-validated parameter values for the explicit mention\nmodel were \u0001 = 0.5 and \u03c4 = 50 and the number of EM iterations varied between 2 and 3. Table 9\nshows the two cases of interest and the probability assigned to the missing fact by the two methods.\nThe inverse Gricean approach gives much better results.\nSomali Ship Hijacking. We collected a set\nof 41 news stories concerning ship hijack-\nings based on ship names taken from the web\nsite coordination-maree-noire.eu.\nFrom these documents, we manually ex-\ntracted all mentions of the ownership coun-\ntry and \ufb02ag country of the hijacked ships.\nTwenty-\ufb01ve stories mentioned only one fact\n(ownership or \ufb02ag), while 16 mentioned both.\nOf the 16, 14 reported the \ufb02ag country as different from the ownership country. The Gricean maxims\npredict that if the two countries are the same, then only one of them will be mentioned. The results\n(Table 10) show that the Gricean model is again much more accurate than the MAR model.\n\nTable 9: Birthplace and Citizenship: Predicted\nprobability assigned to the correct interpretation by\nthe Gricean mention model and the MAR model.\n\nCitizenship missing\nBirthplace missing\n\nGricean Model\n\nCon\ufb01guration\n\nPred. prob.\n\nPred. prob.\n\n1.000\n1.000\n\n0.969\n0.565\n\nMAR\n\n4 Conclusion\n\nMAR\n\nPred. prob.\n\nCon\ufb01guration\n\nGricean Model\n\nOwnership missing\n\nTable 10: Flag and Ownership: Predicted probabil-\nity assigned to the missing fact by the Gricean men-\ntion model and the MAR model. Cross-validated\nparameter values \u0001 = 0.5 and \u03c4 = 50; 2-3 EM iter-\nations.\n\nThis paper has shown how to formalize\nthe Gricean conversational maxims, compile\nthem into Markov Logic, and invert them via\nprobabilistic reasoning to learn Horn clause\nrules from facts extracted from documents.\nExperiments on synthetic mentions showed\nthat our method is able to correctly recon-\nstruct complete records even when neither the\ntraining data nor the test data contain com-\nplete records. Our three studies provide ev-\nidence that news articles obey the maxims\nacross three domains.\nIn all three domains, our method achieves excellent performance that far\nexceeds the performance of standard EM imputation. This shows conclusively that rule learning\nbene\ufb01ts from employing an explicit model of the process that generates the data. Indeed, it allows\nrules to be learned correctly from only a handful of complete training examples.\nAn interesting direction for future work is to learn forms of knowledge more complex than Horn\nclauses. For example, the state of a hijacked ship can change over time from states such as \u201cattacked\u201d\nand \u201ccaptured\u201d to states such as \u201cransom demanded\u201d and \u201creleased\u201d. The Gricean mention model\npredicts that if a news story mentions that a ship was released, then it does not need to mention that\nthe ship was \u201cattacked\u201d or \u201ccaptured\u201d. Handling such cases will require extending the methods in\nthis paper to reason about time and what the author and reader know at each point in time. It will also\nrequire better methods for joint inference, because there are more than 10 predicates in this domain,\nand our current EM implementation scales exponentially in the number of interrelated predicates.\n\nFlag missing\n\nPred. prob.\n\n0.459\n0.519\n\n1.000\n1.000\n\nAcknowledgments\n\nThis material is based upon work supported by the Defense Advanced Research Projects Agency\n(DARPA) under Contract No. FA8750-09-C-0179 and by Army Research Of\ufb01ce (ARO). Any opin-\nions, \ufb01ndings and conclusions or recommendations expressed in this material are those of the au-\nthor(s) and do not necessarily re\ufb02ect the views of the DARPA, the Air Force Research Laboratory\n(AFRL), ARO, or the US government.\n\n8\n\n\fReferences\n[1] A. Carlson, J. Betteridge, B. Kisiel, B. Settles, E.R. Hruschka Jr., and T.M. Mitchell. Toward an\narchitecture for never-ending language learning. In Proceedings of the Conference on Arti\ufb01cial\nIntelligence (AAAI), pages 1306\u20131313. AAAI Press, 2010.\n\n[2] A. Carlson, J. Betteridge, R. C. Wang, E. R. Hruschka, Jr., and T. M. Mitchell. Coupled semi-\nsupervised learning for information extraction. In Proceedings of the Third ACM International\nConference on Web Search and Data Mining, WSDM \u201910, pages 101\u2013110, New York, NY,\nUSA, 2010. ACM.\n\n[3] W. W. Cohen. WHIRL: A word-based information representation language. Arti\ufb01cial Intelli-\n\ngence, 118(1-2):163\u2013196, 2000.\n\n[4] J. R. Doppa, M. S. Sorower, M. Nasresfahani, J. Irvine, W. Orr, T. G. Dietterich, X. Fern,\nand P. Tadepalli. Learning rules from incomplete examples via implicit mention models. In\nProceedings of the 2011 Asian Conference on Machine Learning, 2011.\n\n[5] O. Etzioni, M. Banko, S. Soderland, and D. S. Weld. Open information extraction from the\n\nweb. Commun. ACM, 51(12):68\u201374, 2008.\n\n[6] M. Freedman, E. Loper, E. Boschee, and R. Weischedel. Empirical Studies in Learning to\nRead. In Proceedings of Workshop on Formalisms and Methodology for Learning by Reading\n(NAACL-2010), pages 61\u201369, 2010.\n\n[7] H. P. Grice. Logic and conversation. In Syntax and semantics: Speech acts, volume 3, pages\n\n43\u201358. Academic Press, New York, 1975.\n\n[8] S. Kok, M. Sumner, M. Richardson, P. Singla, H. Poon, D. Lowd, and P. Domingos. The\nAlchemy system for statistical relational AI. Technical report, Department of Computer Sci-\nence and Engineering, University of Washington, Seattle, WA, 2007.\n\n[9] L. Michael. Reading between the lines. In IJCAI, pages 1525\u20131530, 2009.\n[10] L. Michael and L. G. Valiant. A \ufb01rst experimental demonstration of massive knowledge infu-\n\nsion. In KR, pages 378\u2013389, 2008.\n\n[11] U. Y. Nahm and R. J. Mooney. A mutually bene\ufb01cial integration of data mining and informa-\ntion extraction. In Proceedings of the Seventeenth National Conference on Arti\ufb01cial Intelli-\ngence and the Twelfth Conference on Innovative Applications of Arti\ufb01cial Intelligence, pages\n627\u2013632. AAAI Press, 2000.\n\n[12] NIST. Automatic Content Extraction 2008 Evaluation Plan.\n[13] R. Parker, D. Graff, J. Kong, K. Chen, and K. Maeda. English Gigaword Fourth Edition.\n\nLinguistic Data Consortium, Philadelphia, 2009.\n\n[14] L. Ramshaw, E. Boschee, M. Freedman, J. MacBride, R. Weischedel, and A.Zamanian. Serif\nlanguage processing effective trainable language understanding. In Joseph Olive, Caitlin Chris-\ntianson, and John McCary, editors, Handbook of Natural Language Processing and Machine\nTranslation: DARPA Global Autonomous Language Exploitation. Springer, 2011.\n\n[15] M. Richardson and P. Domingos. Markov logic networks. Machine learning, 62:107\u2013136,\n\nFebruary 2006.\n\n[16] J. L. Schafer and M. K. Olsen. Multiple imputation for multivariate missing-data problems: a\n\ndata analyst\u2019s perspective. Multivariate Behavioral Research, 33:545\u2013571, 1998.\n\n[17] S. Schoenmackers, O. Etzioni, D. S. Weld, and J. Davis. Learning \ufb01rst-order Horn clauses from\nweb text. In Proceedings of the 2010 Conference on Empirical Methods in Natural Language\nProcessing, EMNLP \u201910, pages 1088\u20131098, Stroudsburg, PA, USA, 2010. Association for\nComputational Linguistics.\n\n9\n\n\f", "award": [], "sourceid": 644, "authors": [{"given_name": "Mohammad", "family_name": "Sorower", "institution": null}, {"given_name": "Janardhan", "family_name": "Doppa", "institution": null}, {"given_name": "Walker", "family_name": "Orr", "institution": null}, {"given_name": "Prasad", "family_name": "Tadepalli", "institution": null}, {"given_name": "Thomas", "family_name": "Dietterich", "institution": null}, {"given_name": "Xiaoli", "family_name": "Fern", "institution": null}]}