{"title": "Nonparametric Bayesian Models for Unsupervised Event Coreference Resolution", "book": "Advances in Neural Information Processing Systems", "page_first": 73, "page_last": 81, "abstract": "We present a sequence of unsupervised, nonparametric Bayesian models for clustering complex linguistic objects. In this approach, we consider a potentially infinite number of features and categorical outcomes. We evaluate these models for the task of within- and cross-document event coreference on two corpora. All the models we investigated show significant improvements when compared against an existing baseline for this task.", "full_text": "Nonparametric Bayesian Models for Unsupervised\n\nEvent Coreference Resolution\n\nCosmin Adrian Bejan1, Matthew Titsworth2, Andrew Hickl2, & Sanda Harabagiu1\n\n1 Human Language Technology Research Institute, University of Texas at Dallas\n\n2 Language Computer Corporation, Richardson, Texas\n\nady@hlt.utdallas.edu\n\nAbstract\n\nWe present a sequence of unsupervised, nonparametric Bayesian models for clus-\ntering complex linguistic objects. In this approach, we consider a potentially in\ufb01-\nnite number of features and categorical outcomes. We evaluated these models for\nthe task of within- and cross-document event coreference on two corpora. All the\nmodels we investigated show signi\ufb01cant improvements when compared against an\nexisting baseline for this task.\n\n1 Introduction\nIn Natural Language Processing (NLP), the task of event coreference has numerous applications,\nincluding question answering, multi-document summarization, and information extraction. Two\nevent mentions are coreferential if they share the same participants and spatio-temporal groundings.\nMoreover, two event mentions are identical if they have the same causes and effects. For example,\nthe three documents listed in Table 1 contains four mentions of identical events but only the arrested,\napprehended, and arrest mentions from the documents 1 and 2 are coreferential. These de\ufb01nitions\nwere used in the tasks of Topic Detection and Tracking (TDT), as reported in [24].\n\nPrevious approaches to event coreference resolution [3] used the same lexeme or synonymy of the\nverb describing the event to decide coreference. Event coreference was also tried by using the\nsemantic types of an ontology [17]. However, the features used by these approaches are hard to select\nand require the design of domain speci\ufb01c constraints. To address this problems, we have explored\na sequence of unsupervised, nonparametric Bayesian models that are used to probabilistically infer\ncoreference clusters of event mentions from a collection of unlabeled documents. Our approach\nis motivated by the recent success of unsupervised approaches for entity coreference resolution\n[16, 22, 25] and by the advantages of using a large amount of data at no cost.\n\nOne model was inspired by the fully generative Bayesian model proposed by Haghighi and Klein\n[16] (henceforth, H&K). However, to employ the H&K\u2019s model for tasks that require clustering\nobjects with rich linguistic features (such as event coreference resolution), or to extend this model in\norder to enclose additional observable properties is a challenging task [22, 25]. In order to counter\nthis limitation, we make a conditional independence assumption between the observable features\nand propose a generalized framework (Section 3) that is able to easily incorporate new features.\n\nDuring the process of learning the model described in Section 3, it was observed that a large amount\nof time was required to incorporate and tune new features. This lead us to the challenge of creating a\nframework which considers an unbounded number of features where the most relevant are selected\nautomatically. To accomplish this new goal, we propose two novel approaches (Section 4). The\n\ufb01rst incorporates a MarkovIndianBuffetProcess (mIBP) [30] into a HierarchicalDirichletProcess\n(HDP) [28]. The second uses an In\ufb01niteHiddenMarkovModel (iHMM) [5] coupled to an In\ufb01nite\nFactorialHiddenMarkovModel (iFHMM) [30].\n\nIn this paper, we focus on event coreference resolution, though adaptations for event identity resolu-\ntion can be easily made. We evaluated the models on the ACE 2005 event corpus [18] and on a new\nannotated corpus encoding within- and cross-document event coreference information (Section 5).\n\n1\n\n\fDocument 1: San Diego Chargers receiver Vincent Jackson was arrested on suspicion of drunk driving on\nTuesday morning, \ufb01ve days before a key NFL playoff game.\n. . .\nPolice apprehended Jackson in San Diego at 2:30 a.m. and booked him for the misdemeanour before his\nrelease.\nDocument 2: Despite his arrest on suspicion of driving under the in\ufb02uence yesterday, Chargers receiver\nVincent Jackson will play in Sunday\u2019s AFC divisional playoff game at Pittsburgh.\nDocument 3: In another anti-piracy operation, Navy warship on Saturday repulsed an attack on a merchant\nvessel in the Gulf of Aden and nabbed 23 Somali and Yemeni sea brigands.\n\nTable 1: Examples of coreferential and identical events.\n\n2 Event Coreference Resolution\nModels for solving event coreference and event identity can lead to the generation of ad-hoc event\nhierarchies from text. A sample of a hierarchy capturing corefering and identical events, including\nthose from the example presented in Section 1, is illustrated in Figure 1.\n\ngeneric\nevents\n\nevents\n\nevent\n\nmentions\n\narrest\n\nEvent properties:\nSuspect:\nAuthorities:\nTime:\nLocation:\n\nVincent Jackson\npolice\nTuesday\nSan Diego\n\narrest\n\n... arrested ... apprehended\n\n... arrest ... \n\nDocument 1\n\nDocument 2\n\nEvent properties:\nSuspect:\nAuthorities:\nTime:\nLocation:\n\nsea brigands\nNavy warship\nSaturday\nGulf of Aden\n\narrest\n\n... nabbed ... \n\nDocument 3\n\nFigure 1: A portion of the event hierarchy.\n\nFirst, we introduce some basic notation.1 Next, to cluster the mentions that share common event\nproperties (as shown in Figure 1), we brie\ufb02y describe the linguistic features of event mentions.\n2.1 Notation\nAs input for our models, we consider a collection of I documents, each document i having Ji event\nmentions. Each event mention is characterized by L feature types, FT, and each feature type is\nrepresented by a \ufb01nite number of feature values, f v. Therefore, we can represent the observable\nproperties of an event mention, em, as a vector of pairs h(FT1 : f v1i), . . . , (FTL : f vLi)i, where each\nfeature value index i ranges in the feature value space associated with a feature type.\n2.2 Linguistic Features\nWe consider the following set of features associated to an event mention:2\nLexical Features (LF) To capture the lexical context of an event mention, we extract the following\nfeatures: the head word of the mention (HW), the lemma of the HW (HL), lemmas of left and right\nwords of the mention (LHL,RHL), and lemmas of left and right mentions (LHE,RHE).\nClass Features (CF) These features aim to classify mentions into several types of classes:\nthe\nmention HW\u2019s part-of-speech (POS), the word class of the HW (HWC), which can take one of the\nfollowing values hverb, noun, adjective, otheri, and the event class of the mention (EC). To extract\nthe event class associated to every event mention, we employed the event identi\ufb01er described in [6].\nWordNet Features (WF) We build three types of clusters over all the words from WordNet [9]\nand use them as features for the mention HW. First cluster type associates an unique id to each\n(word:HWC) pair (WNW). The second cluster type uses the transitive closure of the synonymous\nrelations to group words from WordNet (WNS). Finally, the third cluster type considers as grouping\ncriteria the category from WordNet lexicographer\u2019s \ufb01les that is associated to each word (WNL). For\ncases when a new word does not belong to any of these WordNet clusters, we create a new cluster\nwith a new id for each of the three cluster types.\nSemantic Features (SF) To extract features that characterize participants and properties of event\nmentions, we use s semantic parser [8] trained on PropBank(PB) [23] and FrameNet(FN) [4] cor-\npora. (For instance, for the apprehended mention from our example, Jackson is the feature value\n\n1For consistency, we try to preserve the notation of the original models.\n2In this subsection and the following section, the feature term is used in context of a feature type.\n\n2\n\n\ffor A0 PB argument3 and the SUSPECT frame element (FEA0) of the ARREST frame.) Another se-\nmantic feature is the semantic frame (FR) that is evoked by an event mention. (For instance, all the\nemphasized mentions from our example evoke the ARREST frame from FN.)\nFeature Combinations (FC) We also explore various combinations of features presented above.\nExamples include HW+POS, HL+FR, FE+A1, etc.\n3 Finite Feature Models\nIn this section, we present a sequence of HDP mixture models for solving event coreference. For this\ntype of approach, a DirichletProcess (DP) [10] is associated with each document, and each mixture\ncomponent, which in our case corresponds to an event, is shared across documents. To describe\nthese models, we consider Z the set of indicator random variables for indices of events, \u03c6z the set\nof parameters associated to an event z, \u03c6 a notation for all model parameters, and X a notation for\nall random variables that represent observable features.\n\nGiven a document collection annotated with event mentions, the goal is to \ufb01nd the best assignment\nof event indices, Z\u2217, which maximize the posterior probability P (Z | X). In a Bayesian approach,\nthis probability is computed by integrating out all model parameters:\n\nP (Z|X) = Z P (Z, \u03c6|X)d\u03c6 = Z P (Z|X, \u03c6)P (\u03c6|X)d\u03c6\n\nIn order to describe our modi\ufb01cations, we \ufb01rst revisit a basic model from the set of models described\nin H&K\u2019s paper.\n3.1 The One Feature Model\nThe one feature model, HDP1f , constitutes the simplest representation of an HDP model. In this\nmodel, which is depicted graphically in Figure 2(a), the observable components are characterized\nby only one feature. The distribution over events associated to each document \u03b2 is generated by a\nDirichlet process with a concentration parameter \u03b1 > 0. Since this setting enables a clustering of\nevent mentions at the document level, it is desirable that events are shared across documents and\nthe number of events K is inferred from data. To ensure this \ufb02exibility, a global nonparametric\nDP prior with a hyperparameter \u03b3 and a global base measure H can be considered for \u03b2 [28]. The\nglobal distribution drawn from this DP prior, denoted as \u03b20 in Figure 2(a), encodes the event mixing\nweights. Thus, same global events are used for each document, but each event has a document\nspeci\ufb01c distribution \u03b2i that is drawn from a DP prior centered on \u03b20.\nTo infer the true posterior probability of P (Z|X), we follow [28] in using a Gibbs sampling algo-\nrithm [12] based on the direct assignment sampling scheme. In this sampling scheme, the \u03b2 and \u03c6\nparameters are integrated out analytically. The formula for sampling an event index for mention j\nfrom document i, Zi,j, is given by:4\n\nP (Zi,j | Z\u2212i,j , HL) \u221d P (Zi,j | Z\u2212i,j)P (HLi,j | Z, HL\u2212i,j)\n\nwhere HLi,j is the head lemma of the event mention j from the document i.\nFirst, in the generative process of an event mention, an event index z is sampled by using a mecha-\nnism that facilitates sampling from a prior for in\ufb01nite mixture models called the Chinese Restaurant\nFranchise (CRF) representation [28]:\n\nP (Zi,j = z | Z\u2212i,j , \u03b20) \u221d (cid:26) \u03b1\u03b2u\n\n0 ,\n\nif z = znew\n\nnz + \u03b1\u03b2z\n\n0 , otherwise\n\nHere, nz is the number of event mentions with the event index z, znew is a new event index not used\nalready in Z\u2212i,j, \u03b2z\n0 is the\nweight for the unknown mixture component.\n\n0 are the global mixing proportions associated to the K events, and \u03b2u\n\nThen, to generate the mention head lemma (in this model, X = hHLi), the event z is associated with\na multinomial emission distribution over the HL feature values having the parameters \u03c6 = h\u03c6hl\nZ i.\nWe assume that this emission distribution is drawn from a symmetric Dirichlet distribution with\nconcentration \u03bbHL:\n\n3A0 annotates in PB a speci\ufb01c type of semantic role which represents the AGENT, the DOER, or the ACTOR\nof a speci\ufb01c event. Another PB argument is A1, which plays the role of the PATIENT, the THEME, or the\nEXPERIENCER of an event.\n\n4Z\u2212i,j represents a notation for Z \u2212 {Zi,j}.\n\n3\n\n\f\u03b3\n\n\u03b1\n\n\u03c6\n\n\u221e\n\n\u03b3\n\n\u03b1\n\nH\n\n\u03b20\n\n\u221e\n\n\u03b2\n\n\u221e\n\nZi\n\nHLi\n\nJi\n\nI\n\n(a)\n\nH\n\n\u03b20\n\n\u221e\n\n\u03c6\n\n\u221e\n\n\u03b2\n\n\u221e\n\nZi\n\nHLi FRi\n\nJi\n\nI\n\n(b)\n\n\u03b3\n\n\u03b1\n\nH\n\n\u03b20\n\n\u221e\n\n\u03c6\n\n\u221e\n\n\u03b2\n\n\u221e\n\nZi\n\nXi\n\n(c)\n\nL\n\nJi\n\nI\n\n\u03b3\n\n\u03b1\n\n\u03b8\n\n\u03c6\n\n\u221e\n\nH\n\n\u03b20\n\n\u221e\n\n\u03b2\n\n\u221e\n\nZi\n\nHLi\n\nPOSi\n\nFRi\n\n(d)\n\nJi\n\nI\n\nFigure 2: Graphical representation of four HDP models. Each node corresponds to a random variable. In\nparticular, shaded nodes denotes observable variables. Each rectangle captures the replication of the structure\nit contains. The number of replications is indicated in the bottom-right corner of the rectangle. The model\ndepicted in (a) is an HDP model using one feature; the model in (b) employs HL and FR features; (c) illustrates\na \ufb02at representation of a limited number of features in a generalized framework (henceforth, HDPf lat); and (d)\ncaptures a simple example of structured network topology of three feature variables (henceforth, HDPstruct).\nThe dependencies involving parameters \u03c6 and \u03b8 in models (b), (c), and (d) are omitted for clarity.\n\nP (HLi,j = hl | Z, HL\u2212i,j) \u221d nhl,z + \u03bbHL\n\nwhere HLi,j is the head lemma of mention j from document i, and nhl,z is the number of times\nthe feature value hl has been associated with the event index z in (Z, HL\u2212i,j). We also apply the\nLidstone\u2019s smoothing method to this distribution.\n3.2 Adding More Features\nA model in which observable components are represented only by one feature has the tendency to\ncluster these components based on their feature value. To address this limitation, H&K proposed\na more complex model that is strictly customized for entity coreference resolution. On the other\nhand, event coreference involves clustering complex objects characterized by richer features than\nentity coreference (or topic detection), and therefore it is desirable to extend the HDP1f model with\na generalized model where additional features can be easily incorporated.\n\nTo facilitate this extension, we assume that feature variables are conditionally independent given Z.\nThis assumption considerably reduces the complexity of computing P (Z | X). For example, if we\nwant to incorporate another feature (e.g., F R) in the previous model, the formula becomes:\n\nP (Zi,j | HL, FR) \u221d P (Zi,j)P (HLi,j, F Ri,j | Z) = P (Zi,j)P (HLi,j | Z)P (F Ri,j | Z)\n\nIn this formula, we omit the conditioning components of Z, HL, and FR for clarity. The graphical\nrepresentation corresponding to this model is illustrated in Figure 2(b). In general, if X consists of\nL feature variables, the inference formula for the Gibbs sampler is de\ufb01ned as:\n\nP (Zi,j | X) \u221d P (Zi,j) YF T \u2208X\n\nP (F Ti,j | Z)\n\nThe graphical model for this general setting is depicted in Figure 2(c). Drawing an analogy, the\ngraphical representation involving feature variables and Z variables resembles the graphical repre-\nsentation of a Naive Bayes classi\ufb01er.\n\nWhen dependencies between feature variables exist (e.g., in our case, frame elements are dependent\nof the semantic frames that de\ufb01ne them, and frames are dependent of the words that evoke them),\nvarious global distributions are involved for computing P (Z | X). For instance, for the model\ndepicted in Figure 2(d) the posterior probability is given by:\n\nP (Zi,j)P (F Ri,j | HLi,j, \u03b8) YF T \u2208X\n\nP (F Ti,j | Z)\n\nIn this model, P (F Ri,j | HLi,j, \u03b8) is a global distribution parameterized by \u03b8, and the feature\nvariables considered are X = hHL, POS, FRi.\n\n4\n\n\fFor all these extended models, we compute the prior and likelihood factors as described in the one\nfeature model. Also, following H&K, in the inference mechanism we assign soft counts for missing\nfeatures (e.g., unspeci\ufb01ed PB argument).\n4 Unbounded Feature Models\nFirst, we present a generative model called the MarkovIndianBuffetProcess (mIBP) that provides a\nmechanism in which each object can be represented by a sparse subset of a potentially unbounded set\nof latent features [15, 14, 30].5 Then, to overcome the limitations regarding the number of mixture\ncomponents and the number of features associated with objects, we combine this mechanism with\nan HDP model to form an mIBP-HDP hybrid. Finally, to account for temporal dependencies, we\nemploy an mIBP extension, called the In\ufb01niteFactorialHiddenMarkovModel (iFHMM) [30], in\ncombination with an In\ufb01niteHiddenMarkovModel (iHMM) to form the iFHMM-iHMM model.\n4.1 The Markov Indian Buffet Process\nAs described in [30], the mIBP de\ufb01nes a distribution over an unbounded set of binary Markov chains,\nwhere each chain can be associated to a binary latent feature that evolves over time according to\nMarkov dynamics. Speci\ufb01cally, if we denote by M the total number of feature chains and by T\nthe number of observable components (event mentions), the mIBP de\ufb01nes a probability distribution\nover a binary matrix F with T rows, which correspond to observations, and an unbounded number\nof columns (M \u2192 \u221e), which correspond to features. An observation yt contains a subset from\nthe unbounded set of features {f 1, f 2, . . . , f M } that is represented in the matrix by a binary vector\nFt = hF 1\nTherefore, F decomposes the observations and represents them as feature factors, which can then\nbe associated to hidden variables in an iFHMM as depicted in Figure 3(a). The transition matrix of\na binary Markov chain associated to a feature f m is de\ufb01ned as\n\nt = 1 indicates that f i is associated to yt.\n\nt i, where F i\n\nt , . . . , F M\n\nt , F 2\n\n1 \u2212 bm bm(cid:19)\nW(m) = (cid:18)1 \u2212 am am\n\nt\u22121\n\nt\u22121\n\nb\n\n1\u2212F m\nm\n\nF m\nm ).\n\nt \u223c Bernoulli(a\n\nt = i), the parameters am \u223c Beta(\u03b1\u2032/M, 1) and bm \u223c Beta(\u03b3 \u2032, \u03b4\u2032),\nt+1 = j | F m\n0 = 0. In the generative process, the hidden variable of feature f m for an\n\nwhere W(m)\nij = P (F m\nand the initial state F m\nobject yt, F m\nTo compute the probability of the feature matrix F6, in which the parameters a and b are integrated\nout analytically, we use the counting variables c00\nm to record the 0 \u2192 0, 0 \u2192 1,\n1 \u2192 0, and 1 \u2192 1 transitions f m has made in the binary chain m. The stochastic process that derives\nthe probability distribution in terms of these variables is de\ufb01ned as follows. The \ufb01rst component\nsamples a number of Poisson(\u03b1\u2032) features. In general, depending on the value that was sampled in\nthe previous step (t \u2212 1), a feature f m is sampled for the tth component according to the following\nprobabilities:\n\nm , and c11\n\nm , c01\n\nm , c10\n\nP (F m\n\nt = 1 | F m\n\nt\u22121 = 1) =\n\nP (F m\n\nt = 1 | F m\n\nt\u22121 = 0) =\n\nc11\nm + \u03b4\u2032\n\u03b3 \u2032 + \u03b4\u2032 + c10\n\nm + c11\nm\n\nc00\nm\n\nc00\nm + c01\nm\n\nThe tth component then repeats the same mechanism for sampling the next features until it \ufb01nishes\nthe current number of sampled features M . After all features are sampled for the tth component,\na number of Poisson(\u03b1\u2032/t) new features are assigned for this component and M gets incremented\naccordingly.\n4.2 The mIBP-HDP Model\nOne direct application of the mIBP is to integrate it into the HDP models proposed in Section 3. In\nthis way, the new nonparametric extension will have the bene\ufb01ts of capturing uncertainty regarding\nthe number of mixture components that are characterized by a potentially in\ufb01nite number of features.\nSince one observable component is associated with an unbounded countable set of features, we have\nto provide a mechanism in which only a \ufb01nite set of features will represent the component in the\nHDP inference process.\n\n5In this section, a feature is represented by a (feature type:feature value) pair.\n6Technical details for computing this probability are described in [30].\n\n5\n\n\fFM\n0\n\nFM\n1\n\nFM\n2\n\nF2\n0\n\nF1\n0\n\nF2\n1\n\nF1\n1\n\nF2\n2\n\nF1\n2\n\nY1\n\nY2\n\nS0\n\nS1\n\nS2\n\nFM\n0\n\nFM\n1\n\nFM\n2\n\nF2\n0\n\nF1\n0\n\nF2\n1\n\nF1\n1\n\nF2\n2\n\nF1\n2\n\nY1\n\nY2\n\nFM\nT\n\nF2\nT\n\nF1\nT\n\nYT\n\nST\n\nFM\nT\n\nF2\nT\n\nF1\nT\n\nYT\n\n(a)\n\n(b)\n\nFigure 3: (a) The In\ufb01nite Factorial Hidden Markov Model. (b) The iFHMM-iHMM model. (M\u2192 \u221e)\n\nThe idea behind this mechanism is to use slice sampling7 [21] in order to derive a \ufb01nite set of\nfeatures for yt. Letting qm be the number of times feature f m was sampled in the mIBP, and vt an\nauxiliary variable for yt such that vt \u223c Uniform(1, max{qm | F m\nt = 1}), we de\ufb01ne the \ufb01nite feature\nset Bt for the observation yt as:\n\nBt = {f m | F m\n\nt = 1 \u2227 qm \u2265 vt}\n\nThe \ufb01niteness of this feature set is based on the observation that, in the generative process of the\nmIBP, only a \ufb01nite set of features are sampled for a component. Another observation worth men-\ntioning regarding the way this set is constructed is that only the most representative features of yt\nget selected in Bt.\n4.3 The iFHMM-iHMM Model\nThe iFHMM is a nonparametric Bayesian factor model that extends the Factorial HiddenMarkov\nModel (FHMM) [13] by letting the number of parallel Markov chains M be learned from data.\nAlthough the iFHMM allows a more \ufb02exible representation of the latent structure, it can not be\nused as a framework where the number of clustering components K is in\ufb01nite. On the other hand,\nthe iHMM represents a nonparametric extension of the Hidden Markov Model (HMM) [27] that\nallows performing inference on an in\ufb01nite number of states K.\nIn order to further increase the\nrepresentational power for modeling discrete time series data, we propose a nonparametric extension\nthat combines the best of the two models, and lets the parameters M and K be learned from data.\nEach step in the new generative process, whose graphical representation is depicted in Figure 3(b),\nis performed in two phases: (i) the latent feature variables from the iFHMM framework are sampled\nusing the mIBP mechanism; and (ii) the features sampled so far, which become observable during\nthis second phase, are used in an adapted beam sampling algorithm [29] to infer the clustering\ncomponents (or, in our case, latent events).\n\nTo describe the beam sampler for event coreference resolution, we introduce additional notation.\nWe denote by (s1, . . . , sT ) the sequence of hidden states corresponding to the sequence of event\nmentions (y1, . . . , yT ), where each state st belong to one of the K events, st \u2208 {1, . . . , K}, and\neach mention yt is represented by a sequence of latent features hF 1\nt i. One element of\nthe transition probability \u03c0 is de\ufb01ned as \u03c0ij = P (st = j | st\u22121 = i) and a mention yt is generated\naccording to a likelihood model F that is parameterized by a state-dependent parameter \u03c6st (yt |\nst \u223c F(\u03c6st )). The observation parameters \u03c6 are iid drawn from a prior base distribution H.\nThe beam sampling algorithm combines the ideas of slice sampling and dynamic programming for\nan ef\ufb01cient sampling of state trajectories. Since in time series models the transition probabilities\nhave independent priors [5], Van Gael and colleagues [29] also used the HDP mechanism to al-\nlow couplings across transitions. For sampling the whole hidden state trajectory s, this algorithm\nemploys a forward \ufb01ltering-backward sampling technique.\n\nt , . . . , F M\n\nt , F 2\n\nIn the forward step of our implementation, we sample the feature variables using the mIBP as de-\nscribed in Section 4.1, and the auxiliary variable ut \u223c Uniform(0, \u03c0st\u22121st ) for each mention yt.\nAs explained in [29], the auxiliary variables u are used to \ufb01lter only those trajectories s for which\n\n7The idea of using this procedure is inspired from [29] where a slice variable was used to sample a \ufb01nite\n\nnumber of state trajectories in the iHMM.\n\n6\n\n\f\u03c0st\u22121st \u2265 ut for all t. Also, in this step, we compute the probabilities P (st | y1:t, u1:t) for all t as\ndescribed in [29]:\n\nP (st | y1:t, u1:t) \u221d P (yt | st) Xst\u22121:ut<\u03c0st\u22121 st\n\nP (st\u22121 | y1:t\u22121, u1:t\u22121)\n\nHere, the dependencies involving parameters \u03c0 and \u03c6 are omitted for clarity.\nIn the backward step, we \ufb01rst sample the event for the last state sT directly from P (sT | y1:T , u1:T )\nand then, for all t : T \u2212 1, 1, we sample each state st given st+1 by using the formula P (st |\nst+1, y1:T , u1:T) \u221d P (st|y1:t, u1:t)P (st+1|st, ut+1).\nTo sample the emission distribution \u03c6 ef\ufb01ciently, and to ensure that each mention is characterized\nby a \ufb01nite set of representative features, we set the base distribution H to be conjugate with the\ndata distribution F in a Dirichlet-multinomial model with the suf\ufb01cient statistics of the multinomial\ndistribution (o1, . . . , oK ) de\ufb01ned as:\n\nok =\n\nT\n\nXt=1 Xf m\u2208Bt\n\nnmk\n\nwhere nmk counts how many times feature f m was sampled for event k, and Bt stores a \ufb01nite set\nof features for yt as it is de\ufb01ned in Section 4.2.\n5 Evaluation\nEvent Coreference Data One corpus used for evaluation is ACE 2005 [18]. This corpus annotates\nwithin-document coreference information of speci\ufb01c types of events (such as Con\ufb02ict, Justice, and\nLife). After an initial processing phase, we extracted from ACE 6553 event mentions and 4946\nevents. To increase the diversity of events and to evaluate the models for both within- and cross-\ndocument event coreference, we created the EventCorefBank corpus (ECB).8 This new corpus con-\ntains 43 topics, 1744 event mentions, 1302 within-document events, and 339 cross-document events.\n\nFor a more realistic approach, we trained the models on all the event mentions from the two corpora\nand not only on the mentions manually annotated for event coreference (the true event mentions). In\nthis regard, we ran the event identi\ufb01er described in [6] on the ACE and ECB corpora, and extracted\n45289 and 21175 system mentions respectively.\nThe Experimental Setup Table 2 lists the recall (R), precision (P), and F-score (F) of our exper-\niments averaged over 5 runs of the generative models. Since there is no agreement on the best\ncoreference resolution metric, we employed four metrics for our evaluation: the link-based MUC\nmetric [31], the mention-based B3 metric [2], the entity-based CEAF metric [19], and the pairwise\nF1 (PW) metric. In the evaluation process, we considered only the true mentions of the ACE test\ndataset and of the test sets of a 5-fold cross validation scheme on the ECB dataset. For evaluating\nthe cross-document coreference annotations, we adopted the same approach as described in [3] by\nmerging all the documents from the same topic into a meta-document and then scoring this docu-\nment as performed for within-document evaluation. Also, for both corpora, we considered a set of\n132 feature types, where each feature type consists on average of 3900 distinct feature values.\nThe Baseline A simple baseline for event coreference consists in grouping events by their event\nclasses [1]. To extract event classes, we employed the event identi\ufb01er described in [6]. Therefore,\nthis baseline will categorize events into a small number of clusters, since the event identi\ufb01er is\ntrained to predict the \ufb01ve event classes annotated in TimeBank [26]. As it was already observed\n[20, 11], considering very few categories for coreference resolution tasks will result in overestimates\nof the MUC scorer. For instance, a baseline that groups all entity mentions into the same entity\nachieves the highest MUC score than any published system for the task of entity coreference. Similar\nbehaviour of the MUC metric is observed for event coreference resolution. For example, for cross-\ndocument evaluation on ECB, a baseline that clusters all mentions into one event achieves 73.2%\nMUC F-score, while the baseline listed in Table 2 achieves 72.9% MUC F-score.\nHDP Extensions Due to memory limitations, we evaluated the HDPf lat and HDPstruct models\nonly on a restricted subset of manually selected feature types. In general, as shown in Table 2,\nthe HDPf lat model achieved the best performance results on the ACE test dataset, whereas the\n\n8This resource is available at http://www.hlt.utdallas.edu/\u223cady. The annotation process is described in [7].\n\n7\n\n\fModel\n\nMUC\n\nP\n\nR\n\n94.3\nBaseline\n62.2\nHDP1f (HL)\n53.5\nHDPf lat\n61.9\nHDPstruct\nmIBP-HDP\n48.7\niFHMM-iHMM 48.7\n\n92.2\nBaseline\n46.9\nHDP1f (HL)\n37.8\nHDPf lat\n47.4\nHDPstruct\n38.2\nmIBP-HDP\niFHMM-iHMM 39.5\n\n90.5\nBaseline\n47.7\nHDP1f (HL)\n44.4\nHDPf lat\n51.9\nHDPstruct\nmIBP-HDP\n40.0\niFHMM-iHMM 48.4\n\n33.1\n43.1\n54.2\n49.0\n41.9\n48.8\n\n39.8\n54.8\n92.9\n82.7\n68.8\n85.2\n\n61.1\n70.5\n95.3\n89.5\n79.8\n89.0\n\nB3\nP\n\nF\n\nF\n\nP\nACE (within-document event coreference)\n\nR\n\nR\n\nCEAF\n\n49.0\n50.9\n53.9\n54.7\n45.1\n48.7\n\n97.9\n86.0\n83.4\n86.2\n81.7\n81.9\n\n25.0\n70.6\n84.2\n76.9\n76.4\n82.2\n\n39.9\n77.5\n83.8\n81.3\n79.0\n82.1\n\n14.7\n62.3\n76.9\n69.0\n68.8\n74.6\n\n64.4\n76.4\n76.5\n77.5\n73.8\n74.5\n\nECB (within-document event coreference)\n\n55.6\n50.4\n53.4\n60.1\n48.9\n53.9\n\n97.7\n84.3\n82.1\n84.3\n82.1\n82.5\n\n55.8\n89.0\n99.2\n97.1\n95.3\n98.1\n\n71.0\n86.5\n89.8\n90.2\n88.2\n89.6\n\n44.5\n83.4\n93.9\n92.7\n90.3\n93.1\n\n80.1\n79.6\n78.2\n81.1\n78.5\n78.8\n\nECB (cross-document event coreference)\n\n72.9\n56.8\n60.5\n65.7\n53.2\n62.7\n\n93.8\n67.0\n65.0\n69.3\n63.1\n67.0\n\n49.6\n86.2\n98.7\n95.8\n94.1\n96.4\n\n64.9\n75.3\n78.3\n80.4\n75.5\n79.0\n\n36.6\n76.2\n86.9\n86.2\n82.7\n85.5\n\n72.7\n57.1\n56.0\n60.1\n54.6\n58.0\n\nF\n\nR\n\n24.0\n68.6\n76.7\n73.0\n71.2\n74.5\n\n57.2\n81.4\n85.3\n86.5\n84.0\n85.3\n\n48.7\n65.2\n68.0\n70.8\n65.7\n69.1\n\n93.5\n50.5\n43.3\n53.2\n37.4\n37.2\n\n93.7\n36.6\n27.0\n34.4\n26.5\n29.4\n\n90.7\n34.9\n29.2\n37.5\n26.1\n33.3\n\nPW\nP\n\n8.2\n27.7\n47.1\n38.1\n28.9\n39.0\n\n25.4\n53.4\n92.4\n83.0\n67.9\n86.6\n\n28.6\n58.9\n95.1\n85.6\n77.0\n88.3\n\nF\n\n15.2\n35.8\n45.1\n44.4\n32.6\n38.1\n\n39.8\n42.6\n41.3\n48.6\n37.7\n43.7\n\n43.3\n43.5\n44.4\n52.1\n38.9\n48.2\n\nTable 2: Evaluation results for within- and cross-document event coreference resolution.\n\nHDPstruct model, which also considers dependencies between feature types, proved to be more\neffective on the ECB dataset for both within- and cross-document event coreference evaluation. The\nset of feature types used to achieve these results consists of combinations of types from all feature\ncategories described in Section 2.2. For the results of the HDPstruct model listed in Table 2, we also\nexplored the conditional dependencies between the HL, FR, and FEA types.\n\nAs can be observed from Table 2, the results of the HDPf lat and HDPstruct models show an F-score\nincrease by 4-10% over the HDP1f model, and therefore prove that the HDP extensions provide a\nmore \ufb02exible representation for clustering objects characterized by rich properties.\nmIBP-HDP In spite of its advantage of working with a potentially in\ufb01nite number of features in an\nHDP framework, the mIBP-HDP model did not achieve a satisfactory performance in comparison\nwith the other proposed models. However, the results were obtained by automatically selecting\nonly 2% of distinct feature values from the entire set of values extracted from both corpora. When\ncompared with the restricted set of features considered by the HDPf lat and HDPstruct models, the\npercentage of values selected by mIBP-HDP is only 6%. A future research area for improving this\nmodel is to consider other distributions for automatic selection of salient feature values.\niFHMM-iHMM In spite of the automatic feature selection employed for the iFHMM-iHMM model,\nits results remain competitive against the results of the HDP extensions (where the feature types\nwere hand tuned). As shown in Table 2, most of the iFHMM-iHMM results fall in between the\nHDPf lat and HDPstruct models. Also, these results indicate that the iFHMM-iHMM model is a\nbetter framework than HDP in capturing the event mention dependencies simulated by the mIBP\nfeature sampling scheme. Similar to the mIBP-HDP model, to achieve these results, the iFHMM-\niHMM model uses only 2% values from the entire set of distinct feature values. For the experiments\nof the iFHMM-iHMM results reported in Table 2, we set \u03b1\u2032=50, \u03b3 \u2032=0.5, and \u03b4\u2032=0.5.\n\n6 Conclusion\n\nIn this paper, we have described how a sequence of unsupervised, nonparametric Bayesian models\ncan be employed to cluster complex linguistic objects that are characterized by a rich set of features.\nThe experimental results proved that these models are able to solve real data applications in which\nthe feature and cluster numbers are treated as free parameters, and the selection of features is per-\nformed automatically. While the results of event coreference resolution are promising, we believe\nthat the classes of models proposed in this paper have a real utility for a wide range of applications.\n\n8\n\n\fReferences\n\n[1] David Ahn. 2006. The stages of event extraction. In Proceedings of the Workshop on Annotating and\n\nReasoning about Time and Events, pages 1\u20138.\n\n[2] Amit Bagga and Breck Baldwin. 1998. Algorithms for Scoring Coreference Chains. In Proc. of LREC.\n[3] Amit Bagga and Breck Baldwin. 1999. Cross-Document Event Coreference: Annotations, Experiments,\n\nand Observations. In Proceedings of the ACL-99 Workshop on Coreference and its Applications.\n\n[4] Collin F. Baker, Charles J. Fillmore, and John B. Lowe. 1998. The Berkeley FrameNet project.\n\nProceedings of COLING-ACL.\n\nIn\n\n[5] Matthew J. Beal, Zoubin Ghahramani, and Carl Edward Rasmussen. 2002. The In\ufb01nite Hidden Markov\n\nModel. In Proceedings of NIPS.\n\n[6] Cosmin Adrian Bejan. 2007. Deriving Chronological Information from Texts through a Graph-based\n\nAlgorithm. In Proceedings of FLAIRS-2007.\n\n[7] Cosmin Adrian Bejan and Sanda Harabagiu. 2008. A Linguistic Resource for Discovering Event Structures\n\nand Resolving Event Coreference. In Proceedings of LREC-2008.\n\n[8] Cosmin Adrian Bejan and Chris Hathaway. 2007. UTD-SRL: A Pipeline Architecture for Extracting Frame\n\nSemantic Structures. In Proceedings of SemEval-2007.\n\n[9] Christiane Fellbaum. 1998. WordNet: An Electronic Lexical Database. MIT Press.\n[10] Thomas S. Ferguson. 1973. A Bayesian Analysis of Some Nonparametric Problems. The Annals of\n\nStatistics, 1(2):209\u2013230.\n\n[11] Jenny Rose Finkel and Christopher D. Manning. 2008. Enforcing Transitivity in Coreference Resolution.\n\nIn Proceedings of ACL/HLT-2008, pages 45\u201348.\n\n[12] Stuart Geman and Donald Geman. 1984. Stochastic relaxation, Gibbs distributions and the Bayesian\n\nrestoration of images. . IEEE Transactions on Pattern Analysis and Machine Intelligence, 6:721\u2013741.\n\n[13] Z. Ghahramani and M. Jordan. 1997. Factorial Hidden Markov Models. Machine Learning, 29:245\u2013273.\n[14] Zoubin Ghahramani, T. L. Grif\ufb01ths, and Peter Sollich, 2007. Bayesian Statistics 8, chapter Bayesian\n\nnonparametric latent feature models, pages 201\u2013225. Oxford University Press.\n\n[15] Tom Grif\ufb01ths and Zoubin Ghahramani. 2006.\nProcess. In Proceedings of NIPS, pages 475\u2013482.\n\nIn\ufb01nite Latent Feature Models and the Indian Buffet\n\n[16] Aria Haghighi and Dan Klein. 2007. Unsupervised Coreference Resolution in a Nonparametric Bayesian\n\nModel. In Proceedings of the ACL.\n\n[17] Kevin Humphreys, Robert Gaizauskas, Saliha Azzam. 1997. Event Coreference for Information Extrac-\ntion. In Proceedings of the Workshop on Operational Factors in Practical, Robust Anaphora Resolution\nfor Unrestricted Texts, 35th Meeting of ACL, pages 75\u201381.\n\n[18] LDC-ACE05. 2005. ACE (Automatic Content Extraction) English Annotation Guidelines for Events.\n[19] X. Luo. 2005. On Coreference Resolution Performance Metrics. In Proceedings of EMNLP.\n[20] X. Luo, A. Ittycheriah, H. Jing, N. Kambhatla, and S. Roukos 2004. A Mention-Synchronous Coreference\n\nResolution Algorithm Based On the Bell Tree. In Proceedings of ACL-2004.\n\n[21] Radford M. Neal. 2003. Slice Sampling. The Annals of Statistics, 31:705\u2013741.\n[22] Vincent Ng. 2008. Unsupervised Models for Coreference Resolution. In Proceedings of EMNLP.\n[23] Martha Palmer, Daniel Gildea, and Paul Kingsbury. 2005. The Proposition Bank: An Annotated Corpus\n\nof Semantic Roles. Computational Linguistics, 31(1):71\u2013105.\n\n[24] Ron Papka. 1999. On-line New Event Detection, Clustering and Tracking. Ph.D. thesis, Department of\n\nComputer Science, University of Massachusetts.\n\n[25] Hoifung Poon and Pedro Domingos. 2008. Joint Unsupervised Coreference Resolution with Markov\n\nLogic. In Proceedings of EMNLP.\n\n[26] J. Pustejovsky, P. Hanks, R. Sauri, A. See, R. Gaizauskas, A. Setzer, D. Radev, B. Sundheim, D. Day, L.\n\nFerro, and M. Lazo. 2003. The TimeBank Corpus. In Corpus Linguistics, pages 647\u2013656.\n\n[27] Lawrence R. Rabiner. 1989. A Tutorial on Hidden Markov Models and Selected Applications in Speech\n\nRecognition. In Proceedings of the IEEE, pages 257\u2013286.\n\n[28] Yee Whye Teh, Michael Jordan, Matthew Beal, and David Blei. 2006. Hierarchical Dirichlet Processes.\n\nJournal of the American Statistical Association, 101(476):1566\u20131581.\n\n[29] Jurgen Van Gael, Yunus Saatci, Yee Whye Teh, and Zoubin Ghahramani. 2008. Beam Sampling for the\n\nIn\ufb01nite Hidden Markov Model. In Proceedings of ICML, pages 1088\u20131095.\n\n[30] Jurgen Van Gael, Yee Whye Teh, and Zoubin Ghahramani. 2008. The In\ufb01nite Factorial Hidden Markov\n\nModel. In Proceedings of NIPS.\n\n[31] Marc Vilain, John Burger, John Aberdeen, Dennis Connolly, and Lynette Hirschman. 1995. A Model-\n\nTheoretic Coreference Scoring Scheme. In Proceedings of MUC-6, pages 45\u201352.\n\n9\n\n\f", "award": [], "sourceid": 746, "authors": [{"given_name": "Cosmin", "family_name": "Bejan", "institution": null}, {"given_name": "Matthew", "family_name": "Titsworth", "institution": null}, {"given_name": "Andrew", "family_name": "Hickl", "institution": null}, {"given_name": "Sanda", "family_name": "Harabagiu", "institution": null}]}