{"title": "Fast Exact Inference with a Factored Model for Natural Language Parsing", "book": "Advances in Neural Information Processing Systems", "page_first": 3, "page_last": 10, "abstract": null, "full_text": "Fast Exact Inference with a Factored Model for\n\nNatural Language Parsing\n\nDan Klein\n\nDepartment of Computer Science\n\nStanford University\n\nStanford, CA 94305-9040\nklein@cs.stanford.edu\n\nChristopher D. Manning\n\nDepartment of Computer Science\n\nStanford University\n\nStanford, CA 94305-9040\n\nmanning@cs.stanford.edu\n\nAbstract\n\nWe present a novel generative model for natural language tree structures\nin which semantic (lexical dependency) and syntactic (PCFG) structures\nare scored with separate models. This factorization provides concep-\ntual simplicity, straightforward opportunities for separately improving\nthe component models, and a level of performance comparable to simi-\nlar, non-factored models. Most importantly, unlike other modern parsing\nmodels, the factored model admits an extremely effective A* parsing al-\ngorithm, which enables ef\ufb01cient, exact inference.\n\n1 Introduction\n\nSyntactic structure has standardly been described in terms of categories (phrasal labels and\nword classes), with little mention of particular words. This is possible, since, with the\nexception of certain common function words, the acceptable syntactic con\ufb01gurations of a\nlanguage are largely independent of the particular words that \ufb01ll out a sentence. Conversely,\nfor resolving the important attachment ambiguities of modi\ufb01ers and arguments, lexical\npreferences are known to be very effective. Additionally, methods based only on key lexical\ndependencies have been shown to be very effective in choosing between valid syntactic\nforms [1]. Modern statistical parsers [2, 3] standardly use complex joint models of over\nboth category labels and lexical items, where \u201ceverything is conditioned on everything\u201d to\nthe extent possible within the limits of data sparseness and \ufb01nite computer memory. For\nexample, the probability that a verb phrase will take a noun phrase object depends on the\nhead word of the verb phrase. A VP headed by acquired will likely take an object, while\na VP headed by agreed will likely not. There are certainly statistical interactions between\nsyntactic and semantic structure, and, if deeper underlying variables of communication\nare not modeled, everything tends to be dependent on everything else in language [4].\nHowever, the above considerations suggest that there might be considerable value in a\nfactored model, which provides separate models of syntactic con\ufb01gurations and lexical\ndependencies, and then combines them to determine optimal parses. For example, under\nthis view, we may know that acquired takes right dependents headed by nouns such as\ncompany or division, while agreed takes no noun-headed right dependents at all. If so,\nthere is no need to explicitly model the phrasal selection on top of the lexical selection.\nAlthough we will show that such a model can indeed produce a high performance parser,\nwe will focus particularly on how a factored model permits ef\ufb01cient, exact inference, rather\nthan the approximate heuristic inference normally used in large statistical parsers.\n\n\fS\n\nNP\n\nVP\n\nfell-VBD\n\npayrolls-NNS\n\nfell\n\nin-IN\n\nNN\n\nNNS\n\nVBD\n\nPP\n\nFactory\n\npayrolls\n\nfell\n\nIN\n\nNN\n\nSeptember\n(a) PCFG Structure\n\nin\n\nFactory-NN\n\npayrolls\n\nin September-NN\n\nFactory\n\nSeptember\n\n(b) Dependency Structure\n\nS, fell-VBD\n\nNP, payrolls-NNS\n\nVP, fell-VBD\n\nFactory-NN\n\npayrolls-NNS\n\nfell-VBD\n\nPP, in-IN\n\nFactory\n\npayrolls\n\nfell\n\nin-IN\n\nSeptember-NN\n\n(c) Combined Structure\n\nin\n\nSeptember\n\nFigure 1: Three kinds of parse structures.\n\n2 A Factored Model\n\nGenerative models for parsing typically model one of the kinds of structures shown in \ufb01g-\nure 1. Figure 1a is a plain phrase-structure tree T , which primarily models syntactic units,\n\ufb01gure 1b is a dependency tree D, which primarily models word-to-word selectional af\ufb01ni-\nties [5], and \ufb01gure 1c is a lexicalized phrase-structure tree L, which carries both category\nand (part-of-speech tagged) head word information at each node.\n\nA lexicalized tree can be viewed as the pair L D .T ; D/ of a phrase structure tree T and\na dependency tree D. In this view, generative models over lexicalized trees, of the sort\nstandard in lexicalized PCFG parsing [2, 3], can be regarded as assigning mass P.T ; D/\nto such pairs. To the extent that dependency and phrase structure need not be modeled\njointly, we can factor our model as P.T ; D/ D P.T /P.D/: this approach is the basis\nof our proposed models, and its use is, to our knowledge, new. This factorization, of\ncourse, assigns mass to pairs which are incompatible, either because they do not generate\nthe same terminal string or do not embody compatible bracketings. Therefore, the total\nmass assigned to valid structures will be less than one. We could imagine \ufb01xing this by\nrenormalizing. For example, this situation \ufb01ts into the product-of-experts framework [6],\nwith one semantic expert and one syntactic expert that must agree on a single structure.\nHowever, since we are presently only interested in \ufb01nding most-likely parses, no global\nrenormalization constants need to be calculated.\n\nGiven the factorization P.T ; D/ D P.T /P.D/, rather than engineering a single complex\ncombined model, we can instead build two simpler sub-models. We show that the com-\nbination of even quite simple \u201coff the shelf\u201d implementations of the two sub-models can\nprovide decent parsing performance. Further, the modularity afforded by the factorization\nmakes it much easier to extend and optimize the individual components. We illustrate this\nby building improved versions of both sub-models, but we believe that there is room for\nfurther optimization.\n\nConcretely, we used the following sub-models. For P.T /, we used successively more\naccurate PCFGs. The simplest, PCFG-BASIC, used the raw treebank grammar, with nonter-\nminals and rewrites taken directly from the training trees [7]. In this model, nodes rewrite\natomically, in a top-down manner, in only the ways observed in the training data. For im-\nproved models of P.T /, tree nodes\u2019 labels were annotated with various contextual markers.\nIn PCFG-PA, each node was marked with its parent\u2019s label as in [8]. It is now well known\nthat such annotation improves the accuracy of PCFG parsing by weakening the PCFG inde-\npendence assumptions. For example, the NP in \ufb01gure 1a would actually have been labeled\nNP\u02c6S. Since the counts were not fragmented by head word or head tag, we were able\nto directly use the MLE parameters, without smoothing. 1 The best PCFG model, PCFG-\nLING, involved selective parent splitting, order-2 rule markovization (similar to [2, 3]), and\nlinguistically-derived feature splits.2\n\n1This is not to say that smoothing would not improve performance, but to underscore how the\n\nfactored model encounters less sparsity problems than a joint model.\n\n2In\ufb01nitive VPs, possessive NPs, and gapped Ss were marked, the preposition tag was split into\n\n\fX(h)\n\nj\n\nhi\nAn Edge\n\nO.n5/ Items and Schema\nX(h)\n\nY(h0)\n\nZ(h)\n\n+\n\n+\n\nhi\n\nj\n\nj\n\nh0\n\nk\n\nX(h) Y(h0)\n\nThe Edge Combination Schema\n\nZ(h)\n\nhi\n\nk\n\nFigure 2: Edges and the edge combination schema for an O.n5/ lexicalized tabular parser.\n\nModels of P.D/ were lexical dependency models, which deal with tagged words: pairs\nhw; ti. First the head hwh ; th i of a constituent is generated, then successive right depen-\ndents hwd ; td i until a STOP token (cid:5) is generated, then successive left dependents until (cid:5)\nis generated again. For example, in \ufb01gure 1, \ufb01rst we choose fell-VBD as the head of the\nsentence. Then, we generate in-IN to the right, which then generates September-NN to the\nright, which generates (cid:5) on both sides. We then return to in-IN, generate (cid:5) to the right, and\nso on.\n\nThe dependency models required smoothing, as the word-word dependency data is very\nsparse. In our basic model, DEP-BASIC, we generate a dependent conditioned on the head\nand direction, using a mixture of two generation paths: a head can select a speci\ufb01c argument\nword, or a head can select only an argument tag. For head selection of words, there is a prior\ndistribution over dependents taken by the head\u2019s tag, for example, left dependents taken by\npast tense verbs: P.wd ; td jth ; dir / D count.wd ; td ; th ; dir /=count.th ; dir /. Observations\nof bilexical pairs are taken against this prior, with some prior strength (cid:20):\n\nP.wd ; td jwh ; th ; dir / D\n\ncount.wd ; td ; wh ; th ; dir / C (cid:20) P.wd ; td jth ; dir /\n\ncount.wh ; th ; dir / C (cid:20)\n\nThis model can capture bilexical selection, such as the af\ufb01nity between payrolls and fell.\nAlternately, the dependent can have only its tag selected, and then the word is generated\nindependently: P.wd ; td jwh ; th ; dir / D P.wd jtd /P.td jwh ; th ; dir /. The estimates for\nP.td jwh ; th ; dir / are similar to the above. These two mixture components are then lin-\nearly interpolated, giving just two prior strengths and a mixing weight to be estimated on\nheld-out data.\n\nIn the enhanced dependency model, DEP-VAL, we condition not only on direction, but also\non distance and valence. The decision of whether to generate (cid:5) is conditioned on one of\n\ufb01ve values of distance between the head and the generation point: zero, one, 2\u20135, 6\u201310,\nand 11+. If we decide to generate a non-(cid:5) dependent, the actual choice of dependent is\nsensitive only to whether the distance is zero or not. That is, we model only zero/non-zero\nvalence. Note that this is (intentionally) very similar to the generative model of [2] in broad\nstructure, but substantially less complex.\n\nAt this point, one might wonder what has been gained. By factoring the semantic and\nsyntactic models, we have certainly simpli\ufb01ed both (and fragmented the data less), but\nthere are always simpler models, and researchers have adopted complex ones because of\ntheir parsing accuracy. In the remainder of the paper, we demonstrate the three primary\nbene\ufb01ts of our model: a fast, exact parsing algorithm; parsing accuracy comparable to\nnon-factored models; and useful modularity which permits easy extensibility.\n\nseveral subtypes, conjunctions were split into contrastive and other occurrences, and the word not\nwas given a unique tag.\nIn all models, unknown words were modeled using only the MLE of\nP.tagjunknown/ with ML estimates for the reserved mass per tag. Selective splitting was done using\nan information-gain like criterion.\n\n\f3 An A* Parser\n\nIn this section, we outline an ef\ufb01cient algorithm for \ufb01nding the Viterbi, or most probable,\nparse for a given terminal sequence in our factored lexicalized model. The naive approach\nto lexicalized PCFG parsing is to act as if the lexicalized PCFG is simply a large nonlexical\nPCFG, with many more symbols than its nonlexicalized PCFG backbone. For example,\nwhile the original PCFG might have a symbol NP, the lexicalized one has a symbol NP- x\nfor every possible head x in the vocabulary. Further, rules like S ! NP VP become a\nfamily of rules S-x ! NP-y VP-x.3 Within a dynamic program, the core parse item in\nthis case is the edge, shown in \ufb01gure 2, which is speci\ufb01ed by its start, end, root symbol,\nand head position.4 Adjacent edges combine to form larger edges, as in the top of \ufb01gure 2.\nThere are O.n3/ edges, and two edges are potentially compatible whenever the left one\nends where the right one starts. Therefore, there are O.n5/ such combinations to check,\ngiving an O.n5/ dynamic program.5\nThe core of our parsing algorithm is a tabular agenda-based parser, using the O.n5/ schema\nabove. The novelty is in the choice of agenda priority, where we exploit the rapid parsing\nalgorithms available for the sub-models to speed up the otherwise impractical combined\nparse. Our choice of priority also guarantees optimality, in the sense that when the goal\nedge is removed, its most probable parse is known exactly. Other lexicalized parsers ac-\ncelerate parsing in ways that destroy this optimality guarantee. The top-level procedure is\ngiven in \ufb01gure 3. First, we parse exhaustively with the two sub-models, not to \ufb01nd com-\nplete parses, but to \ufb01nd best outside scores for each edge e. An outside score is the score of\nthe best parse structure which starts at the goal and includes e, the words before it, and the\nwords after it, as depicted in \ufb01gure 3. Outside scores are a Viterbi analog of the standard\noutside probabilities given by the inside-outside algorithm [11]. For the syntactic model,\nP.T /, well-known cubic PCFG parsing algorithms are easily adapted to \ufb01nd outside scores.\nFor the semantic model, P.D/, there are several presentations of cubic dependency parsing\nalgorithms, including [9] and [12]. These can also be adapted to produce outside scores in\ncubic time, though since their basic data structures are not edges, there is some subtlety.\nFor space reasons, we omit the details of these phases.\n\nAn agenda-based parser tracks all edges that have been constructed at a given time. When\nan edge is \ufb01rst constructed, it is put on an agenda, which is a priority queue indexed by\nsome score for that node. The agenda is a holding area for edges which have been built\nin at least one way, but which have not yet been used in the construction of other edges.\nThe core cycle of the parser is to remove the highest-priority edge from the agenda, and\nact on it according to the edge combination schema, combining it with any previously\nremoved, compatible edges. This much is common to many parsers; agenda-based parsers\nprimarily differ in their choice of edge priority. If the best known inside score for an edge\nis used as a priority, then the parser will be optimal. In particular, when the goal edge is\nremoved, its score will correspond the most likely parse. The proof is a generalization of\nthe proof of Dijkstra\u2019s algorithm (uniform-cost search), and is omitted for space reasons\n\n3The score of such a rule in the factored model would be the PCFG score for S ! NP VP,\n\ncombined with the score for x taking y as a dependent and the left and right STOP scores for y.\n\n4The head position variable often, as in our case, also speci\ufb01es the head\u2019s tag.\n5Eisner and Satta [9] propose a clever O.n4/ modi\ufb01cation which separates this process into two\nsteps by introducing an intermediate object. However, even the O.n4/ formulation is impractical for\nexhaustive parsing with broad-coverage, lexicalized treebank grammars. There are several reasons for\nthis: the constant factor due to the grammar is huge (these grammars often contain tens of thousands\nof rules once binarized), and larger sentences are more likely to contain structures which unlock\nincreasingly large regions of the grammar ([10] describes how this can cause the sentence length\nto leak into terms which are analyzed as constant, leading to empirical growth far faster than the\npredicted bounds). We did implement a version of this parser using the O.n4/ formulation of [9],\nbut, because of the effectiveness of the A* estimate, it was only marginally faster; see section 4.\n\n\f1. Extract the PCFG sub-model and set up the PCFG parser.\n2. Use the PCFG parser to \ufb01nd outside scores (cid:11)PCFG.e/ for each edge.\n3. Extract the dependency sub-model and set up the dependency parser.\n4. Use the dependency parser to \ufb01nd outside scores (cid:11)DEP.e/ for each edge.\n5. Combine PCFG and dependency sub-models into the lexicalized model.\n6. Form the combined outside estimate a.e/ D (cid:11)PCFG.e/ C (cid:11)DEP.e/\n7. Use the lexicalized A* parser, with a.e/ as an A* estimate of (cid:11).e/\n\n(cid:11)\n\ne\n\n(cid:12)\n\nwords\n\nFigure 3: The top-level algorithm and an illustration of inside and outside scores.\n\nPCFG Model\nPCFG-BASIC\n\nPCFG-PA\n\nPCFG-LING\n\n75.3\n78.4\n83.7\n\nPrecision Recall\n70.2\n76.9\n82.1\n\nF1\n72.7\n77.7\n82.9\n(a) The PCFG Model\n\nExact Match\n\n11.0\n18.5\n25.7\n\nDependency Model Dependency Acc\n\nDEP-BASIC\nDEP-VAL\n\n76.3\n85.0\n\n(b) The Dependency Model\n\nFigure 4: Performance of the sub-models alone.\n\n(but given in [13]). However, removing edges by inside score is not practical (see section 4\nfor an empirical demonstration), because all small edges end up having better scores than\nany large edges. Luckily, the optimality of the algorithm remains if, rather than removing\nitems from the agenda by their best inside scores, we add to those scores any optimistic\n(admissible) estimate of the cost to complete a parse using that item. The proof of this is a\ngeneralization of the proof of the optimality of A* search.\n\nTo our knowledge, no way of generating effective, admissible A* estimates for lexicalized\nparsing has previously been proposed. 6 However, because of the factored structure of\nour model, we can use the results of the sub-models\u2019 parses to give us quite sharp A*\nestimates. Say we want to know the outside score of an edge e. That score will be the score\n(cid:11).Te; De/ (a logprobability) of a certain structure .Te; De/ outside of e, where Te and De\nare a compatible pair. From the initial phases, we know the exact scores of the overall best\ne which can occur outside of e, though of course it may well be that T 0\ne and the best D0\nT 0\ne\nand D0\ne/,\ne/. Therefore, we can\nand so (cid:11).Te; De/ D (cid:11)PCFG.Te/ C (cid:11)DEP.De/ (cid:20) (cid:11)PCFG.T 0\nuse the sum of the sub-models\u2019 outside scores, a.e/ D (cid:11)PCFG.T 0\ne/, as an upper\nbound on the outside score for the combined model. Since it is reasonable to assume that\nthe two models will be broadly compatible and will generally prefer similar structures, this\nshould create a sharp A* estimate, and greatly reduce the work needed to \ufb01nd the goal\nparse. We give empirical evidence of this in section 4.\n\ne are not compatible. However, (cid:11)PCFG.Te/ (cid:20) (cid:11)PCFG.T 0\n\ne / and (cid:11)DEP.De/ (cid:20) (cid:11)DEP.D0\n\ne / C (cid:11)DEP.D0\n\ne / C (cid:11)DEP.D0\n\n4 Empirical Performance\n\nIn this section, we demonstrate that (i) the factored model\u2019s parsing performance is compa-\nrable to non-factored models which use similar features, (ii) there is an advantage to exact\ninference, and (iii) the A* savings are substantial. First, we give parsing \ufb01gures on the stan-\ndard Penn treebank parsing task. We trained the two sub-models, separately, on sections\n02\u201321 of the WSJ section of the treebank. The numbers reported here are the result of then\ntesting on section 23 (length (cid:20) 40). The treebank only supplies node labels (like NP) and\n\n6The basic idea of changing edge priorities to more effectively guide parser work is standardly\nused, and other authors have made very effective use of inadmissible estimates. [2] uses extensive\nprobabilistic pruning \u2013 this amounts to giving pruned edges in\ufb01nitely low priority. Absolute pruning\ncan, and does, prevent the most likely parse from being returned at all. [14] removes edges in order of\nestimates of their correctness. This, too, may result in the \ufb01rst parse found not being the most likely\nparse, but it has another more subtle drawback: if we hold back an edge e for too long, we may use\ne to build another edge f in a new, better way. If f has already been used to construct larger edges,\nwe must then propagate its new score upwards (which can trigger still further propagation).\n\n\fPCFG Model Dependency Model\nPCFG-BASIC\nPCFG-BASIC\n\nPCFG-PA\nPCFG-PA\n\nPCFG-LING\nPCFG-LING\n\nDEP-BASIC\nDEP-VAL\nDEP-BASIC\nDEP-VAL\nDEP-BASIC\nDEP-VAL\n\nPrecision Recall\n78.2\n81.5\n82.2\n85.0\n84.8\n86.8\n\n80.1\n82.5\n82.1\n84.0\n85.4\n86.6\n\nF1\n79.1\n82.0\n82.1\n84.5\n85.1\n86.7\n\nExact Match Dependency Acc\n\n16.7\n17.7\n23.7\n24.8\n30.4\n32.1\n\n87.2\n89.2\n88.0\n89.7\n90.3\n91.0\n\nPCFG Model Dependency Model\nPCFG-LING\nPCFG-LING\n\nDEP-VAL\nDEP-VAL\n\nThresholded?\n\nNo\nYes\n\nF1\n86.7\n86.5\n\nExact Match Dependency Acc\n\n32.1\n31.9\n\n91.0\n90.8\n\nFigure 5: The combined model, with various sub-models, and with/without thresholding.\n\ndoes not contain head information. Heads were calculated for each node according to the\ndeterministic rules given in [2]. These rules are broadly correct, but not perfect.\n\nWe effectively have three parsers:\nthe PCFG (sub-)parser, which produces nonlexical\nphrase structures like \ufb01gure 1a, the dependency (sub-)parser, which produces dependency\nstructures like \ufb01gure 1b, and the combination parser, which produces lexicalized phrase\nstructures like \ufb01gure 1c. The outputs of the combination parser can also be projected down\nto either nonlexical phrase structures or dependency structures. We score the output of our\nparsers in two ways. First, the phrase structure of the PCFG and combination parsers can\nbe compared to the treebank parses. The parsing measures standardly used for this task are\nlabeled precision and recall.7 We also report F1, the harmonic mean of these two quanti-\nties. Second, for the dependency and combination parsers, we can score the dependency\nstructures. A dependency structure D is viewed as a set of head-dependent pairs hh; di,\nwith an extra dependency hroot ; x i where root is a special symbol and x is the head of\nthe sentence. Although the dependency model generates part-of-speech tags as well, these\nare ignored for dependency accuracy. Punctuation is not scored. Since all dependency\nstructures over n non-punctuation terminals contain n dependencies (n (cid:0) 1 plus the root\ndependency), we report only accuracy, which is identical to both precision and recall. It\nshould be stressed that the \u201ccorrect\u201d dependency structures, though generally correct, are\ngenerated from the PCFG structures by linguistically motivated, but automatic and only\nheuristic rules.\nFigure 4 shows the relevant scores for the various PCFG and dependency parsers alone. 8\nThe valence model increases the dependency model\u2019s accuracy from 76.3% to 85.0%, and\neach successive enhancement improves the F1 of the PCFG models, from 72.7% to 77.7%\nto 82.9%. The combination parser\u2019s performance is given in \ufb01gure 5. As each individual\nmodel is improved, the combination F 1 is also improved, from 79.1% with the pair of basic\nmodels to 86.7% with the pair of top models. The dependency accuracy also goes up:\nfrom 87.2% to 91.0%. Note, however, that even the pair of basic models has a combined\ndependency accuracy higher than the enhanced dependency model alone, and the top three\nhave combined F1 better than the best PCFG model alone. For the top pair, \ufb01gure 6c\nillustrates the relative F1 of the combination parser to the PCFG component alone, showing\nthe unsurprising trend that the addition of the dependency model helps more for longer\nsentences, which, on average, contain more attachment ambiguity. The top F 1 of 86.7%\nis greater than that of the lexicalized parsers presented in [15, 16], but less than that of\nthe newer, more complex, parsers presented in [3, 2], which reach as high as 90.1% F 1.\n\n7A tree T is viewed as a set of constituents c.T /. Constituents in the correct and the proposed\ntree must have the same start, end, and label to be considered identical. For this measure, the lexical\nheads of nodes are irrelevant. The actual measures used are detailed in [15], and involve minor\nnormalizations like the removal of punctuation in the comparison.\n\n8The dependency model is sensitive to any preterminal annotation (tag splitting) done by the\n\nPCFG model. The actual value of DEP-VAL shown corresponds to PCFG-LING.\n\n\fd\ne\ns\ns\ne\nc\no\nr\nP\n \ns\ne\ng\nd\nE\n\n1000000\n\n100000\n\n10000\n\n1000\n\n100\n\n10\n\n1\n\nUniform-Cost\nA-Star\n\n)\nc\ne\ns\n(\n \n\ne\nm\nT\n\ni\n\nCombined Phase\nDependency Phase\nPCFG Phase\n\n100\n\n80\n\n60\n\n40\n\n20\n\n0\n\n1\nF\ne\n\n \n\nt\n\nl\n\nu\no\ns\nb\nA\n\n100\n\n75\n\n50\n\n25\n\n0\n\nCombination\nPCFG\nCombination/PCFG\n\n0\n\n10\n\n20\n\n30\n\n40\n\n0\n\n5\n\n10 15 20 25 30 35 40\n\n0\n\n10\n\n20\n\n30\n\nLength\n\n(a)\n\nLength\n(b)\n\nLength\n\n(c)\n\n2\n\n1.5\n\n1\n\n0.5\n\n0\n\n40\n\n \n\n1\nF\ne\nv\ni\nt\n\nl\n\na\ne\nR\n\nFigure 6: (a) A* effectiveness measured by edges expanded, (b) time spent on each phase,\nand (c) relative F1, all shown as sentence length increases.\n\nHowever, it is worth pointing out that these higher-accuracy parsers incorporate many \ufb01nely\nwrought enhancements which could presumably be extracted and applied to bene\ufb01t our\nindividual models.9\nThe primary goal of this paper is not to present a maximally tuned parser, but to demonstrate\na method for fast, exact inference usable in parsing. Given the impracticality of exact\ninference for standard parsers, a common strategy is to take a PCFG backbone, extract a\nset of top parses, either the top k or all parses within a score threshold of the top parse,\nand rerank them [3, 17]. This pruning is done for ef\ufb01ciency; the question is whether it is\nhurting accuracy. That is, would exact inference be preferable? Figure 5 shows the result\nof parsing with our combined model, using the best model pair, but with the A* estimates\naltered to block parses whose PCFG projection had a score further than a threshold (cid:14) D 2\nin log-probability from the best PCFG-only parse. Both bracket F 1 and exact-match rate\nare lower for the thresholded parses, which we take as an argument for exact inference. 10\nWe conclude with data on the effectiveness of the A* method. Figure 6a shows the average\nnumber of edges extracted from the agenda as sentence length increases. Numbers both\nwith and without using the A* estimate are shown. Clearly, the uniform-cost version of\nthe parser is dramatically less ef\ufb01cient; by sentence length 15 it extracts over 800K edges,\nwhile even at length 40 the A* heuristics are so effective that only around 2K edges are\nextracted. At length 10, the average number is less than 80, and the fraction of edges not\nsuppressed is better than 1/10K (and improves as sentence length increases). To explain\nthis effectiveness, we suggest that the combined parsing phase is really only \ufb01guring out\nhow to reconcile the two models\u2019 preferences. 11 The A* estimates were so effective that\neven with our object-heavy Java implementation of the combined parser, total parse time\nwas dominated by the initial, array-based PCFG phase (see \ufb01gure 6b). 12\n\n9For example, the dependency distance function of [2] registers punctuation and verb counts, and\n\nboth smooth the PCFG production probabilities, which could improve the PCFG grammar.\n\n10While pruning typically buys speed at the expense of some accuracy (see also, e.g., [2]), pruning\ncan also sometimes improve F1: Charniak et al. [14] \ufb01nd that pruning based on estimates for P.ejs/\nraises accuracy slightly, for a non-lexicalized PCFG. As they note, their pruning metric seems to\nmimic Goodman\u2019s maximum-constituents parsing [18], which maximizes the expected number of\ncorrect nodes rather than the likelihood of the entire parse. In any case, we see it as valuable to have\nan exact parser with which these types of questions can be investigated at all for lexicalized parsing.\n11Note that the uniform-cost parser does enough work to exploit the shared structure of the dynamic\nprogram, and therefore edge counts appear to grow polynomially. However, the A* parser does so\nlittle work that there is minimal structure-sharing. Its edge counts therefore appear to grow exponen-\ntially over these sentence lengths, just like a non-dynamic-programming parser\u2019s would. With much\nlonger sentences, or a less ef\ufb01cient estimate, the polynomial behavior would reappear.\n\n12The average time to parse a sentence with the best model on a 750MHz Pentium III with 2GB\nRAM was: for 20 words, PCFG 13 sec, dependencies 0.6 sec, combination 0.3 sec; 40 words, PCFG\n72 sec, dependencies 18 sec, combination 1.6 sec.\n\n\f5 Conclusion\n\nThe framework of factored models over lexicalized trees has several advantages. It is con-\nceptually simple, and modularizes the model design and estimation problems. The concrete\nmodel presented performs comparably to other, more complex, non-exact models proposed,\nand can be easily extended in the ways that other parser models have been. Most impor-\ntantly, it admits a novel A* parsing approach which allows fast, exact inference of the most\nprobable parse.\n\nAcknowledgements. We would like to thank Lillian Lee, Fernando Pereira, and Joshua\nGoodman for advice and discussion about this work. This paper is based on work supported\nby the National Science Foundation (NSF) under Grant No. IIS-0085896, by the Advanced\nResearch and Development Activity (ARDA)\u2019s Advanced Question Answering for Intelli-\ngence (AQUAINT) Program, by an NSF Graduate Fellowship to the \ufb01rst author, and by an\nIBM Faculty Partnership Award to the second author.\n\nReferences\n[1] D. Hindle and M. Rooth. Structural ambiguity and lexical relations. Computational Linguistics,\n\n19(1):103\u2013120, 1993.\n\n[2] M. Collins. Head-Driven Statistical Models for Natural Language Parsing. PhD thesis, Uni-\n\nversity of Pennsylvania, 1999.\n\n[3] E. Charniak. A maximum-entropy-inspired parser. NAACL 1, pp. 132\u2013139, 2000.\n[4] R. Bod. What is the minimal set of fragments that achieves maximal parse accuracy? ACL 39,\n\npp. 66\u201373, 2001.\n\n[5] I. A. Mel0 \u02c7cuk. Dependency Syntax: theory and practice. State University of New York Press,\n\nAlbany, NY, 1988.\n\n[6] G. E. Hinton. Training products of experts by minimizing contrastive divergence. Technical\n\nReport GCNU TR 2000-004, GCNU, University College London, 2000.\n\n[7] E. Charniak. Tree-bank grammars. Proceedings of the Thirteenth National Conference on\n\nArti\ufb01cial Intelligence (AAAI \u201996), pp. 1031\u20131036, 1996.\n\n[8] M. Johnson. PCFG models of linguistic tree representations. Computational Linguistics,\n\n24(4):613\u2013632, 1998.\n\n[9] J. Eisner and G. Satta. Ef\ufb01cient parsing for bilexical context-free grammars and head-automaton\n\ngrammars. ACL 37, pp. 457\u2013464, 1999.\n\n[10] D. Klein and C. D. Manning. Parsing with treebank grammars: Empirical bounds, theoretical\n\nmodels, and the structure of the Penn treebank. ACL 39/EACL 10, pp. 330\u2013337, 2001.\n\n[11] J. K. Baker. Trainable grammars for speech recognition. D. H. Klatt and J. J. Wolf, editors,\nSpeech Communication Papers for the 97th Meeting of the Acoustical Society of America, pp.\n547\u2013550, 1979.\n\n[12] J. Lafferty, D. Sleator, and D. Temperley. Grammatical trigrams: A probabilistic model of link\ngrammar. Proc. AAAI Fall Symposium on Probabilistic Approaches to Natural Language, 1992.\n[13] D. Klein and C. D. Manning. Parsing and hypergraphs. Proceedings of the 7th International\n\nWorkshop on Parsing Technologies (IWPT-2001), 2001.\n\n[14] E. Charniak, S. Goldwater, and M. Johnson. Edge-based best-\ufb01rst chart parsing. Proceedings\n\nof the Sixth Workshop on Very Large Corpora, pp. 127\u2013133, 1998.\n\n[15] D. M. Magerman. Statistical decision-tree models for parsing. ACL 33, pp. 276\u2013283, 1995.\n[16] M. J. Collins. A new statistical parser based on bigram lexical dependencies. ACL 34, pp.\n\n184\u2013191, 1996.\n\n[17] M. Collins. Discriminative reranking for natural language parsing. ICML 17, pp. 175\u2013182,\n\n2000.\n\n[18] J. Goodman. Parsing algorithms and metrics. ACL 34, pp. 177\u2013183, 1996.\n\n\f", "award": [], "sourceid": 2325, "authors": [{"given_name": "Dan", "family_name": "Klein", "institution": null}, {"given_name": "Christopher", "family_name": "Manning", "institution": null}]}