{"title": "Discriminative Log-Linear Grammars with Latent Variables", "book": "Advances in Neural Information Processing Systems", "page_first": 1153, "page_last": 1160, "abstract": null, "full_text": "Discriminative Log-Linear Grammars\n\nwith Latent Variables\n\nSlav Petrov and Dan Klein\n\nComputer Science Department, EECS Division\n\nUniversity of California at Berkeley, Berkeley, CA, 94720\n\n{petrov, klein}@cs.berkeley.edu\n\nAbstract\n\nWe demonstrate that log-linear grammars with latent variables can be practically\ntrained using discriminative methods. Central to ef\ufb01cient discriminative training\nis a hierarchical pruning procedure which allows feature expectations to be ef\ufb01-\nciently approximated in a gradient-based procedure. We compare L1 and L2 reg-\nularization and show that L1 regularization is superior, requiring fewer iterations\nto converge, and yielding sparser solutions. On full-scale treebank parsing exper-\niments, the discriminative latent models outperform both the comparable genera-\ntive latent models as well as the discriminative non-latent baselines.\n\n1 Introduction\n\nIn recent years, latent annotation of PCFG has been shown to perform as well as or better than stan-\ndard lexicalized methods for treebank parsing [1, 2]. In the latent annotation scenario, we imagine\nthat the observed treebank is a coarse trace of a \ufb01ner, unobserved grammar. For example, the single\ntreebank category NP (noun phrase) may be better modeled by several \ufb01ner categories representing\nsubject NPs, object NPs, and so on. At the same time, discriminative methods have consistently\nprovided advantages over their generative counterparts, including less restriction on features and\ngreater accuracy [3, 4, 5]. In this work, we therefore investigate discriminative learning of latent\nPCFGs, hoping to gain the best from both lines of work.\n\nDiscriminative methods for parsing are not new. However, most discriminative methods, at least\nthose which globally trade off feature weights, require repeated parsing of the training set, which is\ngenerally impractical. Previous work on end-to-end discriminative parsing has therefore resorted to\n\u201ctoy setups,\u201d considering only sentences of length 15 [6, 7, 8] or extremely small corpora [9]. To get\nthe bene\ufb01ts of discriminative methods, it has therefore become common practice to extract n-best\ncandidate lists from a generative parser and then use a discriminative component to rerank this list.\nIn such an approach, repeated parsing of the training set can be avoided because the discriminative\ncomponent only needs to select the best tree from a \ufb01xed candidate list. While most state-of-the-art\nparsing systems apply this hybrid approach [10, 11, 12], it has the limitation that the candidate list\noften does not contain the correct parse tree. For example 41% of the correct parses were not in the\ncandidate pool of \u224830-best parses in [10].\nIn this paper we present a hierarchical pruning procedure that exploits the structure of the model\nand allows feature expectations to be ef\ufb01ciently approximated, making discriminative training of\nfull-scale grammars practical. We present a gradient-based procedure for training a discriminative\ngrammar on the entire WSJ section of the Penn Treebank (roughly 40,000 sentences containing\n1 million words). We then compare L1 and L2 regularization and show that L1 regularization is\nsuperior, requiring fewer iterations to converge and yielding sparser solutions. Independent of the\nregularization, discriminative grammars signi\ufb01cantly outperform their generative counterparts in our\nexperiments.\n\n1\n\n\fRB\n\nNot\n\nFRAG\n\nNP\n\nDT\n\nthis\n\nNN\n\nyear\n\n(a)\n\n.\n\n.\n\nROOT\n\nFRAG\n\n.\n\n.\n\nFRAG\n\nRB\n\nNot\n\nNP\n\nDT\n\nthis\n\nNN\n\nyear\n\n(b)\n\nROOT\n\nFRAG-x\n\n.-x\n\n.\n\nFRAG-x\n\nRB-x\n\nNot\n\nNP-x\n\nDT-x\n\nthis\n\nNN-x\n\nyear\n\n(c)\n\nFigure 1: (a) The original tree. (b) The (binarized) X-bar tree. (c) The annotated tree.\n\n2 Grammars with latent annotations\n\nContext-free grammars (CFGs) underlie most high-performance parsers in one way or another [13,\n12, 14]. However, a CFG which simply takes the empirical productions and probabilities off of\na treebank does not perform well. This naive grammar is a poor one because its context-freedom\nassumptions are too strong in some places and too weak in others. Therefore, a variety of techniques\nhave been developed to both enrich and generalize the naive grammar. Recently an automatic state-\nsplitting approach was shown to produce state-of-the art performance [2, 14]. We extend this line of\nwork by investigating discriminative estimation techniques for automatically re\ufb01ned grammars.\n\nWe consider grammars that are automatically derived from a raw treebank. Our experiments are\nbased on a completely unsplit X-bar grammar, obtained directly from the Penn Treebank by the\nbinarization procedure shown in Figure 1. For each local tree rooted at an evaluation category X,\nwe introduce a cascade of new nodes labeled X so that each has two children in a right branching\nfashion. Each node is then re\ufb01ned with a latent variable, splitting each observed category into k\nunobserved subcategories. We refer to trees over unsplit categories as parse trees and trees over\nsplit categories as derivations.\n\nOur log-linear grammars are parametrized by a vector \u03b8 which is indexed by productions X \u2192 \u03b3.\nThe conditional probability of a derivation tree t given a sentence w can be written as:\n\nP\u03b8(t|w) =\n\n1\n\nZ(\u03b8, w) YX\u2192\u03b3\u2208t\n\ne\u03b8X\u2192\u03b3 =\n\n1\n\nZ(\u03b8, w)\n\ne\u03b8Tf (t)\n\n(1)\n\nwhere Z(\u03b8, w) is the partition function and f (t) is a vector indicating how many times each pro-\nduction occurs in the derivation t. The inside/outside algorithm [15] gives us an ef\ufb01cient way of\nsumming over an exponential number of derivations. Given a sentence w spanning the words\nw1, w2, . . . , wn = w1:n, the inside and outside scores of a (split) category A spanning (i, j) are\ncomputed by summing over all possible children B and C spanning (i, k) and (k, j) respectively:1\n\nSIN(A, i, j) = XA\u2192BC Xi