{"title": "Bayesian Synchronous Grammar Induction", "book": "Advances in Neural Information Processing Systems", "page_first": 161, "page_last": 168, "abstract": "We present a novel method for inducing synchronous context free grammars (SCFGs) from a corpus of parallel string pairs. SCFGs can model equivalence between strings in terms of substitutions, insertions and deletions, and the reordering of sub-strings. We develop a non-parametric Bayesian model and apply it to a machine translation task, using priors to replace the various heuristics commonly used in this field. Using a variational Bayes training procedure, we learn the latent structure of translation equivalence through the induction of synchronous grammar categories for phrasal translations, showing improvements in translation performance over previously proposed maximum likelihood models.", "full_text": "Bayesian Synchronous Grammar Induction\n\nPhil Blunsom, Trevor Cohn, Miles Osborne\nSchool of Informatics, University of Edinburgh\n10 Crichton Street, Edinburgh, EH8 9AB, UK\n\n{pblunsom,tcohn,miles}@inf.ed.ac.uk\n\nAbstract\n\nWe present a novel method for inducing synchronous context free grammars\n(SCFGs) from a corpus of parallel string pairs. SCFGs can model equivalence\nbetween strings in terms of substitutions, insertions and deletions, and the reorder-\ning of sub-strings. We develop a non-parametric Bayesian model and apply it to a\nmachine translation task, using priors to replace the various heuristics commonly\nused in this \ufb01eld. Using a variational Bayes training procedure, we learn the\nlatent structure of translation equivalence through the induction of synchronous\ngrammar categories for phrasal translations, showing improvements in translation\nperformance over maximum likelihood models.\n\n1 Introduction\n\nA recent trend in statistical machine translation (SMT) has been the use of synchronous grammar\nbased formalisms, permitting polynomial algorithms for exploring exponential forests of translation\noptions. Current state-of-the-art synchronous grammar translation systems rely upon heuristic rel-\native frequency parameter estimates borrowed from phrase-based machine translation[1, 2]. In this\nwork we draw upon recent Bayesian models of monolingual parsing [3, 4] to develop a generative\nsynchronous grammar model of translation using a hierarchical Dirichlet process (HDP) [5].\nThere are two main contributions of this work. The \ufb01rst is that we include sparse priors over the\nmodel parameters, encoding the intuition that source phrases will have few translations, and also ad-\ndressing the problem of over\ufb01tting when using long multi-word translations pairs. Previous models\nhave relied upon heuristics to implicitly bias models towards such distributions [6]. In addition, we\ninvestigate different priors based on standard machine translation models. This allows the perfor-\nmance bene\ufb01ts of these models to be combined with a principled estimation procedure.\nOur second contribution is the induction of categories for the synchronous grammar using a HDP\nprior. Such categories allow the model to learn the latent structure of translational equivalence be-\ntween strings, such as a preference to reorder adjectives and nouns when translating between French\nto English or to encode that a phrase pair should be used at the beginning or end of a sentence. Au-\ntomatically induced non-terminal symbols give synchronous grammar models increased power over\nsingle non-terminal systems such as [2], while avoiding the problems of relying on noisy domain-\nspeci\ufb01c parsers, as in [7]. As the model is non-parametric, the HDP prior will provide a bias towards\nparameter distributions using as many, or as few, non-terminals as necessary to model the training\ndata. Following [3] we optimise a truncated variational bound on the true posterior distribution.\nWe evaluate the model on both synthetic data, and the real task of translating from Chinese to\nEnglish, showing improvements over a maximum likelihood estimate (MLE) model. We focus\non modelling the generation of a translation for a source sentence, putting aside for further work\nintegration with common components of a state-of-the-art translation system, such as a language\nmodel and minimum error rate training [6].\nWhile we are not aware of any previous attempts to directly induce synchronous grammars with\nmore than a single category, a number of generatively trained machine translation models have been\n\n\fB\n\nA\n\nB\n\nB\n\nA\n\nB\n\nB\n\nA\n\n(cid:3494)(cid:5)(cid:2)(cid:6)\n\nto the Hundred Regiments Offensive\n\nis the Monument\n\n(cid:3)(cid:4)(cid:1)\n\n(cid:4712)(cid:2741)(cid:4661)(cid:3953)\nStanding tall\n\n(cid:2127)(cid:2848)(cid:2022)(cid:2177)(cid:2509)\n\non Taihang Mountain\n\nB\n\n(cid:15)\n.\n\nFigure 1: An example SCFG derivation from a Chinese source sentence which yields the English\nsentence: \u201cStanding tall on Taihang Mountain is the Monument to the Hundred Regiment Offensive.\u201d\n(Cross-bars indicate that the child nodes have been reordered in the English target.)\n\nproposed. [8] described the ITG subclass of SCFGs and performed many experiments using MLE\ntraining to induce translation models on small corpora. Most subsequent work with ITG grammars\nhas focused on the sub-task of word alignment [9], rather than actual translation, and has continued\nto use MLE trained models. A notable recent exception is [10] who used Dirichlet priors to smooth\nan ITG alignment model. Our results clearly indicate that MLE models considerably over\ufb01t when\nused to estimate synchronous grammars, while the judicious use of priors can alleviate this problem.\nThis result raises the prospect that many MLE trained models of translation (e.g.\n[7, 11, 12]),\npreviously dismissed for under-performing heuristic approaches, should be revisited.\n\n2 Synchronous context free grammar\n\nA synchronous context free grammar (SCFG, [13]) describes the generation of pairs of strings.\nA string pair is generated by applying a series of paired context-free rewrite rules of the form,\nX (cid:31) (cid:30)(cid:31)(cid:44) \u03c6(cid:44)(cid:29)(cid:28), where X is a non-terminal, (cid:31) and \u03c6 are strings of terminals and non-terminals and\n(cid:29)speci\ufb01es a one-to-one alignment between non-terminals in (cid:31) and \u03c6. In the context of SMT, by\nassigning the source and target languages to the respective sides of a SCFG it is possible to describe\ntranslation as the process of parsing the source sentence, while generating the target translation [2].\nIn this paper we only consider binary normal-form SCFGs which allow productions to rewrite as\neither a pair of a pair of non-terminals, or a pair of non-empty terminal strings (these may span\nmultiple words). Such grammars are equivalent to the inversion transduction grammars presented in\n[8]. Note however that our approach is general and could be used with other synchronous grammar\ntransducers (e.g., [7]). The binary non-terminal productions can specify that the order of the child\nnon-terminals is the same in both languages (a monotone production), or is reversed (a reordering\nproduction). Monotone and reordering rules are written:\n\nZ (cid:31) (cid:30)X 1 Y 2 (cid:44) X 1 Y 2(cid:28) and Z (cid:31) (cid:30)X 1 Y 2 (cid:44) Y 2 X 1(cid:28)\n\nrespectively, where X(cid:44) Y and Z are non-terminals and the boxed indices denote the alignment.\nWithout loss of generality, here we add the restriction that non-terminals on the source and target\nsides of the grammar must have the same category. Although conceptually simple, a binary normal-\nform SCFGs can still represent a wide range of linguistic phenomena required for translation [8].\nFigure 1 shows an example derivation for Chinese to English. The grammar in this example has\nnon-terminals A and B which distinguish between translation phrases which permit re-orderings.\n\n3 Generative Model\n\nA sequence of SCFG rule applications which produces both a source and a target sentence is referred\nto as a derivation, denoted z. The generative process of a derivation in our model is described in\nTable 1. First a start symbol, z1, is drawn, followed by its rule type. This rule type determines\nif the symbol will rewrite as a source-target translation pair, or a pair of non-terminals with either\nmonotone or reversed order. The process then recurses to rewrite each pair of child non-terminals.\n\n\fHDP-SCFG\n\n\u03c0|\u03b1 \u223c GEM(\u03b1)\n\u03c6S|\u03b1S, \u03c0 \u223c DP(\u03b1S, \u03c0)\nz |\u03b1Y \u223c Dirichlet(\u03b1Y )\n\u03c6T\nz |\u03b1M , \u03c0 \u223c DP(\u03b1M , \u03c0\u03c0T )\n\u03c6M\nz |\u03b1R, \u03c0 \u223c DP(\u03b1R, \u03c0\u03c0T )\n\u03c6R\nz |\u03b1E, P0 \u223c DP(\u03b1E, P0)\n\u03c6E\nz1|\u03c6S \u223c Multinomial(\u03c6S)\nFor each node i in the synchronous derivation z with category zi:\n\u223c Multinomial(\u03c6T\n\n)\n\nti|\u03c6T\nif ti = Emission then:\n\nzi\n\nzi\n\n\u223c Multinomial(\u03c6E\n\n)\nif ti = Monotone Production then:\n\nzi\n\nzi\n\n(cid:104)e, f(cid:105)|\u03c6E\n(cid:104)zl 1 zr 2 , zl 1 zr 2(cid:105)|\u03c6M\n(cid:104)zl 1 zr 2 , zr 2 zl 1(cid:105)|\u03c6R\n\nif ti = Reordering Production then:\n\nzi\n\nzi\n\n\u223c Multinomial(\u03c6M\n\nzi\n\n\u223c Multinomial(\u03c6R\n\nzi\n\n(Draw top-level constituent prior distribution)\n(Draw start-symbol distribution)\n(Draw rule-type parameters)\n(Draw monotone binary production parameters)\n(Draw reordering binary production parameters)\n(Draw emission production parameters)\n\n(First draw the start symbol)\n\n(Draw a rule type)\n\n(Draw source and target phrases)\n\n)\n\n(Draw left and right (source) child constituents)\n\n)\n\n(Draw left and right (source) child constituents)\n\nTable 1: Hierarchical Dirichlet process model of the production of a synchronous tree from a SCFG.\n\nThis continues until no non-terminals are remaining, at which point the derivation is complete and\nthe source and target sentences can be read off. When expanding a production each decision is\ndrawn from a multinomial distribution speci\ufb01c to the non-terminal, zi. This allows different non-\nterminals to rewrite in different ways \u2013 as an emission, reordering or monotone production. The prior\ndistribution for each binary production is parametrised by \u03c0, the top-level stick-breaking weights,\nthereby ensuring that each production draws its children from a shared inventory of category labels.\nThe parameters for each multinomial distributions are themselves drawn from their corresponding\nprior. The hyperparameters, \u03b1, \u03b1S, \u03b1Y , \u03b1M , \u03b1R, and \u03b1E, encode prior knowledge about the sparsity\nof each distribution. For instance, we can encode a preference towards longer or short derivations\nusing \u03b1Y , and a preference for sparse or dense translation lexicons with \u03b1E. To simplify matters\nwe assume a single hyperparameter for productions, i.e. \u03b1P \u2206= \u03b1S = \u03b1M = \u03b1R. In addition to\nallowing for the incorporation of prior knowledge about sparsity, the priors have been chosen to be\nconjugate to the multinomial distribution. In the following sections we describe and motivate our\nchoices for each one of these distributions.\n\n3.1 Rule type distribution\n\nThe rule type distribution determines the relative likelihood of generating a terminal string pair,\na monotone production, or a reordering. Synchronous grammars that allow multiple words to be\nemitted at the leaves of a derivation are prone to focusing probability mass on only the longest\ntranslation pairs, i.e. if a training set sentence pair can be explained by many short translation pairs,\nor a few long ones the maximum likelihood solution will be to use the longest pairs. This issue is\nmanifested by the rule type distribution assigning a high probability to emissions versus either of\nthe binary productions, resulting in short \ufb02at derivations with few productions. We can counter this\ntendency by assuming a prior distribution that allows us to temper the model\u2019s preference for short\nderivations with large translation pairs. We do so by setting the concentration parameter, \u03b1Y , to a\nnumber greater than one which smooths the rule type distribution.\n\n3.2 Emission distribution\n\nThe Dirichlet process prior on the terminal emission distribution serves two purposes. Firstly the\nprior allows us to encode the intuition that our model should have few translation pairs. The trans-\nlation pairs in our system are induced from noisy data and thus many of them will be of little use.\nTherefore a sparse prior should lead to these noisy translation pairs being assigned probabilities\n\n\fclose to zero. Secondly, the base distribution P0 of the Dirichlet process can be used to include\nsophisticated prior distributions over translation pairs from other popular models of translation. The\ntwo structured priors we investigate in this work are IBM model 1, and the relative frequency count\nestimators from phrase based translation:\n\n0\n\n)\n\nIBM Model 1 (P m1\nIBM Model 1 [14] is a word based generative translation model that assigns\na joint probability to a source and target translation pair. The model is based on a noisy channel in\nwhich we decompose the probability of f given e from the language model probability of e. The\nconditional model assumes a latent alignment from words in e to those in f and that the probability\nof word-to-word translations are independent:\n\n|f|(cid:89)\n\n|e|(cid:88)\n\nj=1\n\ni=0\n\n(f , e) = P m1(f|e) \u00d7 P (e) = P (e) \u00d7\n\nP m1\n\n0\n\n1\n\n(|e| + 1)|f| \u00d7\n\np(fj|ei) ,\n\nwhere e0 represents word insertions. We use a unigram language model for the probability P (e), and\ntrain the parameters p(fj|ei) using a variational approximation, similar to that which is described in\nSection 3.4.\nModel 1 allows us to assign a prior probability to each translation pair in our model. This prior\nsuggests that lexically similar translation pairs should have similar probabilities. For example, if\nthe French-English pairs (chapeau, cap) and (rouge, red) both have high probability, then the pair\n(chapeau rouge, red cap) should also.\n\nRelative frequency (P RF\n) Most statistical machine translation models currently in use estimate\nthe probabilities for translation pairs using a simple relative frequency estimator. Under this model\nthe joint probability of a translation pair is simply the number of times the source was observed to\nbe aligned to the target in the word aligned corpus normalised by the total number of observed pairs:\n\n0\n\nP RF\n\n0\n\n(f , e) = C(f , e)\nC(\u2217,\u2217) ,\n\nwhere C(\u2217,\u2217) is the total number of translation pair alignments observed. Although this estimator\ndoesn\u2019t take into account any generative process for how the translation pairs were observed, and\nby extension of the arguments for tree substitution grammars is biased and inconsistent [15], it has\nproved effective in many state-of-the-art translation systems.1\n\n3.3 Non-terminal distributions\n\nWe employ a structured prior for binary production rules inspired by similar approaches in mono-\nlingual grammar induction [3, 4]. The marginal distribution over non-terminals, \u03c0, is drawn from\na stick-breaking prior [5]. This generates an in\ufb01nite vector of scalars which sum to one and whose\nexpected values decrease geometrically, with the rate of decay being controlled by \u03b1. The pa-\nrameters of the start symbol distribution are drawn from a Dirichlet process parametrised by the\nstick-breaking weights, \u03c0. In addition, both the monotone and reordering production parameters are\ndrawn from a Dirichlet process parameterised by the matrix of the expectations for each pair of non-\nterminals, \u03c0\u03c0T , assuming independence in the prior. This allows the model to prefer grammars with\nfew non-terminal labels and where each non-terminal has a sparse distribution over productions.\n\n3.4\n\nInference\n\nPrevious work with monolingual HDP-CFG grammars have employed either Gibbs sampling [4] or\nvariational Bayes [3] approaches to inference. In this work we follow the mean-\ufb01eld approximation\npresented in [16, 3], truncating the top-level stick-breaking prior on the non-terminals and optimising\na variational bound on the probability of the training sample. The mean-\ufb01eld approach offers better\nscaling and convergence properties than a Gibbs sampler, at the expense of increased approximation.\nFirst we start with our objective, the likelihood of the observed string pairs, x = {(e, f)}:\nq(\u03b8, z) log p(\u03b8)p(x, z|\u03b8)\n\np(\u03b8)p(x, z|\u03b8) \u2265\n\nlog p(x) = log\n\n(cid:88)\n\n(cid:88)\n\n(cid:90)\n\n(cid:90)\n\nd\u03b8\n\nd\u03b8\n\nz\n\nz\n\nq(\u03b8, z)\n\n,\n\n1Current translation systems more commonly use the conditional, rather than joint, estimator.\n\n\fwhere \u03b8 = (\u03c0, \u03c6S, \u03c6M , \u03c6R, \u03c6E, \u03c6T ) are our model parameters and z are the hidden derivations.\nWe bound the above using Jensen\u2019s inequality to move the logarithm (a convex function) inside\nthe integral and sum, and introduce the mean-\ufb01eld distribution q(\u03b8, z). Assuming this distribution\nfactorises over the model parameters and latent variables, q(\u03b8, z) = q(\u03b8)q(z),\n\n(cid:90)\n\n(cid:32)\nlog p(\u03b8)\nq(\u03b8)\n\n+(cid:88)\n\nz\n\nlog p(x) \u2265\n\nd\u03b8q(\u03b8)\n\n(cid:33)\n\nq(z) log p(x, z|\u03b8)\nq(z)\n\n\u2206= F(q(\u03b8), q(z)) .\n\nUpon taking the functional partial derivatives of F(q(\u03b8), q(z)) and equating to zero, we\nobtain sub-normalised summary weights for each of the factorised variational distributions:\nWi\n\n\u2206= exp{Eq(\u03c6) [log \u03c6i]}. For the monotone and reordering distributions these become:\n\nexp{\u03c8(cid:0)C(cid:0)z \u2192 (cid:104)zl 1 zr 2 , zl 1 zr 2(cid:105)(cid:1) + \u03b1P \u03c0zl \u03c0zr\nexp{\u03c8(cid:0)C(cid:0)z \u2192 (cid:104)\u2217 1\u2217 2 ,\u2217 1\u2217 2(cid:105)(cid:1) + \u03b1P(cid:1)}\nexp{\u03c8(cid:0)C(cid:0)z \u2192 (cid:104)zl 1 zr 2 , zr 2 zl 1(cid:105)(cid:1) + \u03b1P \u03c0zl \u03c0zr\nexp{\u03c8(cid:0)C(cid:0)z \u2192 (cid:104)\u2217 1\u2217 2 ,\u2217 2\u2217 1(cid:105)(cid:1) + \u03b1P(cid:1)}\n\n(cid:1)}\n(cid:1)}\n\n,\n\nW M\n\nz (zl, zr) =\n\nW R\n\nz (zl, zr) =\n\nwhere C(z \u2192 \u00b7\u00b7\u00b7 ) is the expected count of rewriting symbol z using the given production. The\nstarred rewrites in the denominators indicate a sum over any monotone or reordering production,\nrespectively. The weights for the rule-type and emission distributions are de\ufb01ned similarly. The\nvariational training cycles between optimising the q(\u03b8) distribution by re-estimating the weights W\nand the stick-breaking prior \u03c0, then using these estimates, with the inside-outside dynamic program-\nming algorithm, to calculate the q(z) distribution. Optimising the top-level stick-breaking weights\nhas no closed form solution as a dependency is induced between the GEM prior and production\ndistributions. [3] advocate using a gradient projection method to locally optimise this function. As\nour truncation levels are small, we instead use Monte-Carlo sampling to estimate a global optimum.\n\n3.5 Prediction\n\n(cid:90)\n\nThe predictive distribution under our Bayesian model is given by:\n\np(z|x, f) =\n\nd\u03b8 p(\u03b8|x)p(z|f , \u03b8) \u2248\n\nd\u03b8 q(\u03b8)p(z|f , \u03b8) \u2265 exp\n\n(cid:90)\n\n(cid:90)\n\nd\u03b8 q(\u03b8) log p(z|f , \u03b8) ,\n\nwhere x is the training set of parallel sentence pairs, f is a testing source sentence and z its deriva-\ntion.2 Calculating the predictive probability even under the variational approximation is intractable,\ntherefore we bound the approximation following [16]. The bound can then be maximised to \ufb01nd the\nbest derivation, z, with the Viterbi algorithm, using the sub-normalised W parameters from the last\nE step of variational Bayes training as the model parameters.\n\n4 Evaluation\n\nWe evaluate our HDP-SCFG model on both synthetic and real-world translation tasks.\n\nRecovering a synthetic grammar This experiment investigates the ability of our model to recover\na simple synthetic grammar, using the minimum number of constituent categories. Ten thousand\ntraining pairs were generated from the following synthetic grammar, with uniform weights, which\nincludes both reordering and ambiguous terminal distributions:\n\nS \u2192 (cid:104)A 1 A 2 , A 1 A 2(cid:105)\nS \u2192 (cid:104)B 1 B 2 , B 2 B 1(cid:105)\nS \u2192 (cid:104)C 1 C 2 , C 1 C 2(cid:105)\n\nA \u2192 (cid:104)a, a(cid:105)|(cid:104)b, b(cid:105)|(cid:104)c, c(cid:105)\nB \u2192 (cid:104)d, d(cid:105)|(cid:104)e, e(cid:105)|(cid:104)f, f(cid:105)\nC \u2192 (cid:104)g, g(cid:105)|(cid:104)h, h(cid:105)|(cid:104)i, i(cid:105)\n\n2The derivation speci\ufb01es the translation. Alternatively we could bound on the likelihood of a translation,\n\nmarginalising out the derivation. However, this bound cannot be maximised tractably when e is unobserved.\n\n\fFigure 2: Synthetic grammar experiments. The HDP model correctly allocates a single binary\nproduction non-terminal and three equally weighted emission non-terminals.\n\nSentences\nSentences\nSegments/Words\nAv. Sentence Length\nLongest Sentence\n\nTraining\n\n33164\n\nDevelopment\n\n500\n\nTest\n\n506\n\nChinese\n\nEnglish\n\nChinese\n\nEnglish\n\nChinese\n\nEnglish\n\n253724\n\n279104\n\n7\n41\n\n8\n45\n\n3464\n\n6\n58\n\n3752\n\n7\n62\n\n3784\n\n7\n61\n\n3823\n\n7\n56\n\nTable 2: Chinese to English translation corpus statistics.\n\n3 probability of generating a monotone production (A,C), versus 1\n\nFigure 2 shows the emission and production distributions produced by the HDP-SCFG model,3 as\nwell as an EM trained maximum likelihood (MLE) model. The variational inference for the HDP\nmodel was truncated at \ufb01ve categories, likewise the MLE model was trained with \ufb01ve categories.\nThe hierarchical model \ufb01nds the correct grammar. It allocates category 2 to the S category, giving\nit a 2\n3 for a reordering (B). For\nthe emission distribution the HDP model assigns category 1 to A, 3 to B and 5 to C, each of which\nhas a posterior probability of 1\n3. The stick-breaking prior biases the model towards using a small set\nof categories, and therefore the model correctly uses only four categories, assigning zero posterior\nprobability mass to category 4.\nThe MLE model has no bias for small grammars and therefore uses all available categories to model\nthe data. For the production distribution it creates two categories with equal posteriors to model the\nS category, while for emissions the model collapses categories A and C into category 1, and splits\ncategory B over 3 and 5. This grammar is more expressive than the target grammar, over-generating\nbut including the target grammar as a subset. The particular grammar found by the MLE model is\ndependent on the (random) initialisation and the fact that the EM algorithm can only \ufb01nd a local\nmaximum, however it will always use all available categories to model the data.\nChinese-English machine translation The real-world translation experiment aims to determine\nwhether the model can learn and generalise from a noisy large-scale parallel machine translation\ncorpus, and provide performance bene\ufb01ts on the standard evaluation metrics. We evaluate our model\non the IWSLT 2005 Chinese to English translation task [17], using the 2004 test set as development\ndata for tuning the hyperparameters. The statistics for this data are presented in Table 2. The training\ndata made available for this task consisted of 40k pairs of transcribed utterances, drawn from the\ntravel domain. The translation phrase pairs that form the base of our grammar are induced using the\nstandard alignment and translation phrase pair extraction heuristics used in phrase-based translation\nmodels [6]. As these heuristics aren\u2019t based on a generative model, and don\u2019t guarantee that the\ntarget translation will be reachable from the source, we discard those sentence pairs for which we\ncannot produce a derivation, leaving 33,164 sentences for training. Model performance is evaluated\nusing the standard Bleu4 metric [18] which measures average n-gram precision, n \u2264 4.\n\n3No structured P0 was used in this model, rather a simple Dirichlet prior with uniform \u03b1E was employed\n\nfor the emission distribution.\n\n12345HDPMLEBinary production posterior distributionCategoryPosterior0.00.20.40.60.81.012345HDPMLEEmission posterior distributionCategoryPosterior0.00.20.40.60.81.0\fFigure 3: Tuning the Dirichlet \u03b1 parameters for the emission and rule type distributions (develop-\nment set).\n\nSingle Category\n\nMLE Uniform P0\n32.9\n\n35.5\n\nP0 = M1\n\n37.1\n\nP0 = RF\n\n38.7\n\nTable 3: Test results for the model with a single non-terminal category and various emission priors\n(BLEU).\n\n5 Categories\n\nMLE\n29.9\n\nP0 = RF\n\n38.8\n\nTable 4: Test set results for the hierarchical model with the variational distribution truncated at \ufb01ve\nnon-terminal categories (BLEU).\n\nWe \ufb01rst evaluate our model using a grammar with a single non-terminal category (rendering the\nhierarchical prior redundant) and vary the prior P0 used for the emission parameters. For this model\nwe investigate the effect that the emission and rule-type priors have on translation performance.\nFigure 3 graphs the variation in Bleu score versus the two free hyperparameters for the model with a\nsimple uniform P0, evaluated on the development corpus. Both graphs show a convex relationship,\nwith \u03b1Y being considerably more peaked. For the \u03b1E hyperparameter the optimal value is 0.75,\nindicating that the emission distribution bene\ufb01ts from a slightly sparse distribution, but not far from\nthe uniform value of 1.0. The sharp curve for the \u03b1Y rule-type distribution hyperparameter con\ufb01rms\nour earlier hypothesis that the model requires considerable smoothing in order to force it to place\nprobability mass on long derivations rather than simply placing it all on the largest translation pairs.\nThe optimal hyperparameter values on the development data for the two structured emission distri-\nbution priors, Model 1 (M 1) and relative frequency (RF ), also provide insight into the underlying\nmodels. The M 1 prior has a heavy bias towards smaller translation pairs, countering the model\u2019s\ninherent bias. Thus the optimal value for the \u03b1Y parameter is 1.0, suggesting that the two biases\nbalance. Conversely the RF prior is biased towards larger translation pairs reinforcing the model\u2019s\nbias, thus a very large value (106) for the \u03b1Y parameter gives optimal development set performance.\nTable 3 shows the performance of the single category models with each of the priors on the test set.4\nThe results show that all the Bayesian models outperform the MLE, and that non-uniform priors\nhelp considerably, with the RF prior obtaining the highest score.\nIn Table 4 we show the results for taking the best performing RF model from the previous experi-\nment and increasing the variational approximation\u2019s truncation limit to \ufb01ve non-terminals. The \u03b1P\nwas set to 1.0, corresponding to a sparse distribution over binary productions.5 Here we see that the\nHDP model improves slightly over the single category approximation. However the baseline MLE\nmodel uses the extra categories to over\ufb01t the training data signi\ufb01cantly, resulting in much poorer\ngeneralisation performance.\n\n4For comparison, a state-of-the-art SCFG decoder based on the heuristic estimator, incorporating a trigram\n\nlanguage model and using minimum error rate training achieves a BLEU score of approximately 46.\n\n5As there are \ufb01ve non-terminal categories, an \u03b1P = 52 would correspond to a uniform distribution.\n\nlllllaa EBLEU (%)32.032.533.033.50.10.20.51.0llllllllaa YBLEU (%)32.032.533.033.51e+001e+021e+041e+06\f5 Conclusion\n\nWe have proposed a Bayesian model for inducing synchronous grammars and demonstrated its ef\ufb01-\ncacy on both synthetic and real machine translation tasks. The sophisticated priors over the model\u2019s\nparameters address limitations of MLE models, most notably over\ufb01tting, and effectively model the\nnature of the translation task. In addition, the incorporation of a hierarchical prior opens the door to\nthe unsupervised induction of grammars capable of representing the latent structure of translation.\nOur Bayesian model of translation using synchronous grammars provides a basis upon which more\nsophisticated models can be built, enabling a move away from the current heuristically engineered\ntranslation systems.\n\nReferences\n[1] Andreas Zollmann and Ashish Venugopal. Syntax augmented machine translation via chart parsing. In\nProc. of the HLT-NAACL 2006 Workshop on Statistical Machine Translation, New York City, June 2006.\n[2] David Chiang. Hierarchical phrase-based translation. Computational Linguistics, 33(2):201\u2013228, 2007.\n[3] Percy Liang, Slav Petrov, Michael Jordan, and Dan Klein. The in\ufb01nite PCFG using hierarchical Dirichlet\nIn Proc. of the 2007 Conference on Empirical Methods in Natural Language Processing\n\nprocesses.\n(EMNLP-2007), pages 688\u2013697, Prague, Czech Republic, 2007.\n\n[4] Jenny Rose Finkel, Trond Grenager, and Christopher D. Manning. The in\ufb01nite tree. In Proc. of the 45th\n\nAnnual Meeting of the ACL (ACL-2007), Prague, Czech Republic, 2007.\n\n[5] Y. W. Teh, M. I. Jordan, M. J. Beal, and D. M. Blei. Hierarchical Dirichlet processes. Journal of the\n\nAmerican Statistical Association, 101(476):1566\u20131581, 2006.\n\n[6] Philipp Koehn, Franz Josef Och, and Daniel Marcu. Statistical phrase-based translation. In Proc. of the\n3rd International Conference on Human Language Technology Research and 4th Annual Meeting of the\nNAACL (HLT-NAACL 2003), pages 81\u201388, Edmonton, Canada, May 2003.\n\n[7] Michel Galley, Jonathan Graehl, Kevin Knight, Daniel Marcu, Steve DeNeefe, Wei Wang, and Igna-\nIn Proc. of\ncio Thayer. Scalable inference and training of context-rich syntactic translation models.\nthe 44th Annual Meeting of the ACL and 21st International Conference on Computational Linguistics\n(COLING/ACL-2006), pages 961\u2013968, Sydney, Australia, July 2006.\n\n[8] Dekai Wu. Stochastic inversion transduction grammars and bilingual parsing of parallel corpora. Com-\n\nputational Linguistics, 23(3):377\u2013403, 1997.\n\n[9] Colin Cherry and Dekany Lin. Inversion transduction grammar for joint phrasal translation modeling.\nIn Proc. of the HLT-NAACL Workshop on Syntax and Structure in Statistical Translation (SSST 2007),\nRochester, USA, 2007.\n\n[10] Hao Zhang, Chris Quirk, Robert C. Moore, and Daniel Gildea. Bayesian learning of non-compositional\nphrases with synchronous parsing. In Proc. of the 46th Annual Conference of the Association for Com-\nputational Linguistics: Human Language Technologies (ACL-08:HLT), pages 97\u2013105, Columbus, Ohio,\nJune 2008.\n\n[11] Daniel Marcu and William Wong. A phrase-based, joint probability model for statistical machine transla-\ntion. In Proc. of the 2002 Conference on Empirical Methods in Natural Language Processing (EMNLP-\n2002), pages 133\u2013139, Philadelphia, July 2002. Association for Computational Linguistics.\n\n[12] John DeNero, Dan Gillick, James Zhang, and Dan Klein. Why generative phrase models underperform\nsurface heuristics. In Proc. of the HLT-NAACL 2006 Workshop on Statistical Machine Translation, pages\n31\u201338, New York City, June 2006.\n\n[13] Philip M. Lewis II and Richard E. Stearns. Syntax-directed transduction. J. ACM, 15(3):465\u2013488, 1968.\n[14] P. F. Brown, S. A. Della Pietra, V. J. Della Pietra, and R. L. Mercer. The mathematics of statistical machine\n\ntranslation: Parameter estimation. Computational Linguistics, 19(2):263\u2013311, 1993.\n\n[15] Mark Johnson. The DOP estimation method is biased and inconsistent. Computational Linguistics,\n\n28(1):71\u201376, 2002.\n\n[16] Matthew Beal. Variational Algorithms for Approximate Bayesian Inference. PhD thesis, The Gatsby\n\nComputational Neuroscience Unit, University College London, 2003.\n\n[17] Matthias Eck and Chiori Hori. Overview of the IWSLT 2005 evaluation campaign.\n\nInternational Workshop on Spoken Language Translation, Pittsburgh, October 2005.\n\nIn Proc. of the\n\n[18] Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. Bleu: a method for automatic evaluation\nof machine translation. In Proc. of the 40th Annual Meeting of the ACL and 3rd Annual Meeting of the\nNAACL (ACL-2002), pages 311\u2013318, Philadelphia, Pennsylvania, 2002.\n\n\f", "award": [], "sourceid": 238, "authors": [{"given_name": "Phil", "family_name": "Blunsom", "institution": null}, {"given_name": "Trevor", "family_name": "Cohn", "institution": null}, {"given_name": "Miles", "family_name": "Osborne", "institution": null}]}