{"title": "Randomized Pruning: Efficiently Calculating Expectations in Large Dynamic Programs", "book": "Advances in Neural Information Processing Systems", "page_first": 144, "page_last": 152, "abstract": "Pruning can massively accelerate the computation of feature expectations in large models. However, any single pruning mask will introduce bias. We present a novel approach which employs a randomized sequence of pruning masks. Formally, we apply auxiliary variable MCMC sampling to generate this sequence of masks, thereby gaining theoretical guarantees about convergence. Because each mask is generally able to skip large portions of an underlying dynamic program, our approach is particularly compelling for high-degree algorithms. Empirically, we demonstrate our method on bilingual parsing, showing decreasing bias as more masks are incorporated, and outperforming fixed tic-tac-toe pruning.", "full_text": "Randomized Pruning: Ef\ufb01ciently Calculating\n\nExpectations in Large Dynamic Programs\n\nAlexandre Bouchard-C\u02c6ot\u00b4e1\n\nSlav Petrov2,\u2020\n\nbouchard@cs.berkeley.edu slav@google.com\n\n1Computer Science Division\n\nUniversity of California at Berkeley\n\nBerkeley, CA 94720\n\nDan Klein1\n\nklein@cs.berkeley.edu\n\n2Google Research\n\n76 Ninth Ave\n\nNew York, NY 10011\n\nAbstract\n\nPruning can massively accelerate the computation of feature expectations in large models.\nHowever, any single pruning mask will introduce bias. We present a novel approach which\nemploys a randomized sequence of pruning masks. Formally, we apply auxiliary variable\nMCMC sampling to generate this sequence of masks, thereby gaining theoretical guaran-\ntees about convergence. Because each mask is generally able to skip large portions of an\nunderlying dynamic program, our approach is particularly compelling for high-degree algo-\nrithms. Empirically, we demonstrate our method on bilingual parsing, showing decreasing\nbias as more masks are incorporated, and outperforming \ufb01xed tic-tac-toe pruning.\n\n1\n\nIntroduction\n\nMany natural language processing applications, from discriminative training [18, 9] to minimum-\nrisk decoding [16, 34], require the computation of expectations over large-scale combinatorial\nspaces. Problem scale comes from a combination of large constant factors (such as the massive\ngrammar sizes in monolingual parsing) or high-degree algorithms (such as the many dimensions of\nbitext parsing). In both cases, the primary mechanism for ef\ufb01ciency has been pruning, wherein large\nregions of the search space are skipped on the basis of some computation mask. For example, in\nmonolingual parsing, entire labeled spans may be skipped on the basis of posterior probabilities in\na coarse grammar [17, 7]. Conditioning on these masks, the underlying dynamic program can be\nmade to run arbitrarily quickly.\nUnfortunately, aggressive pruning introduces biases in the resulting expectations. As an extreme\nexample, features with low expectation may be pruned down to zero if their supporting struc-\ntures are completely skipped. One option is to simply prune less aggressively and spend more\ntime on a single, more exhaustive expectation computation, perhaps by carefully tuning various\nthresholds [26, 12] and using parallel computing [9, 38]. However, we present a novel alternative:\nrandomized pruning. In randomized pruning, multiple pruning masks are used in sequence. The re-\nsulting sequence of expectation computations are averaged, and errors average out over the multiple\ncomputations. As a result, time can be directly traded against approximation quality, and errors of\nany single mask can be overcome.\nOur approach is based on the idea of auxiliary variable sampling [31], where a set of auxiliary\nvariables formalizes the idea of a pruning mask. Resampling the auxiliary variables changes the\nmask at each iteration, so that the portion of the chart that is unconstrained at a given iteration can\nimprove the mask for subsequent iterations. In other words, pruning decisions are continuously\nrevisited and revised. Since our approach is formally grounded in the framework of block Gibbs\nsampling [33], it inherits desirable guarantees as a consequence. If one needs successively better\n\n\u2020Work done while at the University of California at Berkeley.\n\n1\n\n\fFigure 1: A parse tree, from which the assignment variables are extracted. A linearization into an\nassignment vector is shown at the right.\n\napproximations, more iterations can be performed, with a guarantee of convergence to the true\nexpectations.\nIn practice, of course, we are only interested in the behavior after a \ufb01nite number of iterations: the\nmethod would be useless if it did not outperform previous heuristics in a time range bounded by\nthe exact computation time. Here, we investigate empirical performance on English-Chinese bitext\nparsing, showing that bias decreases over time. Moreover, we show that our randomized pruning\noutperforms standard single-mask tic-tac-toe pruning [40], achieving lower bias over a range of total\ncomputation times. Our technique is orthogonal to approaches that use parallel computation [9, 38],\nand can be additionally parallelized at the sentence level.\nIn what follows, we explain the method in the context of parsing because it makes the exposition\nmore concrete, and because our experiments are on similar combinatorial objects (bitext deriva-\ntions). Note, however, that the applicability of this approach is in no way limited to parsing. The\nsettings in which randomized pruning will be most advantageous will be those in which high-order\ndynamic programs can be vastly sped up by masking, yet no single aggressive mask is likely to be\nadequate.\n\n2 Randomized pruning\n\n2.1 The need for expectations\n\nAlgorithms for discriminative training, consensus decoding, and unsupervised learning typically\ninvolve repetitively computing a large number of expectations. In discriminative training of proba-\nbilistic parsers, for example [18, 32], one needs to repeatedly parse the entire training set in order to\ncompute the necessary expected feature counts. In this setup (Figure 1), the conditional distribution\nof a tree-valued random variable T given a yield y(T ) = w is modeled using a log-linear model :\nP\u03b8(T = t|y(T ) = w) = exp{h\u03b8, f(t, w)i \u2212 log Z(\u03b8, w)}, in which \u03b8 \u2208 RK is a parameter vector\nand f(t, w) \u2208 RK is a feature function. Training such a model involves the computation of the\nfollowing gradient in between each update of \u03b8:\n\n\u2207 logY\n\nP\u03b8(T = ti|y(T ) = wi) =X\n\nn\n\ni\u2208I\n\ni\u2208I\n\nf(ti, wi) \u2212 E\u03b8[f(T, wi)|y(T ) = wi]\n\no\n\n,\n\nwhere {wi : i \u2208 I} are the training sentences with corresponding gold trees {ti}.\nThe \ufb01rst term in the above equation can be computed in linear time, while the second requires a\ncubic-time dynamic program (the inside-outside algorithm), which computes constituent posteriors\nfor all possible spans of words (the chart cells in Figure 1). Hence, computing expectations is\nindeed the bottleneck here. While it is not impossible to calculate these expectations exactly, it\nis computationally very expensive, limiting previous work to toy setups with 15 word sentences\n[18, 32, 35], or necessitating aggressive pruning [26, 12] that is not well understood.\n\n2\n\n(b)+!!+!!+!+++!++!012345(a)SNPPSheVPVheardNPthenoise..012345.(c)...++++---+a(0,5)a(3,4)\u00b7\u00b7\u00b7\u00b7\u00b7\u00b7a(0,3)\u00b7\u00b7\u00b7!a(0,1)TsFigure1:Aparsetree(a)andthecorrespondingchartcells(b),fromwhichtheassignmentvector(c)isextracted.Notshownarethelabelsofthedynamicprogrammingchartcells.approximations,moreiterationscanbeperformed,withaguaranteeofconvergencetothetrueexpectations.Inpractice,ofcourse,weareonlyinterestedinthebehavioraftera\ufb01nitenumberofiterations:themethodwouldbeuselessifitdidnotoutperformpreviousheuristicsinthetimerangeboundedbytheexactcomputationtime.Here,weinvestigateempiricalperformanceonEnglish-Chinesebitextparsing,showingthatbiasdecreasesovertime.Moreover,weshowthatourrandomizedpruningoutperformsstandardsingle-masktic-tac-toepruning[40],achievinglowerbiasoverarangeoftotalcomputationtimes.Ourtechniqueisorthogonaltoapproachesthatuseparallelcomputation[9,38],andcanbeadditionallyparallelizedatthesentencelevel.Inwhatfollows,weexplainthemethodinthecontextofparsingtomaketheexpositionmoreconcrete,andbecauseourexperimentsareonsimilarcombinatorialobjects(bitextderivations).Note,however,thattheapplicabilityofthisapproachisinnowaylimitedtoparsing.Thesettingsinwhichrandomizedpruningwillbemostadvantageouswillbethoseinwhichhigh-orderdynamicprogramscanbevastlyspedupbymasking,yetnosingleaggressivemaskislikelytobeadequate.2Randomizedpruning2.1TheneedforexpectationsAlgorithmsfordiscriminativetraining,consensusdecoding,andunsupervisedlearningtypicallyinvolverepetitivelycomputingalargenumberofexpectations.Indiscriminativetrainingofproba-bilisticparsers,forexample[18,32],oneneedstorepeatedlyparsetheentiretrainingsetinordertocomputethenecessaryexpectedfeaturecounts.Inthissetup(Figure1),theconditionaldistributionofatree-valuedrandomvariableTgivenayieldy(T)=wismodeledusingalog-linearmodel:P\u03b8(T=t|y(T)=w)=exp{!\u03b8,f(t,w)\"\u2212logZ(\u03b8,w)},inwhich\u03b8\u2208RKisaparametervectorandf(t,w)\u2208RKisafeaturefunction.Trainingsuchamodelinvolvesthecomputationofthefollowinggradientinbetweeneachupdateof\u03b8(skippinganeasytocomputeregularizationterm):\u2207log!i\u2208IP\u03b8(T=ti|y(T)=wi)=\"i\u2208I#f(ti,wi)\u2212E\u03b8[f(T,wi)|y(T)=wi]$,where{wi:i\u2208I}arethetrainingsentenceswithcorrespondinggoldtrees{ti}.The\ufb01rsttermintheaboveequationcanbecomputedinlineartime,whilethesecondrequiresacubic-timedynamicprogram(theinside-outsidealgorithm),whichcomputesconstituentposteriorsforallpossiblespansofwords(thechartcellsinFigure1).Hence,computingexpectationsisindeedthebottleneckhere.Whileitisnotimpossibletocalculatetheseexpectationsexactly,thisiscomputationallyveryexpensive,limitingpreviousworktotoysetupswith15wordsentences[18,32,35],ornecessitatingaggressivepruning[26,12]thatisnotwellunderstood.2.2ApproximateexpectationswithasinglepruningmaskInthecaseofmonolingualparsing,thecomputationoffeaturecountexpectationsisusuallyapprox-imatedwithapruningmaskwhichallowstheomissionoflowprobabilityconstituents.Formally,apruningmaskisamapfromthesetMofallpossiblespanstotheset{prune,keep},indicating2t(b)+!!+!!+!+++!++!012345(a)SNPPSheVPVheardNPthenoise..012345.(c)...++++---+a(0,5)a(3,4)\u00b7\u00b7\u00b7\u00b7\u00b7\u00b7a(0,3)\u00b7\u00b7\u00b7!a(0,1)TsFigure1:Aparsetree(a)andthecorrespondingchartcells(b),fromwhichtheassignmentvector(c)isextracted.Notshownarethelabelsofthedynamicprogrammingchartcells.approximations,moreiterationscanbeperformed,withaguaranteeofconvergencetothetrueexpectations.Inpractice,ofcourse,weareonlyinterestedinthebehavioraftera\ufb01nitenumberofiterations:themethodwouldbeuselessifitdidnotoutperformpreviousheuristicsinthetimerangeboundedbytheexactcomputationtime.Here,weinvestigateempiricalperformanceonEnglish-Chinesebitextparsing,showingthatbiasdecreasesovertime.Moreover,weshowthatourrandomizedpruningoutperformsstandardsingle-masktic-tac-toepruning[40],achievinglowerbiasoverarangeoftotalcomputationtimes.Ourtechniqueisorthogonaltoapproachesthatuseparallelcomputation[9,38],andcanbeadditionallyparallelizedatthesentencelevel.Inwhatfollows,weexplainthemethodinthecontextofparsingtomaketheexpositionmoreconcrete,andbecauseourexperimentsareonsimilarcombinatorialobjects(bitextderivations).Note,however,thattheapplicabilityofthisapproachisinnowaylimitedtoparsing.Thesettingsinwhichrandomizedpruningwillbemostadvantageouswillbethoseinwhichhigh-orderdynamicprogramscanbevastlyspedupbymasking,yetnosingleaggressivemaskislikelytobeadequate.2Randomizedpruning2.1TheneedforexpectationsAlgorithmsfordiscriminativetraining,consensusdecoding,andunsupervisedlearningtypicallyinvolverepetitivelycomputingalargenumberofexpectations.Indiscriminativetrainingofproba-bilisticparsers,forexample[18,32],oneneedstorepeatedlyparsetheentiretrainingsetinordertocomputethenecessaryexpectedfeaturecounts.Inthissetup(Figure1),theconditionaldistributionofatree-valuedrandomvariableTgivenayieldy(T)=wismodeledusingalog-linearmodel:P\u03b8(T=t|y(T)=w)=exp{!\u03b8,f(t,w)\"\u2212logZ(\u03b8,w)},inwhich\u03b8\u2208RKisaparametervectorandf(t,w)\u2208RKisafeaturefunction.Trainingsuchamodelinvolvesthecomputationofthefollowinggradientinbetweeneachupdateof\u03b8(skippinganeasytocomputeregularizationterm):\u2207log!i\u2208IP\u03b8(T=ti|y(T)=wi)=\"i\u2208I#f(ti,wi)\u2212E\u03b8[f(T,wi)|y(T)=wi]$,where{wi:i\u2208I}arethetrainingsentenceswithcorrespondinggoldtrees{ti}.The\ufb01rsttermintheaboveequationcanbecomputedinlineartime,whilethesecondrequiresacubic-timedynamicprogram(theinside-outsidealgorithm),whichcomputesconstituentposteriorsforallpossiblespansofwords(thechartcellsinFigure1).Hence,computingexpectationsisindeedthebottleneckhere.Whileitisnotimpossibletocalculatetheseexpectationsexactly,thisiscomputationallyveryexpensive,limitingpreviousworktotoysetupswith15wordsentences[18,32,35],ornecessitatingaggressivepruning[26,12]thatisnotwellunderstood.2.2ApproximateexpectationswithasinglepruningmaskInthecaseofmonolingualparsing,thecomputationoffeaturecountexpectationsisusuallyapprox-imatedwithapruningmaskwhichallowstheomissionoflowprobabilityconstituents.Formally,apruningmaskisamapfromthesetMofallpossiblespanstotheset{prune,keep},indicating2TreeAssignments\fFigure 2: An example of how a selection vector s and an assignment vector a are turned into a\npruning mask m.\n\n2.2 Approximate expectations with a single pruning mask\n\nIn the case of monolingual parsing, the computation of feature count expectations is usually approx-\nimated with a pruning mask, which allows the omission of low probability constituents. Formally,\na pruning mask is a map from the set M of all possible spans to the set {prune, keep}, indicating\nwhether a given span is to be ignored. It is easy to incorporate such a pruning mask into existing\ndynamic programming algorithms for computing expectations: Whenever a dynamic programming\nstate is considered, we \ufb01rst consult the mask and skip over the pruned states, greatly accelerating\nthe computation (see Algorithm 3 for a schematic description of the pruned inside pass). However,\nthe expected feature counts Em[f] computed by pruned inside-outside with a single mask m are not\nexact, introducing a systematic error and biasing the model in undesirable ways.\n\n2.3 Approximate expectations with a sequence of masks\n\nPN\n\ni=1\n\nTo reduce the bias resulting from the use of a single pruning mask, we propose a novel algorithm that\ncan combine several masks. Given a sequence of masks, m(1), m(2), . . . , m(N ), we will average the\nE\nm(i)[f]. Our contribution is to show a principled way of\nexpectations under each of them 1\nN\ncomputing a sequence of masks such that this average not only has theoretical guarantees, but also\nhas good \ufb01nite-sample performance. The key is to de\ufb01ne a set of auxiliary variables, and we present\nthis construction in more detail in the following sections. In this section, we present the algorithm\noperationally.\nThe masks are de\ufb01ned via two vector-valued Markov chains: a selection chain with current value de-\nnoted by s, and an assignment chain with current value a. Both s and a are vectors with coordinates\nindexed by spans over the current sentence: \u03b9 \u2208 M = {hj, ki : 0 \u2264 j < k \u2264 n = |w|}. Elements\ns\u03b9 specify whether a span \u03b9 will be selected (s\u03b9 = 1) or excluded (0) in the current iteration (i). The\nassignment vector a then determines, for each span, whether it would be forbidden if selected (or\nnegative, a\u03b9 = \u2212) or required (positive, +) to be a constituent.\nOur masks m = m(s, a) are generated deterministically from the selection and assignment vectors.\nThe deterministic procedure uses s to pick a few spans and values to \ufb01x from a, forming a mask m.\nNote that a single span \u03b9 that is both positive and selected implies that all of the spans \u03ba crossing \u03b9\nshould be pruned (i.e. all of the spans such that neither \u03b9 \u2286 \u03ba nor \u03ba \u2286 \u03b9 holds). This compilation of\nthe pruning constraints is described in Algorithm 2. The type of the return value m of this function\nis also a vector with coordinates corresponding to spans: m\u03b9 \u2208 {prune, keep}. Computation of this\nmask is illustrated on a concrete example in Figure 2.1\nWe can now summarize how randomized pruning works (see Algorithm 1 for pseudocode). At\nthe beginning of every iteration (i), the \ufb01rst step is to sample new values of the selection vector\nconditioning on the current selection vector. We will refer to the transition probability of this Markov\nchain on selection vectors as k\u2217. Once a new mask m has been precomputed from the current\nselection vector and assignments, pruned inside-outside scores are computed using this mask. The\n1It may seem that Algorithm 2 is also slow, introducing a new bottleneck. However, |s| is small in practice,\nand the constant is much smaller since it does not depend on the grammar, making this algorithm fast in practice.\n\n3\n\n000100110000000012345+\u2212\u2212\u2212+\u2212+\u2212+++\u2212\u2212++012345012345Selections AssignmentsMasks1selected0excluded+positive\u2212negativekeepprune\fFigure 3: Pseudo-code for randomized pruning in the case of monolingual parsing (assuming a\ngrammar with no unaries except at pre-terminal positions. We have omitted PrunedOutside because\nof limited space, but its structure is very similar to PrunedInside.\n\ninside-outside scores are then used in two ways: \ufb01rst, to calculate the expected feature counts under\nthe pruned model, Em[f], which are added to a running average; second, to resample new values for\nthe assignment vector.2\nLet us describe in more detail how a new assignment vector a0 is updated given the previous assign-\nment a. This is a two step update process. First, a tree t is sampled from the chart computed by\nPrunedInside(w, m) (Figure 1, left). This can be done in quadratic time using a standard algorithm\n[19, 13]. Next, the assignments are set to a new value deterministically as follows: for each span \u03b9,\na\u03b9 = + if \u03b9 is a constituent in t, and a\u03b9 = \u2212 otherwise (Figure 1, right). We will denote this property\nby [\u03b9 \u2208 t].\nWe defer to Section 3.2 for the description of the selection vector updates\u2014the form of these updates\nwill be easier to motivate after the analysis of the algorithm.\n\n3 Analysis\n\nIn this section we show that the procedure described above can be viewed as running an MCMC\nalgorithm. This implies that the guarantees associated with this class of algorithms extend to our\nprocedure. In particular, consistency holds: 1\nN\n\nm(i)f a.s.\u2212\u2192 Ef.\nE\n\nPN\n\ni=1\n\n3.1 Auxiliary variables and the assignment Markov chain\n\nWe start by formally describing the Markov chain over assignments. This is done by de\ufb01ning a\ncollection of Gibbs operators ks(\u00b7,\u00b7) indexed by a selection vectors s.\nThe original state space (the space of trees) does not easily decompose into a graphical model where\ntextbook Gibbs sampling could be applied, so we \ufb01rst augment the state space with auxiliary vari-\nables. Broadly speaking, an auxiliary variable is a state augmentation such that the target distribution\nis a marginal of the expanded distribution. It is called auxiliary because the parts of the samples cor-\nresponding to the augmentation are discarded at the end of the computation. At an intermediate\nstage, however, the state augmentation helps explore the space ef\ufb01ciently.\nThis technique is best explained with a concrete example in our parsing setup. In this case, the\naugmentation is a collection of |M| binary-valued random variables, each corresponding to a span\nof the current sentence w. The auxiliary variable corresponding to span \u03b9 \u2208 M will be denoted by\nA\u03b9. We de\ufb01ne the auxiliary variables by specifying their conditional distribution A\u03b9|(T = t). This\nconditional is a deterministic function: P(A\u03b9|T = t) = [\u03b9 \u2208 t].\nWith this augmentation, we can now describe the sampler. It is a block Gibbs sampler, meaning that\nit resamples a subset of the random variables, conditioning on the other ones. Even when the subsets\nselected across iterations overlap, acceptance probabilities are still guaranteed to be one [33].\n\n2The second operation only needs the inside scores.\n\n4\n\nsamedynamicprogramasforexactsampling,exceptthatasinglecellinthechartispruned(thecell\u03b9).Thesettingwherea=+ismoreinteresting:inthiscasesigni\ufb01cantlymorecellscanbepruned.Indeed,allconstituentsoverlappingwith\u03b9arepruned.Thiscanleadtoaspeed-upofuptoamultiplicativeconstantof8=23,whenthespan\u03b9haslength|\u03b9|=|w|2.Moreconstraintsaremaintainedduringresamplingstepsinpractice(i.e.|s|>1),leadingtoamuchhigherempiricalspeedup.Algorithm1:AuxVar(w,f)a,s\u2190randominitializationE\u21900fori\u22081,2,...,Ndos\u223ck\u2217(s,\u00b7)m\u2190CreateMask(a,s)ComputePrunedInside(w,m)ComputePrunedOutside(w,m)S\u2190E+Emfa\u223cks(a,\u00b7)returnENAlgorithm2:CreateMask(s,a)for\u03b9\u2208Mdofor\u03ba\u2208sdoifa\u03b9=\u2212and\u03b9=\u03bathenm\u03b9\u2190prunecontinueouterloopifa\u03b9=+and\u03b9!\u03baand\u03ba!\u03b9thenm\u03b9\u2190prunecontinueouterloopm\u03b9\u2190keepreturnmAlgorithm3:PrunedInside(w,m){Initializethechartinthestandardway}for\u03b9=%j,k&\u2208M,bottom-updoifm\u03b9=keepthenforl:j1),leadingtoamuchhigherempiricalspeedup.Algorithm1:AuxVar(w,f)a,s\u2190randominitializationE\u21900fori\u22081,2,...,Ndos\u223ck\u2217(s,\u00b7)m\u2190CreateMask(a,s)ComputePrunedInside(w,m)ComputePrunedOutside(w,m)S\u2190E+Emfa\u223cks(a,\u00b7)returnENAlgorithm2:CreateMask(s,a)for\u03b9\u2208Mdofor\u03ba\u2208sdoifa\u03b9=\u2212and\u03b9=\u03bathenm\u03b9\u2190prunecontinueouterloopifa\u03b9=+and\u03b9!\u03baand\u03ba!\u03b9thenm\u03b9\u2190prunecontinueouterloopm\u03b9\u2190keepreturnmAlgorithm3:PrunedInside(w,m){Initializethechartinthestandardway}for\u03b9=%j,k&\u2208M,bottom-updoifm\u03b9=keepthenforl:j1),leadingtoahighempiricalspeedup.Algorithm1:AuxVar(w,f)a,s\u2190randominitializationE\u21900fori\u22081,2,...,Ndos\u223ck\u2217(s,\u00b7)m\u2190CreateMask(s,a)ComputePrunedInside(w,m)ComputePrunedOutside(w,m)E\u2190E+Emfa\u223cks(a,\u00b7)returnENConsidernowtheproblemofjointlyresamplingtheblockcontainingTandacollectionofexcludedauxiliaryvariables{A\u03b9:\u03b9/\u2208s}givenacollectionofselectedones.Wecanwritethedecomposition:P`T=t,S|C\u00b4=P(T=t|C)Y\u03b9/\u2208sP`A\u03b9=a\u03b9|T=t\u00b4=P(T=t|C)Y\u03b9/\u2208s1\u02d8a\u03b9=1[\u03b9\u2208t]\u00af,whereS=!A\u03b9=a\u03b9:\u03b9/\u2208s\"isacon\ufb01gurationoftheexcludedauxiliaryvariablesandC=!A\u03b9=a\u03b9:\u03b9\u2208s\"isacon\ufb01gurationoftheselectedones.The\ufb01rstfactorinthesecondlineisagainapruneddynamicprogram(describedinAlgorithm3).Theproductofindicatorfunctionsshowsthatonceatreehasbeenpicked,theexcludedauxiliaryvariablescanbesettonewvaluesdeterministicallybyreadingfromthesampledtreetwhether\u03b9isaconstituent,foreach\u03b9/\u2208s.Givenaselectionvectors,wedenotetheinducedblockGibbskerneldescribedabovebyks(\u00b7,\u00b7).Sincethiskerneldependsonthepreviousstateonlythroughtheassignmentsoftheauxiliaryvari-ables,wecanalsowriteitasatransitionkernelonthespace{+,\u2212}|M|ofauxiliaryvariableassign-ments:ks(a,a\").3.2TheselectionchainThereisaseparatemechanism,k\u2217,thatupdatesateachiterationtheselectionsoftheauxiliaryvariables.ThismechanismcorrespondstopickingwhichGibbsoperatorkswillbeusedtotran-sitionintheMarkovchainonassignmentsdescribedabove.Wewilldenotetherandomvariablecorrespondingtotheselectionvectorsatstate(i)byS(i).InstandardtreatmentsofMCMCalgorithms[33,22],thevariablesS(i)arerestrictedtobeei-therindependent(amixtureofkernels),ordeterministicenumerations(analternationofker-nels).HoweverthisrestrictioncanberelaxedtohavingS(i)beitselfaMarkovchainwithkernelk\u2217:{0,1}|M|\u00d7{0,1}|M|\u2192[0,1].Thisrelaxationcanbethoughtofasallowingstochasticpoliciesforkernelselection.33Thereisashortandintuitiveargumenttojustifythisrelaxation.Letx\u2217beastatefromk\u2217,andconsiderthesetofpathsPstartingatx\u2217andextendeduntilthey\ufb01rstreturntox\u2217.Manyofthesepathshavein\ufb01nite5\fThe blocks of resampled variables will always contain T as well as a subset of the excluded auxiliary\nvariables. Note that when conditioning on all of the auxiliary variables, the posterior distribution on\nT is deterministic. We therefore require that P(|s| < |M| i.o.) = 1 to maintain irreducibility.\nWe now describe in more detail the effect that each setting of a, s has on the posterior distribution\non T . We start by developing the form of the posterior distribution over trees when there is a single\nselected auxiliary variable, i.e. T|(A\u03b9 = a). If a = \u2212, sampling from T|A\u03b9 = \u03b9 requires the\nsame dynamic program as for exact sampling, except that a single cell in the chart is pruned (the\ncell \u03b9). The setting where a = + is more interesting: in this case signi\ufb01cantly more cells can be\npruned. Indeed, all constituents overlapping with \u03b9 are pruned. This can lead to a speed-up of up\nto a multiplicative constant of 8 = 23, when the span \u03b9 has length |\u03b9| = |w|\n2 . More constraints are\nmaintained during resampling steps in practice (i.e. |s| > 1), leading to a large empirical speedup.\nConsider now the problem of jointly resampling the block containing T and a collection of excluded\nauxiliary variables {A\u03b9 : \u03b9 /\u2208 s} given a collection of selected ones. We can write the decomposition:\n\nP`T = t,S|C\u00b4 = P(T = t|C)\n\nY\nY\n\n\u03b9 /\u2208s\n\nP`A\u03b9 = a\u03b9|T = t\u00b4\n1\u02d8a\u03b9 = [\u03b9 \u2208 t]\u00af,\n\n= P(T = t|C)\n\n\u03b9 /\u2208s\n\nwhere S =(cid:0)A\u03b9 = a\u03b9 : \u03b9 /\u2208 s(cid:1) is a con\ufb01guration of the excluded auxiliary variables and C =(cid:0)A\u03b9 =\na\u03b9 : \u03b9 \u2208 s(cid:1) is a con\ufb01guration of the selected ones. The \ufb01rst factor in the second line is again a pruned\n\ndynamic program (described in Algorithm 3). The product of indicator functions shows that once a\ntree has been picked, the excluded auxiliary variables can be set to new values deterministically by\nreading from the sampled tree t whether \u03b9 is a constituent, for each \u03b9 /\u2208 s.\nGiven a selection vector s, we denote the induced block Gibbs kernel described above by ks(\u00b7,\u00b7).\nSince this kernel depends on the previous state only through the assignments of the auxiliary vari-\nables, we can also write it as a transition kernel on the space {+,\u2212}|M| of auxiliary variable assign-\nments: ks(a, a0).\n\n3.2 The selection chain\nThere is a separate mechanism, k\u2217, that updates at each iteration the selection s of the auxiliary\nvariables. This mechanism corresponds to picking which Gibbs operator ks will be used to tran-\nsition in the Markov chain on assignments described above. We will denote the random variable\ncorresponding to the selection vector s at state (i) by S(i).\n\nIn standard treatments of MCMC algorithms [33, 22], the variables S(i) are restricted to be ei-\nther independent (a mixture of kernels), or deterministic enumerations (an alternation of ker-\nnels). However this restriction can be relaxed to having S(i) be itself a Markov chain with kernel\nk\u2217 : {0, 1}|M|\u00d7{0, 1}|M| \u2192 [0, 1]. This relaxation can be thought of as allowing stochastic policies\nfor kernel selection.3\nThe choice of k\u2217 is important. To understand why, recall that in the situation where (A\u03b9 = \u2212), a\nsingle cell in the chart is pruned, whereas in the case where (A\u03b9 = +), a large fraction of the chart\ncan be ignored. The construction of k\u2217 is therefore where having a simpler model or heuristic at\nhand can play a role: as a way to favor the selection of constituents that are likely to be positive,\nso that better speedup can be achieved. Note that the algorithm can recover from mistakes in the\nsimpler model, since the assignments of the auxiliary variables are also resampled.\nAnother issue that should be considered when designing k\u2217 is that it should avoid self-transitions\n(repeating the same set of selections). To see why, note that if (s, a) = (s0, a0), then m = m(s, a) =\n3There is a short and intuitive argument to justify this relaxation. Let x\u2217 be a state from k\u2217, and consider\nthe set of paths P starting at x\u2217 and extended until they \ufb01rst return to x\u2217. Many of these paths have in\ufb01nite\nlength, however if k\u2217 is positive recurrent, k\u2217(\u00b7,\u00b7), will assign probability zero to these paths. We then use\nthe following reduction: when the chain is at x\u2217, \ufb01rst pick a path from P under the distribution induced by\nk\u2217 (this is a mixture of kernels). Once a path is selected, deterministically follow the edges in the path until\ncoming back to x\u2217 (alternation of kernels). Since mixtures and alternations of \u03c0-invariant kernels preserve\n\u03c0-invariance, we are done.\n\n5\n\n\fEmf +E\n2\n\nm0 f\n\n= Emf. The estimator is unchanged in this case, even after\n\nm(s0, a0) = m0 and hence\npaying the computational cost of a second iteration.\nThe mechanism we used takes both of these issues into consideration. First, it uses a simpler model\n(for instance a grammar with fewer non-terminal symbols) to pick a subset M0 \u2286 M of the spans\nthat have high posterior probability. Our kernel k\u2217 is restricted to selection vectors s such that\ns \u2286 M0. Next, in order to avoid repetition, our kernel transitions from a previous selection s to the\nnext one, s0, as follows: after picking a random subset R \u2282 s of size |s|\n2 , de\ufb01ne s0 = (M0\\s) \u222a R.\nProvided that the chain is initialized with |s| = 2|M0|\n, this scheme has the property that it changes a\nlarge portion of the state at every iteration (more precisely, |s \u2229 s0| = 1\n3), and moreover all subsets\nof M0 of size 2|M0|\nare eventually resampled with probability one. Note that this update depends on\nthe previous selection vector, but not on the assignment vector.\nGiven the asymmetric effect between conditioning on positive versus negative auxiliary variables, it\nis tempting to let the k\u2217 depend on the current assignment of the auxiliary variables. Unfortunately\nsuch schemes will not converge to the correct distribution in general. Counterexamples are given in\nthe adaptive MCMC literature [2].\n\n3\n\n3\n\n3.3 Accelerated averaging\n\nPN\n\ni=1 f(X (i)). In our case X (i) contains the current tree and assignments, (T (i), A(i)).\n\nIn this section, we justify the way expected suf\ufb01cient statistics are estimated from the collection of\nsamples. In other words, how the variable E is updated in Algorithm 1.\nIn a generic MCMC situation, once samples X (1), X (2), . . . are collected, the traditional way of\nestimating expected suf\ufb01cient statistics f is to average \u201chard counts,\u201d i.e.\nto use the estimator:\nSN = 1\nN\nFor general Metropolis-Hastings chains, this is often the only method available. On the other hand,\nin our parsing setup\u2014and more generally, with any Gibbs sampler\u2014it turns out that there is a more\nef\ufb01cient way of combining the samples [23]. The idea behind this alternative is to take \u201csoft counts.\u201d\nThis is what we do when we add Emf to the running average in Algorithm 1.\nSuppose we have extracted samples X (1), X (2), . . . , X (i), with corresponding selection vectors\nS(1), S(2), . . . , S(i). In order to transition to the next step, we will have to sample from the probabil-\nity distribution denoted by kS(i)(X (i),\u00b7). In the standard setting, we would extract a single sample\nX (i+1) and add f(X (i+1)) to a running average.\nMore formally, the accelerated averaging method consists of adding the following soft count instead:\nparsing setup. This quantity was denoted Emf in the previous section. The \ufb01nal estimator then has\nthe form:4 S0\n\nR f(x)kS(i)(X (i), dx), which can be computed with one extra pruned outside computation in our\n\n(cid:0)X (i), dx(cid:1).\n\nR f(x) kS(i)\n\nPN\u22121\n\nN = 1\nN\u22121\n\ni=1\n\n4 Experiments\n\nWhile we used the task of monolingual parsing to illustrate our randomized pruning procedure, the\ntechnique is most powerful when the dynamic program is a higher-order polynomial. We therefore\ndemonstrate the utility of randomized pruning on a bitext parsing task. In bitext parsing, we have\nsentence-aligned corpora from two languages, and are computing expectations over aligned parse\ntrees [6, 28]. The model we use is most similar to [3], but we extend this model and allow rules to\nmix terminals and non-terminals, as is often done in the context of machine translation [8]. These\nrules were excluded in [3] for tractability reasons, but our sampler allows ef\ufb01cient sampling in this\nmore challenging setup.\nIn the terminology of adaptor grammars [19], our sampling step involves resampling an adapted\nderivation given a base measure derivation for each sentence. Concretely, the problem is to sample\nfrom a class of isotonic bipartite graphs over the nodes of two trees. By isotonic we mean that the\n\n4As a side note, we make the observation that this estimator is reminiscent of a structure mean \ufb01eld update.\nIt is different though, since it is still an asymptotically unbiased estimator, while mean \ufb01elds approximations\nconverge in \ufb01nite time to a biased estimate.\n\n6\n\n\f(a)\n\n(b)\n\n(c)\n\n(d)\n\nFigure 4: Because each sampling step is three orders of magnitude faster than the exact computation\n(a,b), we can afford to average over multiple samples and thereby reduce the L2 bias compared to\na \ufb01xed pruning scheme (c). Our auxiliary variable sampling scheme also substantially outperforms\nthe tic-tac-toe pruning heuristic (d).\n\nedges E of this bipartite graph should have the property that if two non-terminals \u03b1, \u03b10 and \u03b2, \u03b20 are\naligned in the sampled bipartite graph, i.e. (\u03b1, \u03b10) \u2208 E and (\u03b2, \u03b20) \u2208 E, then \u03b1 \u2265 \u03b2 \u21d2 \u03b10 \u2265 \u03b20,\nwhere \u03b1 \u2265 \u03b2 denotes that \u03b1 is an ancestor of \u03b2. The weight (up to a proportionality constant) of\neach of these alignments is obtained as follows: \ufb01rst, consider each aligned point as the left-hand of\na rule. Next, multiply the score of these rules. If we let p, q be the length of the two sentences, one\ncan check that this yields a dynamic program of complexity O(pb+1qb+1), where b is the branching\nfactor (we follow [3] and use b = 3).\nWe picked this particular bilingual bitext parsing formalism for two reasons. First, it is relevant to\nmachine translation research. Several researchers have found that state-of-the-art performance can\nbe attained using grammars that mix terminals and non-terminals in their rules [8, 14]. Second, the\nrandomized pruning method is most competitive in cases where the dynamic program has a suf\ufb01-\nciently high degree. We did experiments on monolingual parsing that showed that the improvements\nwere not signi\ufb01cant for most sentence lengths, and inferior to the coarse-to-\ufb01ne method of [25].\nThe bitext parsing version of the randomized pruning algorithm is very similar to the monolingual\ncase. Rather than being over constituent spans, our auxiliary variables in the bitext case are over\ninduced alignments of synchronous derivations. A pair of words is aligned if it is emitted by the\nsame synchronous rule. Note that this includes many-to-many and null alignments since several or\nzero lexical elements can be emitted by a single rule. Given two aligned sentences, the auxiliary\nvariables Ai,j are the pq binary random variables indicating whether word i is aligned with word j.\nTo compare our approximate inference procedure to exact inference, we follow previous work [15,\n29] and measure the L2 distance between the pruned expectations and the exact expectations.5\n\n4.1 Results\n\nWe ran our experiments on the Chinese Treebank (and its English translation) [39], limiting the\nproduct of the sentence lengths of the two sentences to p \u00d7 q \u2264 130. This was necessary be-\ncause computing exact expectations (as needed for comparing to our baseline) quickly becomes\nprohibitive. Note that our pruning method, in contrast, can handle much longer sentences with-\nout problem\u2014one pass through all 1493 sentences with a product length of less than 1000 took 28\nminutes on one 2.66GHz Xeon CPU.\nWe used the BerkeleyAligner [21] to obtain high-precision, intersected alignments to construct the\nhigh-con\ufb01dence set M0 of auxiliary variables needed for k\u2217 (Section 3.2)\u2014in other words, to con-\nstruct the support of the selection chain S(i).\nFor randomized pruning to be ef\ufb01cient, we need to be able to extract a large number of samples\nwithin the time required for computing the exact expectations. Figure 4(a) shows the average time\nrequired to compute the full dynamic program and the dynamic program required to extract a sin-\ngle sample for varying sentence product lengths. The ratio between the two (explicitly shown in\n\nP\n\nPK\n\n\u201cE\u03b8,i[fk] \u2212\n\n\u201d2\n\n\u02dcE\u03b8,i[fk]\n\n5More precisely, we averaged this bias across the sentence-pairs: bias(\u03b8) = 1|I|\n\ni\u2208I\n\nk=1\n\n, where E\u03b8,i[f ], \u02dcE\u03b8,i[f ] are shorthands notations for exact and approximate expectations.\n\n7\n\n406080100120Product length1010010001x1041x1051x1061x107Mean time (ms)ExactSampling step406080100120Product length500150025003500 Speed-up6080100120140Product length00.511.52Mean L2 biasFixedAuxiliary variables100010000100000Mean time (ms)00.511.52Mean L2 biasTic-tac-toeAuxiliary variables\fFigure 4(b)) increases with the sentence lengths, and reaches three orders of magnitude, making it\npossible to average over a large number of samples, while still greatly reducing computation time.\nWe can compute expectations for many samples very ef\ufb01ciently, but how accurate are the approxi-\nmated expectations? Figure 4(c) shows that averaging over several masks reduces bias signi\ufb01cantly.\nIn particular, the bias increases considerably for longer sentences when only a single sample is used,\nbut remains roughly constant when we average multiple samples. To determine the number of sam-\nples in this experiment, we measured the time required for exact inference, and ran the auxiliary\nvariable sampler for half of that time. The main point of Figure 4(c) is to show that under realis-\ntic running time conditions, the bias of the auxiliary variable sampler stays roughly constant as a\nfunction of sentence length.\nFinally, we compared the auxiliary variable algorithm to tic-tac-toe pruning, a heuristic proposed in\n[40] and improved in [41]. Tic-tac-toe is an algorithm that ef\ufb01ciently precomputes a \ufb01gure of merit\nfor each bispan. This \ufb01gure of merit incorporates an inside score and an outside score. To compute\nthis score, we used a product of the two IBM model 1 scores (one for each directionality). When a\nbispan \ufb01gure of merit falls under a threshold, it is pruned away.\nIn Figure 4(d), each curve corresponds to a family of heuristics with varying aggressiveness. With\ntic-tac-toe, aggressiveness is increased via the cut-off threshold, while with the auxiliary variable\nsampler, it is controlled by letting the sampler run for more iterations. For each algorithm, its\ncoordinates correspond to the mean L2 bias and mean time in milliseconds per sentence. The plot\nshows that there is a large regime where the auxiliary variable algorithm dominates tic-tac-toe for\nthis task. Our method is competitive up to a mean running time of about 15 sec/sentence, which is\nwell above the typical running time one needs for realistic, large scale training.\n\n5 Related work\n\nThere is a large body of related work on approximate inference techniques. When the goal is to\nmaximize an objective function, simple beam pruning [10] can be suf\ufb01cient. However, as argued in\n[4], beam pruning is not appropriate for computing expectations because the resulting approximation\nis too concentrated around the mode. To overcome this problem, [5] suggest adding a collection of\nsamples to a beam of k-best estimates. Their approach is quite different to ours as no auxiliary\nvariables are used.\nAuxiliary variables are quite versatile and have been used to create MCMC algorithms that can\nexploit gradient information [11], ef\ufb01cient samplers for regression [1], for unsupervised Bayesian\ninference [31], automatic sampling of generic distribution [24] and non-parametric Bayesian statis-\ntics [37, 20, 36]. In computer vision, in particular, an auxiliary variable sampler developed by [30]\nis widely used for image segmentation [27].\n\n6 Conclusion\n\nMask-based pruning is an effective way to speed up large dynamic programs for calculating feature\nexpectations. Aggressive masks introduce heavy bias, while conservative ones offer only limited\nspeed-ups. Our results show that, at least for bitext parsing, using many randomized aggressive\nmasks generated with an auxiliary variable sampler is superior in time and bias to using a single,\nmore conservative one. The applicability of this approach is in no way limited to the cases consid-\nered here. Randomized pruning will be most advantageous when high-order dynamic programs can\nbe vastly sped up by masking, yet no single aggressive mask is likely to be adequate.\n\nReferences\n\n[1] J. Albert and S. Chib. Bayesian analysis of binary and polychotomous response data. JASA, 1993.\n[2] C. Andrieu and E. Moulines. On the ergodicity properties of some adaptive MCMC algorithms. Ann.\n\nAppl. Probab., 2006.\n\n[3] P. Blunsom, T. Cohn, C. Dyer, and M. Osborne. A Gibbs sampler for phrasal synchronous grammar\n\ninduction. In EMNLP, 2009.\n\n8\n\n\f[4] P. Blunsom, T. Cohn, and M. Osborne. A discriminative latent variable model for statistical machine\n\ntranslation. In ACL-HLT, 2008.\n\n[5] P. Blunsom and M. Osborne. Probabilistic inference for machine translation. In EMNLP, 2008.\n[6] D. Burkett and D. Klein. Two languages are better than one (for syntactic parsing). In EMNLP \u201908, 2008.\n[7] E. Charniak and M. Johnson. Coarse-to-\ufb01ne n-best parsing and maxent discriminative reranking. In ACL,\n\n2005.\n\n[8] D. Chiang. A hierarchical phrase-based model for statistical machine translation. In ACL, 2005.\n[9] S. Clark and J. R. Curran. Parsing the WSJ using CCG and log-linear models. In ACL, 2004.\n[10] M. Collins. Head-Driven Statistical Models for Natural Language Parsing. PhD thesis, UPenn, 1999.\n[11] S. Duane, A. D. Kennedy, B. J. Pendleton, and D. Roweth. Hybrid Monte Carlo. Physics Letters B, 1987.\n[12] J. Finkel, A. Kleeman, and C. Manning. Ef\ufb01cient, feature-based, conditional random \ufb01eld parsing. In\n\nACL, 2008.\n\n[13] J. R. Finkel, C. D. Manning, and A. Y. Ng. Solving the problem of cascading errors: Approximate\n\nBayesian inference for linguistic annotation pipelines. In EMNLP, 2006.\n\n[14] M. Galley, M. Hopkins, K. Knight, and D. Marcu. What\u2019s in a translation rule? In HLT-NAACL, 2004.\n[15] A. Globerson and T. Jaakkola. Approximate inference using planar graph decomposition. In NIPS, 2006.\n[16] J. Goodman. Parsing algorithms and metrics. In ACL, 1996.\n[17] J. Goodman. Global thresholding and multiple-pass parsing. In EMNLP, 1997.\n[18] M. Johnson. Joint and conditional estimation of tagging and parsing models. In ACL, 2001.\n[19] M. Johnson, T. L. Grif\ufb01ths, and S. Goldwater. Bayesian inference for PCFGs via Markov Chain Monte\n\nCarlo. In ACL, 2007.\n\n[20] P. Liang, M. I. Jordan, and B. Taskar. A permutation-augmented sampler for Dirichlet process mixture\n\nmodels. In ICML, 2007.\n\n[21] P. Liang, B. Taskar, and D. Klein. Alignment by agreement. In NAACL, 2006.\n[22] D. J. C. MacKay. Information Theory, Inference and Learning Algorithms. Cambridge U. Press, 2003.\n[23] I. W. McKeague and W. Wefelmeyer. Markov chain Monte Carlo and Rao-Blackwellization. Statistical\n\nPlanning and Inference, 2000.\n\n[24] R. Neal. Slice sampling. Annals of Statistics, 2000.\n[25] S. Petrov and D. Klein. Improved inference for unlexicalized parsing. In HLT-NAACL \u201907, 2007.\n[26] S. Petrov and D. Klein. Discriminative log-linear grammars with latent variables. In NIPS, 2008.\n[27] J. Shi and J. Malik. Normalized cuts and image segmentation. IEEE Transactions on Pattern Analysis\n\nand Machine Intelligence, 2000.\n\n[28] D. Smith and N. Smith. Bilingual parsing with factored estimation: Using english to parse korean. In\n\nEMNLP \u201904, 2004.\n\n[29] D. A. Smith and J. Eisner. Dependency parsing by belief propagation. In EMNLP, 2008.\n[30] R. H. Swendsen and J. S. Wang. Nonuniversal critical dynamics in MC simulations. Rev. Lett, 1987.\n[31] M. A. Tanner and W. H. Wong. The calculation of posterior distributions by data augmentation. JASA,\n\n1987.\n\n[32] B. Taskar, D. Klein, M. Collins, D. Koller, and C. Manning. Max-margin parsing. In EMNLP, 2004.\n[33] L. Tierney. Markov chains for exploring posterior distributions. The Annals of Statistics, 1994.\n[34] I. Titov and J. Henderson. Loss minimization in parse reranking. In EMNLP, 2006.\n[35] J. Turian, B. Wellington, and I. D. Melamed. Scalable discriminative learning for natural language parsing\n\nand translation. In NIPS, 2006.\n\n[36] J. Van Gael, Y. Saatci, Y. W. Teh, and Z. Ghahramani. Beam sampling for the in\ufb01nite hidden Markov\n\nmodel. In ICML, 2008.\n\n[37] S. G. Walker. Sampling the Dirichlet mixture model with slices. Communications in Statistics - Simulation\n\nand Computation, 2007.\n\n[38] J. Wolfe, A. Haghighi, and D. Klein. Fully distributed em for very large datasets. In ICML \u201908, 2008.\n[39] N. Xue, F-D Chiou, and M. Palmer. Building a large-scale annotated Chinese corpus. In COLING, 2002.\n[40] H. Zhang and D. Gildea. Stochastic lexicalized inversion transduction grammar for alignment. In ACL,\n\n2005.\n\n[41] H. Zhang, C. Quirk, R. C. Moore, and D. Gildea. Bayesian learning of non-compositional phrases with\n\nsynchronous parsing. In ACL, 2008.\n\n9\n\n\f", "award": [], "sourceid": 1030, "authors": [{"given_name": "Alexandre", "family_name": "Bouchard-c\u00f4t\u00e9", "institution": null}, {"given_name": "Slav", "family_name": "Petrov", "institution": null}, {"given_name": "Dan", "family_name": "Klein", "institution": null}]}