{"title": "Reward Augmented Maximum Likelihood for Neural Structured Prediction", "book": "Advances in Neural Information Processing Systems", "page_first": 1723, "page_last": 1731, "abstract": "A key problem in structured output prediction is enabling direct optimization of the task reward function that matters for test evaluation. This paper presents a simple and computationally efficient method that incorporates task reward into maximum likelihood training. We establish a connection between maximum likelihood and regularized expected reward, showing that they are approximately equivalent in the vicinity of the optimal solution. Then we show how maximum likelihood can be generalized by optimizing the conditional probability of auxiliary outputs that are sampled proportional to their exponentiated scaled rewards. We apply this framework to optimize edit distance in the output space, by sampling from edited targets. Experiments on speech recognition and machine translation for neural sequence to sequence models show notable improvements over maximum likelihood baseline by simply sampling from target output augmentations.", "full_text": "Reward Augmented Maximum Likelihood\n\nfor Neural Structured Prediction\n\nMohammad Norouzi\n\nSamy Bengio\n\nZhifeng Chen\n\nNavdeep Jaitly\n\nMike Schuster\nDale Schuurmans\n{mnorouzi, bengio, zhifengc, ndjaitly}@google.com\n\n{schuster, yonghui, schuurmans}@google.com\n\nYonghui Wu\n\nGoogle Brain\n\nAbstract\n\nA key problem in structured output prediction is direct optimization of the task\nreward function that matters for test evaluation. This paper presents a simple and\ncomputationally ef\ufb01cient approach to incorporate task reward into a maximum like-\nlihood framework. By establishing a link between the log-likelihood and expected\nreward objectives, we show that an optimal regularized expected reward is achieved\nwhen the conditional distribution of the outputs given the inputs is proportional\nto their exponentiated scaled rewards. Accordingly, we present a framework to\nsmooth the predictive probability of the outputs using their corresponding rewards.\nWe optimize the conditional log-probability of augmented outputs that are sampled\nproportionally to their exponentiated scaled rewards. Experiments on neural se-\nquence to sequence models for speech recognition and machine translation show\nnotable improvements over a maximum likelihood baseline by using reward aug-\nmented maximum likelihood (RML), where the rewards are de\ufb01ned as the negative\nedit distance between the outputs and the ground truth labels.\n\nIntroduction\n\n1\nStructured output prediction is ubiquitous in machine learning. Recent advances in natural language\nprocessing, machine translation, and speech recognition hinge on the development of better dis-\ncriminative models for structured outputs and sequences. The foundations of learning structured\noutput models were established by the seminal work on conditional random \ufb01elds (CRFs) [17] and\nstructured large margin methods [32], which demonstrate how generalization performance can be\nsigni\ufb01cantly improved when one considers the joint effects of the predictions across multiple output\ncomponents. These models have evolved into their deep neural counterparts [29, 1] through the use\nof recurrent neural networks (RNN) with LSTM [13] cells and attention mechanisms [2].\nA key problem in structured output prediction has always been to enable direct optimization of the\ntask reward (loss) used for test evaluation. For example, in machine translation one seeks better BLEU\nscores, and in speech recognition better word error rates. Not surprisingly, almost all task reward\nmetrics are not differentiable, hence hard to optimize. Neural sequence models (e.g. [29, 2]) optimize\nconditional log-likelihood, i.e. the conditional log-probability of the ground truth outputs given\ncorresponding inputs. These models do not explicitly consider the task reward during training, hoping\nthat conditional log-likelihood serves as a good surrogate for the task reward. Such methods make no\ndistinction between alternative incorrect outputs: log-probability is only measured on the ground truth\ninput-output pairs, and all alternative outputs are equally penalized through normalization, whether\nnear or far from the ground truth target. We believe one can improve upon maximum likelihood (ML)\nsequence models if the difference in the rewards of alternative outputs is taken into account.\nStandard ML training, despite its limitations, has enabled the training of deep RNN models, leading to\nrevolutionary advances in machine translation [29, 2, 21] and speech recognition [5\u20137]. A key property\n\n\fof ML training for locally normalized RNN models is that the objective function factorizes into\nindividual loss terms, which could be ef\ufb01ciently optimized using stochastic gradient descend (SGD).\nThis training procedure does not require any form of inference or sampling from the model during\ntraining, leading to computational ef\ufb01ciency and ease to implementation. By contrast, almost all\nalternative formulations for training structure prediction models require some form of inference or\nsampling from the model at training time which slows down training, especially for deep RNNs\n(e.g. see large margin, search-based [8, 39], and expected risk optimization methods).\nOur work is inspired by the use of reinforcement learning (RL) algorithms, such as policy gradi-\nent [37], to optimize expected task reward [25]. Even though expected task reward seems like a\nnatural objective, direct policy optimization faces signi\ufb01cant challenges: unlike ML, a stochastic\ngradient given a mini-batch of training examples is extremely noisy and has a high variance; gradients\nneed to be estimated via sampling from the model, which is a non-stationary distribution; the reward\nis often sparse in a high-dimensional output space, which makes it dif\ufb01cult to \ufb01nd any high value\npredictions, preventing learning from getting off the ground; and, \ufb01nally, maximizing reward does\nnot explicitly consider the supervised labels, which seems inef\ufb01cient. In fact, all previous attempts\nat direct policy optimization for structured output prediction have started by bootstrapping from a\npreviously trained ML solution [25, 27], using several heuristics and tricks to make learning stable.\nThis paper presents a new approach to task reward optimization that combines the computational\nef\ufb01ciency and simplicity of ML with the conceptual advantages of expected reward maximization.\nOur algorithm called reward augmented maximum likelihood (RML) simply adds a sampling step\non top of the typical likelihood objective.\nInstead of optimizing conditional log-likelihood on\ntraining input-output pairs, given each training input, we \ufb01rst sample an output proportionally to its\nexponentiated scaled reward. Then, we optimize log-likelihood on such auxiliary output samples\ngiven corresponding inputs. When the reward for an output is de\ufb01ned as its similarity to a ground\ntruth output, then the output sampling distribution is peaked at the ground truth output, and its\nconcentration is controlled by a temperature hyper-parameter.\nOur theoretical analysis shows that the RML and regularized expected reward objectives optimize a\nKL divergence between the exponentiated reward and model distributions, but in opposite directions.\nFurther, we show that at non-zero temperatures, the gap between the two criteria can be expressed\nby a difference of variances measured on interpolating distributions. This observation reveals how\nentropy regularized expected reward can be estimated by sampling from exponentiated scaled rewards,\nrather than sampling from the model distribution.\nRemarkably, we \ufb01nd that the RML approach achieves signi\ufb01cantly improved results over state of the\nart maximum likelihood RNNs. We show consistent improvement on both speech recognition (TIMIT\ndataset) and machine translation (WMT\u201914 dataset), where output sequences are sampled according\nto their edit distance to the ground truth outputs. Surprisingly, we \ufb01nd that the best performance\nis achieved with output sampling distributions that shift a lot of the weight away from the ground\ntruth outputs. In fact, in our experiments, the training algorithm rarely sees the original unperturbed\noutputs. Our results give further evidence that models trained with imperfect outputs and their reward\nvalues can improve upon models that are only exposed to a single ground truth output per input\n[12, 20].\n\n2 Reward augmented maximum likelihood\nGiven a dataset of input-output pairs, D \u2261 {(x(i), y\u2217(i))}N\ni=1, structured output models learn a\nparametric score function p\u03b8(y | x), which scores different output hypotheses, y \u2208 Y. We assume\nthat the set of possible output, Y is \ufb01nite, e.g. English sentences up to a maximum length. In a\nprobabilistic model, the score function is normalized, while in a large-margin model the score may\nnot be normalized. In either case, once the score function is learned, given an input x, the model\n\np\u03b8(y | x) .\n\n(1)\n\npredicts an output(cid:98)y achieving maximal score,\ndataset D(cid:48), one computes(cid:80)\n\n(cid:98)y(x) = argmax\n\ny\n\nIf this optimization is intractable, approximate inference (e.g. beam search) is used. We use a reward\nfunction r(y, y\u2217) to evaluate different proposed outputs against ground-truth outputs. Given a test\n\n(x,y\u2217)\u2208D(cid:48) r((cid:98)y(x), y\u2217) as a measure of empirical reward. Since models\n\nwith larger empirical reward are preferred, ideally one hopes to maximize empirical reward during\ntraining.\n\n2\n\n\fHowever, since empirical reward is not amenable to numerical optimization, one often considers\noptimizing alternative differentiable objectives. The maximum likelihood (ML) framework tries to\nminimize negative log-likelihood of the parameters given the data,\n\nLML(\u03b8;D) =\n\n\u2212 log p\u03b8(y\u2217 | x) .\n\n(2)\n\n(cid:88)\n\n(x,y\u2217)\u2208D\n\n(cid:27)\n\nMinimizing this objective increases the conditional probability of the target outputs, log p\u03b8(y\u2217 | x),\nwhile decreasing the conditional probability of alternative incorrect outputs. According to this\nobjective, all negative outputs are equally wrong, and none is preferred over the others.\nBy contrast, reinforcement learning (RL) advocates optimizing expected reward (with a maximum\nentropy regularizer [38]), which is formulated as minimization of the following objective,\n\nLRL(\u03b8; \u03c4,D) =\n\np\u03b8(y | x) r(y, y\u2217)\n\n,\n\n(3)\n\n(cid:26)\n\n(cid:88)\n\n(x,y\u2217)\u2208D\n\n\u2212 \u03c4H (p\u03b8(y | x)) \u2212(cid:88)\n\ny\u2208Y\n\n\u2212(cid:80)\n\nwhere r(y, y\u2217) denotes the reward function, e.g. negative edit distance or BLEU score, \u03c4 con-\ntrols the degree of regularization, and H (p) is the entropy of a distribution p, i.e. H (p(y)) =\ny\u2208Y p(y) log p(y). It is well-known that optimizing LRL(\u03b8; \u03c4 ) using SGD is challenging be-\ncause of the large variance of the gradients. Below we describe how ML and RL objectives are\nrelated, and propose a hybrid between the two that combines their bene\ufb01ts for supervised learning.\nLet us de\ufb01ne a distribution in the output space, termed the exponentiated payoff distribution, that is\ncentral in linking ML and RL objectives:\nq(y | y\u2217; \u03c4 ) =\n\nwhere Z(y\u2217, \u03c4 ) =(cid:80)\n(4)\ny\u2208Y exp{r(y, y\u2217)/\u03c4}. One can verify that the global minimum of LRL(\u03b8; \u03c4 ),\ni.e. the optimal regularized expected reward, is achieved when the model distribution matches the\nexponentiated payoff distribution, i.e. p\u03b8(y | x) = q(y | y\u2217; \u03c4 ). To see this, we re-express the\n(cid:88)\nobjective function in (3) in terms of a KL divergence between p\u03b8(y | x) and q(y | y\u2217; \u03c4 ),\n\nexp{r(y, y\u2217)/\u03c4} ,\n\nZ(y\u2217, \u03c4 )\n\n1\n\nDKL (p\u03b8(y | x) (cid:107) q(y | y\u2217; \u03c4 )) =\n\nLRL(\u03b8; \u03c4 ) + const ,\n\n(5)\n\n1\n\u03c4\n\nwhere the constant const on the RHS is(cid:80)\n\n(x,y\u2217)\u2208D\n\n(x,y\u2217)\u2208D log Z(y\u2217, \u03c4 ). Thus, the minimum of DKL (p\u03b8 (cid:107) q)\nand LRL is achieved when p\u03b8 = q. At \u03c4 = 0, when there is no entropy regularization, the optimal p\u03b8\nis a delta distribution, p\u03b8(y | x) = \u03b4(y | y\u2217), where \u03b4(y | y\u2217) = 1 at y = y\u2217 and 0 at y (cid:54)= y\u2217. Note\nthat \u03b4(y | y\u2217) is equivalent to the exponentiated payoff distribution in the limit as \u03c4 \u2192 0.\nReturning to the log-likelihood objective, one can verify that (2) is equivalent to a KL divergence in\nthe opposite direction between a delta distribution \u03b4(y | y\u2217) and the model distribution p\u03b8(y | x),\n(6)\n\nDKL (\u03b4(y | y\u2217) (cid:107) p\u03b8(y | x)) = LML(\u03b8) .\n\n(cid:88)\n\n(x,y\u2217)\u2208D\n\nThere is no constant on the RHS, as the entropy of a delta distribution is zero, i.e. H (\u03b4(y | y\u2217)) = 0.\nWe propose a method called reward-augmented maximum likelihood (RML), which generalizes ML\nby allowing a non-zero temperature parameter in the exponentiated payoff distribution, while still\noptimizing the KL divergence in the ML direction. The RML objective function takes the form,\n\n(cid:27)\n\n(cid:88)\n\n(x,y\u2217)\u2208D\n\n(cid:26)\n\n\u2212(cid:88)\n\ny\u2208Y\n\nLRML(\u03b8; \u03c4,D) =\n\nq(y | y\u2217; \u03c4 ) log p\u03b8(y | x)\n\n,\n\nwhich can be re-expressed in terms of a KL divergence as follows,\n\nDKL (q(y | y\u2217; \u03c4 ) (cid:107) p\u03b8(y | x)) = LRML(\u03b8; \u03c4 ) + const ,\n\n(cid:88)\n\nwhere the constant const is \u2212(cid:80)\n\n(x,y\u2217)\u2208D\n\n(x,y\u2217)\u2208D H (q(y | y\u2217, \u03c4 )). Note that the temperature parameter,\n\u03c4 \u2265 0, serves as a hyper-parameter that controls the smoothness of the optimal distribution around\n\n(7)\n\n(8)\n\n3\n\n\fcorrect targets by taking into account the reward function in the output space. The objective functions\nLRL(\u03b8; \u03c4 ) and LRML(\u03b8; \u03c4 ), have the same global optimum of p\u03b8, but they optimize a KL divergence\nin opposite directions. We characterize the difference between these two objectives below, showing\nthat they are equivalent up to their \ufb01rst order Taylor approximations. For optimization convenience,\nwe focus on minimizing LRML(\u03b8; \u03c4 ) to achieve a good solution for LRL(\u03b8; \u03c4 ).\n2.1 Optimization\nOptimizing the reward augmented maximum likelihood (RML) objective, LRML(\u03b8; \u03c4 ), is straightfor-\nward if one can draw unbiased samples from q(y | y\u2217; \u03c4 ). We can express the gradient of LRML in\nterms of an expectation over samples from q(y | y\u2217; \u03c4 ),\n\n\u2207\u03b8LRML(\u03b8; \u03c4 ) = Eq(y|y\u2217;\u03c4 )\n\n(9)\nThus, to estimate \u2207\u03b8LRML(\u03b8; \u03c4 ) given a mini-batch of examples for SGD, one draws y samples\ngiven mini-batch y\u2217\u2019s and then optimizes log-likelihood on such samples by following the mean\ngradient. At a temperature \u03c4 = 0, this reduces to always sampling y\u2217, hence ML training with no\nsampling.\nBy contrast, the gradient of LRL(\u03b8; \u03c4 ), based on likelihood ratio methods, takes the form,\n\n(cid:2) \u2212 \u2207\u03b8 log p\u03b8(y | x)(cid:3) .\n\n\u2207\u03b8LRL(\u03b8; \u03c4 ) = Ep\u03b8(y|x)\n\n(10)\nThere are several critical differences between (9) and (10) that make SGD optimization of LRML(\u03b8; \u03c4 )\nmore desirable. First, in (9), one has to sample from a stationary distribution, the so called expo-\nnentiated payoff distribution, whereas in (10) one has to sample from the model distribution as it is\nevolving. Not only does sampling from the model potentially slow down training, one also needs\nto employ several tricks to get a better estimate of the gradient of LRL [25]. A body of literature in\nreinforcement learning focuses on reducing the variance of (10) by using sophisticated techniques\nsuch as actor-critique methods [30, 9]. Further, the reward is often sparse in a high-dimensional\noutput space, which makes \ufb01nding any reasonable prediction challenging when (10) is used to re\ufb01ne\na randomly initialized model. Thus, smart model initialization is needed. By contrast, we initialize\nthe models randomly and re\ufb01ne them using (9).\n\n(cid:2) \u2212 \u2207\u03b8 log p\u03b8(y | x) \u00b7 r(y, y\u2217)(cid:3) .\n\n2.2 Sampling from the exponentiated payoff distribution\nTo compute the gradient of the model using the RML approach, one needs to sample auxiliary outputs\nfrom the exponentiated payoff distribution, q(y | y\u2217; \u03c4 ). This sampling is the price that we have to\npay to learn with rewards. One should contrast this with loss-augmented inference in structured large\nmargin methods, and sampling from the model in RL. We believe sampling outputs proportional to\nexponentiated rewards is more ef\ufb01cient and effective in many cases.\nExperiments in this paper use reward values de\ufb01ned by either negative Hamming distance or negative\nedit distance. We sample from q(y | y\u2217; \u03c4 ) by strati\ufb01ed sampling, where we \ufb01rst select a particular\ndistance, and then sample an output with that distance value. Here we focus on edit distance sampling,\nas Hamming distance sampling is a simpler special case. Given a sentence y\u2217 of length m, we count\nthe number of sentences within an edit distance e, where e \u2208 {0, . . . , 2m}. Then, we reweight the\ncounts by exp{\u2212e/\u03c4} and normalize. Let c(e, m) denote the number of sentences at an edit distance\ne from a sentence of length m. First, note that a deletion can be thought as a substitution with a nil\ntoken. This works out nicely because given a vocabulary of length v, for each insertion we have v\noptions, and for each substitution we have v \u2212 1 options, but including the nil token, there are v\noptions for substitutions too. When e = 1, there are m possible substitutions and m + 1 insertions.\nHence, in total there are (2m + 1)v sentences at an edit distance of 1. Note, that exact computation\nof c(e, m) is dif\ufb01cult if we consider all edge cases, for example when there are repetitive words in y\u2217,\nbut ignoring such edge cases we can come up with approximate counts that are reliable for sampling.\nWhen e > 1, we estimate c(e, m) by\n\nve ,\n\n(11)\n\n(cid:18)m\n\n(cid:19)(cid:18)m + e \u2212 2s\n(cid:19)\n\nm(cid:88)\n\ne \u2212 s\n\nc(e, m) =\n\ns\n\ns=0\n\nwhere s enumerates over the number of substitutions. Once s tokens are substituted, then those s\npositions lose their signi\ufb01cance, and the insertions before and after such tokens could be merged.\nHence, given s substitutions, there are really m \u2212 s reference positions for e \u2212 s possible insertions.\nFinally, one can sample according to BLEU score or other sequence metrics by importance sampling\nwhere the proposal distribution could be edit distance sampling above.\n\n4\n\n\f3 RML analysis\nIn the RML framework, we \ufb01nd the model parameters by minimizing the objective (7) instead of\noptimizing the RL objective, i.e. regularized expected reward in (3). The difference lies in minimizing\nDKL (q(y | y\u2217; \u03c4 ) (cid:107) p\u03b8(y | x)) instead of DKL (p\u03b8(y | x) (cid:107) q(y | y\u2217; \u03c4 )). For convenience, let\u2019s\nrefer to q(y | y\u2217; \u03c4 ) as q, and p\u03b8(y | x) as p. Here, we characterize the difference between the two\ndivergences, DKL (q (cid:107) p) \u2212 DKL (p (cid:107) q), and use this analysis to motivate the RML approach.\nWe will initially consider the KL divergence in its more general form as a Bregman divergence, which\nwill make some of the key properties clearer. A Bregman divergence is de\ufb01ned by a strictly convex,\ndifferentiable, closed potential function F : F \u2192 R [3]. Given F and two points p, q \u2208 F, the\ncorresponding Bregman divergence DF : F \u00d7 F \u2192 R+ is de\ufb01ned by\n\nDF (p (cid:107) q) = F (p) \u2212 F (q) \u2212 (p \u2212 q)T\u2207F (q) ,\n\n(12)\nthe difference between the strictly convex potential at p and its \ufb01rst order Taylor approximation\nexpanded about q. Clearly this de\ufb01nition is not symmetric between p and q. By the strict convexity\nof F it follows that DF (p (cid:107) q) \u2265 0 with DF (p (cid:107) q) = 0 if and only if p = q. To characterize the\ndifference between opposite Bregman divergences, we provide a simple result that relates the two\ndirections under suitable conditions. Let HF denote the Hessian of F .\nProposition 1. For any twice differentiable strictly convex closed potential F , and p, q \u2208 int(F):\n(13)\n\n4 (p \u2212 q)T(cid:0)HF (a) \u2212 HF (b)(cid:1)(p \u2212 q)\n\nDF (q (cid:107) p) = DF (p (cid:107) q) + 1\n\n2 ), b = (1\u2212 \u03b2)q + \u03b2p, (0 \u2264 \u03b2 \u2264 1\n2 ).\n\nfor some a = (1\u2212 \u03b1)p + \u03b1q, (0 \u2264 \u03b1 \u2264 1\n(see supp. material)\nFor probability vectors p, q \u2208 \u2206|Y| and a potential F (p) = \u2212\u03c4H (p), DF (p (cid:107) q) = \u03c4 DKL (p (cid:107) q).\nLet f\u2217 : R|Y| \u2192 \u2206|Y| denote a normalized exponential operator that takes a real-valued logit\nvector and turns it into a probability vector. Let r and s denote real-valued logit vectors such that\nq = f\u2217(r/\u03c4 ) and p = f\u2217(s/\u03c4 ). Below, we characterize the gap between DKL (p(y) (cid:107) q(y)) and\nDKL (q(y) (cid:107) p(y)) in terms of the difference between s(y) and r(y).\nProposition 2. The KL divergence between p and q in two directions can be expressed as,\nDKL (p (cid:107) q) = DKL (q (cid:107) p) + 1\n< DKL (q (cid:107) p) + 1\n\n4\u03c4 2 Vary\u223cf\u2217(a/\u03c4 ) [s(y) \u2212 r(y)] \u2212 1\n\u03c4 2 (cid:107)s \u2212 r(cid:107)2\n2,\n\n4\u03c4 2 Vary\u223cf\u2217(b/\u03c4 ) [s(y) \u2212 r(y)]\n\nfor some a = (1\u2212 \u03b1)s + \u03b1r, (0 \u2264 \u03b1 \u2264 1\nGiven Proposition 2, one can relate the two objectives, LRL(\u03b8; \u03c4 ) (5) and LRML(\u03b8; \u03c4 ) (8), by\nLRL = \u03c4LRML + 1\n\nVary\u223cf\u2217(a/\u03c4 ) [s(y) \u2212 r(y)]\u2212 Vary\u223cf\u2217(b/\u03c4 ) [s(y) \u2212 r(y)]\n\n2 ), b = (1\u2212 \u03b2)r + \u03b2s, (0 \u2264 \u03b2 \u2264 1\n2 ).\n\n(cid:88)\n\n(cid:110)\n\n(cid:111)\n\n(see supp. material)\n\n+ const,\n\n4\u03c4\n\n(x,y\u2217)\u2208D\n\n(14)\nwhere s(y) denotes \u03c4-scaled logits predicted by the model such that p\u03b8(y | x) = f\u2217(s(y)/\u03c4 ), and\nr(y) = r(y, y\u2217). The gap between regularized expected reward (5) and \u03c4-scaled RML criterion (8)\nis simply a difference of two variances, whose magnitude decreases with increasing regularization.\nProposition 2 also shows an opportunity for learning algorithms: if \u03c4 is chosen so that q = f\u2217(r/\u03c4 ),\nthen f\u2217(a/\u03c4 ) and f\u2217(b/\u03c4 ) have lower variance than p (which can always be achieved for suf\ufb01ciently\nsmall \u03c4 provided p is not deterministic), then the expected regularized reward under p, and its gradient\nfor training, can be exactly estimated, in principle, by including the extra variance terms and sampling\nfrom more focused distributions than p. Although we have not yet incorporated approximations to\nthe additional variance terms into RML, this is an interesting research direction.\n\n4 Related Work\nThe literature on structure output prediction is vast, falling into three broad categories: (a) super-\nvised learning approaches that ignore task reward and use supervision; (b) reinforcement learning\napproaches that use only task reward and ignore supervision; and (c) hybrid approaches that attempt\nto exploit both supervision and task reward. This paper clearly falls in category (c).\nWork in category (a) includes classical conditional random \ufb01elds [17] and conditional log-likelihood\ntraining of RNNs [29, 2]. It also includes the approaches that attempt to perturb the training inputs\n\n5\n\n\fand supervised training structures to improves the robustness (and hopefully the generalization) of\nthe conditional models (e.g. see [4, 16]). These approaches offer improvements to standard maximum\nlikelihood estimation, but they are fundamentally limited by not incorporating a task reward.\nBy contrast, work in category (b) includes reinforcement learning approaches that only consider\ntask reward and do not use any other supervision. Beyond the traditional reinforcement learning\napproaches, such as policy gradient [37, 31], and actor-critic [30], Q-learning [34], this category\nincludes SEARN [8]. There is some relationship to the work presented here and work on relative\nentropy policy search [23], and policy optimization via expectation maximization [35] and KL-\ndivergence [14, 33], however none of these bridge the gap between the two directions of the KL-\ndivergence, nor do they consider any supervision data as we do here.\nThere is also a substantial body of related work in category (c), which considers how to exploit\nsupervision information while training with a task reward metric. A canonical example is large\nmargin structured prediction [32, 11], which explicitly uses supervision and considers an upper\nbound surrogate for task loss. This approach requires loss augmented inference that cannot be\nef\ufb01ciently achieved for general task losses. We are not aware of successful large-margin methods\nfor neural sequence prediction, but a related approach by [39] for neural machine translation builds\non SEARN [8]. Some form of inference during training is still needed, and the characteristics of\nthe objective are not well studied. We also mentioned the work on maximizing task reward by\nbootstrapping from a maximum likelihood policy [25, 27], but such an approach only makes limited\nuse of supervision. Some work in robotics has considered exploiting supervision as a means to\nprovide indirect sampling guidance to improve policy search methods that maximize task reward\n[18, 19, 26], but these approaches do not make use of maximum likelihood training. An interesting\nwork is [15] which explicitly incorporates supervision in the policy evaluation phase of a policy\niteration procedure that otherwise seeks to maximize task reward. However, this approach only\nconsiders a greedy policy form that does not lend itself to being represented as a deep RNN, and has\nnot been applied to structured output prediction. Most relevant are ideas for improving approximate\nmaximum likelihood training for intractable models by passing the gradient calculation through\nan approximate inference procedure [10, 28]. These works, however, are specialized to particular\napproximate inference procedures, and, by directly targeting expected reward, are subject to the\nvariance problems that motivated this work.\nOne advantage of the RML framework is its computational ef\ufb01ciency at training time. By contrast,\nRL and scheduled sampling [4] require sampling from the model, which can slow down the gradient\ncomputation by 2\u00d7. Structural SVM requires loss-augmented inference which is often more expensive\nthan sampling from the model. Our framework only requires sampling from a \ufb01xed exponentated\npayoff distribution, which can be thought as a form of input pre-processing. This pre-processing can\nbe parallelized by model training by having a thread handling loading the data and augmentation.\nRecently, we were informed of the unpublished work of Volkovs et al. [36] that also proposes an\nobjective like RML, albeit with a different derivation. No theoretical relation was established to\nentropy regularized RL, nor was the method applied to neural nets for sequences, but large gains\nwere reported over several baselines applying the technique to ranking problems with CRFs.\n\n5 Experiments\nWe compare our approach, reward augmented maximum likelihood (RML), with standard maximum\nlikelihood (ML) training on sequence prediction tasks using state-of-the-art attention-based recur-\nrent neural networks [29, 2]. Our experiments demonstrate that the RML approach considerably\noutperforms ML baseline on both speech recognition and machine translation tasks.\n\n5.1 Speech recognition\nFor experiments on speech recognition, we use the TIMIT dataset; a standard benchmark for clean\nphone recognition. This dataset consists of recordings from different speakers reading ten phonetically\nrich sentences covering major dialects of American English. We use the standard train / dev / test\nsplits suggested by the Kaldi toolkit [24].\nAs the sequence prediction model, we use an attention-based encoder-decoder recurrent model of [5]\nwith three 256-dimensional LSTM layers for encoding and one 256-dimensional LSTM layer for\ndecoding. We do not modify the neural network architecture or its gradient computation in any way,\n\n6\n\n\f\u03c4 = 0.6\n\n\u03c4 = 0.7\n\n\u03c4 = 0.8\n\n\u03c4 = 0.9\n\n0\n\n0.1\n0\n11\n\n0.2\n1\n12\n\n0.3\n2\n13\n\n0.4\n3\n14\n\n0.5\n\n0.6\n\n4\n15\n\n5\n16\n\n0.7\n7\n\n6\n\n0.8\n8\n\n0.9\n\n1\n\n9\n\n10\n\nFigure 1: Fraction of different number of edits applied to a sequence of length 20 for different \u03c4. At\n\u03c4 = 0.9, augmentations with 5 to 9 edits are sampled with a probability > 0.1. [view in color]\n\nMethod\n\nML baseline\n\nRML, \u03c4 = 0.60\nRML, \u03c4 = 0.65\nRML, \u03c4 = 0.70\nRML, \u03c4 = 0.75\nRML, \u03c4 = 0.80\nRML, \u03c4 = 0.85\nRML, \u03c4 = 0.90\nRML, \u03c4 = 0.95\nRML, \u03c4 = 1.00\n\nDev set\n\n20.87 (\u22120.2, +0.3)\n19.92 (\u22120.6, +0.3)\n19.64 (\u22120.2, +0.5)\n18.97 (\u22120.1, +0.1)\n18.44 (\u22120.4, +0.4)\n18.27 (\u22120.2, +0.1)\n18.10 (\u22120.4, +0.3)\n18.00 (\u22120.4, +0.3)\n18.46 (\u22120.1, +0.1)\n18.78 (\u22120.6, +0.8)\n\nTest set\n\n22.18 (\u22120.4, +0.2)\n21.65 (\u22120.5, +0.4)\n21.28 (\u22120.6, +0.4)\n21.28 (\u22120.5, +0.4)\n20.15 (\u22120.4, +0.4)\n19.97 (\u22120.1, +0.2)\n19.97 (\u22120.3, +0.2)\n19.89 (\u22120.4, +0.7)\n20.12 (\u22120.2, +0.1)\n20.41 (\u22120.2, +0.5)\n\nTable 1: Phone error rates (PER) for different methods on TIMIT dev and test sets. Average PER of 4\nindependent training runs is reported.\n\nbut we only change the output targets fed into the network for gradient computation and SGD update.\nThe input to the network is a standard sequence of 123-dimensional log-mel \ufb01lter response statistics.\nGiven each input, we generate new outputs around ground truth targets by sampling according to\nthe exponentiated payoff distribution. We use negative edit distance as the measure of reward. Our\noutput augmentation process allows insertions, deletions, and substitutions.\nAn important hyper-parameter in our framework is the temperature parameter, \u03c4, controlling the\ndegree of output augmentation. We investigate the impact of this hyper-parameter and report results\nfor \u03c4 selected from a candidate set of \u03c4 \u2208 {0.6, 0.65, 0.7, 0.75, 0.8, 0.85, 0.9, 0.95, 1.0}. At a\ntemperature of \u03c4 = 0, outputs are not augmented at all, but as \u03c4 increases, more augmentation is\ngenerated. Figure 1 depicts the fraction of different numbers of edits applied to a sequence of length\n20 for different values of \u03c4. These edits typically include very small number of deletions, and roughly\nequal number of insertions and substitutions. For insertions and substitutions we uniformly sample\nelements from a vocabulary of 61 phones. According to Figure 1, at \u03c4 = 0.6, more than 60% of the\noutputs remain intact, while at \u03c4 = 0.9, almost all target outputs are being augmented with 5 to 9\nedits being sampled with a probability larger than 0.1. We note that the augmentation becomes more\nsevere as the outputs get longer.\nThe phone error rates (PER) on both dev and test sets for different values of \u03c4 and the ML baseline\nare reported in Table 1. Each model is trained and tested 4 times, using different random seeds. In\nTable 1, we report average PER across the runs, and in parenthesis the difference of average error to\nminimum and maximum error. We observe that a temperature of \u03c4 = 0.9 provides the best results,\noutperforming the ML baseline by 2.9% PER on the dev set and 2.3% PER on the test set. The\nresults consistently improve when the temperature increases from 0.6 to 0.9, and they get worse\nbeyond \u03c4 = 0.9. It is surprising to us that not only the model trains with such a large amount of\naugmentation at \u03c4 = 0.9, but also it signi\ufb01cantly improves upon the baseline. Finally, we note that\nprevious work [6, 7] suggests several re\ufb01nements to improve sequence to sequence models on TIMIT\nby adding noise to the weights and using more focused forward-moving attention mechanism. While\nthese re\ufb01nements are interesting and they could be combined with the RML framework, in this work,\nwe do not implement such re\ufb01nements, and focus speci\ufb01cally on a fair comparison between the ML\nbaseline and the RML method.\n\n7\n\n\fMethod\n\nAverage BLEU Best BLEU\n\nML baseline\n\nRML, \u03c4 = 0.75\nRML, \u03c4 = 0.80\nRML, \u03c4 = 0.85\nRML, \u03c4 = 0.90\nRML, \u03c4 = 0.95\n\n36.50\n36.62\n36.80\n36.91\n36.69\n36.57\n\n36.87\n36.91\n37.11\n37.23\n37.07\n36.94\n\nTable 2: Tokenized BLEU score on WMT\u201914 English to French evaluated on newstest-2014 set. The\nRML approach with different \u03c4 considerably improves upon the maximum likelihood baseline.\n\n5.2 Machine translation\nWe evaluate the effectiveness of the proposed approach on WMT\u201914 English to French machine\ntranslation benchmark. Translation quality is assessed using tokenized BLEU score, to be consistent\nwith previous work on neural machine translation [29, 2, 22]. Models are trained on the full 36M\nsentence pairs from WMT\u201914 training set, and evaluated on 3003 sentence pairs from newstest-2014\ntest set. To keep the sampling process ef\ufb01cient and simple on such a large corpus, we augment the\noutput sentences only based on Hamming distance (i.e. edit distance without insertion or deletion).\nFor each sentece we sample a single output at each step. One can consider insertions and deletions or\nsampling according to exponentiated sentence BLEU scores, but we leave that to future work.\nAs the conditional sequence prediction model, we use an attention-based encoder-decoder recurrent\nneural network similar to [2], but we use multi-layer encoder and decoder networks consisting of\nthree layers of 1024 LSTM cells. As suggested by [2], for computing the softmax attention vectors,\nwe use a feedforward neural network with 1024 hidden units, which operates on the last encoder\nand the \ufb01rst decoder layers. In all of the experiments, we keep the network architecture and the\nhyper-parameters \ufb01xed. All of the models achieve their peak performance after about 4 epochs of\ntraining, once we anneal the learning rates. To reduce the noise in the BLEU score evaluation, we\nreport both peak BLEU score and BLEU score averaged among about 70 evaluations of the model\nwhile doing the \ufb01fth epoch of training. We perform beam search decoding with a beam size of 8.\nTable 2 summarizes our experimental results on WMT\u201914. We note that our ML translation baseline\nis quite strong, if not the best among neural machine translation models [29, 2, 22], achieving very\ncompetitive performance for a single model. Even given such a strong baseline, the RML approach\nconsistently improves the results. Our best model with a temperature \u03c4 = 0.85 improves average\nBLEU by 0.4, and best BLEU by 0.35 points, which is a considerable improvement. Again we\nobserve that as we increase the amount of augmentation from \u03c4 = 0.75 to \u03c4 = 0.85 the results\nconsistently get better, and then they start to get worse with more augmentation.\nDetails. We train the models using asynchronous SGD with 12 replicas without momentum. We\nuse mini-batches of size 128. We initially use a learning rate of 0.5, which we then exponentially\ndecay to 0.05 after 800K steps. We keep evaluating the models between 1.1 and 1.3 million steps\nand report average and peak BLEU scores in Table 2. We use a vocabulary 200K words for the\nsource language and 80K for the target language. We only consider training sentences that are up to\n80 tokens. We replace rare words with several UNK tokens based on their \ufb01rst and last characters. At\ninference time, we replace UNK tokens in the output sentences by copying source words according\nto largest attention activations as suggested by [22].\n\n6 Conclusion\n\nWe present a learning algorithm for structured output prediction that generalizes maximum likelihood\ntraining by enabling direct optimization of a task reward metric. Our method is computationally\nef\ufb01cient and simple to implement. It only requires augmentation of the output targets used within a\nlog-likelihood objective. We show how using augmented outputs sampled according to edit distance\nimproves a maximum likelihood baseline by a considerable margin, on both machine translation and\nspeech recognition tasks. We believe this framework is applicable to a wide range of probabilistic\nmodels with arbitrary reward functions. In the future, we intend to explore the applicability of this\nframework to other probabilistic models on tasks with more complicated evaluation metrics.\n\n8\n\n\fReferences\n[1] D. Andor, C. Alberti, D. Weiss, A. Severyn, A. Presta, K. Ganchev, S. Petrov, and M. Collins. Globally\n\nnormalized transition-based neural networks. arXiv:1603.06042, 2016.\n\n[2] D. Bahdanau, K. Cho, and Y. Bengio. Neural machine translation by jointly learning to align and translate.\n\nICLR, 2015.\n\n[3] A. Banerjee, S. Merugu, I. S. Dhillon, and J. Ghosh. Clustering with Bregman divergences. JMLR, 2005.\n[4] S. Bengio, O. Vinyals, N. Jaitly, and N. M. Shazeer. Scheduled sampling for sequence prediction with\n\nrecurrent neural networks. NIPS, 2015.\n\n[5] W. Chan, N. Jaitly, Q. V. Le, and O. Vinyals. Listen, attend and spell. ICASSP, 2016.\n[6] J. Chorowski, D. Bahdanau, K. Cho, and Y. Bengio. End-to-end continuous speech recognition using\n\nattention-based recurrent nn: \ufb01rst results. arXiv:1412.1602, 2014.\n\n[7] J. K. Chorowski, D. Bahdanau, D. Serdyuk, K. Cho, and Y. Bengio. Attention-based models for speech\n\n[8] H. Daum\u00b4e, III, J. Langford, and D. Marcu. Search-based structured prediction. Mach. Learn. J., 2009.\n[9] T. Degris, P. M. Pilarski, and R. S. Sutton. Model-free reinforcement learning with continuous action in\n\nrecognition. NIPS, 2015.\n\npractice. ACC, 2012.\n\n2010.\n\nLearn. J., 2012.\n\n[10] J. Domke. Generic methods for optimization-based modeling. AISTATS, 2012.\n[11] K. Gimpel and N. A. Smith. Softmax-margin crfs: Training log-linear models with cost functions. NAACL,\n\n[12] G. Hinton, O. Vinyals, and J. Dean. Distilling the knowledge in a neural network. arXiv:1503.02531, 2015.\n[13] S. Hochreiter and J. Schmidhuber. Long short-term memory. Neural Computation, 1997.\n[14] H. J. Kappen, V. G\u00b4omez, and M. Opper. Optimal control as a graphical model inference problem. Mach.\n\n[15] B. Kim, A. M. Farahmand, J. Pineau, and D. Precup. Learning from limited demonstrations. NIPS, 2013.\n[16] A. Kumar, O. Irsoy, J. Su, J. Bradbury, R. English, B. Pierce, P. Ondruska, I. Gulrajani, and R. Socher. Ask\n\nme anything: Dynamic memory networks for natural language processing. ICML, 2016.\n\n[17] J. D. Lafferty, A. McCallum, and F. C. N. Pereira. Conditional Random Fields: Probabilistic models for\n\nsegmenting and labeling sequence data. ICML, 2001.\n\n[18] S. Levine and V. Koltun. Guided policy search. ICML, 2013.\n[19] S. Levine and V. Koltun. Variational policy search via trajectory optimization. NIPS, 2013.\n[20] D. Lopez-Paz, B. Sch\u00a8olkopf, L. Bottou, and V. Vapnik. Unifying distillation and privileged information.\n\n[21] M.-T. Luong, H. Pham, and C. D. Manning. Effective approaches to attention-based neural machine\n\nICLR, 2016.\n\ntranslation. EMNLP, 2015.\n\n[22] M.-T. Luong, I. Sutskever, Q. V. Le, O. Vinyals, and W. Zaremba. Addressing the rare word problem in\n\nneural machine translation. ACL, 2015.\n\n[23] J. Peters, K. M\u00a8ulling, and Y. Alt\u00a8un. Relative entropy policy search. AAAI, 2010.\n[24] D. Povey, A. Ghoshal, G. Boulianne, et al. The kaldi speech recognition toolkit. ASRU, 2011.\n[25] M. Ranzato, S. Chopra, M. Auli, and W. Zaremba. Sequence level training with recurrent neural networks.\n\nICLR, 2016.\n\ntranslation. ACL, 2016.\n\n[26] S. Shen, Y. Cheng, Z. He, W. He, H. Wu, M. Sun, and Y. Liu. Minimum risk training for neural machine\n\n[27] D. Silver et al. Mastering the game of Go with deep neural networks and tree search. Nature, 2016.\n[28] V. Stoyanov, A. Ropson, and J. Eisner. Empirical risk minimization of graphical model parameters given\n\napproximate inference, decoding, and model structure. AISTATS, 2011.\n\n[29] I. Sutskever, O. Vinyals, and Q. V. Le. Sequence to sequence learning with neural networks. NIPS, 2014.\n[30] R. S. Sutton and A. G. Barto. Reinforcement Learning: An Introduction. MIT Press, 1998.\n[31] R. S. Sutton, D. A. McAllester, S. P. Singh, and Y. Mansour. Policy gradient methods for reinforcement\n\nlearning with function approximation. NIPS, 2000.\n\n[32] B. Taskar, C. Guestrin, and D. Koller. Max-margin markov networks. NIPS, 2004.\n[33] E. Todorov. Linearly-solvable markov decision problems. NIPS, 2006.\n[34] H. van Hasselt, A. Guez, and D. Silver. Deep reinforcement learning with double q-learning.\n\n[35] N. Vlassis, M. Toussaint, G. Kontes, and S. Piperidis. Learning model-free robot control by a Monte Carlo\n\narXiv:1509.06461, 2015.\n\nEM algorithm. Autonomous Robots, 2009.\n\n[36] M. Volkovs, H. Larochelle, and R. Zemel. Loss-sensitive training of probabilistic conditional random\n\n[37] R. J. Williams. Simple statistical gradient-following algorithms for connectionist reinforcement learning.\n\n[38] R. J. Williams and J. Peng. Function optimization using connectionist reinforcement learning algorithms.\n\n\ufb01elds. arXiv:1107.1805v1, 2011.\n\nMach. Learn. J., 1992.\n\nConnection Science, 1991.\n\n[39] S. Wiseman and A. M. Rush.\n\narXiv:1606.02960, 2016.\n\nSequence-to-sequence learning as beam-search optimization.\n\n9\n\n\f", "award": [], "sourceid": 943, "authors": [{"given_name": "Mohammad", "family_name": "Norouzi", "institution": "Google"}, {"given_name": "Samy", "family_name": "Bengio", "institution": "Google Brain"}, {"given_name": "zhifeng", "family_name": "Chen", "institution": "Google Brain"}, {"given_name": "Navdeep", "family_name": "Jaitly", "institution": "Google Brain"}, {"given_name": "Mike", "family_name": "Schuster", "institution": "Google"}, {"given_name": "Yonghui", "family_name": "Wu", "institution": null}, {"given_name": "Dale", "family_name": "Schuurmans", "institution": "Alberta"}]}