{"title": "Decoding with Value Networks for Neural Machine Translation", "book": "Advances in Neural Information Processing Systems", "page_first": 178, "page_last": 187, "abstract": "Neural Machine Translation (NMT) has become a popular technology in recent years, and beam search is its de facto decoding method due to the shrunk search space and reduced computational complexity. However, since it only searches for local optima at each time step through one-step forward looking, it usually cannot output the best target sentence. Inspired by the success and methodology of AlphaGo, in this paper we propose using a prediction network to improve beam search, which takes the source sentence $x$, the currently available decoding output $y_1,\\cdots, y_{t-1}$ and a candidate word $w$ at step $t$ as inputs and predicts the long-term value (e.g., BLEU score) of the partial target sentence if it is completed by the NMT model. Following the practice in reinforcement learning, we call this prediction network \\emph{value network}. Specifically, we propose a recurrent structure for the value network, and train its parameters from bilingual data. During the test time, when choosing a word $w$ for decoding, we consider both its conditional probability given by the NMT model and its long-term value predicted by the value network. Experiments show that such an approach can significantly improve the translation accuracy on several translation tasks.", "full_text": "Decoding with Value Networks for Neural Machine\n\nTranslation\n\nDi He1\n\nHanqing Lu2\n\nYingce Xia3\n\ndi_he@pku.edu.cn\n\nhanqinglu@cmu.edu\n\nxiayingc@mail.ustc.edu.cn\n\nTao Qin4\n\nLiwei Wang1,5\n\nTie-Yan Liu4\n\ntaoqin@microsoft.com\n\nwanglw@cis.pku.edu.cn\n\ntie-yan.liu@microsoft.com\n\n1Key Laboratory of Machine Perception, MOE, School of EECS, Peking University\n\n2Carnegie Mellon University\n\n3University of Science and Technology of China\n\n5Center for Data Science, Peking University, Beijing Institute of Big Data Research\n\n4Microsoft Research\n\nAbstract\n\nNeural Machine Translation (NMT) has become a popular technology in recent\nyears, and beam search is its de facto decoding method due to the shrunk search\nspace and reduced computational complexity. However, since it only searches\nfor local optima at each time step through one-step forward looking, it usually\ncannot output the best target sentence. Inspired by the success and methodology of\nAlphaGo, in this paper we propose using a prediction network to improve beam\nsearch, which takes the source sentence x, the currently available decoding output\ny1,\u00b7\u00b7\u00b7 , yt\u22121 and a candidate word w at step t as inputs and predicts the long-term\nvalue (e.g., BLEU score) of the partial target sentence if it is completed by the NMT\nmodel. Following the practice in reinforcement learning, we call this prediction\nnetwork value network. Speci\ufb01cally, we propose a recurrent structure for the value\nnetwork, and train its parameters from bilingual data. During the test time, when\nchoosing a word w for decoding, we consider both its conditional probability\ngiven by the NMT model and its long-term value predicted by the value network.\nExperiments show that such an approach can signi\ufb01cantly improve the translation\naccuracy on several translation tasks.\n\n1\n\nIntroduction\n\nNeural Machine Translation (NMT), which is based on deep neural networks and provides an end-\nto-end solution to machine translation, has attracted much attention from the research community\n[2, 6, 12, 20] and gradually been adopted by industry in past several years [18, 22]. NMT uses\nan RNN-based encoder-decoder framework to model the entire translation process. In training, it\nmaximizes the likelihood of a target sentence given a source sentence. In testing, given a source\nsentence x, it tries to \ufb01nd a sentence y\u2217 in the target language that maximizes the conditional\nprobability P (y|x). Since the number of possible target sentences is exponentially large, \ufb01nding the\noptimal y\u2217 is NP-hard. Thus beam search is commonly employed to \ufb01nd a reasonably good y.\nBeam search is a heuristic search algorithm that maintains the top-scoring partial sequences expanded\nin a left-to-right fashion. In particular, it keeps a pool of candidates each of which is a partial sequence.\nAt each time step, the algorithm expands each candidate by appending a new word, and then keeps\n\n31st Conference on Neural Information Processing Systems (NIPS 2017), Long Beach, CA, USA.\n\n\fthe top-ranked new candidates scored by the NMT model. The algorithm terminates if it meets the\nmaximum decoding depth or all sentences are completely generated, i.e., all sentences are ended with\nthe end-of-sentence (EOS) symbol.\nWhile NMT with beam search has been proved to be successful, it has several obvious issues,\nincluding exposure bias [9], loss-evaluation mismatch [9] and label bias [16], which have been studied.\nHowever, we observe that there is still an important issue associated with beam search of NMT, the\nmyopic bias, which unfortunately is largely ignored, to the best of our knowledge. Beam search tends\nto focus more on short-term reward. At iteration t, for a candidate y1,\u00b7\u00b7\u00b7 , yt\u22121 (refers to y P (y\navg_bleu(x, yp,2). Denote \u03c9 as the parameter of the value function described in Section 4.1. We\ndesign the loss function as follows:\n\n(cid:88)\n\nL(\u03c9) =\n\nev\u03c9(x,yp,2)\u2212v\u03c9(x,yp,1),\n\n(7)\n\nwhere avg_bleu(x, yp,1) > avg_bleu(x, yp,2).\n\n(x,yp,1,yp,2)\n\nvalue network model v(x, y), beam search size K, maximum search depth L, weight \u03b1.\n\nAlgorithm 2 Beam search with value network in NMT\n1: Input: Testing example x, neural machine translation model P (y|x) with target vocabulary V ,\n2: Set S = \u2205, U = \u2205 as candidate sets.\n3: repeat\n4:\n5:\n6:\n\nt log P (y|x) + (1 \u2212 \u03b1) \u00d7 log v(x, y)|y \u2208 Uexpand}\n\nt = t + 1.\nUexpand \u2190 {yi + {w}|yi \u2208 U, w \u2208 V }.\nU \u2190 {top (K \u2212 |S|) candidates that maximize\n\u03b1 \u00d7 1\nUcomplete \u2190 {y|y \u2208 U, yt = EOS}\n7:\nU \u2190 U \\ Ucomplete\n8:\nS \u2190 S \u222a Ucomplete\n9:\n10: until |S| = K or t = L\n11: Output: y = argmaxy\u2208S\u222aU \u03b1 \u00d7 1|y| log P (y|x) + (1 \u2212 \u03b1) \u00d7 log v(x, y)\n\n5\n\n\f4.4\n\nInference\n\nSince the value network estimates the long-term reward of a state, it will be helpful to enhance the\ndecoding process of NMT. For example, in a certain decoding step, the NMT model prefers word\nw1 over w2 according to the conditional probability, but it does not know that picking w2 will be a\nbetter choice for future decoding. As the value network provides suf\ufb01cient information on the future\nreward, if the value network outputs show that picking w2 is better than picking w1, we can take both\nNMT probability and future reward into consideration to choose a better action.\nIn this paper, we simply linearly combine the outputs of the NMT model and the value network,\nwhich is motivated by the success of AlphaGo [11]. We \ufb01rst compute the normalized log probability\nof each candidate, and then linearly combine it with the logarithmic value of the reward. In detail,\ngiven a translation model P (y|x), a value network v(x, y) and a hyperparameter \u03b1 \u2208 (0, 1), the score\nof partial sequence y for x is computed by\n\n\u03b1 \u00d7 1\n\n|y| log P (y|x) + (1 \u2212 \u03b1) \u00d7 log v(x, y),\n\n(8)\n\nwhere |y| is the length of y. The details of the decoding process are presented in Algorithm 2, and we\ncall our neural network-based decoding algorithm NMT-VNN for short.\n\n5 Experiments\n\n5.1 Settings\n\nWe compare our proposed NMT-VNN with two baselines. The \ufb01rst one is classic NMT with beam\nsearch [2] (NMT-BS). The second one [16] trains a predictor that can evaluate the quality of any\npartial sequence, e.g., partial BLEU score 1, and then it uses the predictor to select words instead of\nthe probability. The main difference between [16] and ours is that they predict the local improvement\nof BLEU for any single word, while ours aims at predicting the \ufb01nal BLEU score and use the predicted\nscore to select words. We refer their work as beam search optimization (we call it NMT-BSO). For\nNMT-BS, we directly used the open source code [2]. NMT-BSO was implemented by ourselves\nbased on the open source code [2].\nWe tested our proposed algorithms and the baselines on three pairs of languages: English\u2192French\n(En\u2192Fr), English\u2192German (En\u2192De), and Chinese\u2192English (Zh\u2192En). In detail, we used the same\nbilingual corpora from WMT\u2019 14 as used in [2] , which contains 12M, 4.5M and 10M training data\nfor each task. Following common practices, for En\u2192Fr and En\u2192De, we concatenated newstest2012\nand newstest2013 as the validation set, and used newstest2014 as the testing set. For Zh\u2192En, we\nused NIST 2006 and NIST 2008 datasets for testing, and use NIST 2004 dataset for validation. For\nall datasets in Chinese, we used a public tool for word segmentation. In all experiments, validation\nsets were only used for early-stopping and hyperparameter tuning.\nFor NMT-VNN and NMT-BS, we need to train an NMT model \ufb01rst. We followed [2] to set\nexperimental parameters to train the NMT model. For each language, we constructed the vocabulary\nwith the most common 30K words in the parallel corpora, and out-of-vocabulary words were replaced\nwith a special token \u201cUNK\". Each word was embedded into a vector space of 620 dimensions, and\nthe dimension of the recurrent unit was 1000. We removed sentences with more than 50 words from\nthe training set. Batch size was set as 80 with 20 batches pre-fetched and sorted by sentence lengths.\nThe NMT model was trained with asynchronized SGD on four K40m GPUs for about seven days.\nFor NMT-BSO, we implemented the algorithm and the model was trained in the same environment.\nFor the value network used in NMT-VNN, we set the same parameters for the encoder-decoder\nlayers as the NMT model. Additionally, in the SM module and CC module, we set function \u00b5SM\nand \u00b5CC as single-layer feed forward networks with 1000 output nodes. In Algorithm 1, we set\nthe hyperparameter K = 20 to estimate the value of any partial sequence. We adapted mini-batch\ntraining with batch size to be 80, and the value network model was trained with AdaDelta [21] on\none K40m GPU for about three days.\n\n1If the ground truth is y\u2217, the partial bleu on the partial sequence y