{"title": "Cold-Start Reinforcement Learning with Softmax Policy Gradient", "book": "Advances in Neural Information Processing Systems", "page_first": 2817, "page_last": 2826, "abstract": "Policy-gradient approaches to reinforcement learning have two common and undesirable overhead procedures, namely warm-start training and sample variance reduction. In this paper, we describe a reinforcement learning method based on a softmax value function that requires neither of these procedures. Our method combines the advantages of policy-gradient methods with the efficiency and simplicity of maximum-likelihood approaches. We apply this new cold-start reinforcement learning method in training sequence generation models for structured output prediction problems. Empirical evidence validates this method on automatic summarization and image captioning tasks.", "full_text": "Cold-Start Reinforcement Learning with\n\nSoftmax Policy Gradient\n\nNan Ding\nGoogle Inc.\n\nVenice, CA 90291\n\ndingnan@google.com\n\nRadu Soricut\nGoogle Inc.\n\nVenice, CA 90291\n\nrsoricut@google.com\n\nAbstract\n\nPolicy-gradient approaches to reinforcement learning have two common and un-\ndesirable overhead procedures, namely warm-start training and sample variance\nreduction. In this paper, we describe a reinforcement learning method based on a\nsoftmax value function that requires neither of these procedures. Our method com-\nbines the advantages of policy-gradient methods with the ef\ufb01ciency and simplicity\nof maximum-likelihood approaches. We apply this new cold-start reinforcement\nlearning method in training sequence generation models for structured output\nprediction problems. Empirical evidence validates this method on automatic sum-\nmarization and image captioning tasks.\n\n1\n\nIntroduction\n\nReinforcement learning is the study of optimal sequential decision-making in an environment [16]. Its\nrecent developments underpin a large variety of applications related to robotics [11, 5] and games [20].\nPolicy search in reinforcement learning refers to the search for optimal parameters for a given policy\nparameterization [5]. Policy search based on policy-gradient [26, 21] has been recently applied to\nstructured output prediction for sequence generations. These methods alleviate two common problems\nthat approaches based on training with the Maximum-likelihood Estimation (MLE) objective exhibit,\nnamely the exposure-bias problem [24, 19] and the wrong-objective problem [19, 15] (more on this\nin Section 2). As a result of addressing these problems, policy-gradient methods achieve improved\nperformance compared to MLE training in various tasks, including machine translation [19, 7], text\nsummarization [19], and image captioning [19, 15].\nPolicy-gradient methods for sequence generation work as follows: \ufb01rst the model proposes a sequence,\nand the ground-truth target is used to compute a reward for the proposed sequence with respect to\nthe reward of choice (using metrics known to correlate well with human-rated correctness, such\nas ROUGE [13] for summarization, BLEU [18] for machine translation, CIDEr [23] or SPICE [1]\nfor image captioning, etc.). The reward is used as a weight for the log-likelihood of the proposed\nsequence, and learning is done by optimizing the weighted average of the log-likelihood of the\nproposed sequences. The policy-gradient approach works around the dif\ufb01culty of differentiating the\nreward function (the majority of which are non-differentiable) by using it as a weight. However, since\nsequences proposed by the model are also used as the target of the model, they are very noisy and\ntheir initial quality is extremely poor. The dif\ufb01culty of aligning the model output distribution with\nthe reward distribution over the large search space of possible sequences makes training slow and\ninef\ufb01cient\u2217. As a result, overhead procedures such as warm-start training with the MLE objective\nand sophisticated methods for sample variance reduction are required to train with policy gradient.\n\u2217Search space size is O(V T ), where V is the number of word types in the vocabulary (typically between 104\n\nand 106) and T is the the sequence length (typically between 10 and 50), hence between 1040 and 10300.\n\n31st Conference on Neural Information Processing Systems (NIPS 2017), Long Beach, CA, USA.\n\n\fThe fundamental reason for the inef\ufb01ciency of policy-gradient\u2013based reinforcement learning is the\nlarge discrepancy between the model-output distribution and the reward distribution, especially in\nthe early stages of training. If, instead of generating the target based solely on the model-output\ndistribution, we generate it based on a proposal distribution that incorporates both the model-output\ndistribution and the reward distribution, learning would be ef\ufb01cient, and neither warm-start training\nnor sample variance reduction would be needed. The outstanding problem is \ufb01nding a value function\nthat induces such a proposal distribution.\nIn this paper, we describe precisely such a value function, which in turn gives us a Softmax Policy\nGradient (SPG) method. The softmax terminology comes from the equation that de\ufb01nes this value\nfunction, see Section 3. The gradient of the softmax value function is equal to the average of the\ngradient of the log-likelihood of the targets whose proposal distribution combines both model output\ndistribution and reward distribution. Although this distribution is infeasible to sample exactly, we\nshow that one can draw samples approximately, based on an ef\ufb01cient forward-pass sampling scheme.\nTo balance the importance between the model output distribution and the reward distribution, we use\na bang-bang [8] mixture model to combine the two distributions. Such a scheme removes the need of\n\ufb01ne-tuning the weights across different datasets and throughout the learning epochs. In addition to\nusing a main metric as the task reward (ROUGE, CIDEr, etc.), we show that one can also incorporate\nadditional, task-speci\ufb01c metrics to enforce various properties on the output sequences (Section 4).\nWe numerically evaluate our method on two sequence generation benchmarks, a headline-generation\ntask and an image-caption\u2013generation task (Section 5). In both cases, the SPG method signi\ufb01cantly\nimproves the accuracy, compared to maximum-likelihood and other competing methods. Finally, it is\nworth noting that although the training and inference of the SPG method in the paper is mainly based\non sequence learning, the idea can be extended to other reinforcement learning applications.\n\n2 Limitations of Existing Sequence Learning Regimes\n\nOne of the standard approaches to sequence-learning training is Maximum-likelihood Estimation\n\n(MLE). Given a set of inputs X =(cid:8)xi(cid:9) and target sequences Y =(cid:8)yi(cid:9), the MLE loss function is:\n(cid:88)\n(cid:9) denote the input and the target sequence of the i-th example,\nHere xi and yi = (cid:8)yi\n\nM LE(\u03b8) = \u2212 log p\u03b8(yi|xi).\n\nM LE(\u03b8), where Li\nLi\n\nLM LE(\u03b8) =\n\ni\n1, . . . , yi\nT\n\n(1)\n\npredictions. At training-time, log p\u03b8(yi|xi) = (cid:80)\n\nrespectively. For instance, in the image captioning task, xi is the image of the i-th example, and yi is\nthe groundtruth caption of the i-th example.\nAlthough widely used in many different applications, MLE estimation for sequence learning suffers\nfrom the exposure-bias problem [24, 19]. Exposure-bias refers to training procedures that produce\nbrittle models that have only been exposed to their training data distribution but not to their own\n1...t\u22121), i.e. the loss of the t-th\nword is conditional on the true previous-target tokens yi\n1...t\u22121 are unavailable\n1...t\u22121|xi) yields a signi\ufb01cant\nduring inference, replacing them with tokens zi\ndiscrepancy between how the model is used at training time versus inference time. The exposure-bias\nproblem has recently received attention in neural-network settings with the \u201cdata as demonstrator\u201d [24]\nand \u201cscheduled sampling\u201d [3] approaches. Although improving model performance in practice, such\nproposals have been shown to be statistically inconsistent [10], and still need to perform MLE-based\nwarm-start training.\nA more general approach to MLE is the Reward Augmented Maximum Likelihood (RAML)\nmethod [17]. RAML makes the correct observation that, under MLE, all alternative outputs are\nequally penalized through normalization, regardless of their relationship to the ground-truth target.\nInstead, RAML corrects for this shortcoming using an objective of the form:\n\n1...t\u22121 generated by p\u03b8(zi\n\n1...t\u22121. However, since yi\n\nt|xi, yi\n\nt log p\u03b8(yi\n\nrR(zi|yi) log p\u03b8(zi|xi).\n\n(2)\n\nRAM L(\u03b8) = \u2212(cid:88)\n\nLi\n\n(cid:80)\nwhere rR(zi|yi) = exp(R(zi|yi)/\u03c4 )\nzi exp(R(zi|yi)/\u03c4 ). This formulation uses R(zi|yi) to denote the value of a\nsimilarity metric R between zi and yi (the reward), with yi = argmaxzi R(zi|yi); \u03c4 is a temperature\nhyper-parameter to control the peakiness of this reward distribution. Since the sum over all zi for\n\nzi\n\n2\n\n\fthe reward distribution rR(zi|yi) in Eq. (2) is infeasible to compute, a standard approach is to\ndraw J samples zij from the reward distribution, and approximate the expectation by Monte Carlo\nintegration:\n\nRAM L(\u03b8) (cid:39) \u2212 1\nLi\nJ\n\nlog p\u03b8(zij|xi).\n\n(3)\n\nJ(cid:88)\n\nj=1\n\nP G(\u03b8) = \u2212V i\nLi\n\nP G(\u03b8) = \u2212(cid:88)\n\n\u2202\n\u2202\u03b8\n\nAlthough a clear improvement over Eq. (1), the sampling for zij in Eq. (3) is solely based on\nrR(zi|yi) and completely ignores the model probability. At the same time, this technique does not\naddress the exposure bias problem at all.\nA different approach, based on reinforcement learning methods, achieves sequence learning following\na policy-gradient method [21]. Its appeal is that it not only solves the exposure-bias problem, but also\ndirectly alleviates the wrong-objective problem [19, 15] of MLE approaches. Wrong-objective refers\nto the critique that MLE-trained models tend to have suboptimal performance because such models\nare trained on a convenient objective (i.e., maximum likelihood) rather than a desirable objective\n(e.g., a metric known to correlate well with human-rated correctness). The policy-gradient method\nuses a value function VP G, which is equivalent to a loss LP G de\ufb01ned as:\n\nP G(\u03b8), V i\n\nP G(\u03b8) = Ep\u03b8(zi|xi)[R(zi|yi)].\n\n(4)\n\nThe gradient for Eq. (4) is:\n\nJ(cid:88)\n\nLi\n\n(5)\nSimilar to (3), one can draw J samples zij from p\u03b8(zi|xi) to approximate the expectation by Monte-\nCarlo integration:\n\nzi\n\np\u03b8(zi|xi)R(zi|yi)\n\nlog p\u03b8(zi|xi).\n\n\u2202\n\u2202\u03b8\n\n\u2202\n\u2202\u03b8\n\n\u2202\n\u2202\u03b8\n\nP G(\u03b8) (cid:39) \u2212 1\nLi\nJ\n\nR(zij|yi)\n\nlog p\u03b8(zij|xi).\n\nj=1\n\n(6)\nHowever, the large discrepancy between the model prediction distribution p\u03b8(zi|xi) and the reward\nR(zi|yi)\u2019s values, which is especially acute during the early training stages, makes the Monte-Carlo\nintegration extremely inef\ufb01cient. As a result, this method also requires a warm-start phase in which\nthe model distribution achieves some local maximum with respect to a reward-metric\u2013free objective\n(e.g., MLE), followed by a model re\ufb01nement phase in which reward-metric\u2013based PG updates are\nused to re\ufb01ne the model [19, 7, 15]. Although this combination achieves better results in practice\ncompared to pure likelihood-based approaches, it is unsatisfactory from a theoretical and modeling\nperspective, as well as inef\ufb01cient from a speed-to-convergence perspective. Both these issues are\naddressed by the value function we describe next.\n\n3 Softmax Policy Gradient (SPG) Method\nIn order to smoothly incorporate both the model distribution p\u03b8(zi|xi) and the reward metric R(zi|yi),\nwe replace the value function from Eq. 4 with a Softmax value function for Policy Gradient (SPG),\nVSP G, equivalent to a loss LSP G de\ufb01ned as:\n\nSP G(\u03b8) = log(cid:0)Ep\u03b8(zi|xi)[exp(R(zi|yi))](cid:1) .\n\nSoftmaxzi(\u00b7) = log(cid:80)\n\n(7)\nBecause the value function for example i is equal to Softmaxzi(log p\u03b8(zi|xi) + R(zi|yi)), where\nzi exp(\u00b7), we call it the softmax value function. Note that the softmax value\nfunction from Eq. (7) is the dual of the entropy-regularized policy search (REPS) objective [5, 16]\nL(q) = Eq[R] + KL(q|p\u03b8). However, our learning and sampling procedures are signi\ufb01cantly\ndifferent from REPS, as shown in what follows.\nThe gradient for Eq. (7) is:\n\nSP G(\u03b8) = \u2212V i\nLi\n\nSP G(\u03b8), V i\n\nSP G(\u03b8) = \u2212\nLi\n\n\u2202\n\u2202\u03b8\n\n1\n\n(cid:80)\n= \u2212(cid:88)\nzi p\u03b8(zi|xi) exp(R(zi|yi))\nq\u03b8(zi|xi, yi)\n(cid:80)\nzi p\u03b8(zi|xi) exp(R(zi|yi)) p\u03b8(zi|xi) exp(R(zi|yi)).\n\nlog p\u03b8(zi|xi)\n\n\u2202\n\u2202\u03b8\n\nzi\n\nzi\n\n1\n\nwhere q\u03b8(zi|xi, yi) =\n\np\u03b8(zi|xi) exp(R(zi|yi))\n\nlog p\u03b8(zi|xi)\n\n\u2202\n\u2202\u03b8\n\n(cid:33)\n\n(8)\n\n(cid:32)(cid:88)\n\n3\n\n\fThere are several advantages associated with the gra-\ndient from Eq. (8).\nFirst, q\u03b8(zi|xi, yi) takes into account both p\u03b8(zi|xi)\nand R(zi|yi). As a result, Monte Carlo integration\nover q\u03b8-samples approximates Eq. (8) better, and has\nsmaller variance compared to Eq. (5). This allows\nour model to start learning from scratch without the\nwarm-start and variance-reduction crutches needed\nby previously-proposed PG approaches.\nSecond, as Figure 1 shows, the samples for the SPG\nmethod (pentagons) lie between the ground-truth tar-\nget distribution (triangle and circles) and the model\ndistribution (squares). These targets are both easier\nto learn by p\u03b8 compared to ground-truth\u2013only targets\nlike the ones for MLE (triangle) and RAML (circles),\nand also carry more information about the ground-truth target compared to model-only samples (PG\nsquares). This formulation allows us to directly address the exposure-bias problem, by allowing the\nmodel distribution to learn at training time how to deal with events conditioned on model-generated\ntokens, similar with what happens at inference time (more on this in Section 3.2). At the same time,\nthe updates used for learning rely heavily on the in\ufb02uence of the reward metric R(zi|yi), therefore\ndirectly addressing the wrong-objective problem. Together, these properties allow the model to\nachieve improved accuracy.\nThird, although q\u03b8 is infeasible for exact sampling, since both p\u03b8(zi|xi) and exp(R(zi|yi)) are\nfactorizable across zi\nt denotes the t-th word of the i-th output sequence), we can apply\nef\ufb01cient approximate inference for the SPG method as shown in the next section.\n\nFigure 1: Comparing the target samples for\nMLE, RAML (the rR distribution), PG (the\np\u03b8 distribution), and SPG (the q\u03b8 distribution).\n\nt (where zi\n\n3.1\n\nInference\n\nIn order to estimate the gradient from Eq. (8) with Monte-Carlo integration, one needs to be able\nto draw samples from q\u03b8(zi|xi, yi). To tackle this problem, we \ufb01rst decompose R(zi|yi) along the\nt-axis:\n\nR(zi|yi) =\n\nR(zi\n\n(cid:123)(cid:122)\n1:t|yi) \u2212 R(zi\n(cid:44)\u2206ri\nt|yi,zi\n\nt(zi\n\n(cid:125)\n1:t\u22121|yi)\n,\n\n1:t\u22121)\n\nT(cid:88)\n\nt=1\n\n(cid:124)\n\nT(cid:89)\n\nwhere R(zi\nincrement notation, we can rewrite:\n\n1:t|yi) \u2212 R(zi\n\n1:t\u22121|yi) characterizes the reward increment for zi\n\nt. Using the reward\n\nq\u03b8(zi|xi, yi) =\n\n1\n\nexp(log p\u03b8(zi\n\n1:t\u22121, xi) + \u2206ri\n\nt(zi\n\nt|zi\n\nt|yi, zi\n\n1:t\u22121))\n\nZ\u03b8(xi, yi)\n\nt=1\n\nwhere Z\u03b8(xi, yi) is the partition function equal to the sum over all con\ufb01gurations of zi. Since the\nnumber of such con\ufb01gurations grows exponentially with respect to the sequence-length T , directly\ndrawing from q\u03b8(zi|xi, yi) is infeasible. To make the inference ef\ufb01cient, we replace q\u03b8(zi|xi, yi)\nwith the following approximate distribution:\n\u02dcq\u03b8(zi|xi, yi) =\n\nT(cid:89)\n\nt|xi, yi, zi\n\n1:t\u22121),\n\n\u02dcq\u03b8(zi\n\nwhere\n\n\u02dcq\u03b8(zi\n\nt|xi, yi, zi\n\n1:t\u22121) =\n\n1:t\u22121)\nBy replacing q\u03b8 in Eq. (8) with \u02dcq\u03b8, we obtain:\n\n\u02dcZ\u03b8(xi, yi, zi\n\nt=1\n\n\u2202\n\u2202\u03b8\n\nLi\n\nSP G(\u03b8) = \u2212(cid:88)\n(cid:39) \u2212(cid:88)\n\nzi\n\nzi\n\n1\n\nexp(log p\u03b8(zi\n\n1:t\u22121, xi) + \u2206ri\n\nt(zi\n\nt|zi\n\nt|yi, zi\n\n1:t\u22121)).\n\nlog p\u03b8(zi|xi)\n\nlog p\u03b8(zi|xi) (cid:44) \u2202\n\u2202\u03b8\n\n\u02dcLi\n\nSP G(\u03b8)\n\n(9)\n\nq\u03b8(zi|xi, yi)\n\n\u02dcq\u03b8(zi|xi, yi)\n\n\u2202\n\u2202\u03b8\n\n\u2202\n\u2202\u03b8\n\n4\n\nMLE targetRAML targetsPG targetsSPG targets(cid:6925)(cid:7578)(cid:7371)(cid:7578)rR\f1:t\u22121) sums over the con\ufb01gurations of one zi\n\nCompared to Z\u03b8(xi, yi), \u02dcZ\u03b8(xi, yi, zi\nt only. Therefore,\nthe cost of drawing one zi from \u02dcq\u03b8(zi|xi, yi) grows only linearly with respect to T . Furthermore, for\nt|yi, zi\ncommon reward metrics such as ROUGE and CIDEr, the computation of \u2206ri\n1:t\u22121) can be\nt(zi\ndone in O(T ) instead of O(V ) (where V is the size of the state space for zi\nt, i.e., vocabulary size).\nThat is because the maximum number of unique words in yi is T , and any words not in yi have the\nsame reward increment. When we limit ourselves to J = 1 sample for each example in Eq. (9), the\napproximate SPG inference time of each example is similar to the inference time for the gradient of\nthe MLE objective. Combined with the empirical \ufb01ndings in Section 5 (Figure 3) where the steps\nfor convergence are comparable, we conclude that the time for convergence for the SPG method is\nsimilar to the MLE based method.\n\n3.2 Bang-bang Rewarded SPG Method\n\nt|zi\n\nOne additional dif\ufb01culty for the SPG method is that\nthe model\u2019s log-probability values\n1:t\u22121|yi) are not on the\n1:t|yi) \u2212 R(zi\n1:t\u22121, xi) and the reward-increment values R(zi\nlog p\u03b8(zi\nsame scale. In order to balance the impact of these two factors, we need to weigh them appropriately.\nt) (cid:44)\nFormally, we achieve this by adding a weight wi\nt\u00b7\u2206ri\nt). The ap-\nwi\nt), where\n\n1:t\u22121) so that the total reward R(zi|yi, wi) =(cid:80)T\nproximate proposal distribution becomes \u02dcq\u03b8(zi|xi, yi, wi) =(cid:81)T\n\nt|yi, zi\n1:t\u22121, wi\n1:t\u22121, wi\n\nt to the reward increments: \u2206ri\n\nt|yi, zi\n\n1:t\u22121, wi\n\nt(zi\n\n\u02dcq\u03b8(zi\n\nt|xi, yi, zi\n\n1:t\u22121, wi\n\nt) \u221d exp(log p\u03b8(zi\n\nt|zi\n\n1:t\u22121, xi) + \u2206ri\n\nt(zi\nt|yi, zi\nt(zi\nt|xi, yi, zi\nt|yi, zi\n\nt=1 \u2206ri\nt=1 \u02dcq\u03b8(zi\nt(zi\n\n1:t\u22121, wi\n\nt)).\n\nt|zi\n\n(10)\n\n(11)\n\nThe challenge in this case is to choose an appropriate weight wi\nheavily for different i, t, as well as across different iterations and tasks.\nIn order to minimize the efforts for \ufb01ne-tuning the reward weights, we propose a bang-bang rewarded\nsoftmax value function, equivalent to a loss LBBSP G de\ufb01ned as:\n\nt, because log p\u03b8(zi\n\n1:t\u22121, xi) varies\n\nLi\n\nBBSP G(\u03b8) = \u2212(cid:88)\nBBSP G(\u03b8) = \u2212(cid:88)\n\nwi\n\nwi\n\np(wi) log(cid:0)Ep\u03b8(zi|xi)[exp(R(zi|yi, wi))](cid:1) ,\n(cid:88)\n(cid:125)\n(cid:124)\n\n\u02dcq\u03b8(zi|xi, yi, wi)\n\nlog p\u03b8(zi|xi)\n,\n\n(cid:123)(cid:122)\n\n\u2202\n\u2202\u03b8\n\nzi\n\np(wi)\n\n(cid:44)\u2212 \u2202\nt = 0) = pdrop = 1 \u2212 p(wi\n\n\u2202\u03b8\n\nSP G(\u03b8|wi)\n\u02dcLi\n\nand \u2202\n\u2202\u03b8\n\n\u02dcLi\n\nwhere p(wi) = (cid:81)\n\n\u2206ri\n\nt(zi\n\nt|zi\n\nt p(wi\n\nt|yi, zi\n\nt) and p(wi\n\n(cid:54)= argmaxzi\n\nt = W , the term \u2206ri\n\nt = W ). Here W is a suf\ufb01ciently\nlarge number (e.g., 10,000), pdrop is a hyper-parameter in [0, 1]. The name bang-bang is borrowed\nfrom control theory [8], and refers to a system which switches abruptly between two extreme states\n(namely W and 0).\nWhen wi\n1:t\u22121, wi\nt)\n1:t\u22121, xi), so the sampling of\noverwhelms log p\u03b8(zi\nt. It is im-\nt is decided by the reward increment of zi\nzi\nportant to emphasize that in general the groundtruth\nt|yi, zi\nlabel yi\n1:t\u22121), because\nt\n1:t\u22121 may not be the same as yi\n1:t\u22121 (see an ex-\nzi\nample in Figure 2). The only special case is when\nt to always equal W , and\npdrop = 0, which forces wi\nimplies zi\nt (and therefore the\nSPG method reduces to the MLE method).\nt = 0, by de\ufb01nition\nOn the other hand, when wi\nt|yi, zi\nt) = 0. In this case, the sam-\n\u2206ri\nt is based only on the model prediction\npling of zi\ndistribution p\u03b8(zi\n1:t\u22121, xi), the same situation we\nhave at inference time. Furthermore, we have the\nfollowing lemma (with the proof provided in the Supplementary Material):\n\nFigure 2: An example of sequence generation\nwith the bang-bang reward weights. z4 =\n\u201din\u201d is sampled from the model distribution\nsince w4 = 0. Although w5 = W , z5 =\n\u201dthe\u201d (cid:54)= y5 because z4 = \u201din\u201d.\n\nt is always equal\u2020 to yi\n\n1:t\u22121, wi\nt|zi\n\nt(zi\n\nt(zi\n\nt\n\n\u2020This follows from recursively applying R\u2019s property that yi\n\nt = argmaxzi\n\nt\n\n\u2206ri\n\nt(zi\n\nt|yi, zi\n\n1:t\u22121 = yi\n\n1:t\u22121).\n\n5\n\nt1234567ytamanissittingintheparkWWW0W......wtztamanisinthe......argmax (cid:7490)r5(z5|y, z1:4) = \u2018the\u2019 \u2260 y5 = \u2018in\u2019\fLemma 1 When wi\n\nt = 0,(cid:88)\n\n\u02dcq\u03b8(zi|xi, yi, wi)\n\n\u2202\n\u2202\u03b8\n\nlog p\u03b8(zi\n\nt|xi, zi\n\n1:t\u22121) = 0.\n\nzi\n\nt (cid:54)= 0 are included. To see that, using the fact that log p\u03b8(zi|xi) =(cid:80)T\n\nSP G(\u03b8|wi) is very different from traditional PG-method gradients, in that only the zi\n\u02dcLi\nt\n1:t\u22121),\n\nt|xi, zi\n\nAs a result, \u2202\n\u2202\u03b8\nwith wi\n\n\u02dcq\u03b8(zi|xi, yi, wi)\n\n\u2202\n\u2202\u03b8\n\nlog p\u03b8(zi\n\nt=1 log p\u03b8(zi\nt|xi, zi\n\n1:t\u22121),\n\nUsing the result of Lemma 1, Eq. (12) is equal to:\n\n(cid:88)\n(cid:88)\n\nzi\n\n\u2202\n\u2202\u03b8\n\n\u2202\n\u2202\u03b8\n\n\u02dcLi\n\n\u02dcLi\n\nSP G(\u03b8|wi) = \u2212(cid:88)\nSP G(\u03b8|wi) = \u2212 (cid:88)\n= \u2212(cid:88)\n\n{t:wi\n\nt\n\nzi\n\nt(cid:54)=0}\n\u02dcq\u03b8(zi|xi, yi, wi)\n\nzi\n\n\u02dcq\u03b8(zi|xi, yi, wi)\n\nlog p\u03b8(zi\n\nt|xi, zi\n\n1:t\u22121)\n\nlog p\u03b8(zi\n\nt|xi, zi\n\n1:t\u22121)\n\n\u2202\n\u2202\u03b8\n\n\u2202\n\u2202\u03b8\n\n(cid:88)\n\n{t:wi\n\nt(cid:54)=0}\n\n(12)\n\n(13)\n\nUsing Monte-Carlo integration, we approximate Eq. (11) by \ufb01rst drawing wij from p(wi) and then\nt|xi, zi\niteratively drawing zij\nfrom \u02dcq\u03b8(zi\nt ) for t = 1, . . . , T . For larger values of pdrop,\nt\nthe wij sample contains more wij\nt = 0 and the resulting zij contains proportionally more samples\nfrom the model prediction distribution (with a direct effect on alleviating the exposure-bias problem).\nAfter zij is obtained, only the log-likelihood of zij\n\n(cid:54)= 0 are included in the loss:\n\n1:t\u22121, yi, wij\n\n\u2202\n\u2202\u03b8\n\nBBSP G(\u03b8) (cid:39) \u2212 1\n\u02dcLi\nJ\n\nlog p\u03b8(zij\n\nt |xi, zij\n\n1:t\u22121).\n\n(14)\n\nJ(cid:88)\n\nj=1\n\n(cid:110)\n\nt\n\n(cid:88)\n\nt when wij\n(cid:111) \u2202\n\n\u2202\u03b8\n\nt:w\n\nij\n\nt (cid:54)=0\n\nThe details about the gradient evaluation for the bang-bang rewarded softmax value function are\ndescribed in Algorithm 1 of the Supplementary Material.\n\n4 Additional Reward Functions\nBesides the main reward function R(zi|yi), additional reward functions can be used to enforce\ndesirable properties for the output sequences. For instance, in summarization, we occasionally \ufb01nd\nthat the decoded output sequence contains repeated words, e.g. \"US R&B singer Marie Marie Marie\nMarie ...\". In this framework, this can be directly \ufb01xed by using an additional auxiliary reward\nfunction that simply rewards negatively two consecutive tokens in the generated sequence:\n\n(cid:26)\u22121\n\n0\n\nDUPi\n\nt =\n\nif zi\nt = zi\notherwise.\n\nt\u22121,\n\nIn conjunction with the bang-bang weight scheme, the introduction of such a reward function has the\nimmediate effect of severely penalizing such \u201cstuttering\u201d in the model output; the decoded sequence\nafter applying the DUP negative reward becomes: \"US R&B singer Marie Christina has ...\".\nAdditionally, we can use the same approach to correct for certain biases in the forward sampling\napproximation. For example, the following function negatively rewards the end-of-sentence symbol\nwhen the length of the output sequence is less than that of the ground-truth target sequence |yi|:\n\nEOSi\n\nt =\n\nt = and t < |yi|,\n\nif zi\notherwise.\n\nA more detailed discussion about such reward functions is available in the Supplementary Material.\nDuring training, we linearly combine the main reward function with the auxiliary functions:\n\n\u2206ri\n\nt(zi\n\nt|yi, zi\n\n1:t\u22121, wi\n\nt) = wi\n\n1:t|yi) \u2212 R(zi\n\n1:t\u22121|yi) + DUPi\n\nt + EOSi\nt\n\n(cid:1) ,\n\nwith W = 10, 000. During testing, since the ground-truth target yi is unavailable, this becomes:\n\n\u2206ri\n\nt(zi\n\nt|yi, zi\n\n1:t\u22121, W ) = W \u00b7 DUPi\nt.\n\n6\n\n(cid:26)\u22121\nt \u00b7(cid:0)R(zi\n\n0\n\n\f5 Experiments\n\nWe numerically evaluate the proposed softmax policy gradient (SPG) method on two sequence\ngeneration benchmarks: a document-summarization task for headline generation, and an automatic\nimage-captioning task. We compare the results of the SPG method against the standard maximum\nlikelihood estimation (MLE) method, as well as the reward augmented maximum likelihood (RAML)\nmethod [17]. Our experiments indicate that the SPG method outperforms signi\ufb01cantly the other\napproaches on both the summarization and image-captioning tasks.\nWe implemented all the algorithms using TensorFlow 1.0 [6]. For the RAML method, we used\n\u03c4 = 0.85 which was the best performer in [17]. For the SPG algorithm, all the results were obtained\nusing a variant of ROUGE [13] as the main reward metric R, and J = 1 (sample one target for each\nexample, see Eq. (14)). We report the impact of the pdrop for values in {0.2, 0.4, 0.6, 0.8}.\nIn addition to using the main reward-metric for sampling targets, we also used it to weight the loss\nfor target zij , as we found that it improved the performance of the SPG algorithm. We also applied\na naive version of the policy gradient (PG) algorithm (without any variance reduction) by setting\npdrop = 0.0, W \u2192 0, but failed to train any meaningful model with cold-start. When starting from a\npre-trained MLE checkpoint, we found that it was unable to improve the original MLE result. This\nresult con\ufb01rms that variance-reduction is a requirement for the PG method to work, whereas our SPG\nmethod is free of such requirements.\n\n5.1 Summarization Task: Headline Generation\n\nGigaword-10K DUC-2004\n22.6 \u00b1 0.6\n35.2 \u00b1 0.3\n23.1 \u00b1 0.6\n36.4 \u00b1 0.2\n36.6 \u00b1 0.2\n23.5 \u00b1 0.6\n37.8 \u00b1 0.2\n24.3 \u00b1 0.5\n37.4 \u00b1 0.2\n24.1 \u00b1 0.5\n24.6 \u00b1 0.5\n37.3 \u00b1 0.2\n\nHeadline generation is a standard text generation task, taking as input a document and generating a\nconcise summary/headline for it. In our experiments, the supervised data comes from the English\nGigaword [9], and consists of news-articles paired with their headlines. We use a training set of\nabout 6 million article-headline pairs, in addition to two randomly-extracted validation and evaluation\nsets of 10K examples each. In addition to the Gigaword evaluation set, we also report results on the\nstandard DUC-2004 test set. The DUC-2004 consists of 500 news articles paired with four different\nhuman-generated groundtruth summaries, capped at 75 bytes.\u2021 The expected output is a summary of\nroughly 14 words, created based on the input article.\nWe use the sequence-to-sequence recurrent neural net-\nwork with attention model [2]. For encoding, we use\na three-layer, 512-dimensional bidirectional RNN ar-\nchitecture, with a Gated Recurrent Unit (GRU) as the\nunit-cell [4]; for decoding, we use a similar three-layer,\n512-dimensional GRU-based architecture. Both the en-\ncoder and decoder networks use a shared vocabulary\nand embedding matrix for encoding/decoding the word\nsequences, with a vocabulary consisting of 220K word\ntypes and a 512-dimensional embedding. We truncate\nthe encoding sequences to a maximum of 30 tokens, and the decoding sequences to a maximum of\n15 tokens. The model is optimized using ADAGRAD with a mini-batch size of 200, a learning rate\nof 0.01, and gradient clipping with norm equal to 4. We use 40 workers for computing the updates,\nand 10 parameter servers for model storing and (asynchronous and distributed) updating. We run\nthe training procedure for 10M steps and pick the checkpoint with the best ROUGE-2 score on the\nGigaword validation set.\nWe report ROUGE-L scores on the Gigaword evaluation set, as well as the DUC-2004 set, in Table 1.\nThe scores are computed using the standard pyrouge package\u00a7, with standard errors computed using\nbootstrap resampling [12]. As the numerical values indicate, the maximum performance is achieved\nwhen pdrop is in mid-range, with 37.8 F1 ROUGE-L at pdrop = 0.4 on the large Gigaword evaluation\nset (a larger range for pdrop between 0.4 and 0.8 gives comparable scores on the smaller DUC-2004\nset). These numbers are signi\ufb01cantly better compared to RAML (36.4 on Gigaword-10K), which in\nturn is signi\ufb01cantly better compared to MLE (35.2).\n\nMethod\nMLE\nRAML\nSPG 0.2\nSPG 0.4\nSPG 0.6\nSPG 0.8\n\nTable 1: The F1 ROUGE-L scores (with\nstandard errors) for headline generation.\n\n\u2021This dataset is available by request at http://duc.nist.gov/data.html.\n\u00a7Available at pypi.python.org/pypi/pyrouge/0.1.3\n\n7\n\n\f5.2 Automatic Image-Caption Generation\n\nMethod\nMLE\nRAML\nSPG 0.2\nSPG 0.4\nSPG 0.6\nSPG 0.8\n\nValidation-4K\n\n37.7 \u00b1 0.1\n38.0 \u00b1 0.1\n38.0 \u00b1 0.1\n38.1 \u00b1 0.1\n38.2 \u00b1 0.1\n37.7 \u00b1 0.1\n\nTable 2: The CIDEr (with the coco-caption\npackage) and ROUGE-L (with the pyrouge\npackage) scores for image captioning on\nMSCOCO.\n\nC40\nCIDEr ROUGE-L CIDEr\n0.94\n0.968\n0.997\n0.97\n0.98\n1.001\n1.00\n1.013\n1.01\n1.033\n1.009\n1.00\n\nFor the image-captioning task, we use the standard\nMSCOCO dataset [14]. The MSCOCO dataset contains\n82K training images and 40K validation images, each\nwith at least 5 groundtruth captions. The results are\nreported using the numerical values for the C40 testset\nreported by the MSCOCO online evaluation server\u00b6.\nFollowing standard practice, we combine the training\nand validation datasets for training our model, and hold\nout a subset of 4K images as our validation set.\nOur model architecture is simple, following the ap-\nproach taken by the Show-and-Tell approach [25]. We\nuse a one 512-dimensional RNN architecture with an\nLSTM unit-cell, with a dropout rate equal of 0.3 ap-\nplied to both input and output of the LSTM layer. We use the same vocabulary size of 8,854\nword-types as in [25], with 512-dimensional word-embeddings. We truncate the decoding sequences\nto a maximum of 15 tokens. The input image is embedded by \ufb01rst passing it through a pretrained\nInception-V3 network [22], and then projected to a 512-dimensional vector. The model is optimized\nusing ADAGRAD with a mini-batch size of 25, a learning rate of 0.01, and gradient clipping with\nnorm equal to 4. We run the training procedure for 4M steps and pick the checkpoint of the best\nCIDEr score [23] on our held-out 4K validation set.\nWe report both CIDEr and ROUGE-L scores on our\n4K Validation set, as well as CIDEr scores on the of-\n\ufb01cial C40 testset as reported by the MSCOCO online\nevaluation server, in Table 2. The CIDEr scores are re-\nported using the coco-caption evaluation toolkit(cid:107), while\nROUGE-L scores are reported using the standard py-\nrouge package (note that these ROUGE-L scores are\ngenerally lower than those reported by the coco-caption\ntoolkit, as it reports an average score over multiple\nreference, while the latter reports the maximum).\nThe evaluation results indicate that the SPG method is\nsuperior to both the MLE and RAML methods. The\nmaximum score is obtained with pdrop = 0.6, with a\nCIDEr score of 1.01 on the C40 testset. In contrast,\non the same testset, the RAML method has a CIDEr\nscore of 0.97, and the MLE method a score of 0.94. In\nFigure 3, we show that the number of steps for SPG to converge is similar to the one for MLE/RAML.\nWith the per-step inference cost of those methods being similar (see Section 3.1), the overall conver-\ngence time for the SPG method is similar to the MLE and RAML methods.\n\nFigure 3: Number of training steps vs.\nCIDEr scores (on Validation-4K) for var-\nious learning regimes.\n\n6 Conclusion\n\nThe reinforcement learning method presented in this paper, based on a softmax value function, is\nan ef\ufb01cient policy-gradient approach that eliminates the need for warm-start training and sample\nvariance reduction during policy updates. We show that this approach allows us to tackle sequence\ngeneration tasks by training models that avoid two long-standing issues: the exposure-bias problem\nand the wrong-objective problem. Experimental results con\ufb01rm that the proposed method achieves\nsuperior performance on two different structured output prediction problems, one for text-to-text\n(automatic summarization) and one for image-to-text (automatic image captioning). We plan to\nexplore and exploit the properties of this method for other reinforcement learning problems as well\nas the impact of various, more-advanced reward functions on the performance of the learned models.\n\n\u00b6Available at http://mscoco.org/dataset/#captions-eval.\n(cid:107)Available at https://github.com/tylin/coco-caption.\n\n8\n\n05000001000000150000020000002500000Steps0.900.920.940.960.981.001.021.04CIDER ScoreMLERAMLSPG 0.6\fAcknowledgments\n\nWe greatly appreciate Sebastian Goodman for his contributions to the experiment code. We would\nalso like to acknowledge Ning Ye and Zhenhai Zhu for their help with the image captioning model\ncalibration as well as the anonymous reviewers for their valuable comments.\n\nReferences\n[1] Peter Anderson, Basura Fernando, Mark Johnson, and Stephen Gould. SPICE: semantic\n\npropositional image caption evaluation. In ECCV, 2016.\n\n[2] D. Bahdanau, K. Cho, and Y. Bengio. Neural machine translation by jointly learning to align\n\nand translate. In Proceedings of ICLR, 2015.\n\n[3] Samy Bengio, Oriol Vinyals, Navdeep Jaitly, and Noam Shazeer. Scheduled sampling for\nIn Advances in Neural Information\n\nsequence prediction with recurrent neural networks.\nProcessing Systems 28, pages 1171\u20131179. 2015.\n\n[4] K. Cho, B. van Merrienboer, C. G\u00fcl\u00e7ehre, D. Bahdanau, F. Bougares, H. Schwenk, and\nY. Bengio. Learning phrase representations using RNN encoder-decoder for statistical machine\ntranslation. In Proceedings of EMNLP, pages 1724\u20131734, 2014.\n\n[5] Marc P. Deisenroth, Gerhard Neumann, and Jan Peters. A survey on policy search for robotics.\n\nFoundations and Trends R(cid:13) in Robotics, 2(1\u20132):1\u2013142, 2013. ISSN 1935-8253.\n\n[6] M. Abadi et al. TensorFlow: Large-scale machine learning on heterogeneous systems, 2015.\n\nURL http://tensorflow.org/.\n\n[7] Y. Wu et al. Google\u2019s neural machine translation system: Bridging the gap between human and\n\nmachine translation. CoRR, abs/1609.08144, 2016.\n\n[8] L. C. Evans. An introduction to mathematical optimal control theory. Preprint, version 0.2.\n\n[9] David Graff and Christopher Cieri. English Gigaword Fifth Edition LDC2003T05. In Linguistic\n\nData Consortium, Philadelphia, 2003.\n\n[10] Ferenc Huszar. How (not) to train your generative model: Scheduled sampling, likelihood,\n\nadversary? CoRR, abs/1511.05101, 2015.\n\n[11] Jens Kober, J Andrew Bagnell, and Jan Peters. Reinforcement learning in robotics: A survey.\n\nThe International Journal of Robotics Research, 32(11):1238\u20131274, 2013.\n\n[12] Philipp Koehn. Statistical signi\ufb01cance tests for machine translation evaluation. In Proceedings\n\nof EMNLP, pages 388\u2014-395, 2004.\n\n[13] Chin-Yew Lin and Franz Josef Och. Automatic evaluation of machine translation quality using\n\nlongest common subsequence and skip-bigram statistics. In Proceedings of ACL, 2004.\n\n[14] Tsung-Yi Lin, Michael Maire, Serge J. Belongie, Lubomir D. Bourdev, Ross B. Girshick, James\nHays, Pietro Perona, Deva Ramanan, Piotr Doll\u00e1r, and C. Lawrence Zitnick. Microsoft COCO:\ncommon objects in context. CoRR, abs/1405.0312, 2014.\n\n[15] Siqi Liu, Zhenhai Zhu, Ning Ye, Sergio Guadarrama, and Kevin Murphy. Optimization of image\ndescription metrics using policy gradient methods. In International Conference on Computer\nVision (ICCV), 2017.\n\n[16] Gergely Neu, Anders Jonsson, and Vicen\u00e7 G\u00f3mez. A uni\ufb01ed view of entropy-regularized\n\nmarkov decision processes. CoRR, abs/1705.07798, 2017.\n\n[17] M. Norouzi, S. Bengio, Z. Chen, N. Jaitly, M. Schuster, Y. Wu, and D. Schuurmans. Reward\nIn Advances in Neural\n\naugmented maximum likelihood for neural structured prediction.\nInformation Processing Systems 29, pages 1723\u20131731, 2016.\n\n[18] Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. Bleu: A method for automatic\n\nevaluation of machine translation. In Proceedings of ACL, 2002.\n\n9\n\n\f[19] Marc\u2019Aurelio Ranzato, Sumit Chopra, Michael Auli, and Wojciech Zaremba. Sequence level\n\ntraining with recurrent neural networks. CoRR, abs/1511.06732, 2015.\n\n[20] David Silver, Aja Huang, Chris J Maddison, Arthur Guez, Laurent Sifre, George Van Den Driess-\nche, Julian Schrittwieser, Ioannis Antonoglou, Veda Panneershelvam, Marc Lanctot, et al. Mas-\ntering the game of Go with deep neural networks and tree search. Nature, 529(7587):484\u2013489,\n2016.\n\n[21] RS Sutton, D McAllester, S Singh, and Y Mansour. Policy gradient methods for reinforcement\n\nlearning with function approximation. In NIPS, 1999.\n\n[22] Christian Szegedy, Vincent Vanhoucke, Sergey Ioffe, Jonathon Shlens, and Zbigniew Wojna.\n\nRethinking the inception architecture for computer vision. volume abs/1512.00567, 2015.\n\n[23] Ramakrishna Vedantam, C. Lawrence Zitnick, and Devi Parikh. Cider: Consensus-based image\ndescription evaluation. In The IEEE Conference on Computer Vision and Pattern Recognition\n(CVPR), June 2015.\n\n[24] Arun Venkatraman, Martial Hebert, and J. Andrew Bagnell. Improving multi-step prediction of\nlearned time series models. In Proceedings of the Twenty-Ninth AAAI Conference on Arti\ufb01cial\nIntelligence, pages 3024\u20133030. AAAI Press, 2015.\n\n[25] Oriol Vinyals, Alexander Toshev, Samy Bengio, and Dumitru Erhan. Show and tell: A neural\nIn Proc. of IEEE Conference on Computer Vision and Pattern\n\nimage caption generator.\nRecognition (CVPR), 2015.\n\n[26] Ronald J. Williams. Simple statistical gradient-following algorithms for connectionist reinforce-\n\nment learning. Machine Learning, 8(3):229\u2013256, 1992.\n\n10\n\n\f", "award": [], "sourceid": 1608, "authors": [{"given_name": "Nan", "family_name": "Ding", "institution": "Google"}, {"given_name": "Radu", "family_name": "Soricut", "institution": "Google"}]}