{"title": "Memory Augmented Policy Optimization for Program Synthesis and Semantic Parsing", "book": "Advances in Neural Information Processing Systems", "page_first": 9994, "page_last": 10006, "abstract": "We present Memory Augmented Policy Optimization (MAPO), a simple and novel way to leverage a memory buffer of promising trajectories to reduce the variance of policy gradient estimate. MAPO is applicable to deterministic environments with discrete actions, such as structured prediction and combinatorial optimization tasks. We express the expected return objective as a weighted sum of two terms: an\nexpectation over the high-reward trajectories inside the memory buffer, and a separate expectation over trajectories outside the buffer. To make an efficient algorithm of MAPO, we propose: (1) memory weight clipping to accelerate and stabilize training; (2) systematic exploration to discover high-reward trajectories; (3) distributed sampling from inside and outside of the memory buffer to scale up training. MAPO improves the sample efficiency and robustness of policy gradient, especially on tasks with sparse rewards. We evaluate MAPO on weakly supervised program synthesis from natural language (semantic parsing). On the WikiTableQuestions benchmark, we improve the state-of-the-art by 2.6%, achieving an accuracy of 46.3%. On the WikiSQL benchmark, MAPO achieves an accuracy of 74.9% with only weak supervision, outperforming several strong baselines with full supervision. Our source code is available at https://goo.gl/TXBp4e", "full_text": "Memory Augmented Policy Optimization for\n\nProgram Synthesis and Semantic Parsing\n\nChen Liang\nGoogle Brain\n\ncrazydonkey200@gmail.com\n\nMohammad Norouzi\n\nGoogle Brain\n\nmnorouzi@google.com\n\nJonathan Berant\n\nTel-Aviv University, AI2\njoberant@cs.tau.ac.il\n\nQuoc Le\n\nGoogle Brain\n\nqvl@google.com\n\nNi Lao\n\nSayMosaic Inc.\n\nni.lao@mosaix.ai\n\nAbstract\n\nWe present Memory Augmented Policy Optimization (MAPO), a simple and novel\nway to leverage a memory buffer of promising trajectories to reduce the variance\nof policy gradient estimates. MAPO is applicable to deterministic environments\nwith discrete actions, such as structured prediction and combinatorial optimization.\nOur key idea is to express the expected return objective as a weighted sum of two\nterms: an expectation over the high-reward trajectories inside a memory buffer,\nand a separate expectation over trajectories outside of the buffer. To design an\nef\ufb01cient algorithm based on this idea, we propose: (1) memory weight clipping to\naccelerate and stabilize training; (2) systematic exploration to discover high-reward\ntrajectories; (3) distributed sampling from inside and outside of the memory buffer\nto speed up training. MAPO improves the sample ef\ufb01ciency and robustness of\npolicy gradient, especially on tasks with sparse rewards. We evaluate MAPO on\nweakly supervised program synthesis from natural language (semantic parsing). On\nthe WIKITABLEQUESTIONS benchmark, we improve the state-of-the-art by 2.6%,\nachieving an accuracy of 46.3%. On the WIKISQL benchmark, MAPO achieves\nan accuracy of 74.9% with only weak supervision, outperforming several strong\nbaselines with full supervision. Our source code is available at goo.gl/TXBp4e.\n\nIntroduction\n\n1\nThere has been a recent surge of interest in applying policy gradient methods to various application\ndomains including program synthesis [26, 17, 68, 10], dialogue generation [25, 11], architecture\nsearch [69, 71], game [53, 31] and continuous control [44, 50]. Simple policy gradient methods\nlike REINFORCE [58] use Monte Carlo samples from the current policy to perform an on-policy\noptimization of the expected return objective. This often leads to unstable learning dynamics and\npoor sample ef\ufb01ciency, sometimes even underperforming random search [30].\nThe dif\ufb01culty of gradient based policy optimization stems from a few sources: (1) policy gradient\nestimates have a large variance; (2) samples from a randomly initialized policy often attain small\nrewards, resulting in a slow training progress in the initial phase (cold start); (3) random policy\nsamples do not explore the search space ef\ufb01ciently and systematically. These issues can be especially\nprohibitive in applications such as program synthesis and robotics [4] where the search space is large\nand the rewards are delayed and sparse. In such domains, a high reward is only achieved after a long\nsequence of correct actions. For instance, in program synthesis, only a few programs in the large\ncombinatorial space of programs may correspond to the correct functional form. Unfortunately,\nrelying on policy samples to explore the space often leads to forgetting a high reward trajectory unless\nit is re-sampled frequently [26, 3].\n\n32nd Conference on Neural Information Processing Systems (NeurIPS 2018), Montr\u00e9al, Canada.\n\n\fLearning through re\ufb02ection on past experiences (\u201cexperience replay\u201d) is a promising direction to\nimprove data ef\ufb01ciency and learning stability. It has recently been widely adopted in various deep\nRL algorithms, but its theoretical analysis and empirical comparison are still lacking. As a result,\nde\ufb01ning the optimal strategy for prioritizing and sampling from past experiences remain an open\nquestion. There has been various attempts to incorporate off-policy samples within the policy gradient\nframework to improve the sample ef\ufb01ciency of the REINFORCE and actor-critic algorithms (e.g.,\n[12, 57, 51, 15]). Most of these approaches utilize samples from an old policy through (truncated)\nimportance sampling to obtain a low variance, but biased estimate of the gradients. Previous work\nhas aimed to incorporate a replay buffer into policy gradient in the general RL setting of stochastic\ndynamics and possibly continuous actions. By contrast, we focus on deterministic environments with\ndiscrete actions and develop an unbiased policy gradient estimator with low variance (Figure 1).\nThis paper presents MAPO: a simple and novel way to incorporate a memory buffer of promising\ntrajectories within the policy gradient framework. We express the expected return objective as\na weighted sum of an expectation over the trajectories inside the memory buffer and a separate\nexpectation over unknown trajectories outside of the buffer. The gradient estimates are unbiased\nand attain lower variance. Because high-reward trajectories remain in the memory, it is not possible\nto forget them. To make an ef\ufb01cient algorithm for MAPO, we propose 3 techniques: (1) memory\nweight clipping to accelerate and stabilize training; (2) systematic exploration of the search space to\nef\ufb01ciently discover the high-reward trajectories; (3) distributed sampling from inside and outside of\nthe memory buffer to scale up training;\nWe assess the effectiveness of MAPO on weakly supervised program synthesis from natural lan-\nguage (see Section 2). Program synthesis presents a unique opportunity to study generalization\nin the context of policy optimization, besides being an important real world application. On the\nchallenging WIKITABLEQUESTIONS [39] benchmark, MAPO achieves an accuracy of 46.3% on the\ntest set, signi\ufb01cantly outperforming the previous state-of-the-art of 43.7% [67]. Interestingly, on the\nWIKISQL [68] benchmark, MAPO achieves an accuracy of 74.9% without the supervision of gold\nprograms, outperforming several strong fully supervised baselines.\n\nPosition Event\n\nYear Venue\n2001 Hungary\n2003 Finland\n2005 Germany\n2007 Thailand\n2008 China\n\n2nd\n1st\n11th\n1st\n7th\n\nTime\n400m 47.12\n400m 46.69\n400m 46.62\nrelay 182.05\nrelay 180.32\n\nTable 1: x: Where did the last 1st place\n\ufb01nish occur? y: Thailand\n\n2 The Problem of Weakly Supervised Contextual Program Synthesis\nConsider the problem of learning to map a natural lan-\nguage question x to a structured query a in a program-\nming language such as SQL (e.g., [68]), or converting a\ntextual problem description into a piece of source codeas\nin programming competitions (e.g., [5]). We call these\nproblems contextual program synthesis and aim at tack-\nling them in a weakly supervised setting \u2013 i.e., no correct\naction sequence a, which corresponds to a gold program,\nis given as part of the training data, and training needs to\nsolve the hard problem of exploring a large program space.\nTable 1 shows an example question-answer pair. The model needs to \ufb01rst discover the programs that\ncan generate the correct answer in a given context, and then learn to generalize to new contexts.\nWe formulate the problem of weakly supervised contextual program synthesis as follows: to generate\na program by using a parametric function, \u02c6a \u201c fpx; \u2713q, where \u2713 denotes the model parameters. The\nquality of a program \u02c6a is measured by a scoring or reward function Rp\u02c6a | x, yq. The reward function\nmay evaluate a program by executing it on a real environment and comparing the output against the\ncorrect answer. For example, it is natural to de\ufb01ne a binary reward that is 1 when the output equals\nthe answer and 0 otherwise. We assume that the context x includes both a natural language input and\nan environment, for example an interpreter or a database, on which the program will be executed.\ni\u201c1, the goal is to \ufb01nd optimal parameters \u2713\u02da that\nGiven a dataset of context-answer pairs, tpxi, yiquN\nparameterize a mapping of x \u00d1 a with maximum empirical return on a heldout test set.\nOne can think of the problem of contextual program synthesis as an instance of reinforcement learning\n(RL) with sparse terminal rewards and deterministic transitions, for which generalization plays a key\nrole. There has been some recent attempts in the RL community to study generalization to unseen\ninitial conditions (e.g. [45, 35]). However, most prior work aims to maximize empirical return on\nthe training environment [6, 9]. The problem of contextual program synthesis presents a natural\napplication of RL for which generalization is the main concern.\n\n2\n\n\f3 Optimization of Expected Return via Policy Gradients\nTo learn a mapping of (context x) \u00d1 (program a), we optimize the parameters of a conditional\ndistribution \u21e1\u2713pa | xq that assigns a probability to each program given the context. That is, \u21e1\u2713 is a\ndistribution over the countable set of all possible programs, denoted A. Thus @a P A : \u21e1\u2713pa | xq\u2022 0\nand\u221eaPA \u21e1\u2713pa | xq \u201c 1. Then, to synthesize a program for a novel context, one \ufb01nds the most likely\nprogram under the distribution \u21e1\u2713 via exact or approximate inference \u02c6a \u00ab argmaxaPA \u21e1\u2713pa | xq .\nAutoregressive models present a tractable family of distributions that estimates the probability of a\nsequence of tokens, one token at a time, often from left to right. To handle variable sequence length,\none includes a special end-of-sequence token at the end of the sequences. We express the probability\n\nof a program a given x as \u21e1\u2713pa | xq \u201d \u00b1|a|i\u201ct \u21e1\u2713pat | a\u2020t, xq ,where a\u2020t \u201d pa1, . . . , at\u00b41q\n\ndenotes a pre\ufb01x of the program a. One often uses a recurrent neural network (e.g. [20]) to predict the\nprobability of each token given the pre\ufb01x and the context.\nIn the absence of ground truth programs, policy gradient techniques present a way to optimize the\nparameters of a stochastic policy \u21e1\u2713 via optimization of expected return. Given a training dataset\nof context-answer pairs, tpxi, yiquN\ni\u201c1, the objective is expressed as Ea\u201e\u21e1\u2713pa|xq Rpa | x, yq. The\nreward function Rpa | x, yq evaluates a complete program a, based on the context x and the correct\nanswer y. These assumptions characterize the problem of program synthesis well, but they also apply\nto many other discrete optimization and structured prediction domains.\nSimpli\ufb01ed notation. In what follows, we simplify the notation by dropping the dependence of the\npolicy and the reward on x and y. We use a notation of \u21e1\u2713paq instead of \u21e1\u2713pa | xq and Rpaq instead\nof Rpa | x, yq, to make the formulation less cluttered, but the equations hold in the general case.\nWe express the expected return objective in the simpli\ufb01ed notation as,\n\nOERp\u2713q \u201c \u00ffaPA\n\n\u21e1\u2713paq Rpaq \u201c Ea\u201e\u21e1\u2713paq Rpaq .\n\n(1)\n\nThe REINFORCE [58] algorithm presents an elegant and convenient way to estimate the gradient of\nthe expected return (1) using Monte Carlo (MC) samples. Using K trajectories sampled i.i.d. from\nthe current policy \u21e1\u2713, denoted tap1q, . . . , apKqu, the gradient estimate can be expressed as,\nK\u00ffk\u201c1r log \u21e1\u2713papkqq rRpapkqq \u00b4 bs ,\n\nr\u2713OERp\u2713q \u201c Ea\u201e\u21e1\u2713paq r log \u21e1\u2713paq Rpaq \u00ab 1\n\nwhere a baseline b is subtracted from the returns to reduce the variance of gradient estimates.\nThis formulation enables direct optimization of OER via MC sampling from an unknown search\nspace, which also serves the purpose of exploration. To improve such exploration behavior, one\noften includes the entropy of the policy as an additional term inside the objective to prevent early\nconvergence. However, the key limitation of the formulation stems from the dif\ufb01culty of estimating\nthe gradients accurately only using a few fresh samples.\n\n(2)\n\nK\n\n4 MAPO: Memory Augmented Policy Optimization\nWe consider an RL environment with a \ufb01nite number of discrete actions, deterministic transitions, and\ndeterministic terminal returns. In other words, the set of all possible action trajectories A is countable,\neven though possibly in\ufb01nite, and re-evaluating the return of a trajectory Rpaq twice results in the\nsame value. These assumptions characterize the problem of program synthesis well, but also apply to\nmany structured prediction problems [47, 37] and combinatorial optimization domains (e.g., [7]).\nTo reduce the variance in gradient estimation and prevent forgetting high-reward trajectories, we\nintroduce a memory buffer, which saves a set of promising trajectories denoted B \u201d tpapiqquM\ni\u201c1.\nPrevious works [26, 2, 60] utilized a memory buffer by adopting a training objective similar to\n\nOAUGp\u2713q \u201c OERp\u2713q ` p1 \u00b4 q\u00ffaPB\n\nlog \u21e1\u2713paq,\n\n(3)\n\nwhich combines the expected return objective with a maximum likelihood objective over the memory\nbuffer B. This training objective is not directly optimizing the expected return any more because the\nsecond term introduces bias into the gradient. When the trajectories in B are not gold trajectories\n\n3\n\n\fFigure 1: Overview of MAPO compared with experience replay using importance sampling.\n\nbut high-reward trajectories collected during exploration, uniformly maximizing the likelihood of\neach trajectory in B could be problematic. For example, in program synthesis, there can sometimes\nbe spurious programs [40] that get the right answer, thus receiving high reward, for a wrong reason,\ne.g., using 2 ` 2 to answer the question \u201cwhat is two times two\u201d. Maximizing the likelihood of those\nhigh-reward but spurious programs will bias the gradient during training.\nWe aim to utilize the memory buffer in a principled way. Our key insight is that one can re-express\nthe expected return objective as a weighted sum of two terms: an expectation over the trajectories\ninside the memory buffer, and a separate expectation over the trajectories outside the buffer,\n\n\u00ffaPpA\u00b4Bq\n\n`\n\u21e1\u2713paq Rpaq\n` p1 \u00b4 \u21e1Bq Ea\u201e\u21e1\u00b4\u2713 paq Rpaq\nlooooooomooooooon\nExpectation outside B\n\n(4)\n\n(5)\n\n,\n\nOERp\u2713q \u201c \u00ffaPB\n\n\u21e1\u2713paq Rpaq\n\u201c \u21e1B Ea\u201e\u21e1`\u2713 paq Rpaq\nlooooooomooooooon\nExpectation inside B\n\nwhere A \u00b4 B denotes the set of trajectories not included in the memory buffer, \u21e1B \u201c\u221eaPB \u21e1\u2713paq\ndenote the total probability of the trajectories in the buffer, and \u21e1`\u2713 paq and \u21e1\u00b4\u2713 paq denote a normalized\nprobability distribution inside and outside of the buffer,\n\n\u21e1`\u2713 paq \u201c\"\u21e1\u2713paq{\u21e1B\n\nif a P B\nif a R B\nThe policy gradient can be expressed as,\n\n0\n\n,\u21e1\n\n\u00b4\u2713 paq \u201c\"0\n\n\u21e1\u2713paq{p1 \u00b4 \u21e1Bq\n\nif a P B\nif a R B\n\n.\n\n(6)\n\nr\u2713OERp\u2713q \u201c \u21e1B Ea\u201e\u21e1`\u2713 paq r log \u21e1\u2713paqRpaq ` p1 \u00b4 \u21e1Bq Ea\u201e\u21e1\u00b4\u2713 paq r log \u21e1\u2713paqRpaq.\n\n(7)\n\nThe second expectation can be estimated by sampling from \u21e1\u00b4\u2713 paq, which can be done through\nrejection sampling by sampling from \u21e1\u2713paq and rejecting the sample if a P B. If the memory\nbuffer only contains a small number of trajectories, the \ufb01rst expectation can be computed exactly\nby enumerating all the trajectories in the buffer. The variance in gradient estimation is reduced\nbecause we get an exact estimate of the \ufb01rst expectation while sampling from a smaller stochastic\nspace of measure p1 \u00b4 \u21e1Bq. If the memory buffer contains a large number of trajectories, the \ufb01rst\nexpectation can be approximated by sampling. Then, we get a strati\ufb01ed sampling estimator of the\ngradient. The trajectories inside and outside the memory buffer are two mutually exclusive and\ncollectively exhaustive strata, and the variance reduction still holds. The weights for the \ufb01rst and\nsecond expectations are \u21e1B and 1 \u00b4 \u21e1B respectively. We call \u21e1B the memory weight.\nIn the following we present 3 techniques to make an ef\ufb01cient algorithm of MAPO.\n4.1 Memory Weight Clipping\nPolicy gradient methods usually suffer from a cold start problem. A key observation is that a \u201cbad\u201d\npolicy, one that achieves low expected return, will assign small probabilities to the high-reward\ntrajectories, which in turn causes them to be ignored during gradient estimation. So it is hard to\nimprove from a random initialization, i.e., the cold start problem, or to recover from a bad update,\ni.e., the brittleness problem. Ideally we want to force the policy gradient estimates to pay at least\nsome attention to the high-reward trajectories. Therefore, we adopt a clipping mechanism over the\n\n4\n\n\fERp\u2713q \u201c \u21e1c\n\nBq Ea\u201e\u21e1\u00b4\u2713 paq r log \u21e1\u2713paqRpaq,\n\nB Ea\u201e\u21e1`\u2713 paq r log \u21e1\u2713paqRpaq ` p1 \u00b4 \u21e1c\n\nmemory weight \u21e1B, which ensures that the memory weight is greater or equal to \u21b5, i.e. ,\u21e1 B \u2022 \u21b5,\notherwise clips it to \u21b5. So the new gradient estimate is,\n(8)\nr\u2713Oc\nB \u201c maxp\u21e1B,\u21b5q is the clipped memory weight. At the beginning of training, the clipping is\nwhere \u21e1c\nactive and introduce a bias, but accelerates and stabilizes training. Once the policy is off the ground,\nthe memory weights are almost never clipped given that they are naturally larger than \u21b5 and the\ngradients are not biased any more. See section 5.4 for an empirical analysis of the clipping.\n4.2 Systematic Exploration\nTo discover high-reward trajectories for the memory buffer B,\nwe need to ef\ufb01ciently explore the search space. Exploration\nusing policy samples suffers from repeated samples, which\nis a waste of computation in deterministic environments. So\nwe propose to use systematic exploration to improve the ef-\n\ufb01ciency. More speci\ufb01cally we keep a set Be of fully explored\npartial sequences, which can be ef\ufb01ciently implemented using\na bloom \ufb01lter. Then, we use it to enforce a policy to only\ntake actions that lead to unexplored sequences. Using a bloom\n\ufb01lter we can store billions of sequences in Be with only sev-\neral gigabytes of memory. The pseudo code of this approach\nis shown in Algorithm 1. We warm start the memory buffer\nusing systematic exploration from random policy as it can be\ntrivially parallelized. In parallel to training, we continue the\nsystematic exploration with the current policy to discover new\nhigh reward trajectories.\n\nAlgorithm 1 Systematic Exploration\nInput: context x, policy \u21e1, fully\nexplored sub-sequences Be, high-\nreward sequences B\nInitialize: empty sequence a0:0\nwhile true do\n\nsample at \u201e \u21e1V pa|a0:t\u00b41q\na0:t \u2013 a0:t\u00b41}at\nif at \u201c\u201c EOS then\nif Rpa0:tq\u00b0 0 then\nB \u2013 B Y a0:t\nBe \u2013 Be Y a0:t\nbreak\n\nV \u201c ta | a0:t\u00b41}a R Beu\nif V \u201c\u201c H then\n\nBe \u2013 Be Y a0:t\u00b41\nbreak\n\n\u2713\n\n\u2713\n\n\u2713\n\ni ,Biq\n\nInput:\ntpBi,Be\nrepeat\n\niquN\n\nAlgorithm 2 MAPO\n\ni\u201c1, memories\n\u00f4 for all actors\n\ndata tpxi, yiquN\ni\u201c1, constants \u21b5, \u270f, M\n\nInitialize training batch D \u2013H\nGet a batch of inputs C\nfor pxi, yi,Be\n\n4.3 Distributed Sampling\nAn exact computation of the \ufb01rst expectation of (5)\nrequires an enumeration over the memory buffer. The\ncost of gradient computation will grow linearly w.r.t\nthe number of trajectories in the buffer, so it can be\nprohibitively slow when the buffer contains a large\nnumber of trajectories. Alternatively, we can ap-\nproximate the \ufb01rst expectation using sampling. As\nmentioned above, this can be viewed as strati\ufb01ed\nsampling and the variance is still reduced. Although\nthe cost of gradient computation now grows linearly\nw.r.t the number of samples instead of the total num-\nber of trajectories in the buffer, the cost of sampling\nstill grows linearly w.r.t the size of the memory buffer\nbecause we need to compute the probability of each\ntrajectory with the current model.\nA key insight is that if the bottleneck is in sampling,\nthe cost can be distributed through an actor-learner\narchitecture similar to [15]. See the Supplemental\nMaterial D for a \ufb01gure depicting the actor-learner\narchitecture. The actors each use its model to sample\ntrajectories from inside the memory buffer through\nrenormalization (\u21e1`\u2713 in (6)), and uses rejection sam-\npling to pick trajectories from outside the memory\n(\u21e1\u00b4\u2713 in (6)). It also computes the weights for these\ntrajectories using the model. These trajectories and\ntheir weights are then pushed to a queue of samples.\nThe learner fetches samples from the queue and uses\nthem to compute gradient estimates to update the parameters. By distributing the cost of sampling\nto a set of actors, the training can be accelerated almost linearly w.r.t the number of actors. In our\nexperiments, we got a \u201e20 times speedup from distributed sampling with 30 actors.\n\ni ,Biq P C do\nAlgorithm1pxi,\u21e1 old\n,Be\nSample a`i \u201e \u21e1old\nover Bi\nw`i \u2013 maxp\u21e1old\n\u2713 pBiq,\u21b5q\nD \u2013 D Y pa`i , Rpa`i q, w`i q\nSample ai \u201e \u21e1old\nif ai R Bi then\nwi \u2013p 1 \u00b4 w`i q\nD \u2013 D Y pai, Rpaiq, wiq\nPush D to training queue\nuntil converge or early stop\nrepeat\n\nGet a batch D from training queue\nfor pai, Rpaiq, wiq P D do\nd\u2713 \u2013 d\u2713 ` wi Rpaiq r log \u21e1\u2713paiq\nupdate \u2713 using d\u2713\n\u00f4 once every M batches\n\u2713 \u2013 \u21e1\u2713\n\u21e1old\n\nuntil converge or early stop\nOutput: \ufb01nal parameters \u2713\n\n\u00f4 for the learner\n\n5\n\n\f4.4 Final Algorithm\nThe \ufb01nal training procedure is summarized in Algorithm 2. As mentioned above, we adopt the\nactor-learner architecture for distributed training. It uses multiple actors to collect training samples\nasynchronously and one learner for updating the parameters based on the training samples. Each actor\ninteracts with a set of environments to generate new trajectories. For ef\ufb01ciency, an actor uses a stale\npolicy (\u21e1old\n), which is often a few steps behind the policy of the learner and will be synchronized\nperiodically. To apply MAPO, each actor also maintains a memory buffer Bi to save the high-reward\ntrajectories. To prepare training samples for the learner, the actor picks nb samples from inside Bi and\n.\nalso performs rejection sampling with no on-policy samples, both according to the actor\u2019s policy \u21e1old\nWe then use the actor policy to compute a weight maxp\u21e1\u2713pBq,\u21b5q for the samples in the memory\nbuffer, and use 1 \u00b4 maxp\u21e1\u2713pBq,\u21b5q for samples outside of the buffer. These samples are pushed to a\nqueue and the learner reads from the queue to compute gradients and update the parameters.\n\n\u2713\n\n\u2713\n\n5 Experiments\nWe evaluate MAPO on two program synthesis from natural language (also known as semantic\nparsing) benchmarks, WIKITABLEQUESTIONS and WIKISQL, which requires generating programs\nto query and process data from tables to answer natural language questions. We \ufb01rst compare\nMAPO to four common baselines, and ablate systematic exploration and memory weight clipping\nto show their utility. Then we compare MAPO to the state-of-the-art on these two benchmarks. On\nWIKITABLEQUESTIONS, MAPO is the \ufb01rst RL-based approach that signi\ufb01cantly outperforms the\nprevious state-of-the-art. On WIKISQL, MAPO trained with weak supervision (question-answer\npairs) outperforms several strong models trained with full supervision (question-program pairs).\n\n5.1 Experimental setup\nDatasets. WIKITABLEQUESTIONS [39] contains tables extracted from Wikipedia and question-\nanswer pairs about the tables. See Table 1 as an example. There are 2,108 tables and 18,496 question-\nanswer pairs splitted into train/dev/test set.. We follow the construction in [39] for converting a table\ninto a directed graph that can be queried, where rows and cells are converted to graph nodes while\ncolumn names become labeled directed edges. For the questions, we use string match to identify\nphrases that appear in the table. We also identify numbers and dates using the CoreNLP annotation\nreleased with the dataset. The task is challenging in several aspects. First, the tables are taken from\nWikipedia and cover a wide range of topics. Second, at test time, new tables that contain unseen\ncolumn names appear. Third, the table contents are not normalized as in knowledge-bases like\nFreebase, so there are noises and ambiguities in the table annotation. Last, the semantics are more\ncomplex comparing to previous datasets like WEBQUESTIONSSP [62]. It requires multiple-step\nreasoning using a large set of functions, including comparisons, superlatives, aggregations, and\narithmetic operations [39]. See Supplementary Material A for more details about the functions.\nWIKISQL [68] is a recent large scale dataset on learning natural language interfaces for databases.\nIt also uses tables extracted from Wikipedia, but is much larger and is annotated with programs\n(SQL). There are 24,241 tables and 80,654 question-program pairs splitted into train/dev/test set.\nComparing to WIKITABLEQUESTIONS, the semantics are simpler because the SQLs use fewer\noperators (column selection, aggregation, and conditions). We perform similar preprocessing as for\nWIKITABLEQUESTIONS. Most of the state-of-the-art models relies on question-program pairs for\nsupervised training, while we only use the question-answer pairs for weakly supervised training.\nModel architecture. We adopt the Neural Symbolic Machines framework[26], which combines (1)\na neural \u201cprogrammer\u201d, which is a seq2seq model augmented by a key-variable memory that can\ntranslate a natural language utterance to a program as a sequence of tokens, and (2) a symbolic\n\u201ccomputer\u201d, which is an Lisp interpreter that implements a domain speci\ufb01c language with built-in\nfunctions and provides code assistance by eliminating syntactically or semantically invalid choices.\nFor the Lisp interpreter, we added functions according to [67, 34] for WIKITABLEQUESTIONS\nexperiments and used the subset of functions equivalent to column selection, aggregation, and\nconditions for WIKISQL. See the Supplementary Material A for more details about functions used.\nWe implemented the seq2seq model augmented with key-variable memory from [26] in Tensor-\nFlow [1]. Some minor differences are: (1) we used a bi-directional LSTM for the encoder; (2) we\nused two-layer LSTM with skip-connections in both the encoder and decoder. GloVe [43] embeddings\nare used for the embedding layer in the encoder and also to create embeddings for column names by\n\n6\n\n\fFigure 2: Comparison of MAPO and 3 baselines\u2019 dev set accuracy curves. Results on WIKITABLE-\nQUESTIONS is on the left and results on WIKISQL is on the right. The plot is average of 5 runs with\na bar of one standard deviation. The horizontal coordinate (training steps) is in log scale.\n\naveraging the embeddings of the words in a name. Following [34, 24], we also add a binary feature\nin each step of the encoder, indicating whether this word is found in the table, and an integer feature\nfor a column name counting how many of the words in the column name appear in the question. For\nthe WIKITABLEQUESTIONS dataset, we use the CoreNLP annotation of numbers and dates released\nwith the dataset. For the WIKISQL dataset, only numbers are used, so we use a simple parser to\nidentify and parse the numbers in the questions, and the tables are already preprocessed. The tokens\nof the numbers and dates are anonymized as two special tokens and . The hidden\nsize of the LSTM is 200. We keep the GloVe embeddings \ufb01xed during training, but project it to 200\ndimensions using a trainable linear transformation. The same architecture is used for both datasets.\nTraining Details. We \ufb01rst apply systematic exploration using a random policy to discover high-\nreward programs to warm start the memory buffer of each example. For WIKITABLEQUESTIONS,\nwe generated 50k programs per example using systematic exploration with pruning rules inspired by\nthe grammars from [67] (see Supplementary E). We apply 0.2 dropout on both encoder and decoder.\nEach batch includes samples from 25 examples. For experiments on WIKISQL, we generated 1k\nprograms per example due to computational constraint. Because the dataset is much larger, we don\u2019t\nuse any regularization. Each batch includes samples from 125 examples. We use distributed sampling\nfor WIKITABLEQUESTIONS. For WIKISQL, due to computational constraints, we truncate each\nmemory buffer to top 5 and then enumerate all 5 programs for training. For both experiments, the\nsamples outside memory buffer are drawn using rejection sampling from 1 on-policy sample per\nexample. At inference time, we apply beam search of size 5. We evaluate the model periodically on\nthe dev set to select the best model. We apply a distributed actor-learner architecture for training. The\nactors use CPUs to generate new trajectories and push the samples into a queue. The learner reads\nbatches of data from the queue and uses GPU to accelerate training (see Supplementary D). We use\nAdam optimizer for training and the learning rate is 10\u00b43. All the hyperparameters are tuned on the\ndev set. We train the model for 25k steps on WikiTableQuestions and 15k steps on WikiSQL.\n\n5.2 Comparison to baselines\nWe \ufb01rst compare MAPO against the following baselines using the same neural architecture.\n\u00a7 REINFORCE: We use on-policy samples to estimate the gradient of expected return as in (2), not\nutilizing any form of memory buffer.\n\u00a7 MML: Maximum Marginal Likelihood maximizes the marginal probability of the memory buffer\nN log\u00b1i\u221eaPBi\n\u21e1\u2713paq. Assuming binary rewards\nas in OMMLp\u2713q \u201c 1\nand assuming that the memory buffer contains almost all of the trajectories with a reward of 1,\nMML optimizes the marginal probability of generating a rewarding program. Note that under these\nN\u221ei\u221eaPBi\nassumptions, expected return can be expressed as OERp\u2713q \u00ab 1\n\u21e1\u2713paq. Comparing the two\nobjectives, we can see that MML maximizes the product of marginal probabilities, whereas expected\nreturn maximizes the sum. More discussion of these two objectives can be found in [17, 36, 48].\n\u00a7 Hard EM: Expectation-Maximization algorithm is commonly used to optimize the marginal\nlikelihood in the presence of latent variables. Hard EM uses the samples with the highest probability\n\nN\u221ei log\u221eaPBi\n\n\u21e1\u2713paq \u201c 1\n\n7\n\n\fto approximate the gradient to OMML.\n\u00a7 IML: Iterative Maximum Likelihood training [26, 2] uniformly maximizes the likelihood of all the\ntrajectories with the highest rewards OMLp\u2713q \u201c\u221eaPB log \u21e1\u2713paq.\nBecause the memory buffer is too large to enumerate, we use samples from the buffer to approximate\nthe gradient for MML and IML, and uses samples with highest \u21e1\u2713paq for Hard EM.\nWe show the result in Table 2 and the dev accuracy curves in Figure 2. Removing systematic explo-\nration or the memory weight clipping signi\ufb01cantly weaken MAPO because high-reward trajectories\nare not found or easily forgotten. REINFORCE barely learns anything because starting from a random\npolicy, most samples result in a reward of zero. MML and Hard EM converge faster, but the learned\nmodels underperform MAPO, which suggests that the expected return is a better objective. IML runs\nfaster because it randomly samples from the buffer, but the objective is prone to spurious programs.\n\n5.3 Comparison to state-of-the-art\n\nOn WIKITABLEQUESTIONS (Table 3), MAPO is\nthe \ufb01rst RL-based approach that signi\ufb01cantly out-\nperforms the previous state-of-the-art by 2.6%.\nUnlike previous work, MAPO does not require\nmanual feature engineering or additional hu-\nman annotation1. On WIKISQL (Table 4), even\nthough MAPO does not exploit ground truth pro-\ngrams (weak supervision), it is able to outperform\nmany strong baselines trained using programs\n(full supervision). The techniques introduced in\nother models can be incorporated to further im-\nprove the result of MAPO, but we leave that as\nfuture work. We also qualitatively analyzed a\ntrained model and see that it can generate fairly\ncomplex programs. See the Supplementary Mate-\nrial B for some examples of generated programs.\nWe select the best model based on validation ac-\ncuracy and report its test accuracy. We also report\nthe mean accuracy and standard deviation based\non 5 runs given the variance caused by the non-\nlinear optimization procedure, although it is not\navailable from other models.\n\n5.4 Analysis of Memory Weight Clipping\n\nIn this subsection, we present an analysis of the\nbias introduced by memory weight clipping. We\nde\ufb01ne the clipping fraction as the percentage of\nexamples where the clipping is active. In other\nwords, it is the percentage of examples with a\nnon-empty memory buffer, for which \u21e1B \u2020 \u21b5. It\nis also the fraction of examples whose gradient\ncomputation will be biased by the clipping, so\nthe higher the value, the more bias, and the gra-\ndient is unbiased when the clip fraction is zero.\nIn \ufb01gure 3, one can observe that the clipping\nfraction approaches zero towards the end of train-\ning and is negatively correlated with the training\naccuracy. In the experiments, we found that a\n\ufb01xed clipping threshold works well, but we can\nalso gradually decrease the clipping threshold to\ncompletely remove the bias.\n\nWIKITABLE\n\nWIKISQL\n\nREINFORCE\nMML (Soft EM)\nHard EM\nIML\nMAPO\nMAPO w/o SE\nMAPO w/o MWC\n\n\u2020 10\n\n39.7 \u02d8 0.3\n39.3 \u02d8 0.6\n36.8 \u02d8 0.5\n42.3 \u02d8 0.3\n\n\u2020 10\n\u2020 10\n\n\u2020 10\n\n70.7 \u02d8 0.1\n70.2 \u02d8 0.3\n70.1 \u02d8 0.2\n72.2 \u02d8 0.2\n\n\u2020 10\n\u2020 10\n\nTable 2: Ablation study for Systematic Explo-\nration (SE) and Memory Weight Clipping (MWC).\nWe report mean accuracy %, and its standard de-\nviation on dev sets based on 5 runs.\n\nPasupat & Liang (2015) [39]\nNeelakantan et al. (2017) [34]\nNeelakantan et al. (2017) [34]\nHaug et al. (2017) [18]\nHaug et al. (2017) [18]\nZhang et al. (2017) [67]\nMAPO\nMAPO (mean of 5 runs)\nMAPO (std of 5 runs)\nMAPO (ensembled)\n\nE.S. Dev. Test\n37.0 37.1\n34.1 34.2\n37.5 37.7\n34.8\n38.7\n40.4 43.7\n42.7 43.8\n42.3 43.1\n0.5\n0.3\n46.3\n-\n\n-\n1\n15\n1\n15\n-\n1\n-\n-\n10\n\n-\n-\n\nTable 3: Results on WIKITABLEQUESTIONS.\nE.S. is the ensemble size, when applicable.\n\nFully supervised\nZhong et al. (2017) [68]\nWang et al. (2017) [56]\nXu et al. (2017) [61]\nHuang et al. (2018) [22]\nYu et al. (2018) [63]\nSun et al. (2018) [54]\nDong & Lapata (2018) [14]\nWeakly supervised\nMAPO\nMAPO (mean of 5 runs)\nMAPO (std of 5 runs)\nMAPO (ensemble of 10)\n\nDev.\n60.8\n67.1\n69.8\n68.3\n74.5\n75.1\n79.0\nDev.\n72.2\n72.2\n0.2\n-\n\nTest\n59.4\n66.8\n68.0\n68.0\n73.5\n74.6\n78.5\nTest\n72.6\n72.1\n0.3\n74.9\n\nTable 4: Results on WIKISQL. Unlike other meth-\nods, MAPO only uses weak supervision.\n\n1Krishnamurthy et al. [24] achieved 45.9 accuracy when trained on the data collected with dynamic program-\n\nming and pruned with more human annotations [41, 32].\n\n8\n\n\fFigure 3: The clipping fraction and training accuracy w.r.t the training steps (log scale).\n\n6 Related work\nProgram synthesis & semantic parsing. There has been a surge of recent interest in applying\nreinforcement learning to program synthesis [10, 2, 64, 33] and combinatorial optimization [70, 7].\nDifferent from these efforts, we focus on the contextualized program synthesis where generalization\nto new contexts is important. Semantic parsing [65, 66, 27] maps natural language to executable\nsymbolic representations. Training semantic parsers through weak supervision is challenging because\nthe model must interact with a symbolic interpreter through non-differentiable operations to search\nover a large space of programs [8, 26]. Previous work [17, 34] reports negative results when applying\nsimple policy gradient methods like REINFORCE [58], which highlights the dif\ufb01culty of exploration\nand optimization when applying RL techniques. MAPO takes advantage of discrete and deterministic\nnature of program synthesis and signi\ufb01cantly improves upon REINFORCE.\nExperience replay. An experience replay buffer [28] enables storage and usage of past experiences to\nimprove the sample ef\ufb01ciency of RL algorithms. Prioritized experience replay [49] prioritizes replays\nbased on temporal-difference error for more ef\ufb01cient optimization. Hindsight experience replay [4]\nincorporates goals into replays to deal with sparse rewards. MAPO also uses past experiences to\ntackle sparse reward problems, but by storing and reusing high-reward trajectories, similar to [26, 38].\nPrevious work[26] assigns a \ufb01xed weight to the trajectories, which introduces bias into the policy\ngradient estimates. More importantly, the policy is often trained equally on the trajectories that have\nthe same reward, which is prone to spurious programs. By contrast, MAPO uses the trajectories in a\nprincipled way to obtain an unbiased low variance gradient estimate.\nVariance reduction. Policy optimization via gradient descent is challenging because of: (1) large\nvariance in gradient estimates; (2) small gradients in the initial phase of training. Prior variance\nreduction approaches [59, 58, 29, 16] mainly relied on control variate techniques by introducing\na critic model [23, 31, 51]. MAPO takes a different approach to reformulate the gradient as a\ncombination of expectations inside and outside a memory buffer. Standard solutions to the small\ngradient problem involves supervised pretraining [52, 19, 46] or using supervised data to generate\nrewarding samples [36, 13], which cannot be applied when supervised data are not available. MAPO\nreduces the variance by sampling from a smaller stochastic space or through strati\ufb01ed sampling, and\naccelerates and stabilizes training by clipping the weight of the memory buffer.\nExploration. Recently there has been a lot of work on improving exploration [42, 55, 21] by\nintroducing additional reward based on information gain or pseudo count. For program synthesis [5,\n34, 10], the search spaces are enumerable and deterministic. Therefore, we propose to conduct\nsystematic exploration, which ensures that only novel trajectories are generated.\n\n7 Conclusion\nWe present memory augmented policy optimization (MAPO) that incorporates a memory buffer of\npromising trajectories to reduce the variance of policy gradients. We propose 3 techniques to enable\nan ef\ufb01cient algorithm for MAPO: (1) memory weight clipping to accelerate and stabilize training; (2)\nsystematic exploration to ef\ufb01ciently discover high-reward trajectories; (3) distributed sampling from\ninside and outside memory buffer to scale up training. MAPO is evaluated on real world program\nsynthesis from natural language / semantic parsing tasks. On WIKITABLEQUESTIONS, MAPO is\nthe \ufb01rst RL approach that signi\ufb01cantly outperforms previous state-of-the-art; on WIKISQL, MAPO\ntrained with only weak supervision outperforms several strong baselines trained with full supervision.\n\n9\n\n\fAcknowledgments\nWe would like to thank Dan Abola\ufb01a, Ankur Taly, Thanapon Noraset, Arvind Neelakantan, Wenyun\nZuo, Chenchen Pan and Mia Liang for helpful discussions. Jonathan Berant was partially supported\nby The Israel Science Foundation grant 942/16.\n\nReferences\n[1] Mart\u00edn Abadi, Ashish Agarwal, Paul Barham, Eugene Brevdo, Zhifeng Chen, Craig Citro,\nGregory S. Corrado, Andy Davis, Jeffrey Dean, Matthieu Devin, Sanjay Ghemawat, Ian J.\nGoodfellow, Andrew Harp, Geoffrey Irving, Michael Isard, Yangqing Jia, Rafal J\u00f3zefowicz,\nLukasz Kaiser, Manjunath Kudlur, Josh Levenberg, Dan Man\u00e9, Rajat Monga, Sherry Moore,\nDerek Gordon Murray, Chris Olah, Mike Schuster, Jonathon Shlens, Benoit Steiner, Ilya\nSutskever, Kunal Talwar, Paul A. Tucker, Vincent Vanhoucke, Vijay Vasudevan, Fernanda B.\nVi\u00e9gas, Oriol Vinyals, Pete Warden, Martin Wattenberg, Martin Wicke, Yuan Yu, and Xiaoqiang\nZheng. Tensor\ufb02ow: Large-scale machine learning on heterogeneous distributed systems.\nArXiv:1603.04467, 2016.\n\n[2] Daniel A Abola\ufb01a, Mohammad Norouzi, and Quoc V Le. Neural program synthesis with\n\npriority queue training. arXiv preprint arXiv:1801.03526, 2018.\n\n[3] Daniel A. Abola\ufb01a, Mohammad Norouzi, Jonathan Shen, Rui Zhao, and Quoc V. Le. Neural\n\nprogram synthesis with priority queue training. arXiv:1801.03526, 2018.\n\n[4] Marcin Andrychowicz, Filip Wolski, Alex Ray, Jonas Schneider, Rachel Fong, Peter Welinder,\nBob McGrew, Josh Tobin, OpenAI Pieter Abbeel, and Wojciech Zaremba. Hindsight experience\nreplay. NIPS, 2017.\n\n[5] M. Balog, A. L. Gaunt, M. Brockschmidt, S. Nowozin, and D. Tarlow. Deepcoder: Learning to\n\nwrite programs. ICLR, 2017.\n\n[6] Marc G Bellemare, Yavar Naddaf, Joel Veness, and Michael Bowling. The arcade learning\n\nenvironment: An evaluation platform for general agents. JMLR, 2013.\n\n[7] Irwan Bello, Hieu Pham, Quoc V Le, Mohammad Norouzi, and Samy Bengio. Neural combina-\n\ntorial optimization with reinforcement learning. arXiv:1611.09940, 2016.\n\n[8] Jonathan Berant, Andrew Chou, Roy Frostig, and Percy Liang. Semantic parsing on freebase\n\nfrom question-answer pairs. EMNLP, 2(5):6, 2013.\n\n[9] Greg Brockman, Vicki Cheung, Ludwig Pettersson, Jonas Schneider, John Schulman, Jie Tang,\n\nand Wojciech Zaremba. Openai gym. arXiv:1606.01540, 2016.\n\n[10] Rudy Bunel, Matthew Hausknecht, Jacob Devlin, Rishabh Singh, and Pushmeet Kohli. Lever-\naging grammar and reinforcement learning for neural program synthesis. In International\nConference on Learning Representations, 2018.\n\n[11] Abhishek Das, Satwik Kottur, Jos\u00e9 MF Moura, Stefan Lee, and Dhruv Batra. Learning\ncooperative visual dialog agents with deep reinforcement learning. arXiv:1703.06585, 2017.\n\n[12] Thomas Degris, Martha White, and Richard S Sutton. Off-policy actor-critic. ICML, 2012.\n[13] Nan Ding and Radu Soricut. Cold-start reinforcement learning with softmax policy gradient. In\n\nAdvances in Neural Information Processing Systems, pages 2817\u20132826, 2017.\n\n[14] Li Dong and Mirella Lapata. Coarse-to-\ufb01ne decoding for neural semantic parsing. CoRR,\n\nabs/1805.04793, 2018.\n\n[15] Lasse Espeholt, Hubert Soyer, Remi Munos, Karen Simonyan, Volodymir Mnih, Tom Ward,\nYotam Doron, Vlad Firoiu, Tim Harley, Iain Dunning, et al. Impala: Scalable distributed deep-rl\nwith importance weighted actor-learner architectures. arXiv:1802.01561, 2018.\n\n[16] Will Grathwohl, Dami Choi, Yuhuai Wu, Geoff Roeder, and David Duvenaud. Backpropagation\nthrough the void: Optimizing control variates for black-box gradient estimation. arXiv preprint\narXiv:1711.00123, 2017.\n\n[17] Kelvin Guu, Panupong Pasupat, Evan Liu, and Percy Liang. From language to programs:\n\nBridging reinforcement learning and maximum marginal likelihood. ACL, 2017.\n\n[18] Till Haug, Octavian-Eugen Ganea, and Paulina Grnarova. Neural multi-step reasoning for\n\nquestion answering on semi-structured tables. In ECIR, 2018.\n\n10\n\n\f[19] Todd Hester, Matej Vecerik, Olivier Pietquin, Marc Lanctot, Tom Schaul, Bilal Piot, Andrew\nSendonaris, Gabriel Dulac-Arnold, Ian Osband, John Agapiou, Joel Z. Leibo, and Audrunas\nGruslys. Deep q-learning from demonstrations. AAAI, 2018.\n\n[20] Sepp Hochreiter and J\u00fcrgen Schmidhuber. Long short-term memory. Neural Comput., 1997.\n[21] Rein Houthooft, Xi Chen, Yan Duan, John Schulman, Filip De Turck, and Pieter Abbeel. Vime:\nVariational information maximizing exploration. In Advances in Neural Information Processing\nSystems, pages 1109\u20131117, 2016.\n\n[22] Po-Sen Huang, Chenglong Wang, Rishabh Singh, Wen tau Yih, and Xiaodong He. Natural\n\nlanguage to structured query generation via meta-learning. CoRR, abs/1803.02400, 2018.\n\n[23] Vijay R Konda and John N Tsitsiklis. Actor-critic algorithms. In Advances in neural information\n\nprocessing systems, pages 1008\u20131014, 2000.\n\n[24] Jayant Krishnamurthy, Pradeep Dasigi, and Matt Gardner. Neural semantic parsing with type\n\nconstraints for semi-structured tables. EMNLP, 2017.\n\n[25] Jiwei Li, Will Monroe, Alan Ritter, Michel Galley, Jianfeng Gao, and Dan Jurafsky. Deep\n\nreinforcement learning for dialogue generation. arXiv:1606.01541, 2016.\n\n[26] Chen Liang, Jonathan Berant, Quoc Le, Kenneth D. Forbus, and Ni Lao. Neural symbolic\n\nmachines: Learning semantic parsers on freebase with weak supervision. ACL, 2017.\n\n[27] P. Liang, M. I. Jordan, and D. Klein. Learning dependency-based compositional semantics.\n\nACL, 2011.\n\n[28] Long-Ji Lin. Self-improving reactive agents based on reinforcement learning, planning and\n\nteaching. Machine learning, 8(3-4):293\u2013321, 1992.\n\n[29] Hao Liu, Yihao Feng, Yi Mao, Dengyong Zhou, Jian Peng, and Qiang Liu. Sample-ef\ufb01cient\n\npolicy optimization with stein control variate. arXiv preprint arXiv:1710.11198, 2017.\n\n[30] Horia Mania, Aurelia Guy, and Benjamin Recht. Simple random search provides a competitive\n\napproach to reinforcement learning. arXiv preprint arXiv:1803.07055, 2018.\n\n[31] Volodymyr Mnih, Adria Puigdomenech Badia, Mehdi Mirza, Alex Graves, Timothy Lillicrap,\nTim Harley, David Silver, and Koray Kavukcuoglu. Asynchronous methods for deep rein-\nforcement learning. In International Conference on Machine Learning, pages 1928\u20131937,\n2016.\n\n[32] Pramod Kaushik Mudrakarta, Ankur Taly, Mukund Sundararajan, and Kedar Dhamdhere. It\n\nwas the training data pruning too! arXiv:1803.04579, 2018.\n\n[33] O\ufb01r Nachum, Mohammad Norouzi, Kelvin Xu, and Dale Schuurmans. Bridging the gap\nbetween value and policy based reinforcement learning. In Advances in Neural Information\nProcessing Systems, pages 2775\u20132785, 2017.\n\n[34] Arvind Neelakantan, Quoc V. Le, Mart\u00edn Abadi, Andrew D McCallum, and Dario Amodei.\n\nLearning a natural language interface with neural programmer. arXiv:1611.08945, 2016.\n\n[35] Alex Nichol, Vicki Pfau, Christopher Hesse, Oleg Klimov, and John Schulman. Gotta learn fast:\n\nA new benchmark for generalization in rl. arXiv:1804.03720, 2018.\n\n[36] Mohammad Norouzi, Samy Bengio, Navdeep Jaitly, Mike Schuster, Yonghui Wu, Dale Schu-\nurmans, et al. Reward augmented maximum likelihood for neural structured prediction. In\nAdvances In Neural Information Processing Systems, pages 1723\u20131731, 2016.\n\n[37] Sebastian Nowozin, Christoph H Lampert, et al. Structured learning and prediction in computer\nvision. Foundations and Trends R in Computer Graphics and Vision, 6(3\u20134):185\u2013365, 2011.\n[38] Junhyuk Oh, Yijie Guo, Satinder Singh, and Honglak Lee. Self-imitation learning. ICML, 2018.\n[39] Panupong Pasupat and Percy Liang. Compositional semantic parsing on semi-structured tables.\n\nACL, 2015.\n\n[40] Panupong Pasupat and Percy Liang. Inferring logical forms from denotations. In Proceedings\nof the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long\nPapers), volume 1, pages 23\u201332, 2016.\n\n[41] Panupong Pasupat and Percy Liang. Inferring logical forms from denotations. ACL, 2016.\n\n11\n\n\f[42] Deepak Pathak, Pulkit Agrawal, Alexei A. Efros, and Trevor Darrell. Curiosity-driven explo-\n\nration by self-supervised prediction. In ICML, 2017.\n\n[43] Jeffrey Pennington, Richard Socher, and Christopher D. Manning. Glove: Global vectors for\n\nword representation. EMNLP, 2014.\n\n[44] Jan Peters and Stefan Schaal. Policy gradient methods for robotics. IROS, 2006.\n[45] Aravind Rajeswaran, Kendall Lowrey, Emanuel V Todorov, and Sham M Kakade. Towards\n\ngeneralization and simplicity in continuous control. NIPS, 2017.\n\n[46] Marc\u2019Aurelio Ranzato, Sumit Chopra, Michael Auli, and Wojciech Zaremba. Sequence level\n\ntraining with recurrent neural networks. ICLR, 2016.\n\n[47] St\u00e9phane Ross, Geoffrey Gordon, and Drew Bagnell. A reduction of imitation learning and\nstructured prediction to no-regret online learning. In Proceedings of the fourteenth international\nconference on arti\ufb01cial intelligence and statistics, pages 627\u2013635, 2011.\n\n[48] Nicolas Le Roux. Tighter bounds lead to improved classi\ufb01ers. ICLR, 2017.\n[49] Tom Schaul, John Quan, Ioannis Antonoglou, and David Silver. Prioritized experience replay.\n\nICLR, 2016.\n\n[50] John Schulman, Sergey Levine, Pieter Abbeel, Michael Jordan, and Philipp Moritz. Trust region\n\npolicy optimization. ICML, 2015.\n\n[51] John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal\n\npolicy optimization algorithms. arXiv:1707.06347, 2017.\n\n[52] David Silver, Aja Huang, Chris J Maddison, Arthur Guez, Laurent Sifre, George Van Den Driess-\nche, Julian Schrittwieser, Ioannis Antonoglou, Veda Panneershelvam, Marc Lanctot, et al. Mas-\ntering the game of go with deep neural networks and tree search. Nature, 529(7587):484\u2013489,\n2016.\n\n[53] David Silver, Julian Schrittwieser, Karen Simonyan, Ioannis Antonoglou, Aja Huang, Arthur\nGuez, Thomas Hubert, Lucas Baker, Matthew Lai, Adrian Bolton, et al. Mastering the game of\ngo without human knowledge. Nature, 2017.\n\n[54] Yibo Sun, Duyu Tang, Nan Duan, Jianshu Ji, Guihong Cao, Xiaocheng Feng, Bing Qin,\nTing Liu, and Ming Zhou. Semantic parsing with syntax-and table-aware sql generation.\narXiv:1804.08338, 2018.\n\n[55] Haoran Tang, Rein Houthooft, Davis Foote, Adam Stooke, OpenAI Xi Chen, Yan Duan, John\nSchulman, Filip DeTurck, and Pieter Abbeel. #exploration: A study of count-based exploration\nfor deep reinforcement learning. In I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus,\nS. Vishwanathan, and R. Garnett, editors, Advances in Neural Information Processing Systems\n30, pages 2753\u20132762. Curran Associates, Inc., 2017.\n\n[56] Chenglong Wang, Marc Brockschmidt, and Rishabh Singh. Pointing out SQL queries from text.\n\nICLR, 2018.\n\n[57] Ziyu Wang, Victor Bapst, Nicolas Heess, Volodymyr Mnih, Remi Munos, Koray Kavukcuoglu,\n\nand Nando de Freitas. Sample ef\ufb01cient actor-critic with experience replay. ICLR, 2017.\n\n[58] Ronald J. Williams. Simple statistical gradient-following algorithms for connectionist reinforce-\n\nment learning. Machine Learning, pages 229\u2013256, 1992.\n\n[59] Cathy Wu, Aravind Rajeswaran, Yan Duan, Vikash Kumar, Alexandre M Bayen, Sham Kakade,\nIgor Mordatch, and Pieter Abbeel. Variance reduction for policy gradient with action-dependent\nfactorized baselines. ICLR, 2018.\n\n[60] Yonghui Wu, Mike Schuster, Zhifeng Chen, Quoc V. Le, Mohammad Norouzi, Wolfgang\nMacherey, Maxim Krikun, Yuan Cao, Qin Gao, Klaus Macherey, Jeff Klingner, Apurva Shah,\nMelvin Johnson, Xiaobing Liu, Lukasz Kaiser, Stephan Gouws, Yoshikiyo Kato, Taku Kudo,\nHideto Kazawa, Keith Stevens, George Kurian, Nishant Patil, Wei Wang, Cliff Young, Jason\nSmith, Jason Riesa, Alex Rudnick, Oriol Vinyals, Greg Corrado, Macduff Hughes, and Jeffrey\nDean. Google\u2019s neural machine translation system: Bridging the gap between human and\nmachine translation. arXiv:1609.08144, 2016.\n\n[61] Xiaojun Xu, Chang Liu, and Dawn Song. SQLNet: Generating structured queries from natural\n\nlanguage without reinforcement learning. ICLR, 2018.\n\n12\n\n\f[62] Wen-tau Yih, Matthew Richardson, Chris Meek, Ming-Wei Chang, and Jina Suh. The value of\n\nsemantic parse labeling for knowledge base question answering. ACL, 2016.\n\n[63] Tao Yu, Zifan Li, Zilin Zhang, Rui Zhang, and Dragomir Radev. Typesql: Knowledge-based\n\ntype-aware neural text-to-sql generation. arXiv:1804.09769, 2018.\n\n[64] Wojciech Zaremba and Ilya Sutskever. Reinforcement learning neural turing machines.\n\narXiv:1505.00521, 2015.\n\n[65] M. Zelle and R. J. Mooney. Learning to parse database queries using inductive logic program-\nming. Association for the Advancement of Arti\ufb01cial Intelligence (AAAI), pages 1050\u20131055,\n1996.\n\n[66] L. S. Zettlemoyer and M. Collins. Learning to map sentences to logical form: Structured\nclassi\ufb01cation with probabilistic categorial grammars. Uncertainty in Arti\ufb01cial Intelligence\n(UAI), pages 658\u2013666, 2005.\n\n[67] Yuchen Zhang, Panupong Pasupat, and Percy Liang. Macro grammars and holistic triggering\n\nfor ef\ufb01cient semantic parsing. ACL, 2017.\n\n[68] Victor Zhong, Caiming Xiong, and Richard Socher. Seq2sql: Generating structured queries\n\nfrom natural language using reinforcement learning. arXiv:1709.00103, 2017.\n\n[69] Barret Zoph and Quoc V Le. Neural architecture search with reinforcement learning. ICLR,\n\n2016.\n\n[70] Barret Zoph and Quoc V Le. Neural architecture search with reinforcement learning.\n\narXiv:1611.01578, 2016.\n\n[71] Barret Zoph, Vijay Vasudevan, Jonathon Shlens, and Quoc V Le. Learning transferable\n\narchitectures for scalable image recognition. arXiv:1707.07012, 2017.\n\n13\n\n\f", "award": [], "sourceid": 6474, "authors": [{"given_name": "Chen", "family_name": "Liang", "institution": "Google Brain"}, {"given_name": "Mohammad", "family_name": "Norouzi", "institution": "Google Brain"}, {"given_name": "Jonathan", "family_name": "Berant", "institution": "Tel Aviv University"}, {"given_name": "Quoc", "family_name": "Le", "institution": "Google"}, {"given_name": "Ni", "family_name": "Lao", "institution": "Mosaix.ai"}]}