{"title": "Improving Neural Program Synthesis with Inferred Execution Traces", "book": "Advances in Neural Information Processing Systems", "page_first": 8917, "page_last": 8926, "abstract": "The task of program synthesis, or automatically generating programs that are consistent with a provided specification, remains a challenging task in artificial intelligence. As in other fields of AI, deep learning-based end-to-end approaches have made great advances in program synthesis. However, more so than other fields such as computer vision, program synthesis provides greater opportunities to explicitly exploit structured information such as execution traces, which contain a superset of the information input/output pairs. While they are highly useful for program synthesis, as execution traces are more difficult to obtain than input/output pairs, we use the insight that we can split the process into two parts: infer the trace from the input/output example, then infer the program from the trace. This simple modification leads to state-of-the-art results in program synthesis in the Karel domain, improving accuracy to 81.3% from the 77.12% of prior work.", "full_text": "Improving Neural Program Synthesis with Inferred\n\nExecution Traces\n\nRichard Shin\u2217\nUC Berkeley\n\nricshin@cs.berkeley.edu\n\nIllia Polosukhin\nNEAR Protocol\n\nillia@nearprotocol.com\n\nDawn Song\nUC Berkeley\n\ndawnsong@cs.berkeley.edu\n\nAbstract\n\nThe task of program synthesis, or automatically generating programs that are\nconsistent with a provided speci\ufb01cation, remains a challenging task in arti\ufb01cial\nintelligence. As in other \ufb01elds of AI, deep learning-based end-to-end approaches\nhave made great advances in program synthesis. However, compared to other\n\ufb01elds such as computer vision, program synthesis provides greater opportunities\nto explicitly exploit structured information such as execution traces. While execu-\ntion traces can provide highly detailed guidance for a program synthesis method,\nthey are more dif\ufb01cult to obtain than more basic forms of speci\ufb01cation such as\ninput/output pairs. Therefore, we use the insight that we can split the process into\ntwo parts: infer traces from input/output examples, then infer programs from traces.\nOur application of this idea leads to state-of-the-art results in program synthesis in\nthe Karel domain, improving accuracy to 81.3% from the 77.12% of prior work.\n\n1\n\nIntroduction\n\nThe task of program synthesis is to automatically generate a computer program that satis\ufb01es some\nspeci\ufb01cation.\nIt is a problem that has been studied since the earliest days of arti\ufb01cial intelli-\ngence [Waldinger and Lee, 1969, Manna and Waldinger, 1975]. With the renewed popularity\nof neural networks for machine learning in recent years, neural approaches to program synthesis have\ncorrespondingly attracted greater attention from the research community. One set of approaches,\nreferred as neural program induction by Devlin et al. [2017b], involves learning parameters for\na neural network architecture with a design inspired by existing computational structures such as\nstacks [Grefenstette et al., 2015, Joulin and Mikolov, 2015], random-access and associative mem-\nory [Kurach et al., 2016, Graves et al., 2016], and GPUs [Kaiser and Sutskever, 2015].\nA different set of approaches, neural program synthesis, instead learn to generate explicit discrete\nprograms in a domain-speci\ufb01c language from a speci\ufb01cation that consists of as few as 5 input/output\nexample pairs. Several recent papers have proposed neural network-based approaches to program\nsynthesis from input/output examples [Parisotto et al., 2017, Devlin et al., 2017b, Bunel et al., 2018].\nThese methods use an end-to-end encoder-decoder approach, where a neural network learns to\ngenerate a program from an encoding of a program speci\ufb01cation (a set of input/output examples)\nfrom a large synthetic training dataset.\nEnd-to-end approaches like these have been particularly successful in perceptual domains like\ncomputer vision where designing intermediate representations is a big challenge. In contrast, within\nprogram synthesis, there exists a great deal of structure and auxiliary information the model could\nlearn to exploit in addition to the typically used input/output examples. An example is program\nexecution traces, which have also been used to great effect by prior work [Reed and de Freitas, 2016,\nWang et al., 2018].\n\n\u2217Work partially performed at NEAR.\n\n32nd Conference on Neural Information Processing Systems (NeurIPS 2018), Montr\u00e9al, Canada.\n\n\fGiven that an execution trace can be a strict superset of an input/output example, intuition suggests\nthat program synthesis from execution traces should be easier than synthesis from I/O examples.\nSince we can replay an execution trace with the interpreter to obtain detailed information about the\nprogram state at each step, a trace-based synthesis model can rely upon the interpreter to handle\nthe semantics of the DSL\u2019s operations, and focus more on how to infer control \ufb02ow constructs by\nreconciling different paths taken in different inputs. However, execution traces are dif\ufb01cult to obtain\nas they are much more challenging for the end user to specify, so it is hard to reap their bene\ufb01ts.\nIn this work, we use the insight that if encoder-decoder neural networks can synthesize programs\nfrom input/output examples, they should also be able to infer execution traces. Thus, we can split the\nproblem into two steps: use input/output examples to infer execution traces, and then use execution\ntraces to infer the program. Our empirical results show that this modi\ufb01cation leads to state-of-the-art\nresults on the Karel [Pattis, 1981] program synthesis task, improving upon Bunel et al. [2018] from\n77.12% to 81.3% accuracy.\nOur analysis shows greater accuracy on programs of varying lengths and complexities, demonstrating\nthe general utility of the approach. This is despite the fact that we only use straightforward maximum\nlikelihood training, which is easier to tune than reinforcement learning methods of prior work.\nNevertheless, as our method is largely orthogonal to prior techniques like reinforcement learning,\nour research suggests useful future directions for further improving the accuracy of neural program\nsynthesis.\n\n2 Related work\n\nProgram synthesis from examples. There have been several practical applications of program-\nming by example based on search techniques and carefully crafted heuristics, such as Gulwani [2011].\nMore recent work has started to apply deep learning for program synthesis from examples such\nas RobustFill [Devlin et al., 2017b], DeepCoder [Balog et al., 2016], Neuro-Symbolic Program\nSynthesis [Parisotto et al., 2017], and Deep API Programmer [Bhupatiraju et al., 2017]. Gaunt et al.\n[2016] provides a comparison of various program synthesis from examples approaches on different\nbenchmarks, showing limitations of existing gradient descent models. Bunel et al. [2018], which also\nuses the domain of synthesizing Karel programs from examples, learns to predict the correct program\nwith a deep learning model by leveraging the syntax constraints of the program language and training\nvia reinforcement learning to generate more consistent programs.\n\nProgram induction. Another line of recent work aims to teach neural networks the functional\nbehavior of programs, by augmenting the neural architecture with additional computational modules\nsuch as Neural Turing Machines [Graves et al., 2014a], Neural GPUs [Kaiser and Sutskever, 2015],\nand stack-augmented RNNs [Joulin and Mikolov, 2015]. However, these approaches require a large\ndataset of input-output pairs to learn a single program, and can have trouble generalizing to unseen\ninputs. An alternative approach involves training neural networks to reproduce execution traces, like\nin Neural Program-Interpreters [Reed and De Freitas, 2015]. Even though this approach generalizes\nbetter, it still requires training a separate model per task. Devlin et al. [2017a] overcomes this\nweakness by using a meta-learning approach, where the model can induce a program from observing\na new task at test time to induce program and produce answer on unseen inputs. Nevertheless, the\nlearned program is induced latently within the weights and activations of a neural network, which\nlimits our ability to interpret which program has been learned and may limit the complexity of\nprograms that the model can represent and execute. Indeed, Bunel et al. [2018], which synthesizes\nexplicit Karel programs, obtained more accurate results compared to Devlin et al. [2017a] which uses\nan induced latent representation of Karel programs.\n\nLeveraging interpreters for inverse graphics.\nIn recent years, there has been work on learning\nsemantics of interpreters for inverse graphics with neural networks: learning how to control a drawing\nengine to reproduce a given picture, or in other words, recovering its underlying structure. In Ellis\net al. [2017], the authors \ufb01rst infer the execution trace of a drawing program and leverages a generic\nprogram search algorithm on the trace. Ganin et al. [2018] instead simultaneously uses techniques\nfrom reinforcement learning and adversarial training to teach an agent how to generate a program\nwhich renders the desired image. These methods provide evidence that having explicit prediction of\n\n2\n\n\ftraces or steps aids in learning the semantics of an interpreter, which is an important component of\nprogram synthesis.\n\n3 Background\n\nProg p := def main():s\nStmt s\n\na | if(b) : s | if(b) : s1else : s2\n\n:= while(b) : s | repeat(r) : s | s1; s2\n|\n:= markersPresent() | leftIsClear()\n|\n\nrightIsClear() | frontIsClear() | not(b)\n\nCond b\n\nAction a := move() | turnLeft() | turnRight()\n\npickMarker() | putMarker()\n\n|\n:= 0 | 1 | \u00b7\u00b7\u00b7 | 19\n\nCste r\n\nFigure 1: The syntax of the Karel DSL as used in this paper. Figure from Devlin et al. [2017a].\n\nProblem domain. Karel is an educational programming language (Pattis [1981]), used for example\nin Stanford CS introductory classes and the Hour of Code initiative. It features an agent inside a grid\nworld, where certain cells can contain markers or walls (but not both); the agent cannot enter cells\nwhere there is a wall. The agent can take the following actions: moving forward (move), turning left\nor right (turnLeft, turnRight), and modifying the world state by removing or adding markers to\nthe current location (pickMarker, putMarker).\nKarel programs, which are imperative, can contain branching statements (if, ifElse), while loops\nwhich execute as long as a condition is true, and repeat loops which execute for a \ufb01xed number\nof repetitions. The following conditions are available: whether the cell at the agent\u2019s location\ncontains markers (markersPresent), and whether there are any walls nearby (frontIsClear,\nleftIsClear, rightIsClear).\nDevlin et al. [2017a] introduced the use of the Karel domain for program induction, where a neural\nnetwork learns to represent a program; in this work, we tackle Karel program synthesis. The goal in\nthis domain is to learn how to generate a program in the Karel DSL given a small set of input and\noutput grids. Formally, we are given a set of n input-output world pairs {(I1, O1),\u00b7\u00b7\u00b7 , (In, On)},\nwith some hidden program \u03c0 which satis\ufb01es the property that executing \u03c0 in I1 results in O1, I2\nresults in O2, and so on. Our task is to recover \u02c6\u03c0 by observing the n input/output pairs, such that \u02c6\u03c0 is\nsemantically equivalent to \u03c0; in other words, \u02c6\u03c0 should have the same effect as \u03c0 on any input world,\nbut they do not need to be textually equivalent. However, note that the problem is under-speci\ufb01ed:\nwith n input-output pairs, it is not possible to disambiguate among all possible \u03c0, so the model needs\nto pick the most promising among the possibilities.\nOur work is based on Bunel et al. [2018], which applied a neural encoder-decoder approach to\nKarel program synthesis, similar to the work of Devlin et al. [2017b] and Parisotto et al. [2017]\nwhich was for a string-editing domain. Bunel et al. [2018] used both supervised learning with a\nrandomly-generated synthetic dataset to train their model, as well as a reinforcement learning-based\napproach to further improve the model\u2019s program synthesis accuracy. As part of their work, they have\ndeveloped a deep learning architecture for Karel program synthesis which we use as the basis for our\napproach.\n\n4 Approach\n\n4.1 Motivation\n\nPast work in program synthesis, program induction, program repair, and other areas of machine\nlearning have explored the bene\ufb01ts of learning using execution traces.\n\n3\n\n\fFigure 2: Architecture of I/O \u2192 TRACE model.\n\nFor example, in program induction, the Neural Programmer-Interpreter [Reed and de Freitas, 2016]\nreceives execution traces as supervision, and can learn complex tasks more quickly and accurately;\ngiven recursive traces, it can exhibit perfect generalization [Cai et al., 2017]. Other program induction\nmodels such as the Neural Turing Machine [Graves et al., 2014b] and Neural GPU [Kaiser and\nSutskever, 2016] have the harder task of learning directly from input/output examples, and thus need\na very large amount of training data and careful hyper-parameter tuning. In program repair, Wang\net al. [2018] improve upon baseline methods by learning models that use execution traces. Ellis\net al. [2017] and Ganin et al. [2018] generate execution traces for the purposes of inverse graphics.\nEven for generative modeling of images, learning to generate the picture incrementally and additively\n(similar to execution traces in our setting) has been able to improve performance [Gregor et al., 2015].\nWe hypothesize that any program synthesis model must be able to internally reason about the\nsemantics of the DSL and in particular the atomic operations. For example, in the Flash Fill system\n[Gulwani, 2011], this knowledge is explicitly speci\ufb01ed by the system\u2019s creators. In contrast, neural\nprogram synthesis methods need to learn both the semantics of the DSL and how to synthesize\nprograms in the DSL entirely from the training data.\nEven beyond the past work using traces and our hypothesis above, our intuition as programmers\nsuggests that it should be easier to synthesize a Karel program given not only the input/output\nexamples, but also the list of steps taken by the correct program to transform the input into the output.\nHowever, it is impractical to expect an end user specifying the desired program to also provide the\ncorrect execution trace for the program, since devising the trace is almost as much work as writing\nthe actual program itself.\nHowever, if neural networks can learn to synthesize programs from input/output examples as shown\nby Bunel et al. [2018] and others, it follows that they should also be able to synthesize the execution\ntrace from input/output examples. Indeed, we can expect this to be an easier problem given that the\nexecution trace does not contain any control \ufb02ow constructs. Furthermore, as the model iteratively\ngenerates the execution trace, we can evaluate the partial trace with the Karel interpreter and provide\nits internal state to the model to help guide the model\u2019s next output.\nOnce we have a model that can recover the correct execution trace from the input/output examples\nfor a desired program, it becomes possible to train and use a program synthesis model that takes both\ninput/output examples and corresponding execution traces as the program speci\ufb01cation. By including\ninformation extracted from the Karel interpreter as we run the execution trace about the current state\nof the world at each point in the trace, the program synthesis model has less need to internally reason\nabout the program\u2019s semantics as some of that work is effectively of\ufb02oaded to the Karel interpreter.\nIn the subsequent sections, we detail how we built two models to split the Karel program synthesis\nproblem (\u201cI/O \u2192 CODE\u201d) into two parts: I/O \u2192 TRACE, then TRACE \u2192 CODE.\n\n4\n\nConvolutions \u2192 FC <s>turnRightturnRightmove</s>turnRightturnRightmoveConv \u2192 FC Conv \u2192 FC Conv \u2192 FC Input/output pair Intermediate states\fFigure 3: Architecture of TRACE \u2192 CODE model. We follow the architecture in Bunel et al. [2018],\nbut add an execution trace input.\n\n4.2 Predicting execution traces from input/output pairs\n\nIn this work, an execution trace refers to an ordered set of actions: (action1, ..., actionT ). In\nthe case of Karel, actions are move, turn{Right, Lef t}, {put, pick}M arker. For a given\noriginal training example (\u03c0,{(I1, O1), ..., (IN , ON )}), we can generate N training examples\n(I1, O1, (action1,\u00b7\u00b7\u00b7 , actionT )1), . . . , (IN , ON , (action1,\u00b7\u00b7\u00b7 , actionT )N ) for trace prediction by\nrunning \u03c0 on these I/O pairs and recording the actions taken by the program. Thus, if the original\ntraining data contained K examples where each example contained N I/O pairs, then we obtain a I/O\npair to trace dataset of K \u00b7 N examples suitable for supervised learning.\nFigure 2 shows the deep learning model architecture we used for this task. To encode the input/output\nexamples, we use a convolutional neural network with a \ufb01nal fully-connected layer, taken from Bunel\net al. [2018]. To generate the sequence of actions, we use a two-layer LSTM decoder. At each step of\nthe decoder, it receives as input the concatenation of the following: 1) an embedding of the action\ntaken in the previous step, 2) the input/output pair embedding, and 3) an embedding of the current\nstate of the grid after executing all of the past actions. In theory, the LSTM can learn to keep track of\nthe current state of the grid internally, rendering the third input unnecessary; however, as we will see\nin Section 5.4, explicitly using the Karel interpreter to track the current grid state helps the model\nbetter understand how the grid is changing after each action and maintain more context.\n\n4.3 Synthesizing programs from input/output examples and execution traces\n\nTo create a model which uses both a set of input/output examples and execution traces for generating\nthe desired program, we started with the architecture from Bunel et al. [2018] and extend it to also\ntake the execution trace as an input. Speci\ufb01cally, we add a bidirectional LSTM to the architecture\nwhich is responsible for producing an embedding of the execution trace for each step in the trace.\nGiven an execution trace (action1,n, ..., actionT,n) and an initial state state1,n := In for the\nnth input/output example, we can use the Karel interpreter to replay the actions to obtain\nstate2,n,\u00b7\u00b7\u00b7 , stateT +1,n. For Karel, each state contains the full grid world and the objects within it:\nthe location of the agent, its orientation, the current number of markers in each cell, and the locations\nof walls (which cannot be manipulated by the agent and therefore do not change through the course\nof execution). Given the mismatch in lengths, and also given that the actions semantically occur\nin between the states, we interleave the two to create an input of length T + (T + 1) = 2T + 1:\n(state1,n, action1,n, state2,n,\u00b7\u00b7\u00b7 , actionT,n, stateT +1,n).\n\n5\n\nturnRightturnRightInput 2Output 2moveturnRightturnRightInput 1Output 1<s>repeat2{turnRightConvolutions \u2192 FC Execution trace embedding: attentionx5x5x5x5x5Maxpool \u300bFC \u300b SoftmaxMaxpool \u300bFC \u300b SoftmaxMaxpool \u300bFC \u300b SoftmaxMaxpool \u300bFC \u300b SoftmaxMaxpool \u300bFC \u300b Softmaxrepeat2{turnRight......}move\fTo provide this input to the bidirectional LSTM, we embed statet,n and actiont,n in the follow-\ning way. For each state, we evaluate the four conditionals (markersPresent, frontIsClear,\nleftIsClear, rightIsClear) that can in\ufb02uence the program\u2019s \ufb02ow of execution, embed each\nboolean value and concatenate them. For each action, we look up the corresponding embedding from\na table. Both sets of embeddings are randomly initialized and learned. We will denote the 2T + 1\noutputs for the nth trace from the bidirectional trace LSTM as T1,n,\u00b7\u00b7\u00b7 ,T2T +1,n.\nWe also tried a variant where the input consists of T + 1 elements. First, we append appending\na \ufb01nal </s> to the list of actions: actionT +1,n = </s>, to make the lengths match. We then\nembed each statet,n and actiont,n, and then concatenate the embeddings together. In addition to\nthe conditional values, we also tried providing an embedding of the grid itself to the LSTM, using a\nsimilar convolutional neural net as used to encode each grid in the I/O \u2192 TRACE model. However,\nwe found that both variants were inferior compared to the method described in the previous paragraph.\nFollowing Bunel et al. [2018], we also encode each input/output example using a convolutional neural\nnetwork with a \ufb01nal fully-connected layer. We generate the program one token at a time with a\ndecoder LSTM. As in Bunel et al. [2018] and Devlin et al. [2017b], we run a separate LSTM for each\ninput/output pair; the LSTMs have separate states but shared weights. At decoding step i, the LSTM\nfor the nth input/output example receives the concatenation of following:\n\n\u2022 an embedding of the token generated in step i \u2212 1, or of <s> at step 0 (the start of decoding)\n\u2022 the I/O pair embedding for (Ik, Ok),\n\u2022 the context vector ci\u22121,n for step i \u2212 1: a weighted sum of T1,n,\u00b7\u00b7\u00b7 ,T2T +1,n, computed\nusing a multiplicative attention mechanism based on oi\u22121,n, the LSTM\u2019s output at step i\u2212 1.\nNote that c0,n = 0.\n\nWe obtain the decoder LSTM output oi,n for each of the N input/output pairs, compute the context\nvector ci,n as described above, and concatenate them to obtain \u02dcoi,n = Concat(oi,n, ci,n). To compute\nthe logits over the ith program token, we compute W \u00b7 MaxPool(\u02dcoi,1,\u00b7\u00b7\u00b7 , \u02dcoi,N ), where W \u2208 Rv\u00d7d,\nand v is the size of the output vocabulary. See Figure 3 for a visual depiction of the overall architecture.\nTo train this model, we tried two different sources of supervision. Recall that each entry in the\nprovided training data consists of a program \u03c0 and 5 input/output pairs {(I1, O1), ..., (I5, O5)}).\nFirst, we can execute \u03c0 on I1, . . . , I5 to obtain the execution trace on each input; we refer to this\ntrace as the gold trace. However, we will not have access to the gold trace when we wish to actually\nuse this model for program synthesis from input/output examples.\nSecond, we can use the I/O \u2192 TRACE model from Section 4.2 to infer a valid trace for the given\nI/O pair; we refer to this trace as the inferred trace. Unfortunately, this model does not always\nsucceed at recovering a correct trace for the I/O pair, in which case we substitute a trace containing\na single UNK action and two grid states, the input grid and the output grid. Furthermore, the trace\nmay deviate from the actions taken by the true program even if the \ufb01nal state is identical, as certain\nactions can be permuted without any effect on the output (such as turnLeft and pickMarker), and\ncertain sequences of actions (such as turnLeft then turnRight) are no-ops. Nevertheless, we will\nonly have access to the inferred trace at inference time, so it is useful to match the training and test\ndistributions more closely.\n\n5 Experiments\n\n5.1 Training dataset and procedure\n\nTo train and test our models, we used the same dataset as Bunel et al. [2018], from https://bit.\nly/karel-dataset. The training dataset consists of 1,116,854 entries, and the test dataset contains\n2,500 entries. Each entry in the dataset contains a Karel program and 6 input/output pairs which\nsatisfy that program. For training the I/O \u2192 TRACE model, we used all 6 input/output pairs within\neach entry for a total of 6,701,124 training traces. For training the TRACE \u2192 CODE model (and our\nreimplementation of the I/O \u2192 CODE model from Bunel et al. [2018]), we randomly sample 5 out\nof the 6 input/output examples (and corresponding traces) each time we sample an entry from the\ntraining data. In general, we endeavored to follow the training regime from the prior work as closely\nas possible, although we discovered that SGD with gradient clipping worked better for training the\nmodels than Adam. For all of the evaluations of TRACE \u2192 CODE we used beam search with size 50.\n\n6\n\n\fTable 1: Comparison of our best model with previous work from Bunel et al. [2018]. \u201cGen.\u201d stands\nfor generalization accuracy.\n\nMLE [Bunel et al., 2018]\nRL_beam_div_opt [Bunel et al., 2018]\nI/O \u2192 CODE, MLE\nI/O \u2192 TRACE \u2192 CODE, MLE\n\nTop-1\nExact Match\n39.94%\n32.17%\n40.1%\n42.8%\n\nTop-50\nGuided Search\n\nGen.\n71.91% \u2212\n77.12% \u2212\n73.5%\n84.6%\n81.3% 88.8%\n\nGen.\n86.37%\n85.38%\n85.8%\n90.8%\n\n5.2 Performance metrics\n\nWhen evaluating the model, we use beam search both to get outputs that have higher log likelihood\nthan what can be obtained with greedy decoding, and also to obtain multiple candidate sequences. As\nsuch, we use multiple criteria to report the performance of the models.\nFirst, for purposes of comparison, we have the same metrics as used by the prior work: Top-K Exact\nMatch, which measures how often one of the top K output programs of the model textually matches\nthe original program exactly; and Top-K Generalization, which denotes the fraction of test instances\nfor which one of the top K output programs will have the correct behavior across the 5 input/output\nexamples used to specify the program to the model, as well as the held-out 6th input/output example.\nAs an alternative to these metrics, we suggest to use what we call Top-K Model-Guided Search\nAccuracy. In this metric, we consider the top K program outputs in order, from most likely to least\nlikely. We test each candidate program on the 5 input/output examples that specify the program, and\nsee if it works correctly on those 5. We return the \ufb01rst such program (the top-ranked one) as the\nsolution, and then test it on the held-out 6th program to report the accuracy. The motivation for this\napproach is two-fold. First, we already have the 5 I/O examples to specify the program for the model\nto produce, and so we might as well use them to \ufb01lter any unsatisfactory outputs of the model, to reap\nthe bene\ufb01ts of having a precisely checkable speci\ufb01cation for the correct answer. Second, this metric\nis more comparable to other methods in the literature that use a search-based method for program\nsynthesis, either with handwritten heuristics or with machine learning models (such as [Balog et al.,\n2016]); indeed, such methods will often try thousands or millions of candidate programs, rather than\nthe comparatively small K = 50 which we used for our experiments.\n\n5.3 Evaluation of I/O \u2192 TRACE \u2192 CODE\nIn Table 1, we compare our best I/O \u2192 TRACE \u2192 CODE model (created by gluing together I/O \u2192\nTRACE and TRACE \u2192 CODE) against the previous work of Bunel et al. [2018]. We reimplemented\ntheir MLE model (labeled as I/O \u2192 CODE), obtaining slightly better results compared to theirs.\nWe note that we did not implement the RL_beam_div_opt training method of Bunel et al. [2018],\nand so our results are all based on MLE training. Nevertheless, our I/O \u2192 TRACE \u2192 CODE method\noutperforms all others on all metrics, including the best result in Bunel et al. [2018]. We anticipate\nthat using reinforcement learning methods (such as RL_beam_div_opt) can improve our method\u2019s\naccuracy even further.\nWe also analyzed how models performed on various slices of the test data in Table 2: programs with\nno control \ufb02ow (only actions); programs with conditionals (if or ifElse) but not loops (repeat or\nwhile); programs with loops but no conditionals; and programs containing at least one control \ufb02ow\nelement. We also partitioned the data depending on the length of the gold program into three buckets.\nWe can observe that I/O \u2192 TRACE \u2192 CODE improves upon I/O \u2192 CODE within every slice of\nthe data. The magnitude of the improvement is most signi\ufb01cant on long programs, which provides\nsupporting evidence for our hypothesis in that the I/O \u2192 CODE model would need to internally keep\ntrack of the Karel state but have trouble doing so.\n\n7\n\n\fTable 2: Comparing performance on different slices of data\n\nSlice\n\nNo control \ufb02ow\nOnly Conditions\nOnly Loops\nWith all control \ufb02ow\nProgram length 0-15\nProgram length 15-30\nProgram length 30+\n\n% of dataset\n26.4%\n15.6%\n29.9%\n73.6%\n\nI/O \u2192 CODE\n100.0%\n87.4%\n91.3%\n79.0%\n\n44.8%\n40.7%\n14.5%\n\n99.5%\n80.8%\n48.6%\n\nI/O \u2192 TRACE \u2192 CODE\n100.0%\n91.0%\n94.3%\n84.8%\n\n\u2206%\n+0.0%\n+3.6%\n+3.0%\n+5.8%\n+0.0%\n99.5%\n+6.1%\n86.9%\n61.0% +12.4%\n\nTable 3: Evaluation of I/O \u2192 TRACE models.\n\nTop-1\n\nTop-10\n\nExact Match Correct Exact Match Correct\n58.7%\n57.6%\n57.8%\n\n59.3%\n92.5%\n58.0%\n94.8%\n95.2% 58.2%\n\n95.9%\n97.4%\n98.0%\n\nNo grids\nLGRL\nPRESNET\n\n5.4 Evaluating I/O \u2192 TRACE and TRACE \u2192 CODE separately\nFor the \ufb01rst part of our approach (I/O \u2192 TRACE), in Table 3 we show results of predicting the 5\nexecution traces from the 5 input/output examples used to specify a Karel program synthesis task.\nIn this table, we consider a result to be correct if all 5 predicted traces transform the corresponding\ninput states to the output states when executed in the Karel interpreter. As discussed in Section 4.2,\nthere exists many possible execution traces which transform a given input state to the output state;\ntherefore, the exact match accuracy is much lower than the correctness metric.\nThe LGRL model uses the same architecture for convolutional encoder as in Bunel et al. [2018].\nWe also augmented it with residual connections between layers, results for which are reported as\nPRESNET model. Furthermore, to con\ufb01rm that the interpreter\u2019s current state helps the I/O \u2192 TRACE\nmodel produce correct traces, we have trained a variant (\u201cNo grids\u201d) which omits the inputs of the\ngrid state.\nFor the second part (TRACE \u2192 CODE), Table 4 compares a model trained on gold traces against one\ntrained on inferred traces from the best I/O \u2192 TRACE model. Due to the distributional differences\nbetween the gold and inferred traces, the model trained on gold traces does poorly on inferred traces.\nWe also tried an evaluation using the gold execution traces from the test set. As discussed earlier in\nSection 4.3, the gold execution traces would not normally be available at test time, so this evaluation\nserves as a hypothetical comparison against our main result.\n\n6 Discussion and Future Work\n\nFrom our results, we consider con\ufb01rmed our hypothesis that it is bene\ufb01cial to use traces for explicitly\ntraining a model that learns the semantics of interpreter, separately from the task of synthesizing the\ncode for the correct program.\nInterestingly, the predicted trace and gold trace fail to match exactly in half of the cases even though\nthe predicted trace is correct. Indeed, the TRACE \u2192 CODE model trained on gold traces doesn\u2019t\nperform as well on inferred traces. That said, when we evaluated TRACE \u2192 CODE on gold traces\nat validation time, it outperformed the model trained on inferred traces. Since I/O \u2192 TRACE\nindependently predicts traces for each I/O pair, we hypothesize that they lack consistency with\neach other compared to the gold traces. Thus, for future work we suggest to investigate inferring\nthe execution trace for a given I/O pair, conditioned on already generated execution traces for the\nsame underlying program but on different I/O pairs. We also noticed that the TRACE \u2192 CODE\n\n8\n\n\fTable 4: Evaluation of TRACE \u2192 CODE models.\n\nTrain traces Test traces Exact Match Correct Guided Search\nGold\nInferred\nGold\n\n76.5%\n81.8%\n81.3% 88.8%\n92.4%\n86.4%\n\nInferred\nInferred\nGold\n\n39.2%\n42.8%\n54.0%\n\nmodel trained on predicted traces performed much worse when evaluated on gold traces compared to\npredicted traces. This phenomenon suggests that training a TRACE \u2192 CODE model with multiple\noptions of traces sampled from I/O \u2192 TRACE and from gold traces may improve the model\u2019s\nresilience and further improve the accuracy.\nWe also leave for future work exploring usage of reinforcement learning objectives (similar to Bunel\net al. [2018]) for the training of these models. We see two possible ways of applying these objectives:\ntraining I/O \u2192 TRACE and TRACE \u2192 CODE models separately with the reward of passing unseen\ntests; and training I/O \u2192 TRACE and TRACE \u2192 CODE models jointly end-to-end, where traces\nare decoded into symbolic form and reinforcement learning allows propagation of learning signals\nbetween the two parts.\n\nAcknowledgements\n\nThis material is in part based upon work supported by Berkeley Deep Drive, the National Science\nFoundation under Grant No. TWC-1409915, and the Defence Advanced Research Projects Agency\nunder Grant No. FA8750-17-2-0091. Any opinions, \ufb01ndings, and conclusions or recommendations\nexpressed in this material are those of the authors and do not necessarily re\ufb02ect the views of the\nabove organizations.\n\nReferences\nMatej Balog, Alexander L. Gaunt, Marc Brockschmidt, Sebastian Nowozin, and Daniel Tarlow.\nDeepcoder: Learning to write programs. CoRR, abs/1611.01989, 2016. URL http://arxiv.\norg/abs/1611.01989.\n\nSurya Bhupatiraju, Rishabh Singh, Abdel rahman Mohamed, and Pushmeet Kohli. Deep api pro-\n\ngrammer: Learning to program with apis. CoRR, abs/1704.04327, 2017.\n\nRudy Bunel, Matthew Hausknecht, Jacob Devlin, Rishabh Singh, and Pushmeet Kohli. Leveraging\ngrammar and reinforcement learning for neural program synthesis. International Conference on\nLearning Representations, 2018. URL https://openreview.net/forum?id=H1Xw62kRZ.\n\nJonathon Cai, Richard Shin, and Dawn Song. Making neural programming architectures generalize\nIn International Conference on Learning Representations, 2017. URL http:\n\nvia recursion.\n//arxiv.org/abs/1511.06279.\n\nJacob Devlin, Rudy R Bunel, Rishabh Singh, Matthew Hausknecht, and Pushmeet Kohli. Neural\nprogram meta-induction. In Advances in Neural Information Processing Systems, pages 2077\u20132085,\n2017a.\n\nJacob Devlin, Jonathan Uesato, Surya Bhupatiraju, Rishabh Singh, Abdel-rahman Mohamed, and\nPushmeet Kohli. Robust\ufb01ll: Neural program learning under noisy i/o. In International Conference\non Machine Learning, pages 990\u2013998, 2017b.\n\nKevin Ellis, Daniel Ritchie, Armando Solar-Lezama, and Joshua B. Tenenbaum. Learning to\ninfer graphics programs from hand-drawn images. CoRR, abs/1707.09627, 2017. URL http:\n//arxiv.org/abs/1707.09627.\n\nYaroslav Ganin, Tejas Kulkarni, Igor Babuschkin, S. M. Ali Eslami, and Oriol Vinyals. Synthesizing\nprograms for images using reinforced adversarial learning. CoRR, abs/1804.01118, 2018. URL\nhttp://arxiv.org/abs/1804.01118.\n\n9\n\n\fAlexander L Gaunt, Marc Brockschmidt, Rishabh Singh, Nate Kushman, Pushmeet Kohli, Jonathan\nTaylor, and Daniel Tarlow. Terpret: A probabilistic programming language for program induction.\narXiv preprint arXiv:1608.04428, 2016.\n\nAlex Graves, Greg Wayne, and Ivo Danihelka. Neural turing machines. CoRR, abs/1410.5401, 2014a.\n\nAlex Graves, Greg Wayne, and Ivo Danihelka. Neural turing machines. CoRR, abs/1410.5401, 2014b.\n\nURL http://arxiv.org/abs/1410.5401.\n\nAlex Graves, Greg Wayne, Malcolm Reynolds, Tim Harley, Ivo Danihelka, Agnieszka Grabska-\nBarwi\u00b4nska, Sergio G\u00f3mez Colmenarejo, Edward Grefenstette, Tiago Ramalho, John Agapiou, et al.\nHybrid computing using a neural network with dynamic external memory. Nature, 538(7626):\n471\u2013476, 2016.\n\nEdward Grefenstette, Karl Moritz Hermann, Mustafa Suleyman, and Phil Blunsom. Learning to\ntransduce with unbounded memory. In Advances in Neural Information Processing Systems, pages\n1828\u20131836, 2015.\n\nKarol Gregor, Ivo Danihelka, Alex Graves, and Daan Wierstra. DRAW: A recurrent neural network for\nimage generation. CoRR, abs/1502.04623, 2015. URL http://arxiv.org/abs/1502.04623.\n\nSumit Gulwani. Automating string processing in spreadsheets using input-output examples. In\nProceedings of the 38th Annual ACM SIGPLAN-SIGACT Symposium on Principles of Programming\nLanguages, POPL \u201911, pages 317\u2013330, New York, NY, USA, 2011. ACM. ISBN 978-1-4503-0490-\n0. doi: 10.1145/1926385.1926423. URL http://doi.acm.org/10.1145/1926385.1926423.\n\nArmand Joulin and Tomas Mikolov. Inferring algorithmic patterns with stack-augmented recurrent\n\nnets. In NIPS, 2015.\n\n\u0141ukasz Kaiser and Ilya Sutskever. Neural gpus learn algorithms. arXiv preprint arXiv:1511.08228,\n\n2015.\n\nLukasz Kaiser and Ilya Sutskever. Neural gpus learn algorithms. In International Conference on\n\nLearning Representations, 2016. URL http://arxiv.org/abs/1511.08228.\n\nKarol Kurach, Marcin Andrychowicz, and Ilya Sutskever. Neural random access machines. ICLR,\n\n2016. URL http://arxiv.org/abs/1511.06392.\n\nZohar Manna and Richard Waldinger. Knowledge and reasoning in program synthesis. In Proceedings\nof the 4th International Joint Conference on Arti\ufb01cial Intelligence - Volume 1, IJCAI\u201975, pages\n288\u2013295, San Francisco, CA, USA, 1975. Morgan Kaufmann Publishers Inc. URL http://dl.\nacm.org/citation.cfm?id=1624626.1624670.\n\nEmilio Parisotto, Abdel-rahman Mohamed, Rishabh Singh, Lihong Li, Dengyong Zhou, and Pushmeet\nKohli. Neuro-symbolic program synthesis. In International Conference on Learning Representa-\ntions, 2017.\n\nRichard E Pattis. Karel the robot: a gentle introduction to the art of programming. John Wiley &\n\nSons, Inc., 1981.\n\nScott Reed and Nando De Freitas. Neural programmer-interpreters. arXiv preprint arXiv:1511.06279,\n\n2015.\n\nScott Reed and Nando de Freitas. Neural programmer-interpreters. In International Conference on\n\nLearning Representations, 2016. URL http://arxiv.org/abs/1511.06279.\n\nRichard J. Waldinger and Richard C. T. Lee. Prow: A step toward automatic program writing.\nIn Proceedings of the 1st International Joint Conference on Arti\ufb01cial Intelligence, IJCAI\u201969,\npages 241\u2013252, San Francisco, CA, USA, 1969. Morgan Kaufmann Publishers Inc. URL http:\n//dl.acm.org/citation.cfm?id=1624562.1624586.\n\nKe Wang, Zhendong Su, and Rishabh Singh. Dynamic neural program embeddings for program repair.\nInternational Conference on Learning Representations, 2018. URL https://openreview.net/\nforum?id=BJuWrGW0Z.\n\n10\n\n\f", "award": [], "sourceid": 5345, "authors": [{"given_name": "Eui Chul", "family_name": "Shin", "institution": "UC Berkeley"}, {"given_name": "Illia", "family_name": "Polosukhin", "institution": "NEAR"}, {"given_name": "Dawn", "family_name": "Song", "institution": "UC Berkeley"}]}