{"title": "Automatic Program Synthesis of Long Programs with a Learned Garbage Collector", "book": "Advances in Neural Information Processing Systems", "page_first": 2094, "page_last": 2103, "abstract": "We consider the problem of generating automatic code given sample input-output pairs. We train a neural network to map from the current state and the outputs to the program's next statement. The neural network optimizes multiple tasks concurrently: the next operation out of a set of high level commands, the operands of the next statement, and which variables can be dropped from memory. Using our method we are able to create programs that are more than twice as long as existing state-of-the-art solutions, while improving the success rate for comparable lengths, and cutting the run-time by two orders of magnitude. Our code, including an implementation of various literature baselines, is publicly available at https://github.com/amitz25/PCCoder", "full_text": "Automatic Program Synthesis of Long Programs\n\nwith a Learned Garbage Collector\n\nAmit Zohar1\n\nLior Wolf 1 2\n\n1The School of Computer Science , Tel Aviv University\n\n2Facebook AI Research\n\nAbstract\n\nWe consider the problem of generating automatic code given sample input-output\npairs. We train a neural network to map from the current state and the outputs\nto the program\u2019s next statement. The neural network optimizes multiple tasks\nconcurrently: the next operation out of a set of high level commands, the operands\nof the next statement, and which variables can be dropped from memory. Using\nour method we are able to create programs that are more than twice as long as\nexisting state-of-the-art solutions, while improving the success rate for comparable\nlengths, and cutting the run-time by two orders of magnitude. Our code, including\nan implementation of various literature baselines, is publicly available at https:\n//github.com/amitz25/PCCoder\n\n1\n\nIntroduction\n\nAutomatic program synthesis has been repeatedly identi\ufb01ed as a key goal of AI research. However,\ndespite decades of interest, this goal is still largely unrealized. In this work, we study program\nsynthesis in a Domain Speci\ufb01c Language (DSL), following [1]. This allows us to focus on high-level\nprograms, in which complex methods are implemented in a few lines of code. Our input is a handful\nof input/output examples, and the desired output is a program that agrees with all of these examples.\nIn [1], the authors train a neural network to predict the existence of functions in the program and then\nemploy a highly optimized search in order to \ufb01nd a correct solution based on this prediction. While\nthis approach works for short programs, the number of possible solutions grows exponentially with\nprogram length, making it infeasible to identify the solution based on global properties.\nOur work employs a step-wise approach to the program synthesis problem. Given the current state,\nour neural network directly predicts the next statement, including both the function (operator) and\nparameters (operands). We then perform a beam search based on the network\u2019s predictions, reapplying\nthe neural network at each step. The more accurate the neural network, the less programs one needs\nto include in the search before identifying the correct solution.\nSince the number of variables increases with the program\u2019s length, some of the variables in memory\nneed to be discarded. We therefore train a second network to predict the variables that are to be\ndiscarded. Training the new network does not only enable us to solve more complex problems, but\nalso serves as an auxiliary task that improves the generalization of the statement prediction network.\nOur approach is relatively slow in the number of evaluations per time unit. However, it turns out to be\nmuch more ef\ufb01cient with respect to selecting promising search directions, and we present empirical\nresults that demonstrate that: (i) For the length and level of accuracy reported in [1], our method is\ntwo orders of magnitude faster. (ii) For the level of accuracy and time budget used in that paper, we\nare able to solve programs more than two times as complex, as measured by program length (iii) For\nthe time budget and length tested in the previous work, we are able to achieve a near perfect accuracy.\n\n32nd Conference on Neural Information Processing Systems (NeurIPS 2018), Montr\u00e9al, Canada.\n\n\f(a)\n\n(b)\n\nFigure 1: (a) An example of a program of 4 statements in the DSL that receives a single int array as\ninput, and 2 input-output samples. (b) The environment of the program after executing line 3.\n\n2 General Formulation\n\nOur programs are represented by the Domain Speci\ufb01c Language (DSL) de\ufb01ned by [1]. Each program\nis a sequence of function calls with their parameters, and each function call creates a new variable\ncontaining the return value. The parameters are either inputs, previously computed variables, or a\nlambda function. There are 19 lambda functions de\ufb01ned in this DSL, such as increasing every element\nof the list by one (+1), thresholding (>0) or parity checks (%2==0), and are used, for example, as\ninputs to the MAP operator that applies these functions to each element in a list.\nEach variable in the DSL is either an integer in the range of [-256,255] or an array of integers, which\nis bounded by the length of 20. The output of the program is de\ufb01ned as the return value of its last\nstatement. See Fig. 1(a) for a sample program and appendix F of [1] for the DSL speci\ufb01cation.\nFormally, consider the set of all programs with length of at most t and number of inputs of at most n.\nLet F be the set of all operators (functions) in our DSL and let L be the set of all lambdas. The set of\nvalues for each operand is M = {i \u2208 N | 1 \u2264 i \u2264 t + n} \u222a L. For an operator o with q operands,\nthe statement would include the operator and a list of operands p = [p1, ..., pq] \u2208 Mq. Many of the\nelements of M q are invalid if a variable pj was not initialized yet, or if the value pj is not of the\nsuitable type for operand j of the operator o. We denote the set of possible statements in a program\nby S \u2286 F \u00d7 M r, where r is the maximal number of operands for an operator in the DSL. S contains\nboth valid and invalid statements, and is \ufb01nite and \ufb01xed. The network we train learns to identify the\nvalid subset, and the constraints are not dictated apriori, e.g., by means of type checking.\nIn imperative programming, the program advances by changing its state one statement at a time. We\nwish to represent these states during the execution. Naturally, these states depend on the input. For\nconvenience, we also include the output in the state, since it is also an input of our network. To that\nend we de\ufb01ne the state of a program as the sequence of all the variable values acquired thus far,\nstarting with the program\u2019s input, and concatenated with the desired output. Before performing any\nline of code, the state contains the inputs and the output. After each line of code, a new variable is\nadded to the state.\nA single input/output pair is insuf\ufb01cient for de\ufb01ning a program, since many (inequivalent) potential\nprograms may be compatible with the pair. Therefore, following [1] the synthesis problem is de\ufb01ned\nby a set of input/output pairs. Since each input-output example results in a different program state,\nwe de\ufb01ne the environment e of a program at a certain step of its execution, as the concatenation\nof all its states (one for each input/output example), see Fig. 1(b). Our objective is thus to learn a\nfunction f : E \u2192 S that maps from the elements of the space of program environments E to the next\nstatement to perform.\nBy limiting the program\u2019s length to t, we are able to directly represent the mapping f by a feed\nforward neural network. However, this comes at a cost of limiting the program\u2019s length by a \ufb01xed\nmaximal length. To remedy this, we learn a second function g : E \u2192 {0, 1}t+n that predicts for each\nvariable in the program whether it can be dropped. During the program generation, we can use this\ninformation to forgo variables that are no longer needed, thereby making space for new variables.\nThis enables us to generate programs longer than t.\nNote that our problem formulation considers the identi\ufb01cation of any program that is compatible\nwith all samples to be a success. This is in contrast to some recent work [2, 3, 4], focused on\n\n2\n\n\fFigure 2: A depiction of the embedding module of our network. In this example there are n = 2\nexamples, v = 2 variables, arrays are of maximum length l = 2 and the embedding dimension is\nd = 1. For simplicity, the variable type one-hot encoding is omitted. (i) A program environment is\ngiven as input to the network. (ii) Each variable is represented by a compact \ufb01xed-length vector. N\nstands for NULL. (iii) The integers of each variable are embedded using a learned look-up table. (iv)\nEach vector is passed through a 56-unit linear layer. (v) The resulting vectors are concatenated to a\nsingle \ufb01xed-length vector.\n\nstring processing, where success is measured on unseen samples. Since any string can be converted\nto any other string by simple insertions, deletions, and replacements, string processing requires\nadditional constraints. However, in the context of our DSL, \ufb01nding any solution is challenging by\nitself. Another difference from string processing is that strings can be broken into segments that are\ndealt separately [5].\nWithin the context of string processing, [6] is perhaps the closest work to ours. Their approach is to\ntreat the problem as a sequence-to-sequence problem, from samples to code. As part of our ablation\nanalysis we compare our method with this method. Our setting also differs from frameworks that\ncomplete partial programs, where parts of the program are provided or otherwise constrained [7, 8].\n\n3 Method\n\nWe generate our train and test data similarly to [1]. First, we generate random programs from the\nDSL. Then, we prune away programs that contain redundant variables, and programs for which an\nequivalent program exists in the dataset (could be shorter). Equivalence is approximated by identical\nbehavior on a set of input-output examples. We generate valid inputs for the programs by bounding\ntheir output value to our DSL\u2019s predetermined range and then propagating these constraints backward\nthrough the program.\n\n3.1 Representing the Environment\n\nThe environment is the input to both networks f and g and is represented as a set of k state-vectors,\nwhere k is the number of samples. Each state vector is represented by a \ufb01xed-length array of v\nvariables, which includes both generated variables and the program\u2019s initial input, which is assumed\nto contain at least one variable and up to n variables. If there are less than v variables, a special\nNULL value appears in the missing values of the array.\nEach variable is \ufb01rst represented as a \ufb01xed-length vector, similarly to what is done in [1]. The vector\ncontains two bits for a one-hot encoding of the type of the variable (number or list), an embedding\nvector of size d = 20 to capture the numeric value of the number or the \ufb01rst element of the list, and\nl \u2212 1 additional cells of length d for the values of the other list elements. In our implementation,\nfollowing the DSL de\ufb01nition of [1], l = 20 is the maximal array size. Note that the embedding of a\nnumber variable and a list of length one differs only in the one-hot type encoding.\nThe embedding of length d is learned as part of the training process using a Look Up Table that maps\nintegers from [\u2212256, 256] to Rd, where 256 indicates a NULL value, or a missing element of the list\nif its length is shorter than l.\nA single variable is thus represented, both in our work and in [1], by a vector of length l\u00b7 d + 2 = 402.\nAt this point the architectures diverge. In our work, this vector is passed through a single layer\nneural network with 56 units and a SELU activation function [9]. A state is then represented\n\n3\n\n\fFigure 3: The general architecture of our network. Each state vector is \ufb01rst embedded and then\npassed through a dense block before \ufb01nally pooling the results together to a single vector. The result\nis then passed to 3 parallel feed-forward layers that perform the predictions.\n\nby a concatenation of v + 1 activation vectors of length 56, with the additional vector being the\nrepresentation of the output. Thus, we obtain a state vector of length s = 56(v + 1).\n\n3.2 Network Architecture\nWe compute f and g based on the embedding of the environment, which is a set of k vectors in Rs.\nSince the order of the examples is arbitrary, this embedding is a permutation-invariant function of\nthese state vectors.\nSince we perform a search with our network, an essential desired property for our architecture is to\nhave high model capacity while retaining fast run times. To this end, we use a densely connected\nblock [10], an architecture that has recently achieved state-of-the-art results on image classi\ufb01cation\nwith relatively small depth. In our work, the convolution layers are replaced with fully connected\nlayers. Permutation invariance is then obtained by employing average pooling.\nEach state vector is thus passed through a single 10-layer fully-connected dense block. The i-th layer\nin the block receives as input the activations of all layers 1, 2, .., i\u2212 1, and has 56 hidden units, except\nfor the last layer which has 256 units. SELU [9] activations are used between the layers of the dense\nblock. The resulting k vectors are pooled together via simple arithmetic averaging.\nIn order to learn the environment embedding and to perform prediction, we learn networks for three\ndifferent classi\ufb01cation tasks, out of which only the two matching functions f and g are used during\ntest time. Statement prediction: An approximation of f from the formulation in Sec. 2. Since we\nlimit the amount of variables in the program, there is a \ufb01nite set of possible statements. The problem\nis therefore cast as a multiclass classi\ufb01cation problem. Variable dropping: An approximation of g\nfrom the formulation. This head is a multi-label binary classi\ufb01er that predicts for each variable in\nthe program whether it can be dropped. At test-time, when we run out of variables, we drop those\nthat were deemed most probable to be unnecessary by this head. Operator prediction: This head\npredicts the function of the next statement, where a function is de\ufb01ned as the operator in combination\nwith the lambda function in case it expects one, but without the other operands. For example, a\nmap with (+1) is considered a different function from a map with (\u22121). While the next function\ninformation is already contained inside the predicted statement, we found this auxiliary training\ntask to be highly bene\ufb01cial. Intuitively, this objective is easier to learn than the objective of the \ufb01rst\nclassi\ufb01cation network, since the number of classes is much smaller and the distinction between the\nclasses is clearer. In addition, it provides a hierarchical structure to the space of possible statements.\nIn our implementation, all three classi\ufb01ers are linear. The size of the output dimensions for the\nstatement prediction and variable dropping are thus |S| and [0, 1]v respectively, and the size of the\noutput of the function prediction is bounded by |F||L|.\nDuring training, for each program in our dataset we execute the program line by line and generate\nthe following information: (i) The program environment achieved by running the program for every\nexample simultaneously, up until the current line of code. (ii) The next statement of the program. (iii)\nThe function of the next statement. (iv) A label for each variable that indicates whether it is used in\nthe rest of the program. If not, it can be dropped.\nCross entropy loss is used for all three tasks, where for variable-dropping multi-label binary cross\nentropy is used. Concretely, denote the predicted statements, functions and drop probability for\nvariable i by S, F, Di respectively, and by \u02c6S, \u02c6F , \u02c6Di the corresponding ground truths. Our model is\n\n4\n\n\ftrained with the following unweighted loss:\n\nL = CE(S, \u02c6S) + CE(F, \u02c6F ) +\n\nv(cid:88)\n\ni=1\n\nCE(Di, \u02c6Di)\n\nFor optimization, we use Adam [11] with a learning rate of 0.001 and batch size of 100.\n\n3.3 The Search Method\n\nSearching for the program based on the network\u2019s output is a tree-search, where the nodes are program\nenvironments and the edges are all possible statements, ordered by their likelihood according to the\nnetwork. During search, we maintain a program environment that represents the program predicted\nthus far. In each step, we query the network and update the environment according to the predicted\nstatement. If the maximum number of variables v = t + n is exceeded, we drop the variable predicted\nmost likely to be insigni\ufb01cant by the model\u2019s drop (\u201cgarbage collection\u201d) sub-network.\nSince with in\ufb01nite time any underlying program can be reached and since the benchmarks are\ntime-based, we require a search method that is both exhaustive and capable of being halted, yielding\nintermediate results. To this end, we use CAB [12], a simple extension of beam search that provides\nthese exact properties.\nCAB operates in a spiral manner. First, a regular beam search is run with initial heuristic pruning\nrules. If a desired solution could not be found, the pruning rules are weakened and a new beam search\nbegins. This operation is performed in a loop until a solution is found, or when a speci\ufb01ed timeout\nhas been reached. Speci\ufb01cally, the parameters of CAB in our setting are: \u03b1, which is the beam size;\n\u03b2, which is the number of statement predictions to examine; and c, a constant value that is added to \u03b2\nafter every beam search. After every iteration, CAB doubles \u03b1 and \u03b2 is increased by c, ensuring that\nthe vast majority of paths explored in the next iteration will be new. We employ \u03b1 = 100, \u03b2 = 10,\nand c = 10. It is interesting to note that in [1], a search that is similar in spirit to CAB was attempted\nin one experiment, but resulted in reduced performance.\n\n4 Experiments\n\nWe ran a series of experiments testing the required runtime to reach a certain level of success. As\nbaselines we employed the best variant of DeepCoder [1], and the untrained DFS search employed as\na baseline in that paper. We employed various variants of our Predict and Collect Coder (PCCoder)\nmethod, in order to understand its sensitivity and to \ufb01nd out which factors contribute to its success.\nAll our experiments were performed using F64s Azure cloud instances. Each search is a Cython\nprogram that runs on a single thread.\nIn each line of experiments, we trained our model on a dataset consisting of programs of length of up\nto t1. We then sampled w test programs of length t2, which are guaranteed to be semantically disjoint\nfrom the dataset.\nThe \ufb01rst line of experiments was done similarly to [1], where our model was trained on programs\nof length of up to t1 = 4 and tested on w = 100 programs of length t2 = 5. For the environment\nrepresentation, we used a memory buffer that can hold up to four variables in addition to the input.\nThis means that at test-time for programs with three inputs a variable is dropped for the last statement.\nSince there is no public implementation for [1], we have measured our dataset on our reimplementation\nthereof. Differences in results between the article and our reimplementation can be explained by\nthe fact that our code is in optimized Cython, whereas [1] used C++. To verify that our data is\nconsistent with [1] in terms of dif\ufb01culty, we have also provided results for baseline search of our\nreimplementation and the article.\nThe results are reported in the same manner used in [1]. For every problem, de\ufb01ned by a set of \ufb01ve\ninput/output pairs, the method is run with a timeout of 10,000s. We then consider the time that it took\nto obtain a certain ratio of success (number of solved problems over the number of problems after a\ncertain number of seconds). The results are reported in Tab. 1. As can be seen, our reimplementation\nof DeepCoder is slower than the original implementation (discarding hardware differences). However,\nthe new method (PCCoder) is considerably faster than the original DeepCoder method, even though\nour implementation is less optimized.\n\n5\n\n\fTable 1: Recreating the experiment of [1], where training is done on programs shorter than 5\nstatements and testing is done on programs of length 5. The table shows the time required to achieve\na solution for a given ratio of the programs. Success ratios that could not be reached in the timeout of\n10,000 seconds are marked by blanks.\n\nMethod\n\nBaseline [1]\n\nBaseline (reimpl.)\n\nDeepCoder [1]\n\nDeepCoder (reimpl.)\n\nPCCoder\n\n20% 40%\n2887s\n163s\n4310s\n484s\n514s\n24s\n67s\n1465s\n41s\n5s\n\n60% 80% 95%\n6832s\n9628s\n2654s\n3826s\n259s\n\n782s\n\n-\n-\n-\n-\n\n-\n-\n-\n-\n\n2793s\n\nTable 2: Comparison for programs of varying lengths between our PCCoder8 method with memory\nsize 8, our PCCoder11 with memory size 11, and DeepCoder (\u2217reimplementation)\n\nLength Model\n\nTotal solved\n\nTimeout needed to solve\n\n5 DeepCoder\u2217\nPCCoder8\nPCCoder11\n8 DeepCoder\u2217\nPCCoder8\nPCCoder11\n10 DeepCoder\u2217\nPCCoder8\nPCCoder11\n12 DeepCoder\u2217\nPCCoder8\nPCCoder11\n14 DeepCoder\u2217\nPCCoder8\nPCCoder11\n\n63.0%\n99.2%\n99.6%\n11.2%\n90.0%\n91.2%\n7.0%\n73.0%\n74.0%\n4.0%\n65.4%\n67.2%\n2.0%\n52.0%\n60.4%\n\n168s\n127s\n\n132s 1743s\n129s 1587s\n\n-\n13s\n11s\n-\n\n-\n45s\n37s\n-\n\n4s\n4s\n-\n24s\n21s\n-\n\n454s 1050s 4759s\n310s\n988s 4382s\n\n5% 10% 20% 40% 60% 70% 80% 90% 99%\n0.6s 12s\n1s\n1s\n1s\n2s\n578s 2979s\n1s\n4s\n642s\n2s\n5s\n-\n3s\n8s\n-\n2s\n1s\n\n25s 458s 2004s\n2s\n3s\n-\n4s\n8s\n-\n6s 143s 1642s 3426s\n12s 135s 1299s 3285s\n-\n5s 295s 2725s\n21s 378s 2589s\n-\n39s 747s\n18s 162s 4476s\n\n2s\n5s\n-\n4s\n7s\n-\n4s\n11s\n-\n3s\n2s\n\n-\n\n-\n-\n-\n-\n-\n-\n-\n-\n-\n-\n-\n-\n\n-\n-\n-\n-\n-\n-\n-\n-\n-\n\n-\n\n-\n-\n-\n-\n-\n-\n\n-\n\n-\n\n-\n-\n-\n-\n-\n-\n-\n-\n-\n\n5s\n6s\n-\n\n-\n\n-\n\n-\n-\n\n-\n\n-\n\nIn the second experiment, we trained our model on a single dataset consisting of programs of length\nof up to t1 = 12. We then sampled w = 500 programs of lengths 5, 8, 10, 12, 14 and tasked our\nmodel with solving them. In total, the dataset consists of 143000 train programs of varying lengths\nand 2500 test programs. For each test length t2, the model is allowed to output programs of maximum\nlength t2. The memory of the programs was limited to either 11 (including up to three inputs) or 8.\nFor comparison, results of our reimplementation of [1] for the same tests are also provided.\nIn this experiment, in order to limit the total runtime, we employed a timeout of 5000s. The results\nare reported in Tab. 2. As can be seen, for programs of length 5, when trained on longer programs\nthan what was done in the \ufb01rst experiment, our method outperforms DeepCoder with an even larger\ngap in performance.\nOverall, within the timeout boundary of 5000s, our method, with a memory of size 11, solves 60% of\nthe programs of length 14, and 67% of those of length 12. DeepCoder, meanwhile, fails to surpass\nthe 5% mark for these lengths. The results for memory size 11 and memory size 8 are similar, except\nfor length 14. Considering that with three inputs, the \ufb01rst memory would need to discard variables\nafter step 8, both variants make an extensive use of the garbage collection mechanism.\nTab. 3 compares the length of the predicted program with the length of the ground-truth. As can be\nseen, for lengths 5 and 8 most predictions are equal in length to the solutions, whereas for larger\nlengths it is common to observe shorter solutions than the target program.\n\n6\n\n\fTable 3: The length of the predicted program (columns) vs. corresponding ground-truth length (rows).\n\nLength\n\nr\ne\nd\no\nC\nC\nP\n\n8 5\n8\n10\n12\n14\n1 5\n8\n10\n12\n14\n\n1\nr\ne\nd\no\nC\nC\nP\n\n2\n\n3\n\n4\n\n5\n\n14\n\n7\n-\n\n8\n-\n\n10\n-\n-\n\n11\n9\n1\n6\n13\n-\n0% 0% 1% 14% 85% -\n-\n-\n-\n0% 0% 0% 1% 10% 32% 23% 34% -\n-\n-\n-\n0% 0% 0% 2% 5% 16% 16% 32% 16% 13% -\n-\n0% 0% 0% 3% 3% 9% 27% 22% 11% 9% 8% 8% -\n-\n0% 0% 0% 0% 1% 5% 10% 26% 20% 12% 7% 7% 4% 8%\n-\n0% 0% 2% 16% 82% -\n-\n-\n-\n-\n0% 0% 1% 1% 8% 25% 29% 36% -\n-\n-\n0% 0% 0% 2% 5% 12% 20% 26% 19% 16% -\n-\n0% 0% 0% 2% 2% 9% 19% 21% 14% 15% 10% 8% -\n-\n0% 0% 0% 0% 1% 9% 7% 23% 17% 13% 19% 7% 1% 3%\n\n12\n-\n-\n-\n\n-\n-\n-\n\n-\n\n-\n\n-\n-\n\nTable 4: CIDEr scores of PCCoder\u2019s predictions with respect to the ground-truth. Each program is\nrepresented as a sentence, and each function is a word. Only successful predictions are measured.\n\n5\n\n8\n\nPCCoder8\nPCCoder11\n\n72.87\n72.33\n\n48.47\n52.93\n\n10\n\n33.11\n38.48\n\n12\n\n27.26\n28.41\n\n14\n\n19.66\n21.90\n\nFollowing this, one may wonder what is the amount of similarity between the target program and the\nfound one, when the search is successful. We employ the CIDEr [13] score in order to compute this\nsimilarity. CIDEr is considered a relatively semantic measure for similarity between two sentences,\nand is computed based on a weighted combination of n-grams that takes into account the words\u2019\nfrequency. In our experiment, we consider each program to be a sentence and each valid combination\nof functions and lambdas to be a word. We use the training set of programs of length of up to t1 = 12\nto compute the background frequencies. The results are presented in Tab. 4. As can be seen, for\nlength 5 the similarity is relatively high, whereas for larger lengths the score gradually decreases.\nWe next perform an ablation analysis in order to study the contribution of each component. In this\nexperiment, we trained different models on a dataset of 79000 programs of length of up to t1 = 8. We\nthen tested their performance on 1000 test programs of length t2 = 8. All variants employ a memory\nsize of v = 8, which means that variables are discarded after step \ufb01ve for programs with 3 inputs.\nSpeci\ufb01cally, these are the variants tested: PCCoder - the original PCCoder model. PCCoder_SD - the\nPCCoder model with the auxiliary function task discarded (called SD since statement and variable\ndrop predictions are still active). PCCoder_SF - the PCCoder model with the variable dropping\ntask discarded. The variables are discarded at random. PCCoder_S - the PCCoder model with both\nthe variable dropping and function prediction tasks discarded. PCCoder_FixedGC - the PCCoder\nmodel where the GC is learned after the representation is \ufb01xed. PCCoder_ImportGC - the PCCoder\nmodel where GC is not used as an auxiliary task, but computed separately using the original network.\nPCCoder_Linear - the PCCoder model with the dense block replaced by 3 feed-forward layers of 256\nunits. This results in a similar architecture to [1], except for differences in input and embedding. Note\nthat training deeper networks, without employing dense connections failed to improve the results.\nPCCoder_ResNet - the PCCoder model with ResNet blocks instead of the dense block. We found\nthat using 4 ResNet blocks, each comprised of 2 fully-connected layers with 256 units and a SELU\nactivation worked best. PCCoder_DFS - the PCCoder model, but with DFS search instead of CAB.\nThe width of the search is limited to ensure that we try multiple predictions for the lower depths\nof the search. PCCoder_Ten - the original PCCoder model trained on the same dataset but with 10\nexamples for each program (both at train and test time). PCCoder_Ten5 - the results of PCCoder_Ten\nwhere at run time we provide only 5 samples instead of 10.\nThe results are reported in Tab. 5. As can be seen, all variants except PCCoder_Ten5 result in a loss of\nperformance. The loss of the variable drop is less detrimental than the drop of the auxiliary function\nprediction task, pointing to dif\ufb01culty of training to predict speci\ufb01c statements. The GC component\nhas a positive effect both as an auxiliary task and on test-time results. Linear and ResNet, which\n\n7\n\n\fTable 5: Comparison between several variants of our method. Training was done on programs of\nlength of up to 8 and tested on 1,000 programs of length 8, with a timeout of 1,000 seconds.\n\nModel\n\nTotal solved\n\nTimeout needed to solve\n\nPCCoder\nPCCoder_SD\nPCCoder_SF\nPCCoder_S\nPCCoder_FixedGC\nPCCoder_ImportGC\nPCCoder_Linear\nPCCoder_ResNet\nPCCoder_DFS\nPCCoder_Ten\nPCCoder_Ten5\n\n83%\n70%\n77%\n66%\n79%\n80%\n73%\n76%\n67%\n78%\n84%\n\n202s\n953s\n741s\n\n20% 40% 60% 70% 80%\n3s\n635s\n5s\n6s\n7s\n4s\n3s\n0.7s\n2s\n29s\n1s\n0.6s\n\n84s\n347s\n161s\n660s\n116s\n96s\n323s\n121s\n692s\n78s\n70s\n\n10s\n66s\n11s\n126s\n13s\n12s\n13s\n9s\n310s\n12s\n9s\n\n285s\n263s\n643s\n414s\n\n229s\n135s\n\n939s\n\n530s\n\n-\n-\n-\n\n-\n-\n-\n-\n\n-\n\n-\n\nTable 6: Comparison between multiple recently proposed methods of program synthesis on our DSL.\nSimilarly to the ablation experiment, training was done on programs of length of up to 8 and tested\non 1,000 programs of length 8, with a timeout of 1,000 seconds.\n\nModel\n\nTotal solved\n\nPCCoder\nPCCoder_No_IO\nDeepCoder\nDeepCoder_CAB\nRobustFill (Attn A)\nRobustFill (Attn B)\n\n83%\n4%\n5%\n4%\n2%\n4%\n\nTimeout needed to solve\n1% 2%\n5%\n0.3s\n0.2s\n0.6s\n51s\n50s\n245s\n44s\n3s\n131s\n843s\n8s\n7s\n11s\n\n4%\n0.4s\n554s\n311s\n626s\n\n517s\n\n-\n-\n-\n\n971s\n\n-\n\n-\n\nemploy a shallower network, are faster for easy programs, but are, however, challenged by more\ndif\ufb01cult ones. The beam search is bene\ufb01cial and CAB is considerably more ef\ufb01cient than DFS.\nIt has been argued in [14] that learning from only a few sample input/output pairs is hard and one\nshould use more samples. Clearly, training on more samples cannot hurt as they can always be\nignored. However, the less examples one uses during test-time, the more freedom the algorithm\nhas to come up with compatible solutions, making the synthesis problem easier. Our experiment\nwith PCCoder_Ten shows that, for a network trained and tested on 10 samples vs a network trained\nand employed on 5 samples, a higher success rate is achieved with less input samples. However,\nas indicated by PCCoder_Ten5, if more samples are used during training but only 5 samples are\nemployed at test-time, the success rate of the model improves.\nIn the next experiment, we evaluate a few other program synthesis methods on our DSL. To this end,\nwe employ the same experimental conditions as in the ablation experiment. These are the methods\nassessed: (i) Two of the variants of RobustFill [6], reimplemented for our DSL. For fairness, CAB\nis used. (ii) The original DeepCoder model and a variant of DeepCoder where CAB search is used\ninstead of DFS. (iii) A variant of our PCCoder model where intermediate variable values are not used,\ncausing the state to be \ufb01xed per input-output pair.\nAs can be seen in Tab. 6, using CAB with DeepCoder slightly degrades results in comparison to DFS.\nA possible reason is that the method\u2019s predicted probabilities do not depend on previous statements.\nFurthermore, the Attn.B variant of RobustFill and the variant of PCCoder without IO states both\nperform comparably to DeepCoder. Considering that the programs being generated are reasonably\nlong and that all three methods do not track variable values during execution, this can be expected.\nA speci\ufb01c program can be the target of many random environments. An auxiliary experiment is\nperformed in order to evaluate the invariance of the learned representation to the speci\ufb01c environment\nused. We sampled eight different programs of length 12 and created ten random environments for\n\n8\n\n\fFigure 4: A 2D t-SNE plot of the embedding of the environments in R256 (after average pooling over\nthe samples). The experiment consists of eight different programs of length 12. For each program,\nten random environments were created, each including \ufb01ve input-output samples. The eight colors\nrepresent the associated programs of each environment.\n\neach, with each environment comprising of \ufb01ve samples of input/output pairs. As can be seen in\nFig. 4, the environments for each program are grouped together.\n\n5 Conclusions\n\nWe present a new algorithm that achieves a sizable improvement over the recent baseline, which is a\nstrong method that was compared to other existing approaches from the automatic code synthesis\nliterature. It would be interesting to compare the performance of the method to human programmers\nwho can reason, implement, and execute. We hypothesize that, within this restricted domain of code\nsynthesis, the task would be extremely challenging for humans, even for programs of shorter lengths.\nSince any valid solution is a correct output, and since we employ a search process, one may wonder\nwhether Reinforcement Learning (RL) would be suitable for the problem. We have attempted to\napply a few RL methods as well as to incorporate elements from the RL literature in our work. These\nattempts are detailed in the supplementary.\n\nAcknowledgments\n\nThis work was supported by an ICRC grant.\n\nReferences\n[1] Matej Balog, Alexander L Gaunt, Marc Brockschmidt, Sebastian Nowozin, and Daniel Tarlow.\n\nDeepcoder: Learning to write programs. In ICLR, 2017.\n\n[2] Ashwin Kalyan, Abhishek Mohta, Oleksandr Polozov, Dhruv Batra, Prateek Jain, and Sumit\nGulwani. Neural-guided deductive search for real-time program synthesis from examples. In\nICLR, 2018.\n\n[3] Sumit Gulwani. Programming by examples: Applications, algorithms, and ambiguity resolution.\nIn Proceedings of the 8th International Joint Conference on Automated Reasoning - Volume\n9706, pages 9\u201314, Berlin, Heidelberg, 2016. Springer-Verlag.\n\n[4] Rishabh Singh and Sumit Gulwani. Predicting a correct program in programming by example.\n\nIn CAV, July 2015.\n\n[5] Oleksandr Polozov and Sumit Gulwani. Flashmeta: A framework for inductive program\nsynthesis. In Proceedings of the 2015 ACM SIGPLAN International Conference on Object-\nOriented Programming, Systems, Languages, and Applications, pages 107\u2013126, New York, NY,\nUSA, 2015. ACM.\n\n[6] Jacob Devlin, Jonathan Uesato, Surya Bhupatiraju, Rishabh Singh, Abdel rahman Mohamed,\n\nand Pushmeet Kohli. Robust\ufb01ll: Neural program learning under noisy i/o. In ICML, 2017.\n\n9\n\n\f[7] Alexander L Gaunt, Marc Brockschmidt, Rishabh Singh, Nate Kushman, Pushmeet Kohli,\nJonathan Taylor, and Daniel Tarlow. Terpret: A probabilistic programming language for\nprogram induction. arXiv preprint arXiv:1608.04428, 2016.\n\n[8] Sebastian Riedel, Matko Bosnjak, and Tim Rockt\u00e4schel. Programming with a differentiable\n\nforth interpreter. CoRR, abs/1605.06640, 2016.\n\n[9] G\u00fcnter Klambauer, Thomas Unterthiner, Andreas Mayr, and Sepp Hochreiter. Self-normalizing\nneural networks. In I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan,\nand R. Garnett, editors, NIPS, pages 971\u2013980, 2017.\n\n[10] G. Huang, Z. Liu, L. v. d. Maaten, and K. Q. Weinberger. Densely connected convolutional\n\nnetworks. In CVPR, pages 2261\u20132269, July 2017.\n\n[11] Diederik P. Kingma and Jimmy Ba. Adam: A method for stochastic optimization. CoRR,\n\nabs/1412.6980, 2014.\n\n[12] Weixiong Zhang. Complete anytime beam search. In AAAI, AAAI \u201998/IAAI \u201998, pages 425\u2013430,\n\nMenlo Park, CA, USA, 1998. American Association for Arti\ufb01cial Intelligence.\n\n[13] Ramakrishna Vedantam, C Lawrence Zitnick, and Devi Parikh. Cider: Consensus-based image\ndescription evaluation. In Proceedings of the IEEE conference on computer vision and pattern\nrecognition, pages 4566\u20134575, 2015.\n\n[14] Anonymous Reviewer 3. ICLR open review of \u201cDeepcoder: Learning to write programs\u201d.\n\nhttps://openreview.net/forum?id=ByldLrqlx¬eId=HJor8rbNg, 2017.\n\n10\n\n\f", "award": [], "sourceid": 1073, "authors": [{"given_name": "Amit", "family_name": "Zohar", "institution": "Tel Aviv Universtiy"}, {"given_name": "Lior", "family_name": "Wolf", "institution": "Facebook AI Research"}]}