{"title": "Neural Program Meta-Induction", "book": "Advances in Neural Information Processing Systems", "page_first": 2080, "page_last": 2088, "abstract": "Most recently proposed methods for Neural Program induction work under the assumption of having a large set of input/output (I/O) examples for learning any given input-output mapping. This paper aims to address the problem of data and computation efficiency of program induction by leveraging information from related tasks. Specifically, we propose two novel approaches for cross-task knowledge transfer to improve program induction in limited-data scenarios. In our first proposal, portfolio adaptation, a set of induction models is pretrained on a set of related tasks, and the best model is adapted towards the new task using transfer learning. In our second approach, meta program induction, a $k$-shot learning approach is used to make a model generalize to new tasks without additional training. To test the efficacy of our methods, we constructed a new benchmark of programs written in the Karel programming language. Using an extensive experimental evaluation on the Karel benchmark, we demonstrate that our proposals dramatically outperform the baseline induction method that does not use knowledge transfer. We also analyze the relative performance of the two approaches and study conditions in which they perform best. In particular, meta induction outperforms all existing approaches under extreme data sparsity (when a very small number of examples are available), i.e., fewer than ten. As the number of available I/O examples increase (i.e. a thousand or more), portfolio adapted program induction becomes the best approach. For intermediate data sizes, we demonstrate that the combined method of adapted meta program induction has the strongest performance.", "full_text": "Neural Program Meta-Induction\n\nJacob Devlin\u2217\n\nGoogle\n\njacobdevlin@google.com\n\nRudy Bunel\u2217\n\nUniversity of Oxford\n\nrudy@robots.ox.ac.uk\n\nRishabh Singh\n\nMicrosoft Research\n\nrisin@microsoft.com\n\nMatthew Hausknecht\n\nMicrosoft Research\n\nmahauskn@microsoft.com\n\nPushmeet Kohli\u2217\n\nDeepMind\n\npushmeet@google.com\n\nAbstract\n\nMost recently proposed methods for Neural Program Induction work under the\nassumption of having a large set of input/output (I/O) examples for learning any\nunderlying input-output mapping. This paper aims to address the problem of data\nand computation ef\ufb01ciency of program induction by leveraging information from\nrelated tasks. Speci\ufb01cally, we propose two approaches for cross-task knowledge\ntransfer to improve program induction in limited-data scenarios. In our \ufb01rst pro-\nposal, portfolio adaptation, a set of induction models is pretrained on a set of\nrelated tasks, and the best model is adapted towards the new task using transfer\nlearning. In our second approach, meta program induction, a k-shot learning ap-\nproach is used to make a model generalize to new tasks without additional training.\nTo test the ef\ufb01cacy of our methods, we constructed a new benchmark of programs\nwritten in the Karel programming language [17]. Using an extensive experimental\nevaluation on the Karel benchmark, we demonstrate that our proposals dramatically\noutperform the baseline induction method that does not use knowledge transfer. We\nalso analyze the relative performance of the two approaches and study conditions\nin which they perform best. In particular, meta induction outperforms all existing\napproaches under extreme data sparsity (when a very small number of examples are\navailable), i.e., fewer than ten. As the number of available I/O examples increase\n(i.e. a thousand or more), portfolio adapted program induction becomes the best\napproach. For intermediate data sizes, we demonstrate that the combined method\nof adapted meta program induction has the strongest performance.\n\n1\n\nIntroduction\n\nNeural program induction has been a very active area of research in the last few years, but this past\nwork has made highly variable set of assumptions about the amount of training data and types of\ntraining signals that are available. One common scenario is example-driven algorithm induction,\nwhere the goal is to learn a model which can perform a speci\ufb01c task (i.e., an underlying program\nor algorithm), such as sorting a list of integers[7, 11, 12, 21]. Typically, the goal of these works are\nto compare a newly proposed network architecture to a baseline model, and the system is trained\non input/output examples (I/O examples) as a standard supervised learning task. For example, for\ninteger sorting, the I/O examples would consist of pairs of unsorted and sorted integer lists, and the\nmodel would be trained to maximize cross-entropy loss of the output sequence. In this way, the\ninduction model is similar to a standard sequence generation task such as machine translation or\nimage captioning. In these works, the authors typically assume that a near-in\ufb01nite amount of I/O\nexamples corresponding to a particular task are available.\n\n\u2217Work performed at Microsoft Research.\n\n31st Conference on Neural Information Processing Systems (NIPS 2017), Long Beach, CA, USA.\n\n\fOther works have made different assumptions about data: Li et al. [14] trains models from scratch\nusing 32 to 256 I/O examples. Lake et al. [13] learns to induce complex concepts from several\nhundred examples. Devlin et al. [5], Duan et al. [6], and Santoro et al. [19] are able to perform\ninduction using as few one I/O example, but these works assume that a large set of background tasks\nfrom the same task family are available for training. Neelakantan et al. [16] and Andreas et al. [1]\nalso develop models which can perform induction on new tasks that were not seen at training time,\nbut are conditioned on a natural language representation rather than I/O examples.\nThese varying assumptions about data are all reasonable in differing scenarios. For example, in a\nscenario where a reference implementation of the program is available, it is reasonable to expect that\nan unlimited amount of I/O examples can be generated, but it may be unreasonable to assume that any\nsimilar program will also be available. However, we can also consider a scenario like FlashFill [9],\nwhere the goal is to learn a regular expression based string transformation program based on user-\nprovided examples, such as \u201cJohn Smith \u2192 Smith, J.\u201d). Here, it is only reasonable to assume\nthat a handful of I/O examples are available for a particular task, but that many examples are available\nfor other tasks in the same family (e.g., \u201cFrank Miller \u2192 Frank M\u201d).\nIn this work, we compare several different techniques for neural program induction, with a particular\nfocus on how the relative accuracy of these techniques differs as a function of the available training\ndata. In other words, if technique A is better than technique B when only \ufb01ve I/O examples are\navailable, does this mean A will also be better than B when 50 I/O examples are available? What\nabout 1000? 100,000? How does this performance change if data for many related tasks is available?\nTo answer these questions, we evaluate four general techniques for cross-task knowledge sharing:\n\u2022 Plain Program Induction (PLAIN) - Supervised learning is used to train a model which\ncan perform induction on a single task, i.e., read in an input example for the task and predict\nthe corresponding output. No cross-task knowledge sharing is performed.\n\u2022 Portfolio-Adapted Program Induction (PLAIN+ADAPT) - Simple transfer learning is\n\u2022 Meta Program Induction (META) - A k-shot learning-style model is used to represent an\nexponential family of tasks, where the training I/O examples corresponding to a task are\ndirectly conditioned on as input to the network. This model can generalize to new tasks\nwithout any additional training.\n\u2022 Adapted Meta Program Induction (META+ADAPT) - The META model is adapted to a\n\nused to to adapt a model which has been trained on a related task for a new task.\n\nspeci\ufb01c new task using round-robin hold-one-out training on the task\u2019s I/O examples.\n\nWe evaluate these techniques on a synthetic domain described in Section 2, using a simple but strong\nnetwork architecture. All models are fully example-driven, so the underlying program representation\nis only used to generate I/O examples, and is not used when training or evaluating the model.\n\n2 Karel Domain\n\nIn order to ground the ideas presented here, we describe our models in relation to a particular\nsynthetic domain called \u201cKarel\u201d. Karel is an educational programming language developed at\nStanford University in the 1980s[17]. In this language, a virtual agent named Karel the Robot moves\naround a 2D grid world placing markers and avoiding obstacle. The domain speci\ufb01c language (DSL)\nfor Karel is moderately complex, as it allows if/then/else blocks, for loops, and while loops,\nbut does not allow variable assignments. Compared to the current program induction benchmarks,\nKarel introduces a new challenge of learning programs with complex control \ufb02ow, where the state-\nof-the-art program synthesis techniques involving constraint-solving and enumeration do not scale\nbecause of the prohibitively large search space. Karel is also an interesting domain as it is used\nfor example-driven programming in an introductory Stanford programming course.2 In this course,\nstudents are provided with several I/O grids corresponding to some underlying Karel program that\nthey have never seen before, and must write a single program which can be run on all inputs to\ngenerate the corresponding outputs. This differs from typical programming assignments, since the\nprogram speci\ufb01cation is given in the form of I/O examples rather than natural language. An example\nis given in Figure 1. Note that inducing Karel programs is not a toy reinforcement learning task.\n\n2The programs are written manually by students; it is not used to teach program induction or synthesis.\n\n2\n\n\fSince the example I/O grids are of varying dimensions, the learning task is not to induce a single\ntrace that only works on grids of a \ufb01xed size, but rather to induce a program that can can perform the\ndesired action on \u201carbitrary-size grids\u201d, thereby forcing it to use the loop structure appropriately.\n\nFigure 1: Karel Domain: On the left, a sample task from the Karel domain with two training I/O\nexamples (I1, O1), (I2, O2) and one test I/O example ( \u02c6I, \u02c6O). The computer is Karel, the circles\nrepresent markers and the brick wall represents obstacles. On the right, the language spec for Karel.\n\nIn this work, we only explore the induction variant of Karel, where instead of attempting to synthesize\nthe program, we attempt to directly generate the output grid \u02c6O from a corresponding input grid \u02c6I.\nAlthough the underlying program is used to generate the training data, it is not used by the model in\nany way, so in principle it does not have to explicitly exist. For example, a more complex real-world\nanalogue would be a system where a user controls a drone to provide examples of a task such as\n\u201cFly around the boundary of the forest, and if you see a deer, take a picture of it, then return home.\u201d\nSuch a task might be dif\ufb01cult to represent using a program, but could be possible with a suf\ufb01ciently\npowerful and well-trained induction model, especially if cross-task knowledge sharing is used.\n\n3 Plain Program Induction\n\nIn this work, plain program induction (denoted as PLAIN) refers to the supervised training of a\nparametric model using a set of input/output examples (I1, O1), ..., (IN , ON ), such that the model\ncan take some new \u02c6I as input and emit the corresponding \u02c6O. In this scenario, all I/O examples in\ntraining and test correspond to the same task (i.e., underlying program or algorithm), such as sorting\na list of integers. Examples of past work in plain program induction using neural networks include\n[7, 11, 12, 8, 4, 20, 2].\nFor the Karel domain, we use a simple architecture shown on the left side of Figure 2. The\ninput feature map are an 16-dimensional vector with n-hot encodings to represent the objects\nof the cell, i.e., (AgentFacingNorth, AgentFacingEast, ..., OneMarker, TwoMarkers,\n..., Obstacle). Additionally,\ninstead of predicting the output grid directly, we use an\nLSTM to predict the delta between the input grid and output grid as a series of tokens us-\ning. For example, AgentRow=+1 AgentCol=+2 HeroDir=south MarkerRow=0 MarkerCol=0\nMarkerCount=+2 would indicate that the hero has moved north 1 row, east 2 rows, is facing south,\nand also added two markers on its starting position. This sequence can be deterministically applied to\nthe input to create the output grid. Speci\ufb01c details about the model architecture and training are given\nin Section 8.\n\n4 Portfolio-Adapted Program Induction\n\nMost past work in neural programs induction assumes that a very large amount of training data is\navailable to train a particular task, and ignores data sparsity issues entirely. However, in a practical\nscenario such as the FlashFill domain described in Section 1 or the real-world Karel analogue\n\n3\n\n\fFigure 2: Network Architecture: Diagrams for the general network architectures used for the Karel\ndomain. Speci\ufb01cs of the model are provided in Section 8.\n\ndescribed in Section 2, I/O examples for a new task must be provided by the user. In this case, it may\nbe unrealistic to expect more than a handful of I/O examples corresponding to a new task.\nOf course, it is typically infeasible to train a deep neural network from scratch with only a handful of\ntraining examples. Instead, we consider a scenario where data is available for a number of background\ntasks from the same task family. In the Karel domain, the task family is simply any task from the\nKarel DSL, but in principle the task family can be more a more abstract concept such as \u201cThe set of\nstring transformations that a user might perform on columns in a spreadsheet.\u201d\nOne way of taking advantage of such background tasks is with straightforward transfer learning,\nwhich we refer to as portfolio-adapted program induction (denoted as PLAIN+ADAPT). Here, we\nhave a portfolio of models each trained on a single background I/O task. To train an induction model\nfor a new task, we select the \u201cbest\u201d background model and use it as an initialization point for training\nour new model. This is analogous to the type of transfer learning used in standard classi\ufb01cation\ntasks like image recognition or machine translation [10, 15]. The criteria by which we select this\nbackground model is to score the training I/O examples for the new task with each model in the\nportfolio, and select the one with the highest log-likelihood.\n\n5 Meta Program Induction\n\nAlthough we expect that PLAIN+ADAPT will allow us to learn an induction model with fewer I/O\nexamples than training from scratch, it is still subject to the normal pitfalls of SGD-based training.\nIn particular, it is typically very dif\ufb01cult to train powerful DNNs using very few I/O examples (e.g.,\n< 100) without encountering signi\ufb01cant over\ufb01tting.\nAn alternative method is to train a single network which represents an entire (exponentially large)\nfamily of tasks, and the latent representation of a particular task is represented by conditioning on\nthe training I/O examples for that task. We refer to this type of model as meta induction (denoted as\nMETA) because instead of using SGD to learn a latent representation of a particular task based on I/O\nexamples, we are using SGD to learn how to learn a latent task representation based on I/O examples.\nMore speci\ufb01cally, our meta induction architecture takes as input a set of demonstration examples\n(I1, O1), ..., (Ik, Ok) and an additional eval input \u02c6I, and emits the corresponding output \u02c6O. A diagram\nis shown in Figure 2. The number of demonstration examples k is typically small, e.g., 1 to 5. At\ntraining time, we are given a large number of tasks with k + 1 examples each. During training,\none example is chosen at random to represent the eval example, the others are used to represent the\ndemonstration examples. At test time, we are given k I/O examples which correspond to a new task\nthat was not seen at training, along with one or more eval inputs \u02c6I. Then, we are able to generate\nthe corresponding \u02c6O for the new task without performing any SGD. The META model could also be\ndescribed as a k-shot learning system, closely related to Duan et al. [6] and Santoro et al. [19].\nIn a scenario where a moderate number of I/O examples are available at test time, e.g., 10 to 100,\nperforming meta induction is non-trivial. It is not computationally feasible to train a model which is\n\n4\n\n\fdirectly conditioned on k = 100 examples, and using a larger value of k at test time than training\ntime creates an undesirable mismatch. So, if the model is trained using k examples but n examples\nare available at test time (n > k), the approach we take is to randomly sample a number of k-sized\nsets and performing ensembling of the softmax log probabilities for each output token. There are (n\nchoose k) total subsets available, but we found little improvement in using more than 2 \u2217 n/k. We set\nk = 5 in all experiments, and present results using different values of n in Section 8.\n\n6 Adapted Meta Program Induction\n\nThe previous approach to use n > k I/O examples at test\ntime seems reasonable, but certainly not optimal. An al-\nternative approach is to combine the best aspects of META\nand PLAIN+ADAPT, and adapt the meta model to a partic-\nular new task using SGD. To do this, we can repeatedly\nsample k + 1 I/O examples from the n total examples\nprovided, and \ufb01ne tune the META model for the new task\nin the exact manner that it was trained originally. For de-\ncoding, we still perform the same algorithm as the META\nmodel, but the weights have been adapted for the particular\ntask being decoded.\nIn order to mitigate over\ufb01tting, we found that it is useful\nto perform \u201cdata-mixture regularization,\u201d where the I/O examples for the new task are mixed with\nrandom training data corresponding to other tasks. In all experiments here we sample 10% of the I/O\nexamples in a minibatch from the new task and 90% from random training tasks. It is potential that\nunder\ufb01tting could occur in this scenario, but note that the meta network is already trained to represent\nan exponential number of tasks, so using a single task for 10% of the data is quite signi\ufb01cant. Results\nwith data mixture adaptation are shown in Figure 3, which demonstrates that this acts as a strong\nregularizer and moderately improves held-out loss.\n\nFigure 3: Data-Mixture Regulariza-\ntion\n\n7 Comparison with Existing Work on Neural Program Induction\n\nThere has been a large amount of past work in neural program induction, and many of these works\nhave made different assumptions about the conditions of the induction scenario. Here, our goal is to\ncompare the four techniques presented here to each other and to past work across several attributes:\n\u2022 Example-Driven Induction - \u0013 = The system is trained using I/O examples as speci\ufb01cation.\n\n\u0017 = The system uses some other speci\ufb01cation, such as natural language.\n\n\u2022 No Explicit Program Representation - \u0013 = The system can be trained without any explicit\n\nprogram or program trace. \u0017 = The system requires a program or program trace.\n\n\u2022 Task-Speci\ufb01c Learning - \u0013 = The model is trained to maximize performance on a particular\n\ntask. \u0017 = The model is trained for a family of tasks.\n\n\u2022 Cross-Task Knowledge Sharing - \u0013 = The system uses information from multiple tasks\nwhen training a model for a new task. \u0017 = The system uses information from only a single\ntask for each model.\n\nThe comparison is presented in Table 1. The PLAIN technique is closely related to the example-driven\ninduction models such as Neural Turing Machines[7] or Neural RAM[12], which typically have not\nfocused on cross-task knowledge transfer. The META model is closely related are the k-shot imitation\nlearning approaches [6, 5, 19], but these papers did not explore task-speci\ufb01c adaptation.\n\n8 Experimental Results\n\nIn this section we evaluate the four techniques PLAIN, PLAIN+ADAPT, META, META+ADAPT on the\nKarel domain. The primary goal is to compare performance relative to the number of training I/O\nexamples available for the test task.\n\n5\n\n\fSystem\n\nExample-\nDriven\nInduction\n\nNo Explicit\nProgram\nor Trace\n\nTask-\nSpeci\ufb01c\nLearning\n\nCross-Task\nKnowledge\n\nSharing\n\nNovel Architectures Applied to Program Induction\n\nNTM [7], Stack RNN [11], NRAM [12]\nNeural Transducers [8], Learn Algo [21]\nOthers [4, 20, 2, 13]\n\n\u0013\n\n\u0013\n\nNPI [18]\nRecursive NPI [3], NPL [14]\n\nTrace-Augmented Induction\n\n\u0013\n\u0013\n\n\u0017\n\u0017\n\n\u0013\n\n\u0013\n\u0013\n\nNon Example-Driven Induction (e.g., Natural Language-Driven Induction)\n\nInducing Latent Programs [16]\nNeural Module Networks [1]\n\n\u0017\n\n1-Shot Imitation Learning [6]\nRobustFill [5], Meta-Learning [19]\n\n\u0013\n\nk-shot Imitation Learning\n\n\u0013\n\n\u0013\n\nPlain Program Induction\nPortfolio-Adapted Program Induction\nMeta Program Induction\nAdapted Meta Program Induction\n\nTechniques Explored in This Work\n\u0013\n\u0013\n\u0013\n\u0013\n\n\u0013\n\u0013\n\u0013\n\u0013\n\n\u0013\n\n\u0017\n\n\u0013\n\u0013\n\u0017\n\u0013\n\n\u0017\n\n\u0013\n\u0017\n\n\u0013\n\n\u0013\n\n\u0017\n\n\u0013(Weak)\n\u0013(Strong)\n\u0013(Strong)\n\nTable 1: Comparison with Existing Work: Comparison of existing work across several attributes.\n\nFor the primary experiments reported here, the overall network architecture is sketched in Figure 2,\nwith details as follows: The input encoder is a 3-layer CNN with a FC+relu layer on top. The output\ndecoder is a 1-layer LSTM. For the META model, the task encoder uses 1-layer CNN to encode the\ninput and output for a single example, which are concatenated on the feature map dimension and fed\nthrough a 6-layer CNN with a FC+relu layer on top. Multiple I/O examples were combined with\nmax-pooling on the \ufb01nal vector. All convolutional layers use a 3 \u00d7 3 kernel with a 64-dimensional\nfeature map. The fully-connected and LSTM are 1024-dimensional. Different model sizes are\nexplored later in this section. The dropout, learning rate, and batch size were optimized with grid\nsearch for each value of n using a separate set of validation tasks. Training was performed using\nSGD + momentum and gradient clipping using an in-house toolkit.\nAll training, validation, and test programs were generated by treating the Karel DSL as a probabilistic\ncontext free grammar and performing top-down expansion with uniform probability at each node.\nThe input grids were generated by creating a grid of a random size and inserting the agent, markers,\nand obstacles at random. The output grid was generated by executing the program on the input grid,\nand if the agent ran into an obstacle or did not move, then the example was thrown out and a new\ninput grid was generated. We limit the nesting depth of control \ufb02ow to be at most 4 (i.e. at most 4\nnested if/while blocks can be chosen in a valid program). We sample I/O grids of size n \u00d7 m, where\nn and m are integers sampled uniformly from the range 2 to 20. We sample programs of size upto 20\nstatements. Every program and I/O grid in the training/validation/test set is unique.\nResults are presented in Figure 4, evaluated on 25 test tasks with 100 eval examples each.3 The x-axis\nrepresents the number of training/demonstration I/O examples available for the test task, denoted as\nn. The PLAIN system was trained only on these n examples directly. The PLAIN+ADAPT system was\nalso trained on these n examples, but was initialized using a portfolio of m models that had been\ntrained on d examples each. Three different values of m and d are shown in the \ufb01gure. The META\nmodel in this \ufb01gure was trained on 1,000,000 tasks with 6 I/O examples each, but smaller amounts of\nMETA training are shown in Figure 5. A point-by-point analysis is given below:\n\n3Note that each task and eval example is evaluated independently, so the size of the test set does not affect\n\nthe accuracy.\n\n6\n\n\fFigure 4: Induction Results: Comparison of the four induction techniques on the Karel scenario.\nThe accuracy denotes the total percentage of examples for which the 1-best output grid was exactly\nequal to the reference.\n\n\u2022 PLAIN vs. PLAIN+ADAPT: PLAIN+ADAPT signi\ufb01cantly outperforms PLAIN unless n is\nvery large (10k+), in which case both systems perform equally well. This result makes sense,\nsince we expect that much of the representation learning (e.g., how to encode an I/O grid\nwith a CNN) will be independent of the exact task.\n\n\u2022 PLAIN+ADAPT Model Portfolio Size: Here, we compare the three model portfolio settings\nshown for PLAIN+ADAPT. The number of available models (m = 1 vs. m = 25) only\nhas a small effect on accuracy, and this effect is only present for small values of n (e.g.,\nn < 100) when the absolute performance is poor in any case. This implies that the majority\nof cross-task knowledge sharing is independent of the exact details of a task.\nOn the other hand, the number of examples used to train each model in the portfolio\n(d = 1000 vs d = 100000) has a much larger effect, especially for moderate values of\nn, e.g., 50 to 100. This makes sense, as we would not expect a signi\ufb01cant bene\ufb01t from\nadaptation unless (a) d (cid:29) n, and (b) n is large enough to train a robust model.\n\n\u2022 META vs. META+ADAPT: META+ADAPT does not improve over META for small values of\nn, which is in-line with the common observation that SGD-based training is dif\ufb01cult using a\nsmall number of samples. However, for large values of n, the accuracy of META+ADAPT\nincreases signi\ufb01cantly while the META model remains \ufb02at.\n\n\u2022 PLAIN+ADAPT vs. META+ADAPT: Perhaps the most interesting result in the entire chart\nis the fact that the accuracy crosses over, and PLAIN+ADAPT outperforms META+ADAPT by\na signi\ufb01cant margin for large values of n (i.e., 1000+). Intuitively, this makes sense, since\nthe meta induction model was trained to represent an exponential family of tasks moderately\nwell, rather than represent a single task with extreme precision.\nBecause the network architecture of the META model is a superset of the PLAIN model,\nthese results imply that for a large value of n, the model is becoming stuck in a poor local\noptima.4 To validate this hypothesis, we performed adaptation on the meta network after\nrandomly re-initializing all of the weights, and found that in this case the performance of\nMETA+ADAPT matches that of PLAIN+ADAPT for large values of n. This con\ufb01rms that the\npre-trained meta network is actually a worse starting point than training from scratch when\na large number of training I/O examples are available.\n\nLearning Curves: The left side of Figure 4 presents average held-out loss for the various techniques\nusing 50 and 1000 training I/O examples. Epoch 0 on the META+ADAPT corresponds to the META\n\n4Since the DNN is over-parameterized relative to the number of training examples, the system is able to\nover\ufb01t the training examples in all cases. Therefore \u201cpoor local optimal\u201d is referring to the model\u2019s ability to\ngeneralize to the test examples.\n\n7\n\n\fFigure 5: Ablation results for Karel Induction.\n\nloss. We can see that the PLAIN+ADAPT loss starts out very high, but the model able to adapt to the\nnew task quickly. The META+ADAPT loss starts out very strong, but only improves by a small amount\nwith adaptation. For 1000 I/O examples, it is able to overtake the META+ADAPT model by a small\namount, supporting what was observed in Figure 4.\nVarying the Model Size: Here, we present results on three architectures: Large = 64-dim feature\nmap, 1024-dim FC/RNN (used in the primary results); Medium = 32-dim feature map, 256-dim\nFC/RNN; Small = 16-dim feature map, 64-dim FC/RNN. All models use the structure described\nearlier in this section. We can see the center of Figure 5 that model size has a much larger impact on\nthe META model than the PLAIN, which is intuitive \u2013 representing an entire family tasks from a given\ndomain requires signi\ufb01cantly more parameters than a single task. We can also see that the larger\nmodels outperform the smaller models for any value of n, which is likely because the dropout ratio\nwas selected for each model size and value of n to mitigate over\ufb01tting.\nVarying the Amount of META Training: The META model presented in Figure 4 represents a very\noptimistic scenario which is trained on 1,000,000 background tasks with 6 I/O examples each. On\nthe right side of Figure 5, we present META results using 100,000 and 10,000 training tasks. We see a\nsigni\ufb01cant loss in accuracy, which demonstrates that it is quite challenging to train a META model\nthat can generalize to new tasks.\n\n9 Conclusions\n\nIn this work, we have contrasted two techniques for using cross-task knowledge sharing to improve\nneural program induction, which are referred to as adapted program induction and meta program\ninduction. Both of these techniques can be used to improve accuracy on a new task by using models\nthat were trained on related tasks from the same family. However, adapted induction uses a transfer\nlearning style approach while meta induction uses a k-shot learning style approach.\nWe applied these techniques to a challenging induction domain based on the Karel programming\nlanguage, and found that each technique, including unadapted induction, performs best under certain\nconditions. Speci\ufb01cally, the preferred technique depends on the number of I/O examples (n) that\nare available for the new task we want to learn, as well as the amount of background data available.\nThese conclusions can be summarized by the following table:\n\nTechnique\n\nBackground Data Required\n\nWhen to Use\n\nPLAIN\nPLAIN+ADAPT\n\nMETA\n\nMETA+ADAPT\n\nNone\nFew related tasks (1+) with a large\nnumber of I/O examples (1,000+)\nMany related tasks (100k+) with a\nsmall number of I/O examples (5+)\nSame as META\n\nn is very large (10,000+)\nn is fairly large (1,000 to\n10,000)\nn is small (1 to 20)\n\nn is moderate (20 to 100)\n\nAlthough we have only applied these techniques to a single domain, we believe that these conclusions\nare highly intuitive, and should generalize across domains. In future work, we plan to explore\nmore principled methods for adapted meta adaption, in order to improve upon results in the very\nlimited-example scenario.\n\n8\n\n\fReferences\n[1] Jacob Andreas, Marcus Rohrbach, Trevor Darrell, and Dan Klein. Neural module networks. pages 39\u201348,\n\n2016.\n\n[2] Marcin Andrychowicz and Karol Kurach. Learning ef\ufb01cient algorithms with hierarchical attentive memory.\n\nCoRR, abs/1602.03218, 2016.\n\n[3] Jonathon Cai, Richard Shin, and Dawn Song. Making neural programming architectures generalize via\n\nrecursion. In ICLR, 2017.\n\n[4] Ivo Danihelka, Greg Wayne, Benigno Uria, Nal Kalchbrenner, and Alex Graves. Associative long short-\n\nterm memory. ICML, 2016.\n\n[5] Jacob Devlin, Jonathan Uesato, Surya Bhupatiraju, Rishabh Singh, Abdel-rahman Mohamed, and Pushmeet\n\nKohli. Robust\ufb01ll: Neural program learning under noisy I/O. CoRR, abs/1703.07469, 2017.\n\n[6] Yan Duan, Marcin Andrychowicz, Bradly C. Stadie, Jonathan Ho, Jonas Schneider, Ilya Sutskever, Pieter\n\nAbbeel, and Wojciech Zaremba. One-shot imitation learning. CoRR, abs/1703.07326, 2017.\n\n[7] Alex Graves, Greg Wayne, and Ivo Danihelka. Neural turing machines. arXiv preprint arXiv:1410.5401,\n\n2014.\n\n[8] Edward Grefenstette, Karl Moritz Hermann, Mustafa Suleyman, and Phil Blunsom. Learning to transduce\n\nwith unbounded memory. NIPS, 2015.\n\n[9] Sumit Gulwani, William R Harris, and Rishabh Singh. Spreadsheet data manipulation using examples.\n\nCommunications of the ACM, 2012.\n\n[10] Mi-Young Huh, Pulkit Agrawal, and Alexei A. Efros. What makes imagenet good for transfer learning?\n\nCoRR, abs/1608.08614, 2016. URL http://arxiv.org/abs/1608.08614.\n\n[11] Armand Joulin and Tomas Mikolov. Inferring algorithmic patterns with stack-augmented recurrent nets. In\n\nNIPS, pages 190\u2013198, 2015.\n\n[12] Karol Kurach, Marcin Andrychowicz, and Ilya Sutskever. Neural random-access machines. ICLR, 2016.\n\n[13] Brenden M Lake, Ruslan Salakhutdinov, and Joshua B Tenenbaum. Human-level concept learning through\n\nprobabilistic program induction. Science, 350(6266):1332\u20131338, 2015.\n\n[14] Chengtao Li, Daniel Tarlow, Alexander L. Gaunt, Marc Brockschmidt, and Nate Kushman. Neural program\n\nlattices. In ICLR, 2017.\n\n[15] Minh-Thang Luong and Christopher D. Manning. Stanford neural machine translation systems for spoken\n\nlanguage domains. 2015.\n\n[16] Arvind Neelakantan, Quov V. Le, and Ilya Sutskever. Neural programmer: Inducing latent programs with\n\ngradient descent. ICLR, 2016.\n\n[17] Richard E Pattis. Karel the robot: a gentle introduction to the art of programming. John Wiley & Sons,\n\nInc., 1981.\n\n[18] Scott Reed and Nando de Freitas. Neural programmer-interpreters. ICLR, 2016.\n\n[19] Adam Santoro, Sergey Bartunov, Matthew Botvinick, Daan Wierstra, and Timothy Lillicrap. Meta-\nlearning with memory-augmented neural networks. In International conference on machine learning,\npages 1842\u20131850, 2016.\n\n[20] Sainbayar Sukhbaatar, Arthur Szlam, Jason Weston, and Rob Fergus. End-to-end memory networks. NIPS,\n\n2015.\n\n[21] Wojciech Zaremba, Tomas Mikolov, Armand Joulin, and Rob Fergus. Learning simple algorithms from\n\nexamples. CoRR, abs/1511.07275, 2015. URL http://arxiv.org/abs/1511.07275.\n\n9\n\n\f", "award": [], "sourceid": 1255, "authors": [{"given_name": "Jacob", "family_name": "Devlin", "institution": "Microsoft Research"}, {"given_name": "Rudy", "family_name": "Bunel", "institution": "Oxford University"}, {"given_name": "Rishabh", "family_name": "Singh", "institution": "Microsoft Research"}, {"given_name": "Matthew", "family_name": "Hausknecht", "institution": "Microsoft Research"}, {"given_name": "Pushmeet", "family_name": "Kohli", "institution": "DeepMind"}]}