{"title": "Adaptive Neural Compilation", "book": "Advances in Neural Information Processing Systems", "page_first": 1444, "page_last": 1452, "abstract": "This paper proposes an adaptive neural-compilation framework to address the problem of learning efficient program. Traditional code optimisation strategies used in compilers are based on applying pre-specified set of transformations that make the code faster to execute without changing its semantics. In contrast, our work involves adapting programs to make them more efficient while considering correctness only on a target input distribution. Our approach is inspired by the recent works on differentiable representations of programs. We show that it is possible to compile programs written in a low-level language to a differentiable representation. We also show how programs in this representation can be optimised to make them efficient on a target distribution of inputs. Experimental results demonstrate that our approach enables learning specifically-tuned algorithms for given data distributions with a high success rate.", "full_text": "Adaptive Neural Compilation\n\nRudy Bunel\u2217\n\nUniversity of Oxford\n\nrudy@robots.ox.ac.uk\n\nAlban Desmaison\u2217\nUniversity of Oxford\n\nalban@robots.ox.ac.uk\n\nPushmeet Kohli\nMicrosoft Research\n\npkohli@microsoft.com\n\nPhilip H.S. Torr\n\nUniversity of Oxford\n\nphilip.torr@eng.ox.ac.uk\n\nM. Pawan Kumar\nUniversity of Oxford\n\npawan@robots.ox.ac.uk\n\nAbstract\n\nThis paper proposes an adaptive neural-compilation framework to address the\nproblem of learning ef\ufb01cient programs. Traditional code optimisation strategies\nused in compilers are based on applying pre-speci\ufb01ed set of transformations that\nmake the code faster to execute without changing its semantics. In contrast, our\nwork involves adapting programs to make them more ef\ufb01cient while considering\ncorrectness only on a target input distribution. Our approach is inspired by the\nrecent works on differentiable representations of programs. We show that it is\npossible to compile programs written in a low-level language to a differentiable\nrepresentation. We also show how programs in this representation can be optimised\nto make them ef\ufb01cient on a target input distribution. Experimental results demon-\nstrate that our approach enables learning speci\ufb01cally-tuned algorithms for given\ndata distributions with a high success rate.\n\n1\n\nIntroduction\n\nAlgorithm design often requires making simplifying assumptions about the input data. Consider, for\ninstance, the computational problem of accessing an element in a linked list. Without the knowledge\nof the input data distribution, one can only specify an algorithm that runs in a time linear in the\nnumber of elements of the list. However, suppose all the linked lists that we encountered in practice\nwere ordered in memory. Then it would be advantageous to design an algorithm speci\ufb01cally for this\ntask as it can lead to a constant running time. Unfortunately, the input data distribution of a real world\nproblem cannot be easily speci\ufb01ed as in the above simple example. The best that one can hope for is\nto obtain samples drawn from the distribution. A natural question that arises from these observations:\n\u201cHow can we adapt a generic algorithm for a computational task using samples from an unknown\ninput data distribution?\u201d\nThe process of \ufb01nding the most ef\ufb01cient implementation of an algorithm has received considerable\nattention in the theoretical computer science and code optimisation community. Recently, Condi-\ntionally Correct Superoptimization [14] was proposed as a method for leveraging samples of the\ninput data distribution to go beyond semantically equivalent optimisation and towards data-speci\ufb01c\nperformance improvements. The underlying procedure is based on a stochastic search over the space\nof all possible programs. Additionally, they restrict their applications to reasonably small, loop-free\nprograms, thereby limiting their impact in practice.\nIn this work, we take inspiration from the recent wave of machine-learning frameworks for estimating\nprograms. Using recurrent models, Graves et al. [2] introduced a fully differentiable representation of\na program, enabling the use of gradient-based methods to learn a program from examples. Many other\nmodels that have been published recently [3, 5, 6, 8] build and improve on the early work by Graves\n\n\u2217The \ufb01rst two authors contributed equally.\n\n30th Conference on Neural Information Processing Systems (NIPS 2016), Barcelona, Spain.\n\n\fet al. [2]. Unfortunately, these models are usually complex to train and need to rely on methods\nsuch as curriculum learning or gradient noise to reach good solutions as shown by Neelakantan et al.\n[10]. Moreover, their interpretability is limited. The learnt model is too complex for the underlying\nalgorithm to be recovered and transformed into a regular computer program.\nThe main focus of the machine-learning community has thus far been on learning programs from\nscratch, with little emphasis on running time. However, for nearly all computational problems, it is\nfeasible to design generic algorithms for the worst-case. We argue that a more pragmatic goal for\nthe machine learning community is to design methods for adapting existing programs for speci\ufb01c\ninput data distributions. To this end, we propose the Adaptive Neural Compiler (ANC). We design a\ncompiler capable of mechanically converting algorithms to a differentiable representation, thereby\nproviding adequate initialisation to the dif\ufb01cult problem of optimal program learning. We then\npresent a method to improve this compiled program using data-driven optimisation, alleviating the\nneed to perform a wide search over the set of all possible programs. We show experimentally that\nthis framework is capable of adapting simple generic algorithms to perform better on given datasets.\n\n2 Related Works\n\nThe idea of compiling programs to neural networks has previously been explored in the literature.\nSiegelmann [15] described how to build a Neural Network that would perform the same operations as\na given program. A compiler has been designed by Gruau et al. [4] targeting an extended version\nof Pascal. A complete implementation was achieved when Neto et al. [11] wrote a compiler for\nNETDEF, a language based on the Occam programming language. While these methods allow us\nto obtain an exact representation of a program as a neural network, they do not lend themselves to\noptimisation to improve the original program. Indeed, in their formulation, each elementary step of a\nprogram is expressed as a group of neurons with a precise topology, set of weights and biases, thereby\nrendering learning via gradient descent infeasible. Performing gradient descent in this parameter\nspace would result in invalid operations and thus is unlikely to lead to any improvement. The recent\nwork by Reed and de Freitas [12] on Neural Programmer-Interpreters (NPI) can also be seen as a\nway to compile any program into a neural network. It does so by learning a model that mimics the\nprogram. While more \ufb02exible than previous approaches, the NPI only learns to reproduce an existing\nprogram. Therefore it cannot be used to \ufb01nd a new and possibly better program.\nAnother approach to this learning problem is the one taken by the code optimisation community. By\nexploring the space of all possible programs, either exhaustively [9] or in a stochastic manner [13],\nthey search for programs having the same results but being more ef\ufb01cient. The work of Sharma et al.\n[14] broadens the space of acceptable improvements to data-speci\ufb01c optimisations as opposed to\nthe provably equivalent transformations that were previously the only ones considered. However,\nthis method is still reliant on non-gradient-based methods for ef\ufb01cient exploration of the space. By\nrepresenting everything in a differentiable manner, we aim to obtain gradients to guide the exploration.\nRecently, Graves et al. [2] introduced a learnable representation of programs, called the Neural Turing\nMachine (NTM). The NTM uses an LSTM as a Controller, which outputs commands to be executed\nby a deterministic differentiable Machine. From examples of input/output sequences, they manage\nto learn a Controller such that the model becomes capable of performing simple algorithmic tasks.\nExtensions of this model have been proposed in [3, 5] where the memory tape was replaced by\ndifferentiable versions of stacks or lists. Kurach et al. [8] modi\ufb01ed the NTM to introduce a notion of\npointers making it more amenable to represent traditional programs. Parallel works have been using\nReinforcement Learning techniques such as the REINFORCE algorithm [1, 16, 17] or Q-learning [18]\nto be able to work with non differentiable versions of the above mentioned models. All these models\nare trained only with a loss based on the difference between the output of the model and the expected\noutput. This weak supervision results in learning becoming more dif\ufb01cult. For instance the Neural\nRAM [8] requires a high number of random restarts before converging to a correct solution [10], even\nwhen using the best hyperparameters obtained through a large grid search.\nIn our work, we will \ufb01rst show that we can design a new neural compiler whose target will be a\nController-Machine model. This makes the compiled model amenable to learning from examples.\nMoreover, we can use it as initialisation for the learning procedure, allowing us to aim for the more\ncomplex task of \ufb01nding an ef\ufb01cient algorithm.\n\n2\n\n\fIR0\n\nR0\n\nM0\n\nController\n\nController\n\nIR1\n\nstop\n\nR1\nM1\n\nIR2\n\nstop\n\nR2\nM2\n\nMachine\n\nMemory\n\nMachine\n\nMemory\n\n...\n\n...\n\n...\n\nMT\n\nInst\nSTOP\nZERO\nINC\nDEC\nADD\nSUB\nMIN\nMAX\nREAD\nWRITE\n\nJEZ\n\narg1\n\narg2\n\noutput\n\n-\n-\na\na\na\na\na\na\na\na\na\n\n-\n-\n-\n-\nb\nb\nb\nb\n-\nb\nb\n\n0\n0\na+1\na-1\na+b\na-b\n\nmt\na\n0\n0\n\nmin(a,b)\nmax(a,b)\n\nside effect\nstop = 1\n\n-\n-\n-\n-\n-\n-\n-\n\nMemory access\n\nmt\na = b\nIRt = b\nif a = 0\n\n(a) General view of the whole Model.\n\n(b) Machine instructions.\n\nFigure 1: Model components.\n\n3 Model\n\nOur model is composed of two parts: (i) a Controller, in charge of specifying what should be\nexecuted; and (ii) a Machine, following the commands of the Controller. We start by describing the\nglobal architecture of the model. For the sake of simplicity, the general description will present a\nnon-differentiable version of the model. Section 3.2 will then explain the modi\ufb01cations required to\nmake this model completely differentiable. A more detailed description of the model is provided in\nthe supplementary material.\n\n1, mt\n\n2, . . . , mt\n\n1, rt\n\n2, . . . , rt\n\nM}, registers that contain R values Rt = {rt\n\n3.1 General Model\nWe \ufb01rst de\ufb01ne for each timestep t the memory tape that contains M integer values\nMt = {mt\nR} and the instruc-\ntion register that contains a single value IRt. We also de\ufb01ne a set of instructions that can be executed,\nwhose main role is to perform computations using the registers. For example, add the values contained\nin two registers. We also de\ufb01ne as a side effect any action that involves elements other than the input\nand output values of the instruction. Interaction with the memory is an example of such side effect.\nAll the instructions, their computations and side effects are detailed in Figure 1b.\nAs can be seen in Figure 1a the execution model takes as input an initial memory tape M0 and\noutputs a \ufb01nal memory tape MT after T steps. At each step t, the Controller uses the instruction\nregister IRt to compute the command for the Machine. The command is a 4-tuple e, a, b, o. The\n\ufb01rst element e is the instruction that should be executed by the Machine, enumerated as an integer.\nThe elements a and b specify which registers should be used as arguments for the given instruction.\nThe last element o speci\ufb01es in which register the output of the instruction should be written. For\nexample, the command {ADD, 2, 3, 1} means that only the value of the \ufb01rst register should change,\nfollowing rt+1\n3). Then the Machine will execute this command, updating the values\nof the memory, the registers and the instruction register. The Machine always performs two other\noperations apart from the required instruction. It outputs a stop \ufb02ag that allows the model to decide\nwhen to stop the execution. It also increments the instruction register IRt by one at each iteration.\n\n1 = ADD(rt\n\n2, rt\n\n3.2 Differentiability\nThe model presented above is a simple execution machine but it is not differentiable. In order to be\nable to train this model end-to-end from a loss de\ufb01ned over the \ufb01nal memory tape, we need to make\nevery intermediate operation differentiable.\nTo achieve this, we replace every discrete value in our model by a multinomial distribution over all\nthe possible values that could have been taken. Moreover, each hard choice that would have been\nnon-differentiable is replaced by a continuous soft choice. We will henceforth use bold letters to\nindicate the probabilistic version of a value.\nFirst, the memory tape Mt is replaced by an M \u00d7 M matrix Mt, where Mt\ni,j corresponds to the\ni taking the value j. The same change is applied to the registers Rt, replacing them\nprobability of mt\nwith an R \u00d7 M matrix Rt, where Rt\ni taking the value j. Finally, the\ninstruction register is also transformed, the same way as the other registers, from a single value IRt\nto a vector of size M noted IRt, where the i-th element represents its probability to take the value i.\n\ni,j represents the probability of rt\n\n3\n\n\fThe Machine does not contain any learnable parameter and will just execute a given command. To\nmake it differentiable, the Machine now takes as input four probability distributions et, at, bt and\not, where et is a distribution over instructions, and at, bt and ot are distributions over registers. We\ncompute the argument values arg1\nt and arg2\nt as convex combinations of delta-function probability\ndistributions of the different registers values:\n\nR(cid:88)\n\nR(cid:88)\n\n(cid:88)\n\n0\u2264i,j\u2264M\n\narg1\n\nt =\n\nat\nirt\ni\n\narg2\n\nt =\n\nirt\nbt\ni,\n\n(1)\n\ni=1\n\ni=1\n\nwhere at\noutput value of each instruction k using the following formula:\n\ni are the i-th values of the vectors at and bt. Using these values, we can compute the\n\ni and bt\n\n\u22000 \u2264 c \u2264 M outt\n\nk,c =\n\narg1\n\nt\n\ni \u00b7 arg2\n\nt\n\nj \u00b7 1[gk(i, j) = c mod M ],\n\n(2)\n\nk=1 et\n\ni = rt\n\nkoutt\ni(1 \u2212 ot\n\noutt =(cid:80)N\n\nk,c is the\nwhere gk is the function associated to the k-th instruction as presented in Table 1b, outt\nprobability for an instruction k to output the value c at the time-step t and arg1\ni is the proba-\nt\nbility of the argument 1 having the value i at the time-step t. Since the executed instruction is\ncontrolled by the probability e, the output for all instructions will also be a convex combination:\nk, where N is the number of instructions. This value is then stored into the\nregisters by performing a soft-write parametrised by ot: the value stored in the i-th register at time\nt + 1 is rt+1\ni, allowing the choice of the output register to be differentiable.\nA special case is associated with the stop signal. When executing the model, we keep track of the\nprobability that the program should have terminated before this iteration based on the probability\nassociated at each iteration with the speci\ufb01c instruction that controls this \ufb02ag. Once this probability\ngoes over a threshold \u03b7stop \u2208 (0, 1], the execution is halted. We applied the same techniques to make\nthe side-effects differentiable, this is presented in the supplementary materials.\nThe Controller is the only learnable part of our model. The \ufb01rst learnable part is the initial values for\nthe registers R0 and for the instruction register IR0. The second learnable part is the parameters of\nthe Controller which computes the required distributions using:\n\ni) + outtot\n\net = We \u2217 IRt,\n\nat = Wa \u2217 IRt,\n\n(3)\nwhere We is an N \u00d7 M matrix and Wa, Wb and Wo are R\u00d7 M matrices. A representation of these\nmatrices can be found in Figure 2c. The Controller as de\ufb01ned above is composed of four independent,\nfully-connected layers. In Section 4.3 we will see that this complexity is suf\ufb01cient for our model to\nbe able to represent any program.\nHenceforth, we will denote by \u03b8 = {R0,IR0, We, Wa, Wb, Wo} the set of learnable parameters.\n\nbt = Wb \u2217 IRt,\n\not = Wo \u2217 IRt\n\n4 Adaptative Neural Compiler\nWe will now present the Adaptive Neural Compiler. Its goal is to \ufb01nd the best set of weights \u03b8\u2217 for a\ngiven dataset such that our model will perform the correct input/output mapping as ef\ufb01ciently as it\ncan. We begin by describing our learning objective in details. The two subsequent sections will focus\non making the optimisation of our learning objective computationally feasible.\n\n4.1 Objective function\nOur goal is to solve a given algorithmic problem ef\ufb01ciently. The algorithmic problem is de\ufb01ned\nas a set of input/output pairs. We also have access to a generic program that is able to perform the\nrequired mapping. In our example of accessing elements in a linked list, the transformation would\nconsist in writing down the desired value at the speci\ufb01ed position in the tape. The program given\nto us would iteratively go through the elements of the linked list, \ufb01nd the desired value and write it\ndown at the desired position. If there exists some bias that would allow this traversal to be faster, we\nexpect the program to exploit it.\nOur approach to this problem is to construct a differentiable objective function that maps controller\nparameters to a loss. We de\ufb01ne this loss based on the states of the memory tape and outputs of the\nController at each step of the execution. The precise mathematical formulation for each term of the\nloss is given in the supplementary materials. Here we present the motivation behind each of them.\n\n4\n\n\fCorrectness We \ufb01rst want the \ufb01nal memory tape to match the expected output for a given input.\nHalting To prevent programs from taking an in\ufb01nite amount of time without stopping, we de\ufb01ne a\nmaximum number of iterations Tmax after which the execution is halted. Moreover, we add a penalty\nif the Controller didn\u2019t halt before this limit.\nEf\ufb01ciency We penalise each iteration taken by the program where it does not stop.\nCon\ufb01dence We \ufb01nally make sure that if the Controller wants to stop, the current output is correct.\nIf only the correctness term was considered, nothing would encourage the learnt algorithm to halt as\nsoon as it \ufb01nished. If only correctness and halting were considered, then the program may not halt as\nearly as possible. Con\ufb01dence enables the algorithm to better evaluate when to stop.\nThe loss is a weighted sum of the four above-mentioned terms. We denote the loss of the i-th training\nsample, given parameters \u03b8, as Li(\u03b8). Our learning objective is then speci\ufb01ed as:\n\n(cid:88)\n\ni\n\nmin\n\n\u03b8\n\nLi(\u03b8)\n\ns.t. \u03b8 \u2208 \u0398,\n\n(4)\n\nwhere \u0398 is a set over the parameters such that the outputs of the Controller, the initial values of each\nregister and of the instruction register are all probability distributions.\nThe above optimisation is a highly non-convex problem. In the rest of this section, we will \ufb01rst\npresent a small modi\ufb01cation to the model that will remove the constraints to be able to use standard\ngradient descent-based methods. Moreover, a good initialisation is helpful to solve these non-convex\nproblems. To alleviate this de\ufb01ciency, we now introduce our Neural Compiler that will provide a\ngood initialisation.\n\n4.2 Reformulation\nIn order to use gradient descent methods without having to project the parameters on \u0398, we alter the\nformulation of the Controller. We use softmax layers to be able to learn learn unormalized scores that\nare then mapped to probability distributions. We add one after each linear layer of the Controller and\nfor the initial values of the registers. This way, we transform the constrained-optimisation problem\ninto an unconstrained one, allowing us to use standard gradient descent methods. As discussed in\nother works [10], this kind of model is hard to train and requires a high number of random restarts\nbefore converging to a good solution. We will now present a Neural Compiler that will provide good\ninitialisations to help with this problem.\n\n4.3 Neural Compiler\nThe goal for the Neural Compiler is to convert an algorithm, written as an unambiguous program, to\na set of parameters. These parameters, when put into the controller, will reproduce the exact steps of\nthe algorithm. This is very similar to the problem framed by Reed and de Freitas [12], but we show\nhere a way to accomplish it without any learning.\nThe different steps of the compilation are illustrated in Figure 2. The \ufb01rst step is to go from the\nwritten version of the program to the equivalent list of low level instruction. This step can be seen\nas going from Figure 2a to Figure 2b. The illustrative example uses a fairly low-level language but\ntraditional features of programming languages such as loops or if-statements can be supported\nusing the JEZ instruction. The use of constants as arguments or as values is handled by introducing\nnew registers that hold these values. The value required to be passed as target position to the JEZ\ninstruction can be resolved at compile time.\nHaving obtained this intermediate representation, generating the parameters is straightforward. As\ncan be seen in Figure 2b, each line contains one instruction, the two input registers and the output\nregister, and corresponds to a command that the Controller will have to output. If we ensure that\nIR is a Dirac-delta distribution on a given value, then the matrix-vector product is equivalent\nto selecting a row of the weight matrix. As IR is incremented at each iteration, the Controller\noutputs the rows of the matrix in order. We thus have a one-to-one mapping between the lines of the\nintermediate representation and the rows of the weight matrix. An example of these matrices can be\nfound in Figure 2c. The weight matrix has 10 rows, corresponding to the number of lines of code\nof our intermediate representation. For example, on the \ufb01rst line of the matrix corresponding to the\noutput (2civ), we see that the \ufb01fth element has value 1. This is linked to the \ufb01rst line of code where\nthe output of the READ operation is stored into the \ufb01fth register. With this representation, we can\n\n5\n\n\fvar head = 0;\nvar nb_jump = 1;\nvar out_write = 2;\n\nnb_jump = READ(nb_jump);\nout_write = READ(out_write);\n\nloop : head = READ(head);\n\nnb_jump = DEC(nb_jump);\nJEZ(nb_jump, end);\nJEZ(0, loop);\n\nend : head = INC(head);\n\nhead = READ(head);\nWRITE(out_write, head);\nSTOP();\n\nInitial Registers:\n\nR1 = 6; R2 = 2; R3 = 0;\nR4 = 2; R5 = 1; R6 = 0;\nR7 = 0;\nProgram:\n0 : R5 = READ (R5, R7)\n1 : R4 = READ (R4, R7)\n2 : R6 = READ (R6, R7)\n3 : R5 = DEC (R5, R7)\n4 : R7 = JEZ (R5, R1)\n5 : R3 = JEZ (R3, R2)\n6 : R6 = INC (R6, R7)\n7 : R6 = READ (R6, R7)\n8 : R7 = WRITE(R4, R6)\n9 : R7 = STOP (R7, R7)\n\n(i) Instr.\n\n(ii) Arg1\n\n(iii) Arg2\n\n(iv) Out\n\n(a) Input program\n\n(b) Intermediary representation\n\n(c) Weights\n\nFigure 2: Example of the compilation process. (2a) Program written to perform the ListK task. Given a pointer\nto the head of a linked list, an integer k, a target cell and a linked list, write in the target cell the k-th element of\nthe list. (2b) Intermediary representation of the program. This corresponds to the instruction that a Random\nAccess Machine would need to perform to execute the program. (2c) Representation of the weights that encodes\nthe intermediary representation. Each row of the matrix correspond to one state/line. Initial value of the registers\nare also parameters of the model, omitted here.\n\nnote that the number of parameters is linear in the number of lines of code in the original program\nand that the largest representable number in our Machine needs to be greater than the number of lines\nin our program.\nMoreover, any program written in a regular assembly language can be rewritten to use only our\nrestricted set of instructions. This can be done \ufb01rstly because all the conditionals of the assembly\nlanguage can be expressed as a combination of arithmetic and JEZ instructions. Secondly because all\nthe arithmetic operations can be represented as a combination of our simple arithmetic operations,\nloops and ifs statements. This means that any program that can run on a regular computer, can be\n\ufb01rst rewritten to use our restricted set of instructions and then compiled down to a set of weights for\nour model. Even though other models use LSTM as controller, we showed here that a Controller\ncomposed of simple linear functions is expressive enough. The advantage of this simpler model is\nthat we can now easily interpret the weights of our model in a way that is not possible if we use a\nrecurrent network as a controller.\nThe most straightforward way to leverage the results of the compilation is to initialise the Controller\nwith the weights obtained through compilation of the generic algorithm. To account for the extra\nsoftmax layer, we need to multiply the weights produced by the compiler by a large constant to output\nDirac-delta distributions. Some results associated with this technique can be found in Section 5.1.\nHowever, if we initialise with exactly this sharp set of parameters, the training procedure is not able\nto move away from the initialisation as the gradients associated with the softmax in this region are\nvery small. Instead, we initialise the controller with a non-ideal version of the generic algorithm.\nThis means that the choice with the highest probability in the output of the Controller is correct, but\nthe probability of other choices is not zero. As can be seen in Section 5.2, this allows the Controller\nto learn by gradient descent a new algorithm, different from the original one, that has a lower loss\nthan the ideal version of the compiled program.\n\n5 Experiments\nWe performed two sets of experiments. The \ufb01rst shows the capability of the Neural Compiler to\nperfectly reproduce any given program. The second shows that our Neural Compiler can adapt and\nimprove the performance of programs. We present results of data-speci\ufb01c optimisation being carried\nout and show decreases in runtime for all the algorithms and additionally, for some algorithms, show\nthat the runtime is a different computational-complexity class altogether. All the code required to\nreproduce these experiments is available online 1.\n\n1https://github.com/albanD/adaptive-neural-compilation\n\n6\n\n\f5.1 Compilation\nThe compiler described in section 4.3 allows us to go from a program written using our instruction\nset to a set of weights \u0398 for our Controller.\nTo illustrate this point, we implemented simple programs that can solve the tasks introduced by Kurach\net al. [8] and a shortest path problem. One of these implementations can be found in Figure 2a, while\nthe others are available in the supplementary materials. These programs are written in a speci\ufb01c\nlanguage, and are transformed by the Neural Compiler into parameters for the model. As expected,\nthe resulting models solve the original tasks exactly and can generalise to any input sequence.\n\n5.2 ANC experiments\nIn addition to being able to reproduce any given program as was done by Reed and de Freitas [12],\nwe have the possibility of optimising the resulting program further. We exhibit this by compiling\nprogram down to our model and optimising their performance. The ef\ufb01ciency gain for these tasks\ncome either from \ufb01nding simpler, equivalent algorithms or by exploiting some bias in the data to\neither remove instructions or change the underlying algorithm.\nWe identify three different levels of interpretability for our model: the \ufb01rst type corresponds to\nweights containing only Dirac-delta distributions, there is an exact one-to-one mapping between\nlines in the weight matrices and lines of assembly code. In the second type where all probabilities\nare Dirac-delta except the ones associated with the execution of the JEZ instruction, we can recover\nan exact algorithm that will use if statements to enumerate the different cases arising from this\nconditional jump. In the third type where any operation other than JEZ is executed in a soft way or\nuse a soft argument, it is not possible to recover a program that will be as ef\ufb01cient as the learned one.\nWe present here brie\ufb02y the considered tasks and biases, and report the reader to the supplementary\nmaterials for a detailed encoding of the input/output tape.\n\n1. Access: Given a value k and an array A, return A[k]. In the biased version, the value of k is\nalways the same, so the address of the required element can be stored in a constant. This is\nsimilar to the optimisation known as constant folding.\n\n2. Swap: Given an array A and two pointers p and q, swap the elements A[p] and A[q]. In the\n\nbiased version, p and q are always the same so reading them can be avoided.\n\n3. Increment: Given an array, increment all its element by 1. In the biased version, the array\nis of \ufb01xed size and the elements of the array have the same value so you do not need to read\nall of them when going through the array.\n\n4. Listk: Given a pointer to the head of a linked list, a number k and a linked list, \ufb01nd the value\nof the k-th element. In the biased version, the linked list is organised in order in memory, as\nwould be an array, so the address of the k-th value can be computed in constant time. This\nis the example developed in Figure 2.\n\n5. Addition: Two values are written on the tape and should be summed. No data bias is\nintroduced but the starting algorithm is non-ef\ufb01cient: it performs the addition as a series of\nincrement operation. The more ef\ufb01cient operation would be to add the two numbers.\n\n6. Sort: Given an array A, sort it. In the biased version, only the start of the array might be\n\nunsorted. Once the start has been arranged, the end of the array can be safely ignored.\n\nFor each of these tasks, we perform a grid search on the loss parameters and on our hyper-parameters.\nTraining is performed using Adam [7] and success rates are obtained by running the optimisation with\n100 different random seeds. We consider that a program has been successfully optimised when two\nconditions are ful\ufb01lled. First, it needs to output the correct solution for all test cases presenting the\nsame bias. Second, the average number of iterations taken to solve a problem must have decreased.\nNote that if we cared only about the \ufb01rst criterion, the methods presented in Section 5.1 would already\nprovide a success rate of 100%, without requiring any training.\nThe results are presented in Table 1. For each of these tasks, we manage to \ufb01nd faster algorithms. In\nthe simple cases of Access and Swap, the optimal algorithms for the considered bias are obtained.\nThey are found by incorporating heuristics to the algorithm and storing constants in the initial values\nof the registers. The learned programs for these tasks are always in the \ufb01rst case of interpretability,\nthis means that we can recover the most ef\ufb01cient algorithm from the learned weights.\n\n7\n\n\fTable 1: Average number of iterations required to solve instances of the problems for the original program, the\nbest learned program and the ideal algorithm for the biased dataset. We also include the success rate of reaching\na more ef\ufb01cient algorithm across multiple random restarts.\n\nAccess\n\nIncrement\n\nSwap ListK Addition\n\nGeneric\nLearned\n\nIdeal\n\n6\n4\n4\n\nSuccess Rate\n\n37 %\n\n40\n16\n34\n84%\n\n10\n6\n6\n\n27%\n\n18\n11\n10\n19%\n\n20\n9\n6\n\n12%\n\nSort\n38\n18\n9.5\n74%\n\nWhile ListK and Addition have lower success rates, improvements between the original and learned\nalgorithms are still signi\ufb01cant. Both were initialised with iterative algorithms with O(n) complexities.\nThey managed to \ufb01nd constant time O(1) algorithms to solve the given problems, making the runtime\nindependent of the input. Achieving this means that the equivalence between the two approaches\nhas been identi\ufb01ed, similar to how optimising compilers operate. Moreover, on the ListK task, some\nlearned programs corresponds to the second type of interpretability. Indeed these programs use soft\njumps to condition the execution on the value of k. Even though these program would not generalise\nto other values of k, some learned programs for this task achieve a type one interpretability and a\nstudy of the learned algorithm reveal that they can generalise to any value of k.\nFinally, the Increment task achieves an unexpected result. Indeed, it is able to outperform our best\npossible algorithm. By looking at the learned program, we can see that it is actually leveraging the\npossibility to perform soft writes over multiple elements of the memory at the same time to reduce\nits runtime. This is the only case where we see a learned program associated with the third type of\ninterpretability. While our ideal algorithm would give a con\ufb01dence of 1 on the output, this algorithm\nis unable to do so, but it has a high enough con\ufb01dence of 0.9 to be considered a correct algorithm.\nIn practice, for all but the most simple tasks, we observe that further optimisation is possible, as some\nuseless instructions remain present. Some transformations of the controller are indeed dif\ufb01cult to\nachieve through the local changes operated by the gradient descent algorithm. An analysis of these\nfailure modes of our algorithm can be found in the supplementary materials. This motivates us to\nenvision the use of approaches other than gradient descent to address these issues.\n\n6 Discussion\n\nThe work presented here is a \ufb01rst step towards adaptive learning of programs. It opens up several\ninteresting directions of future research. For exemple, the de\ufb01nition of ef\ufb01ciency that we considered in\nthis paper is \ufb02exible. We chose to only look at the average number of operations executed to generate\nthe output from the input. We leave the study of other potential measures such as Kolmogorov\nComplexity and sloc, to name a few, for future works.\nAs shown in the experiment section, our current method is very good at \ufb01nding ef\ufb01cient solutions\nfor simple programs. For more complex programs, only a solution close to the initialisation can\nbe found. Even though training heuristics could help with the tasks considered here, they would\nlikely not scale up to real applications. Indeed, the main problem we identi\ufb01ed is that the gradient-\ndescent based optimisation is unable to explore the space of programs effectively, by performing\nonly local transformations. In future work, we want to explore different optimisation methods. One\napproach would be to mix global and local exploration to improve the quality of the solutions. A\nmore ambitious plan would be to leverage the structure of the problem and use techniques from\ncombinatorial optimisation to try and solve the original discrete problem.\n\nAcknowledgments\nWe would like to thank Siddharth Narayanaswamy and Diane Bouchacourt for helpful discussions and proof-\nreading the paper. This work was supported by the EPSRC, Leverhulme Trust, Clarendon Fund and the ERC\ngrant ERC-2012-AdG 321162-HELIOS, EPSRC/MURI grant ref EP/N019474/1, EPSRC grant EP/M013774/1,\nEPSRC Programme Grant Seebibyte EP/M013774/1 and Microsoft Research PhD Scolarship Program.\n\n8\n\n\fReferences\n\n[1] Marcin Andrychowicz and Karol Kurach. Learning ef\ufb01cient algorithms with hierarchical\n\nattentive memory. CoRR, 2016.\n\n[2] Alex Graves, Greg Wayne, and Ivo Danihelka. Neural turing machines. CoRR, 2014.\n\n[3] Edward Grefenstette, Karl Moritz Hermann, Mustafa Suleyman, and Phil Blunsom. Learning to\n\ntransduce with unbounded memory. In NIPS, 2015.\n\n[4] Fr\u00e9d\u00e9ric Gruau, Jean-Yves Ratajszczak, and Gilles Wiber. A neural compiler. Theoretical\n\nComputer Science, 1995.\n\n[5] Armand Joulin and Tomas Mikolov.\n\nrecurrent nets. In NIPS, 2015.\n\nInferring algorithmic patterns with stack-augmented\n\n[6] \u0141ukasz Kaiser and Ilya Sutskever. Neural gpus learn algorithms. In ICLR, 2016.\n\n[7] Diederik Kingma and Jimmy Ba. Adam: A method for stochastic optimization. In ICLR, 2015.\n\n[8] Karol Kurach, Marcin Andrychowicz, and Ilya Sutskever. Neural random-access machines. In\n\nICLR, 2016.\n\n[9] Henry Massalin. Superoptimizer: a look at the smallest program. In ACM SIGPLAN Notices,\n\nvolume 22, pages 122\u2013126. IEEE Computer Society Press, 1987.\n\n[10] Arvind Neelakantan, Luke Vilnis, Quoc V Le, Ilya Sutskever, Lukasz Kaiser, Karol Kurach,\nand James Martens. Adding gradient noise improves learning for very deep networks. In ICLR,\n2016.\n\n[11] Jo\u00e3o Pedro Neto, Hava Siegelmann, and F\u00e9lix Costa. Symbolic processing in neural networks.\n\nJournal of the Brazilian Computer Society, 2003.\n\n[12] Scott Reed and Nando de Freitas. Neural programmer-interpreters. In ICLR, 2016.\n\n[13] Eric Schkufza, Rahul Sharma, and Alex Aiken. Stochastic superoptimization. In ACM SIGARCH\n\nComputer Architecture News, 2013.\n\n[14] Rahul Sharma, Eric Schkufza, Berkeley Churchill, and Alex Aiken. Conditionally correct\n\nsuperoptimization. In OOPSLA, 2015.\n\n[15] Hava Siegelmann. Neural programming language. In AAAI, 1994.\n\n[16] Ronald Williams. Simple statistical gradient-following algorithms for connectionist reinforce-\n\nment learning. Machine learning, 1992.\n\n[17] Wojciech Zaremba and Ilya Sutskever. Reinforcement learning neural turing machines. arXiv\n\npreprint arXiv:1505.00521, 2015.\n\n[18] Wojciech Zaremba, Tomas Mikolov, Armand Joulin, and Rob Fergus. Learning simple algo-\n\nrithms from examples. CoRR, 2015.\n\n9\n\n\f", "award": [], "sourceid": 818, "authors": [{"given_name": "Rudy", "family_name": "Bunel", "institution": "Oxford University"}, {"given_name": "Alban", "family_name": "Desmaison", "institution": "Oxford"}, {"given_name": "Pawan", "family_name": "Mudigonda", "institution": "University of Oxford"}, {"given_name": "Pushmeet", "family_name": "Kohli", "institution": "Microsoft Research"}, {"given_name": "Philip", "family_name": "Torr", "institution": "Oxford University"}]}