{"title": "Learning to learn by gradient descent by gradient descent", "book": "Advances in Neural Information Processing Systems", "page_first": 3981, "page_last": 3989, "abstract": "The move from hand-designed features to learned features in machine learning has been wildly successful. In spite of this, optimization algorithms are still designed by hand. In this paper we show how the design of an optimization algorithm can be cast as a learning problem, allowing the algorithm to learn to exploit structure in the problems of interest in an automatic way. Our learned algorithms, implemented by LSTMs, outperform generic, hand-designed competitors on the tasks for which they are trained, and also generalize well to new tasks with similar structure. We demonstrate this on a number of tasks, including simple convex problems, training neural networks, and styling images with neural art.", "full_text": "Learning to learn by gradient descent\nby gradient descent\n\nMarcin Andrychowicz1, Misha Denil1, Sergio G\u00f3mez Colmenarejo1, Matthew W. Hoffman1,\n\nDavid Pfau1, Tom Schaul1, Brendan Shillingford1,2, Nando de Freitas1,2,3\n\n1Google DeepMind\n\n2University of Oxford\n\n3Canadian Institute for Advanced Research\n\nmarcin.andrychowicz@gmail.com\n\n{mdenil,sergomez,mwhoffman,pfau,schaul}@google.com\n\nbrendan.shillingford@cs.ox.ac.uk, nandodefreitas@google.com\n\nAbstract\n\nThe move from hand-designed features to learned features in machine learning has\nbeen wildly successful. In spite of this, optimization algorithms are still designed\nby hand. In this paper we show how the design of an optimization algorithm can be\ncast as a learning problem, allowing the algorithm to learn to exploit structure in\nthe problems of interest in an automatic way. Our learned algorithms, implemented\nby LSTMs, outperform generic, hand-designed competitors on the tasks for which\nthey are trained, and also generalize well to new tasks with similar structure. We\ndemonstrate this on a number of tasks, including simple convex problems, training\nneural networks, and styling images with neural art.\n\n1\n\nIntroduction\n\nFrequently, tasks in machine learning can be expressed as the problem of optimizing an objective\nfunction f (\u2713) de\ufb01ned over some domain \u2713 2 \u21e5. The goal in this case is to \ufb01nd the minimizer\n\u2713\u21e4 = arg min\u27132\u21e5 f (\u2713). While any method capable of minimizing this objective function can be\napplied, the standard approach for differentiable functions is some form of gradient descent, resulting\nin a sequence of updates\n\n\u2713t+1 = \u2713t \u21b5trf (\u2713t) .\n\nThe performance of vanilla gradient descent, however, is hampered by the fact that it only makes use\nof gradients and ignores second-order information. Classical optimization techniques correct this\nbehavior by rescaling the gradient step using curvature information, typically via the Hessian matrix\nof second-order partial derivatives\u2014although other choices such as the generalized Gauss-Newton\nmatrix or Fisher information matrix are possible.\nMuch of the modern work in optimization is based around designing update rules tailored to speci\ufb01c\nclasses of problems, with the types of problems of interest differing between different research\ncommunities. For example, in the deep learning community we have seen a proliferation of optimiza-\ntion methods specialized for high-dimensional, non-convex optimization problems. These include\nmomentum [Nesterov, 1983, Tseng, 1998], Rprop [Riedmiller and Braun, 1993], Adagrad [Duchi\net al., 2011], RMSprop [Tieleman and Hinton, 2012], and ADAM [Kingma and Ba, 2015]. More\nfocused methods can also be applied when more structure of the optimization problem is known\n[Martens and Grosse, 2015]. In contrast, communities who focus on sparsity tend to favor very\ndifferent approaches [Donoho, 2006, Bach et al., 2012]. This is even more the case for combinatorial\noptimization for which relaxations are often the norm [Nemhauser and Wolsey, 1988].\n\n30th Conference on Neural Information Processing Systems (NIPS 2016), Barcelona, Spain.\n\n\fThis industry of optimizer design allows differ-\nent communities to create optimization meth-\nods which exploit structure in their problems\nof interest at the expense of potentially poor\nperformance on problems outside of that scope.\nMoreover the No Free Lunch Theorems for Op-\ntimization [Wolpert and Macready, 1997] show\nthat in the setting of combinatorial optimization,\nno algorithm is able to do better than a random\nstrategy in expectation. This suggests that spe-\ncialization to a subclass of problems is in fact\nthe only way that improved performance can be\nachieved in general.\nIn this work we take a different tack and instead\npropose to replace hand-designed update rules\nwith a learned update rule, which we call the op-\ntimizer g, speci\ufb01ed by its own set of parameters\n. This results in updates to the optimizee f of\nthe form\n\np a r a m e ter updates\n\noptimizer\n\noptimizee\n\nerror sign al\n\nFigure 1: The optimizer (left) is provided with\nperformance of the optimizee (right) and proposes\nupdates to increase the optimizee\u2019s performance.\n[photos: Bobolas, 2009, Maley, 2011]\n\n(1)\nA high level view of this process is shown in Figure 1. In what follows we will explicitly model\nthe update rule g using a recurrent neural network (RNN) which maintains its own state and hence\ndynamically updates as a function of its iterates.\n\n\u2713t+1 = \u2713t + gt(rf (\u2713t), ) .\n\n1.1 Transfer learning and generalization\nThe goal of this work is to develop a procedure for constructing a learning algorithm which performs\nwell on a particular class of optimization problems. Casting algorithm design as a learning problem\nallows us to specify the class of problems we are interested in through example problem instances.\nThis is in contrast to the ordinary approach of characterizing properties of interesting problems\nanalytically and using these analytical insights to design learning algorithms by hand.\nIt is informative to consider the meaning of generalization in this framework. In ordinary statistical\nlearning we have a particular function of interest, whose behavior is constrained through a data set of\nexample function evaluations. In choosing a model we specify a set of inductive biases about how\nwe think the function of interest should behave at points we have not observed, and generalization\ncorresponds to the capacity to make predictions about the behavior of the target function at novel\npoints. In our setting the examples are themselves problem instances, which means generalization\ncorresponds to the ability to transfer knowledge between different problems. This reuse of problem\nstructure is commonly known as transfer learning, and is often treated as a subject in its own right.\nHowever, by taking a meta-learning perspective, we can cast the problem of transfer learning as one\nof generalization, which is much better studied in the machine learning community.\nOne of the great success stories of deep-learning is that we can rely on the ability of deep networks to\ngeneralize to new examples by learning interesting sub-structures. In this work we aim to leverage\nthis generalization power, but also to lift it from simple supervised learning to the more general\nsetting of optimization.\n\n1.2 A brief history and related work\nThe idea of using learning to learn or meta-learning to acquire knowledge or inductive biases has a\nlong history [Thrun and Pratt, 1998]. More recently, Lake et al. [2016] have argued forcefully for\nits importance as a building block in arti\ufb01cial intelligence. Similarly, Santoro et al. [2016] frame\nmulti-task learning as generalization, however unlike our approach they directly train a base learner\nrather than a training algorithm. In general these ideas involve learning which occurs at two different\ntime scales: rapid learning within tasks and more gradual, meta learning across many different tasks.\nPerhaps the most general approach to meta-learning is that of Schmidhuber [1992, 1993]\u2014building\non work from [Schmidhuber, 1987]\u2014which considers networks that are able to modify their own\nweights. Such a system is differentiable end-to-end, allowing both the network and the learning\n\n2\n\n\falgorithm to be trained jointly by gradient descent with few restrictions. However this generality\ncomes at the expense of making the learning rules very dif\ufb01cult to train. Alternatively, the work\nof Schmidhuber et al. [1997] uses the Success Story Algorithm to modify its search strategy rather\nthan gradient descent; a similar approach has been recently taken in Daniel et al. [2016] which uses\nreinforcement learning to train a controller for selecting step-sizes.\nBengio et al. [1990, 1995] propose to learn updates which avoid back-propagation by using simple\nparametric rules. In relation to the focus of this paper the work of Bengio et al. could be characterized\nas learning to learn without gradient descent by gradient descent. The work of Runarsson and\nJonsson [2000] builds upon this work by replacing the simple rule with a neural network.\nCotter and Conwell [1990], and later Younger et al. [1999], also show \ufb01xed-weight recurrent neural\nnetworks can exhibit dynamic behavior without need to modify their network weights. Similarly this\nhas been shown in a \ufb01ltering context [e.g. Feldkamp and Puskorius, 1998], which is directly related\nto simple multi-timescale optimizers [Sutton, 1992, Schraudolph, 1999].\nFinally, the work of Younger et al. [2001] and Hochreiter et al. [2001] connects these different threads\nof research by allowing for the output of backpropagation from one network to feed into an additional\nlearning network, with both networks trained jointly. Our approach to meta-learning builds on this\nwork by modifying the network architecture of the optimizer in order to scale this approach to larger\nneural-network optimization problems.\n\n2 Learning to learn with recurrent neural networks\n\nIn this work we consider directly parameterizing the optimizer. As a result, in a slight abuse of notation\nwe will write the \ufb01nal optimizee parameters \u2713\u21e4(f, ) as a function of the optimizer parameters and\nthe function in question. We can then ask the question: What does it mean for an optimizer to be\ngood? Given a distribution of functions f we will write the expected loss as\n\nL() = Efhf\u2713\u21e4(f, )i .\n\n(2)\n\nAs noted earlier, we will take the update steps gt to be the output of a recurrent neural network m,\nparameterized by , whose state we will denote explicitly with ht. Next, while the objective function\nin (2) depends only on the \ufb01nal parameter value, for training the optimizer it will be convenient to\nhave an objective that depends on the entire trajectory of optimization, for some horizon T,\n\nL() = Ef\" TXt=1\n\nwtf (\u2713t)#\n\nwhere\n\n\u2713t+1 = \u2713t + gt ,\n\n\uf8ff gt\nht+1 = m(rt, ht, ) .\n\n(3)\n\nHere wt 2 R0 are arbitrary weights associated with each time-step and we will also use the notation\nrt = r\u2713f (\u2713t). This formulation is equivalent to (2) when wt = 1[t = T ], but later we will describe\nwhy using different weights can prove useful.\nWe can minimize the value of L() using gradient descent on . The gradient estimate @L()/@ can\nbe computed by sampling a random function f and applying backpropagation to the computational\ngraph in Figure 2. We allow gradients to \ufb02ow along the solid edges in the graph, but gradients\nalong the dashed edges are dropped. Ignoring gradients along the dashed edges amounts to making\nthe assumption that the gradients of the optimizee do not depend on the optimizer parameters, i.e.\n\n@rt@ = 0. This assumption allows us to avoid computing second derivatives of f.\nExamining the objective in (3) we see that the gradient is non-zero only for terms where wt 6= 0. If\nwe use wt = 1[t = T ] to match the original problem, then gradients of trajectory pre\ufb01xes are zero\nand only the \ufb01nal optimization step provides information for training the optimizer. This renders\nBackpropagation Through Time (BPTT) inef\ufb01cient. We solve this problem by relaxing the objective\nsuch that wt > 0 at intermediate points along the trajectory. This changes the objective function, but\nallows us to train the optimizer on partial trajectories. For simplicity, in all our experiments we use\nwt = 1 for every t.\n\n3\n\n\fft-2\n\nOptimizee\n\n\u03b8t-2\n\nOptimizer\n\n\u2207t-2\n\nht-2\n\nt-2\n\n+\n\ngt-2\n\nm\n\nft-1\n\n\u03b8t-1\n\n\u2207t-1\n\nht-1\n\nt-1\n\n+\n\ngt-1\n\nm\n\nft\n\n\u2207t\n\n\u03b8t\n\nht\n\nt\n\n\u03b8t+1\n\nht+1\n\n+\n\nm\n\ngt\n\nFigure 2: Computational graph used for computing the gradient of the optimizer.\n\n2.1 Coordinatewise LSTM optimizer\n\nOne challenge in applying RNNs in our setting is that we want to be able to optimize at least tens of\nthousands of parameters. Optimizing at this scale with a fully connected RNN is not feasible as it\nwould require a huge hidden state and an enormous number of parameters. To avoid this dif\ufb01culty we\nwill use an optimizer m which operates coordinatewise on the parameters of the objective function,\nsimilar to other common update rules like RMSprop and ADAM. This coordinatewise network\narchitecture allows us to use a very small network that only looks at a single coordinate to de\ufb01ne the\noptimizer and share optimizer parameters across different parameters of the optimizee.\nDifferent behavior on each coordinate is achieved by using separate activations for each objective\nfunction parameter. In addition to allowing us to use a small network for this optimizer, this setup has\nthe nice effect of making the optimizer invariant to the order of parameters in the network, since the\nsame update rule is used independently on each coordinate.\nWe implement the update rule for each coordi-\nnate using a two-layer Long Short Term Memory\n(LSTM) network [Hochreiter and Schmidhuber,\n1997], using the now-standard forget gate archi-\ntecture. The network takes as input the opti-\nmizee gradient for a single coordinate as well\nas the previous hidden state and outputs the up-\ndate for the corresponding optimizee parameter.\nWe will refer to this architecture, illustrated in\nFigure 3, as an LSTM optimizer.\nThe use of recurrence allows the LSTM to learn\ndynamic update rules which integrate informa-\ntion from the history of gradients, similar to\nmomentum. This is known to have many desir-\nable properties in convex optimization [see e.g.\nNesterov, 1983] and in fact many recent learning procedures\u2014such as ADAM\u2014use momentum in\ntheir updates.\n\nFigure 3: One step of an LSTM optimizer. All\nLSTMs have shared parameters, but separate hid-\nden states.\n\n\u2026\u2026\n\nLSTMn\n\n\u2026\n\n\u2026\n\nf\n\nLSTM1\n\n\u03b81\n\n\u03b8n\n\n\u22071\n\n\u2207n\n\n\u2026\n\n+\n\n+\n\nPreprocessing and postprocessing Optimizer inputs and outputs can have very different magni-\ntudes depending on the class of function being optimized, but neural networks usually work robustly\nonly for inputs and outputs which are neither very small nor very large. In practice rescaling inputs\nand outputs of an LSTM optimizer using suitable constants (shared across all timesteps and functions\nf) is suf\ufb01cient to avoid this problem. In Appendix A we propose a different method of preprocessing\ninputs to the optimizer inputs which is more robust and gives slightly better performance.\n\n4\n\n\fFigure 4: Comparisons between learned and hand-crafted optimizers performance. Learned optimiz-\ners are shown with solid lines and hand-crafted optimizers are shown with dashed lines. Units for the\ny axis in the MNIST plots are logits. Left: Performance of different optimizers on randomly sampled\n10-dimensional quadratic functions. Center: the LSTM optimizer outperforms standard methods\ntraining the base network on MNIST. Right: Learning curves for steps 100-200 by an optimizer\ntrained to optimize for 100 steps (continuation of center plot).\n\n3 Experiments\n\nIn all experiments the trained optimizers use two-layer LSTMs with 20 hidden units in each layer.\nEach optimizer is trained by minimizing Equation 3 using truncated BPTT as described in Section 2.\nThe minimization is performed using ADAM with a learning rate chosen by random search.\nWe use early stopping when training the optimizer in order to avoid over\ufb01tting the optimizer. After\neach epoch (some \ufb01xed number of learning steps) we freeze the optimizer parameters and evaluate its\nperformance. We pick the best optimizer (according to the \ufb01nal validation loss) and report its average\nperformance on a number of freshly sampled test problems.\nWe compare our trained optimizers with standard optimizers used in Deep Learning: SGD, RMSprop,\nADAM, and Nesterov\u2019s accelerated gradient (NAG). For each of these optimizer and each problem\nwe tuned the learning rate, and report results with the rate that gives the best \ufb01nal error for each\nproblem. When an optimizer has more parameters than just a learning rate (e.g. decay coef\ufb01cients for\nADAM) we use the default values from the optim package in Torch7. Initial values of all optimizee\nparameters were sampled from an IID Gaussian distribution.\n\n3.1 Quadratic functions\n\nIn this experiment we consider training an optimizer on a simple class of synthetic 10-dimensional\nquadratic functions. In particular we consider minimizing functions of the form\n\nf (\u2713) = kW\u2713 yk2\n\n2\n\nfor different 10x10 matrices W and 10-dimensional vectors y whose elements are drawn from an IID\nGaussian distribution. Optimizers were trained by optimizing random functions from this family and\ntested on newly sampled functions from the same distribution. Each function was optimized for 100\nsteps and the trained optimizers were unrolled for 20 steps. We have not used any preprocessing, nor\npostprocessing.\nLearning curves for different optimizers, averaged over many functions, are shown in the left plot of\nFigure 4. Each curve corresponds to the average performance of one optimization algorithm on many\ntest functions; the solid curve shows the learned optimizer performance and dashed curves show\nthe performance of the standard baseline optimizers. It is clear the learned optimizers substantially\noutperform the baselines in this setting.\n\n3.2 Training a small neural network on MNIST\n\nIn this experiment we test whether trainable optimizers can learn to optimize a small neural network\non MNIST, and also explore how the trained optimizers generalize to functions beyond those they\nwere trained on. To this end, we train the optimizer to optimize a base network and explore a series\nof modi\ufb01cations to the network architecture and training procedure at test time.\n\n5\n\n\fFigure 5: Comparisons between learned and hand-crafted optimizers performance. Units for the\ny axis are logits. Left: Generalization to the different number of hidden units (40 instead of 20).\nCenter: Generalization to the different number of hidden layers (2 instead of 1). This optimization\nproblem is very hard, because the hidden layers are very narrow. Right: Training curves for an MLP\nwith 20 hidden units using ReLU activations. The LSTM optimizer was trained on an MLP with\nsigmoid activations.\n\nFigure 6: Systematic study of \ufb01nal MNIST performance as the optimizee architecture is varied,\nusing sigmoid non-linearities. The vertical dashed line in the left-most plot denotes the architecture\nat which the LSTM is trained and the horizontal line shows the \ufb01nal performance of the trained\noptimizer in this setting.\n\nIn this setting the objective function f (\u2713) is the cross entropy of a small MLP with parameters \u2713.\nThe values of f as well as the gradients @f (\u2713)/@\u2713 are estimated using random minibatches of 128\nexamples. The base network is an MLP with one hidden layer of 20 units using a sigmoid activation\nfunction. The only source of variability between different runs is the initial value \u27130 and randomness\nin minibatch selection. Each optimization was run for 100 steps and the trained optimizers were\nunrolled for 20 steps. We used input preprocessing described in Appendix A and rescaled the outputs\nof the LSTM by the factor 0.1.\nLearning curves for the base network using different optimizers are displayed in the center plot of\nFigure 4. In this experiment NAG, ADAM, and RMSprop exhibit roughly equivalent performance the\nLSTM optimizer outperforms them by a signi\ufb01cant margin. The right plot in Figure 4 compares the\nperformance of the LSTM optimizer if it is allowed to run for 200 steps, despite having been trained\nto optimize for 100 steps. In this comparison we re-used the LSTM optimizer from the previous\nexperiment, and here we see that the LSTM optimizer continues to outperform the baseline optimizers\non this task.\n\nGeneralization to different architectures Figure 5 shows three examples of applying the LSTM\noptimizer to train networks with different architectures than the base network on which it was trained.\nThe modi\ufb01cations are (from left to right) (1) an MLP with 40 hidden units instead of 20, (2) a\nnetwork with two hidden layers instead of one, and (3) a network using ReLU activations instead of\nsigmoid. In the \ufb01rst two cases the LSTM optimizer generalizes well, and continues to outperform\nthe hand-designed baselines despite operating outside of its training regime. However, changing\nthe activation function to ReLU makes the dynamics of the learning procedure suf\ufb01ciently different\nthat the learned optimizer is no longer able to generalize. Finally, in Figure 6 we show the results\nof systematically varying the tested architecture; for the LSTM results we again used the optimizer\ntrained using 1 layer of 20 units and sigmoid non-linearities. Note that in this setting where the\n\n6\n\n\fFigure 7: Optimization performance on the CIFAR-10 dataset and subsets. Shown on the left is the\nLSTM optimizer versus various baselines trained on CIFAR-10 and tested on a held-out test set. The\ntwo plots on the right are the performance of these optimizers on subsets of the CIFAR labels. The\nadditional optimizer LSTM-sub has been trained only on the heldout labels and is hence transferring\nto a completely novel dataset.\n\ntest-set problems are similar enough to those in the training set we see even better generalization than\nthe baseline optimizers.\n\n3.3 Training a convolutional network on CIFAR-10\n\nNext we test the performance of the trained neural optimizers on optimizing classi\ufb01cation performance\nfor the CIFAR-10 dataset [Krizhevsky, 2009]. In these experiments we used a model with both\nconvolutional and feed-forward layers. In particular, the model used for these experiments includes\nthree convolutional layers with max pooling followed by a fully-connected layer with 32 hidden units;\nall non-linearities were ReLU activations with batch normalization.\nThe coordinatewise network decomposition introduced in Section 2.1\u2014and used in the previous\nexperiment\u2014utilizes a single LSTM architecture with shared weights, but separate hidden states,\nfor each optimizee parameter. We found that this decomposition was not suf\ufb01cient for the model\narchitecture introduced in this section due to the differences between the fully connected and convo-\nlutional layers. Instead we modify the optimizer by introducing two LSTMs: one proposes parameter\nupdates for the fully connected layers and the other updates the convolutional layer parameters. Like\nthe previous LSTM optimizer we still utilize a coordinatewise decomposition with shared weights\nand individual hidden states, however LSTM weights are now shared only between parameters of the\nsame type (i.e. fully-connected vs. convolutional).\nThe performance of this trained optimizer compared against the baseline techniques is shown in\nFigure 7. The left-most plot displays the results of using the optimizer to \ufb01t a classi\ufb01er on a held-out\ntest set. The additional two plots on the right display the performance of the trained optimizer on\nmodi\ufb01ed datasets which only contain a subset of the labels, i.e. the CIFAR-2 dataset only contains\ndata corresponding to 2 of the 10 labels. Additionally we include an optimizer LSTM-sub which was\nonly trained on the held-out labels.\nIn all these examples we can see that the LSTM optimizer learns much more quickly than the baseline\noptimizers, with signi\ufb01cant boosts in performance for the CIFAR-5 and especially CIFAR-2 datsets.\nWe also see that the optimizers trained only on a disjoint subset of the data is hardly effected by this\ndifference and transfers well to the additional dataset.\n\n3.4 Neural Art\n\nThe recent work on artistic style transfer using convolutional networks, or Neural Art [Gatys et al.,\n2015], gives a natural testbed for our method, since each content and style image pair gives rise to a\ndifferent optimization problem. Each Neural Art problem starts from a content image, c, and a style\nimage, s, and is given by\n\nf (\u2713) = \u21b5Lcontent(c, \u2713) + Lstyle(s, \u2713) + Lreg(\u2713)\n\nThe minimizer of f is the styled image. The \ufb01rst two terms try to match the content and style of\nthe styled image to that of their \ufb01rst argument, and the third term is a regularizer that encourages\nsmoothness in the styled image. Details can be found in [Gatys et al., 2015].\n\n7\n\n\fFigure 8: Optimization curves for Neural Art. Content images come from the test set, which was not\nused during the LSTM optimizer training. Note: the y-axis is in log scale and we zoom in on the\ninteresting portion of this plot. Left: Applying the training style at the training resolution. Right:\nApplying the test style at double the training resolution.\n\nFigure 9: Examples of images styled using the LSTM optimizer. Each triple consists of the content\nimage (left), style (right) and image generated by the LSTM optimizer (center). Left: The result of\napplying the training style at the training resolution to a test image. Right: The result of applying a\nnew style to a test image at double the resolution on which the optimizer was trained.\n\nWe train optimizers using only 1 style and 1800 content images taken from ImageNet [Deng et al.,\n2009]. We randomly select 100 content images for testing and 20 content images for validation of\ntrained optimizers. We train the optimizer on 64x64 content images from ImageNet and one \ufb01xed\nstyle image. We then test how well it generalizes to a different style image and higher resolution\n(128x128). Each image was optimized for 128 steps and trained optimizers were unrolled for 32\nsteps. Figure 9 shows the result of styling two different images using the LSTM optimizer. The\nLSTM optimizer uses inputs preprocessing described in Appendix A and no postprocessing. See\nAppendix C for additional images.\nFigure 8 compares the performance of the LSTM optimizer to standard optimization algorithms. The\nLSTM optimizer outperforms all standard optimizers if the resolution and style image are the same\nas the ones on which it was trained. Moreover, it continues to perform very well when both the\nresolution and style are changed at test time.\nFinally, in Appendix B we qualitatively examine the behavior of the step directions generated by the\nlearned optimizer.\n\n4 Conclusion\n\nWe have shown how to cast the design of optimization algorithms as a learning problem, which\nenables us to train optimizers that are specialized to particular classes of functions. Our experiments\nhave con\ufb01rmed that learned neural optimizers compare favorably against state-of-the-art optimization\nmethods used in deep learning. We witnessed a remarkable degree of transfer, with for example the\nLSTM optimizer trained on 12,288 parameter neural art tasks being able to generalize to tasks with\n49,152 parameters, different styles, and different content images all at the same time. We observed\nsimilar impressive results when transferring to different architectures in the MNIST task.\nThe results on the CIFAR image labeling task show that the LSTM optimizers outperform hand-\nengineered optimizers when transferring to datasets drawn from the same data distribution.\nReferences\nF. Bach, R. Jenatton, J. Mairal, and G. Obozinski. Optimization with sparsity-inducing penalties. Foundations\n\nand Trends in Machine Learning, 4(1):1\u2013106, 2012.\n\n8\n\n\fS. Bengio, Y. Bengio, and J. Cloutier. On the search for new learning rules for ANNs. Neural Processing Letters,\n\n2(4):26\u201330, 1995.\n\nY. Bengio, S. Bengio, and J. Cloutier. Learning a synaptic learning rule. Universit\u00e9 de Montr\u00e9al, D\u00e9partement\n\nd\u2019informatique et de recherche op\u00e9rationnelle, 1990.\n\nF. Bobolas. brain-neurons, 2009. URL https://www.flickr.com/photos/fbobolas/3822222947. Cre-\n\native Commons Attribution-ShareAlike 2.0 Generic.\n\nN. E. Cotter and P. R. Conwell. Fixed-weight networks can learn. In International Joint Conference on Neural\n\nNetworks, pages 553\u2013559, 1990.\n\nC. Daniel, J. Taylor, and S. Nowozin. Learning step size controllers for robust neural network training. In\n\nAssociation for the Advancement of Arti\ufb01cial Intelligence, 2016.\n\nJ. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei. Imagenet: A large-scale hierarchical image database.\n\nIn Computer Vision and Pattern Recognition, pages 248\u2013255. IEEE, 2009.\n\nD. L. Donoho. Compressed sensing. Transactions on Information Theory, 52(4):1289\u20131306, 2006.\nJ. Duchi, E. Hazan, and Y. Singer. Adaptive subgradient methods for online learning and stochastic optimization.\n\nJournal of Machine Learning Research, 12:2121\u20132159, 2011.\n\nL. A. Feldkamp and G. V. Puskorius. A signal processing framework based on dynamic neural networks\nwith application to problems in adaptation, \ufb01ltering, and classi\ufb01cation. Proceedings of the IEEE, 86(11):\n2259\u20132277, 1998.\n\nL. A. Gatys, A. S. Ecker, and M. Bethge. A neural algorithm of artistic style. arXiv Report 1508.06576, 2015.\nS. Hochreiter and J. Schmidhuber. Long short-term memory. Neural computation, 9(8):1735\u20131780, 1997.\nS. Hochreiter, A. S. Younger, and P. R. Conwell. Learning to learn using gradient descent. In International\n\nConference on Arti\ufb01cial Neural Networks, pages 87\u201394. Springer, 2001.\n\nD. P. Kingma and J. Ba. Adam: A method for stochastic optimization. In International Conference on Learning\n\nRepresentations, 2015.\n\nA. Krizhevsky. Learning multiple layers of features from tiny images. Technical report, 2009.\nB. M. Lake, T. D. Ullman, J. B. Tenenbaum, and S. J. Gershman. Building machines that learn and think like\n\nT. Maley. neuron, 2011. URL https://www.flickr.com/photos/taylortotz101/6280077898. Creative\n\npeople. arXiv Report 1604.00289, 2016.\n\nCommons Attribution 2.0 Generic.\n\nJ. Martens and R. Grosse. Optimizing neural networks with Kronecker-factored approximate curvature. In\n\nInternational Conference on Machine Learning, pages 2408\u20132417, 2015.\n\nG. L. Nemhauser and L. A. Wolsey. Integer and combinatorial optimization. John Wiley & Sons, 1988.\nY. Nesterov. A method of solving a convex programming problem with convergence rate o (1/k2). In Soviet\n\nMathematics Doklady, volume 27, pages 372\u2013376, 1983.\n\nM. Riedmiller and H. Braun. A direct adaptive method for faster backpropagation learning: The RPROP\n\nalgorithm. In International Conference on Neural Networks, pages 586\u2013591, 1993.\n\nT. P. Runarsson and M. T. Jonsson. Evolution and design of distributed learning rules. In IEEE Symposium on\n\nCombinations of Evolutionary Computation and Neural Networks, pages 59\u201363. IEEE, 2000.\n\nA. Santoro, S. Bartunov, M. Botvinick, D. Wierstra, and T. Lillicrap. Meta-learning with memory-augmented\n\nneural networks. In International Conference on Machine Learning, 2016.\n\nJ. Schmidhuber. Evolutionary principles in self-referential learning; On learning how to learn: The meta-meta-...\n\nhook. PhD thesis, Institut f. Informatik, Tech. Univ. Munich, 1987.\n\nJ. Schmidhuber. Learning to control fast-weight memories: An alternative to dynamic recurrent networks.\n\nJ. Schmidhuber. A neural network that embeds its own meta-levels. In International Conference on Neural\n\nNeural Computation, 4(1):131\u2013139, 1992.\n\nNetworks, pages 407\u2013412. IEEE, 1993.\n\nJ. Schmidhuber, J. Zhao, and M. Wiering. Shifting inductive bias with success-story algorithm, adaptive levin\n\nsearch, and incremental self-improvement. Machine Learning, 28(1):105\u2013130, 1997.\n\nN. N. Schraudolph. Local gain adaptation in stochastic gradient descent. In International Conference on\n\nArti\ufb01cial Neural Networks, volume 2, pages 569\u2013574, 1999.\n\nR. S. Sutton. Adapting bias by gradient descent: An incremental version of delta-bar-delta. In Association for\n\nthe Advancement of Arti\ufb01cial Intelligence, pages 171\u2013176, 1992.\n\nS. Thrun and L. Pratt. Learning to learn. Springer Science & Business Media, 1998.\nT. Tieleman and G. Hinton. Lecture 6.5-rmsprop: Divide the gradient by a running average of its recent\n\nmagnitude. COURSERA: Neural Networks for Machine Learning, 4:2, 2012.\n\nP. Tseng. An incremental gradient (-projection) method with momentum term and adaptive stepsize rule. Journal\n\nD. H. Wolpert and W. G. Macready. No free lunch theorems for optimization. Transactions on Evolutionary\n\non Optimization, 8(2):506\u2013531, 1998.\n\nComputation, 1(1):67\u201382, 1997.\n\n10(2):272\u2013283, 1999.\n\nConference on Neural Networks, 2001.\n\nA. S. Younger, P. R. Conwell, and N. E. Cotter. Fixed-weight on-line learning. Transactions on Neural Networks,\n\nA. S. Younger, S. Hochreiter, and P. R. Conwell. Meta-learning with backpropagation. In International Joint\n\n9\n\n\f", "award": [], "sourceid": 1982, "authors": [{"given_name": "Marcin", "family_name": "Andrychowicz", "institution": "Google Deepmind"}, {"given_name": "Misha", "family_name": "Denil", "institution": "Google DeepMind"}, {"given_name": "Sergio", "family_name": "G\u00f3mez", "institution": "Google DeepMind"}, {"given_name": "Matthew", "family_name": "Hoffman", "institution": "Google DeepMind"}, {"given_name": "David", "family_name": "Pfau", "institution": "Google DeepMind"}, {"given_name": "Tom", "family_name": "Schaul", "institution": "Google Deepmind"}, {"given_name": "Brendan", "family_name": "Shillingford", "institution": ""}, {"given_name": "Nando", "family_name": "de Freitas", "institution": "Google"}]}