{"title": "Sobolev Training for Neural Networks", "book": "Advances in Neural Information Processing Systems", "page_first": 4278, "page_last": 4287, "abstract": "At the heart of deep learning we aim to use neural networks as function approximators -  training them to produce outputs from inputs in emulation of a ground truth function or data creation process. In many cases we only have access to input-output pairs from the ground truth, however it is becoming more common to have access to derivatives of the target output with respect to the input -- for example when the ground truth function is itself a neural network such as in network compression or distillation.  Generally these target derivatives are not computed, or are ignored. This paper introduces Sobolev Training for neural networks, which is a method for incorporating these target derivatives in addition the to target values while training. By optimising neural networks to not only approximate the function\u2019s outputs but also the function\u2019s derivatives we encode additional information about the target function within the parameters of the neural network. Thereby we can improve the quality of our predictors, as well as the data-efficiency and generalization capabilities of our learned function approximation. We provide theoretical justifications for such an approach as well as examples of empirical evidence on three distinct domains: regression on classical optimisation datasets, distilling policies of an agent playing Atari, and on large-scale applications of synthetic gradients.  In all three domains the use of Sobolev Training, employing target derivatives in addition to target values, results in models with higher accuracy and stronger generalisation.", "full_text": "Sobolev Training for Neural Networks\n\nWojciech Marian Czarnecki, Simon Osindero, Max Jaderberg\n\nGrzegorz Swirszcz, and Razvan Pascanu\n\n{lejlot,osindero,jaderberg,swirszcz,razp}@google.com\n\nDeepMind, London, UK\n\nAbstract\n\nAt the heart of deep learning we aim to use neural networks as function approxi-\nmators \u2013 training them to produce outputs from inputs in emulation of a ground\ntruth function or data creation process. In many cases we only have access to\ninput-output pairs from the ground truth, however it is becoming more common to\nhave access to derivatives of the target output with respect to the input \u2013 for exam-\nple when the ground truth function is itself a neural network such as in network\ncompression or distillation. Generally these target derivatives are not computed, or\nare ignored. This paper introduces Sobolev Training for neural networks, which is\na method for incorporating these target derivatives in addition the to target values\nwhile training. By optimising neural networks to not only approximate the func-\ntion\u2019s outputs but also the function\u2019s derivatives we encode additional information\nabout the target function within the parameters of the neural network. Thereby\nwe can improve the quality of our predictors, as well as the data-ef\ufb01ciency and\ngeneralization capabilities of our learned function approximation. We provide\ntheoretical justi\ufb01cations for such an approach as well as examples of empirical\nevidence on three distinct domains: regression on classical optimisation datasets,\ndistilling policies of an agent playing Atari, and on large-scale applications of\nsynthetic gradients. In all three domains the use of Sobolev Training, employing\ntarget derivatives in addition to target values, results in models with higher accuracy\nand stronger generalisation.\n\nIntroduction\n\n1\nDeep Neural Networks (DNNs) are one of the main tools of modern machine learning. They are\nconsistently proven to be powerful function approximators, able to model a wide variety of functional\nforms \u2013 from image recognition [8, 24], through audio synthesis [27], to human-beating policies\nin the ancient game of GO [22]. In many applications the process of training a neural network\nconsists of receiving a dataset of input-output pairs from a ground truth function, and minimising\nsome loss with respect to the network\u2019s parameters. This loss is usually designed to encourage\nthe network to produce the same output, for a given input, as that from the target ground truth\nfunction. Many of the ground truth functions we care about in practice have an unknown analytic\nform, e.g. because they are the result of a natural physical process, and therefore we only have the\nobserved input-output pairs for supervision. However, there are scenarios where we do know the\nanalytic form and so are able to compute the ground truth gradients (or higher order derivatives),\nalternatively sometimes these quantities may be simply observable. A common example is when the\nground truth function is itself a neural network; for instance this is the case for distillation [9, 20],\ncompressing neural networks [7], and the prediction of synthetic gradients [12]. Additionally, if we\nare dealing with an environment/data-generation process (vs. a pre-determined set of data points),\nthen even though we may be dealing with a black box we can still approximate derivatives using \ufb01nite\ndifferences. In this work, we consider how this additional information can be incorporated in the\nlearning process, and what advantages it can provide in terms of data ef\ufb01ciency and performance. We\n\n31st Conference on Neural Information Processing Systems (NIPS 2017), Long Beach, CA, USA.\n\n\fFigure 1: a) Sobolev Training of order 2. Diamond nodes m and f indicate parameterised functions,\nwhere m is trained to approximate f. Green nodes receive supervision. Solid lines indicate con-\nnections through which error signal from loss l, l1, and l2 are backpropagated through to train m.\nb) Stochastic Sobolev Training of order 2. If f and m are multivariate functions, the gradients are\nJacobian matrices. To avoid computing these high dimensional objects, we can ef\ufb01ciently compute\nand \ufb01t their projections on a random vector vj sampled from the unit sphere.\n\npropose Sobolev Training (ST) for neural networks as a simple and ef\ufb01cient technique for leveraging\nderivative information about the desired function in a way that can easily be incorporated into any\ntraining pipeline using modern machine learning libraries.\nThe approach is inspired by the work of Hornik [10] which proved the universal approximation\ntheorems for neural networks in Sobolev spaces \u2013 metric spaces where distances between functions\nare de\ufb01ned both in terms of their differences in values and differences in values of their derivatives.\nIn particular, it was shown that a sigmoid network can not only approximate a function\u2019s value\narbitrarily well, but that the network\u2019s derivatives with respect to its inputs can approximate the\ncorresponding derivatives of the ground truth function arbitrarily well too. Sobolev Training exploits\nthis property, and tries to match not only the output of the function being trained but also its derivatives.\n\nThere are several related works which have also exploited derivative information for function approx-\nimation. For instance Wu et al. [30] and antecedents propose a technique for Bayesian optimisation\nwith Gaussian Processess (GP), where it was demonstrated that the use of information about gradi-\nents and Hessians can improve the predictive power of GPs. In previous work on neural networks,\nderivatives of predictors have usually been used either to penalise model complexity (e.g. by pushing\nJacobian norm to 0 [19]), or to encode additional, hand crafted invariances to some transformations\n(for instance, as in Tangentprop [23]), or estimated derivatives for dynamical systems [6] and very\nrecently to provide additional learning signal during attention distillation [31]1. Similar techniques\nhave also been used in critic based Reinforcement Learning (RL), where a critic\u2019s derivatives are\ntrained to match its target\u2019s derivatives [29, 15, 5, 4, 26] using small, sigmoid based models. Finally,\nHyv\u00e4rinen proposed Score Matching Networks [11], which are based on the somewhat surprising\nobservation that one can model unknown derivatives of the function without actual access to its values\n\u2013 all that is needed is a sampling based strategy and speci\ufb01c penalty. However, such an estimator has\na high variance [28], thus it is not really useful when true derivatives are given.\nTo the best of our knowledge and despite its simplicity, the proposal to directly match network\nderivatives to the true derivatives of the target function has been minimally explored for deep\nnetworks, especially modern ReLU based models.\nIn our method, we show that by using the\nadditional knowledge of derivatives with Sobolev Training we are able to train better models \u2013 models\nwhich achieve lower approximation errors and generalise to test data better \u2013 and reduce the sample\ncomplexity of learning. The contributions of our paper are therefore threefold: (1): We introduce\n\n1Please relate to Supplementary Materials, section 5 for details\n\n2\n\nfifi+1fi+2\u2026\u2026\u2026\u2026fifi+1fi+2\u2026\u2026\u2026\u2026Mi+1i\u02c6iMi+2\u02c6i+1i+1(c)LLForward connection, differentiableForward connection, non-differentiableError gradient, non-differentiableSynthetic error gradient, differentiableLegend:Synthetic error gradient, non-differentiablexfmD2xfD2xmDxmDxf D_{\\mathbf{x}} fll2l1@@x@@x@@x@@x\u02c6Lp(h|\u2713)@@hSG(h,y)hy\u02c6Lp(h|\u2713)@@hSG(h,y)hy\u02c6Lp(h|\u2713)@@hSG(h,y)hySG(h,y)hyf(h,y|\u2713)hy0xfmll2l1@@x@@xmv1=Dxhm,v1iDxhDxhm,v1i,v2iD_{\\mathbf{x}} \\langle D_{\\mathbf{x}} \\langle m, v_1 \\rangle, v_2 \\ranglev1@@x@@xDxhDxhf,v1i,v2iDxhf,v1iv2a)b)\fWhen learning in Sobolev spaces, this is replaced with:\n\nN(cid:88)\n\uf8ee\uf8f0(cid:96)(m(xi|\u03b8), f (xi)) +\n\ni=1\n\nN(cid:88)\n\ni=1\n\n(cid:96)(m(xi|\u03b8), f (xi)).\n\n(cid:0)Dj\n\nK(cid:88)\n\nj=1\n\n(cid:96)j\n\nxm(xi|\u03b8), Dj\n\nxf (xi)(cid:1)\uf8f9\uf8fb ,\n\nSobolev Training \u2013 a new paradigm for training neural networks. (2): We look formally at the\nimplications of matching derivatives, extending previous results of Hornik [10] and showing that\nmodern architectures are well suited for such training regimes. (3): Empirical evidence demonstrating\nthat Sobolev Training leads to improved performance and generalisation, particularly in low data\nregimes. Example domains are: regression on classical optimisation problems; policy distillation\nfrom RL agents trained on the Atari domain; and training deep, complex models using synthetic\ngradients \u2013 we report the \ufb01rst successful attempt to train a large-scale ImageNet model using synthetic\ngradients.\n\n2 Sobolev Training\n\nWe begin by introducing the idea of training using Sobolev spaces. When learning a function\nf, we may have access to not only the output values f (xi) for training points xi, but also the\nvalues of its j-th order derivatives with respect to the input, Dj\nxf (xi). In other words, instead\nof the typical training set consisting of pairs {(xi, f (xi))}N\ni=1 we have access to (K + 2)-tuples\ni=1. In this situation, the derivative information can easily be\n{(xi, f (xi), D1\nincorporated into training a neural network model of f by making derivatives of the neural network\nmatch the ones given by f.\nConsidering a neural network model m parameterised with \u03b8, one typically seeks to minimise the\nempirical error in relation to f according to some loss function (cid:96)\n\nx f (xi))}N\n\nxf (xi), ..., DK\n\n(1)\n\nwhere (cid:96)j are loss functions measuring error on j-th order derivatives. This causes the neural network\nto encode derivatives of the target function in its own derivatives. Such a model can still be trained\nusing backpropagation and off-the-shelf optimisers.\nA potential concern is that this optimisation might be expensive when either the output dimensionality\nof f or the order K are high, however one can reduce this cost through stochastic approximations.\nSpeci\ufb01cally, if f is a multivariate function, instead of a vector gradient, one ends up with a full\nJacobian matrix which can be large. To avoid adding computational complexity to the training\nprocess, one can use an ef\ufb01cient, stochastic version of Sobolev Training: instead of computing a full\nJacobian/Hessian, one just computes its projection onto a random vector (a direct application of a\nknown estimation trick [19]). In practice, this means that during training we have a random variable\nv sampled uniformly from the unit sphere, and we match these random projections instead:\n\n\uf8ee\uf8f0(cid:96)(m(xi|\u03b8), f (xi)) +\n\nN(cid:88)\n\ni=1\n\nK(cid:88)\n\nj=1\n\n(cid:2)(cid:96)j\n\n(cid:0)(cid:10)Dj\nxm(xi|\u03b8), vj(cid:11) ,(cid:10)Dj\n\nE\nvj\n\nxf (xi), vj(cid:11)(cid:1)(cid:3)\uf8f9\uf8fb .\n\n(2)\n\nFigure 1 illustrates compute graphs for non-stochastic and stochastic Sobolev Training of order 2.\n\n3 Theory and motivation\n\nWhile in the previous section we de\ufb01ned Sobolev Training, it is not obvious that modeling the\nderivatives of the target function f is bene\ufb01cial to function approximation, or that optimising such\nan objective is even feasible. In this section we motivate and explore these questions theoretically,\nshowing that the Sobolev Training objective is a well posed one, and that incorporating derivative\ninformation has the potential to drastically reduce the sample complexity of learning.\nHornik showed [10] that neural networks with non-constant, bounded, continuous activation functions,\nwith continuous derivatives up to order K are universal approximators in the Sobolev spaces of\norder K, thus showing that sigmoid-networks are indeed capable of approximating elements of these\n\n3\n\n\fFigure 2: Left: From top: Example of the piece-wise linear function; Two (out of a continuum of)\nhypotheses consistent with 3 training points, showing that one needs two points to identify each linear\nsegment; The only hypothesis consistent with 3 training points enriched with derivative information.\nRight: Logarithm of test error (MSE) for various optimisation benchmarks with varied training set\nsize (20, 100 and 10000 points) sampled uniformly from the problem\u2019s domain.\n\nthere exists an \u03b7 > 0 such that for any C1 function h either (cid:107)f \u2212 h(cid:107)\u221e \u2265 \u03b7 or(cid:13)(cid:13)g \u2212 \u2202h\n\nspaces arbitrarily well. However, nowadays we often use activation functions such as ReLU which\nare neither bounded nor have continuous derivatives. The following theorem shows that for K = 1\nwe can use ReLU function (or a similar one, like leaky ReLU) to create neural networks that are\nuniversal approximators in Sobolev spaces. We will use a standard symbol C1(S) (or simply C1) to\ndenote a space of functions which are continuous, differentiable, and have a continuous derivative on\na space S [14]. All proofs are given in the Supplementary Materials (SM).\nTheorem 1. Let f be a C1 function on a compact set. Then, for every positive \u03b5 there exists a single\nhidden layer neural network with a ReLU (or a leaky ReLU) activation which approximates f in\nSobolev space S1 up to \u0001 error.\nThis suggests that the Sobolev Training objective is achievable, and that we can seek to encode the\nvalues and derivatives of the target function in the values and derivatives of a ReLU neural network\nmodel. Interestingly, we can show that if we seek to encode an arbitrary function in the derivatives of\nthe model then this is impossible not only for neural networks but also for any arbitrary differentiable\npredictor on compact sets.\nTheorem 2. Let f be a C1 function. Let g be a continuous function satisfying (cid:107)g \u2212 \u2202f\n\u2202x(cid:107)\u221e > 0. Then,\nHowever, when we move to the regime of \ufb01nite training data, we can encode any arbitrary function in\nthe derivatives (as well as higher order signals if the resulting Sobolev spaces are not degenerate), as\nshown in the following Proposition.\nProposition 1. Given any two functions f : S \u2192 R and g : S \u2192 Rd on S \u2286 Rd and a \ufb01nite\nset \u03a3 \u2282 S, there exists neural network h with a ReLU (or a leaky ReLU) activation such that\n\u2200x \u2208 \u03a3 : f (x) = h(x) and g(x) = \u2202h\nHaving shown that it is possible to train neural networks to encode both the values and derivatives of\na target function, we now formalise one possible way of showing that Sobolev Training has lower\nsample complexity than regular training.\nLet F denote the family of functions parametrised by \u03c9. We de\ufb01ne Kreg = Kreg(F) to be a measure\nof the amount of data needed to learn some target function f. That is Kreg is the smallest number for\nwhich there holds: for every f\u03c9 \u2208 F and every set of distinct Kreg points (x1, ..., xKreg ) such that\n\u2200i=1,...,Kreg f (xi) = f\u03c9(xi) \u21d2 f = f\u03c9. Ksob is de\ufb01ned analogously, but the \ufb01nal implication is of\nform f (xi) = f\u03c9(xi) \u2227 \u2202f\n\u2202x (xi) \u21d2 f = f\u03c9. Straight from the de\ufb01nition there follows:\nProposition 2. For any F, there holds Ksob(F) \u2264 Kreg(F).\nFor many families, the above inequality becomes sharp. For example, to determine the coef\ufb01cients\nof a polynomial of degree n one needs to compute its values in at least n + 1 distinct points. If we\nknow values and the derivatives at k points, it is a well-known fact that only (cid:100) n\n2(cid:101) points suf\ufb01ce to\ndetermine all the coef\ufb01cients. We present two more examples in a slightly more formal way. Let\nFG denote a family of Gaussian PDF-s (parametrised by \u00b5, \u03c3). Let Rd \u2283 D = D1 \u222a . . . \u222a Dn and\nlet FPL be a family of functions from D1 \u00d7 ... \u00d7 Dn (Cartesian product of sets Di) to Rn of form\nf (x) = [A1x1 + b1, . . . , Anxn + bn] (linear element-wise) (Figure 2 Left).\n\n\u2202x (x) (it has 0 training loss).\n\n(cid:13)(cid:13)\u221e \u2265 \u03b7.\n\n\u2202x\n\n\u2202x (xi) = \u2202f\u03c9\n\n4\n\n\fDataset\n\n20 training samples\n\n100 training samples\n\nRegular\n\nSobolev\n\nRegular\n\nSobolev\n\nFigure 3: Styblinski-Tang function (on the left) and its models using regular neural network training\n(left part of each plot) and Sobolev Training (right part). We also plot the vector \ufb01eld of the gradients\nof each predictor underneath the function plot.\n\nProposition 3. There holds Ksob (FG) < Kreg(FG) and Ksob(FPL) < Kreg(FPL).\nThis result relates to Deep ReLU networks as they build a hyperplanes-based model of the target\nfunction. If those were parametrised independently one could expect a reduction of sample complexity\nby d+1 times, where d is the dimension of the function domain. In practice parameters of hyperplanes\nin such networks are not independent, furthermore the hinges positions change so the Proposition\ncannot be directly applied, but it can be seen as an intuitive way to see why the sample complexity\ndrops signi\ufb01cantly for Deep ReLU networks too.\n\n4 Experimental Results\nWe consider three domains where information about derivatives is available during training2.\n\n4.1 Arti\ufb01cial Data\nFirst, we consider the task of regression on a set of well known low-dimensional functions used for\nbenchmarking optimisation methods.\nWe train two hidden layer neural networks with 256 hidden units per layer with ReLU activations to\nregress towards function values, and verify generalisation capabilities by evaluating the mean squared\nerror on a hold-out test set. Since the task is standard regression, we choose all the losses of Sobolev\nTraining to be L2 errors, and use a \ufb01rst order Sobolev method (second order derivatives of ReLU\nnetworks with a linear output layer are constant, zero). The optimisation is therefore:\n\nN(cid:88)\n\ni=1\n\nmin\n\n\u03b8\n\n1\nN\n\n(cid:107)f (xi) \u2212 m(xi|\u03b8)(cid:107)2\n\n2 + (cid:107)\u2207xf (xi) \u2212 \u2207xm(xi|\u03b8)(cid:107)2\n2.\n\nFigure 2 right shows the results for the optimisation benchmarks. As expected, Sobolev trained\nnetworks perform extremely well \u2013 for six out of seven benchmark problems they signi\ufb01cantly reduce\nthe testing error with the obtained errors orders of magnitude smaller than the corresponding errors of\nthe regularly trained networks. The stark difference in approximation error is highlighted in Figure 3,\nwhere we show the Styblinski-Tang function and its approximations with both regular and Sobolev\nTraining. It is clear that even in very low data regimes, the Sobolev trained networks can capture the\nfunctional shape.\nLooking at the results, we make two important observations. First, the effect of Sobolev Training\nis stronger in low-data regimes, however it does not disappear even in the high data regime, when\none has 10,000 training examples for training a two-dimensional function. Second, the only case\nwhere regular regression performed better is the regression towards Ackley\u2019s function. This particular\n\n2All experiments were performed using TensorFlow [2] and the Sonnet neural network library [1].\n\n5\n\n\fTest action prediction error\n\nTest DKL\n\nRegular distillation\n\nSobolev distillation\n\nFigure 4: Test results of distillation of RL agents on three Atari games. Reported test action prediction\nerror (left) is the error of the most probable action predicted between the distilled policy and target\npolicy, and test DKL (right) is the Kulblack-Leibler divergence between policies. Numbers in the\ncolumn title represents the percentage of the 100K recorded states used for training (the remaining\nare used for testing). In all scenarios the Sobolev distilled networks are signi\ufb01cantly more similar to\nthe target policy.\n\nexample was chosen to show that one possible weak point of our approach might be approximating\nfunctions with a very high frequency signal component in the relatively low data regime. Ackley\u2019s\nfunction is composed of exponents of high frequency cosine waves, thus creating an extremely bumpy\nsurface, consequently a method that tries to match the derivatives can behave badly during testing if\none does not have enough data to capture this complexity. However, once we have enough training\ndata points, Sobolev trained networks are able to approximate this function better.\n\n4.2 Distillation\nAnother possible application of Sobolev Training is to perform model distillation. This technique has\nmany applications, such as network compression [21], ensemble merging [9], or more recently policy\ndistillation in reinforcement learning [20].\nWe focus here on a task of distilling a policy. We aim to distill a target policy \u03c0\u2217(s) \u2013 a trained\nneural network which outputs a probability distribution over actions \u2013 into a smaller neural network\n\u03c0(s|\u03b8), such that the two policies \u03c0\u2217 and \u03c0 have the same behaviour. In practice this is often done by\nminimising an expected divergence measure between \u03c0\u2217 and \u03c0, for example, the Kullback\u2013Leibler\ndivergence DKL(\u03c0(s)(cid:107)\u03c0\u2217(s)), over states gathered while following \u03c0\u2217. Since policies are multivari-\nate functions, direct application of Sobolev Training would mean producing full Jacobian matrices\nwith respect to the s, which for large actions spaces is computationally expensive. To avoid this issue\nwe employ a stochastic approximation described in Section 2, thus resulting in the objective\n\nmin\n\n\u03b8\n\nDKL(\u03c0(s|\u03b8)(cid:107)\u03c0\n\n\u2217\n\n(s)) + \u03b1Ev [(cid:107)\u2207s(cid:104)log \u03c0\n\n\u2217\n\n(s), v(cid:105) \u2212 \u2207s(cid:104)log \u03c0(s|\u03b8), v(cid:105)(cid:107)] ,\n\nwhere the expectation is taken with respect to v coming from a uniform distribution over the unit\nsphere, and Monte Carlo sampling is used to approximate it.\nAs target policies \u03c0\u2217, we use agents playing Atari games [17] that have been trained with A3C [16]\non three well known games: Pong, Breakout and Space Invaders. The agent\u2019s policy is a neural\nnetwork consisting of 3 layers of convolutions followed by two fully-connected layers, which we\ndistill to a smaller network with 2 convolutional layers and a single smaller fully-connected layer\n(see SM for details). Distillation is treated here as a purely supervised learning problem, as our aim is\nnot to re-evaluate known distillation techniques, but rather to show that if the aim is to minimise a\ngiven divergence measure, we can improve distillation using Sobolev Training. Figure 4 shows test\nerror during training with and without Sobolev Training3. The introduction of Sobolev Training leads\nto similar effects as in the previous section \u2013 the network generalises much more effectively, and this\n\n3Testing is performed on a held out set of episodes, thus there are no temporal nor causal relations between\n\ntraining and testing\n\n6\n\n\fTable 1: Various techniques for producing synthetic gradients. Green shaded nodes denote nodes that\nget supervision from the corresponding object from the main network (gradient or loss value). We\nreport accuracy on the test set \u00b1 standard deviation. Backpropagation results are given in parenthesis.\n\nNoprop\n\nDirect SG [12]\n\nVFBN [25]\n\nCritic\n\nSobolev\n\n79.2% \u00b10.01\n\n54.5% \u00b11.15\n\nCIFAR-10 with 3 synthetic gradient modules\nTop 1 (94.3%)\nImageNet with 1 synthetic gradient module\n-\nTop 1 (75.0%)\nTop 5 (92.3%)\n-\nImageNet with 3 synthetic gradient modules\nTop 1 (75.0%)\nTop 5 (92.3%)\n\n18.7% \u00b10.18\n38.0% \u00b10.34\n\n54.0% \u00b10.29\n77.3% \u00b10.06\n\n-\n-\n\n88.5% \u00b12.70\n\n93.2% \u00b10.02\n\n93.5% \u00b10.01\n\n57.9% \u00b12.03\n81.5% \u00b11.20\n\n71.7% \u00b10.23\n90.5% \u00b10.15\n\n72.0% \u00b10.05\n90.8% \u00b10.01\n\n28.3% \u00b15.24\n52.9% \u00b16.62\n\n65.7% \u00b10.56\n86.9% \u00b10.33\n\n66.5% \u00b10.22\n87.4% \u00b10.11\n\nis especially true in low data regimes. Note the performance gap on Pong is small due to the fact that\noptimal policy is quite degenerate for this game4. In all remaining games one can see a signi\ufb01cant\nperformance increase from using our proposed method, and as well as minor to no over\ufb01tting.\nDespite looking like a regularisation effect, we stress that Sobolev Training is not trying to \ufb01nd the\nsimplest models for data or suppress the expressivity of the model. This training method aims at\nmatching the original function\u2019s smoothness/complexity and so reduces over\ufb01tting by effectively\nextending the information content of the training set, rather than by imposing a data-independent\nprior as with regularisation.\n\n4.3 Synthetic Gradients\nThe previous experiments have shown how information about the derivatives can boost approximating\nfunction values. However, the core idea of Sobolev Training is broader than that, and can be employed\nin both directions. Namely, if one ultimately cares about approximating derivatives, then additionally\napproximating values can help this process too. One recent technique, which requires a model of\ngradients is Synthetic Gradients (SG) [12] \u2013 a method for training complex neural networks in a\ndecoupled, asynchronous fashion. In this section we show how we can use Sobolev Training for SG.\nThe principle behind SG is that instead of doing full backpropagation using the chain-rule, one splits\na network into two (or more) parts, and approximates partial derivatives of the loss L with respect\nto some hidden layer activations h with a trainable function SG(h, y|\u03b8). In other words, given that\nnetwork parameters up to h are denoted by \u0398\n\n\u2202L\n\u2202\u0398\n\n=\n\n\u2202L\n\u2202h\n\n\u2202h\n\u2202\u0398 \u2248 SG(h, y|\u03b8)\n\n\u2202h\n\u2202\u0398\n\n.\n\n(cid:13)(cid:13)(cid:13)SG(h, y|\u03b8) \u2212 \u2202L(ph,y)\n\n\u2202h\n\n(cid:13)(cid:13)(cid:13)2\n\nIn the original SG paper, this module is trained to minimise LSG(\u03b8) =\n,\n2\nwhere ph is the \ufb01nal prediction of the main network for hidden activations h. For the case of learning\na classi\ufb01er, in order to apply Sobolev Training in this context we construct a loss predictor, composed\n\n4For majority of the time the policy in Pong is uniform, since actions taken when the ball is far away from\n\nthe player do not matter at all. Only in crucial situations it peaks so the ball hits the paddle.\n\n7\n\nfifi+1fi+2\u2026\u2026\u2026\u2026fifi+1fi+2\u2026\u2026\u2026\u2026Mi+1i\u02c6iMi+2\u02c6i+1i+1(a)(b)(c)DifferentiableLegend:xyLhSGLSGxyLhForward connection, differentiableForward connection, non-differentiableError gradient, non-differentiableSynthetic error gradient, differentiableLegend:Synthetic error gradient, non-differentiableNon-differentiableForward connectionError gradientSynthetic error gradientxfmD2xfD2xmDxmDxf D_{\\mathbf{x}} fll2l1@@x@@x@@x@@x\u02c6Lp(h|\u2713)@@hSG(h,y)hy\u02c6Lp(h|\u2713)@@hSG(h,y)hy\u02c6Lp(h|\u2713)@@hSG(h,y)hySG(h,y)hyf(h,y|\u2713)hy0fifi+1fi+2\u2026\u2026\u2026\u2026fifi+1fi+2\u2026\u2026\u2026\u2026Mi+1i\u02c6iMi+2\u02c6i+1i+1(a)(b)(c)DifferentiableLegend:xyLhSGLSGxyLhForward connection, differentiableForward connection, non-differentiableError gradient, non-differentiableSynthetic error gradient, differentiableLegend:Synthetic error gradient, non-differentiableNon-differentiableForward connectionError gradientSynthetic error gradientxfmD2xfD2xmDxmDxf D_{\\mathbf{x}} fll2l1@@x@@x@@x@@x\u02c6Lp(h|\u2713)@@hSG(h,y)hy\u02c6Lp(h|\u2713)@@hSG(h,y)hy\u02c6Lp(h|\u2713)@@hSG(h,y)hySG(h,y)hyf(h,y|\u2713)hy0fifi+1fi+2\u2026\u2026\u2026\u2026fifi+1fi+2\u2026\u2026\u2026\u2026Mi+1i\u02c6iMi+2\u02c6i+1i+1(a)(b)(c)DifferentiableLegend:xyLhSGLSGxyLhForward connection, differentiableForward connection, non-differentiableError gradient, non-differentiableSynthetic error gradient, differentiableLegend:Synthetic error gradient, non-differentiableNon-differentiableForward connectionError gradientSynthetic error gradientxfmD2xfD2xmDxmDxf D_{\\mathbf{x}} fll2l1@@x@@x@@x@@x\u02c6Lp(h|\u2713)@@hSG(h,y)hy\u02c6Lp(h|\u2713)@@hSG(h,y)hy\u02c6Lp(h|\u2713)@@hSG(h,y)hySG(h,y)hyf(h,y|\u2713)hy0fifi+1fi+2\u2026\u2026\u2026\u2026fifi+1fi+2\u2026\u2026\u2026\u2026Mi+1i\u02c6iMi+2\u02c6i+1i+1(a)(b)(c)DifferentiableLegend:xyLhSGLSGxyLhForward connection, differentiableForward connection, non-differentiableError gradient, non-differentiableSynthetic error gradient, differentiableLegend:Synthetic error gradient, non-differentiableNon-differentiableForward connectionError gradientSynthetic error gradientxfmD2xfD2xmDxmDxf D_{\\mathbf{x}} fll2l1@@x@@x@@x@@x\u02c6Lp(h|\u2713)@@hSG(h,y)hy\u02c6Lp(h|\u2713)@@hSG(h,y)hy\u02c6Lp(h|\u2713)@@hSG(h,y)hySG(h,y)hyf(h,y|\u2713)hy0fifi+1fi+2\u2026\u2026\u2026\u2026fifi+1fi+2\u2026\u2026\u2026\u2026Mi+1i\u02c6iMi+2\u02c6i+1i+1(a)(b)(c)DifferentiableLegend:xyLhSGLSGxyLhForward connection, differentiableForward connection, non-differentiableError gradient, non-differentiableSynthetic error gradient, differentiableLegend:Synthetic error gradient, non-differentiableNon-differentiableForward connectionError gradientSynthetic error gradientxfmD2xfD2xmDxmDxf D_{\\mathbf{x}} fll2l1@@x@@x@@x@@x\u02c6Lp(h|\u2713)@@hSG(h,y)hy\u02c6Lp(h|\u2713)@@hSG(h,y)hy\u02c6Lp(h|\u2713)@@hSG(h,y)hySG(h,y)hyf(h,y|\u2713)hy0\fm(h, y|\u03b8) := L(p(h|\u03b8), y),\nLsob\nSG(\u03b8) = (cid:96)(m(h, y|\u03b8), L(ph, y))) + (cid:96)1\n\nSG(h, y|\u03b8) := \u2202m(h, y|\u03b8)/\u2202h,\n, \u2202L(ph,y)\n\n\u2202h\n\n(cid:16) \u2202m(h,y|\u03b8)\n\n(cid:17)\n\n.\n\n\u2202h\n\nof a class predictor p(\u00b7|\u03b8) followed by the log loss, which gets supervision from the true loss, and the\ngradient of the prediction gets supervision from the true gradient:\n\nIn the Sobolev Training framework, the target function is the loss of the main network L(ph, y)\nfor which we train a model m(h, y|\u03b8) to approximate, and in addition ensure that the model\u2019s\nderivatives \u2202m(h, y|\u03b8)/\u2202h are matched to the true derivatives \u2202L(ph, y)/\u2202h. The model\u2019s derivatives\n\u2202m(h, y|\u03b8)/\u2202h are used as the synthetic gradient to decouple the main network.\nThis setting closely resembles what is known in reinforcement learning as critic methods [13]. In\nparticular, if we do not provide supervision on the gradient part, we end up with a loss critic. Similarly\nif we do not provide supervision at the loss level, but only on the gradient component, we end up in a\nmethod that resembles VFBN [25]. In light of these connections, our approach in this application\nsetting can be seen as a generalisation and uni\ufb01cation of several existing ones (see Table 1 for\nillustrations of these approaches).\nOne could ask why we need these additional constraints, and what is gained over using a neural\nnetwork based approximator directly [12]. The answer lies in the fact that gradient vector \ufb01elds are a\ntiny subset of all vector \ufb01elds, and while each neural network produces a valid vector \ufb01eld, almost no\n(standard) neural network produces valid gradient vector \ufb01elds. Using non-gradient vector \ufb01elds as\nupdate directions for learning can have catastrophic consequences \u2013 learning divergence, oscillations,\nchaotic behaviour, etc. The following proposition makes this observation more formal:\nProposition 4. If an approximator SG(h, y|\u03b8) produces a valid gradient vector \ufb01eld of some scalar\nfunction L then the approximator\u2019s Jacobian matrix must be symmetric.\n\nIt is worth noting that having a symmetric Jacobian is an extremely rare property for a neural network\nmodel. For example, a linear model has a symmetric Jacobian if and only if its weight matrix is\nsymmetric. If we sample weights iid from typical distribution (like Gaussian or uniform on an\ninterval), the probability of sampling such a matrix is 0, but it could be easy to learn with strong,\nsymmetric-enforcing updates. On the other hand, for highly non-linear neural networks, it is not only\nimprobable to randomly \ufb01nd such a model, but enforcing this constraint during learning becomes\nmuch harder too. This might be one of the reasons why linear SG modules work well in Jaderberg et\nal. [12], but non-linear convolutional SG struggled to achieve state-of-the-art performance.\nWhen using Sobolev-like approach SG always produces a valid gradient vector \ufb01eld by construction,\nthus avoiding the problem described.\nWe perform experiments on decoupling deep convolutional neural network image classi\ufb01ers using\nsynthetic gradients produced by loss critics that are trained with Sobolev Training, and compare to\nregular loss critic training, and regular synthetic gradient training. We report results on CIFAR-10 for\nthree network splits (and therefore three synthetic gradient modules) and on ImageNet with one and\nthree network splits 5.\nThe results are shown in Table 1. With a naive SG model, we obtain 79.2% test accuracy on CIFAR-10.\nUsing an SG architecture which resembles a small version of the rest of the model makes learning\nmuch easier and led to 88.5% accuracy, while Sobolev Training achieves 93.5% \ufb01nal performance.\nThe regular critic also trains well, achieving 93.2%, as the critic forces the lower part of the network\nto provide a representation which it can use to reduce the classi\ufb01cation (and not just prediction) error.\nConsequently it provides a learning signal which is well aligned with the main optimisation. However,\nthis can lead to building representations which are suboptimal for the rest of the network. Adding\nadditional gradient supervision by constructing our Sobolev SG module avoids this issue by making\nsure that synthetic gradients are truly aligned and gives an additional boost to the \ufb01nal accuracy.\nFor ImageNet [3] experiments based on ResNet50 [8], we obtain qualitatively similar results. Due\nto the complexity of the model and an almost 40% gap between no backpropagation and full\nbackpropagation results, the difference between methods with vs without loss supervision grows\nsigni\ufb01cantly. This suggests that at least for ResNet-like architectures, loss supervision is a crucial\n\n5N.b. the experiments presented use learning rates, annealing schedule, etc. optimised to maximise the\n\nbackpropagation baseline, rather than the synthetic gradient decoupled result (details in the SM).\n\n8\n\n\fcomponent of a SG module. After splitting ResNet50 into four parts the Sobolev SG achieves 87.4%\ntop 5 accuracy, while the regular critic SG achieves 86.9%, con\ufb01rming our claim about suboptimal\nrepresentation being enforced by gradients from a regular critic. Sobolev Training results were also\nmuch more reliable in all experiments (signi\ufb01cantly smaller standard deviation of the results).\n\n5 Discussion and Conclusion\n\nIn this paper we have introduced Sobolev Training for neural networks \u2013 a simple and effective way\nof incorporating knowledge about derivatives of a target function into the training of a neural network\nfunction approximator. We provided theoretical justi\ufb01cation that encoding both a target function\u2019s\nvalue as well as its derivatives within a ReLU neural network is possible, and that this results in\nmore data ef\ufb01cient learning. Additionally, we show that our proposal can be ef\ufb01ciently trained using\nstochastic approximations if computationally expensive Jacobians or Hessians are encountered.\nIn addition to toy experiments which validate our theoretical claims, we performed experiments to\nhighlight two very promising areas of applications for such models: one being distillation/compression\nof models; the other being the application to various meta-optimisation techniques that build models\nof other models dynamics (such as synthetic gradients, learning-to-learn, etc.). In both cases we obtain\nsigni\ufb01cant improvement over classical techniques, and we believe there are many other application\ndomains in which our proposal should give a solid performance boost.\nIn this work we focused on encoding true derivatives in the corresponding ones of the neural network.\nAnother possibility for future work is to encode information which one believes to be highly correlated\nwith derivatives. For example curvature [18] is believed to be connected to uncertainty. Therefore,\ngiven a problem with known uncertainty at training points, one could use Sobolev Training to match\nthe second order signal to the provided uncertainty signal. Finite differences can also be used to\napproximate gradients for black box target functions, which could help when, for example, learning a\ngenerative temporal model. Another unexplored path would be to apply Sobolev Training to internal\nderivatives rather than just derivatives with respect to the inputs.\n\nReferences\n[1] Sonnet. https://github.com/deepmind/sonnet. 2017.\n\n[2] Mart\u00edn Abadi, Ashish Agarwal, Paul Barham, Eugene Brevdo, Zhifeng Chen, Craig Citro, Greg S\nCorrado, Andy Davis, Jeffrey Dean, Matthieu Devin, et al. Tensor\ufb02ow: Large-scale machine learning on\nheterogeneous distributed systems. arXiv preprint arXiv:1603.04467, 2016.\n\n[3] Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet: A large-scale hierarchical\nimage database. In Computer Vision and Pattern Recognition, 2009. CVPR 2009. IEEE Conference on,\npages 248\u2013255. IEEE, 2009.\n\n[4] Michael Fairbank and Eduardo Alonso. Value-gradient learning. In Neural Networks (IJCNN), The 2012\n\nInternational Joint Conference on, pages 1\u20138. IEEE, 2012.\n\n[5] Michael Fairbank, Eduardo Alonso, and Danil Prokhorov. Simple and fast calculation of the second-order\ngradients for globalized dual heuristic dynamic programming in neural networks. IEEE transactions on\nneural networks and learning systems, 23(10):1671\u20131676, 2012.\n\n[6] A Ronald Gallant and Halbert White. On learning the derivatives of an unknown mapping with multilayer\n\nfeedforward networks. Neural Networks, 5(1):129\u2013138, 1992.\n\n[7] Song Han, Huizi Mao, and William J Dally. Deep compression: Compressing deep neural networks with\n\npruning, trained quantization and huffman coding. arXiv preprint arXiv:1510.00149, 2015.\n\n[8] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition.\nIn Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 770\u2013778,\n2016.\n\n[9] Geoffrey Hinton, Oriol Vinyals, and Jeff Dean. Distilling the knowledge in a neural network. arXiv\n\npreprint arXiv:1503.02531, 2015.\n\n[10] Kurt Hornik. Approximation capabilities of multilayer feedforward networks. Neural networks, 4(2):251\u2013\n\n257, 1991.\n\n9\n\n\f[11] Aapo Hyv\u00e4rinen. Estimation of non-normalized statistical models using score matching. Journal of\n\nMachine Learning Research, pages 695\u2013709, 2005.\n\n[12] Max Jaderberg, Wojciech Marian Czarnecki, Simon Osindero, Oriol Vinyals, Alex Graves, and Koray\nKavukcuoglu. Decoupled neural interfaces using synthetic gradients. arXiv preprint arXiv:1608.05343,\n2016.\n\n[13] Vijay R Konda and John N Tsitsiklis. Actor-critic algorithms. In NIPS, volume 13, pages 1008\u20131014,\n\n1999.\n\n[14] Steven G Krantz. Handbook of complex variables. Springer Science & Business Media, 2012.\n\n[15] W Thomas Miller, Paul J Werbos, and Richard S Sutton. Neural networks for control. MIT press, 1995.\n\n[16] Volodymyr Mnih, Adria Puigdomenech Badia, Mehdi Mirza, Alex Graves, Timothy Lillicrap, Tim Harley,\nDavid Silver, and Koray Kavukcuoglu. Asynchronous methods for deep reinforcement learning. In\nInternational Conference on Machine Learning, pages 1928\u20131937, 2016.\n\n[17] Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Alex Graves, Ioannis Antonoglou, Daan Wierstra,\nand Martin Riedmiller. Playing atari with deep reinforcement learning. arXiv preprint arXiv:1312.5602,\n2013.\n\n[18] Razvan Pascanu and Yoshua Bengio. Revisiting natural gradient for deep networks. arXiv preprint\n\narXiv:1301.3584, 2013.\n\n[19] Salah Rifai, Gr\u00e9goire Mesnil, Pascal Vincent, Xavier Muller, Yoshua Bengio, Yann Dauphin, and Xavier\nGlorot. Higher order contractive auto-encoder. Machine Learning and Knowledge Discovery in Databases,\npages 645\u2013660, 2011.\n\n[20] Andrei A Rusu, Sergio Gomez Colmenarejo, Caglar Gulcehre, Guillaume Desjardins, James Kirkpatrick,\nRazvan Pascanu, Volodymyr Mnih, Koray Kavukcuoglu, and Raia Hadsell. Policy distillation. arXiv\npreprint arXiv:1511.06295, 2015.\n\n[21] Bharat Bhusan Sau and Vineeth N Balasubramanian. Deep model compression: Distilling knowledge from\n\nnoisy teachers. arXiv preprint arXiv:1610.09650, 2016.\n\n[22] David Silver, Aja Huang, Chris J Maddison, Arthur Guez, Laurent Sifre, George Van Den Driessche, Julian\nSchrittwieser, Ioannis Antonoglou, Veda Panneershelvam, Marc Lanctot, et al. Mastering the game of go\nwith deep neural networks and tree search. Nature, 529(7587):484\u2013489, 2016.\n\n[23] Patrice Simard, Bernard Victorri, Yann LeCun, and John S Denker. Tangent prop-a formalism for specifying\n\nselected invariances in an adaptive network. In NIPS, volume 91, pages 895\u2013903, 1991.\n\n[24] Karen Simonyan and Andrew Zisserman. Very deep convolutional networks for large-scale image recogni-\n\ntion. arXiv preprint arXiv:1409.1556, 2014.\n\n[25] Shin-ichi Maeda Koyama Masanori Takeru Miyato, Daisuke Okanohara. Synthetic gradient methods with\n\nvirtual forward-backward networks. ICLR workshop proceedings, 2017.\n\n[26] Yuval Tassa and Tom Erez. Least squares solutions of the hjb equation with neural network value-function\n\napproximators. IEEE transactions on neural networks, 18(4):1031\u20131041, 2007.\n\n[27] A\u00e4ron van den Oord, Sander Dieleman, Heiga Zen, Karen Simonyan, Oriol Vinyals, Alex Graves, Nal\nKalchbrenner, Andrew Senior, and Koray Kavukcuoglu. Wavenet: A generative model for raw audio.\nCoRR abs/1609.03499, 2016.\n\n[28] Pascal Vincent. A connection between score matching and denoising autoencoders. Neural computation,\n\n23(7):1661\u20131674, 2011.\n\n[29] Paul J Werbos. Approximate dynamic programming for real-time control and neural modeling. Handbook\n\nof intelligent control, 1992.\n\n[30] Anqi Wu, Mikio C Aoi, and Jonathan W Pillow. Exploiting gradients and hessians in bayesian optimization\n\nand bayesian quadrature. arXiv preprint arXiv:1704.00060, 2017.\n\n[31] Sergey Zagoruyko and Nikos Komodakis. Paying more attention to attention: Improving the performance\n\nof convolutional neural networks via attention transfer. arXiv preprint arXiv:1612.03928, 2016.\n\n10\n\n\f", "award": [], "sourceid": 2246, "authors": [{"given_name": "Wojciech", "family_name": "Czarnecki", "institution": "DeepMind"}, {"given_name": "Simon", "family_name": "Osindero", "institution": "DeepMind"}, {"given_name": "Max", "family_name": "Jaderberg", "institution": "DeepMind"}, {"given_name": "Grzegorz", "family_name": "Swirszcz", "institution": "DeepMind @ Google"}, {"given_name": "Razvan", "family_name": "Pascanu", "institution": "Google DeepMind"}]}