{"title": "Neural Tangent Kernel: Convergence and Generalization in Neural Networks", "book": "Advances in Neural Information Processing Systems", "page_first": 8571, "page_last": 8580, "abstract": "At initialization, artificial neural networks (ANNs) are equivalent to Gaussian processes in the infinite-width limit, thus connecting them to kernel methods. We prove that the evolution of an ANN during training can also be described by a kernel: during gradient descent on the parameters of an ANN, the network function (which maps input vectors to output vectors) follows the so-called kernel gradient associated with a new object, which we call the Neural Tangent Kernel (NTK). This kernel is central to describe the generalization features of ANNs. While the NTK is random at initialization and varies during training, in the infinite-width limit it converges to an explicit limiting kernel and stays constant during training. This makes it possible to study the training of ANNs in function space instead of parameter space. Convergence of the training can then be related to the positive-definiteness of the limiting NTK.\n\nWe then focus on the setting of least-squares regression and show that in the infinite-width limit, the network function follows a linear differential equation during training. The convergence is fastest along the largest kernel principal components of the input data with respect to the NTK, hence suggesting a theoretical motivation for early stopping.\n\nFinally we study the NTK numerically, observe its behavior for wide networks, and compare it to the infinite-width limit.", "full_text": "Neural Tangent Kernel:\n\nConvergence and Generalization in Neural Networks\n\nArthur Jacot\n\n\u00b4Ecole Polytechnique F\u00b4ed\u00b4erale de Lausanne\n\narthur.jacot@netopera.net\n\nImperial College London and \u00b4Ecole Polytechnique F\u00b4ed\u00b4erale de Lausanne\n\nFranck Gabriel\n\nfranckrgabriel@gmail.com\n\nCl\u00b4ement Hongler\n\n\u00b4Ecole Polytechnique F\u00b4ed\u00b4erale de Lausanne\n\nclement.hongler@gmail.com\n\nAbstract\n\nAt initialization, arti\ufb01cial neural networks (ANNs) are equivalent to Gaussian\nprocesses in the in\ufb01nite-width limit (12; 9), thus connecting them to kernel methods.\nWe prove that the evolution of an ANN during training can also be described by a\nkernel: during gradient descent on the parameters of an ANN, the network function\nf\u03b8 (which maps input vectors to output vectors) follows the kernel gradient of the\nfunctional cost (which is convex, in contrast to the parameter cost) w.r.t. a new\nkernel: the Neural Tangent Kernel (NTK). This kernel is central to describe the\ngeneralization features of ANNs. While the NTK is random at initialization and\nvaries during training, in the in\ufb01nite-width limit it converges to an explicit limiting\nkernel and it stays constant during training. This makes it possible to study the\ntraining of ANNs in function space instead of parameter space. Convergence of\nthe training can then be related to the positive-de\ufb01niteness of the limiting NTK.\nWe then focus on the setting of least-squares regression and show that in the in\ufb01nite-\nwidth limit, the network function f\u03b8 follows a linear differential equation during\ntraining. The convergence is fastest along the largest kernel principal components\nof the input data with respect to the NTK, hence suggesting a theoretical motivation\nfor early stopping.\nFinally we study the NTK numerically, observe its behavior for wide networks,\nand compare it to the in\ufb01nite-width limit.\n\n1\n\nIntroduction\n\nArti\ufb01cial neural networks (ANNs) have achieved impressive results in numerous areas of machine\nlearning. While it has long been known that ANNs can approximate any function with suf\ufb01ciently\nmany hidden neurons (7; 10), it is not known what the optimization of ANNs converges to. Indeed\nthe loss surface of neural networks optimization problems is highly non-convex: it has a high number\nof saddle points which may slow down the convergence (4). A number of results (3; 13; 14) suggest\nthat for wide enough networks, there are very few \u201cbad\u201d local minima, i.e. local minima with much\nhigher cost than the global minimum. More recently, the investigation of the geometry of the loss\nlandscape at initialization has been the subject of a precise study (8). The analysis of the dynamics\n\n32nd Conference on Neural Information Processing Systems (NeurIPS 2018), Montr\u00b4eal, Canada.\n\n\fof training in the large-width limit for shallow networks has seen recent progress as well (11). To\nthe best of the authors knowledge, the dynamics of deep networks has however remained an open\nproblem until the present paper: see the contributions section below.\nA particularly mysterious feature of ANNs is their good generalization properties in spite of their\nusual over-parametrization (16). It seems paradoxical that a reasonably large neural network can \ufb01t\nrandom labels, while still obtaining good test accuracy when trained on real data (19). It can be noted\nthat in this case, kernel methods have the same properties (1).\nIn the in\ufb01nite-width limit, ANNs have a Gaussian distribution described by a kernel (12; 9). These\nkernels are used in Bayesian inference or Support Vector Machines, yielding results comparable to\nANNs trained with gradient descent (9; 2). We will see that in the same limit, the behavior of ANNs\nduring training is described by a related kernel, which we call the neural tangent network (NTK).\n\n1.1 Contribution\n\nWe study the network function f\u03b8 of an ANN, which maps an input vector to an output vector, where\n\u03b8 is the vector of the parameters of the ANN. In the limit as the widths of the hidden layers tend to\nin\ufb01nity, the network function at initialization, f\u03b8 converges to a Gaussian distribution (12; 9).\nIn this paper, we investigate fully connected networks in this in\ufb01nite-width limit, and describe the\ndynamics of the network function f\u03b8 during training:\n\n\u2022 During gradient descent, we show that the dynamics of f\u03b8 follows that of the so-called kernel\ngradient descent in function space with respect to a limiting kernel, which only depends on\nthe depth of the network, the choice of nonlinearity and the initialization variance.\n\n\u2022 The convergence properties of ANNs during training can then be related to the positive-\nde\ufb01niteness of the in\ufb01nite-width limit NTK. The values of the network function f\u03b8 outside\nthe training set is described by the NTK, which is crucial to understand how ANN generalize.\n\u2022 For a least-squares regression loss, the network function f\u03b8 follows a linear differential\nequation in the in\ufb01nite-width limit, and the eigenfunctions of the Jacobian are the kernel\nprincipal components of the input data. This shows a direct connection to kernel methods\nand motivates the use of early stopping to reduce over\ufb01tting in the training of ANNs.\n\n\u2022 Finally we investigate these theoretical results numerically for an arti\ufb01cial dataset (of points\non the unit circle) and for the MNIST dataset. In particular we observe that the behavior of\nwide ANNs is close to the theoretical limit.\n\n2 Neural networks\n\nfunctions f\u03b8 in a space F. The dimension of the parameter space is P =(cid:80)L\u22121\n\nIn this article, we consider fully-connected ANNs with layers numbered from 0 (input) to L (output),\neach containing n0, . . . , nL neurons, and with a Lipschitz, twice differentiable nonlinearity function\n\u03c3 : R \u2192 R, with bounded second derivative 1.\nThis paper focuses on the ANN realization function F (L) : RP \u2192 F, mapping parameters \u03b8 to\n(cid:96)=0 (n(cid:96) + 1)n(cid:96)+1: the\nparameters consist of the connection matrices W ((cid:96)) \u2208 Rn(cid:96)\u00d7n(cid:96)+1 and bias vectors b((cid:96)) \u2208 Rn(cid:96)+1 for\n(cid:96) = 0, ..., L \u2212 1. In our setup, the parameters are initialized as iid Gaussians N (0, 1).\nthe function space F is de\ufb01ned as\nFor a \ufb01xed distribution pin on the input space Rn0,\n{f : Rn0 \u2192 RnL}. On this space, we consider the seminorm || \u00b7 ||pin, de\ufb01ned in terms of the\nbilinear form\n\n(cid:2)f (x)T g(x)(cid:3) .\n\n(cid:104)f, g(cid:105)pin = Ex\u223cpin\n(cid:80)N\n\nIn this paper, we assume that the input distribution pin is the empirical distribution on a \ufb01nite dataset\nx1, ..., xN , i.e the sum of Dirac measures 1\nN\n\ni=0 \u03b4xi.\n\n1While these smoothness assumptions greatly simplify the proofs of our results, they do not seem to be\n\nstrictly needed for the results to hold true.\n\n2\n\n\fWe de\ufb01ne the network function by f\u03b8(x) := \u02dc\u03b1(L)(x; \u03b8), where the functions \u02dc\u03b1((cid:96))(\u00b7; \u03b8) : Rn0 \u2192 Rn(cid:96)\n(called preactivations) and \u03b1((cid:96))(\u00b7; \u03b8) : Rn0 \u2192 Rn(cid:96) (called activations) are de\ufb01ned from the 0-th to\nthe L-th layer by:\n\n\u03b1(0)(x; \u03b8) = x\n1\u221a\nn(cid:96)\n\n\u02dc\u03b1((cid:96)+1)(x; \u03b8) =\n\nW ((cid:96))\u03b1((cid:96))(x; \u03b8) + \u03b2b((cid:96))\n\n\u03b1((cid:96))(x; \u03b8) = \u03c3(\u02dc\u03b1((cid:96))(x; \u03b8)),\n\nij \u223c N (0, 1\n\nn(cid:96)\n\n) and b((cid:96))\n\n1\u221a\nn(cid:96)\n\nwhere the nonlinearity \u03c3 is applied entrywise. The scalar \u03b2 > 0 is a parameter which allows us to\ntune the in\ufb02uence of the bias on the training.\nRemark 1. Our de\ufb01nition of the realization function F (L) slightly differs from the classical one.\nand the parameter \u03b2 are absent and the parameters are initialized using\nUsually, the factors\nj \u223c N (0, 1) (or\nwhat is sometimes called LeCun initialization, taking W ((cid:96))\nj = 0) to compensate. While the set of representable functions F (L)(RP ) is the same\nsometimes b((cid:96))\nand \u03b2), the derivatives of the realization\nfor both parametrizations (with or without the factors\nfunction with respect to the connections \u2202W ((cid:96))\nand \u03b2\nrespectively in comparison to the classical parametrization.\nThe factors\nare key to obtaining a consistent asymptotic behavior of neural networks as the\nwidths of the hidden layers n1, ..., nL\u22121 grow to in\ufb01nity. However a side-effect of these factors is\nthat they reduce greatly the in\ufb02uence of the connection weights during training when n(cid:96) is large: the\nfactor \u03b2 is introduced to balance the in\ufb02uence of the bias and connection weights. In our numerical\nexperiments, we take \u03b2 = 0.1 and use a learning rate of 1.0, which is larger than usual, see Section 6.\nThis gives a behaviour similar to that of a classical network of width 100 with a learning rate of 0.01.\n\nF (L) and bias \u2202b((cid:96))\n\nF (L) are scaled by\n\n1\u221a\nn(cid:96)\n\n1\u221a\nn(cid:96)\n\nij\n\nj\n\n1\u221a\nn(cid:96)\n\n3 Kernel gradient\nThe training of an ANN consists in optimizing f\u03b8 in the function space F with respect to a functional\ncost C : F \u2192 R, such as a regression or cross-entropy cost. Even for a convex functional cost C,\nthe composite cost C \u25e6 F (L) : RP \u2192 R is in general highly non-convex (3). We will show that\nduring training, the network function f\u03b8 follows a descent along the kernel gradient with respect to\nthe Neural Tangent Kernel (NTK) which we introduce in Section 4. This makes it possible to study\nthe training of ANNs in the function space F, on which the cost C is convex.\nA multi-dimensional kernel K is a function Rn0 \u00d7 Rn0 \u2192 RnL\u00d7nL, which maps any pair (x, x(cid:48)) to\nan nL\u00d7 nL-matrix such that K(x, x(cid:48)) = K(x(cid:48), x)T (equivalently K is a symmetric tensor in F \u2297F).\nSuch a kernel de\ufb01nes a bilinear map on F, taking the expectation over independent x, x(cid:48) \u223c pin:\n\n(cid:104)f, g(cid:105)K := Ex,x(cid:48)\u223cpin\n\n(cid:2)f (x)T K(x, x(cid:48))g(x(cid:48))(cid:3) .\n\nThe kernel K is positive de\ufb01nite with respect to || \u00b7 ||pin if ||f||pin > 0 =\u21d2 ||f||K > 0.\nWe denote by F\u2217 the dual of F with respect to pin, i.e. the set of linear forms \u00b5 : F \u2192 R of the form\n\u00b5 = (cid:104)d,\u00b7(cid:105)pin for some d \u2208 F. Two elements of F de\ufb01ne the same linear form if and only if they\nare equal on the data. The constructions in the paper do not depend on the element d \u2208 F chosen in\norder to represent \u00b5 as (cid:104)d,\u00b7(cid:105)pin. Using the fact that the partial application of the kernel Ki,\u00b7(x,\u00b7) is\na function in F, we can de\ufb01ne a map \u03a6K : F\u2217 \u2192 F mapping a dual element \u00b5 = (cid:104)d,\u00b7(cid:105)pin to the\nfunction f\u00b5 = \u03a6K(\u00b5) with values:\n\nf\u00b5,i(x) = \u00b5Ki,\u00b7(x,\u00b7) = (cid:104)d, Ki,\u00b7(x,\u00b7)(cid:105)pin .\n\nFor our setup, which is that of a \ufb01nite dataset x1, . . . , xn \u2208 Rn0, the cost functional C only depends\non the values of f \u2208 F at the data points. As a result, the (functional) derivative of the cost C at a\npoint f0 \u2208 F can be viewed as an element of F\u2217, which we write \u2202in\nf C|f0. We denote by d|f0 \u2208 F,\na corresponding dual element, such that \u2202in\n\nf C|f0 = (cid:104)d|f0,\u00b7(cid:105)pin.\n\n3\n\n\f(cid:16)\n\n(cid:17)\n\nThe kernel gradient \u2207KC|f0 \u2208 F is de\ufb01ned as \u03a6K\nf C which is only\nN(cid:88)\nde\ufb01ned on the dataset, the kernel gradient generalizes to values x outside the dataset thanks to the\nkernel K:\n\n. In contrast to \u2202in\n\nf C|f0\n\u2202in\n\n\u2207KC|f0 (x) =\n\n1\nN\n\nj=1\n\nK(x, xj)d|f0(xj).\n\nA time-dependent function f (t) follows the kernel gradient descent with respect to K if it satis\ufb01es\nthe differential equation\n\n\u2202tf (t) = \u2212\u2207KC|f (t).\nDuring kernel gradient descent, the cost C(f (t)) evolves as\n\n\u2202tC|f (t) = \u2212(cid:10)d|f (t),\u2207KC|f (t)\n\npin = \u2212(cid:13)(cid:13)d|f (t)\n(cid:11)\n\n(cid:13)(cid:13)2\n\nK .\n\nConvergence to a critical point of C is hence guaranteed if the kernel K is positive de\ufb01nite with\nrespect to || \u00b7 ||pin: the cost is then strictly decreasing except at points such that ||d|f (t)||pin = 0.\nIf the cost is convex and bounded from below, the function f (t) therefore converges to a global\nminimum as t \u2192 \u221e.\n\n3.1 Random functions approximation\n\nAs a starting point to understand the convergence of ANN gradient descent to kernel gradient descent\nin the in\ufb01nite-width limit, we introduce a simple model, inspired by the approach of (15).\nA kernel K can be approximated by a choice of P random functions f (p) sampled independently\nfrom any distribution on F whose (non-centered) covariance is given by the kernel K:\n\nE[f (p)\n\nk (x)f (p)\n\nk(cid:48) (x(cid:48))] = Kkk(cid:48)(x, x(cid:48)).\n\nThese functions de\ufb01ne a random linear parametrization F lin : RP \u2192 F\n\n\u03b8 (cid:55)\u2192 f lin\n\n\u03b8 =\n\n1\u221a\nP\n\n\u03b8pf (p).\n\nP(cid:88)\n\np=1\n\nThe partial derivatives of the parametrization are given by\n\n\u2202\u03b8p F lin(\u03b8) =\n\n1\u221a\nP\n\nf (p).\n\nOptimizing the cost C \u25e6 F lin through gradient descent, the parameters follow the ODE:\n\n(cid:68)\nd|f lin\n\n\u03b8(t)\n\n, f (p)(cid:69)\n\n.\n\npin\n\nf (p) = \u2212 1\u221a\nP\n\n, f (p)(cid:69)\n\nf (p),\n\npin\n\n\u03b8(t)\n\nf C|f lin\n\u2202in\nP(cid:88)\n\np=1\n\n(cid:68)\nd|f lin\nP(cid:88)\n\n\u03b8(t)\n\n1\nP\n\np=1\n\n(cid:80)P\n\nwhere the right-hand side is equal to the kernel gradient \u2212\u2207 \u02dcKC with respect to the tangent kernel\n\n\u02dcK =\n\n\u2202\u03b8p F lin(\u03b8) \u2297 \u2202\u03b8p F lin(\u03b8) =\n\nf (p) \u2297 f (p).\n\nThis is a random nL-dimensional kernel with values \u02dcKii(cid:48)(x, x(cid:48)) = 1\nPerforming gradient descent on the cost C \u25e6 F lin is therefore equivalent to performing kernel gradient\ndescent with the tangent kernel \u02dcK in the function space. In the limit as P \u2192 \u221e, by the law of large\nnumbers, the (random) tangent kernel \u02dcK tends to the \ufb01xed kernel K, which makes this method an\napproximation of kernel gradient descent with respect to the limiting kernel K.\n\ni(cid:48) (x(cid:48)).\n\np=1 f (p)\n\n(x)f (p)\n\nP\n\ni\n\n\u2202t\u03b8p(t) = \u2212\u2202\u03b8p (C \u25e6 F lin)(\u03b8(t)) = \u2212 1\u221a\nP\n\u03b8(t) evolves according to\n\nAs a result the function f lin\n\n\u2202tf lin\n\n\u03b8(t) =\n\n1\u221a\nP\n\n\u2202t\u03b8p(t)f (p) = \u2212 1\nP\n\nP(cid:88)\n\np=1\n\nP(cid:88)\n\np=1\n\n4\n\n\f4 Neural tangent kernel\nFor ANNs trained using gradient descent on the composition C \u25e6 F (L), the situation is very similar to\nthat studied in the Section 3.1. During training, the network function f\u03b8 evolves along the (negative)\nkernel gradient\n\nwith respect to the neural tangent kernel (NTK)\n\n\u2202tf\u03b8(t) = \u2212\u2207\u0398(L) C|f\u03b8(t)\nP(cid:88)\n\n\u0398(L)(\u03b8) =\n\n\u2202\u03b8p F (L)(\u03b8) \u2297 \u2202\u03b8p F (L)(\u03b8).\n\np=1\n\nHowever, in contrast to F lin, the realization function F (L) of ANNs is not linear. As a consequence,\nthe derivatives \u2202\u03b8p F (L)(\u03b8) and the neural tangent kernel depend on the parameters \u03b8. The NTK\nis therefore random at initialization and varies during training, which makes the analysis of the\nconvergence of f\u03b8 more delicate.\nIn the next subsections, we show that, in the in\ufb01nite-width limit, the NTK becomes deterministic at\ninitialization and stays constant during training. Since f\u03b8 at initialization is Gaussian in the limit, the\nasymptotic behavior of f\u03b8 during training can be explicited in the function space F.\n\n4.1\n\nInitialization\n\nAs observed in (12; 9), the output functions f\u03b8,i for i = 1, ..., nL tend to iid Gaussian processes in\nthe in\ufb01nite-width limit (a proof in our setup is given in the appendix):\nProposition 1. For a network of depth L at initialization, with a Lipschitz nonlinearity \u03c3, and in the\nlimit as n1, ..., nL\u22121 \u2192 \u221e, the output functions f\u03b8,k, for k = 1, ..., nL, tend (in law) to iid centered\nGaussian processes of covariance \u03a3(L), where \u03a3(L) is de\ufb01ned recursively by:\n\n\u03a3(1)(x, x(cid:48)) =\n\nxT x(cid:48) + \u03b22\n1\nn0\n\u03a3(L+1)(x, x(cid:48)) = E\nf\u223cN (0,\u03a3(L))[\u03c3(f (x))\u03c3(f (x(cid:48)))] + \u03b22,\n\ntaking the expectation with respect to a centered Gaussian process f of covariance \u03a3(L).\nRemark 2. Strictly speaking, the existence of a suitable Gaussian measure with covariance \u03a3(L) is\nnot needed: we only deal with the values of f at x, x(cid:48) (the joint measure on f (x), f (x(cid:48)) is simply a\nGaussian vector in 2D). For the same reasons, in the proof of Proposition 1 and Theorem 1, we will\nfreely speak of Gaussian processes without discussing their existence.\n\nThe \ufb01rst key result of our paper (proven in the appendix) is the following: in the same limit, the\nNeural Tangent Kernel (NTK) converges in probability to an explicit deterministic limit.\nTheorem 1. For a network of depth L at initialization, with a Lipschitz nonlinearity \u03c3, and in the\nlimit as the layers width n1, ..., nL\u22121 \u2192 \u221e, the NTK \u0398(L) converges in probability to a deterministic\nlimiting kernel:\n\nThe scalar kernel \u0398(L)\u221e : Rn0 \u00d7 Rn0 \u2192 R is de\ufb01ned recursively by\n\n\u0398(L) \u2192 \u0398(L)\u221e \u2297 IdnL.\n\n\u0398(1)\u221e (x, x(cid:48)) = \u03a3(1)(x, x(cid:48))\n\n\u0398(L+1)\u221e (x, x(cid:48)) = \u0398(L)\u221e (x, x(cid:48)) \u02d9\u03a3(L+1)(x, x(cid:48)) + \u03a3(L+1)(x, x(cid:48)),\n\nwhere\n\n\u02d9\u03a3(L+1) (x, x(cid:48)) = E\n\nf\u223cN (0,\u03a3(L)) [ \u02d9\u03c3 (f (x)) \u02d9\u03c3 (f (x(cid:48)))] ,\n\ntaking the expectation with respect to a centered Gaussian process f of covariance \u03a3(L), and where\n\u02d9\u03c3 denotes the derivative of \u03c3.\nRemark 3. By Rademacher\u2019s theorem, \u02d9\u03c3 is de\ufb01ned everywhere, except perhaps on a set of zero\nLebesgue measure.\n\nNote that the limiting \u0398(L)\u221e only depends on the choice of \u03c3, the depth of the network and the variance\nof the parameters at initialization (which is equal to 1 in our setting).\n\n5\n\n\f4.2 Training\n\n(cid:69)\n\n.\n\npin\n\nOur second key result is that the NTK stays asymptotically constant during training. This applies\nfor a slightly more general de\ufb01nition of training: the parameters are updated according to a training\ndirection dt \u2208 F:\n\n(cid:68)\n\n\u2202t\u03b8p(t) =\n\n\u2202\u03b8p F (L)(\u03b8(t)), dt\n\nthe integral(cid:82) T\nsecond derivative. For any T such that the integral(cid:82) T\n\nIn the case of gradient descent, dt = \u2212d|f\u03b8(t) (see Section 3), but the direction may depend on\nanother network, as is the case for e.g. Generative Adversarial Networks (6). We only assume that\n0 (cid:107)dt(cid:107)pindt stays stochastically bounded as the width tends to in\ufb01nity, which is veri\ufb01ed\nfor e.g. least-squares regression, see Section 5.\nTheorem 2. Assume that \u03c3 is a Lipschitz, twice differentiable nonlinearity function, with bounded\n0 (cid:107)dt(cid:107)pindt stays stochastically bounded, as\nn1, ..., nL\u22121 \u2192 \u221e, we have, uniformly for t \u2208 [0, T ],\n\nAs a consequence, in this limit, the dynamics of f\u03b8 is described by the differential equation\n\n\u0398(L)(t) \u2192 \u0398(L)\u221e \u2297 IdnL.\n\n(cid:16)(cid:104)dt,\u00b7(cid:105)pin\n\n(cid:17)\n\n.\n\n\u2202tf\u03b8(t) = \u03a6\u0398(L)\u221e \u2297IdnL\n\nRemark 4. As the proof of the theorem (in the appendix) shows, the variation during training of the\nindividual activations in the hidden layers shrinks as their width grows. However their collective\nvariation is signi\ufb01cant, which allows the parameters of the lower layers to learn: in the formula of\nthe limiting NTK \u0398(L+1)\u221e (x, x(cid:48)) in Theorem 1, the second summand \u03a3(L+1) represents the learning\ndue to the last layer, while the \ufb01rst summand represents the learning performed by the lower layers.\n\nAs discussed in Section 3, the convergence of kernel gradient descent to a critical point of the cost\nC is guaranteed for positive de\ufb01nite kernels. The limiting NTK is positive de\ufb01nite if the span of\nthe derivatives \u2202\u03b8p F (L), p = 1, ..., P becomes dense in F w.r.t. the pin-norm as the width grows\nto in\ufb01nity. It seems natural to postulate that the span of the preactivations of the last layer (which\nthemselves appear in \u2202\u03b8p F (L), corresponding to the connection weights of the last layer) becomes\ndense in F, for a large family of measures pin and nonlinearities (see e.g. (7; 10) for classical\ntheorems about ANNs and approximation).\n\n5 Least-squares regression\nGiven a goal function f\u2217 and input distribution pin, the least-squares regression cost is\n\n(cid:2)(cid:107)f (x) \u2212 f\u2217(x)(cid:107)2(cid:3) .\n\nC(f ) =\n\n||f \u2212 f\u2217||2\n\npin =\n\n1\n2\n\nEx\u223cpin\n\n1\n2\n\nTheorems 1 and 2 apply to an ANN trained on such a cost. Indeed the norm of the training direction\n(cid:107)d(f )(cid:107)pin = (cid:107)f\u2217 \u2212 f(cid:107)pin is strictly decreasing during training, bounding the integral. We are\ntherefore interested in the behavior of a function ft during kernel gradient descent with a kernel K\n(we are of course especially interested in the case K = \u0398(L)\u221e \u2297 IdnL):\n\n(cid:16)(cid:104)f\u2217 \u2212 f,\u00b7(cid:105)pin\n\n(cid:17)\n\n.\n\n\u2202tft = \u03a6K\n\nThe solution of this differential equation can be expressed in terms of the map \u03a0 : f (cid:55)\u2192\n\u03a6K\n\n:\n\nft = f\u2217 + e\u2212t\u03a0(f0 \u2212 f\u2217)\n\n(\u2212t)k\nk! \u03a0k is the exponential of \u2212t\u03a0. If \u03a0 can be diagonalized by eigenfunctions\nf (i) with eigenvalues \u03bbi, the exponential e\u2212t\u03a0 has the same eigenfunctions with eigenvalues e\u2212t\u03bbi.\nFor a \ufb01nite dataset x1, ..., xN of size N, the map \u03a0 takes the form\n\nk=0\n\n(cid:17)\n(cid:16)(cid:104)f,\u00b7(cid:105)pin\nwhere e\u2212t\u03a0 =(cid:80)\u221e\n\n\u03a0(f )k(x) =\n\n1\nN\n\nfk(cid:48)(xi)Kkk(cid:48)(xi, x).\n\nN(cid:88)\n\nnL(cid:88)\n\ni=1\n\nk(cid:48)=1\n\n6\n\n\fThe map \u03a0 has at most N nL positive eigenfunctions, and they are the kernel principal components\nf (1), ..., f (N nL) of the data with respect to to the kernel K (17; 18). The corresponding eigenvalues\n\u03bbi is the variance captured by the component.\nDecomposing the difference (f\u2217 \u2212 f0) = \u22060\ntrajectory of the function ft reads\n\nalong the eigenspaces of \u03a0, the\n\nf + ... + \u2206N nL\n\nf + \u22061\n\nf\n\nft = f\u2217 + \u22060\n\nf +\n\ne\u2212t\u03bbi\u2206i\nf ,\n\nN nL(cid:88)\nf \u221d f (i).\n\ni=1\n\nf is in the kernel (null-space) of \u03a0 and \u2206i\n\nwhere \u22060\nThe above decomposition can be seen as a motivation for the use of early stopping. The convergence\nis indeed faster along the eigenspaces corresponding to larger eigenvalues \u03bbi. Early stopping hence\nfocuses the convergence on the most relevant kernel principal components, while avoiding to \ufb01t\nthe ones in eigenspaces with lower eigenvalues (such directions are typically the \u2018noisier\u2019 ones: for\ninstance, in the case of the RBF kernel, lower eigenvalues correspond to high frequency functions).\nNote that by the linearity of the map e\u2212t\u03a0, if f0 is initialized with a Gaussian distribution (as is the\ncase for ANNs in the in\ufb01nite-width limit), then ft is Gaussian for all times t. Assuming that the kernel\nis positive de\ufb01nite on the data (implying that the N nL \u00d7 N nL Gram marix \u02dcK = (Kkk(cid:48)(xi, xj))ik,jk(cid:48)\nis invertible), as t \u2192 \u221e limit, we get that f\u221e = f\u2217 + \u22060\n\nf = f0 \u2212(cid:80)\n\nf takes the form\n\n(cid:16)\n\n(cid:17)\n\n,\n\ni \u2206i\n\u02dcK\u22121y0\n\nf\u221e,k(x) = \u03baT\n\nx,k\n\n\u02dcK\u22121y\u2217 +\n\nwith the N nl-vectors \u03bax,k, y\u2217 and y0 given by\n\nf0(x) \u2212 \u03baT\n\nx,k\n\n\u03bax,k = (Kkk(cid:48)(x, xi))i,k(cid:48)\ny\u2217 = (f\u2217\ny0 = (f0,k(xi))i,k .\n\nk (xi))i,k\n\nThe \ufb01rst term, the mean, has an important statistical interpretation: it is the maximum-a-posteriori\n(MAP) estimate given a Gaussian prior on functions fk \u223c N (0, \u0398(L)\u221e ) and the conditions fk(xi) =\nf\u2217\nk (xi) . Equivalently, it is equal to the kernel ridge regression (18) as the regularization goes to\nzero (\u03bb \u2192 0). The second term is a centered Gaussian whose variance vanishes on the points of the\ndataset.\n\n6 Numerical experiments\n\nIn the following numerical experiments, fully connected ANNs of various widths are compared to the\ntheoretical in\ufb01nite-width limit. We choose the size of the hidden layers to all be equal to the same\nvalue n := n1 = ... = nL\u22121 and we take the ReLU nonlinearity \u03c3(x) = max(0, x).\nIn the \ufb01rst two experiments, we consider the case n0 = 2. Moreover, the input elements are taken on\nthe unit circle. This can be motivated by the structure of high-dimensional data, where the centered\ndata points often have roughly the same norm 2.\nIn all experiments, we took nL = 1 (note that by our results, a network with nL outputs behaves\nasymptotically like nL networks with scalar outputs trained independently). Finally, the value of the\nparameter \u03b2 is chosen as 0.1, see Remark 1.\n\n6.1 Convergence of the NTK\n\nThe \ufb01rst experiment illustrates the convergence of the NTK \u0398(L) of a network of depth L = 4 for\ntwo different widths n = 500, 10000. The function \u0398(4)(x0, x) is plotted for a \ufb01xed x0 = (1, 0)\nand x = (cos(\u03b3), sin(\u03b3)) on the unit circle in Figure 1. To observe the distribution of the NTK, 10\nindependent initializations are performed for both widths. The kernels are plotted at initialization\n2The classical example is for data following a Gaussian distribution N (0, Idn0 ): as the dimension n0 grows,\n\nall data points have approximately the same norm\n\nn0.\n\n\u221a\n\n7\n\n\fFigure 1: Convergence of the NTK to a \ufb01xed limit\nfor two widths n and two times t.\n\nFigure 2: Networks function f\u03b8 near convergence\nfor two widths n and 10th, 50th and 90th per-\ncentiles of the asymptotic Gaussian distribution.\n\nt = 0 and then after 200 steps of gradient descent with learning rate 1.0 (i.e. at t = 200). We\napproximate the function f\u2217(x) = x1x2 with a least-squares cost on random N (0, 1) inputs.\nFor the wider network, the NTK shows less variance and is smoother. It is interesting to note that\nthe expectation of the NTK is very close for both networks widths. After 200 steps of training, we\nobserve that the NTK tends to \u201cin\ufb02ate\u201d. As expected, this effect is much less apparent for the wider\nnetwork (n = 10000) where the NTK stays almost \ufb01xed, than for the smaller network (n = 500).\n\n6.2 Kernel regression\n\nFor a regression cost, the in\ufb01nite-width limit network function f\u03b8(t) has a Gaussian distribution for\nall times t and in particular at convergence t \u2192 \u221e (see Section 5). We compared the theoretical\nGaussian distribution at t \u2192 \u221e to the distribution of the network function f\u03b8(T ) of a \ufb01nite-width\nnetwork for a large time T = 1000. For two different widths n = 50, 1000 and for 10 random\ninitializations each, a network is trained on a least-squares cost on 4 points of the unit circle for 1000\nsteps with learning rate 1.0 and then plotted in Figure 2.\nWe also approximated the kernels \u0398(4)\u221e and \u03a3(4) using a large-width network (n = 10000) and used\nthem to calculate and plot the 10th, 50th and 90-th percentiles of the t \u2192 \u221e limiting Gaussian\ndistribution.\nThe distributions of the network functions are very similar for both widths: their mean and variance\nappear to be close to those of the limiting distribution t \u2192 \u221e. Even for relatively small widths\n(n = 50), the NTK gives a good indication of the distribution of f\u03b8(t) as t \u2192 \u221e.\n\n6.3 Convergence along a principal component\n\nWe now illustrate our result on the MNIST dataset of handwritten digits made up of grayscale images\nof dimension 28 \u00d7 28, yielding a dimension of n0 = 784.\nWe computed the \ufb01rst 3 principal components of a batch of N = 512 digits with respect to the NTK\nof a high-width network n = 10000 (giving an approximation of the limiting kernel) using a power\niteration method. The respective eigenvalues are \u03bb1 = 0.0457, \u03bb2 = 0.00108 and \u03bb3 = 0.00078.\nThe kernel PCA is non-centered, the \ufb01rst component is therefore almost equal to the constant function,\nwhich explains the large gap between the \ufb01rst and second eigenvalues3. The next two components are\nmuch more interesting as can be seen in Figure 3a, where the batch is plotted with x and y coordinates\ncorresponding to the 2nd and 3rd components.\nWe have seen in Section 5 how the convergence of kernel gradient descent follows the kernel principal\ncomponents. If the difference at initialization f0 \u2212 f\u2217 is equal (or proportional) to one of the principal\n3It can be observed numerically, that if we choose \u03b2 = 1.0 instead of our recommended 0.1, the gap between\n\nthe \ufb01rst and the second principal component is about ten times bigger, which makes training more dif\ufb01cult.\n\n8\n\n32101230.050.100.150.200.250.300.350.40n=500,t=0n=500,t=20n=10000,t=0n=10000,0n = 500, t = 0n = 500, t = 200n = 10000, t = 0n = 10000, t = 20032101230.40.20.00.20.4f(sin(),cos())n=50n=1000n=,P50n=,{P10,P90}\fn\ni\np\n|\n|\nt\nh\n\n|\n|\n\nn\ni\np\n|\n|\nt\ng\n|\n|\n\n(a) The 2nd and 3rd principal\ncomponents of MNIST.\n\n(b) Deviation of the network function\nf\u03b8 from the straight line.\n\n(c) Convergence of f\u03b8 along the 2nd\nprincipal component.\n\nFigure 3\n\ncomponents f (i), then the function will converge along a straight line (in the function space) to f\u2217 at\nan exponential rate e\u2212\u03bbit.\nWe tested whether ANNs of various widths n = 100, 1000, 10000 behave in a similar manner. We\nset the goal of the regression cost to f\u2217 = f\u03b8(0) + 0.5f (2) and let the network converge. At each time\nstep t, we decomposed the difference f\u03b8(t) \u2212 f\u2217 into a component gt proportional to f (2) and another\none ht orthogonal to f (2). In the in\ufb01nite-width limit, the \ufb01rst component decays exponentially fast\n||gt||pin = 0.5e\u2212\u03bb2t while the second is null (ht = 0), as the function converges along a straight line.\nAs expected, we see in Figure 3b that the wider the network, the less it deviates from the straight line\n(for each width n we performed two independent trials). As the width grows, the trajectory along the\n2nd principal component (shown in Figure 3c) converges to the theoretical limit shown in blue.\nA surprising observation is that smaller networks appear to converge faster than wider ones. This may\nbe explained by the in\ufb02ation of the NTK observed in our \ufb01rst experiment. Indeed, multiplying the\nNTK by a factor a is equivalent to multiplying the learning rate by the same factor. However, note\nthat since the NTK of large-width network is more stable during training, larger learning rates can in\nprinciple be taken. One must hence be careful when comparing the convergence speed in terms of the\nnumber of steps (rather than in terms of the time t): both the in\ufb02ation effect and the learning rate\nmust be taken into account.\n\n7 Conclusion\n\nThis paper introduces a new tool to study ANNs, the Neural Tangent Kernel (NTK), which describes\nthe local dynamics of an ANN during gradient descent. This leads to a new connection between ANN\ntraining and kernel methods: in the in\ufb01nite-width limit, an ANN can be described in the function\nspace directly by the limit of the NTK, an explicit constant kernel \u0398(L)\u221e , which only depends on\nits depth, nonlinearity and parameter initialization variance. More precisely, in this limit, ANN\ngradient descent is shown to be equivalent to a kernel gradient descent with respect to \u0398(L)\u221e . The\nlimit of the NTK is hence a powerful tool to understand the generalization properties of ANNs, and\nit allows one to study the in\ufb02uence of the depth and nonlinearity on the learning abilities of the\nnetwork. The analysis of training using NTK allows one to relate convergence of ANN training with\nthe positive-de\ufb01niteness of the limiting NTK and leads to a characterization of the directions favored\nby early stopping methods.\n\nAcknowledgements\n\nThe authors thank K. Kyt\u00a8ol\u00a8a for many interesting discussions. The second author was supported by\nthe ERC CG CRITICAL. The last author acknowledges support from the ERC SG Constamis, the\nNCCR SwissMAP, the Blavatnik Family Foundation and the Latsis Foundation.\n\n9\n\n321012f(2)(x)21012f(3)(x)05001000150020002500300035004000t0.000.020.040.060.080.100.120.14n=100n=1000n=1000005001000150020002500300035004000t0.00.10.20.30.40.5n=100n=1000n=10000n=\fReferences\n[1] M. Belkin, S. Ma, and S. Mandal. To understand deep learning we need to understand kernel\n\nlearning. arXiv preprint, Feb 2018.\n\n[2] Y. Cho and L. K. Saul. Kernel methods for deep learning. In Advances in Neural Information\n\nProcessing Systems 22, pages 342\u2013350. Curran Associates, Inc., 2009.\n\n[3] A. Choromanska, M. Henaff, M. Mathieu, G. B. Arous, and Y. LeCun. The Loss Surfaces of\n\nMultilayer Networks. Journal of Machine Learning Research, 38:192\u2013204, nov 2015.\n\n[4] Y. N. Dauphin, R. Pascanu, C. Gulcehre, K. Cho, S. Ganguli, and Y. Bengio. Identifying\nand attacking the saddle point problem in high-dimensional non-convex optimization.\nIn\nProceedings of the 27th International Conference on Neural Information Processing Systems -\nVolume 2, NIPS\u201914, pages 2933\u20132941, Cambridge, MA, USA, 2014. MIT Press.\n\n[5] S. S. Dragomir. Some Gronwall Type Inequalities and Applications. Nova Science Publishers,\n\n2003.\n\n[6] I. J. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville,\nand Y. Bengio. Generative Adversarial Networks. NIPS\u201914 Proceedings of the 27th International\nConference on Neural Information Processing Systems - Volume 2, pages 2672\u20132680, jun 2014.\n\n[7] K. Hornik, M. Stinchcombe, and H. White. Multilayer feedforward networks are universal\n\napproximators. Neural Networks, 2(5):359 \u2013 366, 1989.\n\n[8] R. Karakida, S. Akaho, and S.-i. Amari. Universal Statistics of Fisher Information in Deep\n\nNeural Networks: Mean Field Approach. jun 2018.\n\n[9] J. H. Lee, Y. Bahri, R. Novak, S. S. Schoenholz, J. Pennington, and J. Sohl-Dickstein. Deep\n\nneural networks as gaussian processes. ICLR, 2018.\n\n[10] M. Leshno, V. Lin, A. Pinkus, and S. Schocken. Multilayer feedforward networks with a non-\npolynomial activation function can approximate any function. Neural Networks, 6(6):861\u2013867,\n1993.\n\n[11] S. Mei, A. Montanari, and P.-M. Nguyen. A mean \ufb01eld view of the landscape of two-layer\nneural networks. Proceedings of the National Academy of Sciences, 115(33):E7665\u2013E7671,\n2018.\n\n[12] R. M. Neal. Bayesian Learning for Neural Networks. Springer-Verlag New York, Inc., Secaucus,\n\nNJ, USA, 1996.\n\n[13] R. Pascanu, Y. N. Dauphin, S. Ganguli, and Y. Bengio. On the saddle point problem for\n\nnon-convex optimization. arXiv preprint, 2014.\n\n[14] J. Pennington and Y. Bahri. Geometry of neural network loss surfaces via random matrix\ntheory. In Proceedings of the 34th International Conference on Machine Learning, volume 70\nof Proceedings of Machine Learning Research, pages 2798\u20132806, International Convention\nCentre, Sydney, Australia, 06\u201311 Aug 2017. PMLR.\n\n[15] A. Rahimi and B. Recht. Random features for large-scale kernel machines. In Advances in\nNeural Information Processing Systems 20, pages 1177\u20131184. Curran Associates, Inc., 2008.\n\n[16] L. Sagun, U. Evci, V. U. G\u00a8uney, Y. Dauphin, and L. Bottou. Empirical analysis of the hessian\n\nof over-parametrized neural networks. CoRR, abs/1706.04454, 2017.\n\n[17] B. Sch\u00a8olkopf, A. Smola, and K.-R. M\u00a8uller. Nonlinear component analysis as a kernel eigenvalue\n\nproblem. Neural Computation, 10(5):1299\u20131319, 1998.\n\n[18] J. Shawe-Taylor and N. Cristianini. Kernel Methods for Pattern Analysis. Cambridge University\n\nPress, New York, NY, USA, 2004.\n\n[19] C. Zhang, S. Bengio, M. Hardt, B. Recht, and O. Vinyals. Understanding deep learning requires\n\nrethinking generalization. ICLR 2017 proceedings, Feb 2017.\n\n10\n\n\f", "award": [], "sourceid": 5153, "authors": [{"given_name": "Arthur", "family_name": "Jacot", "institution": "EPFL"}, {"given_name": "Franck", "family_name": "Gabriel", "institution": "EPFL"}, {"given_name": "Clement", "family_name": "Hongler", "institution": "EPFL"}]}