{"title": "On the Inductive Bias of Neural Tangent Kernels", "book": "Advances in Neural Information Processing Systems", "page_first": 12893, "page_last": 12904, "abstract": "State-of-the-art neural networks are heavily over-parameterized, making the optimization algorithm a crucial ingredient for learning predictive models with good generalization properties. A recent line of work has shown that in a certain over-parameterized regime, the learning dynamics of gradient descent are governed by a certain kernel obtained at initialization, called the neural tangent kernel. We study the inductive bias of learning in such a regime by analyzing this kernel and the corresponding function space (RKHS). In particular, we study smoothness, approximation, and stability properties of functions with finite norm, including stability to image deformations in the case of convolutional networks, and compare to other known kernels for similar architectures.", "full_text": "On the Inductive Bias of Neural Tangent Kernels\n\nAlberto Bietti\n\nInria\u2217\n\nJulien Mairal\n\nInria\u2217\n\nalberto.bietti@inria.fr\n\njulien.mairal@inria.fr\n\nAbstract\n\nState-of-the-art neural networks are heavily over-parameterized, making the opti-\nmization algorithm a crucial ingredient for learning predictive models with good\ngeneralization properties. A recent line of work has shown that in a certain over-\nparameterized regime, the learning dynamics of gradient descent are governed by\na certain kernel obtained at initialization, called the neural tangent kernel. We\nstudy the inductive bias of learning in such a regime by analyzing this kernel and\nthe corresponding function space (RKHS). In particular, we study smoothness,\napproximation, and stability properties of functions with \ufb01nite norm, including\nstability to image deformations in the case of convolutional networks, and compare\nto other known kernels for similar architectures.\n\n1\n\nIntroduction\n\nThe large number of parameters in state-of-the-art deep neural networks makes them very expressive,\nwith the ability to approximate large classes of functions [26, 41]. Since many networks can\npotentially \ufb01t a given dataset, the optimization method, typically a variant of gradient descent, plays\na crucial role in selecting a model that generalizes well [39].\n\nA recent line of work [2, 16, 20, 21, 27, 30, 54] has shown that when training deep networks in a\ncertain over-parameterized regime, the dynamics of gradient descent behave like those of a linear\nmodel on (non-linear) features determined at initialization. In the over-parameterization limit, these\nfeatures correspond to a kernel known as the neural tangent kernel. In particular, in the case of\na regression loss, the obtained model behaves similarly to a minimum norm kernel least squares\nsolution, suggesting that this kernel may play a key role in determining the inductive bias of the\nlearning procedure and its generalization properties. While it is still not clear if this regime is at play\nin state-of-the-art deep networks, there is some evidence that this phenomenon of \u201clazy training\u201d [16],\nwhere weights only move very slightly during training, may be relevant for early stages of training and\nfor the outmost layers of deep networks [29, 53], motivating a better understanding of its properties.\n\nIn this paper, we study the inductive bias of this regime by studying properties of functions in the\nspace associated with the neural tangent kernel for a given architecture (that is, the reproducing kernel\nHilbert space, or RKHS). Such kernels can be de\ufb01ned recursively using certain choices of dot-product\nkernels at each layer that depend on the activation function. For the convolutional case with recti\ufb01ed\nlinear unit (ReLU) activations and arbitrary patches and linear pooling operations, we show that the\nNTK can be expressed through kernel feature maps de\ufb01ned in a tree-structured hierarchy.\n\nWe study smoothness and stability properties of the kernel mapping for two-layer networks and\nCNNs, which control the variations of functions in the RKHS. In particular, a useful inductive bias\nwhen dealing with natural signals such as images is stability of the output to deformations of the\ninput, such as translations or small rotations. A precise notion of stability to deformations was\nproposed by Mallat [35], and was later studied in [11] in the context of CNN architectures, showing\n\n\u2217Univ. Grenoble Alpes, Inria, CNRS, Grenoble INP, LJK, 38000 Grenoble, France\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\fthe bene\ufb01ts of different architectural choices such as small patch sizes. In contrast to the kernels\nstudied in [11], which for instance cover the limiting kernels that arise from training only the last\nlayer of a ReLU CNN, we \ufb01nd that the obtained NTK kernel mappings for the ReLU activation lack\na desired Lipschitz property which is needed for stability to deformations in the sense of [11, 12, 35].\nInstead, we show that a weaker smoothness property similar to H\u00f6lder smoothness holds, and this\nallows us to show that the kernel mapping is stable to deformations, albeit with a different guarantee.\n\nIn order to balance our observations on smoothness, we also consider approximation properties for the\nNTK of two-layer ReLU networks, by characterizing the RKHS using a Mercer decomposition of the\nkernel in the basis of spherical harmonics [6, 46, 47]. In particular, we study the decay of eigenvalues\nfor this decomposition, which is then related to the regularity of functions in the space, and provides\nrates of approximation for Lipschitz functions [6]. We \ufb01nd that the full NTK has better approximation\nproperties compared to other function classes typically de\ufb01ned for ReLU activations [6, 17, 19],\nwhich arise for instance when only training the weights in the last layer, or when considering Gaussian\nprocess limits of ReLU networks (e.g., [24, 28, 36, 40]).\n\nContributions. Our main contributions can be summarized as follows:\n\n\u2022 We provide a derivation of the NTK for convolutional networks with generic linear operators for\npatch extraction and pooling, and express the corresponding kernel feature map hierarchically\nusing these operators.\n\n\u2022 We study smoothness properties of the kernel mapping for ReLU networks, showing that it is\nnot Lipschitz but satis\ufb01es a weaker H\u00f6lder smoothness property. For CNNs, we then provide a\nguarantee on deformation stability.\n\n\u2022 We characterize the RKHS of the NTK for two-layer ReLU networks by providing a spectral\ndecomposition of the kernel and studying its spectral decay. This leads to improved approximation\nproperties compared to other function classes based on ReLU.\n\nRelated work. Neural tangent kernels were introduced in [27], and similar ideas were used to obtain\nmore quantitative guarantees on the global convergence of gradient descent for over-parameterized\nneural networks [2, 3, 16, 20, 21, 30, 50, 54]. The papers [3, 20, 51] also derive NTKs for convo-\nlutional networks, but focus on simpler architectures. Kernel methods for deep neural networks\nwere studied for instance in [17, 19, 34]. Stability to deformations was originally introduced in the\ncontext of the scattering representation [12, 35], and later extended to neural networks through kernel\nmethods in [11]. The inductive bias of optimization in neural network learning was considered, e.g.,\nby [1, 4, 13, 39, 48]. [6, 25, 45, 49] study function spaces corresponding to two-layer ReLU networks.\nIn particular, [25] also analyzes properties of the NTK, but studies a speci\ufb01c high-dimensional limit\nfor generic activations, while we focus on ReLU networks, studying the corresponding eigenvalue\ndecays in \ufb01nite dimension.\n\n2 Neural Tangent Kernels\n\nIn this section, we provide some background on \u201clazy training\u201d and neural tangent kernels (NTKs),\nand introduce the kernels that we study in this paper. In particular, we derive the NTK for generic\nconvolutional architectures on \u21132 signals. For simplicity of exposition, we consider scalar-valued\nfunctions, noting that the kernels may be extended to the vector-valued case, as done, e.g., in [27].\n\n2.1 Lazy training and neural tangent kernels\n\nMultiple recent works studying global convergence of gradient descent in neural networks (e.g., [2,\n20, 21, 27, 30, 54]) show that when a network is suf\ufb01ciently over-parameterized, weights remain\nclose to initialization during training. The model is then well approximated by its linearization around\ninitialization. For a neural network f (x; \u03b8) with parameters \u03b8 and initialization \u03b80, we then have:2\n\nThis regime where weights barely move has also been referred to as \u201clazy training\u201d [16], in contrast\nto other situations such as the \u201cmean-\ufb01eld\u201d regime (e.g., [15, 38, 37]), where weights move according\n\nf (x; \u03b8) \u2248 f (x; \u03b80) + h\u03b8 \u2212 \u03b80,\u2207\u03b8f (x; \u03b80)i.\n\n(1)\n\n2While we use gradients in our notations, we note that weak differentiability (e.g., with ReLU activations) is\n\nsuf\ufb01cient when studying the limiting NTK [27].\n\n2\n\n\fto non-linear dynamics. Yet, with suf\ufb01cient over-parameterization, the (non-linear) features x 7\u2192\n\u2207\u03b8f (x; \u03b80) of the linearized model (1) become expressive enough to be able to perfectly \ufb01t the\ntraining data, by approximating a kernel method.\n\nNeural Tangent Kernel (NTK). When the width of the network tends to in\ufb01nity, assuming an\nappropriate initialization on weights, the features of the linearized model tend to a limiting kernel K,\ncalled neural tangent kernel [27]:\n\nh\u2207\u03b8f (x; \u03b80),\u2207\u03b8f (x\u2032, \u03b80)i \u2192 K(x, x\u2032).\n\n(2)\n\nIn this limit and under some assumptions, one can show that the weights move very slightly and the\nkernel remains \ufb01xed during training [27], and that gradient descent will then lead to the minimum\nnorm kernel least-squares \ufb01t of the training set in the case of the \u21132 loss (see [27] and [37, Section\nH.7]). Similar interpolating solutions have been found to perform well for generalization, both in\npractice [10] and in theory [8, 31]. When the number of neurons is large but \ufb01nite, one can often show\nthat the kernel only deviates slightly from the limiting NTK, at initialization and throughout training,\nthus allowing convergence as long as the initial kernel matrix is non-degenerate [3, 16, 20, 21].\n\n\u03b8 = (w\u22a4\n\nj=1 vj\u03c3(w\u22a4\n1 , . . . , w\u22a4\n\nNTK for two-layer ReLU networks. Consider a two layer network of the form f (x; \u03b8) =\nq 2\nmPm\nj x), where \u03c3(u) = (u)+ = max(0, u) is the ReLU activation, x \u2208 Rp, and\nm, v\u22a4) are parameters with values initialized as N (0, 1). Practitioners often include\nthe factorp2/m in the variance of the initialization of vj , but we treat it as a scaling factor follow-\ning [20, 21, 27], noting that this leads to the same predictions. The factor 2 is simply a normalization\nconstant speci\ufb01c to the ReLU activation and commonly used by practitioners, which avoids vanishing\nor exploding behavior for deep networks. The corresponding NTK is then given by [16, 21]:\n\nK(x, x\u2032) = 2(x\u22a4x\u2032) Ew\u223cN (0,I)[1{w\u22a4x \u2265 0}1{w\u22a4x\u2032 \u2265 0}] + 2 Ew\u223cN (0,I)[(w\u22a4x)+(w\u22a4x\u2032)+]\n\n= kxkkx\u2032k\u03ba(cid:18) hx, x\u2032i\n\nkxkkx\u2032k(cid:19) ,\n\nwhere\n\n\u03ba(u) := u\u03ba0(u) + \u03ba1(u)\n\n\u03ba0(u) =\n\n1\n\u03c0\n\n(\u03c0 \u2212 arccos(u)) ,\n\n\u03ba1(u) =\n\n1\n\n(3)\n\n(4)\n\n(5)\n\n\u03c0 (cid:16)u \u00b7 (\u03c0 \u2212 arccos(u)) +p1 \u2212 u2(cid:17) .\n\nThe expressions for \u03ba0 and \u03ba1 follow from standard calculations for arc-cosine kernels of degree 0\nand 1 (see [17]). Note that in this two-layer case, the non-linear features obtained for \ufb01nite neurons\ncorrespond to a random features kernel [42], which is known to approximate the full kernel relatively\nwell even with a moderate amount of neurons [7, 42, 43]. One can also extend the derivation to other\nactivation functions, which may lead to explicit expressions for the kernel in some cases [19].\n\nNTK for fully-connected deep ReLU networks. We de\ufb01ne a fully-connected neural network by\n\nf (x; \u03b8) =q 2\n\nmnhwn+1, ani, with a1 = \u03c3(W 1x), and\nW kak\u20131(cid:19) ,\n\nak = \u03c3(cid:18)r 2\n\nmk\u20131\n\nk = 2, . . . , n,\n\nis the ReLU activation and is applied element-wise. Following [27], the corresponding NTK is de\ufb01ned\n\nwhere W k \u2208 Rmk\u00d7mk\u20131 and wn+1 \u2208 Rmn are initialized with i.i.d. N (0, 1) entries, and \u03c3(u) = (u)+\nrecursively by K(x, x\u2032) = Kn(x, x\u2032) with K0(x, x\u2032) = \u03a30(x, x\u2032) = x\u22a4x\u2032, and for k \u2265 1,\n\n\u03a3k(x, x\u2032) = 2 E(u,v)\u223cN (0,Bk)[\u03c3(u)\u03c3(v)]\nKk(x, x\u2032) = \u03a3k(x, x\u2032) + 2Kk\u20131(x, x\u2032) E(u,v)\u223cN (0,Bk)[\u03c3\u2032(u)\u03c3\u2032(v)],\n\n3\n\n\fkernels of degrees 0 and 1 [17], it is easy to show that\n\n\u03a3k\u20131(x, x\u2032) \u03a3k\u20131(x\u2032, x\u2032)(cid:19). Using a change of variables and de\ufb01nitions of arc-cosine\np\u03a3k\u20131(x, x)\u03a3k\u20131(x\u2032, x\u2032)! (6)\n\nwhere Bk =(cid:18) \u03a3k\u20131(x, x) \u03a3k\u20131(x, x\u2032)\n2 E(u,v)\u223cN (0,Bk)[\u03c3(u)\u03c3(v)] =p\u03a3k\u20131(x, x)\u03a3k\u20131(x\u2032, x\u2032)\u03ba1 \n2 E(u,v)\u223cN (0,Bk)[\u03c3\u2032(u)\u03c3\u2032(v)] = \u03ba0 \np\u03a3k\u20131(x, x)\u03a3k\u20131(x\u2032, x\u2032)! ,\n\nwhere \u03ba0 and \u03ba1 are de\ufb01ned in (5).\n\n\u03a3k\u20131(x, x\u2032)\n\n\u03a3k\u20131(x, x\u2032)\n\n(7)\n\nFeature maps construction. We now provide a reformulation of the previous kernel in terms of\nexplicit feature maps, which provides a representation of the data and makes our study of stability\nin Section 4 more convenient. For a given input Hilbert space H, we denote by \u03d5H,1 : H \u2192 H1\nthe kernel mapping into the RKHS H1 for the kernel (z, z\u2032) \u2208 H2 7\u2192 kzkkz\u2032k\u03ba1(hz, z\u2032i/kzkkz\u2032k),\nand by \u03d5H,0 : H \u2192 H0 the kernel mapping into the RKHS H0 for the kernel (z, z\u2032) \u2208 H2 7\u2192\n\u03ba0(hz, z\u2032i/kzkkz\u2032k). We will abuse notation and hide the input space, simply writing \u03d51 and \u03d50.\ncan be de\ufb01ned as K(x, x\u2032) = h\u03a6n(x), \u03a6n(x\u2032)i, with \u03a60(x) = \u03a80(x) = x and for k \u2265 1,\n\nLemma 1 (NTK feature map for fully-connected network). The NTK for the fully-connected network\n\n\u03a8k(x) = \u03d51(\u03a8k\u20131(x))\n\n\u03a6k(x) =(cid:18)\u03d50(\u03a8k\u20131(x)) \u2297 \u03a6k\u20131(x)\n\n\u03d51(\u03a8k\u20131(x))\n\nwhere \u2297 is the tensor product.\n2.2 Neural tangent kernel for convolutional networks\n\n(cid:19) ,\n\nIn this section we study NTKs for convolutional networks (CNNs) on signals, focusing on the\n\nReLU activation. We consider signals in \u21132(Zd, Rm0 ), that is, signals x[u] with u \u2208 Zd denoting\nthe location, x[u] \u2208 Rm0 , and Pu\u2208Zd kx[u]k2 < \u221e (for instance, d = 2 and m0 = 3 for RGB\n\nimages). The in\ufb01nite support allows us to avoid dealing with boundary conditions when considering\ndeformations and pooling. The precise study of \u21132 membership is deferred to Section 4.\n\nPatch extraction and pooling operators P k and Ak. Following [11], we de\ufb01ne two linear opera-\ntors P k and Ak on \u21132(Zd) for extracting patches and performing (linear) pooling at layer k, respec-\ntively. For an H-valued signal x[u], P k is de\ufb01ned by P kx[u] = |Sk|\u22121/2(x[u + v])v\u2208Sk \u2208 H|Sk|,\nwhere Sk is a \ufb01nite subset of Zd de\ufb01ning the patch shape (e.g., a 3x3 box). Pooling is de\ufb01ned\nas a convolution with a linear \ufb01lter hk[u], e.g., a Gaussian \ufb01lter at scale \u03c3k as in [11], that is,\n\nAkx[u] = Pv\u2208Zd hk[u \u2212 v]x[v]. In this discrete setting, we can easily include a downsampling\noperation with factor sk by changing the de\ufb01nition of Ak to Akx[u] =Pv\u2208Zd hk[sku \u2212 v]x[v] (in\nparticular, if hk is a Dirac at 0, we obtain a CNN with \u201cstrided convolutions\u201d). In fact, our NTK\nderivation supports general linear operators Ak : \u21132(Zd) \u2192 \u21132(Zd) on scalar signals.\nFor de\ufb01ning the NTK feature map, we also introduce the following non-linear point-wise operator M ,\ngiven for two signals x, y, by\n\nwhere \u03d50/1 are kernel mappings of arc-cosine 0/1 kernels, as de\ufb01ned in Section 2.1.\n\n\u03d51(x[u]) (cid:19) ,\nM (x, y)[u] =(cid:18)\u03d50(x[u]) \u2297 y[u]\n\n(8)\n\nCNN de\ufb01nition and NTK. We consider a network f (x; \u03b8) =q 2\n\nmnhwn+1, ani\u21132 , with\n\n\u02dcak[u] =(W 1P 1x[u],\n\nmk\u20131\n\nak[u] = Ak\u03c3(\u02dcak)[u],\n\nq 2\n\nW kP kak\u20131[u],\n\nif k = 1,\nif k \u2208 {2, . . . , n},\n\nk = 1, . . . , n,\n\n4\n\n\fwhere W k \u2208 Rmk\u00d7mk\u20131|Sk| and wn \u2208 \u21132(Zd, Rmn ) are initialized with N (0, 1) entries, and \u03c3(\u02dcxk)\ndenotes the signal with \u03c3 applied element-wise to \u02dcxk. We are now ready to state our result on the\nNTK for this model.\n\nProposition 2 (NTK feature map for CNN). The NTK for the above CNN, obtained when the number\n\nof feature maps m1, . . . , mn \u2192 \u221e (sequentially), is given by K(x, x\u2032) = h\u03a6(x), \u03a6(x\u2032)i\u21132(Zd),\nwith \u03a6(x)[u] = AnM (xn, yn)[u], where yn and xn are de\ufb01ned recursively for a given input x by\ny1[u] = x1[u] = P 1x[u], and for k \u2265 2,\n\nxk[u] = P kAk\u20131\u03d51(xk\u20131)[u]\nyk[u] = P kAk\u20131M (xk\u20131, yk\u20131)[u],\n\nwith the abuse of notation \u03d51(x)[u] = \u03d51(x[u]) for a signal x.\n\ni [u] tend to a Gaussian process with covariance \u03a3k(x, u; x\u2032, u\u2032) = hxk[u], x\u2032\n\nThe proof is given in Appendix A.2, where we also show that in the over-parameterization limit, the\npre-activations \u02dcak\n(this is related to recent papers [24, 40] studying Gaussian process limits of Bayesian convolutional\nnetworks). The proof is by induction and relies on similar arguments to [27] for fully-connected\nnetworks, in addition to exploiting linearity of the operators P k and Ak, as well as recursive feature\nmaps for hierarchical kernels. The recent papers [3, 51] also study NTKs for certain convolutional\nnetworks; in contrast to these works, our derivation considers general signals in \u21132(Zd), supports\nintermediate pooling or downsampling by changing Ak, and provides a more intuitive construction\nthrough kernel mappings and the operators P k and Ak. Note that the feature maps xk are de\ufb01ned\nindependently from the yk, and in fact correspond to more standard multi-layer deep kernel ma-\nchines [11, 17, 19, 33] or covariance functions of certain deep Bayesian networks [24, 28, 36, 40].\nThey can also be seen as the feature maps of the limiting kernel that arises when only training weights\nin the last layer and \ufb01xing other layers at initialization (see, e.g., [19]).\n\nk[u\u2032]i\n\n3 Two-Layer Networks\n\nIn this section, we study smoothness and approximation properties of the RKHS de\ufb01ned by neural\ntangent kernels for two-layer networks. For ReLU activations, we show that the NTK kernel mapping\nis not Lipschitz, but satis\ufb01es a weaker smoothness property. In Section 3.2, we characterize the RKHS\nfor ReLU activations and study its approximation properties and bene\ufb01ts. Finally, we comment on\nthe use of other activations in Section 3.3.\n\n3.1 Smoothness of two-layer ReLU networks\n\nHere we study the RKHS H of the NTK for two-layer ReLU networks, de\ufb01ned in (3), focusing on\nsmoothness properties of the kernel mapping, denoted \u03a6(\u00b7). Recall that smoothness of the kernel\nmapping guarantees smoothness of functions f \u2208 H, through the relation\n\n|f (x) \u2212 f (y)| \u2264 kfkHk\u03a6(x) \u2212 \u03a6(y)kH.\n\n(9)\n\nWe begin by showing that the kernel mapping for the NTK is not Lipschitz. This is in contrast to the\nkernel \u03ba1 in (5), obtained by \ufb01xing the weights in the \ufb01rst layer and training only the second layer\nweights (\u03ba1 is 1-Lipschitz by [11, Lemma 1]).\nProposition 3 (Non-Lipschitzness). The kernel mapping \u03a6(\u00b7) of the two-layer NTK is not Lipschitz:\n\nsup\nx,y\n\nk\u03a6(x) \u2212 \u03a6(y)kH\n\nkx \u2212 yk\n\n\u2192 +\u221e.\n\nThis is true even when looking only at points x, y on the sphere. It follows that the RKHS H contains\nunit-norm functions with arbitrarily large Lipschitz constant.\n\nNote that the instability is due to \u03d50, which comes from gradients w.r.t. \ufb01rst layer weigts. We now\nshow that a weaker guarantee holds nevertheless, resembling 1/2-H\u00f6lder smoothness.\n\nProposition 4 (Smoothness for ReLU NTK). We have the following smoothness properties:\n\n1. For x, y such that kxk = kyk = 1, the kernel mapping \u03d50 satis\ufb01es k\u03d50(x)\u2212\u03d50(y)k \u2264pkx \u2212 yk.\n\n5\n\n\f2. For general non-zero x, y, we have k\u03d50(x) \u2212 \u03d50(y)k \u2264q\n\n3. The kernel mapping \u03a6 of the NTK then satis\ufb01es\n\nmin(kxk,kyk)kx \u2212 yk.\nk\u03a6(x) \u2212 \u03a6(y)k \u2264pmin(kxk,kyk)kx \u2212 yk + 2kx \u2212 yk.\n\n1\n\nWe note that while such smoothness properties apply to the functions in the RKHS of the studied\nlimiting kernels, the neural network functions obtained at \ufb01nite width and their linearizations around\ninitialization are not in the RKHS and thus may not preserve such smoothness properties, despite\npreserving good generalization properties, as in random feature models [7, 43]. This discrepancy\nmay be a source of instability to adversarial perturbations.\n\n3.2 Approximation properties for the two-layer ReLU NTK\n\nIn the previous section, we found that the NTK \u03ba for two-layer ReLU networks yields weaker\nsmoothness guarantees compared to the kernel \u03ba1 obtained when the \ufb01rst layer is \ufb01xed. We now\nshow that the NTK has better approximation properties, by studying the RKHS through a spectral\ndecomposition of the kernel and the decay of the corresponding eigenvalues. This highlights a\ntradeoff between smoothness and approximation.\nThe next proposition gives the Mercer decomposition of the NTK \u03ba(hx, ui) in (4), where x, y are in\nthe p \u2212 1 sphere Sp\u22121 = {x \u2208 Rp : kxk = 1}. The decomposition is given in the basis of spherical\nharmonics, as is common for dot-product kernels [46, 47], and our derivation uses results by Bach [6]\non similar decompositions of positively homogeneous activations of the form \u03c3\u03b1(u) = (u)\u03b1\n+. See\nAppendix C for background and proofs.\n\nProposition 5 (Mercer decomposition of ReLU NTK). For any x, y \u2208 Sp\u22121, we have the following\ndecomposition of the NTK \u03ba:\n\n\u03ba(hx, yi) =\n\n\u00b5k\n\n\u221e\n\nXk=0\n\nN (p,k)\n\nXj=1\n\nYk,j(x)Yk,j(y),\n\n(10)\n\nwhere Yk,j, j = 1, . . . , N (p, k) are spherical harmonic polynomials of degree k, and the non-negative\neigenvalues \u00b5k satisfy \u00b50, \u00b51 > 0, \u00b5k = 0 if k = 2j + 1 with j \u2265 1, and otherwise \u00b5k \u223c C(p)k\u2212p\nas k \u2192 \u221e, with C(p) a constant depending only on p. Then, the RKHS is described by:\n\nN (p,k)\n\nf = Xk\u22650,\u00b5k6=0\n\nXj=1\n\nak,jYk,j(\u00b7)\n\ns.t. kfk2\n\nH := Xk\u22650,\u00b5k6=0\n\nXj=1\n\nN (p,k)\n\na2\nk,j\n\u00b5k\n\nH =\uf8f1\uf8f2\n\uf8f3\n\n.\n\n(11)\n\n< \u221e\uf8fc\uf8fd\n\uf8fe\n\nThe zero eigenvalues prevent certain functions from belonging to the RKHS, namely those with\nnon-zero Fourier coef\ufb01cients on the corresponding basis elements (note that adding a bias may prevent\nsuch zero eigenvalues [9]). Here, a suf\ufb01cient condition for all such coef\ufb01cients to be zero is that the\nfunction is even [6]. Note that for the arc-cosine 1 kernel \u03ba1, we have a faster decay \u00b5k = O(k\u2212p\u22122),\nleading to a \u201csmaller\u201d RKHS (see Lemma 17 in Appendix C and [6]). Moreover, the k\u2212p asymptotic\nequivalent comes from the term u\u03ba0(u) in the de\ufb01nition (4) of \u03ba, which comes from gradients of\n\ufb01rst layer weights; the second layer gradients yield \u03ba1, whose contribution to \u00b5k becomes negligible\nfor large k. We use an identity also used in the recent paper [25] which compares similar kernels\nin a speci\ufb01c high-dimensional limit for generic activations; in contrast to [25], we focus on ReLUs\nand study eigenvalue decays in \ufb01nite dimension. We note that our decomposition uses a uniform\ndistribution on the sphere, which allows a precise study of eigenvalues and approximation properties\nof the RKHS using spherical harmonics. When the data distribution is also uniform on the sphere, or\nabsolutely continuous w.r.t. the uniform distribution, our obtained eigenvalues are closely related to\nthose of integral operators for learning problems, which can determine, e.g., non-parametric rates\nof convergence (e.g., [14, 23]) as well as degrees-of-freedom quantities for kernel approximation\n(e.g., [7, 43]). Such quantities often depend on the eigenvalue decay of the integral operator, which\ncan be obtained from \u00b5k after taking multiplicity into account. This is also related to the rate\nof convergence of gradient descent in the lazy training regime, which depends on the minimum\neigenvalue of the empirical kernel matrix in [16, 20, 21].\n\nWe now provide suf\ufb01cient conditions for a function f : Sp\u22121 \u2192 R to be in H, as well as rates of\n\napproximation of Lipschitz functions on the sphere, adapting results of [6] (speci\ufb01cally Proposition 2\nand 3 in [6]) to our NTK setting.\n\n6\n\n\fCorollary 6 (Suf\ufb01cient condition for f \u2208 H). Let f : Sp\u22121 \u2192 R be an even function such that all\ni-th order derivatives exist and are bounded by \u03b7 for 0 \u2264 i \u2264 s, with s \u2265 p/2. Then f \u2208 H with\nkfkH \u2264 C(p)\u03b7, where C(p) is a constant that only depends on p.\nCorollary 7 (Approximation of Lipschitz functions). Let f : Sp\u22121 \u2192 R be an even function such\nthat f (x) \u2264 \u03b7 and |f (x) \u2212 f (y)| \u2264 \u03b7kx \u2212 yk, for all x, y \u2208 Sp\u22121. There is a function g \u2208 H with\nkgkH \u2264 \u03b4, where \u03b4 is larger than a constant depending only on p, such that\nlog(cid:18) \u03b4\n\u03b7(cid:19) .\n\nx\u2208Sp\u22121 |f (x) \u2212 g(x)| \u2264 C(p)\u03b7(cid:18) \u03b4\n\nsup\n\n\u03b7(cid:19)\u22121/(p/2\u22121)\n\nFor both results, there is an improvement over \u03ba1, for which Corollary 6 requires s \u2265 p/2 + 1\nbounded derivatives, and Corollary 7 leads to a weaker rate in (\u03b4/\u03b7)\u22121/(p/2) (see [6, Propositions 2\nand 3] with \u03b1 = 1). These results show that in the over-parameterized regime of the NTK, training\nmultiple layers leads to better approximation properties compared to only training the last layer,\nwhich corresponds to using \u03ba1 instead of \u03ba. In the different regime of \u201cconvex neural networks\u201d\n(e.g., [6, 45]) where neurons can be selected with a sparsity-promoting penalty, the approximation\nrates shown in [6] for ReLU networks are also weaker than for the NTK in the worst case (though the\nregime presents bene\ufb01ts in terms of adaptivity), suggesting that perhaps in some situations the \u201clazy\u201d\nregime of the NTK could be preferred over the regime where neurons are selected using sparsity.\n\nHomogeneous case. When inputs do not lie on the sphere Sp\u22121 but in Rp, the NTK for two-layer\nReLU networks takes the form of a homogeneous dot-product kernel (3), which de\ufb01nes a different\nRKHS \u00afH that we characterize below in terms of the RKHS H of the NTK on the sphere.\nProposition 8 (RKHS of the homogeneous NTK). The RKHS \u00afH of the kernel K(x, x\u2032) =\nkxkkx\u2032k\u03ba(hx, x\u2032i/kxkkx\u2032k) on Rp consists of functions of the form f (x) = kxkg(x/kxk) with g \u2208\nH, where H is the RKHS on the sphere, and we have kfk \u00afH = kgkH.\nNote that while such a restriction to homogeneous functions may be limiting, one may easily obtain\nnon-homogeneous functions by considering an augmented variable z = (x\u22a4, R)\u22a4 and de\ufb01ning\nf (x) = kzkg(z/kzk), where g is now de\ufb01ned on the p-sphere Sp. When inputs are in a ball of\nradius R, this reformulation preserves regularity properties (see [6, Section 3]).\n\n3.3 Smoothness with other activations\n\nIn this section, we look at smoothness of two-layer networks with different activation functions.\nFollowing the derivation for the ReLU in Section 2.1, the NTK for a general activation \u03c3 is given by\n\nK\u03c3(x, x\u2032) = hx, x\u2032i Ew\u223cN (0,1)[\u03c3\u2032(hw, xi)\u03c3\u2032(hw, x\u2032i)] + Ew\u223cN (0,1)[\u03c3(hw, xi)\u03c3(hw, x\u2032i)].\n\nWe then have the following the following result.\nProposition 9 (Lipschitzness for smooth activations). Assume that \u03c3 is twice differentiable and that\nthe quantities \u03b3j := Eu\u223cN (0,1)[(\u03c3(j)(u))2] for j = 0, 1, 2 are bounded, with \u03b30 > 0. Then, for x, y\non the unit sphere, the kernel mapping \u03a6\u03c3 of K\u03c3 satis\ufb01es\n\nk\u03a6\u03c3(x) \u2212 \u03a6\u03c3(y)k \u2264s(\u03b30 + \u03b31) max(cid:18)1,\n\n2\u03b31 + \u03b32\n\n\u03b30 + \u03b31 (cid:19) \u00b7 kx \u2212 yk.\n\nThe proof uses results from [19] on relationships between activations and the corresponding kernels,\nas well as smoothness results for dot-product kernels in [11] (see Appendix B.3). If, for instance, we\nconsider the exponential activation \u03c3(u) = eu\u22122, we have \u03b3j = 1 for all j (using results from [19]), so\nthat the kernel mapping is Lipschitz with constant \u221a3. For the soft-plus activation \u03c3(u) = log(1+eu),\nwe may evaluate the integrals numerically, obtaining (\u03b30, \u03b31, \u03b32) \u2248 (2.31, 0.74, 0.11), so that the\nkernel mapping is Lipschitz with constant \u2248 1.75.\n\n4 Deep Convolutional Networks\n\nIn this section, we study smoothness and stability properties of the NTK kernel mapping for con-\nvolutional networks with ReLU activations. In order to properly de\ufb01ne deformations, we consider\n\n7\n\n\ffollowing [11, 35]. The goal of deformation stability guarantees is to ensure that the data representa-\ntion (in this case, the kernel mapping \u03a6) does not change too much when the input signal is slightly\ndeformed, for instance with a small translation or rotation of an image\u2014a useful inductive bias for\n\ncontinuous signals x(u) in L2(Rd) instead of \u21132(Zd) (i.e., we have kxk2 := R kx(u)k2du < \u221e),\nnatural signals. For a C 1-diffeomorphism \u03c4 : Rd \u2192 Rd, denoting L\u03c4 x(u) = x(u \u2212 \u03c4 (u)) the action\n\noperator of the diffeomorphism, we will show a guarantee of the form\n\nk\u03a6(L\u03c4 x) \u2212 \u03a6(x)k \u2264 (\u03c9(k\u2207\u03c4k\u221e) + Ck\u03c4k\u221e)kxk,\n\nProperties of the operators.\n\nwhere k\u2207\u03c4k\u221e is the maximum operator norm of the Jacobian \u2207\u03c4 (u) over Rd, k\u03c4k\u221e = supu |\u03c4 (u)|,\n\u03c9 is an increasing function and C a positive constant. The second term controls translation invariance,\nand C typically decreases with the scale of the last pooling layer (\u03c3n below), while the \ufb01rst term\ncontrols deformation stability, since k\u2207\u03c4k\u221e measures the \u201csize\u201d of deformations. The function \u03c9(t)\nis typically a linear function of t in other settings [11, 35], here we will obtain a faster growth of\norder \u221at for small t, due to the weaker smoothness that arises from the arc-cosine 0 kernel mappings.\nIn this continuous setup, P k is now given for a signal x \u2208 L2 by\nP kx(u) = \u03bb(Sk)\u22121/2(x(u + v))v\u2208Sk , where \u03bb is the Lebesgue measure. We then have kP kxk =\nkxk, and considering normalized Gaussian pooling \ufb01lters, we have kAkxk \u2264 kxk by Young\u2019s\ninequality [11]. The non-linear operator M is de\ufb01ned point-wise analogously to (8), and satis\ufb01es\nkM (x, y)k2 = kxk2 +kyk2. We thus have that the feature maps in the continuous analog of the NTK\nconstruction in Proposition 2 are in L2 as long as x is in L2. Note that this does not hold for some\nsmooth activations, where kM (x, y)(u)k may be a positive constant even when x(u) = y(u) = 0,\nleading to unbounded L2 norm for M (x, y). The next lemma studies the smoothness of M , extending\nresults from Section 3.1 to signals in L2.\nLemma 10 (Smoothness of operator M ). For two signals x, y \u2208 L2(Rd), we have\n\nkM (x, y) \u2212 M (x\u2032, y\u2032)k \u2264pmin(kyk,ky\u2032k)kx \u2212 x\u2032k + kx \u2212 x\u2032k + ky \u2212 y\u2032k.\n\nAssumptions on architecture. Following [11], we introduce an initial pooling layer A0, corre-\nsponding to an anti-aliasing \ufb01lter, which is necessary for stability and is a reasonable assumption given\nthat in practice input signals are discrete, with high frequencies typically \ufb01ltered by an acquisition\ndevice. Thus, we consider the kernel representation \u03a6n(x) := \u03a6(A0x), with \u03a6 as in Proposition 2.\nWe also assume that patch sizes are controlled by the scale of pooling \ufb01lters, that is\n\n(12)\n\nv\u2208Sk |v| \u2264 \u03b2\u03c3k\u20131,\nsup\n\n(13)\n\nfor some constant \u03b2, where \u03c3k\u20131 is the scale of the pooling operation Ak\u20131, which typically increases\nexponentially with depth, corresponding to a \ufb01xed downsampling factor at each layer in the discrete\ncase. By a simple induction, we can show the following.\n\nLemma 11 (Norm and smoothness of \u03a6n). We have k\u03a6n(x)k \u2264 \u221an + 1kxk, and\nk\u03a6n(x) \u2212 \u03a6n(x\u2032)k \u2264 (n + 1)kx \u2212 x\u2032k + O(n5/4)pkxkkx \u2212 x\u2032k.\n\nDeformation stability bound. We now present our main guarantee on deformation stability for\nthe NTK kernel mapping (the proof is given in Appendix B).\n\nfollowing stability bound:\n\nProposition 12 (Stability of NTK). Let \u03a6n(x) = \u03a6(A0x), and assume k\u2207\u03c4k\u221e \u2264 1/2. We have the\nk\u03a6n(L\u03c4 x) \u2212 \u03a6n(x)k \u2264(cid:18)C(\u03b2)1/2Cn7/4k\u2207\u03c4k1/2\n\u03c3n k\u03c4k\u221e(cid:19)kxk,\n\n\u221e + C(\u03b2)C \u2032n2k\u2207\u03c4k\u221e + \u221an + 1\n\nC\u2032\u2032\n\nwhere C, C\u2032, C\u2032\u2032 are constants depending only on d, and C(\u03b2) also depends on \u03b2 de\ufb01ned in (13).\n\nCompared to the bound in [11], the \ufb01rst term shows weaker stability due to faster growth with k\u2207\u03c4k\u221e,\nwhich comes from (12). The dependence on the depth n is also poorer (n2 instead of n), however\nnote that in contrast to [11], the norm and smoothness constants of \u03a6n(x) in Lemma 11 grow with n\nhere, partially explaining this gap. We also note that choosing small \u03b2 (i.e., small patches in a\ndiscrete setting) is more helpful to improve stability than a small number of layers n, given that C(\u03b2)\nincreases polynomially with \u03b2, while n typically decreases logarithmically with \u03b2 when one seeks a\n\ufb01xed target level of translation invariance (see [11, Section 3.2]).\n\n8\n\n\f(a) CKN with arc-cosine 1 kernels\n\n(b) NTK\n\nFigure 1: Geometry of kernel mapping for CKN and NTK convolutional kernels, on digit images and\ntheir deformations from the In\ufb01nite MNIST dataset [32]. The curves show average relative distances\nof a single digit to its deformations, combinations of translations and deformations, digits of the same\nlabel, and digits of any label. See Appendix D for more details on the experimental setup.\n\nBy \ufb01xing weights of all layers but the last, we would instead obtain feature maps of the form Anxn\n(using notation from Proposition 2), which satisfy the improved stability guarantee of [11]. The\nquestion of approximation for the deep convolutional case is more involved and left for future work,\nbut it is reasonable to expect that the RKHS for the NTK is at least as large as that of the simpler\nkernel with \ufb01xed layers before the last, given that the latter appears as one of the terms in the NTK.\nThis again hints at a tradeoff between stability and approximation, suggesting that one may be able to\nlearn less stable but more discriminative functions in the NTK regime by training all layers.\n\nNumerical experiments. We now study numerically the stability of (exact) kernel mapping rep-\nresentations for convolutional networks with 2 hidden convolutional layers. We consider both a\nconvolutional kernel network (CKN, [11]) with arc-cosine kernels of degree 1 on patches (correspond-\ning to the kernel obtained when only training the last layer and keeping previous layers \ufb01xed) and the\ncorresponding NTK. Figure 1 shows the resulting average distances, when considering collections of\ndigits and deformations thereof. In particular, we \ufb01nd that for small deformations, the distance to the\noriginal image tends to grow more quickly for the NTK compared to the CKN, as the theory suggests\n(a square-root growth rate rather than a linear one). Note also that the relative distances are generally\nlarger for the NTK than for the CKN, suggesting the CKN may be more smooth.\n\n5 Discussion\n\nIn this paper, we have studied the inductive bias of the \u201clazy training\u201d regime for over-parameterized\nneural networks, by considering the neural tangent kernel of different architectures, and analyzing\nproperties of the corresponding RKHS, which characterizes the functions that can be learned ef\ufb01-\nciently in this regime. We \ufb01nd that the NTK for ReLU networks has better approximation properties\ncompared to other neural network kernels, but weaker smoothness properties, although these can\nstill guarantee a form of stability to deformations for CNN architectures, providing an important\ninductive bias for natural signals. While these properties may help obtain better performance when\nlarge amounts of data are available, they can also lead to a poorer estimation error when data is scarce,\na setting in which smoother kernels or better regularization strategies may be helpful.\n\nIt should be noted that while our study of functions in the RKHS may determine what target functions\ncan be learned by over-parameterized networks, the obtained networks with \ufb01nite neurons do not\nbelong to the same RKHS, and hence may be less stable than such target functions, at least outside of\nthe training data, due to approximations both in the linearization (1) and between the \ufb01nite neuron\nand limiting kernels. Additionally, approximation of certain non-smooth functions in this regime\nmay require a very large number of neurons [52]. Finally, we note that while this \u201clazy\u201d regime is\ninteresting and could partly explain the success of deep learning methods, it does not explain, for\ninstance, the common behavior in early layers where neurons move to select useful features in the\ndata, such as Gabor \ufb01lters, as pointed out in [16]. In particular, such a behavior might provide better\nstatistical ef\ufb01ciency by adapting to simple structures in the data (see, e.g., [6]), something which is\nnot captured in a kernel regime like the NTK. It would be interesting to study inductive biases in a\nregime somewhere in between, where neurons may move at least in the \ufb01rst few layers.\n\n9\n\n0123deformation size0.00.10.2mean relative distancedeformationsdeformations + translationsame labelall labels0123deformation size0.00.10.20.3mean relative distance\fAcknowledgments\n\nThis work was supported by the ERC grant number 714381 (SOLARIS project), the ANR 3IA\nMIAI@Grenoble Alpes, and by the MSR-Inria joint centre. The authors thank Francis Bach and\nL\u00e9na\u00efc Chizat for useful discussions.\n\nReferences\n\n[1] Z. Allen-Zhu, Y. Li, and Y. Liang. Learning and generalization in overparameterized neural\nnetworks, going beyond two layers. In Advances in Neural Information Processing Systems\n(NeurIPS), 2019.\n\n[2] Z. Allen-Zhu, Y. Li, and Z. Song. A convergence theory for deep learning via over-\nparameterization. In Proceedings of the International Conference on Machine Learning (ICML),\n2019.\n\n[3] S. Arora, S. S. Du, W. Hu, Z. Li, R. Salakhutdinov, and R. Wang. On exact computation with\nan in\ufb01nitely wide neural net. In Advances in Neural Information Processing Systems (NeurIPS),\n2019.\n\n[4] S. Arora, S. S. Du, W. Hu, Z. Li, and R. Wang. Fine-grained analysis of optimization and gener-\nalization for overparameterized two-layer neural networks. In Proceedings of the International\nConference on Machine Learning (ICML), 2019.\n\n[5] K. Atkinson and W. Han. Spherical harmonics and approximations on the unit sphere: an\n\nintroduction, volume 2044. Springer Science & Business Media, 2012.\n\n[6] F. Bach. Breaking the curse of dimensionality with convex neural networks. Journal of Machine\n\nLearning Research (JMLR), 18(19):1\u201353, 2017.\n\n[7] F. Bach. On the equivalence between kernel quadrature rules and random feature expansions.\n\nJournal of Machine Learning Research (JMLR), 18(21):1\u201338, 2017.\n\n[8] P. L. Bartlett, P. M. Long, G. Lugosi, and A. Tsigler. Benign over\ufb01tting in linear regression.\n\narXiv preprint arXiv:1906.11300, 2019.\n\n[9] R. Basri, D. Jacobs, Y. Kasten, and S. Kritchman. The convergence rate of neural networks\nfor learned functions of different frequencies. In Advances in Neural Information Processing\nSystems (NeurIPS), 2019.\n\n[10] M. Belkin, S. Ma, and S. Mandal. To understand deep learning we need to understand kernel\nlearning. In Proceedings of the International Conference on Machine Learning (ICML), 2018.\n\n[11] A. Bietti and J. Mairal. Group invariance, stability to deformations, and complexity of deep\nconvolutional representations. Journal of Machine Learning Research (JMLR), 20(25):1\u201349,\n2019.\n\n[12] J. Bruna and S. Mallat. Invariant scattering convolution networks. IEEE Transactions on pattern\n\nanalysis and machine intelligence (PAMI), 35(8):1872\u20131886, 2013.\n\n[13] Y. Cao and Q. Gu. Generalization bounds of stochastic gradient descent for wide and deep\n\nneural networks. In Advances in Neural Information Processing Systems (NeurIPS), 2019.\n\n[14] A. Caponnetto and E. De Vito. Optimal rates for the regularized least-squares algorithm.\n\nFoundations of Computational Mathematics, 7(3):331\u2013368, 2007.\n\n[15] L. Chizat and F. Bach. On the global convergence of gradient descent for over-parameterized\nIn Advances in Neural Information Processing Systems\n\nmodels using optimal transport.\n(NeurIPS), 2018.\n\n[16] L. Chizat, E. Oyallon, and F. Bach. On lazy training in differentiable programming. In Advances\n\nin Neural Information Processing Systems (NeurIPS), 2019.\n\n[17] Y. Cho and L. K. Saul. Kernel methods for deep learning. In Advances in Neural Information\n\nProcessing Systems (NIPS), 2009.\n\n[18] F. Cucker and S. Smale. On the mathematical foundations of learning. Bulletin of the American\n\nmathematical society, 39(1):1\u201349, 2002.\n\n10\n\n\f[19] A. Daniely, R. Frostig, and Y. Singer. Toward deeper understanding of neural networks: The\npower of initialization and a dual view on expressivity. In Advances in Neural Information\nProcessing Systems (NIPS), 2016.\n\n[20] S. S. Du, J. D. Lee, H. Li, L. Wang, and X. Zhai. Gradient descent \ufb01nds global minima of deep\nneural networks. In Proceedings of the International Conference on Machine Learning (ICML),\n2019.\n\n[21] S. S. Du, X. Zhai, B. Poczos, and A. Singh. Gradient descent provably optimizes over-\nparameterized neural networks. In Proceedings of the International Conference on Learning\nRepresentations (ICLR), 2019.\n\n[22] C. Efthimiou and C. Frye. Spherical harmonics in p dimensions. World Scienti\ufb01c, 2014.\n\n[23] S. Fischer and I. Steinwart. Sobolev norm learning rates for regularized least-squares algorithm.\n\narXiv preprint arXiv:1702.07254, 2017.\n\n[24] A. Garriga-Alonso, L. Aitchison, and C. E. Rasmussen. Deep convolutional networks as shallow\ngaussian processes. In Proceedings of the International Conference on Learning Representations\n(ICLR), 2019.\n\n[25] B. Ghorbani, S. Mei, T. Misiakiewicz, and A. Montanari. Linearized two-layers neural networks\n\nin high dimension. arXiv preprint arXiv:1904.12191, 2019.\n\n[26] K. Hornik, M. Stinchcombe, and H. White. Multilayer feedforward networks are universal\n\napproximators. Neural networks, 2(5):359\u2013366, 1989.\n\n[27] A. Jacot, F. Gabriel, and C. Hongler. Neural tangent kernel: Convergence and generalization in\n\nneural networks. In Advances in Neural Information Processing Systems (NeurIPS), 2018.\n\n[28] J. Lee, Y. Bahri, R. Novak, S. S. Schoenholz, J. Pennington, and J. Sohl-Dickstein. Deep neural\nnetworks as gaussian processes. In Proceedings of the International Conference on Learning\nRepresentations (ICLR), 2018.\n\n[29] J. Lee, L. Xiao, S. S. Schoenholz, Y. Bahri, J. Sohl-Dickstein, and J. Pennington. Wide neural\nnetworks of any depth evolve as linear models under gradient descent. In Advances in Neural\nInformation Processing Systems (NeurIPS), 2019.\n\n[30] Y. Li and Y. Liang. Learning overparameterized neural networks via stochastic gradient descent\n\non structured data. In Advances in Neural Information Processing Systems (NeurIPS), 2018.\n\n[31] T. Liang and A. Rakhlin. Just interpolate: Kernel\" ridgeless\" regression can generalize. Annals\n\nof Statistics, 2019.\n\n[32] G. Loosli, S. Canu, and L. Bottou. Training invariant support vector machines using selective\nsampling. In Large Scale Kernel Machines, pages 301\u2013320. MIT Press, Cambridge, MA., 2007.\n\n[33] J. Mairal. End-to-End Kernel Learning with Supervised Convolutional Kernel Networks. In\n\nAdvances in Neural Information Processing Systems (NIPS), 2016.\n\n[34] J. Mairal, P. Koniusz, Z. Harchaoui, and C. Schmid. Convolutional kernel networks. In Advances\n\nin Neural Information Processing Systems (NIPS), 2014.\n\n[35] S. Mallat. Group invariant scattering. Communications on Pure and Applied Mathematics,\n\n65(10):1331\u20131398, 2012.\n\n[36] A. Matthews, M. Rowland, J. Hron, R. E. Turner, and Z. Ghahramani. Gaussian process\n\nbehaviour in wide deep neural networks. arXiv preprint arXiv:1804.11271, 2018.\n\n[37] S. Mei, T. Misiakiewicz, and A. Montanari. Mean-\ufb01eld theory of two-layers neural networks:\n\ndimension-free bounds and kernel limit. In Conference on Learning Theory (COLT), 2019.\n\n[38] S. Mei, A. Montanari, and P.-M. Nguyen. A mean \ufb01eld view of the landscape of two-layer\nneural networks. Proceedings of the National Academy of Sciences, 115(33):E7665\u2013E7671,\n2018.\n\n[39] B. Neyshabur, R. Tomioka, and N. Srebro. In search of the real inductive bias: On the role of\nimplicit regularization in deep learning. In Proceedings of the International Conference on\nLearning Representations (ICLR), 2015.\n\n[40] R. Novak, L. Xiao, Y. Bahri, J. Lee, G. Yang, J. Hron, D. A. Abola\ufb01a, J. Pennington, and\nJ. Sohl-Dickstein. Bayesian deep convolutional networks with many channels are gaussian\nprocesses. In Proceedings of the International Conference on Learning Representations (ICLR),\n2019.\n\n11\n\n\f[41] A. Pinkus. Approximation theory of the mlp model in neural networks. Acta numerica,\n\n8:143\u2013195, 1999.\n\n[42] A. Rahimi and B. Recht. Random features for large-scale kernel machines. In Advances in\n\nNeural Information Processing Systems (NIPS), 2008.\n\n[43] A. Rudi and L. Rosasco. Generalization properties of learning with random features.\n\nIn\n\nAdvances in Neural Information Processing Systems (NIPS), 2017.\n\n[44] S. Saitoh. Integral transforms, reproducing kernels and their applications, volume 369. CRC\n\nPress, 1997.\n\n[45] P. Savarese, I. Evron, D. Soudry, and N. Srebro. How do in\ufb01nite width bounded norm networks\n\nlook in function space? In Conference on Learning Theory (COLT), 2019.\n\n[46] B. Sch\u00f6lkopf and A. J. Smola. Learning with kernels: support vector machines, regularization,\n\noptimization, and beyond. 2001.\n\n[47] A. J. Smola, Z. L. Ovari, and R. C. Williamson. Regularization with dot-product kernels. In\n\nAdvances in Neural Information Processing Systems (NIPS), 2001.\n\n[48] D. Soudry, E. Hoffer, M. S. Nacson, S. Gunasekar, and N. Srebro. The implicit bias of gradient\n\ndescent on separable data. Journal of Machine Learning Research (JMLR).\n\n[49] F. Williams, M. Trager, C. Silva, D. Panozzo, D. Zorin, and J. Bruna. Gradient dynamics of\nshallow low-dimensional relu networks. In Advances in Neural Information Processing Systems\n(NeurIPS), 2019.\n\n[50] B. Xie, Y. Liang, and L. Song. Diverse neural network learns true target functions.\n\nIn\nProceedings of the International Conference on Arti\ufb01cial Intelligence and Statistics (AISTATS),\n2017.\n\n[51] G. Yang. Scaling limits of wide neural networks with weight sharing: Gaussian process behavior,\ngradient independence, and neural tangent kernel derivation. arXiv preprint arXiv:1902.04760,\n2019.\n\n[52] G. Yehudai and O. Shamir. On the power and limitations of random features for understanding\n\nneural networks. In Advances in Neural Information Processing Systems (NeurIPS), 2019.\n\n[53] C. Zhang, S. Bengio, and Y. Singer. Are all layers created equal?\n\narXiv preprint\n\narXiv:1902.01996, 2019.\n\n[54] D. Zou, Y. Cao, D. Zhou, and Q. Gu. Stochastic gradient descent optimizes over-parameterized\n\ndeep relu networks. Machine Learning, 2019.\n\n12\n\n\f", "award": [], "sourceid": 7047, "authors": [{"given_name": "Alberto", "family_name": "Bietti", "institution": "Inria"}, {"given_name": "Julien", "family_name": "Mairal", "institution": "Inria"}]}