{"title": "Limitations of Lazy Training of Two-layers Neural Network", "book": "Advances in Neural Information Processing Systems", "page_first": 9111, "page_last": 9121, "abstract": "We study the supervised learning problem under either of the following two models:\r\n(1) Feature vectors x_i are d-dimensional Gaussian and responses are y_i = f_*(x_i) for f_* an unknown quadratic function;\r\n(2) Feature vectors x_i are distributed as a mixture of two d-dimensional centered Gaussians, and y_i's are the corresponding class labels. \r\nWe use two-layers neural networks with quadratic activations, and compare three different learning regimes: the random features (RF) regime in which we only train the second-layer weights; the neural tangent (NT) regime in which we train a linearization of the neural network around its initialization; the fully trained neural network (NN) regime in which we train all the weights in the network. We prove that, even for the simple quadratic model of point (1), there is a potentially unbounded gap between the prediction risk achieved in these three training regimes, when the number of neurons is smaller than the ambient dimension. When the number of neurons is larger than the number of dimensions, the problem is significantly easier and both NT and NN learning achieve zero risk.", "full_text": "Limitations of Lazy Training of\nTwo-layers Neural Networks\n\nBehrooz Ghorbani\n\nDepartment of Electrical Engineering\n\nStanford University\n\nSong Mei\n\nICME\n\nStanford University\n\nghorbani@stanford.edu\n\nsongmei@stanford.edu\n\nTheodor Misiakiewicz\nDepartment of Statistics\n\nStanford University\n\nmisiakie@stanford.edu\n\nAndrea Montanari\n\nDepartment of Electrical Engineering\n\nand Department of Statistics\n\nStanford University\n\nmontanar@stanford.edu\n\nAbstract\n\nWe study the supervised learning problem under either of the following two models:\n(1) Feature vectors xi are d-dimensional Gaussians and responses are yi = f\u2217(xi)\n\nfor f\u2217 an unknown quadratic function;\n\n(2) Feature vectors xi are distributed as a mixture of two d-dimensional centered\n\nGaussians, and yi\u2019s are the corresponding class labels.\n\nWe use two-layers neural networks with quadratic activations, and compare three\ndifferent learning regimes: the random features (RF) regime in which we only\ntrain the second-layer weights; the neural tangent (NT) regime in which we train\na linearization of the neural network around its initialization; the fully trained\nneural network (NN) regime in which we train all the weights in the network. We\nprove that, even for the simple quadratic model of point (1), there is a potentially\nunbounded gap between the prediction risk achieved in these three training regimes,\nwhen the number of neurons is smaller than the ambient dimension. When the num-\nber of neurons is larger than the number of dimensions, the problem is signi\ufb01cantly\neasier and both NT and NN learning achieve zero risk.\n\nIntroduction\n\n1\nConsider the supervised learning problem in which we are given i.i.d. data {(xi, yi)}i\u2264n, where\nxi \u223c P a probability distribution over Rd, and yi = f\u2217(xi). 1 We would like to learn the unknown\nfunction f\u2217 as to minimize the prediction risk E{(f (x) \u2212 f\u2217(x))2}. We will assume throughout\nf\u2217 \u2208 L2(Rd, P), i.e. E{f\u2217(x)2} < \u221e.\nThe function class of two-layers neural networks (with N neurons) is de\ufb01ned by:\n\nf (x) = c +\n\nFNN,N =\n\nai\u03c3((cid:104)wi, x(cid:105)) : c, ai \u2208 R, wi \u2208 Rd, i \u2208 [N ]\n\n(1)\nClassical universal approximation results [9] imply that any f\u2217 \u2208 L2(Rd, P) can be approximated\narbitrarily well by an element in FNN = \u222aNFNN,N (under mild conditions). At the same time,\n1For simplicity, we focus our introductory discussion on the case in which the response yi is a noiseless\n\ni=1\n\nfunction of the feature vector xi: some of our results go beyond this setting.\n\n(cid:110)\n\nN(cid:88)\n\n(cid:111)\n\n.\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\ffN (x) =\n\nai\u03c3((cid:104)wi, x(cid:105)) : ai \u2208 R, i \u2208 [N ]\n\n(cid:111)\n\n,\n\n(cid:110)\n(cid:110)\n\nFRF,N (W ) =\n\nFNT,N (W ) =\n\nN(cid:88)\n\ni=1\n\nN(cid:88)\n\nwe know that such an approximation can be constructed in polynomial time only for a subset of\nfunctions f\u2217. Namely, there exist sets of functions f\u2217 for which no algorithm can construct a good\napproximation in FNN,N in polynomial time [19, 24], even having access to the full distribution P\n(under certain complexity-theoretic assumptions).\nThese facts lead to the following central question in neural network theory:\n\nFor which subset of function Ftract \u2286 L2(Rd, P) can a neural network approxima-\ntion be learnt ef\ufb01ciently?\n\nHere \u2018ef\ufb01ciently\u2019 can be formalized in multiple ways: in this paper we will focus on learning via\nstochastic gradient descent.\nSigni\ufb01cant amount of work has been devoted to two subclasses of FNN,N which we will refer to as\nthe random feature model (RF) [22], and the neural tangent model (NT) [18]:\n\n(cid:111)\n\n.\n\n(2)\n\n(3)\n\nfN (x) = c +\n\n\u03c3(cid:48)((cid:104)wi, x(cid:105))(cid:104)ai, x(cid:105) : c \u2208 R, ai \u2208 Rd, i \u2208 [N ]\n\ni=1\n\nHere W = (w1, . . . , wN ) \u2208 Rd\u00d7N are weights which are not optimized and instead drawn at\nrandom. Through this paper, we will assume (wi)i\u2264N \u223ciid N(0, \u0393). 2\nWe can think of RF and NT as tractable inner bounds of the class of neural networks NN:\n\n\u2022 Tractable. Both FRF,N (W ), FNT,N (W ) are \ufb01nite-dimensional linear spaces, and minimiz-\n\ning the empirical risk over these classes can be performed ef\ufb01ciently.\n\n\u2022 Inner bounds. Indeed FRF,N (W ) \u2286 FNN,N : the random feature model is simply obtained\nby \ufb01xing all the \ufb01rst layer weights. Further FNT(W ) \u2286 cl(FNN,2N ) (the closure of the\nclass of neural networks with 2N neurons). This follows from \u03b5\u22121[\u03c3((cid:104)wi + \u03b5ai, x(cid:105)) \u2212\n\u03c3((cid:104)wi, x(cid:105))] = (cid:104)ai, x(cid:105)\u03c3(cid:48)((cid:104)wi, x(cid:105)) + o(1) as \u03b5 \u2192 0.\n\nIt is possible to show that the class of neural networks NN is signi\ufb01cantly more expressive than the\ntwo linearization RF, NT, see e.g. [26, 15]. In particular, [15] shows that, if the feature vectors\nxi are uniformly random over the d-dimensional sphere, and N, d are large with N = O(d), then\nFRF,N (W ) can only capture linear functions, while FNT,N (W ) can only capture quadratic functions.\nDespite these \ufb01ndings, it could still be that the subset of functions Ftract \u2286 L2(Rd, P) for which we\ncan learn ef\ufb01ciently a neural network approximation is well described by RF and NT. Indeed, several\nrecent papers show that \u2013in a certain highly overparametrized regime\u2013 this description is accurate\n[12, 11, 20]. A speci\ufb01c counterexample is given in [26]: if the function to be learnt is a single neuron\nf\u2217(x) = \u03c3((cid:104)w\u2217, x(cid:105)) then gradient descent (in the space of neural networks with N = 1 neurons)\nef\ufb01ciently learns it [21]; on the other hand, RF or NT require a number of neurons exponential in the\ndimension to achieve vanishing risk.\n\n1.1 Summary of Main Results\n\nIn this paper we explore systematically the gap between RF, NT and NN, by considering two speci\ufb01c\ndata distributions:\n\n(qf) Quadratic functions: feature vectors are distributed according to xi \u223c N(0, Id) and re-\n\nsponses are quadratic functions yi = f\u2217(xi) \u2261 b0 + (cid:104)xi, Bxi(cid:105) with B (cid:23) 0.\n\n(mg) Mixture of Gaussians: yi = \u00b11 with equal probability 1/2, and xi|yi = +1 \u223c N(0, \u03a3(1)),\n\nxi|yi = \u22121 \u223c N(0, \u03a3(2)).\n\n2Notice that we do not add an offset in the RF model, and will limit ourselves to target functions f\u2217 that are\n\ncentered: this choice simpli\ufb01es some calculations without modifying the results.\n\n2\n\n\fFigure 1: Left frame: Prediction (test) error of a two-layer neural networks in \ufb01tting a quadratic\nfunction in d = 450 dimensions, as a function of the number of neurons N. We consider the large\nsample (population) limit n \u2192 \u221e and compare three training regimes: random features (RF), neural\ntangent (NT), and fully trained neural networks (NN). Lines are analytical predictions obtained in\nthis paper, and dots are empirical results. Right frame: Evolution of the risk for NT and NN with the\nnumber of samples. Dashed lines are our analytic prediction for the large n limit.\n\nLet us emphasize that the choice of quadratic functions in model qf is not arbitrary: in a sense, it is\nthe most favorable case for NT training. Indeed [15] proves that3 (when N = O(d)): (i) Third- and\nhigher-order polynomials cannot be approximated nontrivially by FNT,N (W ); (ii) Linear functions\nare already well approximated within FRF,N (W ).\nFor clarity, we will \ufb01rst summarize our result for the model qf, and then discuss generalizations to\nmg. The prediction risk achieved within any of the regimes RF, NT, NN is de\ufb01ned by\n\nE(cid:8)(f\u2217(x) \u2212 \u02c6f (x))2(cid:9) , M \u2208 {RF, NT, NN} .\n\n(4)\n\nRM,N (f\u2217) =\n\nRNN,N (f\u2217; (cid:96), \u03b5) = E(cid:8)(f\u2217(x) \u2212 \u02c6fSGD(x; (cid:96), \u03b5))2(cid:9) ,\n\nmin\n\n\u02c6f\u2208FM,N (W )\n\n(5)\nwhere \u02c6fSGD(\u00b7 ; (cid:96), \u03b5) is the neural network produced by (cid:96) steps of stochastic gradient descent (SGD)\nwhere each sample is used once, and the stepsize is set to \u03b5 (see Section 2.3 for a complete de\ufb01nition).\nNotice that the quantities RM,N (f\u2217), RNN,N (f\u2217; (cid:96), \u03b5) are random variables because of the random\nweights W , and the additional randomness in SGD.\nOur results are summarized by Figure 1, which compares the risk achieved by the three approaches\nabove in the population limit n \u2192 \u221e, using quadratic activations \u03c3(u) = u2 + c0. We consider the\nlarge-network, high-dimensional regime N, d \u2192 \u221e, with N/d \u2192 \u03c1 \u2208 (0,\u221e). Figure 1 reports the\nrisk achieved by various approaches in numerical simulations, and compares them with our theoretical\npredictions for each of three regimes RF, NT, and NN, which are detailed in the next sections.\nThe agreement between analytical predictions and simulations is excellent but, more importantly, a\nclear picture emerges. We can highlight a few phenomena that are illustrated in this \ufb01gure:\nRandom features do not capture quadratic functions. The random features risk RRF,N (f\u2217) remains\ngenerally bounded away from zero for all values of \u03c1 = N/d. It is further highly dependent on the\ndistribution of the weight vectors wi \u223c N(0, \u0393). Section 2.1 characterizes explicitly this dependence,\nfor general activation functions \u03c3. For large \u03c1 = N/d, the optimal distribution of the weight vectors\nuses covariance \u0393\u2217 \u221d B, but even in this case the risk is bounded away from zero unless \u03c1 \u2192 \u221e.\nThe neural tangent model achieves vanishing risk on quadratic functions for N > d. However, the\nrisk is bounded away from zero if N/d \u2192 \u03c1 \u2208 (0, 1). Section 2.1 provides explicit expressions\nfor the minimum risk as a function of \u03c1. Roughly speaking NT \ufb01ts the quadratic function f\u2217 along\nrandom subspace determined by the random weight vectors wi. For N \u2265 d, these vectors span the\n3Note that [15] considers feature vectors xi uniformly random over the sphere rather than Gaussian. However,\nthe results of [15] can be generalized, with certain modi\ufb01cations, to the Gaussian case. Roughly speaking,\nfor Gaussian features, NT with N = O(d) neurons can represent quadratic functions, and a low-dimensional\nsubspace of higher order polynomials.\n\n3\n\n102103Number of Hidden Units, N0.00.20.40.60.81.0R=R0NNNT(I)RF(I)RF(\u00a1\u00a4)N=d010000200003000040000n=d0.00.20.40.60.81.0R=R0NNlimn!1 NNNT(I)limn!1 NT(I)\fwhole space Rd and hence the limiting risk vanishes. For N < d only a fraction of the space is\nspanned, and not the most important one (i.e. not the principal eigendirections of B).\n\nFully trained neural networks achieve vanishing risk on quadratic functions for N > d: this is to be\nexpected on the basis of the previous point. For N/d \u2192 \u03c1 \u2208 (0, 1) the risk is generally bounded away\nfrom 0, but its value is smaller than for the neural tangent model. Namely, in Section 2.3 we give an\nexplicit expression for the asymptotic risk (holding for B (cid:23) 0) implying that, for some GAP(\u03c1) > 0\n(independent of N, d),\n\nE{(f (x) \u2212 f\u2217(x))2} \u2264 RNT,N (f\u2217) \u2212 GAP(\u03c1) .\n\n(6)\n\nt\u2192\u221e lim\nlim\n\u03b5\u21920\n\nRNN,N (f\u2217; (cid:96) = t/\u03b5, \u03b5) = inf\n\nf\u2208FNN,N\n\nWe prove this result by showing convergence of SGD to gradient \ufb02ow in the population risk, and\nthen proving a strict saddle property for the population risk. As a consequence the limiting risk\non the left-hand side coincides with the minimum risk over the whole space of neural networks\nE{(f (x) \u2212 f\u2217(x))2}. We characterize the latter and shows that it amounts to \ufb01tting f\u2217\ninf f\u2208FNN,N\nalong the N principal eigendirections of B. This mechanism is very different from the one arising in\nthe NT regime.\n\nThe picture emerging from these \ufb01ndings is remarkably simple. The fully trained network learns the\nmost important eigendirections of the quadratic function f\u2217(x) and \ufb01ts them, hence surpassing the\nNT model which is con\ufb01ned to a random set of directions.\nLet us emphasize that the above separation between NT and NN is established only for N \u2264 d. It\nis natural to wonder whether this separation generalizes to N > d for more complicated classes of\nfunctions, or if instead it always vanishes for wide networks. We expect the separation to generalize\nto N > d by considering higher order polynomial, instead of quadratic functions. Partial evidence\nin this direction is provided by [15]: for third- or higher-order polynomials NT does not achieve\nvanishing risk at any \u03c1 \u2208 (0,\u221e). The mechanism unveiled by our analysis of quadratic functions\nis potentially more general: neural networks are superior to linearized models such as RF or NT,\nbecause they can learn a good representation of the data.\nOur results for quadratic functions are formally presented in Section 2. In order to con\ufb01rm that the\npicture we obtain is general, we establish similar results for mixture of Gaussians in Section 3. More\nprecisely, our results of RF and NT for mixture of Gaussians are very similar to the quadratic case.\nIn this model, however, we do not prove a convergence result for NN analogous to (6), although we\nbelieve it should be possible by the same approach outlined above. On the other hand, we characterize\nE{(y \u2212 f (x))2} and prove it is strictly\nthe minimum prediction risk over neural networks inf f\u2208FNN,N\nsmaller than the minimum achieved by RF and NT. Finally, Section 4 contains background on our\nnumerical experiments.\n\n1.2 Further Related Work\n\nThe connection (and differences) between two-layers neural networks and random features models\nhas been the object of several papers since the original work of Rahimi and Recht [22]. An incomplete\nlist of references includes [5, 2, 6, 7, 23]. Our analysis contributes to this line of work by establishing\na sharp asymptotic characterization, although in more speci\ufb01c data distributions. Sharp results have\nrecently been proven in [15], for the special case of random weights wi uniformly distributed over\na d-dimensional sphere. Here we consider the more general case of anisotropic random features\nwith covariance \u0393 (cid:54)\u221d I. This clari\ufb01es a key reason for suboptimality of random features: the data\nrepresentation is not adapted to the target function f\u2217. We focus on the population limit n \u2192 \u221e.\nComplementary results characterizing the variance as a function of n are given in [17].\nThe NT model (3) is much more recent [18]. Several papers show that SGD optimization within the\noriginal neural network is well approximated by optimization within the model NT as long as the\nnumber of neurons is large compared to a polynomial in the sample size N (cid:29) nc0 [12, 11, 3, 28].\nEmpirical evidence in the same direction was presented in [20, 4].\nChizat and Bach [8] clari\ufb01ed that any nonlinear statistical model can be approximated by a linear one\nin an early (lazy) training regime. The basic argument is quite simple. Given a model x (cid:55)\u2192 f (x; \u03b8)\nwith parameters \u03b8, we can Taylor-expand around a random initialization \u03b80. Setting \u03b8 = \u03b80 + \u03b2, we\nget\n\nf (x; \u03b8) \u2248 f (x; \u03b80) + \u03b2T\u2207\u03b8f (x; \u03b80) \u2248 \u03b2T\u2207\u03b8f (x; \u03b80) .\n\n(7)\n\n4\n\n\fHere the second approximation holds since, for many random initializations, f (x; \u03b80) \u2248 0 because\nof random cancellations. The resulting model \u03b2T\u2207\u03b8f (x; \u03b80) is linear, with random features.\nOur objective is complementary to this literature: we prove that RF and NT have limited approxima-\ntion power, and signi\ufb01cant gain can be achieved by full training.\nFinally, our analysis of fully trained networks connects to the ample literature on non-convex statistical\nestimation. For two layers neural networks with quadratic activations, Soltanolkotabi, Javanmard and\nLee [25] showed that, as long as the number of neurons satis\ufb01es N \u2265 2d there are no spurious local\nminimizers. Du and Lee [10] showed that the same holds as long as N \u2265 d \u2227 \u221a\n2n where n is the\nsample size. Zhong et. al. [27] established local convexity properties around global optima. Further\nrelated landscape results include [14, 16, 13].\n\n2 Main Results: Quadratic Functions\nAs mentioned in the previous section, our results for quadratic functions (qf) assume xi \u223c N(0, Id)\nand yi = f\u2217(xi) where\n\nf\u2217(x) \u2261 b0 + (cid:104)x, Bx(cid:105) .\n\n(8)\n\n2.1 Random Features\nWe consider random feature model with \ufb01rst-layer weights (wi)i\u2264N \u223c N(0, \u0393). We make the\nfollowing assumptions:\n\nA2. We \ufb01x the weights\u2019 normalization by requiring E{(cid:107)wi(cid:107)2\n\nA1. The activation function \u03c3 veri\ufb01es \u03c3(u)2 \u2264 c0 exp(c1u2/2) for some constants c0, c1 with\nc1 < 1. Further it is nonlinear (i.e. there is no a0, a1 \u2208 R such that \u03c3(u) = a0 + a1 u\nalmost everywhere).\n2} = Tr(\u0393) = 1. We assume\nthe operator norm (cid:107)d \u00b7 \u0393(cid:107)op \u2264 C for some constant C, and that the empirical spectral\ndistribution of d \u00b7 \u0393 converges weakly, as d \u2192 \u221e to a probability distribution D over R\u22650.\nTheorem 1. Let f\u2217 be a quadratic function as per Eq. (8), with E(f\u2217) = 0. Assume conditions A1\nand A2 to hold. Denote by \u03bbk = EG\u223cN(0,1)[\u03c3(G)Hek(G)] the k-th Hermite coef\ufb01cient of \u03c3 and\nassume \u03bb0 = 0. De\ufb01ne \u02dc\u03bb = EG\u223cN(0,1)[\u03c3(G)2] \u2212 \u03bb2\n\n1. Let \u03c8 > 0 be the unique solution of\n\u03bb2\n1t\n1 + \u03bb2\n1t\u03c8\nThen, the following holds as N, d \u2192 \u221e with N/d \u2192 \u03c1:\n\n\u2212\u02dc\u03bb = \u2212 \u03c1\n\u03c8\n\nD(dt) .\n\n(cid:90)\n\n(9)\n\n+\n\n(cid:32)\n\n(cid:107)B(cid:107)2\n\nF\n\n2d(cid:107)\u0393(cid:107)2\n\nF\n\n2d(cid:104)\u0393, B(cid:105)2\n\n\u03c8\u03bb2\n\n(cid:0)2 + \u03c8\u03bb2\n(cid:18)\n\n(cid:33)\n(cid:1) + od,P(1)\n(cid:19)\n\nRRF,N (f\u2217) = (cid:107)f\u2217(cid:107)2\n\nL2\n\n1 \u2212\n\n.\n\n(10)\n\nMoreover, assuming (cid:104)\u0393, B(cid:105)2/(cid:107)\u0393(cid:107)2\n\u03c1 \u2192 \u221e:\n\nF(cid:107)B(cid:107)2\n\nF to have a limit as d \u2192 \u221e, (10) simpli\ufb01es as follows for\n\nlim\n\u03c1\u2192\u221e\n\nlim\n\nd\u2192\u221e,N/d\u2192\u03c1\n\nRRF,N (f\u2217)\n(cid:107)f\u2217(cid:107)2\n\nL2\n\n= lim\nd\u2192\u221e\n\n1 \u2212 (cid:104)\u0393, B(cid:105)2\nF(cid:107)B(cid:107)2\n\n(cid:107)\u0393(cid:107)2\n\nF\n\n.\n\n(11)\n\nL2\n\nNotice that RRF,N (f\u2217)/(cid:107)f\u2217(cid:107)2\nis the RF risk normalized by the risk of the trivial predictor f (x) = 0.\nThe asymptotic result in (11) is remarkably simple. By Cauchy-Schwartz, the normalized risk is\nbounded away from zero even as the number of neurons per dimension diverges \u03c1 = N/d \u2192 \u221e,\nunless \u0393 \u221d B, i.e. the random features are perfectly aligned with the function to be learned. For\nisotropic random features, the right-hand side of Eq. (11) reduces to 1 \u2212 Tr(B)2/(d(cid:107)B(cid:107)2\nparticular, RF performs very poorly when Tr(B) (cid:28) \u221a\nF ). In\nd(cid:107)B(cid:107)F , and no better than the trivial predictor\n\nf (x) = 0 if Tr(B) = 0.\nNotice that the above result applies to quite general activation functions. The formulas simplify\nsigni\ufb01cantly for quadratic activations.\n\n5\n\n\fCorollary 1. Under the assumptions of Theorem 1, further assume \u03c3(x) = x2 \u2212 1. Then we have,\nas N, d \u2192 \u221e with N/d \u2192 \u03c1:\n\nRRF,N (f\u2217) = (cid:107)f\u2217(cid:107)2\n\nL2\n\n1 \u2212\n\n(cid:32)\n\n\u03c1d(cid:104)B, \u0393(cid:105)2\n\n(cid:0)1 + \u03c1d(cid:107)\u0393(cid:107)2\n\nF\n\n(cid:107)B(cid:107)2\n\nF\n\n(cid:33)\n(cid:1) + od,P(1)\n\n.\n\n(12)\n\nThe right-hand side of Eq. (12) is plotted in Fig. 1 for isotropic features \u0393 = I/d, and for optimal\nfeatures \u0393 = \u0393\u2217 \u221d B.\n\n(cid:16)\n\n(cid:110)\n\nE[RNT,N (f\u2217)] = (cid:107)f\u2217(cid:107)2\n\n2.2 Neural Tangent\nFor the NT regime, we focus on quadratic activations and isotropic weights wi \u223c N(0, Id/d).\nTheorem 2. Let f\u2217 be a quadratic function as per Eq. (8), with E(f\u2217) = 0, and assume \u03c3(x) = x2.\nThen, we have for N, d \u2192 \u221e with N/d \u2192 \u03c1\n(1 \u2212 \u03c1)2\n\n1 \u2212 Tr(B)2\nd(cid:107)B(cid:107)2\nwhere the expectation is taken over wi \u223ci.i.d N(0, Id/d).\nAs for the case of random features, the NT risk depends on the target function f\u2217(x) only through\nthe ratio Tr(B)2/(d(cid:107)B(cid:107)2\nF ). However, the normalized risk is always smaller than the baseline\nRNT,N (f\u2217) = (cid:107)f\u2217(cid:107)2\nL2 + od(1),\nwith this worst case achieved when B \u221d I. In particular, E[RNT,N (f\u2217)] vanishes asymptotically for\n\u03c1 \u2265 1. This comes at the price of a larger number of parameters to be \ufb01tted, namely N d instead of\nN.\n\nL2. Note that, by Cauchy-Schwartz, E[RNT,N (f\u2217)] \u2264 (1 \u2212 \u03c1)+(cid:107)f\u2217(cid:107)2\n\n+ (1 \u2212 \u03c1)+\n\nTr(B)2\nd(cid:107)B(cid:107)2\n\n+ od(1)\n\n(cid:111)\n\n(cid:17)\n\nL2\n\n+\n\nF\n\nF\n\n.\n\n2.3 Neural Network\n\nFor the analysis of SGD-trained neural networks, we assume f\u2217 to be a quadratic function as per\nEq. (8), but we will now restrict to the positive semide\ufb01nite case B (cid:23) 0. We consider quadratic\nactivations \u03c3(x) = x2, and we \ufb01x the second layers weights to be 1:\n\nN(cid:88)\n\n\u02c6f (x; W , c) =\n\n(cid:104)wi, x(cid:105)2 + c.\n\ni=1\n\nNotice that we use an explicit offset to account for the mismatch in means between f\u2217 and \u02c6f. It is\nuseful to introduce the population risk, as a function of the network parameters W , c:\n\nL(W , c) = E[(f\u2217(x) \u2212 \u02c6f (x; W , c))2] = E(cid:104)(cid:16)(cid:104)xxT, B \u2212 W W T(cid:105) + b0 \u2212 c\n(cid:17)2\n\nHere expectation is with respect to x \u223c N(0, Id). We will study a one-pass version of SGD, whereby\nat each iteration k we perform a stochastic gradient step with respect to a fresh sample (xk, f\u2217(xk))\n\n(cid:17)2(cid:105)\n\n(cid:16)\n\n.\n\n(W k+1, ck+1) = (W k, ck) \u2212 \u03b5\u2207W ,c\n\nf\u2217(xk) \u2212 \u02c6f (xk; W , c)\n\n,\n\nand de\ufb01ne\n\nRNN,N (f\u2217; (cid:96), \u03b5) \u2261 L(W (cid:96), c(cid:96)) = Ex\u223cN(0,Id)[(f\u2217(x) \u2212 \u02c6f (x; W (cid:96), c(cid:96)))2].\n\nNotice that this is the risk with respect to a new sample, independent from the ones used to train\nW (cid:96), c(cid:96). It is the test error. Also notice that (cid:96) is the number of SGD steps but also (because of the\none-pass assumption) the sample size. Our next theorem characterizes the asymptotic risk achieved\nby SGD. This prediction is reported in Figure 1.\nTheorem 3. Let f\u2217 be a quadratic function as per Eq. (8), with B (cid:23) 0. Consider SGD with\ninitialization (W 0, c0) whose distribution is absolutely continuous with respect to the Lebesgue\nmeasure. Let RNN,N (f\u2217; (cid:96), \u03b5) be the test prediction error after (cid:96) SGD steps with step size \u03b5.\nThen we have (probability is over the initialization (W 0, c0) and the samples)\n\nP(cid:16)(cid:12)(cid:12)(cid:12)RNN,N (f\u2217; (cid:96) = t/\u03b5, \u03b5) \u2212 inf\n\n(cid:12)(cid:12)(cid:12) \u2265 \u03b4) = 0,\n\nlim\nt\u2192\u221e lim\n\u03b5\u21920\nwhere \u03bb1(B) \u2265 \u03bb2(B) \u2265 \u00b7\u00b7\u00b7 \u2265 \u03bbd(B) are the ordered eigenvalues of B.\n\nL(W , c)\n\ninf\nW ,c\n\nW ,c\n\nd(cid:88)\n\ni=N +1\n\nL(W , c) = 2\n\n\u03bbi(B)2,\n\n6\n\n\fFigure 2: Left frame: Prediction (test) error of a two-layer neural networks in \ufb01tting a mixture of\nGaussians in d = 450 dimensions, as a function of the number of neurons N, within the three regimes\nRF, NT, NN. Lines are analytical predictions obtained in this paper, and dots are empirical results\n(both in the population limit). Dotted line is the Bayes error. Right frame: Evolution of the risk for\nNT and NN with the number of samples.\n\nThe proof of this theorem depends on the following proposition concerning the landscape of the\npopulation risk, which is of independent interest.\nProposition 1. Let f\u2217 be a quadratic function as per Eq. (8), with B (cid:23) 0. For any sub-level set of\nthe risk function \u2126(B0) = {x = (W , c) : L(W , c) \u2264 B0}, there exists constants \u03b5, \u03b4 > 0 such that\nL is (\u03b5, \u03b4)-strict saddle in the region \u2126(B0). Namely, for any x \u2208 \u2126(B0) with (cid:107)\u2207L(x)(cid:107)2 \u2264 \u03b5, we\nhave \u03bbmin(\u22072L(x)) < \u2212\u03b4.\nWe can now compare the risk achieved within the regimes RF, NT and NN. Gathering the results of\nCorollary 1, and Theorems 2, 3 (using wi \u223c N(0, I/d) for RF and NT), we obtain\n\n\uf8f1\uf8f4\uf8f4\uf8f4\uf8f4\uf8f4\uf8f4\uf8f4\uf8f4\uf8f2\uf8f4\uf8f4\uf8f4\uf8f4\uf8f4\uf8f4\uf8f4\uf8f4\uf8f3\n\n1 \u2212 \u03c1\n1 + \u03c1\n(1 \u2212 \u03c1)2\n\n1 \u2212\n\n+ + \u03c1(1 \u2212 \u03c1)+\n(cid:80)d\u2227N\n\ni=1 \u03bbi(B)2\n(cid:107)B(cid:107)2\n\nF\n\nRM,N (f\u2217)\n(cid:107)f\u2217(cid:107)2\n\nL2\n\n\u2248\n\nTr(B)2\nd(cid:107)B(cid:107)2\n\nF\n\nfor M = RF,\n\nTr(B)2\nd(cid:107)B(cid:107)2\n\nF\n\nfor M = NT,\n\n(13)\n\nfor M = NN.\n\nAs anticipated, NN learns the most important directions in f\u2217, while RF, NT do not.\n\n3 Main Results: Mixture of Gaussians\nIn this section, we consider the mixture of Gaussian setting (mg): yi = \u00b11 with equal probability\n1/2, and xi|yi = +1 \u223c N(0, \u03a3(1)), xi|yi = \u22121 \u223c N(0, \u03a3(2)). We parametrize the covariances as\n\u03a3(1) = \u03a3 \u2212 \u2206 and \u03a3(2) = \u03a3 + \u2206, and will make the following assumptions:\n\nM1. There exists constants 0 < c1 < c2 such that c1Id (cid:22) \u03a3 (cid:22) c2Id;\n\u221a\nM2. (cid:107)\u2206(cid:107)op = \u0398d(1/\n\nd).\n\u221a\nThe scaling in assumption M2 ensures the signal-to-noise ratio to be of order one. If the eigenvalues\nof \u2206 are much larger than 1/\nd, then it is easy to distinguish the two classes with high probability\n(they are asymptotically mutually singular). If (cid:107)\u2206(cid:107)op = od(1/\nd) then no non-trivial classi\ufb01er\nexists.\nWe will denote by P\u03a3,\u2206 the joint distribution of (y, x) under the (mg) model, and by E\u03a3,\u2206 or E(y,x)\nthe corresponding expectation. The minimum prediction risk within any of the regimes RF, NT, NN\nis de\ufb01ned by\n\n\u221a\n\nRM,N (P) = inf\n\nf\u2208FM,N\n\nE(y,x){(y \u2212 f (x))2} , M \u2208 {RF, NT, NN} .\n\n7\n\n102103Number of Hidden Units, N0.30.40.50.60.70.80.91.0R=R0NNNT(I)RF(I)MMSEN=d05000100001500020000250003000035000n=d0.50.60.70.80.91.0R=R0NNlimn!1 NNNT(I)limn!1 NT(I)\fAs mentioned in the introduction, the picture emerging from our analysis of the mg model is aligned\nwith the results obtained in the previous section. We will limit ourselves to stating the results without\nrepeating comments that were made above. Our results are compared with simulations in Figure\n2. Notice that, in this case, the Bayes error (MMSE) is not achieved even for very wide networks\nN/d (cid:29) 1 either by NT or NN.\n\n3.1 Random Features\nAs in the previous section, we generate random \ufb01rst-layer weights (wi)i\u2264N \u223c N(0, \u0393). We consider\na general activation function satisfying condition A1. We make the following assumption on \u0393, \u03a3:\nB2. We \ufb01x the weights\u2019 normalization by requiring E{(cid:104)wi, \u03a3wi(cid:105)} = Tr(\u0393\u03a3) = 1. We assume\nthat there exists a constant C such that (cid:107)d \u00b7 \u0393(cid:107)op \u2264 C, and that the empirical spectral\ndistribution of d \u00b7 (\u03931/2\u03a3\u03931/2) converges weakly, as d \u2192 \u221e to a probability distribution\nD over R\u22650.\n\nTheorem 4. Consider the mg distribution, with \u03a3 and \u2206 satisfying condition M1 and M2. Assume\nconditions A1 and B2 to hold. De\ufb01ne \u03bbk = EG\u223cN(0,1)[\u03c3(G)Hek(G)] to be the k-th Hermite\ncoef\ufb01cient of \u03c3 and assume without loss of generality \u03bb0 = 0. De\ufb01ne \u02dc\u03bb = E[\u03c3(G)2] \u2212 \u03bb2\n1. Let \u03c8 > 0\nbe the unique solution of\n\n(14)\nDe\ufb01ne \u03b61(d) \u2261 d Tr(\u03a3\u0393\u03a3\u0393)/2, \u03b62(d) \u2261 d Tr(\u2206\u0393)2/4. Then, the following holds as N, d \u2192 \u221e\nwith N/d \u2192 \u03c1:\n\n+\n\n\u03bb2\n1t\n1 + \u03bb2\n1t\u03c8\n\nD(dt) .\n\n\u2212\u02dc\u03bb = \u2212 \u03c1\n\u03c8\n\n(cid:90)\n\nRRF,N (P\u03a3,\u2206) =\n\n(15)\nMoreover, assume \u03b61(d) \u03b62(d) to have limits as d \u2192 \u221e, i.e. we have limd\u2192\u221e \u03b6j(d) = \u03b6j,\u2217 for\nj = 1, 2. Then the following holds as \u03c1 \u2192 \u221e:\n\n1 + (\u03b61(d) + \u03b62(d))\u03bb2\n\n+ od,P(1), .\n\n2\u03c8\n\n1 + \u03b61(d)\u03bb2\n\n2\u03c8\n\nlim\n\u03c1\u2192\u221e\n\nlim\n\nd\u2192\u221e,N/d\u2192\u03c1\n\nRRF,N (P\u03a3,\u2206) =\n\n\u03b61,\u2217\n\n\u03b61,\u2217 + \u03b62,\u2217\n\n.\n\n(16)\n\n3.2 Neural Tangent\nFor the NT model, we \ufb01rst state our theorem for general \u03a3 and wi \u223c N(0, \u0393) and then give an\nexplicit concentration result in the case \u03a3 = I and isotropic weights wi \u223c N(0, I/d).\nTheorem 5. Let P\u03a3,\u2206 be the mixture of Gaussian distribution, with \u03a3 and \u2206 satisfying conditions\nM1 and M2. Further assume \u03c3(x) = x2. Then, the following holds for almost every W \u2208 Rd\u00d7N\n(with respect to the Lebesgue measure):\nRNT,N (P\u03a3,\u2206) =\n\n+ od(1),\n\n2\n\n2 + (cid:107) \u02dc\u2206(cid:107)2\n\nF \u2212 (cid:107)P \u22a5 \u02dc\u2206P \u22a5(cid:107)2\n\nF\n\nwhere \u02dc\u2206 = \u03a3\u22121/2\u2206\u03a3\u22121/2 and P \u22a5 = I \u2212 \u03a31/2W (W T\u03a3W )\u22121W T\u03a31/2 is the projection\nperpendicular to span(\u03a31/2W ).\nAssuming further that \u03a3 = I and wi \u223ci.i.d. N(0, Id/d), we have as N, d \u2192 \u221e with N/d \u2192 \u03c1:\n\nRNT,N (PI,\u2206) =\n\n2 + \u03ba(\u03c1, \u2206)(cid:107)\u2206(cid:107)2\n\nF\n\n2\n\n(cid:16)\n\n\u03ba(\u03c1, \u2206) = 1 \u2212 (1 \u2212 \u03c1)2\n\n1 \u2212 Tr(\u2206)2\nd(cid:107)\u2206(cid:107)2\nIn particular, for \u03c1 \u2265 1, we have (for almost every W )\n1\n\n+\n\nF\n\nRNT,N (PI,\u2206) =\n\n+ od,P(1),\n\n(cid:17) \u2212 (1 \u2212 \u03c1)+\n\nTr(\u2206)2\nd(cid:107)\u2206(cid:107)2\n\nF\n\n,\n\n1 + (cid:107)\u2206(cid:107)2\n\nF /2\n\n+ od,P(1).\n\n8\n\n\f3.3 Neural Network\n\n(cid:80)N\ni=1 ai(cid:104)wi, x(cid:105)2 + c. This is optimized over (ai, wi)i\u2264N and c.\n\nWe consider quadratic activations with general offset and coef\ufb01cients \u02c6f (x; W , a, c) =\n\nTheorem 6. Let P\u03a3,\u2206 be the mixture of Gaussian distribution, with \u03a3 and \u2206 satisfying conditions\nM1 and M2. Then, the following holds\n\nRNN,N (P\u03a3,\u2206) =\n\n2 +(cid:80)N\u2227d\n\n2\n\ni=1 \u03bbi( \u02dc\u2206)2\n\n+ od(1),\n\nwhere \u02dc\u2206 = \u03a3\u22121/2\u2206\u03a3\u22121/2 and \u03bb1( \u02dc\u2206) \u2265 \u03bb1( \u02dc\u2206) \u2265 \u00b7\u00b7\u00b7 \u2265 \u03bbd( \u02dc\u2206) are the singular values of \u02dc\u2206. In\nparticular, for \u03c1 \u2265 1, we have\n\nRNN,N (PI,\u2206) =\n\n1\n\n1 + (cid:107) \u02dc\u2206(cid:107)2\n\nF /2\n\n+ od(1).\n\nLet us emphasize that, for this setting, we do not have a convergence result for SGD as for the model\nqf, cf. Theorem 3. However, because of certain analogies between the two models, we expect a\nsimilar result to hold for mixtures of Gaussians.\nWe can now compare the risks achieved within the regimes RF, NT and NN. Gathering the results\nof Theorems 4, 5 and 6 for \u03a3 = I and \u03c3(x) = x2 \u2212 1 (using wi \u223c N(0, I/d) for RF and NT), we\nobtain\n\n\uf8f1\uf8f4\uf8f4\uf8f4\uf8f4\uf8f4\uf8f4\uf8f4\uf8f4\uf8f2\uf8f4\uf8f4\uf8f4\uf8f4\uf8f4\uf8f4\uf8f4\uf8f4\uf8f3\n\nRM,N (PI,\u2206) \u2248\n\n1\n\n1 + \u03c1\n\n1+2\u03c1 \u00b7 tr(\u2206)2\n1 + \u03ba(\u03c1, \u2206)(cid:107)\u2206(cid:107)2\n\n2d\n\n1\n\n1 +(cid:80)N\u2227d\n\n1\n\ni=1 \u03bbi(\u2206)2/2\n\nF /2\n\nfor M = RF,\n\nfor M = NT,\n\nfor M = NN.\n\n(17)\n\nWe recover a similar behavior as in the case of the (qf) model: NN learns the most important directions\nof \u2206, while RF, NT do not. Note that the Bayes error is not achieved in this model.\n\n4 Numerical Experiments\n\nFor the experiments illustrated in Figures 1 and 2, we use feature size of d = 450, and number of\nhidden units N \u2208 {45,\u00b7\u00b7\u00b7 , 4500}. NT and NN models are trained with SGD in TensorFlow [1]. We\nrun a total of 2 \u00d7 105 SGD steps for each (qf) model and 1.4 \u00d7 105 steps for each (mg) model. The\nSGD batch size is \ufb01xed at 100 and the step size is chosen from the grid {0.001,\u00b7\u00b7\u00b7 , 0.03} where the\nhyper-parameter that achieves the best \ufb01t is used for the \ufb01gures. RF models are \ufb01tted directly by\nsolving KKT conditions with 5 \u00d7 105 observations. After \ufb01tting the model, the test error is evaluated\non 104 fresh samples. In our \ufb01gures, each RF data point corresponds to the test error averaged over\n10 models with independent realizations of W .\nFor (qf) experiments, we choose B to be diagonal with diagonal elements chosen i.i.d from standard\nexponential distribution with parameter 1. For (mg) experiments, \u2206 is also diagonal with the\n}. Experiments with non-diagonal \u2206\ndiagonal element chosen uniformly from the set { 2\u221a\nare presented in the appendix.\nWhile we are only able to provide theory for NN and NT when the activations are quadratic, we have\nperformed extensive experiments examining the behavior of these models with other nonlinearities.\nThese results are reported in the appendix. In general, the phenomena we observe in the case of\nquadratic activations persist when other activations are used. In particular, the positive gap between\nNN and NT is still present when N < d.\n\n, 1\u221a\nd\n\n, 1.5\u221a\nd\n\nd\n\nAcknowledgements\n\nThis work was partially supported by grants NSF DMS-1613091, CCF-1714305, IIS-1741162, and\nONR N00014-18-1-2729, NSF DMS-1418362, NSF DMS-1407813.\n\n9\n\n\fReferences\n[1] Mart\u00edn Abadi, Paul Barham, Jianmin Chen, Zhifeng Chen, Andy Davis, Jeffrey Dean, Matthieu\nDevin, Sanjay Ghemawat, Geoffrey Irving, Michael Isard, et al. Tensor\ufb02ow: A system for\nlarge-scale machine learning. In 12th {USENIX} Symposium on Operating Systems Design and\nImplementation ({OSDI} 16), pages 265\u2013283, 2016.\n\n[2] Ahmed El Alaoui and Michael W Mahoney. Fast randomized kernel ridge regression with\nstatistical guarantees. In Advances in Neural Information Processing Systems, pages 775\u2013783,\n2015.\n\n[3] Zeyuan Allen-Zhu, Yuanzhi Li, and Zhao Song. A convergence theory for deep learning via\n\nover-parameterization. arXiv:1811.03962, 2018.\n\n[4] Sanjeev Arora, Simon S Du, Wei Hu, Zhiyuan Li, Ruslan Salakhutdinov, and Ruosong Wang.\nOn exact computation with an in\ufb01nitely wide neural net. arXiv preprint arXiv:1904.11955,\n2019.\n\n[5] Francis Bach. Sharp analysis of low-rank kernel matrix approximations. In Conference on\n\nLearning Theory, pages 185\u2013209, 2013.\n\n[6] Francis Bach. Breaking the curse of dimensionality with convex neural networks. The Journal\n\nof Machine Learning Research, 18(1):629\u2013681, 2017.\n\n[7] Francis Bach. On the equivalence between kernel quadrature rules and random feature expan-\n\nsions. The Journal of Machine Learning Research, 18(1):714\u2013751, 2017.\n\n[8] Lenaic Chizat and Francis Bach. A note on lazy training in supervised differentiable program-\n\nming. arXiv:1812.07956, 2018.\n\n[9] George Cybenko. Approximation by superpositions of a sigmoidal function. Mathematics of\n\ncontrol, signals and systems, 2(4):303\u2013314, 1989.\n\n[10] Simon S Du and Jason D Lee. On the power of over-parametrization in neural networks with\n\nquadratic activation. arXiv:1803.01206, 2018.\n\n[11] Simon S Du, Jason D Lee, Haochuan Li, Liwei Wang, and Xiyu Zhai. Gradient descent \ufb01nds\n\nglobal minima of deep neural networks. arXiv:1811.03804, 2018.\n\n[12] Simon S Du, Xiyu Zhai, Barnabas Poczos, and Aarti Singh. Gradient descent provably optimizes\n\nover-parameterized neural networks. arXiv:1810.02054, 2018.\n\n[13] Rong Ge, Chi Jin, and Yi Zheng. No spurious local minima in nonconvex low rank problems:\nA uni\ufb01ed geometric analysis. In Proceedings of the 34th International Conference on Machine\nLearning-Volume 70, pages 1233\u20131242. JMLR. org, 2017.\n\n[14] Rong Ge, Jason D Lee, and Tengyu Ma. Learning one-hidden-layer neural networks with\n\nlandscape design. arXiv:1711.00501, 2017.\n\n[15] Behrooz Ghorbani, Song Mei, Theodor Misiakiewicz, and Andrea Montanari. Linearized\n\ntwo-layers neural networks in high dimension. arXiv:1904.12191, 2019.\n\n[16] Benjamin Haeffele, Eric Young, and Rene Vidal. Structured low-rank matrix factorization:\nOptimality, algorithm, and applications to image processing. In International conference on\nmachine learning, pages 2007\u20132015, 2014.\n\n[17] Trevor Hastie, Andrea Montanari, Saharon Rosset, and Ryan J Tibshirani. Surprises in high-\n\ndimensional ridgeless least squares interpolation. arXiv:1903.08560, 2019.\n\n[18] Arthur Jacot, Franck Gabriel, and Cl\u00e9ment Hongler. Neural tangent kernel: Convergence and\ngeneralization in neural networks. In Advances in neural information processing systems, pages\n8571\u20138580, 2018.\n\n[19] Adam Klivans and Pravesh Kothari. Embedding hard learning problems into gaussian space. In\nApproximation, Randomization, and Combinatorial Optimization. Algorithms and Techniques\n(APPROX/RANDOM 2014). Schloss Dagstuhl-Leibniz-Zentrum fuer Informatik, 2014.\n\n10\n\n\f[20] Jaehoon Lee, Lechao Xiao, Samuel S Schoenholz, Yasaman Bahri, Jascha Sohl-Dickstein, and\nJeffrey Pennington. Wide neural networks of any depth evolve as linear models under gradient\ndescent. arXiv:1902.06720, 2019.\n\n[21] Song Mei, Yu Bai, Andrea Montanari, et al. The landscape of empirical risk for nonconvex\n\nlosses. The Annals of Statistics, 46(6A):2747\u20132774, 2018.\n\n[22] Ali Rahimi and Benjamin Recht. Random features for large-scale kernel machines. In Advances\n\nin neural information processing systems, pages 1177\u20131184, 2008.\n\n[23] Alessandro Rudi and Lorenzo Rosasco. Generalization properties of learning with random\n\nfeatures. In Advances in Neural Information Processing Systems, pages 3215\u20133225, 2017.\n\n[24] Ohad Shamir. Distribution-speci\ufb01c hardness of learning neural networks. The Journal of\n\nMachine Learning Research, 19(1):1135\u20131163, 2018.\n\n[25] Mahdi Soltanolkotabi, Adel Javanmard, and Jason D Lee. Theoretical insights into the op-\ntimization landscape of over-parameterized shallow neural networks. IEEE Transactions on\nInformation Theory, 65(2):742\u2013769, 2019.\n\n[26] Gilad Yehudai and Ohad Shamir. On the power and limitations of random features for under-\n\nstanding neural networks. arXiv:1904.00687, 2019.\n\n[27] Kai Zhong, Zhao Song, Prateek Jain, Peter L Bartlett, and Inderjit S Dhillon. Recovery\nguarantees for one-hidden-layer neural networks. In Proceedings of the 34th International\nConference on Machine Learning-Volume 70, pages 4140\u20134149. JMLR. org, 2017.\n\n[28] Difan Zou, Yuan Cao, Dongruo Zhou, and Quanquan Gu. Stochastic gradient descent optimizes\n\nover-parameterized deep relu networks. arXiv preprint arXiv:1811.08888, 2018.\n\n11\n\n\f", "award": [], "sourceid": 4884, "authors": [{"given_name": "Behrooz", "family_name": "Ghorbani", "institution": "Stanford University"}, {"given_name": "Song", "family_name": "Mei", "institution": "Stanford University"}, {"given_name": "Theodor", "family_name": "Misiakiewicz", "institution": "Stanford University"}, {"given_name": "Andrea", "family_name": "Montanari", "institution": "Stanford"}]}