{"title": "Dynamics of stochastic gradient descent for two-layer neural networks in the teacher-student setup", "book": "Advances in Neural Information Processing Systems", "page_first": 6981, "page_last": 6991, "abstract": "Deep neural networks achieve stellar generalisation even when they have enough\nparameters to easily fit all their training data. We study this phenomenon by\nanalysing the dynamics and the performance of over-parameterised two-layer\nneural networks in the teacher-student setup, where one network, the student,\nis trained on data generated by another network, called the teacher. We show\nhow the dynamics of stochastic gradient descent (SGD) is captured by a set of\ndifferential equations and prove that this description is asymptotically exact\nin the limit of large inputs. Using this framework, we calculate the final\ngeneralisation error of student networks that have more parameters than their\nteachers. We find that the final generalisation error of the student increases\nwith network size when training only the first layer, but stays constant or\neven decreases with size when training both layers. We show that these\ndifferent behaviours have their root in the different solutions SGD finds for\ndifferent activation functions. Our results indicate that achieving good\ngeneralisation in neural networks goes beyond the properties of SGD alone and\ndepends on the interplay of at least the algorithm, the model architecture,\nand the data set.", "full_text": "Dynamics of stochastic gradient descent for two-layer\n\nneural networks in the teacher-student setup\n\nSebastian Goldt1, Madhu S. Advani2, Andrew M. Saxe3\n\nFlorent Krzakala4, Lenka Zdeborov\u00e11\n\n1 Institut de Physique Th\u00e9orique, CNRS, CEA, Universit\u00e9 Paris-Saclay, Saclay, France\n\n2 Center for Brain Science, Harvard University, Cambridge, MA 02138, USA\n\n3 Department of Experimental Psychology, University of Oxford, Oxford, United Kingdom\n\n4 Laboratoire de Physique Statistique, Sorbonne Universit\u00e9s,\n\nUniversit\u00e9 Pierre et Marie Curie Paris 6, Ecole Normale Sup\u00e9rieure, 75005 Paris, France\n\nAbstract\n\nDeep neural networks achieve stellar generalisation even when they have enough\nparameters to easily \ufb01t all their training data. We study this phenomenon by\nanalysing the dynamics and the performance of over-parameterised two-layer\nneural networks in the teacher-student setup, where one network, the student, is\ntrained on data generated by another network, called the teacher. We show how the\ndynamics of stochastic gradient descent (SGD) is captured by a set of differential\nequations and prove that this description is asymptotically exact in the limit of\nlarge inputs. Using this framework, we calculate the \ufb01nal generalisation error of\nstudent networks that have more parameters than their teachers. We \ufb01nd that the\n\ufb01nal generalisation error of the student increases with network size when training\nonly the \ufb01rst layer, but stays constant or even decreases with size when training\nboth layers. We show that these different behaviours have their root in the different\nsolutions SGD \ufb01nds for different activation functions. Our results indicate that\nachieving good generalisation in neural networks goes beyond the properties of\nSGD alone and depends on the interplay of at least the algorithm, the model\narchitecture, and the data set.\n\nDeep neural networks behind state-of-the-art results in image classi\ufb01cation and other domains\nhave one thing in common: their size. In many applications, the free parameters of these models\noutnumber the samples in their training set by up to two orders of magnitude 1,2. Statistical learning\ntheory suggests that such heavily over-parameterised networks generalise poorly without further\nregularisation 3\u20139, yet empirical studies consistently \ufb01nd that increasing the size of networks to\nthe point where they can easily \ufb01t their training data and beyond does not impede their ability to\ngeneralise well, even without any explicit regularisation 10\u201312. Resolving this paradox is arguably one\nof the big challenges in the theory of deep learning.\nOne tentative explanation for the success of large networks has focused on the properties of stochastic\ngradient descent (SGD), the algorithm routinely used to train these networks. In particular, it has\nbeen proposed that SGD has an implicit regularisation mechanism that ensures that solutions found\nby SGD generalise well irrespective of the number of parameters involved, for models as diverse as\n(over-parameterised) neural networks 10,13, logistic regression 14 and matrix factorisation models 15,16.\nIn this paper, we analyse the dynamics of one-pass (or online) SGD in two-layer neural networks. We\nfocus in particular on the in\ufb02uence of over-parameterisation on the \ufb01nal generalisation error. We use\nthe teacher-student framework 17,18, where a training data set is generated by feeding random inputs\nthrough a two-layer neural network with M hidden units called the teacher. Another neural network,\nthe student, is then trained using SGD on that data set. The generalisation error is de\ufb01ned as the mean\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\fsquared error between teacher and student outputs, averaged over all of input space. We will focus on\nstudent networks that have a larger number of hidden units K \u2265 M than their teacher. This means\nthat the student can express much more complex functions than the teacher function they have to\nlearn; the students are thus over-parameterised with respect to the generative model of the training\ndata in a way that is simple to quantify. We \ufb01nd this de\ufb01nition of over-parameterisation cleaner in our\nsetting than the oft-used comparison of the number of parameters in the model with the number of\nsamples in the training set, which is not well justi\ufb01ed for non-linear functions. Furthermore, these two\nnumbers surely cannot fully capture the complexity of the function learned in practical applications.\nThe teacher-student framework is also interesting in the wake of the need to understand the ef-\nfectiveness of neural networks and the limitations of the classical approaches to generalisation 11.\nTraditional approaches to learning and generalisation are data agnostic and seek worst-case type\nbounds 19. On the other hand, there has been a considerable body of theoretical work calculating\nthe generalisation ability of neural networks for data arising from a probabilistic model, particularly\nwithin the framework of statistical mechanics 17,18,20\u201322. Revisiting and extending the results that have\nemerged from this perspective is currently experiencing a surge of interest 23\u201328.\nIn this work we consider two-layer networks with a large input layer and a \ufb01nite, but arbitrary, number\nof hidden neurons. Other limits of two-layer neural networks have received a lot of attention recently.\nA series of papers 29\u201332 studied the mean-\ufb01eld limit of two-layer networks, where the number of\nneurons in the hidden layer is very large, and proved various general properties of SGD based on\na description in terms of a limiting partial differential equation. Another set of works, operating in\na different limit, have shown that in\ufb01nitely wide over-parameterised neural networks trained with\ngradient-based methods effectively solve a kernel regression 33\u201338, without any feature learning. Both\nthe mean-\ufb01eld and the kernel regime crucially rely on having an in\ufb01nite number of nodes in the\nhidden layer, and the performance of the networks strongly depends on the detailed scaling used 38,39.\nFurthermore, a very wide hidden layer makes it hard to have a student that is larger than the teacher\nin a quanti\ufb01able way. This leads us to consider the opposite limit of large input dimension and \ufb01nite\nnumber of hidden units.\nOur main contributions are as follows:\n(i) The dynamics of SGD (online) learning by two-layer neural networks in the teacher-student setup\nwas studied in a series of classic papers 40\u201344 from the statistical physics community, leading to a\nheuristic derivation of a set of coupled ordinary differential equations (ODE) that describe the typical\ntime-evolution of the generalisation error. We provide a rigorous foundation of the ODE approach to\nanalysing the generalisation dynamics in the limit of large input size by proving their correctness.\n(ii) These works focused on training only the \ufb01rst layer, mainly in the case where the teacher network\nhas the same number of hidden units and the student network, K = M. We generalise their analysis\nto the case where the student\u2019s expressivity is considerably larger than that of the teacher in order to\ninvestigate the over-parameterised regime K > M.\n(iii) We provide a detailed analysis of the dynamics of learning and of the generalisation when only\nthe \ufb01rst layer is trained. We derive a reduced set of coupled ODE that describes the generalisation\ndynamics for any K \u2265 M and obtain analytical expressions for the asymptotic generalisation error\nof networks with linear and sigmoidal activation functions. Crucially, we \ufb01nd that with all other\nparameters equal, the \ufb01nal generalisation error increases with the size of the student network. In this\ncase, SGD alone thus does not seem to be enough to regularise larger student networks.\n(iv) We \ufb01nally analyse the dynamics when learning both layers. We give an analytical expression for\nthe \ufb01nal generalisation error of sigmoidal networks and \ufb01nd evidence that suggests that SGD \ufb01nds\nsolutions which amount to performing an effective model average, thus improving the generalisation\nerror upon over-parameterisation. In linear and ReLU networks, we experimentally \ufb01nd that the\ngeneralisation error does change as a function of K when training both layers. However, there exist\nstudent networks with better performance that are \ufb01xed points of the SGD dynamics, but are not\nreached when starting SGD from initial conditions with small, random weights.\nCrucially, we \ufb01nd this range of different behaviours while keeping the training algorithm (SGD)\nthe same, changing only the activation functions of the networks and the parts of the network that\nare trained. Our results clearly indicate that the implicit regularisation of neural networks in our\nsetting goes beyond the properties of SGD alone. Instead, a full understanding of the generalisation\nproperties of even very simple neural networks requires taking into account the interplay of at least\n\n2\n\n\fthe algorithm, the network architecture, and the data set used for training, setting up a formidable\nresearch programme for the future.\nReproducibility \u2014 We have packaged the implementation of our experiments and our ODE integrator\ninto a user-friendly library with example programs at https://github.com/sgoldt/nn2pp. All\nplots were generated with these programs, and we give the necessary parameter values beneath each\nplot.\n\n1 Online learning in teacher-student neural networks\nWe consider a supervised regression problem with training set D = {(x\u00b5, y\u00b5)} with \u00b5 = 1, . . . , P .\nThe components of the inputs x\u00b5 \u2208 RN are i.i.d. draws from the standard normal distribution N (0, 1).\nThe scalar labels y\u00b5 are given by the output of a network with M hidden units, a non-linear activation\nfunction g : R \u2192 R and \ufb01xed weights \u03b8\u2217 = (v\u2217 \u2208 RM , w\u2217 \u2208 RM\u00d7N ) with an additive output noise\n\u03b6 \u00b5 \u223c N (0, 1), called the teacher (see also Fig. 1a):\n\ny\u00b5 \u2261 \u03c6(x\u00b5, \u03b8\n\n\u2217\n\n) + \u03c3\u03b6 \u00b5,\n\nwhere \u03c6(x, \u03b8\n\n\u2217\n\n) =\n\nv\n\n\u2217\n\nmg(cid:18) w\u2217\n\nN (cid:19) =\n\nmx\u221a\n\nM(cid:88)m=1\n\nM(cid:88)m\n\n\u2217\nmg(\u03c1m) ,\n\nv\n\n(1)\n\u221a\n\nm is the mth row of w\u2217, and the local \ufb01eld of the mth teacher node is \u03c1m \u2261 w\u2217\n\nwhere w\u2217\nWe will analyse three different network types: sigmoidal with g(x) = erf(x/\ng(x) = max(x, 0), and linear networks where g(x) = x.\nA second two-layer network with K hidden units and weights \u03b8 = (v \u2208 RK, w \u2208 RK\u00d7N ), called\n\u00b5=1 [\u03c6(x\u00b5, \u03b8) \u2212 y\u00b5]2.\nWe emphasise that the student network may have a larger number of hidden units K \u2265 M than the\nteacher and thus be over-parameterised with respect to the generative model of its training data.\nThe SGD algorithm de\ufb01nes a Markov process X \u00b5 \u2261 [v\u2217, w\u2217, v\u00b5, w\u00b5] with update rule given by the\ncoupled SGD recursion relations\n\nthe student, is then trained using SGD on the quadratic training loss E(\u03b8) \u221d(cid:80)P\n\nN.\n2), ReLU with\n\nmx/\n\n\u221a\n\nv\u00b5\nk g\n\n(cid:48)\n\n(\u03bb\u00b5\n\nk )\u2206\u00b5x\u00b5,\n\nw\u00b5+1\nk = w\u00b5\n\nv\u00b5+1\nk = v\u00b5\n\nk \u2212 \u03b7w\u221a\nk \u2212 \u03b7v\n\nN\n\nN\ng(\u03bb\u00b5\n\nk )\u2206\u00b5.\n\n(2)\n\n(3)\n\nand we de\ufb01ned the error term \u2206\u00b5 \u2261(cid:80)k v\u00b5\n\nWe can choose different learning rates \u03b7v and \u03b7w for the two layers and denote by g(cid:48)(\u03bb\u00b5\nof the activation function evaluated at the local \ufb01eld of the student\u2019s kth hidden unit \u03bb\u00b5\n\n\u221a\nk ) the derivative\nk \u2261 wkx\u00b5/\nN,\nm) \u2212 \u03c3\u03b6 \u00b5. We will use the indices\n\u221a\ni, j, k, . . . to refer to student nodes, and n, m, . . . to denote teacher nodes. We take initial weights at\nrandom from N (0, 1) for sigmoidal networks, while initial weights have variance 1/\nN for ReLU\nand linear networks.\nThe key quantity in our approach is the generalisation error of the student with respect to the teacher:\n\nk ) \u2212(cid:80)m v\u2217\n\nk g (\u03bb\u00b5\n\nmg(\u03c1\u00b5\n\n\u0001g(\u03b8, \u03b8\n\n(4)\nwhere the angled brackets (cid:104)\u00b7(cid:105) denote an average over the input distribution. We can make progress by\nrealising that \u0001g(\u03b8\u2217, \u03b8) can be expressed as a function of a set of macroscopic variables, called order\nparameters in statistical physics, 21,40,41\n\n2(cid:68)[\u03c6(x, \u03b8) \u2212 \u03c6(x, \u03b8\n\n\u2217\n\n)]2(cid:69) ,\n\n\u2217\n\n) \u2261 1\n\nik \u2261 w\u00b5\nQ\u00b5\n\ni w\u00b5\nN\n\nk\n\nin \u2261 w\u00b5\n\ni w\u2217\nN\n\nand Tnm \u2261 w\u2217\n\nnw\u2217\nN\n\nn\n\n, R\u00b5\n\n(5)\ntogether with the second-layer weights v\u2217 and v\u00b5. Intuitively, the teacher-student overlaps R\u00b5 =\n[R\u00b5\nin] measure the similarity between the weights of the ith student node and the nth teacher node.\nThe matrix Qik quanti\ufb01es the overlap of the weights of different student nodes with each other, and\nthe corresponding overlap of the teacher nodes are collected in the matrix Tnm. We will \ufb01nd it\nconvenient to collect all order parameters in a single vector\n\u2217\n\nm\n\n,\n\nm\u00b5 \u2261 (R\u00b5, Q\u00b5, T, v\n\n, v\u00b5),\n\n(6)\n\n3\n\n\fand we write the full expression for \u0001g(m\u00b5) in the SM, Eq. (S31).\nIn a series of classic papers, Biehl, Schwarze, Saad, Solla and Riegler 40\u201344 derived a closed set of\nordinary differential equations for the time evolution of the order parameters m (see SM Sec. B).\nTogether with the expression for the generalisation error \u0001g(m\u00b5), these equations give a complete\ndescription of the generalisation dynamics of the student, which they analysed for the special case\nK = M when only the \ufb01rst layer is trained 42,44. Our \ufb01rst contribution is to provide a rigorous\nfoundation for these results under the following assumptions:\n\n(A1) Both the sequences x\u00b5 and \u03b6 \u00b5, \u00b5 = 1, 2, . . ., are i.i.d. random variables; x\u00b5 is drawn from a\nnormal distribution with mean 0 and covariance matrix IN , while \u03b6 \u00b5 is a Gaussian random\nvariable with mean zero and unity variance;\n\n(A2) The function g(x) is bounded and its derivatives up to and including the second order exist and\n\nare bounded, too;\n\n(A3) The initial macroscopic state m0 is deterministic and bounded by a constant;\n(A4) The constants \u03c3, K, M, \u03b7w and \u03b7v are all \ufb01nite.\n\nThe correctness of the ODE description is then established by the following theorem:\nTheorem 1.1. Choose T > 0 and de\ufb01ne \u03b1 \u2261 \u00b5/N. Under assumptions (A1) \u2013 (A4), and for any\n\u03b1 > 0, the macroscopic state m\u00b5 satis\ufb01es\n\nmax\n\n0\u2264\u00b5\u2264N T\n\nE ||m\u00b5 \u2212 m(\u03b1)|| \u2264 C(T )\u221a\nN\n\n,\n\n(7)\n\nwhere C(T ) is a constant depending on T , but not on N, and m(\u03b1) is the unique solution of the\nODE\n\nwith initial condition m\u2217. In particular, we have\n(\u03bbi)\u03c1n(cid:105) ,\n(\u03bbi)\u03bbk(cid:105) + \u03b7vk(cid:104)\u2206g\n\n\u2261 fR(m(\u03b1)) = \u03b7vi(cid:104)\u2206g\n\u2261 fQ(m(\u03b1)) = \u03b7vi(cid:104)\u2206g\n\ndRin\nd\u03b1\ndQik\nd\u03b1\n\n(cid:48)\n\n(\u03bbk)\u03bbi(cid:105)\n\nm(\u03b1) = f (m(\u03b1))\n\nd\ndt\n\n(cid:48)\n\n(cid:48)\n\n+ \u03b72vivk(cid:104)\u22062g\n\n(cid:48)\n\n(cid:48)\n\n(\u03bbk)(cid:105) + \u03b72vivk\u03c32(cid:104)g\n\n(cid:48)\n\n(\u03bbi)g\n\n(cid:48)\n\n(\u03bbk)(cid:105) ,\n\n(\u03bbi)g\n\n\u2261 fv(m(\u03b1)) = \u03b7v(cid:104)\u2206g(\u03bbi)(cid:105).\n\ndvi\nd\u03b1\n\n(8)\n\n(9a)\n\n(9b)\n\n(9c)\n\nwhere all f (m(\u03b1)) are uniformly Lipschitz continuous in m(\u03b1). We are able to close the equations\nbecause we can express averages in Eq. (9) in terms of only m(\u03b1).\n\nWe prove Theorem 1.1 using the theory of convergence of stochastic processes and a coupling trick\nintroduced recently by Wang et al. 45 in Sec. A of the SM. The content of the theorem is illustrated in\nFig. 1b, where we plot \u0001g(\u03b1) obtained by numerically integrating (9) (solid) and from a single run of\nSGD (2) (crosses) for sigmoidal students and varying K, which are in very good agreement.\nGiven a set of non-linear, coupled ODE such as Eqns. (9), \ufb01nding the asymptotic \ufb01xed points\nanalytically to compute the generalisation error would seem to be impossible. In the following, we\nwill therefore focus on analysing the asymptotic \ufb01xed points found by numerically integrating the\nequations of motion. The form of these \ufb01xed points will reveal a drastically different dependence of\nthe test error on the over-parameterisation of neural networks with different activation functions in\nthe different setups we consider, despite them all being trained by SGD. This highlights the fact that\ngood generalisation goes beyond the properties of just the algorithm. Second, knowledge of these\n\ufb01xed points allows us to make analytical and quantitative predictions for the asymptotic performance\nof the networks which agree well with experiments. We also note that several recent theorems 29\u201331\nabout the global convergence of SGD do not apply in our setting because we have a \ufb01nite number of\nhidden units.\n\n4\n\n\fFigure 1: The analytical description of the generalisation dynamics of sigmoidal networks\nmatches experiments. (a) We consider two-layer neural networks with a very large input layer.\n(b) We plot the learning dynamics \u0001g(\u03b1) obtained by integration of the ODEs (9) (solid) and from a\nsingle run of SGD (2) (crosses) for students with different numbers of hidden units K. The insets\nshow the values of the teacher-student overlaps Rin (5) for a student with K = 4 at the two times\nindicated by the arrows. N = 784, M = 4, \u03b7 = 0.2.\n\n2 Asymptotic generalisation error of Soft Committee machines\nWe will \ufb01rst study networks where the second layer weights are \ufb01xed at v\u2217\nm = vk = 1. These networks\nare called a Soft Committee Machine (SCM) in the statistical physics literature 18,27,40\u201342,44. One\nnotable feature of \u0001g(\u03b1) in SCMs is the existence of a long plateau with sub-optimal generalisation\nerror during training. During this period, all student nodes have roughly the same overlap with all\nthe teacher nodes, Rin = const. (left inset in Fig. 1b). As training continues, the student nodes\n\u201cspecialise\u201d and each of them becomes strongly correlated with a single teacher node (right inset),\nleading to a sharp decrease in \u0001g. This effect is well-known for both batch and online learning 18 and\nwill be key for our analysis.\nLet us now use the equations of motion (9) to analyse the asymptotic generalisation error of neural\ng after training has converged and in particular its scaling with L = K \u2212 M. Our \ufb01rst\nnetworks \u0001\u2217\ncontribution is to reduce the remaining K(K + M ) equations of motion to a set of eight coupled\ndifferential equations for any combination of K and M in Sec. C. This enables us to obtain a\nclosed-form expression for \u0001\u2217\nIn the absence of output noise (\u03c3 = 0), the generalisation error of a student with K \u2265 M will\nasymptotically tend to zero as \u03b1 \u2192 \u221e. On the level of the order parameters, this corresponds to\nreaching a stable \ufb01xed point of (9) with \u0001g = 0. In the presence of small output noise \u03c3 > 0, this\n\ufb01xed point becomes unstable and the order parameters instead converge to another, nearby \ufb01xed\npoint m\u2217 with \u0001g(m\u2217) > 0. The values of the order parameters at that \ufb01xed point can be obtained by\nperturbing Eqns. (9) to \ufb01rst order in \u03c3, and the corresponding generalisation error \u0001g(m\u2217) turns out\nto be in excellent agreement with the generalisation error obtained when training a neural network\nusing (2) from random initial conditions, which we show in Fig. 2a.\n\ng as follows.\n\n\u221a\n\nSigmoidal networks. We have performed this calculation for teacher and student networks with\n2). We relegate the details to Sec. C.2, and content us here to state the asymptotic\ng(x) = erf(x/\nvalue of the generalisation error to \ufb01rst order in \u03c32,\n\n\u2217\ng =\n\n\u0001\n\n\u03c32\u03b7\n2\u03c0\n\nf (M, L, \u03b7) + O(\u03c33),\n\n(10)\n\nwhere f (M, L, \u03b7) is a lengthy rational function of its variables. We plot our result in Fig. 2a together\nwith the \ufb01nal generalisation error obtained in a single run of SGD (2) for a neural network with initial\nweights drawn i.i.d. from N (0, 1) and \ufb01nd excellent agreement, which we con\ufb01rmed for a range of\nvalues for \u03b7, \u03c3, and L.\nOne notable feature of Fig. 2a is that with all else being equal, SGD alone fails to regularise the\nstudent networks of increasing size in our setup, instead yielding students whose generalisation error\nincreases linearly with L. One might be tempted to mitigate this effect by simultaneously decreasing\nthe learning rate \u03b7 for larger students. However, lowering the learning rate incurs longer training\n\n5\n\n...(a)xwgK...g2g1gk=g(wkx)v\u03c6\u03c6=Pkvkgk\fFigure 2: The asymptotic generalisation error of Soft Committee Machines increases with the\nnetwork size. N = 784, \u03b7 = 0.05, \u03c3 = 0.01. (a) Our theoretical prediction for \u0001\u2217\ng/\u03c32 for sigmoidal\n(solid) and linear (dashed), Eqns. (10) and (12), agree perfectly with the result obtained from a single\nrun of SGD (2) starting from random initial weights (crosses). (b) The \ufb01nal overlap matrices Q and\nR (5) at the end of an experiment with M = 2, K = 5. Networks with sigmoidal activation function\n(top) show clear signs of specialisation as described in Sec. 2. ReLU networks (bottom) instead\nconverge to solutions where all of the student\u2019s nodes have \ufb01nite overlap with teacher nodes.\n\ntimes, which requires more data for online learning. This trade-off is also found in statistical learning\ntheory, where models with more parameters (higher L) and thus a higher complexity class (e.g. VC\ndimension or Rademacher complexity 4) generalise just as well as smaller ones when given more data.\nIn practice, however, more data might not be readily available, and we show in Fig. S2 of the SM that\neven when choosing \u03b7 = 1/K, the generalisation error still increases with L before plateauing at a\nconstant value.\nWe can gain some intuition for the scaling of \u0001\u2217\ng by considering the asymptotic overlap matrices Q\nand R shown in the left half of Fig. 2b. In the over-parameterised case, L = K \u2212 M student nodes\nare effectively trying to specialise to teacher nodes which do not exist, or equivalently, have weights\nzero. These L student nodes do not carry any information about the teachers output, but they pick up\n\ufb02uctuations from output noise and thus increase \u0001\u2217\ng. This intuition is borne out by an expansion of \u0001\u2217\ng\nin the limit of small learning rate \u03b7, which yields\n\n\u2217\ng =\n\n\u0001\n\n\u03c32\u03b7\n\n2\u03c0 (cid:18)L +\n\nM\u221a\n\n3(cid:19) + O(\u03b72),\n\n(11)\n\nwhich is indeed the sum of the error of M independent hidden units that are specialised to a single\nteacher hidden unit, and L = K \u2212 M super\ufb02uous units contributing each the error of a hidden unit\nthat is \u201clearning\u201d from a hidden unit with zero weights w\u2217\ng \u223c L in sigmoidal networks may\nLinear networks. Two possible explanations for the scaling \u0001\u2217\nbe the specialisation of the hidden units or the fact that teacher and student network can implement\nfunctions of different range if K (cid:54)= M. To test these hypotheses, we calculated \u0001\u2217\ng for linear neural\nnetworks 46,47 with g(x) = x. Linear networks lack a specialisation transition 27 and their output\nrange is set by the magnitude of their weights, rather than their number of hidden units. Following\nthe same steps as before, a perturbative calculation in the limit of small noise variance \u03c32 yields\n\nm = 0 (see also Sec. D of the SM).\n\n\u2217\ng =\n\n\u0001\n\n\u03b7\u03c32(L + M )\n4 \u2212 2\u03b7(L + M )\n\n+ O(\u03c33).\n\n(12)\n\nThis result is again in perfect agreement with experiments, as we demonstrate in Fig. 2a. In the limit\nof small learning rates \u03b7, Eq. (10) simpli\ufb01es to yield the same scaling as for sigmoidal networks,\n\n(13)\ng \u223c L is not just a consequence of either specialisation or the mismatched\nThis shows that the scaling \u0001\u2217\nrange of the networks\u2019 output functions. The optimal number of hidden units for linear networks\nis K = 1 for all M, because linear networks implement an effective linear transformation with an\n\n\u03b7\u03c32(L + M ) + O(cid:0)\u03b72(cid:1) .\n\n\u2217\ng =\n\n1\n4\n\n\u0001\n\n6\n\n051015L101*g/2M=4M=8i n n i (b)\fFigure 3: The performance of sigmoidal networks improves with network size when training\nboth layers with SGD. (a) Generalisation dynamics observed experimentally for students with\nincreasing K, with all other parameters being equal. (N = 500, M = 2, \u03b7 = 0.05, \u03c3 = 0.01, v\u2217 = 4).\n(b) Overlap matrices Q, R, and second layer weights vk of the student at the end of the run with\ng (solid) against \u0001\u2217\nK = 5 shown in (a). (c) Theoretical prediction for \u0001\u2217\ng observed after integration of\nthe ODE until convergence (crosses) (9) (\u03c3 = 0.01, \u03b7 = 0.2, v\u2217 = 2).\n\neffective matrix W =(cid:80)k wk. Adding hidden units to a linear network hence does not augment the\n\nclass of functions it can implement, but it adds redundant parameters which pick up \ufb02uctuations from\nthe teacher\u2019s output noise, increasing \u0001g.\nReLU networks. The analytical calculation of \u0001\u2217\ng, described above, for ReLU networks poses some\nadditional technical challenges, so we resort to experiments to investigate this case. We found that\nthe asymptotic generalisation error of a ReLU student learning from a ReLU teacher has the same\nscaling as the one we found analytically for networks with sigmoidal and linear activation functions:\ng \u223c \u03b7\u03c32L (see Fig. S3). Looking at the \ufb01nal overlap matrices Q and R for ReLU networks in the\n\u0001\u2217\nbottom half of Fig. 2b, we see that instead of the one-to-one specialisation of sigmoidal networks, all\nstudent nodes have a \ufb01nite overlap with some teacher node. This is a consequence of the fact that it is\nmuch simpler to re-express the sum of M ReLU units with K (cid:54)= M ReLU units. However, there\nare still a lot of redundant degrees of freedom in the student, which all pick up \ufb02uctuations from the\nteacher\u2019s output noise and increase \u0001\u2217\ng.\n\nDiscussion. The key result of this section has been that the generalisation error of SCMs scales as\n\ng \u223c \u03b7\u03c32L.\n\u2217\n\n\u0001\n\n(14)\n\nBefore moving on the full two-layer network, we discuss a number of experiments that we performed\nto check the robustness of this result (Details can be found in Sec. G of the SM). A standard\nregularisation method is adding weight decay to the SGD updates (2). However, we did not \ufb01nd a\nscenario in our experiments where weight decay improved the performance of a student with L > 0.\nWe also made sure that our results persist when performing SGD with mini-batches. We investigated\nthe impact of higher-order correlations in the inputs by replacing Gaussian inputs with MNIST\nimages, with all other aspects of our setup the same, and the same \u0001g-L curve as for Gaussian inputs.\nFinally, we analysed the impact of having a \ufb01nite training set. The behaviour of linear networks and\nof non-linear networks with large but \ufb01nite training sets did not change qualitatively. However, as\nwe reduce the size of the training set, we found that the lowest asymptotic generalisation error was\nobtained with networks that have K > M.\n\n3 Training both layers: Asymptotic generalisation error of a neural network\n\nWe now study the performance of two-layer neural networks when both layers are trained according\nto the SGD updates (2) and (3). We set all the teacher weights equal to a constant value, v\u2217\nm = v\u2217,\nto ensure comparability between experiments. However, we train all K second-layer weights of the\nstudent independently and do not rely on the fact that all second-layer teacher weights have the same\nvalue. Note that learning the second layer is not needed from the point of view of statistical learning:\nthe networks from the previous section are already expressive enough to capture the students, and\nwe are thus slightly increasing the over-parameterisation even further. Yet, we will see that the\ngeneralisation properties will be signi\ufb01cantly enhanced.\n\n7\n\n101101103105steps / K / N105104103102101100101g(a)K=1K=2K=3K=4K=5K=6(b)123Z=K/M101*g/2(c)M=1M=2M=3\fSigmoidal networks. We plot the generalisation dynamics of students with increasing K trained\non a teacher with M = 2 in Fig. 3a. Our \ufb01rst observation is that increasing the student size K \u2265 M\ndecreases the asymptotic generalisation error \u0001\u2217\ng, with all other parameters being equal, in stark\ncontrast to the SCMs of the previous section.\nA look at the order parameters after convergence in the experiments from Fig. 3a reveals the intriguing\npattern of specialisation of the student\u2019s hidden units behind this behaviour, shown for K = 5 in\nFig. 3b. First, note that all the hidden units of the student have non-negligible weights (Qii > 0). Two\nstudent nodes (k = 1, 2) have specialised to the \ufb01rst teacher node, i.e. their weights are very close to\nthe weights of the \ufb01rst teacher node (R10 \u2248 R20 \u2248 0.85). The corresponding second-layer weights\napproximately ful\ufb01l v1 + v3 \u2248 v\u2217. Summing the output of these two student hidden units is thus\napproximately equivalent to an empirical average of two estimates of the output of the teacher node.\nThe remaining three student nodes all specialised to the second teacher node, and their outgoing\nweights approximately sum to v\u2217. This pattern suggests that SGD has found a set of weights for\nboth layers where the student\u2019s output is a weighted average of several estimates of the output of the\nteacher\u2019s nodes. We call this the denoising solution and note that it resembles the solutions found in\nthe mean-\ufb01eld limit of an in\ufb01nite hidden layer 29,31 where the neurons become redundant and follow\na distribution dynamics (in our case, a simple one with few peaks, as e.g. Fig. 1 in 31).\nWe con\ufb01rmed this intuition by using an ansatz for the order parameters that corresponds to a denoising\nsolution to solve the equations of motion (9) perturbatively in the limit of small noise to calculate\n\u0001\u2217\ng for sigmoidal networks after training both layers, similarly to the approach in Sec. 2. While this\napproach can be extended to any K and M, we focused on the case where K = ZM to obtain\nmanageable expressions; see Sec. E of the SM for details on the derivation. While the \ufb01nal expression\nis again too long to be given here, we plot it with solid lines in Fig. 3c. The crosses in the same plot\nare the asymptotic generalisation error obtained by integration of the ODE (9) starting from random\ninitial conditions, and show very good agreement.\nWhile our result holds for any M, we note from Fig. 3c that the curves for different M are qualitatively\nsimilar. We \ufb01nd a particular simple result for M = 1 in the limit of small learning rates, where:\n\n\u2217\ng =\n\n\u0001\n\n\u03b7(\u03c3v\u2217)2\n\u221a\n3K\u03c0\n2\n\n+ O(\u03b7\u03c32) .\n\n(15)\n\nThis result should be contrasted with the \u0001g \u223c K behaviour found for SCM.\nExperimentally, we robustly observed that training both layers of the network yields better per-\nformance than training only the \ufb01rst layer with the second layer weights \ufb01xed to v\u2217. However,\nconvergence to the denoising solution can be dif\ufb01cult for large students which might get stuck on a\nlong plateau where their nodes are not evenly distributed among the teacher nodes. While it is easy to\ncheck that such a network has a higher value of \u0001g than the denoising solution, the difference is small,\nand hence the driving force that pushes the student out of the corresponding plateaus is small, too.\nThese observations demonstrate that in our setup, SGD does not always \ufb01nd the solution with the\nlowest generalisation error in \ufb01nite time.\n\nReLU and linear networks. We found experimentally that \u0001\u2217\ng remains constant with increasing K\nin ReLU and in linear networks when training both layers. We plot a typical learning curve in green\nfor linear networks in Fig. 4, but note that the \ufb01gure shows qualitatively similar features for ReLU\nnetworks (Fig. S4). This behaviour was also observed in linear networks trained by batch gradient\ndescent, starting from small initial weights 48. While this scaling of \u0001\u2217\ng with K is an improvement\nover its increase with K for the SCM, (blue curve), this is not the 1/K decay that we observed for\nsigmoidal networks. A possible explanation is the lack of specialisation in linear and ReLU networks\n(see Sec. 2), without which the denoising solution found in sigmoidal networks is not possible. We\nalso considered normalised SCM, where we train only the \ufb01rst layer and \ufb01x the second-layer weights\nat v\u2217\nm = 1/M and vk = 1/K. The asymptotic error of normalised SCM decreases with K (orange\ncurve in Fig. 4), because the second-layer weights vk = 1/K effectively reduce the learning rate,\nas can be easily seen from the SGD updates (2), and we know from our analysis of linear SCM in\nSec. 2 that \u0001g \u223c \u03b7. In SM Sec. F we show analytically how imbalance in the norms of the \ufb01rst and\nsecond layer weights can lead to a larger effective learning rate. Normalised SCM also beat the\nperformance students where we trained both layers, starting from small initial weights in both cases.\nThis is surprising because we checked experimentally that the weights of a normalised SCM after\n\n8\n\n\ftraining are a \ufb01xed point of the SGD dynamics when training both layers. However, we con\ufb01rmed\nexperimentally that SGD does not \ufb01nd this \ufb01xed point when starting with random initial weights.\n\nDiscussion. The qualitative difference be-\ntween training both or only the \ufb01rst layer of\nneural networks is particularly striking for lin-\near networks, where \ufb01xing one layer does not\nchange the class of functions the model can im-\nplement, but makes a dramatic difference for\ntheir asymptotic performance. This observation\nhighlights two important points: \ufb01rst, the per-\nformance of a network is not just determined by\nthe number of additional parameters, but also\nby how the additional parameters are arranged\nin the model. Second, the non-linear dynamics\nof SGD means that changing which weights are\ntrainable can alter the training dynamics in un-\nexpected ways. We saw this for two-layer linear\nnetworks, where SGD did not \ufb01nd the optimal\n\ufb01xed point, and in the non-linear sigmoidal net-\nworks, where training the second layer allowed\nthe student to decrease its \ufb01nal error with every\nadditional hidden unit instead of increasing it\nlike in the SCM.\n\nAcknowledgements\n\nFigure 4: Asymptotic performance of linear two\nlayer network. Error bars indicate one standard de-\nviation over \ufb01ve runs. Parameters: N = 100, M =\n4, v\u2217 = 1, \u03b7 = 0.01, \u03c3 = 0.01.\n\nSG and LZ acknowledge funding from the ERC under the European Union\u2019s Horizon 2020 Research\nand Innovation Programme Grant Agreement 714608-SMiLe. MA thanks the Swartz Program in\nTheoretical Neuroscience at Harvard University for support. AS acknowledges funding by the\nEuropean Research Council, grant 725937 NEUROABSTRACTION. FK acknowledges support from\n\u201cChaire de recherche sur les mod\u00e8les et sciences des donn\u00e9es\u201d, Fondation CFM pour la Recherche-\nENS, and from the French National Research Agency (ANR) grant PAIL.\n\nReferences\n[1] Y. LeCun, Y. Bengio, and G.E. Hinton. Deep learning. Nature, 521(7553):436\u2013444, 2015.\n\n[2] K. Simonyan and A. Zisserman. Very Deep Convolutional Networks for Large-Scale Image\n\nRecognition. In International Conference on Learning Representations, 2015.\n\n[3] P. L. Bartlett and S. Mendelson. Rademacher and Gaussian complexities: Risk bounds and\n\nstructural results. Journal of Machine Learning Research, 3(3):463\u2013482, 2003.\n\n[4] M. Mohri, A. Rostamizadeh, and A. Talwalkar. Foundations of Machine Learning. MIT Press,\n\n2012.\n\n[5] B. Neyshabur, R. Tomioka, and N. Srebro. Norm-Based Capacity Control in Neural Networks.\n\nIn Conference on Learning Theory, 2015.\n\n[6] N. Golowich, A. Rakhlin, and O. Shamir. Size-independent sample complexity of neural\n\nnetworks. Information and Inference: A Journal of the IMA, 2019.\n\n[7] G.K. Dziugaite and D.M. Roy. Computing Nonvacuous Generalization Bounds for Deep\n(Stochastic) Neural Networks with Many More Parameters than Training Data. In Proceedings\nof the Thirty-Third Conference on Uncertainty in Arti\ufb01cial Intelligence, 2017.\n\n[8] S. Arora, R. Ge, B. Neyshabur, and Y. Zhang. Stronger generalization bounds for deep nets via\na compression approach. In 35th International Conference on Machine Learning, ICML 2018,\npages 390\u2013418, 2018.\n\n9\n\n101102103K104103102*g/2SCMNormalisedBoth\f[9] Z. Allen-Zhu, Y. Li, and Y. Liang. Learning and Generalization in Overparameterized Neural\n\nNetworks, Going Beyond Two Layers. arXiv:1811.04918, 2018.\n\n[10] B. Neyshabur, R. Tomioka, and N. Srebro. In search of the real inductive bias: On the role of\n\nimplicit regularization in deep learning. In ICLR, 2015.\n\n[11] C. Zhang, S. Bengio, M. Hardt, B. Recht, and O. Vinyals. Understanding deep learning requires\n\nrethinking generalization. In ICLR, 2017.\n\n[12] D. Arpit, S. Jastrz, M.S. Kanwal, T. Maharaj, A. Fischer, A. Courville, and Y. Bengio. A Closer\nLook at Memorization in Deep Networks. In Proceedings of the 34th International Conference\non Machine Learning, 2017.\n\n[13] P. Chaudhari and S. Soatto. On the inductive bias of stochastic gradient descent. In International\n\nConference on Learning Representations, 2018.\n\n[14] D. Soudry, E. Hoffer, and N. Srebro. The implicit bias of gradient descent on separable data. In\n\nInternational Conference on Learning Representations, 2018.\n\n[15] S. Gunasekar, B. Woodworth, S. Bhojanapalli, B. Neyshabur, and N. Srebro. Implicit Regu-\nlarization in Matrix Factorization. In Advances in Neural Information Processing Systems 30,\npages 6151\u20136159, 2017.\n\n[16] Y. Li, T. Ma, and H. Zhang. Algorithmic Regularization in Over-parameterized Matrix Sensing\nand Neural Networks with Quadratic Activations. In Conference on Learning Theory, pages\n2\u201347, 2018.\n\n[17] H. S. Seung, H. Sompolinsky, and N. Tishby. Statistical mechanics of learning from examples.\n\nPhysical Review A, 45(8):6056\u20136091, 1992.\n\n[18] A. Engel and C. Van den Broeck. Statistical Mechanics of Learning. Cambridge University\n\nPress, 2001.\n\n[19] V. Vapnik. Statistical learning theory. New York, pages 156\u2013160, 1998.\n\n[20] E. Gardner and B. Derrida. Three un\ufb01nished works on the optimal storage capacity of networks.\n\nJournal of Physics A: Mathematical and General, 22(12):1983\u20131994, 1989.\n\n[21] W. Kinzel, P. Ruj\u00e1n, and P. Rujan. Improving a Network Generalization Ability by Selecting\n\nExamples. EPL (Europhysics Letters), 13(5):473\u2013477, 1990.\n\n[22] T.L.H. Watkin, A. Rau, and M. Biehl. The statistical mechanics of learning a rule. Reviews of\n\nModern Physics, 65(2):499\u2013556, 1993.\n\n[23] L. Zdeborov\u00e1 and F. Krzakala. Statistical physics of inference: thresholds and algorithms. Adv.\n\nPhys., 65(5):453\u2013552, 2016.\n\n[24] M.S. Advani and S. Ganguli. Statistical mechanics of optimal convex inference in high\n\ndimensions. Physical Review X, 6(3):1\u201316, 2016.\n\n[25] P. Chaudhari, A. Choromanska, S. Soatto, Y. LeCun, C. Baldassi, C. Borgs, J. Chayes, L. Sagun,\nand R. Zecchina. Entropy-SGD: Biasing Gradient Descent Into Wide Valleys. In ICLR, 2017.\n\n[26] M.S. Advani and A.M. Saxe. High-dimensional dynamics of generalization error in neural\n\nnetworks. arXiv:1710.03667, 2017.\n\n[27] B. Aubin, A. Maillard, J. Barbier, F. Krzakala, N. Macris, and L. Zdeborov\u00e1. The committee\nmachine: Computational to statistical gaps in learning a two-layers neural network. In Advances\nin Neural Information Processing Systems 31, pages 3227\u20133238, 2018.\n\n[28] M. Baity-Jesi, L. Sagun, M. Geiger, S. Spigler, G.B. Arous, C. Cammarota, Y. LeCun, M. Wyart,\nIn\n\nand G. Biroli. Comparing Dynamics: Deep Neural Networks versus Glassy Systems.\nProceedings of the 35th International Conference on Machine Learning, 2018.\n\n10\n\n\f[29] S. Mei, A. Montanari, and P. Nguyen. A mean \ufb01eld view of the landscape of two-layer neural\n\nnetworks. Proceedings of the National Academy of Sciences, 115(33):E7665\u2013E7671, 2018.\n\n[30] G.M. Rotskoff and E. Vanden-Eijnden. Parameters as interacting particles: long time conver-\ngence and asymptotic error scaling of neural networks. In Advances in Neural Information\nProcessing Systems 31, pages 7146\u20137155, 2018.\n\n[31] L. Chizat and F. Bach. On the global convergence of gradient descent for over-parameterized\nmodels using optimal transport. In Advances in Neural Information Processing Systems 31,\npages 3040\u20133050, 2018.\n\n[32] J. Sirignano and K. Spiliopoulos. Mean \ufb01eld analysis of neural networks: A central limit\n\ntheorem. Stochastic Processes and their Applications, 2019.\n\n[33] A. Jacot, F. Gabriel, and C. Hongler. Neural tangent kernel: Convergence and generalization in\nneural networks. In Advances in Neural Information Processing Systems 32, pages 8571\u20138580,\n2018.\n\n[34] S.S. Du, X. Zhai, B. Poczos, and A. Singh. Gradient descent provably optimizes over-\nparameterized neural networks. In International Conference on Learning Representations,\n2019.\n\n[35] Z. Allen-Zhu, Y. Li, and Z. Song. A convergence theory for deep learning via over-\n\nparameterization. arXiv preprint arXiv:1811.03962, 2018.\n\n[36] Y. Li and Y. Liang. Learning Overparameterized Neural Networks via Stochastic Gradient\nDescent on Structured Data. In Advances in Neural Information Processing Systems 31, 2018.\n\n[37] D. Zou, Y. Cao, D. Zhou, and Q. Gu. Stochastic gradient descent optimizes over-parameterized\n\ndeep relu networks. Machine Learning, pages 1\u201326, 2019.\n\n[38] L. Chizat, E. Oyallon, and F. Bach. On lazy training in differentiable programming. In Advances\n\nin Neural Information Processing Systems 33, page forthcoming, 2019.\n\n[39] S. Mei, T. Misiakiewicz, and A. Montanari. Mean-\ufb01eld theory of two-layers neural networks:\n\ndimension-free bounds and kernel limit. arXiv preprint arXiv:1902.06015, 2019.\n\n[40] M. Biehl and H. Schwarze. Learning by on-line gradient descent. J. Phys. A. Math. Gen.,\n\n28(3):643\u2013656, 1995.\n\n[41] D. Saad and S.A. Solla. Exact Solution for On-Line Learning in Multilayer Neural Networks.\n\nPhys. Rev. Lett., 74(21):4337\u20134340, 1995.\n\n[42] D. Saad and S.A. Solla. On-line learning in soft committee machines. Phys. Rev. E, 52(4):4225\u2013\n\n4243, 1995.\n\n[43] P. Riegler and M. Biehl. On-line backpropagation in two-layered neural networks. Journal of\n\nPhysics A: Mathematical and General, 28(20), 1995.\n\n[44] D. Saad and S.A. Solla. Learning with Noise and Regularizers Multilayer Neural Networks. In\n\nAdvances in Neural Information Processing Systems 9, pages 260\u2013266, 1997.\n\n[45] C. Wang, Hong Hu, and Yue M. Lu. A Solvable High-Dimensional Model of GAN.\n\narXiv:1805.08349, 2018.\n\n[46] A. Krogh and J. A. Hertz. Generalization in a linear perceptron in the presence of noise. Journal\n\nof Physics A: Mathematical and General, 25(5):1135\u20131147, 1992.\n\n[47] A.M. Saxe, James L. McClelland, and S. Ganguli. Exact solutions to the nonlinear dynamics of\n\nlearning in deep linear neural networks. In ICLR, 2014.\n\n[48] A.K. Lampinen and S. Ganguli. An analytic theory of generalization dynamics and transfer\nlearning in deep linear networks. In International Conference on Learning Representations,\n2019.\n\n11\n\n\f", "award": [], "sourceid": 3778, "authors": [{"given_name": "Sebastian", "family_name": "Goldt", "institution": "Institut de Physique Th\u00e9orique, CNRS, Paris"}, {"given_name": "Madhu", "family_name": "Advani", "institution": "Apple"}, {"given_name": "Andrew", "family_name": "Saxe", "institution": "University of Oxford"}, {"given_name": "Florent", "family_name": "Krzakala", "institution": "\u00c9cole Normale Sup\u00e9rieure"}, {"given_name": "Lenka", "family_name": "Zdeborov\u00e1", "institution": "CEA Saclay"}]}