{"title": "Nonlinear random matrix theory for deep learning", "book": "Advances in Neural Information Processing Systems", "page_first": 2637, "page_last": 2646, "abstract": "Neural network configurations with random weights play an important role in the analysis of deep learning. They define the initial loss landscape and are closely related to kernel and random feature methods. Despite the fact that these networks are built out of random matrices, the vast and powerful machinery of random matrix theory has so far found limited success in studying them. A main obstacle in this direction is that neural networks are nonlinear, which prevents the straightforward utilization of many of the existing mathematical results. In this work, we open the door for direct applications of random matrix theory to deep learning by demonstrating that the pointwise nonlinearities typically applied in neural networks can be incorporated into a standard method of proof in random matrix theory known as the moments method. The test case for our study is the Gram matrix $Y^TY$, $Y=f(WX)$, where $W$ is a random weight matrix, $X$ is a random data matrix, and $f$ is a pointwise nonlinear activation function. We derive an explicit representation for the trace of the resolvent of this matrix, which defines its limiting spectral distribution. We apply these results to the computation of the asymptotic performance of single-layer random feature methods on a memorization task and to the analysis of the eigenvalues of the data covariance matrix as it propagates through a neural network. As a byproduct of our analysis, we identify an intriguing new class of activation functions with favorable properties.", "full_text": "Nonlinear random matrix theory for deep learning\n\nJeffrey Pennington\n\nGoogle Brain\n\njpennin@google.com\n\nPratik Worah\nGoogle Research\n\npworah@google.com\n\nAbstract\n\nNeural network con\ufb01gurations with random weights play an important role in the\nanalysis of deep learning. They de\ufb01ne the initial loss landscape and are closely\nrelated to kernel and random feature methods. Despite the fact that these networks\nare built out of random matrices, the vast and powerful machinery of random matrix\ntheory has so far found limited success in studying them. A main obstacle in this\ndirection is that neural networks are nonlinear, which prevents the straightforward\nutilization of many of the existing mathematical results. In this work, we open\nthe door for direct applications of random matrix theory to deep learning by\ndemonstrating that the pointwise nonlinearities typically applied in neural networks\ncan be incorporated into a standard method of proof in random matrix theory\nknown as the moments method. The test case for our study is the Gram matrix\nY T Y , Y = f (W X), where W is a random weight matrix, X is a random data\nmatrix, and f is a pointwise nonlinear activation function. We derive an explicit\nrepresentation for the trace of the resolvent of this matrix, which de\ufb01nes its limiting\nspectral distribution. We apply these results to the computation of the asymptotic\nperformance of single-layer random feature networks on a memorization task and\nto the analysis of the eigenvalues of the data covariance matrix as it propagates\nthrough a neural network. As a byproduct of our analysis, we identify an intriguing\nnew class of activation functions with favorable properties.\n\n1\n\nIntroduction\n\nThe list of successful applications of deep learning is growing at a staggering rate.\nImage\nrecognition (Krizhevsky et al., 2012), audio synthesis (Oord et al., 2016), translation (Wu et al.,\n2016), and speech recognition (Hinton et al., 2012) are just a few of the recent achievements. Our\ntheoretical understanding of deep learning, on the other hand, has progressed at a more modest pace.\nA central dif\ufb01culty in extending our understanding stems from the complexity of neural network loss\nsurfaces, which are highly non-convex functions, often of millions or even billions (Shazeer et al.,\n2017) of parameters.\n\nIn the physical sciences, progress in understanding large complex systems has often come by\napproximating their constituents with random variables; for example, statistical physics and\nthermodynamics are based in this paradigm. Since modern neural networks are undeniably large\ncomplex systems, it is natural to consider what insights can be gained by approximating their\nparameters with random variables. Moreover, such random con\ufb01gurations play at least two privileged\nroles in neural networks: they de\ufb01ne the initial loss surface for optimization, and they are closely\nrelated to random feature and kernel methods. Therefore it is not surprising that random neural\nnetworks have attracted signi\ufb01cant attention in the literature over the years.\n\nAnother useful technique for simplifying the study of large complex systems is to approximate\ntheir size as in\ufb01nite. For neural networks, the concept of size has at least two axes: the number\n\n31st Conference on Neural Information Processing Systems (NIPS 2017), Long Beach, CA, USA.\n\n\fof samples and the number of parameters. It is common, particularly in the statistics literature, to\nconsider the mean performance of a \ufb01nite-capacity model against a given data distribution. From this\nperspective, the number of samples, m, is taken to be in\ufb01nite relative to the number of parameters, n,\ni.e. n/m ! 0. An alternative perspective is frequently employed in the study of kernel or random\nfeature methods. In this case, the number of parameters is taken to be in\ufb01nite relative to the number\nof samples, i.e. n/m ! 1. In practice, however, most successful modern deep learning architectures\ntend to have both a large number of samples and a large number of parameters, often of roughly the\nsame order of magnitude. (One simple explanation for this scaling may just be that the other extremes\ntend to produce over- or under-\ufb01tting). Motivated by this observation, in this work we explore the\nin\ufb01nite size limit in which both the number of samples and the number of parameters go to in\ufb01nity at\nthe same rate, i.e. n, m ! 1 with n/m = , for some \ufb01nite constant . This perspective puts us\nsquarely in the regime of random matrix theory.\nAn abundance of matrices are of practical and theoretical interest in the context of random neural\nnetworks. For example, the output of the network, its Jacobian, and the Hessian of the loss function\nwith respect to the weights are all interesting objects of study.\nIn this work we focus on the\nm Y T Y , where Y = f (W X), W is a Gaussian\ncomputation of the eigenvalues of the matrix M \u2318 1\nrandom weight matrix, X is a Gaussian random data matrix, and f is a pointwise activation function.\nIn many ways, Y is a basic primitive whose understanding is necessary for attacking more complicated\ncases; for example, Y appears in the expressions for all three of the matrices mentioned above. But\nstudying Y is also quite interesting in its own right, with several interesting applications to machine\nlearning that we will explore in Section 4.\n\n1.1 Our contribution\n\nThe nonlinearity of the activation function prevents us from leveraging many of the existing mathemat-\nical results from random matrix theory. Nevertheless, most of the basic tools for computing spectral\ndensities of random matrices still apply in this setting. In this work, we show how to overcome\nsome of the technical hurdles that have prevented explicit computations of this type in the past. In\nparticular, we employ the so-called moments method, deducing the spectral density of M from the\ntraces tr M k. Evaluating the traces involves computing certain multi-dimensional integrals, which\nwe show how to evaluate, and enumerating a certain class of graphs, for which we derive a generating\nfunction. The result of our calculation is a quartic equation which is satis\ufb01ed by the trace of the\nresolvent of M, G(z) = E[tr(M zI)1]. It depends on two parameters that together capture the\nonly relevant properties of the nonlinearity f: \u2318, the Gaussian mean of f 2, and \u21e3, the square of the\nGaussian mean of f0. Overall, the techniques presented here pave the way for studying other types of\nnonlinear random matrices relevant for the theoretical understanding of neural networks.\n\n1.2 Applications of our results\n\nWe show that the training loss of a ridge-regularized single-layer random-feature least-squares\nmemorization problem with regularization parameter is related to 2G0(). We observe\nincreased memorization capacity for certain types of nonlinearities relative to others. In particular,\nfor a \ufb01xed value of , the training loss is lower if \u2318/\u21e3 is large, a condition satis\ufb01ed by a large class of\nactivation functions, for example when f is close to an even function. We believe this observation\ncould have an important practical impact in designing next-generation activation functions.\nWe also examine the eigenvalue density of M and observe that if \u21e3 = 0 the distribution collapses to\nthe Marchenko-Pastur distribution (Mar\u02c7cenko & Pastur, 1967), which describes the eigenvalues of the\nWishart matrix X T X. We therefore make the surprising observation that there exist functions f such\nthat f (W X) has the same singular value distribution as X. Said another way, the eigenvalues of the\ndata covariance matrix are unchanged in distribution after passing through a single nonlinear layer\nof the network. We conjecture that this property is actually satis\ufb01ed through arbitrary layers of the\nnetwork, and \ufb01nd supporting numerical evidence. This conjecture may be regarded as a claim about\nthe universality of our results with respect to the distribution of X. Note that preserving the \ufb01rst\nmoment of this distribution is also an effect achieved through batch normalization (Ioffe & Szegedy,\n2015), although higher moments are not necessarily preserved. We therefore offer the hypothesis that\nchoosing activation functions with \u21e3 = 0 might lead to improved training performance, in the same\nway that batch normalization does, at least early in training.\n\n2\n\n\f1.3 Related work\n\nThe study of random neural networks has a relatively long history, with much of the initial work\nfocusing on approaches from statistical physics and the theory of spin glasses. For example, Amit\net al. (1985) analyze the long-time behavior of certain dynamical models of neural networks in terms\nof an Ising spin-glass Hamiltonian, and Gardner & Derrida (1988) examine the storage capacity of\nneural networks by studying the density of metastable states of a similar spin-glass system. More\nrecently, Choromanska et al. (2015) studied the critical points of random loss surfaces, also by\nexamining an associated spin-glass Hamiltonian, and Schoenholz et al. (2017) developed an exact\ncorrespondence between random neural networks and statistical \ufb01eld theory.\nIn a somewhat tangential direction, random neural networks have also been investigated through their\nrelationship to kernel methods. The correspondence between in\ufb01nite-dimensional neural networks\nand Gaussian processes was \ufb01rst noted by Neal (1994a,b). In the \ufb01nite-dimensional setting, the\napproximate correspondence to kernel methods led to the development random feature methods that\ncan accelerate the training of kernel machines (Rahimi & Recht, 2007). More recently, a duality\nbetween random neural networks with general architectures and compositional kernels was explored\nby Daniely et al. (2016).\nIn the last several years, random neural networks have been studied from many other perspectives.\nSaxe et al. (2014) examined the effect of random initialization on the dynamics of learning in deep\nlinear networks. Schoenholz et al. (2016) studied how information propagates through random\nnetworks, and how that affects learning. Poole et al. (2016) and Raghu et al. (2016) investigated\nvarious measures of expressivity in the context of deep random neural networks.\nDespite this extensive literature related to random neural networks, there has been relatively little\nresearch devoted to studying random matrices with nonlinear dependencies. The main focus in this\ndirection has been kernel random matrices and robust statistics models (El Karoui et al., 2010; Cheng\n& Singer, 2013). In a closely-related contemporaneous work, Louart et al. (2017) examined the\nresolvent of Gram matrix Y Y T in the case where X is deterministic.\n\n2 Preliminaries\n\nThroughout this work we will be relying on a number of basic concepts from random matrix theory.\nHere we provide a lightning overview of the essentials, but refer the reader to the more pedagogical\nliterature for background (Tao, 2012).\n\n2.1 Notation\nLet X 2 Rn0\u21e5m be a random data matrix with i.i.d. elements Xi\u00b5 \u21e0N (0, 2\nx) and W 2 Rn1\u21e5n0 be\nw/n0). As discussed in Section 1, we are\na random weight matrix with i.i.d. elements Wij \u21e0N (0, 2\ninterested in the regime in which both the row and column dimensions of these matrices are large and\napproach in\ufb01nity at the same rate. In particular, we de\ufb01ne\nn0\nn1\n\n, \u2318\n\n(1)\n\n \u2318\n\nn0\nm\n\n,\n\nto be \ufb01xed constants as n0, n1, m ! 1. In what follows, we will frequently consider the limit that\nn0 ! 1 with the understanding that n1 ! 1 and m ! 1, so that eqn. (1) is satis\ufb01ed.\nWe denote the matrix of pre-activations by Z = W X. Let f : R!R be a function with\nzero mean and \ufb01nite moments,\n\nZ\n\ndz\np2\u21e1\n\ne z2\n\n2 f (wxz) = 0,\n\nZ\n\nand denote the matrix of post-activations Y = f (Z), where f is applied pointwise. We will be\ninterested in the Gram matrix,\n\ndz\np2\u21e1\n\ne z2\n\n2 f (wxz)k < 1 for k > 1 ,\n\n(2)\n\nM =\n\n1\nm\n\nY Y T 2 Rn1\u21e5n1 .\n\n3\n\n(3)\n\n\f2.2 Spectral density and the Stieltjes transform\n\nThe empirical spectral density of M is de\ufb01ned as,\n\n\u21e2M (t) =\n\n1\nn1\n\nn1Xj=1\n\n (t j(M )) ,\n\n(4)\n\nwhere is the Dirac delta function, and the j(M ), j = 1, . . . , n1, denote the n1 eigenvalues of M,\nincluding multiplicity. The limiting spectral density is de\ufb01ned as the limit of eqn. (4) as n1 ! 1, if\nit exists.\nFor z 2 C \\ supp(\u21e2M ) the Stieltjes transform G of \u21e2M is de\ufb01ned as,\n\nG(z) =Z \u21e2M (t)\n\nz t\n\ndt = \n\n1\nn1\n\nE\u21e5tr(M zIn1)1\u21e4 ,\n\nwhere the expectation is with respect to the random variables W and X. The quantity (M zIn1)1\nis the resolvent of M. The spectral density can be recovered from the Stieltjes transform using the\ninversion formula,\n\n1\n\u21e1\n\nIm G( + i\u270f) .\n\nlim\n\u270f!0+\n\n(6)\n\n\u21e2M () = \n\n2.3 Moment method\n\nOne of the main tools for computing the limiting spectral distributions of random matrices is the\nmoment method, which, as the name suggests, is based on computations of the moments of \u21e2M. The\nasymptotic expansion of eqn. (5) for large z gives the Laurent series,\n\n(5)\n\n(7)\n\n(8)\n\n(9)\n\nwhere mk is the kth moment of the distribution \u21e2M,\n\nmk\nzk+1 ,\n\nG(z) =\n\n1Xk=0\nmk =Z dt \u21e2M (t)tk =\nE24 Xi1,...,ik2[n1]\n\n1\nn1\n\n1\nn1\n\nE\u21e5tr M k\u21e4 .\nMi1i2Mi2i3 \u00b7\u00b7\u00b7 Mik1ik Miki135 ,\n\n1\nn1\n\nE\u21e5tr M k\u21e4 =\n\nIf one can compute mk, then the density \u21e2M can be obtained via eqns. (7) and (6). The idea behind\nthe moment method is to compute mk by expanding out powers of M inside the trace as,\n\nand evaluating the leading contributions to the sum as the matrix dimensions go to in\ufb01nity, i.e. as\nn0 ! 1. Determining the leading contributions involves a complicated combinatorial analysis,\ncombined with the evaluation of certain nontrivial high-dimensional integrals. In the next section and\nthe supplementary material, we provide an outline for how to tackle these technical components of\nthe computation.\n\n3 The Stieltjes transform of M\n\n3.1 Main result\n\nThe following theorem characterizes G as the solution to a quartic polynomial equation.\nTheorem 1. For M, , , w, and x de\ufb01ned as in Section 2.1, and constants \u2318 and \u21e3 de\ufb01ned as,\n\n\u2318 =Z dz\n\nez2/2\np2\u21e1\n\nf (wxz)2\n\nand\n\n\u21e3 =\"wxZ dz\n\nez2/2\np2\u21e1\n\nf0(wxz)#2\n\n,\n\n(10)\n\n4\n\n\fthe Stieltjes transform of the spectral density of M satis\ufb01es,\n\nwhere,\n\nand\n\nG(z) =\n\n \nz\n\nP\u2713 1\nz \u25c6 +\n\n1 \n\nz\n\n,\n\nP = 1 + (\u2318 \u21e3)tPP +\n\nPP t\u21e3\n1 PP t\u21e3\n\n,\n\nP = 1 + (P 1), P = 1 + (P 1) .\n\n(11)\n\n(12)\n\n(13)\n\nThe proof of Theorem 1 is relatively long and complicated, so it\u2019s deferred to the supplementary\nmaterial. The main idea underlying the proof is to translate the calculation of the moments in\neqn. (7) into two subproblems, one of enumerating certain connected outer-planar graphs, and another\nof evaluating integrals that correspond to cycles in those graphs. The complexity resides both in\ncharacterizing which outer-planar graphs contribute at leading order to the moments, and also in\ncomputing those moments explicitly. A generating function encapsulating these results (P from\nTheorem 1) is shown to satisfy a relatively simple recurrence relation. Satisfying this recurrence\nrelation requires that P solve eqn. (12). Finally, some bookkeeping relates G to P .\n\n\u2318 = \u21e3\n\n3.2 Limiting cases\n3.2.1\nIn Section 3 of the supplementary material, we use a Hermite polynomial expansion of f to show that\n\u2318 = \u21e3 if and only if f is a linear function. In this case, M = ZZT , where Z = W X is a product of\nGaussian random matrices. Therefore we expect G to reduce to the Stieltjes transform of a so-called\nproduct Wishart matrix. In (Dupic & Castillo, 2014), a cubic equation de\ufb01ning the Stieltjes transform\nof such matrices is derived. Although eqn. (11) is generally quartic, the coef\ufb01cient of the quartic term\nvanishes when \u2318 = \u21e3 (see Section 4 of the supplementary material). The resulting cubic polynomial\nis in agreement with the results in (Dupic & Castillo, 2014).\n\n\u21e3 = 0\n\n3.2.2\nAnother interesting limit is when \u21e3 = 0, which signi\ufb01cantly simpli\ufb01es the expression in eqn. (12).\nWithout loss of generality, we can take \u2318 = 1 (the general case can be recovered by rescaling z). The\nresulting equation is,\n\nz G2 +\u21e31 \n\n \n\nz 1\u2318G +\n\n \n\n\n= 0 ,\n\n(14)\n\nwhich is precisely the equation satis\ufb01ed by the Stieltjes transform of the Marchenko-Pastur distribution\nwith shape parameter / . Notice that when = 1, the latter is the limiting spectral distribution of\nXX T , which implies that Y Y T and XX T have the same limiting spectral distribution. Therefore we\nhave identi\ufb01ed a novel type of isospectral nonlinear transformation. We investigate this observation\nin Section 4.1.\n\n4 Applications\n\n4.1 Data covariance\nConsider a deep feedforward neural network with lth-layer post-activation matrix given by,\n\nY 0 = X .\n\nY l = f (W lY l1),\n\n(15)\nThe matrix Y l(Y l)T is the lth-layer data covariance matrix. The distribution of its eigenvalues (or the\nsingular values of Y l) determine the extent to which the input signals become distorted or stretched\nas they propagate through the network. Highly skewed distributions indicate strong anisotropy in\nthe embedded feature space, which is a form of poor conditioning that is likely to derail or impede\nlearning. A variety of techniques have been developed to alleviate this problem, the most popular of\nwhich is batch normalization. In batch normalization, the variance of individual activations across the\nbatch (or dataset) is rescaled to equal one. The covariance is often ignored \u2014 variants that attempt to\n\n5\n\n\f(a) L = 1\n\n(b) L = 10\n\nFigure 1: Distance between the (a) \ufb01rst-layer and (b) tenth-layer empirical eigenvalue distributions\nof the data covariance matrices and our theoretical prediction for the \ufb01rst-layer limiting distribution\n\u00af\u21e21, as a function of network width n0. Plots are for shape parameters = 1 and = 3/2. The\ndifferent curves correspond to different piecewise linear activation functions parameterize by \u21b5:\n\u21b5 = 1 is linear, \u21b5 = 0 is (shifted) relu, and \u21b5 = 1 is (shifted) absolute value. In (a), for all \u21b5, we\nsee good convergence of the empirical distribution \u21e21 to our asymptotic prediction \u00af\u21e21. In (b), in\naccordance with our conjecture, we \ufb01nd good agreement between \u00af\u21e21 and the tenth-layer empirical\ndistribution \u21e3 = 0, but not for other values of \u21e3. This provides evidence that when \u21e3 = 0 the\neigenvalue distribution is preserved by the nonlinear transformations.\n\nfully whiten the activations can be very slow. So one aspect of batch normalization, as it is used in\npractice, is that it preserves the trace of the covariance matrix (i.e. the \ufb01rst moment of its eigenvalue\ndistribution) as the signal propagates through the network, but it does not control higher moments of\nthe distribution. A consequence is that there may still be a large imbalance in singular values.\nAn interesting question, therefore, is whether there exist ef\ufb01cient techniques that could preserve or\napproximately preserve the full singular value spectrum of the activations as they propagate through\nthe network. Inspired by the results of Section 3.2.2, we hypothesize that choosing an activation\nfunction with \u21e3 = 0 may be one way to approximately achieve this behavior, at least early in training.\nFrom a mathematical perspective, this hypothesis is similar to asking whether our results in eqn. (11)\nare universal with respect to the distribution of X. We investigate this question empirically.\nLet \u21e2l be the empirical eigenvalue density of Y l(Y l)T , and let \u00af\u21e21 be the limiting density determined\nby eqn. (11) (with = 1). We would like to measure the distance between \u00af\u21e21 and \u21e2l in order to\nsee whether the eigenvalues propagate without getting distorted. There are many options that would\nsuf\ufb01ce, but we choose to track the following metric,\n\nd(\u00af\u21e21,\u21e2 l) \u2318Z d|\u00af\u21e21() \u21e2l()| .\n\nf\u21b5(x) =\n\n[x]+ + \u21b5[x]+ 1+\u21b5p2\u21e1\nq 1\n2 (1 + \u21b52) 1\n\n2\u21e1 (1 + \u21b5)2\n\n\u2318 = 1,\u21e3 =\n\n(1 \u21b5)2\n2(1 + \u21b52) 2\n\n\u21e1 (1 + \u21b5)2 ,\n\n6\n\n(16)\n\n(18)\n\nTo observe the effect of varying \u21e3, we utilize a variant of the relu activation function with non-zero\nslope for negative inputs,\n\n.\n\n(17)\n\nOne may interpret \u21b5 as (the negative of) the ratio of the slope for negative x to the slope for positive\nx. It is straightforward to check that f\u21b5 has zero Gaussian mean and that,\n\nso we can adjust \u21e3 (without affecting \u2318) by changing \u21b5. Fig. 1(a) shows that for any value of \u21b5 (and\nthus \u21e3) the distance between \u00af\u21e21 and \u21e21 approaches zero as the network width increases. This offers\n\n(cid:1)=1((cid:2)=0)(cid:1)=1/4((cid:2)=0.498)(cid:1)=0((cid:2)=0.733)(cid:1)=-1/4((cid:2)=0.884)(cid:1)=-1((cid:2)=1)51050100500100050000.0050.0100.0500.1000.5001n0d((cid:3)1,(cid:3)1)(cid:1)=1((cid:2)=0)(cid:1)=1/4((cid:2)=0.498)(cid:1)=0((cid:2)=0.733)(cid:1)=-1/4((cid:2)=0.884)(cid:1)=-1((cid:2)=1)51050100500100050000.0050.0100.0500.1000.5001n0d((cid:3)1,(cid:3)10)\f(a) = 1\n\n2 , = 1\n\n2\n\n(b) = 1\n\n2 , = 3\n\n4\n\nFigure 2: Memorization performance of random feature networks versus ridge regularization pa-\nrameter . Theoretical curves are solid lines and numerical solutions to eqn. (19) are points.\n \u2318 log10(\u2318/\u21e3 1) distinguishes classes of nonlinearities, with = 1 corresponding to a\nlinear network. Each numerical simulation is done with a different randomly-chosen function f and\nthe speci\ufb01ed . The good agreement con\ufb01rms that no details about f other than are relevant. In\n(a), there are more random features than data points, allowing for perfect memorization unless the\nfunction f is linear, in which case the model is rank constrained. In (b), there are fewer random\nfeatures than data points, and even the nonlinear models fail to achieve perfect memorization. For\na \ufb01xed amount of regularization , curves with larger values of (smaller values of \u21e3) have lower\ntraining loss and hence increased memorization capacity.\n\nnumerical evidence that eqn. (11) is in fact the correct asymptotic limit. It also shows how quickly\nthe asymptotic behavior sets in, which is useful for interpreting Fig. 1(b), which shows the distance\nbetween \u00af\u21e21 and \u21e210. Observe that if \u21e3 = 0, \u21e210 approaches \u00af\u21e21 as the network width increases. This\nprovides evidence for the conjecture that the eigenvalues are in fact preserved as they propagate\nthrough the network, but only when \u21e3 = 0, since we see the distances level off at some \ufb01nite value\nwhen \u21e3 6= 0. We also note that small non-zero values of \u21e3 may not distort the eigenvalues too much.\nThese observations suggest a new method of tuning the network for fast optimization. Re-\ncent work (Pennington et al., 2017) found that inducing dynamical isometry, i.e. equilibrating the\nsingular value distribution of the input-output Jacobian, can greatly speed up training. In our context,\nby choosing an activation function with \u21e3 \u21e1 0, we can induce a similar type of isometry, not of the\ninput-output Jacobian, but of the data covariance matrix as it propagates through the network. We\nconjecture that inducing this additional isometry may lead to further training speed-ups, but we leave\nfurther investigation of these ideas to future work.\n\n4.2 Asymptotic performance of random feature methods\nConsider the ridge-regularized least squares loss function de\ufb01ned by,\n\nL(W2) =\n\n1\n2n2mkY W T\n\n2 Y k2\n\nF + kW2k2\nF ,\n\nY = f (W X) ,\n\n(19)\n\nwhere X 2 Rn0\u21e5m is a matrix of m n0-dimensional features, Y2 Rn2\u21e5m is a matrix of regression\ntargets, W 2 Rn1\u21e5n0 is a matrix of random weights and W2 2 Rn1\u21e5n2 is a matrix of parameters to\nbe learned. The matrix Y is a matrix of random features1. The optimal parameters are,\n\nW \u21e42 =\n\n1\nm\n\nY QY T , Q =\u2713 1\n\nm\n\nY T Y + Im\u25c61\n\n.\n\n(20)\n\n1We emphasize that we are using an unconvential notation for the random features \u2013 we call them Y in order\n\nto make contact with the previous sections.\n\n7\n\n\u03b2=-\u221e\u03b2=-8\u03b2=-6\u03b2=-4\u03b2=-2\u03b2=0\u03b2=2-8-6-4-20240.00.20.40.60.81.0log10(\u03b3/\u03b7)Etrain\u03b2=-\u221e\u03b2=-8\u03b2=-6\u03b2=-4\u03b2=-2\u03b2=0\u03b2=2-8-6-4-20240.00.20.40.60.81.0log10(\u03b3/\u03b7)Etrain\fOur problem setup and analysis are similar to that of (Louart et al., 2017), but in contrast to that work,\nwe are interested in the memorization setting in which the network is trained on random input-output\npairs. Performance on this task is then a measure of the capacity of the model, or the complexity of\nthe function class it belongs to. In this context, we take the data X and the targets Y to be independent\nGaussian random matrices. From eqns. (19) and (20), the expected training loss is given by,\n\n(21)\n\nEtrain = EW,X,Y [L(W \u21e42 )] = EW,X,Y\uf8ff 2\n= EW,X\uf8ff 2\n\nm\n\nm\n\ntrY TYQ2\ntr Q2\n@\n@ EW,X [tr Q] .\n\n2\nm\n\n= \n\nIt is evident from eqn. (5) and the de\ufb01nition of Q that EW,X [tr Q] is related to G(). However, our\nresults from the previous section cannot be used directly because Q contains the trace Y T Y , whereas\nG was computed with respect to Y Y T . Thankfully, the two matrices differ only by a \ufb01nite number of\nzero eigenvalues. Some simple bookkeeping shows that\n(1 / )\n\n(22)\n\n1\nmEW,X [tr Q] =\n\n\n\n\n \n\n\n\nG() .\n\nFrom eqn. (11) and its total derivative with respect to z, an equation for G0(z) can be obtained by\ncomputing the resultant of the two polynomials and eliminating G(z). An equation for Etrain follows;\nsee Section 4 of the supplementary material for details. An analysis of this equation shows that it is\nhomogeneous in , \u2318, and \u21e3, i.e., for any > 0,\n\nEtrain(, \u2318, \u21e3 ) = Etrain(, \u2318, \u21e3) .\n\n(23)\nIn fact, this homogeneity is entirely expected from eqn. (19): an increase in the regularization constant\n can be compensated by a decrease in scale of W2, which, in turn, can be compensated by increasing\nthe scale of Y , which is equivalent to increasing \u2318 and \u21e3. Owing to this homogeneity, we are free to\nchoose = 1/\u2318. For simplicity, we set \u2318 = 1 and examine the two-variable function Etrain(, 1,\u21e3 ).\nThe behavior when = 0 is a measure of the capacity of the model with no regularization and\ndepends on the value of \u21e3,\n\nEtrain(0, 1,\u21e3 ) =\u21e2[1 ]+\n\n[1 / ]+ otherwise.\n\nif \u21e3 = 1 and < 1,\n\n(24)\n\nAs discussed in Section 3.2, when \u2318 = \u21e3 = 1, the function f reduces to the identity. With this in\nmind, the various cases in eqn. (24) are readily understood by considering the effective rank of the\nrandom feature matrix Y.\nIn Fig. 2, we compare our theoretical predictions for Etrain to numerical simulations of solutions\nto eqn. (19). The different curves explore various ratios of \u2318 log10(\u2318/\u21e3 1) and therefore\nprobe different classes of nonlinearities. For each numerical simulation, we choose a random\nquintic polynomial f with the speci\ufb01ed value of (for details on this choice, see Section 3 of the\nsupplementary material). The excellent agreement between theory and simulations con\ufb01rms that\nEtrain depends only on and not on any other details of f. The black curves correspond to the\nperformance of a linear network. The results show that for \u21e3 very close to \u2318, the models are unable to\nutilize their nonlinearity unless the regularization parameter is very small. Conversely, for \u21e3 close to\nzero, the models exploits the nonlinearity very ef\ufb01ciently and absorb large amounts of regularization\nwithout a signi\ufb01cant drop in performance. This suggests that small \u21e3 might provide an interesting\nclass of nonlinear functions with enhanced expressive power. See Fig. 3 for some examples of\nactivation functions with this property.\n\n5 Conclusions\n\nIn this work we studied the Gram matrix M = 1\nm Y T Y , where Y = f (W X) and W and X are\nrandom Gaussian matrices. We derived a quartic polynomial equation satis\ufb01ed by the trace of the\nresolvent of M, which de\ufb01nes its limiting spectral density. In obtaining this result, we demonstrated\n\n8\n\n\fFigure 3: Examples of activation functions and their derivatives for which \u2318 = 1 and \u21e3 = 0. In\n\nred, f (1) = c1 1 + p5e2x2; in green, f (2)(x) = c2 sin(2x) + cos(3x/2) 2e2x e9/8;\nin orange, f (3)(x) = c3|x| p2/\u21e1; and in blue, f (4)(x) = c41 4p3\n2 erf(x). If we let\n\nw = x = 1, then eqn. (2) is satis\ufb01ed and \u21e3 = 0 for all cases. We choose the normalization\nconstants ci so that \u2318 = 1.\n\nf (1)(x)\nf (2)(x)\nf (3)(x)\nf (4)(x)\n\ne x2\n\nthat pointwise nonlinearities can be incorporated into a standard method of proof in random matrix\ntheory known as the moments method, thereby opening the door for future study of other nonlinear\nrandom matrices appearing in neural networks.\nWe applied our results to a memorization task in the context of random feature methods and obtained\nan explicit characterizations of the training error as a function of a ridge regression parameter. The\ntraining error depends on the nonlinearity only through two scalar quantities, \u2318 and \u21e3, which are\ncertain Gaussian integrals of f. We observe that functions with small values of \u21e3 appear to have\nincreased capacity relative to those with larger values of \u21e3.\nWe also make the surprising observation that for \u21e3 = 0, the singular value distribution of f (W X) is\nthe same as the singular value distribution of X. In other words, the eigenvalues of the data covariance\nmatrix are constant in distribution when passing through a single nonlinear layer of the network.\nWe conjectured and found numerical evidence that this property actually holds when passing the\nsignal through multiple layers. Therefore, we have identi\ufb01ed a class of activation functions that\nmaintains approximate isometry at initialization, which could have important practical consequences\nfor training speed.\nBoth of our applications suggest that functions with \u21e3 \u21e1 0 are a potentially interesting class of\nactivation functions. This is a large class of functions, as evidenced in Fig. 3, among which are many\ntypes of nonlinearities that have not been thoroughly explored in practical applications. It would be\ninteresting to investigate these nonlinearities in future work.\n\nReferences\nAmit, Daniel J, Gutfreund, Hanoch, and Sompolinsky, Haim. Spin-glass models of neural networks.\n\nPhysical Review A, 32(2):1007, 1985.\n\nCheng, Xiuyuan and Singer, Amit. The spectrum of random inner-product kernel matrices. Random\n\nMatrices: Theory and Applications, 2(04):1350010, 2013.\n\nChoromanska, Anna, Henaff, Mikael, Mathieu, Michael, Arous, G\u00e9rard Ben, and LeCun, Yann. The\n\nloss surfaces of multilayer networks. In AISTATS, 2015.\n\nDaniely, A., Frostig, R., and Singer, Y. Toward Deeper Understanding of Neural Networks: The\n\nPower of Initialization and a Dual View on Expressivity. arXiv:1602.05897, 2016.\n\nDupic, Thomas and Castillo, Isaac P\u00e9rez. Spectral density of products of wishart dilute random\n\nmatrices. part i: the dense case. arXiv preprint arXiv:1401.7802, 2014.\n\nEl Karoui, Noureddine et al. The spectrum of kernel random matrices. The Annals of Statistics, 38\n\n(1):1\u201350, 2010.\n\nGardner, E and Derrida, B. Optimal storage properties of neural network models. Journal of Physics\n\nA: Mathematical and general, 21(1):271, 1988.\n\n9\n\n\fHinton, Geoffrey, Deng, Li, Yu, Dong, Dahl, George E., Mohamed, Abdel-rahman, Jaitly, Navdeep,\nSenior, Andrew, Vanhoucke, Vincent, Nguyen, Patrick, Sainath, Tara N, et al. Deep neural networks\nfor acoustic modeling in speech recognition: The shared views of four research groups. IEEE\nSignal Processing Magazine, 29(6):82\u201397, 2012.\n\nIoffe, Sergey and Szegedy, Christian. Batch normalization: Accelerating deep network training by\nreducing internal covariate shift. In Proceedings of The 32nd International Conference on Machine\nLearning, pp. 448\u2013456, 2015.\n\nKrizhevsky, Alex, Sutskever, Ilya, and Hinton, Geoffrey E.\n\nImagenet classi\ufb01cation with deep\nconvolutional neural networks. In Advances in neural information processing systems, pp. 1097\u2013\n1105, 2012.\n\nLouart, Cosme, Liao, Zhenyu, and Couillet, Romain. A random matrix approach to neural networks.\n\narXiv preprint arXiv:1702.05419, 2017.\n\nMar\u02c7cenko, Vladimir A and Pastur, Leonid Andreevich. Distribution of eigenvalues for some sets of\n\nrandom matrices. Mathematics of the USSR-Sbornik, 1(4):457, 1967.\n\nNeal, Radford M. Priors for in\ufb01nite networks (tech. rep. no. crg-tr-94-1). University of Toronto,\n\n1994a.\n\nNeal, Radford M. Bayesian Learning for Neural Networks. PhD thesis, University of Toronto, Dept.\n\nof Computer Science, 1994b.\n\nOord, Aaron van den, Dieleman, Sander, Zen, Heiga, Simonyan, Karen, Vinyals, Oriol, Graves, Alex,\nKalchbrenner, Nal, Senior, Andrew, and Kavukcuoglu, Koray. Wavenet: A generative model for\nraw audio. arXiv preprint arXiv:1609.03499, 2016.\n\nPennington, J, Schoenholz, S, and Ganguli, S. Resurrecting the sigmoid in deep learning through\ndynamical isometry: theory and practice. In Advances in neural information processing systems,\n2017.\n\nPoole, B., Lahiri, S., Raghu, M., Sohl-Dickstein, J., and Ganguli, S. Exponential expressivity in deep\n\nneural networks through transient chaos. arXiv:1606.05340, June 2016.\n\nRaghu, M., Poole, B., Kleinberg, J., Ganguli, S., and Sohl-Dickstein, J. On the expressive power of\n\ndeep neural networks. arXiv:1606.05336, June 2016.\n\nRahimi, Ali and Recht, Ben. Random features for large-scale kernel machines.\n\nInfomration Processing Systems, 2007.\n\nIn In Neural\n\nSaxe, A. M., McClelland, J. L., and Ganguli, S. Exact solutions to the nonlinear dynamics of learning\n\nin deep linear neural networks. International Conference on Learning Representations, 2014.\n\nSchoenholz, S. S., Gilmer, J., Ganguli, S., and Sohl-Dickstein, J. Deep Information Propagation.\n\nArXiv e-prints, November 2016.\n\nSchoenholz, S. S., Pennington, J., and Sohl-Dickstein, J. A Correspondence Between Random Neural\n\nNetworks and Statistical Field Theory. ArXiv e-prints, 2017.\n\nShazeer, N., Mirhoseini, A., Maziarz, K., Davis, A., Le, Q., Hinton, G., and Dean, J. Outrageously\nICLR, 2017. URL\n\nlarge neural language models using sparsely gated mixtures of experts.\nhttp://arxiv.org/abs/1701.06538.\n\nTao, Terence. Topics in random matrix theory, volume 132. American Mathematical Society\n\nProvidence, RI, 2012.\n\nWu, Yonghui, Schuster, Mike, Chen, Zhifeng, Le, Quoc V., Norouzi, Mohammad, Macherey,\nWolfgang, Krikun, Maxim, Cao, Yuan, Gao, Qin, Macherey, Klaus, et al. Google\u2019s neural\nmachine translation system: Bridging the gap between human and machine translation. arXiv\npreprint arXiv:1609.08144, 2016.\n\n10\n\n\f", "award": [], "sourceid": 1512, "authors": [{"given_name": "Jeffrey", "family_name": "Pennington", "institution": "Google Brain"}, {"given_name": "Pratik", "family_name": "Worah", "institution": "Google"}]}