{"title": "On Learning Over-parameterized Neural Networks: A Functional Approximation Perspective", "book": "Advances in Neural Information Processing Systems", "page_first": 2641, "page_last": 2650, "abstract": "We consider training over-parameterized two-layer neural networks with Rectified Linear Unit (ReLU) using gradient descent (GD) method. Inspired by a recent line of work, we study the evolutions of network prediction errors across GD iterations, which can be  neatly described in a matrix form. When the network is sufficiently over-parameterized, these matrices individually approximate {\\em an} integral operator which is determined by the feature vector distribution $\\rho$ only. Consequently, GD method can be viewed as {\\em approximately} applying the powers of this integral operator on the underlying/target function $f^*$ that generates the responses/labels. \n \nWe show that if $f^*$ admits a low-rank approximation with respect to the eigenspaces of this integral operator, then the empirical risk decreases to this low rank approximation error at a linear rate which is determined by $f^*$ and $\\rho$ only, i.e., the rate is independent of the sample size $n$. Furthermore, if $f^*$ has zero low-rank approximation error, then, as long as the width of the neural network is $\\Omega(n\\log n)$, the empirical risk decreases to $\\Theta(1/\\sqrt{n})$. To the best of our knowledge, this is the first result showing the sufficiency of nearly-linear network over-parameterization.  We provide an application of our general results to the setting where $\\rho$ is the uniform distribution on the spheres and $f^*$ is a polynomial. Throughout this paper, we consider the scenario where the input dimension $d$ is fixed.", "full_text": "On Learning Over-parameterized Neural Networks:\n\nA Functional Approximation Perspective\n\nLili Su\n\nCSAIL, MIT\nlilisu@mit.edu\n\nPengkun Yang\n\nDepartment of Electrical Engineering\n\nPrinceton University\n\npengkuny@princeton.edu\n\nAbstract\n\nWe consider training over-parameterized two-layer neural networks with Recti\ufb01ed\nLinear Unit (ReLU) using gradient descent (GD) method. Inspired by a recent\nline of work, we study the evolutions of network prediction errors across GD\niterations, which can be neatly described in a matrix form. When the network\nis suf\ufb01ciently over-parameterized, these matrices individually approximate an\nintegral operator which is determined by the feature vector distribution \u21e2 only.\nConsequently, GD method can be viewed as approximately applying the powers of\nthis integral operator on the underlying function f\u21e4 that generates the responses.\nWe show that if f\u21e4 admits a low-rank approximation with respect to the eigenspaces\nof this integral operator, then the empirical risk decreases to this low-rank approxi-\nmation error at a linear rate which is determined by f\u21e4 and \u21e2 only, i.e., the rate is\nindependent of the sample size n. Furthermore, if f\u21e4 has zero low-rank approx-\nimation error, then, as long as the width of the neural network is \u2326(n log n), the\nempirical risk decreases to \u21e5(1/pn). To the best of our knowledge, this is the \ufb01rst\nresult showing the suf\ufb01ciency of nearly-linear network over-parameterization. We\nprovide an application of our general results to the setting where \u21e2 is the uniform\ndistribution on the spheres and f\u21e4 is a polynomial. Throughout this paper, we\nconsider the scenario where the input dimension d is \ufb01xed.\n\n1\n\nIntroduction\n\nNeural networks have been successfully applied in many real-world machine learning applications.\nHowever, a thorough understanding of the theory behind their practical success, even for two-layer\nneural networks, is still lacking. For example, despite learning optimal neural networks is provably\nNP-complete [BG17, BR89], in practice, even the neural networks found by the simple \ufb01rst-order\nmethods perform well [KSH12]. Additionally, in sharp contrast to traditional learning theory, over-\nparameterized neural networks (more parameters than the size of the training dataset) are observed\nto enjoy smaller training and even smaller generalization errors [ZBH+16]. In this paper, we focus\non training over-parameterized two-layer neural networks with Recti\ufb01ed Linear Unit (ReLU) using\ngradient descent (GD) method. Our results can be extended to other activation functions that satisfy\nsome regularity conditions; see [GMMM19, Theorem 2] for an example. The techniques derived and\ninsights obtained in this paper might be applied to deep neural networks as well, for which similar\nmatrix representation exists [DZPS18].\nSigni\ufb01cant progress has been made in understanding the role of over-parameterization in training\nneural networks with \ufb01rst-order methods [AZLL18, DZPS18, ADH+19, OS19, MMN18, LL18,\nZCZG18, DLL+18, AZLS18, CG19]; with proper random network initialization, (stochastic) GD\nconverges to a (nearly) global minimum provided that the width of the network m is polynomially\nlarge in the size of the training dataset n. However, neural networks seem to interpolate the training\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\fdata as soon as the number of parameters exceed the size of the training dataset by a constant\nfactor [ZBH+16, OS19]. To the best of our knowledge, a provable justi\ufb01cation of why such mild\nover-parametrization is suf\ufb01cient for successful gradient-based training is still lacking. Moreover,\nthe convergence rates derived in many existing work approach 0 as n ! 1; see Section A in\nSupplementary Material for details. In many applications the volumes of the datasets are huge \u2013\nthe ImageNet dataset [DDS+09] has 14 million images. For those applications, a non-diminishing\n(i.e., constant w. r. t. n) convergence rate is more desirable. In this paper, our goal is to characterize\na constant (w. r. t. n) convergence rate while improving the suf\ufb01ciency guarantee of network over-\nparameterization. Throughout this paper, we focus on the setting where the dimension of the feature\nvector d is \ufb01xed, leaving the high dimensional region as one future direction.\nInspired by a recent line of work [DZPS18, ADH+19], we focus on characterizing the evolutions of\nthe neural network prediction errors under GD method. This focus is motivated by the fact that the\nneural network representation/approximation of a given function might not be unique [KB18], and\nthis focus is also validated by experimental neuroscience [MG06, ASCC18].\n\nContributions\nIt turns out that the evolution of the network prediction error can be neatly described\nin a matrix form. When the network is suf\ufb01ciently over-parameterized, the matrices involved\nindividually approximate an integral operator which is determined by the feature vector distribution \u21e2\nonly. Consequently, GD method can be viewed as approximately applying the powers of this integral\noperator on the underlying/target function f\u21e4 that generates the responses/labels. The advantages of\ntaking such a functional approximation perspective are three-fold:\n\n\u2022 We showed in Theorem 2 and Corollary 1 that the existing rate characterizations in the\nin\ufb02uential line of work [DZPS18, ADH+19, DLL+18] approach zero (i.e., ! 0) as n ! 1.\nThis is because the spectra of these matrices, as n diverges, concentrate on the spectrum of\nthe integral operator, in which the unique limit of the eigenvalues is zero.\n\n\u2022 We show in Theorem 4 that the training convergence rate is determined by how f\u21e4 can be\ndecomposed into the eigenspaces of an integral operator. This observation is also validated\nby a couple of empirical observations: (1) The spectrum of the MNIST data concentrates\non the \ufb01rst a few eigenspaces [LBB+98]; and (2) the training is slowed down if labels are\npartially corrupted [ZBH+16, ADH+19].\n\n\u2022 We show in Corollary 2 that if f\u21e4 can be decomposed into a \ufb01nite number of eigenspaces of\nthe integral operator, then m = \u21e5(n log n) is suf\ufb01cient for the training error to converge to\n\u21e5(1/pn) with a constant convergence rate. To the best of our knowledge, this is the \ufb01rst\nresult showing the suf\ufb01ciency of nearly-linear network over-parameterization.\n\nNotations For any n, m 2 N, let [n] := {1,\u00b7\u00b7\u00b7 , n} and [m] := {1,\u00b7\u00b7\u00b7 , m}. For any d 2 N,\ndenote the unit sphere as S d1 :=x : x 2 Rd, & kxk = 1 , where k\u00b7k is the standard `2 norm\nwhen it is applied to a vector. We also use k\u00b7k for the spectral norm when it is applied to a matrix. The\nFrobenius norm of a matrix is denoted by k\u00b7kF . Let L2(S d1, \u21e2) denote the space of functions with\n\u21e2 are de\ufb01ned as hf, gi\u21e2 :=RSd1 f (x)g(x)d\u21e2(x)\n\ufb01nite norm, where the inner product h\u00b7,\u00b7i\u21e2 and k \u00b7 k2\n\u21e2 :=RSd1 f 2(x)d\u21e2(x) < 1. We use standard Big-O notations, e.g., for any sequences\nand kfk2\n{ar} and {br}, we say ar = O(br) or ar . br if there is an absolute constant c > 0 such that ar\nbr \uf8ff c,\nwe say ar = \u2326(br) or ar & br if br = O(ar) and we say ar = !(br) if limr!1 |ar/br| = 1.\n\n2 Problem Setup and Preliminaries\n\nStatistical learning We are given a training dataset {(xi, yi) : 1 \uf8ff i \uf8ff n} which consists of n\ntuples (xi, yi), where xi\u2019s are feature vectors that are identically and independently generated from a\ncommon but unknown distribution \u21e2 on Rd, and yi = f\u21e4(xi). We consider the problem of learning\nthe unknown function f\u21e4 with respect to the square loss. We refer to f\u21e4 as a target function. For\nsimplicity, we assume xi 2 S d1 and yi 2 [1, 1]. In this paper, we restrict ourselves to the family\nof \u21e2 that is absolutely continuous with respect to Lebesgue measure. We are interested in \ufb01nding\na neural network to approximate f\u21e4. In particular, we focus on two-layer fully-connected neural\n\n2\n\n\fnetworks with ReLU activation, i.e.,\n\nfW ,a(x) =\n\n1\npm\n\nmXj=1\n\naj [hx, wji]+ , 8 x 2 S d1,\n\n(1)\n\nwhere m is the number of hidden neurons and is assumed to be even, W = (w1,\u00b7\u00b7\u00b7 , wm) 2 Rd\u21e5m\nare the weight vectors in the \ufb01rst layer, a = (a1,\u00b7\u00b7\u00b7 , am) with aj 2 {1, 1} are the weights in the\nsecond layer, and [\u00b7]+ := max{\u00b7, 0} is the ReLU activation function.\nMany authors assume f\u21e4 is also a neural network [MMN18, AZLL18, SS96, LY17, Tia16]. Despite\nthis popularity, a target function f\u21e4 is not necessarily a neural network. One advantage of working\nwith f\u21e4 directly is, as can be seen later, certain properties of f\u21e4 are closely related to whether f\u21e4 can\nbe learned quickly by GD method or not. Throughout this paper, for simplicity, we do not consider\nthe scaling in d and treat d as a constant.\n\nEmpirical risk minimization via gradient descent For each k = 1,\u00b7\u00b7\u00b7 , m/2:\nInitialize\n2. Initialize\nw2k1 \u21e0 N (0, I), and a2k1 = 1 with probability 1\n2, and a2k1 = 1 with probability 1\nw2k = w2k1 and a2k = a2k1. All randomnesses in this initialization are independent, and are\nindependent of the dataset. This initialization is chosen to guarantee zero output at initialization.\nSimilar initialization is adopted in [CB18, Section 3] and [WGL+19]. 1 We \ufb01x the second layer a\nand optimize the \ufb01rst layer W through GD on the empirical risk w. r. t. square loss 2:\n\nLn(W ) :=\n\n1\n2n\n\nnXi=1h(yi  fW (xi))2i .\n\nFor notational convenience, we drop the subscript a in fW ,a. The weight matrix W is update as\n\nW t+1 = W t  \u2318\n\n@Ln(W t)\n\n@W t\n\n,\n\n(3)\n\nwhere \u2318 > 0 is stepsize/learning rate, and W t is the weight matrix at the end of iteration t with W 0\ndenoting the initial weight matrix. For ease of exposition, let\n\n(2)\n\n(4)\n\n(5)\n\nbyi(t) := fW t(xi) =\n\n1\npm\n\naj\u21e5\u2326wt\n\nj, xi\u21b5\u21e4+ , 8 i = 1,\u00b7\u00b7\u00b7 , n.\n\nNotably,byi(0) = 0 for i = 1,\u00b7\u00b7\u00b7 , n. It can be easily deduced from (3) that wj is updated as\n\nwt+1\nj = wt\n\nj +\n\n\u2318aj\nnpm\n\n(yi byi(t)) xi1{hwt\n\nj ,xii>0}.\n\nMatrix representation Let y 2 Rn be the vector that stacks the responses of {(xi, yi)}n\nby(t) be the vector that stacksbyi(t) for i = 1,\u00b7\u00b7\u00b7 , n at iteration t. Additionally, let A := {j : aj = 1}\nand B := {j : aj = 1} . The evolution of (y by(t)) can be neatly described in a matrix form.\nDe\ufb01ne matrices H +,fH +, and H,fH in Rn \u21e5 Rn as: For t  0, and i, i0 2 [n],\nj ,xii>0},\n,xii>0},\n\nj ,xi0i>0}1{hwt\nj ,xi0i>0}1{hwt+1\n\nnm hxi, xi0iXj2A\nnm hxi, xi0iXj2A\n\n1{hwt\n1{hwt\n\nii0(t + 1) =\n\nii0(t + 1) =\n\ni=1. Let\n\n1\n\n1\n\nH +\n\n(6)\n\n(7)\n\nj\n\nand Hii0(t + 1),fHii0(t + 1) are de\ufb01ned similarly by replacing the summation over all the hidden\nneurons in A in (6) and (7) by the summation over B. It is easy to see that both H + and H are\n1Our analysis might be adapted to other initialization schemes, such as He initialization, with m = \u2326(n2).\n\nNevertheless, the more stringent requirement on m might only be an artifact of our analysis.\n\n2The simpli\ufb01cation assumption that the second layer is \ufb01xed is also adopted in [DZPS18, ADH+19]. Similar\nfrozen assumption is adopted in [ZCZG18, AZLS18]. We do agree this assumption might restrict the applicability\nof our results. Nevertheless, even this setting is not well-understood despite the recent intensive efforts.\n\nfH +\n\nmXj=1\nnXi=1\n\n3\n\n\fj\n\npositive semi-de\ufb01nite. The only difference between H +\n\nj ,xii>0} is used in the former, whereas 1{hwt+1\n\nii0(t + 1) (or Hii0(t + 1)) andfH +\n\nii0(t + 1) (or\nfHii0(t+1)) is that 1{hwt\n,xii>0} is adopted in the latter.\nWhen a neural network is suf\ufb01ciently over-parameterized (in particular, m = \u2326(poly(n))), the sign\nchanges of the hidden neurons are sparse; see [AZLL18, Lemma 5.4] and [ADH+19, Lemma C.2]\nfor details. The sparsity in sign changes suggests that bothfH +(t) \u21e1 H +(t) andfH(t) \u21e1 H(t)\nare approximately PSD.\nTheorem 1. For any iteration t  0 and any stepsize \u2318 > 0, it is true that\n\u21e3I  \u2318\u21e3fH +(t + 1) + H(t + 1)\u2318\u2318 (y by(t))\n\uf8ff (y by(t + 1))\n\uf8ff\u21e3I  \u2318\u21e3H +(t + 1) +fH(t + 1)\u2318\u2318 (y by(t)) ,\nTheorem 1 says that when the sign changes are sparse, the dynamics of (y by(t)) are governed by a\n\nsequence of PSD matrices. Similar observation is made in [DZPS18, ADH+19].\n\nwhere the inequalities are entry-wise.\n\n3 Main Results\n\nWe \ufb01rst show (in Section 3.1) that the existing convergence rates that are derived based on minimum\neigenvalues approach 0 as the sample size n grows. Then, towards a non-diminishing convergence\nrate, we characterize (in Section 3.2) how the target function f\u21e4 affects the convergence rate.\n\n3.1 Convergence rates based on minimum eigenvalues\nLet H := H +(1) + H(1). It has been shown in [DZPS18] that when the neural networks are\n\nconvergence rates with high probability can be upper bounded as 3\n\nky by(t)k \uf8ff (1  \u2318min(H))t ky by(0)k = exp\u2713t log\n\nsuf\ufb01ciently over-parameterized m = \u2326(n6), the convergence of ky by(t)k and the associated\nwhere min(H) is the smallest eigenvalue of H. Equality (8) holds because ofby(0) = 0. In this\npaper, we refer to log\n1\u2318min(H) as convergence rate. The convergence rate here is quite appealing\nat \ufb01rst glance as it is independent of the target function f\u21e4. Essentially (8) says that no matter how\nthe training data is generated, via GD, we can always \ufb01nd an over-parameterized neural network that\nperfectly \ufb01ts/memorizes all the training data tuples exponentially fast! Though the spectrum of the\nrandom matrix H can be proved to concentrate as n grows, we observe that min(H) converges to 0\nas n diverges, formally shown in Theorem 2.\nTheorem 2. For any data distribution \u21e2, there exists a sequence of non-negative real numbers\n1  2  . . . (independent of n) satisfying limi!1 i = 0 such that, with probability 1  ,\n\n1  \u2318min(H)\u25c6kyk ,\n\n(8)\n\n1\n\n1\n\nwheree1  \u00b7\u00b7\u00b7 en are the spectrum of H. In addition, if m = !(log n), we have\nwhere P! denotes convergence in probability.\nA numerical illustration of the decay of min(H) in n is presented in Fig. 1a. Theorem 2 is proved in\nAppendix D. By Theorem 2, the convergence rate in (8) approaches zero as n ! 1.\nCorollary 1. For any \u2318 = O(1), it is true that log\n\nmin(H) P! 0,\n\nas n ! 1,\n\n(10)\n\n1\n\n1\u2318min(H) ! 0 as n ! 1.\n\n3 Though a re\ufb01ned analysis of that in [DZPS18] is given by [ADH+19, Theorem 4.1], the analysis crucially\n\nrelies on the convergence rate in (8).\n\n4\n\nsup\n\ni\n\n|i ei| \uf8ffr log(4n2/)\n\nm\n\n+r 8 log(4/)\n\nn\n\n.\n\n(9)\n\n\f(a) The minimum eigenvalues of one realization of H\nunder different n and d, with network width m = 2n.\n\n(b) The spectrum of K with d = 10, n = 500 concen-\ntrates around that of LK.\n\nFigure 1: The spectra of H, K, and LK when \u21e2 is the uniform distribution over S d1.\n\nIn Corollary 1, we restrict our attention to \u2318 = O(1). This is because the general analysis of GD\n[Nes18] adopted by [ADH+19, DZPS18] requires that (1  \u2318max(H)) > 0, and by the spectrum\nconcentration given in Theorem 2, the largest eigenvalue of H concentrates on some strictly positive\nvalue as n diverges, i.e., max(H) = \u21e5(1). Thus, if \u2318 = !(1), then (1  \u2318max(H)) < 0 for any\nsuf\ufb01ciently large n, violating the condition assumed in [ADH+19, DZPS18].\nTheorem 2 essentially follows from two observations. Let K = E [H], where the expectation is\ntaken with respect to the randomness in the network initialization. It is easy to see that by standard\nconcentration argument, for a given dataset, the spectrum of K and H are close with high probability.\nIn addition, the spectrum of K, as n increases, concentrates on the spectrum of the following integral\noperator LK on L2(S d1, \u21e2),\n\n(LKf )(x) :=ZSd1 K(x, s)f (s)d\u21e2,\n\n(11)\n\nwith the kernel function:\n\nK(x, s) := hx, si\n\n2\u21e1\n\n(\u21e1  arccoshx, si) 8 x, s 2 S d1,\n\n(12)\nwhich is bounded over S d1 \u21e5 S d1. In fact, 1  2  \u00b7\u00b7\u00b7 in Theorem 2 are the eigenvalues\nof LK. As supx,s2Sd1 K(x, s) \uf8ff 1\n2, it is true that i \uf8ff 1 for all i  1. Notably, by de\ufb01nition,\nnK(xi, xi0) is the empirical kernel matrix on the feature vectors of the given\nKii0 = E [Hii0] = 1\ndataset {(xi, yi) : 1 \uf8ff i \uf8ff n}. A numerical illustration of the spectrum concentration of K is given\nin Fig. 1b; see, also, [XLS17].\nThough a generalization bound is given in [ADH+19, Theorem 5.1 and Corollary 5.2], it is unclear\nhow this bound scales in n. In fact, if we do not care about the structure of the target function f\u21e4 and\nallow ypn to be arbitrary, this generalization bound might not decrease to zero as n ! 1. A detailed\nargument and a numerical illustration can be found in Appendix B.\n\n3.2 Constant convergence rates\nRecall that f\u21e4 denotes the underlying function that generates output labels/responses (i.e., y\u2019s) given\ninput features (i.e., x\u2019s). For example, f\u21e4 could be a constant function or a linear function. Clearly,\nthe dif\ufb01culty in learning f\u21e4 via training neural networks should crucially depend on the properties\nof f\u21e4 itself. We observe that the training convergence rate might be determined by how f\u21e4 can\nbe decomposed into the eigenspaces of the integral operator de\ufb01ned in (11). This observation is\nalso validated by a couple of existing empirical observations: (1) The spectrum of the MNIST data\n[LBB+98] concentrates on the \ufb01rst a few eigenspaces; and (2) the training is slowed down if labels\nare partially corrupted [ZBH+16, ADH+19]. Compared with [ADH+19], we use spectral projection\nconcentration to show how the random eigenvalues and the random projections in [ADH+19, Eq.(8)\nin Theorem 4.1] are controlled by f\u21e4 and \u21e2.\n\nWe \ufb01rst present a suf\ufb01cient condition for the convergence of ky by(t)k.\n\n5\n\n02004006008001000n105104103102min(H)d=5d=10d=15d=20\fFor any  2 (0, 1\n\n4 ) and given T > 0, if\n\nm \n\n32\nc2\n\nthen with probability at least 1  , the following holds for all t \uf8ff T :\n\n1\npn\n\n\n(I  \u2318K)t y \uf8ff (1  \u2318c0)t + c1, 8 t.\n1 \u2713 1\n+ 2\u2318T c1\u25c64\n \u2713 1\n(y by(t)) \uf8ff (1  \u2318c0)t + 2c1.\n\n\n1\npn\n\n+ 4 log\n\n4n\n\nc0\n\nc0\n\n+ 2\u2318T c1\u25c62! ,\n\n(13)\n\n(14)\n\n(15)\n\nTheorem 3 (Suf\ufb01ciency). Let 0 < \u2318 < 1. Suppose there exist c0 2 (0, 1) and c1 > 0 such that\n\ni=1 y2\n\nTheorem 3 is proved in Appendix E. Theorem 3 says that if 1pn (I  \u2318K)t y converges to c1\nexponentially fast, then 1pn (y by(t)) converges to 2c1 with the same convergence rate guarantee\nRoughly speaking, in our setup, yi = \u21e5(1) and kyk =pPn\n\nprovided that the neural network is suf\ufb01ciently parametrized. Recall that yi 2 [1, 1] for each i 2 [n].\ni = \u21e5(pn). Thus we have the 1pn\nscaling in (13) and (14) for normalization purpose.\nSimilar results were shown in [DZPS18, ADH+19] with \u2318 = min(K)\n, c0 = nmin(K) and c1 = 0.\nBut the obtained convergence rate log\nmin(K) ! 0 as n ! 1. In contrast, as can be seen later\n(in Corollary 2), if f\u21e4 lies in the span of a small number of eigenspaces of the integral operator\nin (11), then we can choose \u2318 = \u21e5(1), choose c0 to be a value that is determined by the target\nfunction f\u21e4 and the distribution \u21e2 only, and choose c1 = \u21e5( 1pn ). Thus, the resulting convergence\ndoes not approach 0 as n ! 1. The additive term c1 = \u21e5(1/pn) arises from the\nrate log\nfact that only \ufb01nitely many data tuples are available. Both the proof of Theorem 3 and the proofs\nin [DZPS18, ADH+19, AZLL18] are based on the observation that when the network is suf\ufb01ciently\nover-parameterized, the sign changes (activation pattern changes) of the hidden neurons are sparse.\nDifferent from [DZPS18, ADH+19], our proof does not use min(K); see Appendix E for details.\nIt remains to show, with high probability, (13) in Theorem 3 holds with properly chosen c0 and c1.\nBy the spectral theorem [DS63, Theorem 4, Chapter X.3] and [RBV10], LK has a spectrum with\ndistinct eigenvalues \u00b51 > \u00b52 > \u00b7\u00b7\u00b7 4 such that\n\n1\u2318c0\n\n12\n\nn\n\n1\n\n1\n\nLK =Xi1\n\n\u00b5iP\u00b5i, with P\u00b5i :=\n\n(I  LK)1d,\n\n1\n\n2\u21e1iZ\u00b5i\n\nwhere P\u00b5i : L2(S d1, \u21e2) ! L2(S d1, \u21e2) is the orthogonal projection operator onto the eigenspace\nassociated with eigenvalue \u00b5i; here (1) i is the imaginary unit, and (2) the integral can be taken over\nany closed simple recti\ufb01able curve (with positive direction) \u00b5i containing \u00b5i only and no other\ndistinct eigenvalue. In other words, P\u00b5if is the function obtained by projecting function f onto the\neigenspaces of the integral operator LK associated with \u00b5i.\nGiven an ` 2 N, let m` be the sum of the multiplicities of the \ufb01rst ` nonzero top eigenvalues of LK.\nThat is, m1 is the multiplicity of \u00b51 and (m2  m1) is the multiplicity of \u00b52. By de\ufb01nition,\n\nm` = \u00b5` 6= \u00b5`+1 = m`+1, 8 `.\n\nTheorem 4. For any `  1 such that \u00b5i > 0, for i \uf8ff `, let\nf\u21e4(x)  ( X1\uf8ffi\uf8ff`\n\n\u270f(f\u21e4, `) := sup\n\nx2Sd1\n\nP\u00b5if\u21e4)(x)\n\nbe the approximation error of the span of the eigenspaces associated with the \ufb01rst ` dis-\n(m`m`+1)2 and\ntinct eigenvalues.\n4 The sequence of distinct eigenvalues can possibly be of \ufb01nite length. In addition, the sequences of \u00b5i\u2019s and\n\nThen given  2 (0, 1\n\n4 ) and T > 0,\n\nif n >\n\n256 log 2\n\n\ni\u2019s (in Theorem 2) are different, the latter of which consists of repetitions.\n\n6\n\n\fc0\n\nc0\n\nc2\n\n+ 4 log 4n\n\n+ 2\u2318T c1\u23182\u25c6 with c0 = 3\n \u21e3 1\n+ 2\u2318T c1\u23184\nm  32\nwith probability  (1  3), for all t \uf8ff T :\n16p2qlog 2\n(y by(t)) \uf8ff\u27131 \n\u2318m`\u25c6t\n3\n(m`  m`+1)pn\n4\n\n1\u2713\u21e3 1\n\n\n1\npn\n\n+\n\n\n\n4 ` and c1 = \u270f(f\u21e4, `), then\n\n+ 2p2\u270f(f\u21e4, `).\n\n1\n\n/ log\n\n1\n\n/ log\n\nc1\n\n/ log\n\n1\u2318c0\n\n1\u2318c0\n\n1\u2318c0\n\nSince m` is determined by f\u21e4 and \u21e2 only, with \u2318 = 1, the convergence rate log\nw. r. t. n.\nRemark 1 (Early stopping). In Theorems 3 and 4, the derived lower bounds of m grow in\nT . To control m, we need to terminate the GD training at some \u201creasonable\u201d T . Fortunately,\nT is typically small. To see this, note that \u2318, c0, and c1 are independent of t. By (13) and\n) iterations provided that\n\n(15) we know 1pn (y by(t)) decreases to \u21e5(c1) in (log 1\n\n1\n1 3\n4 m`\n\nis constant\n\n). Similar to us, early stopping is adopted in\n\nlog n\n log(1 3\n(m`m`+1)2n24\n\n) \uf8ff T . Thus, to guarantee  1pn (y by(t)) = O(c1), it is enough to ter-\n\n(log 1\nc1\nminate GD at iteration T = \u21e5(log 1\nc1\n[AZLL18, LSO19], and is commonly adopted in practice.\nCorollary 2 (zero\u2013approximation error). Suppose there exists ` such that \u00b5i > 0, for i \uf8ff `, and\n4 m` ). For a given  2 (0, 1\n\u270f(f\u21e4, `) = 0. Then let \u2318 = 1 and T =\n(m`m`+1)2\nm`\u2318, then with probability  (1  3), for all t \uf8ff T :\nand m & (n log n)\u21e3 1\n16p2 log 2/\n\npn (m`  m`+1)\n\nCorollary 2 says that for \ufb01xed f\u21e4 and \ufb01xed distribution \u21e2, nearly-linear network over-parameterization\nm = \u21e5(n log n) is enough for GD method to converge exponentially fast as long as 1\n = O(poly(n)).\nCorollary 2 follow immediately from Theorem 4 by specifying the relevant parameters such as \u2318 and\nT . To the best of our knowledge, this is the \ufb01rst result showing suf\ufb01ciency of nearly-linear network\nover-parameterization. Note that (m`  m`+1) > 0 is the eigengap between the `\u2013th and (` + 1)\u2013th\nlargest distinct eigenvalues of the integral operator, and is irrelevant to n. Thus, for \ufb01xed f\u21e4 and \u21e2,\nc1 = \u21e5\u21e3qlog 1\n\n(y by(t)) \uf8ff (1 \n\n /n\u2318.\n\n4 ), if n >\n\nlog4 n log2 1\n\n\n1\npn\n\n256 log 2\n\n\n3m`\n\n)t +\n\n+\n\n4\n\nm`\n\n1\n\n4\n\n.\n\n4 Application to Uniform Distribution and Polynomials\n\nWe illustrate our general results by applying them to the setting where the target functions are\npolynomials and the feature vectors are uniformly distributed on the sphere S d1.\nUp to now, we implicitly incorporate the bias bj in wj by augmenting the original wj; correspondingly,\nthe data feature vector is also augmented. In this section, as we are dealing with distribution on the\noriginal feature vector, we explicitly separate out the bias from wj. In particular, let b0\nj \u21e0 N (0, 1).\nFor ease of exposition, with a little abuse of notation, we use d to denote the dimension of the wj\nand x before the above mentioned augmentation. With bias, (1) can be rewritten as fW ,b(x) =\nj=1 aj [hx, wji + bj]+ ,where b = (b1,\u00b7\u00b7\u00b7 , bm) are the bias of the hidden neurons, and the\n\n1pmPm\n\nkernel function in (12) becomes\nK(x, s) = hx, si + 1\n\n2\u21e1\n\n\u2713\u21e1  arccos\u2713 1\n\n2\n\n(hx, si + 1)\u25c6\u25c6 8 x, s 2 S d1.\n\n(16)\n\nFrom Theorem 4 we know the convergence rate is determined by the eigendecomposition of the\ntarget function f\u21e4 w. r. t. the eigenspaces of LK. When \u21e2 is the uniform distribution on S d1, the\neigenspaces of LK are the spaces of homogeneous harmonic polynomials, denoted by H` for `  0.\nSpeci\ufb01cally, LK = P`0 `P`, where P` (for `  0) is the orthogonal projector onto H` and\n` = \u21b5`\n> 0 is the associated eigenvalue \u2013 \u21b5` is the coef\ufb01cient of K(x, s) in the expansion into\n\nd2\n`+ d2\n\n2\n\n2\n\n7\n\n\f(a) Plot of ` with ` under different d. Here, the ` is\nmonotonically decreasing in `.\n\n(b) Training with f\u21e4 being randomly generated linear\nor quadratic functions with n = 1000, m = 2000.\n\nFigure 2: Application to uniform distribution and polynomials.\n\nGegenbauer polynomials. Note that H` and H`0 are orthogonal when ` 6= `0. See appendix G for\nrelevant backgrounds on harmonic analysis on spheres.\nExplicit expression of eigenvalues ` > 0 is available; see Fig. 2a for an illustration of `. In\nfact, there is a line of work on ef\ufb01cient computation of the coef\ufb01cients of Gegenbauer polynomials\nexpansion [CI12].\nIf the target function f\u21e4 is a standard polynomial of degree `\u21e4, by [Wan, Theorem 7.4], we know f\u21e4\ncan be perfectly projected onto the direct sum of the spaces of homogeneous harmonic polynomials\nup to degree `\u21e4. The following corollary follows immediately from Corollary 2.\nCorollary 3. Suppose f\u21e4 is a degree `\u21e4 polynomial, and the feature vector xi\u2019s are i.i.d. generated\nfrom the uniform distribution over S d1. Let \u2318 = 1, and T = \u21e5(log n). For a given  2 (0, 1\n4 ), if\nn = \u21e5log 1\n ), then with probability at least 1  , for all t \uf8ff T :\n+ \u21e5(r log 1/\n\n\n and m = \u21e5(n log n log2 1\n(y by(t)) \uf8ff\u27131 \n4 \u25c6t\n\n), where c0 = min{`\u21e4, `\u21e4+1} .\n\n1\npn\n\n3c0\n\nn\n\nFor ease of exposition, in the above corollary, \u21e5(\u00b7) hides dependence on quantities such as eigengaps\n\u2013 as they do not depend on n, m, and . Corollary 3 and ` in Fig. 2a together suggest that the\nconvergence rate decays with both the dimension d and the polynomial degree `. This is validated in\nFig. 2a. It might be unfair to compare the absolute values of training errors since f\u21e4 are different.\nNevertheless, the convergence rates can be read from slope in logarithmic scale. We see that the\nconvergence slows down as d increases, and learning a quadratic function is slower than learning a\nlinear function.\nNext we present the explicit expression of `. For ease of exposition, let h(u) := K(x, s) where\nu = hx, si. By [CI12, Eq. (2.1) and Theorem 2], we know\n2`+2kk! d2\n\nwhere h` := h(`)(0) is the `\u2013th order derivative of h at zero, and the Pochhammer symbol (a)k is\nde\ufb01ned recursively as (a)0 = 1, (a)k = (a + k  1)(a)k1 for k 2 N. By a simple induction, it can\nbe shown that h0 = h(0)(0) = 1/3, and for k  1,\n\n2 `+k+1\n\n1Xk=0\n\nd  2\n2\n\nh`+2k\n\n` =\n\n,\n\n(17)\n\n(18)\n\nhk =\n\n1\n2\n\n1{k=1} \n\n1\n\n\u21e12k\u21e3k (arccos 0.5)(k1) + 0.5 (arccos 0.5)(k)\u2318 ,\n\nwhere the computation of the higher-order derivative of arccos is standard. It follows from (17) and\n(18) that ` > 0, and 2` > 2(`+1) and 2`+1 > 2`+3 for all `  0. However, an analytic order\namong ` is unclear, and we would like to explore this in the future.\n\n8\n\n246810degree`109107105103101`d=5d=10d=15d=20\fReferences\n[ADH+19] Sanjeev Arora, Simon S Du, Wei Hu, Zhiyuan Li, and Ruosong Wang. Fine-grained\nanalysis of optimization and generalization for overparameterized two-layer neural\nnetworks. arXiv:1901.08584, 2019.\n\n[ASCC18] Vivek R Athalye, Fernando J Santos, Jose M Carmena, and Rui M Costa. Evidence for\n\na neural law of effect. Science, 359(6379):1024\u20131029, 2018.\n\n[AZLL18] Zeyuan Allen-Zhu, Yuanzhi Li, and Yingyu Liang. Learning and generalization in\noverparameterized neural networks, going beyond two layers. arXiv:1811.04918, 2018.\n\n[AZLS18] Zeyuan Allen-Zhu, Yuanzhi Li, and Zhao Song. A convergence theory for deep learning\n\nvia over-parameterization. arXiv preprint arXiv:1811.03962, 2018.\n\n[BG17] Alon Brutzkus and Amir Globerson. Globally optimal gradient descent for a convnet\nwith gaussian inputs. In Proceedings of the 34th International Conference on Machine\nLearning-Volume 70, pages 605\u2013614. JMLR. org, 2017.\n\n[BR89] Avrim Blum and Ronald L Rivest. Training a 3-node neural network is np-complete.\n\nIn Advances in neural information processing systems, pages 494\u2013501, 1989.\n\n[CB18] Lenaic Chizat and Francis Bach. A note on lazy training in supervised differentiable\n\nprogramming. arXiv preprint arXiv:1812.07956, 2018.\n\n[CG19] Yuan Cao and Quanquan Gu. Generalization bounds of stochastic gradient descent for\n\nwide and deep neural networks. arXiv preprint arXiv:1905.13210, 2019.\n\n[CI12] Mar\u00eda Jos\u00e9 Cantero and Arieh Iserles. On rapid computation of expansions in ultras-\n\npherical polynomials. SIAM Journal on Numerical Analysis, 50(1):307\u2013327, 2012.\n\n[DDS+09] Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet: A\nlarge-scale hierarchical image database. In 2009 IEEE conference on computer vision\nand pattern recognition, pages 248\u2013255. Ieee, 2009.\n\n[DLL+18] Simon S Du, Jason D Lee, Haochuan Li, Liwei Wang, and Xiyu Zhai. Gradient descent\n\n\ufb01nds global minima of deep neural networks. arXiv:1811.03804, 2018.\n\n[DS63] Nelson Dunford and Jacob T Schwartz. Linear operators: Part II: Spectral Theory:\n\nSelf Adjoint Operators in Hilbert Space. Interscience Publishers, 1963.\n\n[DX13] Feng Dai and Yuan Xu. Approximation theory and harmonic analysis on spheres and\n\nballs. Springer, 2013.\n\n[DZPS18] Simon S Du, Xiyu Zhai, Barnabas Poczos, and Aarti Singh. Gradient descent provably\n\noptimizes over-parameterized neural networks. arXiv:1810.02054, 2018.\n\n[GMMM19] Behrooz Ghorbani, Song Mei, Theodor Misiakiewicz, and Andrea Montanari. Lin-\n\nearized two-layers neural networks in high dimension. arXiv:1904.12191, 2019.\n\n[JGH18] Arthur Jacot, Franck Gabriel, and Cl\u00e9ment Hongler. Neural tangent kernel: Con-\nvergence and generalization in neural networks. In Advances in neural information\nprocessing systems, pages 8571\u20138580, 2018.\n\n[KB18] Jason M Klusowski and Andrew R Barron. Approximation by combinations of relu\n\nand squared relu ridge functions with l1 and l0 controls. 2018.\n\n[KSH12] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. Imagenet classi\ufb01cation with\ndeep convolutional neural networks. In Advances in neural information processing\nsystems, pages 1097\u20131105, 2012.\n\n[LBB+98] Yann LeCun, L\u00e9on Bottou, Yoshua Bengio, Patrick Haffner, et al. Gradient-based\nlearning applied to document recognition. Proceedings of the IEEE, 86(11):2278\u20132324,\n1998.\n\n9\n\n\f[LL18] Yuanzhi Li and Yingyu Liang. Learning overparameterized neural networks via stochas-\ntic gradient descent on structured data. In Advances in Neural Information Processing\nSystems, pages 8157\u20138166, 2018.\n\n[LSO19] Mingchen Li, Mahdi Soltanolkotabi, and Samet Oymak. Gradient descent with early\nstopping is provably robust to label noise for overparameterized neural networks.\narXiv:1903.11680, 2019.\n\n[LY17] Yuanzhi Li and Yang Yuan. Convergence analysis of two-layer neural networks with\nrelu activation. In Advances in Neural Information Processing Systems, pages 597\u2013607,\n2017.\n\n[MG06] Eve Marder and Jean-Marc Goaillard. Variability, compensation and homeostasis in\n\nneuron and network function. Nature Reviews Neuroscience, 7(7):563, 2006.\n\n[MMN18] Song Mei, Andrea Montanari, and Phan-Minh Nguyen. A mean \ufb01eld view of the\n\nlandscape of two-layers neural networks. arXiv:1804.06561, 2018.\n\n[Nes18] Yurii Nesterov. Lectures on convex optimization, volume 137. Springer, 2018.\n[OS19] Samet Oymak and Mahdi Soltanolkotabi. Towards moderate overparameterization:\nglobal convergence guarantees for training shallow neural networks. arXiv:1902.04674,\n2019.\n\n[RBV10] Lorenzo Rosasco, Mikhail Belkin, and Ernesto De Vito. On learning with integral\n\noperators. Journal of Machine Learning Research, 11(Feb):905\u2013934, 2010.\n\n[SS96] David Saad and Sara A Solla. Dynamics of on-line gradient descent learning for\nmultilayer neural networks. In Advances in neural information processing systems,\npages 302\u2013308, 1996.\n\n[Sze75] G. Szeg\u00f6. Orthogonal polynomials. American Mathematical Society, Providence, RI,\n\n4th edition, 1975.\n\n[Tia16] Yuandong Tian. Symmetry-breaking convergence analysis of certain two-layered neural\n\nnetworks with relu nonlinearity. 2016.\n\n[VW18] Santosh Vempala and John Wilmes. Gradient descent for one-hidden-layer neural net-\nworks: Polynomial convergence and sq lower bounds. arXiv preprint arXiv:1805.02677,\n2018.\n\n[Wan] Yi Wang. Harmonic analysis and isoperimetric inequalities. LectureNotes.\n\n[WGL+19] Blake Woodworth, Suriya Gunasekar, Jason Lee, Daniel Soudry, and Nathan Srebro.\nKernel and deep regimes in overparametrized models. arXiv preprint arXiv:1906.05827,\n2019.\n\n[XLS17] Bo Xie, Yingyu Liang, and Le Song. Diverse neural network learns true target functions.\n\nIn Arti\ufb01cial Intelligence and Statistics, pages 1216\u20131224, 2017.\n\n[YS19] Gilad Yehudai and Ohad Shamir. On the power and limitations of random features for\n\nunderstanding neural networks. arXiv preprint arXiv:1904.00687, 2019.\n\n[ZBH+16] Chiyuan Zhang, Samy Bengio, Moritz Hardt, Benjamin Recht, and Oriol Vinyals.\nUnderstanding deep learning requires rethinking generalization. arXiv:1611.03530,\n2016.\n\n[ZCZG18] Difan Zou, Yuan Cao, Dongruo Zhou, and Quanquan Gu. Stochastic gradient descent\noptimizes over-parameterized deep relu networks. arXiv preprint arXiv:1811.08888,\n2018.\n\n[ZSJ+17] Kai Zhong, Zhao Song, Prateek Jain, Peter L Bartlett, and Inderjit S Dhillon. Recovery\nguarantees for one-hidden-layer neural networks. In Proceedings of the 34th Interna-\ntional Conference on Machine Learning-Volume 70, pages 4140\u20134149. JMLR. org,\n2017.\n\n10\n\n\f", "award": [], "sourceid": 1521, "authors": [{"given_name": "Lili", "family_name": "Su", "institution": "MIT"}, {"given_name": "Pengkun", "family_name": "Yang", "institution": "Princeton University"}]}