{"title": "Generalization Bounds of Stochastic Gradient Descent for Wide and Deep Neural Networks", "book": "Advances in Neural Information Processing Systems", "page_first": 10836, "page_last": 10846, "abstract": "We study the training and generalization of deep neural networks (DNNs) in the over-parameterized regime, where the network width (i.e., number of hidden nodes per layer) is much larger than the number of training data points. We show that, the expected $0$-$1$ loss of a wide enough ReLU network trained with stochastic gradient descent (SGD) and random initialization can be bounded by the training loss of a random feature model induced by the network gradient at initialization, which we call a \\textit{neural tangent random feature} (NTRF) model. For data distributions that can be classified by NTRF model with sufficiently small error, our result yields a generalization error bound in the order of $\\tilde{\\mathcal{O}}(n^{-1/2})$ that is independent of the network width. Our result is more general and sharper than many existing generalization error bounds for over-parameterized neural networks. In addition, we establish a strong connection between our generalization error bound and the neural tangent kernel (NTK) proposed in recent work.", "full_text": "Generalization Bounds of Stochastic Gradient\nDescent for Wide and Deep Neural Networks\n\nYuan Cao\n\nQuanquan Gu\n\nDepartment of Computer Science\n\nUniversity of California, Los Angeles\n\nDepartment of Computer Science\n\nUniversity of California, Los Angeles\n\nCA 90095, USA\n\nyuancao@cs.ucla.edu\n\nCA 90095, USA\n\nqgu@cs.ucla.edu\n\nAbstract\n\nWe study the training and generalization of deep neural networks (DNNs) in the\nover-parameterized regime, where the network width (i.e., number of hidden nodes\nper layer) is much larger than the number of training data points. We show that, the\nexpected 0-1 loss of a wide enough ReLU network trained with stochastic gradient\ndescent (SGD) and random initialization can be bounded by the training loss of\na random feature model induced by the network gradient at initialization, which\nwe call a neural tangent random feature (NTRF) model. For data distributions\nthat can be classi\ufb01ed by NTRF model with suf\ufb01ciently small error, our result\n\nyields a generalization error bound in the order of rOpn\u00b41{2q that is independent\n\nof the network width. Our result is more general and sharper than many existing\ngeneralization error bounds for over-parameterized neural networks. In addition,\nwe establish a strong connection between our generalization error bound and the\nneural tangent kernel (NTK) proposed in recent work.\n\n1\n\nIntroduction\n\nDeep learning has achieved great success in a wide range of applications including image processing\n[20], natural language processing [17] and reinforcement learning [34]. Most of the deep neural\nnetworks used in practice are highly over-parameterized, such that the number of parameters is much\nlarger than the number of training data. One of the mysteries in deep learning is that, even in an\nover-parameterized regime, neural networks trained with stochastic gradient descent can still give\nsmall test error and do not over\ufb01t. In fact, a famous empirical study by Zhang et al. [38] shows the\nfollowing phenomena:\n\u2022 Even if one replaces the real labels of a training data set with purely random labels, an over-\nparameterized neural network can still \ufb01t the training data perfectly. However since the labels are\nindependent of the input, the resulting neural network does not generalize to the test dataset.\n\u2022 If the same over-parameterized network is trained with real labels, it not only achieves small\n\ntraining loss, but also generalizes well to the test dataset.\n\nWhile a series of recent work has theoretically shown that a suf\ufb01ciently over-parameterized (i.e.,\nsuf\ufb01ciently wide) neural network can \ufb01t random labels [12, 2, 11, 39], the reason why it can generalize\nwell when trained with real labels is less understood. Existing generalization bounds for deep neural\nnetworks [29, 6, 27, 15, 13, 5, 24, 35, 28] based on uniform convergence usually cannot provide\nnon-vacuous bounds [21, 13] in the over-parameterized regime. In fact, the empirical observation by\nZhang et al. [38] indicates that in order to understand deep learning, it is important to distinguish the\ntrue data labels from random labels when studying generalization. In other words, it is essential to\nquantify the \u201cclassi\ufb01ability\u201d of the underlying data distribution, i.e., how dif\ufb01cult it can be classi\ufb01ed.\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\fCertain effort has been made to take the \u201cclassi\ufb01ability\u201d of the data distribution into account for\ngeneralization analysis of neural networks. Brutzkus et al. [7] showed that stochastic gradient descent\n(SGD) can learn an over-parameterized two-layer neural network with good generalization for linearly\nseparable data. Li and Liang [25] proved that, if the data satisfy certain structural assumption, SGD\ncan learn an over-parameterized two-layer network with \ufb01xed second layer weights and achieve a\nsmall generalization error. Allen-Zhu et al. [1] studied the generalization performance of SGD and\nits variants for learning two-layer and three-layer networks, and used the risk of smaller two-layer or\nthree-layer networks with smooth activation functions to characterize the classi\ufb01ability of the data\ndistribution. There is another line of studies on the algorithm-dependent generalization bounds of\nneural networks in the over-parameterized regime [10, 4, 8, 37, 14], which quanti\ufb01es the classi\ufb01ability\nof the data with a reference function class de\ufb01ned by random features [31, 32] or kernels1. Speci\ufb01cally,\nDaniely [10] showed that a neural network of large enough size is competitive with the best function\nin the conjugate kernel class of the network. Arora et al. [4] gave a generalization error bound for two-\nlayer ReLU networks with \ufb01xed second layer weights based on a ReLU kernel function. Cao and Gu\n[8] showed that deep ReLU networks trained with gradient descent can achieve small generalization\nerror if the data can be separated by certain random feature model [32] with a margin. Yehudai and\nShamir [37] used the expected loss of a similar random feature model to quantify the generalization\nerror of two-layer neural networks with smooth activation functions. A similar generalization error\nbound was also given by E et al. [14], where the authors studied the optimization and generalization\nof two-layer networks trained with gradient descent. However, all the aforementioned results are still\nfar from satisfactory: they are either limited to two-layer networks, or restricted to very simple and\nspecial reference function classes.\nIn this paper, we aim at providing a sharper and generic analysis on the generalization of deep ReLU\nnetworks trained by SGD. In detail, we base our analysis upon the key observations that near random\ninitialization, the neural network function is almost a linear function of its parameters and the loss\nfunction is locally almost convex. This enables us to prove a cumulative loss bound of SGD, which\nfurther leads to a generalization bound by online-to-batch conversion [9]. The main contributions of\nour work are summarized as follows:\n\u2022 We give a bound on the expected 0-1 error of deep ReLU networks trained by SGD with random\ninitialization. Our result relates the generalization bound of an over-parameterized ReLU network\nwith a random feature model de\ufb01ned by the network gradients, which we call neural tangent\nrandom feature (NTRF) model. It also suggests an algorithm-dependent generalization error bound\n\nof order rOpn\u00b41{2q, which is independent of network width, if the data can be classi\ufb01ed by the\nwhen reduced to their setting, i.e., rOp1{\u00012q versus rOp1{\u00014q where \u0001 is the target generalization\n\n\u2022 Our analysis is general enough to cover recent generalization error bounds for neural networks\nwith random feature based reference function classes, and provides better bounds. Our expected\n0-1 error bound directly covers the result by Cao and Gu [8], and gives a tighter sample complexity\n\nNTRF model with small enough error.\n\nerror. Compared with recent results by Yehudai and Shamir [37], E et al. [14] who only studied\ntwo-layer networks, our bound not only works for deep networks, but also uses a larger reference\nfunction class when reduced to the two-layer setting, and therefore is sharper.\n\nrOpL \u00a8\n\n\u2022 Our result has a direct connection to the neural tangent kernel studied in Jacot et al. [18]. When\ninterpreted in the language of kernel method, our result gives a generalization bound in the form of\nyJp\u0398pLqq\u00b41y{nq, where y is the training label vector, and \u0398pLq is the neural tangent\nkernel matrix de\ufb01ned on the training input data. This form of generalization bound is similar to,\nbut more general and tighter than the bound given by Arora et al. [4].\n\n\u0159\nNotation We use lower case, lower case bold face, and upper case bold face letters to denote scalars,\n\u0159\nvectors and matrices respectively. For a vector v \u201c pv1, . . . , vdqT P Rd and a number 1 \u010f p \u0103 8,\ni\u201c1 |vi|pq1{p. We also de\ufb01ne }v}8 \u201c maxi |vi|. For a matrix A \u201c pAi,jqm\u02c6n, we\nlet }v}p \u201c p\ni,jq1{2\nuse }A}0 to denote the number of non-zero entries of A, and denote }A}F \u201c p\nand }A}p \u201c max}v}p\u201c1 }Av}p for p \u011b 1. For two matrices A, B P Rm\u02c6n, we de\ufb01ne xA, By \u201c\nTrpAJBq. We denote by A \u013e B if A \u00b4 B is positive semide\ufb01nite. In addition, we de\ufb01ne the\n1Since random feature models and kernel methods are highly related [31, 32], we group them into the same\n\nd\n\ni,j\u201c1 A2\n\na\n\nd\n\ncategory. More details are discussed in Section 3.2.\n\n2\n\n\fasymptotic notations Op\u00a8q, rOp\u00a8q, \u2126p\u00a8q andr\u2126p\u00a8q as follows. Suppose that an and bn be two sequences.\nWe use rOp\u00a8q andr\u2126p\u00a8q to hide the logarithmic factors in Op\u00a8q and \u2126p\u00a8q.\n\nWe write an \u201c Opbnq if lim supn\u00d18 |an{bn| \u0103 8, and an \u201c \u2126pbnq if lim inf n\u00d18 |an{bn| \u0105 0.\n\n2 Problem Setup\n\nfWpxq \u201c ?\n\nIn this section we introduce the basic problem setup. Following the same standard setup implemented\nin the line of recent work [2, 11, 39, 8], we consider fully connected neural networks with width m,\ndepth L and input dimension d. Such a network is de\ufb01ned by its weight matrices at each layer: for\nL \u011b 2, let W1 P Rm\u02c6d, Wl P Rm\u02c6m, l \u201c 2, . . . , L \u00b4 1 and WL P R1\u02c6m be the weight matrices\nof the network. Then the neural network with input x P Rd is de\ufb01ned as\n\nm \u00a8 WL\u03c3pWL\u00b41\u03c3pWL\u00b42 \u00a8\u00a8\u00a8 \u03c3pW1xq\u00a8\u00a8\u00a8qq,\n\n(2.1)\nwhere \u03c3p\u00a8q is the entry-wise activation function. In this paper, we only consider the ReLU activation\nfunction \u03c3pzq \u201c maxt0, zu, which is the most commonly used activation function in applications. It\nis also arguably one of the most dif\ufb01cult activation functions to analyze, due to its non-smoothess. We\nremark that our result can be generalized to many other Lipschitz continuous and smooth activation\nfunctions. For simplicity, we follow Allen-Zhu et al. [2], Du et al. [11] and assume that the widths of\neach hidden layer are the same. Our result can be easily extended to the setting that the widths of\neach layer are not equal but in the same order, as discussed in Zou et al. [39], Cao and Gu [8].\nWhen L \u201c 1, the neural network reduces to a linear function, which has been well-studied. Therefore,\nfor notational simplicity we focus on the case L \u011b 2, where the parameter space is de\ufb01ned as\n\nWe also use W \u201c pW1, . . . , WLq P W to denote the collection of weight matrices for all layers.\nFor W, W1 P W, we de\ufb01ne their inner product as xW, W1y :\u201c\nThe goal of neural network learning is to minimize the expected risk, i.e.,\n\nW :\u201c Rm\u02c6d \u02c6 pRm\u02c6mqL\u00b42 \u02c6 R1\u02c6m.\n\u0159\nl\u201c1 TrpWJ\n\nlq.\nl W1\n\nL\n\nLDpWq :\u201c Epx,yq\u201eDLpx,yqpWq,\n\nmin\nW\n\n(2.2)\nwhere Lpx,yqpWq \u201c (cid:96)ry \u00a8 fWpxqs is the loss de\ufb01ned on any example px, yq, and (cid:96)pzq is the loss\nfunction. Without loss of generality, we consider the cross-entropy loss in this paper, which is de\ufb01ned\nas (cid:96)pzq \u201c logr1 ` expp\u00b4zqs. We would like to emphasize that our results also hold for most convex\nand Lipschitz continuous loss functions such as hinge loss. We now introduce stochastic gradient\ndescent based training algorithm for minimizing the expected risk in (2.2). The detailed algorithm is\ngiven in Algorithm 1.\n\nAlgorithm 1 SGD for DNNs starting at Gaussian initialization\n\nindependently from Np0, 2{mq, l P rL \u00b4 1s.\n\np0q\nl\np0q\nL independently from Np0, 1{mq.\n\nInput: Number of iterations n, step size \u03b7.\nGenerate each entry of W\nGenerate each entry of W\nfor i \u201c 1, 2, . . . , n do\nDraw pxi, yiq from D.\nUpdate Wpiq \u201c Wpi\u00b41q \u00b4 \u03b7 \u00a8 \u2207WLpxi,yiqpWpiqq.\nend for\n\nOutput: Randomly choosexW uniformly from tWp0q, . . . , Wpn\u00b41qu.\n\nThe initialization scheme for Wp0q given in Algorithm 1 generates each entry of the weight matrices\nfrom a zero-mean independent Gaussian distribution, whose variance is determined by the rule that\nthe expected length of the output vector in each hidden layer is equal to the length of the input.\nThis initialization method is also known as He initialization [16]. Here the last layer parameter is\ninitialized with variance 1{m instead of 2{m since the last layer is not associated with the ReLU\nactivation function.\n\n3\n\n\f3 Main Results\n\nIn this section we present the main results of this paper. In Section 3.1 we give an expected 0-1 error\nbound against a neural tangent random feature reference function class. In Section 3.2, we discuss\nthe connection between our result and the neural tangent kernel proposed in Jacot et al. [18].\n\n3.1 An Expected 0-1 Error Bound\nIn this section we give a bound on the expected 0-1 error L0\u00b41D pWq :\u201c Epx,yq\u201eDr1ty\u00a8 fWpxq \u0103 0us\nobtained by Algorithm 1. Our result is based on the following assumption.\nAssumption 3.1. The data inputs are normalized: }x}2 \u201c 1 for all px, yq P supppDq.\nAssumption 3.1 is a standard assumption made in almost all previous work on optimization and\ngeneralization of over-parameterized neural networks [12, 2, 11, 39, 30, 14]. As is mentioned in\nCao and Gu [8], this assumption can be relaxed to c1 \u010f }x}2 \u010f c2 for all px, yq P supppDq, where\nc2 \u0105 c1 \u0105 0 are absolute constants.\nFor any W P W, we de\ufb01ne its \u03c9-neighborhood as\n\nBpW, \u03c9q :\u201c tW1 P W : }W1\n\nl \u00b4 Wl}F \u010f \u03c9, l P rLsu.\n\n,\n\nFpWp0q, Rq \u201c\n\nBelow we introduce the neural tangent random feature function class, which serves as a reference\nfunction class to measure the \u201cclassi\ufb01ability\u201d of the data, i.e., how easy it can be classi\ufb01ed.\nDe\ufb01nition 3.2 (Neural Tangent Random Feature). Let Wp0q be generated via the initialization\nscheme in Algorithm 1. The neural tangent random feature (NTRF) function class is de\ufb01ned as\n\n(cid:32)\n(\nfp\u00a8q \u201c fWp0qp\u00a8q ` x\u2207WfWp0qp\u00a8q, Wy : W P Bp0, R \u00a8 m\u00b41{2q\nwhere R \u0105 0 measures the size of the function class, and m is the width of the neural network.\nThe name \u201cneural tangent random feature\u201d is inspired by the neural tangent kernel proposed by\nJacot et al. [18], because the random features are the gradients of the neural network with random\nweights. Connections between the neural tangent random features and the neural tangent kernel will\nbe discussed in Section 3.2.\n\u02d8\nWe are ready to present our main result on the expected 0-1 error bound of Algorithm 1.\nTheorem 3.3. For any \u03b4 P p0, e\u00b41s and R \u0105 0, there exists\npolypR, Lq\nsuch that if m \u011b m\u02dap\u03b4, R, L, nq, then with probability at least 1 \u00b4 \u03b4 over the randomness of Wp0q,\n?\n+\nnq for some small enough absolute constant\nthe output of Algorithm 1 with step size \u03b7 \u201c \u03ba \u00a8 R{pm\n\u03ba satis\ufb01es\n(cid:96)ryi \u00a8 fpxiqs\n\nm\u02dap\u03b4, R, L, nq \u201c rO\nn\u00ff\n\n\u00a8 n7 \u00a8 logp1{\u03b4q\nc\n\n` O\n\n#\n\n(3.1)\n\n`\n\n\u201c\n\n`\n\ninf\n\n,\n\nn\n\nE\n\n4\nn\n\n\u010f\n\ni\u201c1\n\n\u00ab\n\nLR?\nn\n\nfPFpWp0q,Rq\n\nlogp1{\u03b4q\n\n\u2030\nL0\u00b41D pxWq\n\nThe expected 0-1 error bound given by Theorem 3.3 consists of two terms: The \ufb01rst term in (3.1)\nrelates the expected 0-1 error achieved by Algorithm 1 with a reference function class\u2013the NTRF\nfunction class in De\ufb01nition 3.2. The second term in (3.1) is a standard large-deviation error term. As\n\n\ufb00\nwhere the expectation is taken over the uniform draw ofxW from tWp0q, . . . , Wpn\u00b41qu.\nlong as R \u201c rOp1q, this term matches the standard rOpn\u00b41{2q rate in PAC learning bounds [33].\nlarge. In particular, if we set R \u201c rOp1q, the second term in (3.1) will be rOpn\u00b41{2q. In this case, the\ncan be classi\ufb01ed by FpWp0q, rOp1qq. In other words, Theorem 3.3 suggests that if the data can be\nclassi\ufb01ed by a function in the NTRF function class FpWp0q, rOp1qq with a small training error, the\n\nRemark 3.4. The parameter R in Theorem 3.3 is from the NTRF class and introduces a trade-off\nin the bound: when R is small, the corresponding NTRF class FpWp0q, Rq is small, making the\n\ufb01rst term in (3.1) large, and the second term in (3.1) is small. When R is large, the corresponding\nfunction class FpWp0q, Rq is large, so the \ufb01rst term in (3.1) is small, and the second term will be\n\u201cclassi\ufb01ability\u201d of the underlying data distribution D is determined by how well its i.i.d. samples\n\nover-parameterized ReLU network learnt by Algorithm 1 will have a small generalization error.\n\n4\n\n\fRemark 3.5. The expected 0-1 error bound given by Theorem 3.3 is in a very general form. It\ndirectly covers the result given by Cao and Gu [8]. In Appendix A.1, we show that under the same\nassumptions made in Cao and Gu [8], to achieve \u0001 expected 0-1 error, our result requires a sample\n\ncomplexity of order rOp\u0001\u00b42q, which outperforms the result in Cao and Gu [8] by a factor of \u0001\u00b42.\ntwo-layer neural networks. When L \u201c 2, the NTRF function class FpWp0q, rOp1qq can be written as\n(cid:32)\n(\nfWp0qp\u00a8q ` x\u2207W1fWp0qp\u00a8q, W1y ` x\u2207W2 fWp0qp\u00a8q, W2y : }W1}F ,}W2}F \u010f rOpm\u00b41{2q\n\nRemark 3.6. Our generalization bound can also be compared with two recent results [37, 14] for\n\n.\n\nIn contrast, the reference function classes studied by Yehudai and Shamir [37] and E et al. [14] are\ncontained in the following random feature class:\n\n(cid:32)\n(\nfWp0qp\u00a8q ` x\u2207W2 fWp0qp\u00a8q, W2y : }W2}F \u010f rOpm\u00b41{2q\n\nFpWp0q, rOp1qq is richer than F\u2013it also contains the features corresponding to the \ufb01rst layer gradient\n\nF \u201c\np0q\np0q\nwhere Wp0q \u201c pW\n2 q P Rm\u02c6d \u02c6 R1\u02c6m are the random weights generated by the initial-\n1 , W\nization schemes in Yehudai and Shamir [37], E et al. [14]2. Evidently, our NTRF function class\nof the network at random initialization, i.e., \u2207W1fWp0qp\u00a8q. As a result, our generalization bound is\nsharper than those in Yehudai and Shamir [37], E et al. [14] in the sense that we can show that neural\nnetworks trained with SGD can compete with the best function in a larger reference function class.\n\n,\n\nAs previously mentioned, the result of Theorem 3.3 can be easily extended to the setting where the\nwidths of different layers are different. We should expect that the result remains almost the same,\nexcept that we assume the widths of hidden layers are all larger than or equal to m\u02dap\u03b4, R, L, nq.\nWe would also like to point out that although this paper considers the cross-entropy loss, the proof\nof Theorem 3.3 offers a general framework based on the fact that near initialization, the neural\nnetwork function is almost linear in terms of its weights. We believe that this proof framework can\npotentially be applied to most practically useful loss functions: whenever (cid:96)p\u00a8q is convex/Lipschitz\ncontinuous/smooth, near initialization, LipWq is also almost convex/Lipschitz continuous/smooth\nin W for all i P rns, and therefore standard online optimization analysis can be invoked with\nonline-to-batch conversion to provide a generalization bound. We refer to Section 4 for more details.\n\n3.2 Connection to Neural Tangent Kernel\n\nBesides quantifying the classi\ufb01ability of the data with the NTRF function class FpWp0q, rOp1qq, an\n\nalternative way to apply Theorem 3.3 is to check how large the parameter R needs to be in order to\nmake the \ufb01rst term in (3.1) small enough (e.g., smaller than n\u00b41{2). In this subsection, we show that\nthis type of analysis connects Theorem 3.3 to the neural tangent kernel proposed in Jacot et al. [18]\nand later studied by Yang [36], Lee et al. [23], Arora et al. [3]. Speci\ufb01cally, we provide an expected\n0-1 error bound in terms of the neural tangent kernel matrix de\ufb01ned over the training data. We \ufb01rst\n\u00b8\nde\ufb01ne the neural tangent kernel matrix for the neural network function in (2.1).\nDe\ufb01nition 3.7 (Neural Tangent Kernel Matrix). For any i, j P rns, de\ufb01ne\n\n\u02dc\np1q\np1q\nplq\ni,j \u201c xxi, xjy, A\ni,j \u201c \u03a3\nij \u201c\n\u02d8r\u03c3puq\u03c3pvqs,\n`\npl`1q\ni,j \u201c 2 \u00a8 E\npu,vq\u201eN\nplq\npl`1q\ni,j \u00a8 2 \u00a8 E\npLq\ni,j q{2sn\u02c6n the neural tangent kernel matrix of an L-layer ReLU\nnetwork on training inputs x1, . . . , xn.\n\nr\u0398\nr\u0398\nThen we call \u0398pLq \u201c rpr\u0398\n\n\u02d8r\u03c31puq\u03c31pvqs ` \u03a3\n\ni,j \u201c r\u0398\n\nplq\ni,i \u03a3\nplq\ni,j \u03a3\n\npLq\ni,j ` \u03a3\n\npl`1q\ni,j\n\n.\n\npu,vq\u201eN\n\n0,A\n\nplq\nij\n\n0,A\n\nplq\nij\n\n\u03a3\n\n\u03a3\n\u03a3\n\nplq\ni,j\nplq\nj,j\n\n,\n\n`\n\nDe\ufb01nition 3.7 is the same as the original de\ufb01nition in Jacot et al. [18] when restricting the kernel\nfunction on tx1, . . . , xnu, except that there is an extra coef\ufb01cient 2 in the second and third lines. This\nextra factor is due to the difference in initialization schemes\u2013in our paper the entries of hidden layer\n\n2Normalizing weights to the same scale is necessary for a proper comparison. See Appendix A.2 for details.\n\n5\n\n\fmatrices are randomly generated with variance 2{m, while in Jacot et al. [18] the variance of the\nrandom initialization is 1{m. We remark that this extra factor 2 in De\ufb01nition 3.7 will remove the\nexponential dependence on the network depth L in the kernel matrix, which is appealing. In fact, it is\neasy to check that under our scaling, the diagonal entries of \u03a3pLq are all 1\u2019s, and the diagonal entries\n\nof r\u0398pLq are all L\u2019s.\n\nThe following lemma is a summary of Theorem 1 and Proposition 2 in Jacot et al. [18], which ensures\nthat \u0398pLq is the in\ufb01nite-width limit of the Gram matrix pm\u00b41x\u2207WfWp0qpxiq,\u2207WfWp0qpxjqyqn\u02c6n,\nand is positive-de\ufb01nite as long as no two training inputs are parallel.\nLemma 3.8 (Jacot et al. [18]). For an L layer ReLU network with parameter set Wp0q initialized in\nAlgorithm 1, as the network width m \u00d1 83, it holds that\n\nm\u00b41x\u2207WfWp0qpxiq,\u2207WfWp0qpxjqy P\u00dd\u00d1 \u0398\n\npLq\ni,j ,\n\nwhere the expectation is taken over the randomness of Wp0q. Moreover, as long as each pair of inputs\namong x1, . . . , xn P Sd\u00b41 are not parallel, \u0398pLq is positive-de\ufb01nite.\nRemark 3.9. Lemmas 3.8 clearly shows the difference between our neural tangent kernel matrix\n\u0398pLq in De\ufb01nition 3.7 and the Gram matrix KpLq de\ufb01ned in De\ufb01nition 5.1 in Du et al. [11]. For any\ni, j P rns, by Lemma 3.8 we have\npLq\ni,j \u201c lim\n\nl\u201c1x\u2207Wl fWp0qpxiq,\u2207Wl fWp0qpxjqy.\n\n\u0159\n\n\u0398\n\nL\n\nm\u00d18 m\u00b41\n\nIn contrast, the corresponding entry in KpLq is\n\nK\n\npLq\ni,j \u201c lim\n\nm\u00d18 m\u00b41x\u2207WL\u00b41fWp0qpxiq,\u2207WL\u00b41fWp0qpxjqy.\n\nE\n\ninf\n\n\ufb00\n\n` O\n\nIt can be seen that our de\ufb01nition of kernel matrix takes all layers into consideration, while Du\net al. [11] only considered the last hidden layer (i.e., second last layer). Moreover, it is clear that\n\u0398pLq \u013e KpLq. Since the smallest eigenvalue of the kernel matrix plays a key role in the analysis of\noptimization and generalization of over-parameterized neural networks [12, 11, 4], our neural tangent\nrm\u02dap\u03b4, L, n, \u03bb0q that only depends on \u03b4, L, n and \u03bb0 such that if m \u011b rm\u02dap\u03b4, L, n, \u03bb0q, then with\nkernel matrix can potentially lead to better bounds than the Gram matrix studied in Du et al. [11].\nCorollary 3.10. Let y \u201c py1, . . . , ynqJ and \u03bb0 \u201c \u03bbminp\u0398pLqq. For any \u03b4 P p0, e\u00b41s, there exists\nprobability at least 1 \u00b4 \u03b4 over the randomness of Wp0q, the output of Algorithm 1 with step size\n\ufb00\n\u03b7 \u201c \u03ba \u00a8 infryiyi\u011b1\n\u201c\nwhere the expectation is taken over the uniform draw ofxW from tWp0q, . . . , Wpn\u00b41qu.\n[4] gives a generalization bound rO\nyields a bound rO\n\naryJp\u0398pLqq\u00b41ry{pm\n\u00ab\n?\n\u2030\nnq for some small enough absolute constant \u03ba satis\ufb01es\nL0\u00b41D pxWq\n\u010f rO\nryiyi\u011b1\nL \u00a8\n`a\n\nRemark 3.11. Corollary 3.10 gives an algorithm-dependent generalization error bound of over-\nparameterized L-layer neural networks trained with SGD. It is worth noting that recently Arora et al.\nfor two-layer networks with \ufb01xed second\nlayer weights, where H8 is de\ufb01ned as\n\nOur result in Corollary 3.10 can be specialized to two-layer neural networks by choosing L \u201c 2, and\n\ncryJp\u0398pLqq\u00b41ry\n\nyJpH8q\u00b41y{n\n\ni,j \u201c xxi, xjy \u00a8 Ew\u201eNp0,Iqr\u03c31pwJxiq\u03c31pwJxjqs.\nH8\n\n\u02d8\ni,j ` 2 \u00a8 Ew\u201eNp0,Iqr\u03c3pwJxiq\u03c3pwJxjqs.\n\nyJp\u0398p2qq\u00b41y{n\np2q\ni,j \u201c H8\n\n, where\n\nlogp1{\u03b4q\n\n\u00abc\n\nHere the extra term 2 \u00a8 Ew\u201eNp0,Iqr\u03c3pwJxiq\u03c3pwJxjqs corresponds to the training of the second\nmx\u2207W2fWp0qpxiq,\u2207W2fWp0qpxjqy. Since we have \u0398p2q \u013e H8, our bound\nlayer\u2013it is the limit of 1\nis sharper than theirs. This comparison also shows that, our result generalizes the result in Arora et al.\n[4] from two-layer, \ufb01xed second layer networks to deep networks with all parameters being trained.\n3The original result by Jacot et al. [18] requires that the widths of different layers go to in\ufb01nity sequen-\ntially. Their result was later improved by Yang [36] such that the widths of different layers can go to in\ufb01nity\nsimultaneously.\n\n`a\n\nn\n\n\u02d8\n\n,\n\nn\n\n\u0398\n\n6\n\n\fRemark 3.12. Corollary 3.10 is based on the asymptotic convergence result in Lemma 3.8, which\ndoes not show how wide the network need to be in order to make the Gram matrix close enough\nto the NTK matrix. Very recently, Arora et al. [3] provided a non-asymptotic convergence result\nfor the Gram matrix, and showed the equivalence between an in\ufb01nitely wide network trained by\ngradient \ufb02ow and a kernel regression predictor using neural tangent kernel, which suggests that the\ngeneralization of deep neural networks trained by gradient \ufb02ow can potentially be measured by the\ncorresponding NTK. Utilizing this non-asymptotic convergence result, one can potentially specify\n\nthe detailed dependency of rm\u02dap\u03b4, L, n, \u03bb0q on \u03b4, L, n and \u03bb0 in Corollary 3.10.\n\naryJp\u0398pLqq\u00b41ry factor in the generalization\n\nclassi\ufb01er on data tpxi,ryiqun\n\nRemark 3.13. Corollary 3.10 demonstrates that the generalization bound given by Theorem 3.3\ndoes not increase with network width m, as long as m is large enough. Moreover, it provides a clear\ncharacterization of the classi\ufb01ability of data. In fact, the\nbound given in Corollary 3.10 is exactly the NTK-induced RKHS norm of the kernel regression\ni\u201c1. Therefore, if y \u201c f\u02dapxq for some f\u02dap\u00a8q with bounded norm in the\nNTK-induced reproducing kernel Hilbert space (RKHS), then over-parameterized neural networks\ntrained with SGD generalize well. In Appendix E, we provide some numerical evaluation of the\nleading terms in the generalization bounds in Theorem 3.3 and Corollary 3.10 to demonstrate that\nthey are very informative on real-world datasets.\n\n4 Proof of Main Theory\n\nIn this section we provide the proof of Theorem 3.3 and Corollary 3.10, and explain the intuition\nbehind the proof. For notational simplicity, for i P rns we denote LipWq \u201c Lpxi,yiqpWq.\n\n4.1 Proof of Theorem 3.3\n\nBefore giving the proof of Theorem 3.3, we \ufb01rst introduce several lemmas. The following lemma\nstates that near initialization, the neural network function is almost linear in terms of its weights.\nLemma 4.1. There exists an absolute constant \u03ba such that, with probability at least 1 \u00b4 OpnL2q \u00a8\nexpr\u00b4\u2126pm\u03c92{3Lqs over the randomness of Wp0q, for all i P rns and W, W1 P BpWp0q, \u03c9q with\n\u03c9 \u010f \u03baL\u00b46rlogpmqs\u00b43{2, it holds uniformly that\n|fW1pxiq \u00b4 fWpxiq \u00b4 x\u2207fWpxiq, W1 \u00b4 Wy| \u010f O\nl \u00b4 Wl}2.\nSince the cross-entropy loss (cid:96)p\u00a8q is convex, given Lemma 4.1, we can show in the following lemma\nthat near initialization, LipWq is also almost a convex function of W for any i P rns.\nLemma 4.2. There exists an absolute constant \u03ba such that, with probability at least 1 \u00b4 OpnL2q \u00a8\nexpr\u00b4\u2126pm\u03c92{3Lqs over the randomness of Wp0q, for any \u0001 \u0105 0, i P rns and W, W1 P BpWp0q, \u03c9q\nwith \u03c9 \u010f \u03baL\u00b46m\u00b43{8rlogpmqs\u00b43{2\u00013{4, it holds uniformly that\n\n\u00af\na\nm logpmq\n\nL\u00b41\nl\u201c1 }W1\n\n\u03c91{3L2\n\n\u0159\n\n\u00b4\n\n\u00a8\n\nLipW1q \u011b LipWq ` x\u2207WLipWq, W1 \u00b4 Wy \u00b4 \u0001.\n\nThe locally almost convex property of the loss function given by Lemma 4.2 implies that the dynamics\nof Algorithm 1 is similar to the dynamics of convex optimization. We can therefore derive a bound of\n`\nthe cumulative loss. The result is given in the following lemma.\nLemma 4.3. For any \u0001, \u03b4, R \u0105 0, there exists\n\n\u02d8\n\n\u00a8 \u0001\u00b414 \u00a8 logp1{\u03b4q\n\nsuch that if m \u011b m\u02dap\u0001, \u03b4, R, Lq, then with probability at least 1 \u00b4 \u03b4 over the randomness of Wp0q,\nfor any W\u02da P BpWp0q, Rm\u00b41{2q, Algorithm 1 with \u03b7 \u201c \u03bd\u0001{pLmq, n \u201c L2R2{p2\u03bd\u00012q for some\nsmall enough absolute constant \u03bd has the following cumulative loss bound:\n\nm\u02dap\u0001, \u03b4, R, Lq \u201c rO\n\u0159\n\npolypR, Lq\n\u0159\n\nn\n\ni\u201c1LipWpi\u00b41qq \u010f\n\nn\n\ni\u201c1LipW\u02daq ` 3n\u0001.\n\nWe now \ufb01nalize the proof by applying an online-to-batch conversion argument [9], and use Lemma 4.1\nto relate the neural network function with a function in the NTRF function class.\n\n7\n\n\fpWpi\u00b41qq \u201c 1\nProof of Theorem 3.3. For i P rns, let L0\u00b41\nentropy loss satis\ufb01es 1tz \u010f 0u \u010f 4(cid:96)pzq, we have L0\u00b41\nsetting \u0001 \u201c LR{?\nn\u00ff\n2\u03bdn in Lemma 4.3 gives that, if \u03b7 is set as\nat least 1 \u00b4 \u03b4,\n\nn\u00ff\n\ni\n\ni\n\n1\nn\n\ni\u201c1\n\nL0\u00b41\n\ni\n\npWpi\u00b41qq \u010f 4\nn\n\ni\u201c1\n\n(cid:32)\n(\na\nyi \u00a8 fWpi\u00b41qpxiq \u0103 0\n. Since cross-\npWpi\u00b41qq \u010f 4LipWpi\u00b41qq. Therefore,\n\u03bd{2R{pm\nnq, then with probability\n\n?\n\n\u00a8 LR?\nn\n\n.\n\n(4.1)\n\nLipW\u02daq ` 12?\n2\u03bd\nc\n\nn\u00ff\n\nn\u00ff\n\ni\u201c1\n\nNote that for any i P rns, Wpi\u00b41q only depends on px1, y1q, . . . ,pxi\u00b41, yi\u00b41q and is independent of\npxi, yiq. Therefore by Proposition 1 in Cesa-Bianchi et al. [9], with probability at least 1\u00b4 \u03b4 we have\n\nn\n\n1\nn\n\n\u201c\n\nE\n\nn\n\ni\u201c1\n\nL0\u00b41\n\ni\n\n2 logp1{\u03b4q\n\nn\n\n.\n\n(4.2)\n\n. Therefore combining (4.1) and\n\nL0\u00b41D pWpi\u00b41qq \u010f 1\npWpi\u00b41qq `\n\u201c\n\u2030\n\u0159\ni\u201c1 L0\u00b41D pWpi\u00b41qq \u201c E\nn\u00ff\n\u2030\nL0\u00b41D pxWq\n\nL0\u00b41D pxWq\n\nBy de\ufb01nition, we have 1\n(4.2) and applying union bound, we obtain that with probability at least 1 \u00b4 2\u03b4,\nn\n2 logp1{\u03b4q\n\nLipW\u02daq ` 12?\n\u00a8 LR?\nn\n2\u03bd\nfor all W\u02da P BpWp0q, Rm\u00b41{2q. We now compare the neural network function fW\u02dapxiq with the\n\u00b4\n\u00af\na\nfunction FWp0q,W\u02dapxiq :\u201c fWp0qpxiq ` x\u2207fWp0qpxiq, W\u02da \u00b4 Wp0qy P FpWp0q, Rq. We have\n\u00b4\n\u00af\na\np0q\npRm\u00b41{2q1{3L2\nLipW\u02daq \u010f (cid:96)ryi \u00a8 FWp0q,W\u02dapxiqs ` O\n\u00a8\nm logpmq\nl\n\u00a8 R4{3 \u00a8 m\u00b42{3\nm logpmq\n\u010f (cid:96)ryi \u00a8 FWp0q,W\u02dapxiqs ` O\n\u010f (cid:96)ryi \u00a8 FWp0q,W\u02dapxiqs ` LRn\u00b41{2,\n\n\u203a\u203aW\u02da\n\nl \u00b4 W\n\n\u010f 4\nn\n\nc\n\nL\u00b41\nl\u201c1\n\n\u0159\n\n(4.3)\n\n\u203a\u203a\n\ni\u201c1\n\n`\n\nL3\n\nwhere the \ufb01rst inequality is by the 1-Lipschitz continuity of (cid:96)p\u00a8q and Lemma 4.1, the second inequality\nis by W\u02da P BpWp0q, Rm\u00b41{2q, and last inequality holds as long as m \u011b C1R2L12rlogpmqs3n3 for\nsome large enough absolute constant C1. Plugging the inequality above into (4.3) gives\n\nn\n\n2\n\n\u02c6\n\n\u02d9\n\nc\n\n(cid:96)ryi \u00a8 FWp0q,W\u02dapxiqs `\n\n1 ` 12?\n2\u03bd\n\n\u00a8 LR?\nn\n\n`\n\n2 logp1{\u03b4q\n\nn\n\n.\n\n\u201c\n\u2030\nL0\u00b41D pxWq\n\nE\n\nn\u00ff\n\ni\u201c1\n\n\u010f 4\nn\n\nTaking in\ufb01mum over W\u02da P BpWp0q, Rm\u00b41{2q and rescaling \u03b4 \ufb01nishes the proof.\n\n4.2 Proof of Corollary 3.10\n\nIn this subsection we prove Corollary 3.10. The following lemma shows that at initialization, with\n\na\nLemma 4.4. For any \u03b4 \u0105 0, if m \u011b KL logpnL{\u03b4q for a large enough absolute constant K, then\nlogpn{\u03b4qq for all i P rns.\nwith probability at least 1 \u00b4 \u03b4, |fWp0qpxiq| \u010f Op\n\nhigh probability, the neural network function value at all the training inputs are of order rOp1q.\nWe now present the proof of Corollary 3.10. The idea is to construct suitable target valuespy1, . . . ,pyn,\nand then bound the norm of the solution of the linear equationspyi \u201c x\u2207fWp0qpxiq, Wy, i P rns. In\nspeci\ufb01c, for anyry withryiyi \u011b 1, we examine the minimum distance solution to Wp0q that \ufb01t the\n`aryJp\u0398pLqq\u00b41ry\n\u02d8\u02d8\ndata tpxi,ryiqun\n.\na\nProof of Corollary 3.10. Set B \u201c logt1{rexppn\u00b41{2q \u00b4 1su \u201c Oplogpnqq, then for cross-entropy\nlogpn{\u03b4qq for all i P rns. For anyry with\nloss we have (cid:96)pzq \u010f n\u00b41{2 for z \u011b B. Moreover, let B1 \u201c maxiPrns |fWp0qpxiq|. Then by\nLemma 4.4, with probability at least 1 \u00b4 \u03b4, B1 \u010f Op\n\nryiyi \u011b 1, let B \u201c B ` B1 andpy \u201c B \u00a8ry, then it holds that for any i P rns,\n\nyi \u00a8 rpyi ` fWp0qpxiqs \u201c yi \u00a8pyi ` yi \u00a8 fWp0qpxiq \u011b B ` B1 \u00b4 B1 \u011b B,\n\ni\u201c1 well and use it to construct a speci\ufb01c function in F\n\nWp0q, rO\n\n`\n\n8\n\n\fand therefore\n\n(cid:96)tyi \u00a8 rpyi ` fWp0qpxiqsu \u010f n\u00b41{2, i P rns.\n\nwe haveryJp\u0398pLqq\u00b41ry \u011b n\u00b41L\u00b41}ry}2\n\n(4.4)\nDenote F \u201c m\u00b41{2 \u00a8pvecr\u2207fWp0qpx1qs, . . . , vecr\u2207fWp0qpxnqsq P Rrmd`m`m2pL\u00b42qs\u02c6n. Note that\nentries of \u0398pLq are all bounded by L. Therefore, the largest eigenvalue of \u0398pLq is at most nL, and\n2 \u201c L\u00b41. By Lemma 3.8 and standard matrix perturbation\nbound, there exists m\u02dap\u03b4, L, n, \u03bb0q such that, if m \u011b m\u02dap\u03b4, L, n, \u03bb0q, then with probability at least\n1 \u00b4 \u03b4, FJF is strictly positive-de\ufb01nite and\n\nryiyi\u011b1\n}pFJFq\u00b41 \u00b4 p\u0398pLqq\u00b41}2 \u010f inf\n\n(4.5)\nLet F \u201c P\u039bQJ be the singular value decomposition of F, where P P Rm\u02c6n, Q P Rn\u02c6n have\n\northogonal columns, and \u039b P Rn\u02c6n is a diagonal matrix. Let wvec \u201c P\u039b\u00b41QJpy, then we have\n\nryJp\u0398pLqq\u00b41ry{n.\n\nFJwvec \u201c pQ\u039bPJqpP\u039b\u00b41QJpyq \u201cpy.\n\n(4.6)\n\nMoreover, by direct calculation we have\n\n2 \u201c }P\u039b\u00b41QJpy}2\nTherefore by (4.5) and the fact that }py}2\n\n}wvec}2\n\n2 \u201c }\u039b\u00b41QJpy}2\n\n2 \u201cpyJQ\u039b\u00b42QJpy \u201cpyJpFJFq\u00b41py.\n\n2\n\nn, we have\n\n}wvec}2\n\n2 \u00a8 n \u00a8 }pFJFq\u00b41 \u00b4 p\u0398pLqq\u00b41}2 ` B\n\n2 \u201cpyJrpFJFq\u00b41 \u00b4 p\u0398pLqq\u00b41spy `pyJp\u0398pLqq\u00b41py\n2 \u201c B\n2 \u00a8ryJp\u0398pLqq\u00b41ry\n2 \u00a8ryJp\u0398pLqq\u00b41ry.\n\u010f B\n\u010f 2B\n\u00b4bryJp\u0398pLqq\u00b41ry \u00a8 m\u00b41{2\n\u00af\n}Wl}F \u010f m\u00b41{2}wvec}2 \u010f rO\nLet W P W be the parameter collection reshaped from m\u00b41{2wvec. Then clearly\n`aryJp\u0398pLqq\u00b41ry \u00a8 m\u00b41{2\n`\n\u02d8\u02d8\n. Moreover, by (4.6), we have pyi \u201c\nand therefore W P B\n\u2030(\n(cid:32)\n\u201c\nx\u2207WfWp0qpxiq, Wy. Plugging this into (4.4) then gives\n`aryJp\u0398pLqq\u00b41ry\n`\nSince pfp\u00a8q \u201c fWp0qp\u00a8q ` x\u2207WfWp0qp\u00a8q, Wy P F\nWp0q, rO\nfWp0qpxiq ` x\u2207WfWp0qpxiq, Wy\nyi \u00a8\nrem 3.3 and taking in\ufb01mum overry completes the proof.\n\n\u010f n\u00b41{2.\n\n, applying Theo-\n\n\u02d8\u02d8\n\n0,O\n\n(cid:96)\n\n,\n\n5 Conclusions and Future Work\n\nIn this paper we provide an expected 0-1 error bound for wide and deep ReLU networks trained with\nSGD. This generalization error bound is measured by the NTRF function class. The connection to\nthe neural tangent kernel function studied in Jacot et al. [18] is also discussed. Our result covers a\nseries of recent generalization bounds for wide enough neural networks, and provides better bounds.\nAn important future work is to improve the over-parameterization condition in Theorem 3.3 and\nCorollary 3.10. Other future directions include proving sample complexity lower bounds in the\nover-parameterized regime, implementing the results in Jain et al. [19] to obtain last iterate bound\nof SGD, and establishing uniform convergence based generalization bounds for over-parameterized\nneural networks with methods developped in Bartlett et al. [6], Neyshabur et al. [27], Long and\nSedghi [26].\n\nAcknowledgement\n\nWe would like to thank Peter Bartlett for a valuable discussion, and Simon S. Du for pointing out a\nrelated work [3]. We also thank the anonymous reviewers and area chair for their helpful comments.\nThis research was sponsored in part by the National Science Foundation CAREER Award IIS-\n1906169, IIS-1903202, and Salesforce Deep Learning Research Award. The views and conclusions\ncontained in this paper are those of the authors and should not be interpreted as representing any\nfunding agencies.\n\n9\n\n\fReferences\n[1] ALLEN-ZHU, Z., LI, Y. and LIANG, Y. (2018). Learning and generalization in overparameter-\n\nized neural networks, going beyond two layers. arXiv preprint arXiv:1811.04918 .\n\n[2] ALLEN-ZHU, Z., LI, Y. and SONG, Z. (2018). A convergence theory for deep learning via\n\nover-parameterization. arXiv preprint arXiv:1811.03962 .\n\n[3] ARORA, S., DU, S. S., HU, W., LI, Z., SALAKHUTDINOV, R. and WANG, R. (2019). On\n\nexact computation with an in\ufb01nitely wide neural net. arXiv preprint arXiv:1904.11955 .\n\n[4] ARORA, S., DU, S. S., HU, W., LI, Z. and WANG, R. (2019). Fine-grained analysis of\noptimization and generalization for overparameterized two-layer neural networks. arXiv preprint\narXiv:1901.08584 .\n\n[5] ARORA, S., GE, R., NEYSHABUR, B. and ZHANG, Y. (2018). Stronger generalization bounds\n\nfor deep nets via a compression approach. arXiv preprint arXiv:1802.05296 .\n\n[6] BARTLETT, P. L., FOSTER, D. J. and TELGARSKY, M. J. (2017). Spectrally-normalized\n\nmargin bounds for neural networks. In Advances in Neural Information Processing Systems.\n\n[7] BRUTZKUS, A., GLOBERSON, A., MALACH, E. and SHALEV-SHWARTZ, S. (2017). Sgd\nlearns over-parameterized networks that provably generalize on linearly separable data. arXiv\npreprint arXiv:1710.10174 .\n\n[8] CAO, Y. and GU, Q. (2019). A generalization theory of gradient descent for learning over-\n\nparameterized deep relu networks. arXiv preprint arXiv:1902.01384 .\n\n[9] CESA-BIANCHI, N., CONCONI, A. and GENTILE, C. (2004). On the generalization ability of\n\non-line learning algorithms. IEEE Transactions on Information Theory 50 2050\u20132057.\n\n[10] DANIELY, A. (2017). Sgd learns the conjugate kernel class of the network. In Advances in\n\nNeural Information Processing Systems.\n\n[11] DU, S. S., LEE, J. D., LI, H., WANG, L. and ZHAI, X. (2018). Gradient descent \ufb01nds global\n\nminima of deep neural networks. arXiv preprint arXiv:1811.03804 .\n\n[12] DU, S. S., ZHAI, X., POCZOS, B. and SINGH, A. (2018). Gradient descent provably optimizes\n\nover-parameterized neural networks. arXiv preprint arXiv:1810.02054 .\n\n[13] DZIUGAITE, G. K. and ROY, D. M. (2017). Computing nonvacuous generalization bounds for\ndeep (stochastic) neural networks with many more parameters than training data. arXiv preprint\narXiv:1703.11008 .\n\n[14] E, W., MA, C., WU, L. ET AL. (2019). A comparative analysis of the optimization and\ngeneralization property of two-layer neural network and random feature models under gradient\ndescent dynamics. arXiv preprint arXiv:1904.04326 .\n\n[15] GOLOWICH, N., RAKHLIN, A. and SHAMIR, O. (2017). Size-independent sample complexity\n\nof neural networks. arXiv preprint arXiv:1712.06541 .\n\n[16] HE, K., ZHANG, X., REN, S. and SUN, J. (2015). Delving deep into recti\ufb01ers: Surpassing\nhuman-level performance on imagenet classi\ufb01cation. In Proceedings of the IEEE international\nconference on computer vision.\n\n[17] HINTON, G., DENG, L., YU, D., DAHL, G. E., MOHAMED, A.-R., JAITLY, N., SENIOR,\nA., VANHOUCKE, V., NGUYEN, P., SAINATH, T. N. ET AL. (2012). Deep neural networks\nfor acoustic modeling in speech recognition: The shared views of four research groups. IEEE\nSignal Processing Magazine 29 82\u201397.\n\n[18] JACOT, A., GABRIEL, F. and HONGLER, C. (2018). Neural tangent kernel: Convergence and\n\ngeneralization in neural networks. arXiv preprint arXiv:1806.07572 .\n\n[19] JAIN, P., NAGARAJ, D. and NETRAPALLI, P. (2019). Making the last iterate of sgd information\n\ntheoretically optimal. arXiv preprint arXiv:1904.12443 .\n\n10\n\n\f[20] KRIZHEVSKY, A., SUTSKEVER, I. and HINTON, G. E. (2012). Imagenet classi\ufb01cation with\n\ndeep convolutional neural networks. In Advances in neural information processing systems.\n\n[21] LANGFORD, J. and CARUANA, R. (2002). (not) bounding the true error. In Advances in Neural\n\nInformation Processing Systems.\n\n[22] LECUN, Y., BOTTOU, L., BENGIO, Y., HAFFNER, P. ET AL. (1998). Gradient-based learning\n\napplied to document recognition. Proceedings of the IEEE 86 2278\u20132324.\n\n[23] LEE, J., XIAO, L., SCHOENHOLZ, S. S., BAHRI, Y., SOHL-DICKSTEIN, J. and PENNINGTON,\nJ. (2019). Wide neural networks of any depth evolve as linear models under gradient descent.\narXiv preprint arXiv:1902.06720 .\n\n[24] LI, X., LU, J., WANG, Z., HAUPT, J. and ZHAO, T. (2018). On tighter generalization bound\n\nfor deep neural networks: Cnns, resnets, and beyond. arXiv preprint arXiv:1806.05159 .\n\n[25] LI, Y. and LIANG, Y. (2018). Learning overparameterized neural networks via stochastic\n\ngradient descent on structured data. arXiv preprint arXiv:1808.01204 .\n\n[26] LONG, P. M. and SEDGHI, H. (2019). Size-free generalization bounds for convolutional neural\n\nnetworks. arXiv preprint arXiv:1905.12600 .\n\n[27] NEYSHABUR, B., BHOJANAPALLI, S., MCALLESTER, D. and SREBRO, N. (2017). A pac-\nbayesian approach to spectrally-normalized margin bounds for neural networks. arXiv preprint\narXiv:1707.09564 .\n\n[28] NEYSHABUR, B., LI, Z., BHOJANAPALLI, S., LECUN, Y. and SREBRO, N. (2018). The role\n\nof over-parametrization in generalization of neural networks .\n\n[29] NEYSHABUR, B., TOMIOKA, R. and SREBRO, N. (2015). Norm-based capacity control in\n\nneural networks. In Conference on Learning Theory.\n\n[30] OYMAK, S. and SOLTANOLKOTABI, M. (2019). Towards moderate overparameteriza-\ntion: global convergence guarantees for training shallow neural networks. arXiv preprint\narXiv:1902.04674 .\n\n[31] RAHIMI, A. and RECHT, B. (2008). Random features for large-scale kernel machines. In\n\nAdvances in neural information processing systems.\n\n[32] RAHIMI, A. and RECHT, B. (2009). Weighted sums of random kitchen sinks: Replacing\nminimization with randomization in learning. In Advances in neural information processing\nsystems.\n\n[33] SHALEV-SHWARTZ, S. and BEN-DAVID, S. (2014). Understanding machine learning: From\n\ntheory to algorithms. Cambridge university press.\n\n[34] SILVER, D., HUANG, A., MADDISON, C. J., GUEZ, A., SIFRE, L., VAN DEN DRIESSCHE,\nG., SCHRITTWIESER, J., ANTONOGLOU, I., PANNEERSHELVAM, V., LANCTOT, M. ET AL.\n(2016). Mastering the game of go with deep neural networks and tree search. Nature 529\n484\u2013489.\n\n[35] WEI, C., LEE, J. D., LIU, Q. and MA, T. (2018). On the margin theory of feedforward neural\n\nnetworks. arXiv preprint arXiv:1810.05369 .\n\n[36] YANG, G. (2019). Scaling limits of wide neural networks with weight sharing: Gaussian\nprocess behavior, gradient independence, and neural tangent kernel derivation. arXiv preprint\narXiv:1902.04760 .\n\n[37] YEHUDAI, G. and SHAMIR, O. (2019). On the power and limitations of random features for\n\nunderstanding neural networks. arXiv preprint arXiv:1904.00687 .\n\n[38] ZHANG, C., BENGIO, S., HARDT, M., RECHT, B. and VINYALS, O. (2016). Understanding\n\ndeep learning requires rethinking generalization. arXiv preprint arXiv:1611.03530 .\n\n[39] ZOU, D., CAO, Y., ZHOU, D. and GU, Q. (2018). Stochastic gradient descent optimizes\n\nover-parameterized deep relu networks. arXiv preprint arXiv:1811.08888 .\n\n11\n\n\f", "award": [], "sourceid": 5782, "authors": [{"given_name": "Yuan", "family_name": "Cao", "institution": "UCLA"}, {"given_name": "Quanquan", "family_name": "Gu", "institution": "UCLA"}]}