{"title": "Algorithm-Dependent Generalization Bounds for Overparameterized Deep Residual Networks", "book": "Advances in Neural Information Processing Systems", "page_first": 14797, "page_last": 14807, "abstract": "The skip-connections used in residual networks have become a standard architecture choice in deep learning due to the increased generalization and stability of networks with this architecture, although there have been limited theoretical guarantees for this improved performance. In this work, we analyze overparameterized deep residual networks trained by gradient descent following random initialization, and demonstrate that (i) the class of networks learned by gradient descent constitutes a small subset of the entire neural network function class, and (ii) this subclass of networks is sufficiently large to guarantee small training error. By showing (i) we are able to demonstrate that deep residual networks trained with gradient descent have a small generalization gap between training and test error, and together with (ii) this guarantees that the test error will be small. Our optimization and generalization guarantees require overparameterization that is only logarithmic in the depth of the network, which helps explain why residual networks are preferable to fully connected ones.", "full_text": "Algorithm-Dependent Generalization Bounds for\n\nOverparameterized Deep Residual Networks\n\nSpencer Frei\u2217 and Yuan Cao\u2020 and Quanquan Gu\u2021\n\nAbstract\n\nThe skip-connections used in residual networks have become a standard architec-\nture choice in deep learning due to the increased training stability and generalization\nperformance with this architecture, although there has been limited theoretical un-\nderstanding for this improvement. In this work, we analyze overparameterized deep\nresidual networks trained by gradient descent following random initialization, and\ndemonstrate that (i) the class of networks learned by gradient descent constitutes a\nsmall subset of the entire neural network function class, and (ii) this subclass of\nnetworks is suf\ufb01ciently large to guarantee small training error. By showing (i) we\nare able to demonstrate that deep residual networks trained with gradient descent\nhave a small generalization gap between training and test error, and together with\n(ii) this guarantees that the test error will be small. Our optimization and gener-\nalization guarantees require overparameterization that is only logarithmic in the\ndepth of the network, while all known generalization bounds for deep non-residual\nnetworks have overparameterization requirements that are at least polynomial in\nthe depth. This provides an explanation for why residual networks are preferable\nto non-residual ones.\n\n1\n\nIntroduction\n\nDeep learning has seen an incredible amount of success in a variety of settings over the past eight\nyears, from image recognition [15] to audio recognition [20] and more. Compared with its rapid\nand widespread adoption, the theoretical understanding of why deep learning works so well has\nlagged signi\ufb01cantly. This is particularly the case in the common setup of an overparameterized\nnetwork, where the number of parameters in the network greatly exceeds the number of training\nexamples and input dimension. In this setting, networks have the capacity to perfectly \ufb01t training\ndata, regardless of if it is labeled with real labels or random ones [25]. However, when trained on real\ndata, these networks also have the capacity to truly learn patterns in the data, as evidenced by the\nimpressive performance of overparameterized networks on a variety of benchmark datasets. This\nsuggests the presence of certain mechanisms underlying the data, neural network architectures, and\ntraining algorithms which enable the generalization performance of neural networks. A theoretical\nanalysis that seeks to explain why neural networks work so well would therefore bene\ufb01t from careful\nattention to the speci\ufb01c properties that neural networks have when trained under common optimization\ntechniques.\nMany recent attempts at uncovering the generalization ability of deep learning focused on general\nproperties of neural network function classes with \ufb01xed weights and training losses. For instance,\n\u2217Department of Statistics, University of California, Los Angeles, CA 90095, USA; e-mail:\n\u2020Department of Computer Science, University of California, Los Angeles, CA 90095, USA; e-mail:\n\u2021Department of Computer Science, University of California, Los Angeles, CA 90095, USA; e-mail:\n\nspencerfrei@ucla.edu\n\nyuancao@cs.ucla.edu\n\nqgu@cs.ucla.edu\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\fBartlett et al. [4] proved spectrally normalized margin bound for deep fully connected networks\nin terms of the spectral norms of the weights at each layer. Neyshabur et al. [18] proved a similar\nbound using PAC-Bayesian approach. Arora et al. [2] developed a compression-based framework\nfor generalization of deep fully connected and convolutional networks, and also provided an explicit\ncomparison of recent generalization bounds in the literature. All these studies involved algorithm-\nindependent analyses of the neural network generalization, with resultant generalization bounds that\ninvolve quantities that make the bound looser with increased overparameterization.\nAn important recent development in the practical deployment of neural networks has been the\nintroduction of skip connections between layers, leading to a class of architectures known as residual\nnetworks. Residual networks were \ufb01rst introduced by He et al. [13] to much fanfare, quickly becoming\na standard architecture choice for state-of-the-art neural network classi\ufb01ers. The motivation for\nresidual networks came from the poor behavior of very deep traditional fully connected networks:\nalthough deeper fully connected networks can clearly express any function that a shallower one can,\nin practice (i.e. using gradient descent) it can be dif\ufb01cult to choose hyperparameters that result in\nsmall training error. Deep residual networks, on the other hand, are remarkably stable in practice,\nin the sense that they avoid getting stuck at initialization or having unpredictable oscillations in\ntraining and validation error, two common occurrences when training deep non-residual networks.\nMoreover, deep residual networks have been shown to generalize with better performance and far\nfewer parameters than non-residual networks [22, 7, 14]. We note that much of the recent neural\nnetwork generalization literature has focused on non-residual architectures [4, 18, 2, 12, 5] with\nbounds for the generalization gap that grow exponentially as the depth of the network increases. Li\net al. [16] recently studied a class of residual networks and proved algorithm-independent bounds for\nthe generalization gap that become larger as the depth of the network increases, with a dependence on\nthe depth that is somewhere between sublinear and exponential (a precise characterization requires\nfurther assumptions and/or analysis). We note that verifying the non-vacuousness of algorithm-\nindependent generalization bounds relies on empirical arguments about what values the quantities\nthat appear in the bounds generally take in practical networks (i.e. norms of weight matrices and\ninterlayer activations), while algorithm-dependent generalization bounds such as the ones we provide\nin this paper can be understood without relying on experiments.\n\n1.1 Our Contributions\n\nIn this work, we consider fully connected deep ReLU residual networks and study optimization\nand generalization properties of such networks that are trained with discrete time gradient descent\nfollowing Gaussian initialization.\nWe consider binary classi\ufb01cation under the cross-entropy loss and focus on data that come from\ndistributions D for which there exists a function f for which y\u00b7 f (x) \u2265 \u03b3 > 0 for all (x, y) \u2208 suppD\nfrom a large function class F (see Assumption 3.2). By analyzing the trajectory of the parameters of\nthe network during gradient descent, for any error threshold \u03b5 > 0, we are able to show:\n\n1. Under the cross-entropy loss, we can study an analogous surrogate error and bound the true\nclassi\ufb01cation error by the true surrogate error. This method was introduced by Cao and Gu\n[5].\n\n2. If m\u2217 = \u02dcO(poly(\u03b3\u22121)) \u00b7 max(d, \u03b5\u22122), then provided every layer of the network has at least\nm \u2265 m\u2217 units, gradient descent with small enough step size \ufb01nds a point with empirical\nsurrogate error at most \u03b5 in at most \u02dcO(poly(\u03b3\u22121) \u00b7 \u03b5\u22121) steps with high probability. Here,\n\u02dcO(\u00b7) hides logarithmic factors that may depend on the depth L of the network, the margin \u03b3,\nnumber of samples n, error threshold \u03b5, and probability level \u03b4.\n\n3. Provided m\u2217 = \u02dcO(poly(\u03b3\u22121, \u03b5\u22121)) and n = \u02dcO(poly(\u03b3\u22121, \u03b5\u22121)), the difference between\nthe empirical surrogate error and the true surrogate error is at most \u03b5 with high probability,\nand therefore the above provide a bound on the true classi\ufb01cation error of the learned\nnetwork.\n\nWe emphasize that our guarantees above come with at most logarithmic dependence on the depth of\nthe network. Our methods are adapted from those used in the fully connected architecture by Cao and\nGu [5] to the residual network architecture. The main proof idea is that overparameterization forces\ngradient descent-trained networks to stay in a small neighborhood of initialization where the learned\n\n2\n\n\fnetworks (i) are guaranteed to \ufb01nd small surrogate training error, and (ii) come from a suf\ufb01ciently\nsmall hypothesis class to guarantee a small generalization gap between the training and test errors.\nBy showing that these competing phenomena occur simultaneously, we are able to derive the test\nerror guarantees of Corollary 3.7. The key insight of our analysis is that the Lipschitz constant of\nthe network output for deep residual networks as well as the semismoothness property (Lemma 4.2)\nhave at most logarithmic dependence on the depth, while the known analogues for non-residual\narchitectures all have polynomial dependence on the depth.\n\n1.2 Additional Related Work\n\nIn the last year there has been a variety of works developing algorithm-dependent guarantees for\nneural network optimization and generalization [17, 1, 28, 9, 3, 5, 27, 6]. Li and Liang [17] were\namong the \ufb01rst to theoretically analyze the properties of overparameterized fully connected neural\nnetworks trained with Gaussian random initialization, focusing on a two layer (one hidden layer)\nmodel under a data separability assumption. Their work provided two signi\ufb01cant insights into the\ntraining process of overparameterized ReLU neural networks: (1) the weights stay close to their\ninitial values throughout the optimization trajectory, and (2) the ReLU activation patterns for a\ngiven example do not change much throughout the optimization trajectory. These insights were the\nbackbone of the authors\u2019 strong generalization result for stochastic gradient descent (SGD) in the\ntwo layer case. The insights of Li and Liang [17] provided a basis to various subsequent studies. Du\net al. [9] analyzed a two layer model using a method based on the Gram matrix using inspiration\nfrom kernel methods, showing that gradient descent following Gaussian initialization \ufb01nds zero\ntraining loss solutions at a linear rate. Zou et al. [28] and Allen-Zhu et al. [1] extended the results\nof Li and Liang to the arbitrary L hidden layer fully connected case, again considering (stochastic)\ngradient descent trained from random initialization. Both authors showed that, provided the networks\nwere suf\ufb01ciently wide, arbitrarily deep networks would converge to a zero training loss solution\nat a linear rate, using an assumption about separability of the data. Recently, Zou and Gu [27]\nprovided an improved analysis of the global convergence of gradient descent and SGD for training\ndeep neural networks, which enjoys a milder over-parameterization condition and better iteration\ncomplexity than previous work. Under the same data separability assumption, Zhang et al. [26]\nshowed that deep residual networks can achieve zero training loss for the squared loss at a linear rate\nwith overparameterization essentially independent of the depth of the network. We note that Zhang\net al. [26] studied optimization for the regression problem rather than classi\ufb01cation, and their results\ndo not distinguish the case with random labels from that with true labels; hence, it is not immediately\nclear how to translate their analysis to a generalization bound for classi\ufb01cation under the cross-entropy\nloss as we are able to do in this paper.\nThe above results provide a concrete answer to the question of why overparameterized deep neural\nnetworks can achieve zero training loss using gradient descent. However, the theoretical tools of\nDu et al. [9], Allen-Zhu et al. [1], Zou et al. [28], Zou and Gu [27] apply to data with random\nlabels as well as true labels, and thus do not explain the generalization to unseen data observed\nexperimentally. Dziugaite and Roy [10] optimized PAC-Bayes bounds for the generalization error of\na class of stochastic neural networks that are perturbations of standard neural networks trained by\nSGD. Cao and Gu [5] proved a guarantee for arbitrarily small generalization error for classi\ufb01cation\nin deep fully connected neural networks trained with gradient descent using random initialization.\nThe same authors recently provided an improved result for deep fully connected networks trained by\nstochastic gradient descent using a different approach that relied on the neural tangent kernel and\nonline-to-batch conversion [6]. E et al. [11] recently developed algorithm-dependent generalization\nbounds for a special residual network architecture with many different kinds of skip connections by\nusing kernel methods.\n\n2 Network Architecture and Optimization Problem\n\nWe begin with the notation of the paper. We denote vectors by lowercase letters and matrices by\nuppercase letters, with the assumption that a vector v is a column vector and its transpose v(cid:62) is a\nrow vector. We use the standard O(\u00b7), \u2126(\u00b7), \u0398(\u00b7) complexity notations to ignore universal constants,\nwith \u02dcO(\u00b7), \u02dc\u2126(\u00b7) additionally ignoring logarithmic factors. For n \u2208 N, we write [n] = {1, 2, . . . , n}.\nDenote the number of hidden units at layer l as ml, l = 1, . . . , L + 1. Let the l-th layer weights\nbe Wl \u2208 Rml\u22121\u00d7ml, and concatenate all of the layer weights into a vector W = (W1, . . . , WL+1).\n\n3\n\n\fDenote by wl,j the j-th column of Wl. Let \u03c3(x) = max(0, x) be the ReLU nonlinearity, and let \u03b8\nbe a constant scaling parameter. We consider a class of residual networks de\ufb01ned by the following\narchitecture:\n\nxl = xl\u22121 + \u03b8\u03c3(cid:0)W (cid:62)\n\nl xl\u22121\n\n(cid:1) , l = 2, . . . , L,\n\nx1 = \u03c3(W (cid:62)\nxL+1 = \u03c3(W (cid:62)\n\n1 x),\nL+1xL).\n\nAbove, we denote xl as the l-th hidden layer activations of input x \u2208 Rd, with x0 := x. In order\nfor this network to be de\ufb01ned, it is necessary that m1 = m2 = \u00b7\u00b7\u00b7 = mL. We are free to choose\nmL+1, as long as mL+1 = \u0398(m1) (see Assumption 3.4). We de\ufb01ne a constant, non-trainable vector\nv = (1, 1, . . . , 1,\u22121,\u22121, . . . ,\u22121)(cid:62) \u2208 RmL+1 with equal parts +1 and \u22121\u2019s that determines the\nnetwork output,\n\nfW (x) = v(cid:62)xL+1.\n\nproducts of matrices(cid:81)b\n\nWe note that our methods can be extended to the case of a trainable top layer weights v by choosing\nthe appropriate scale of initialization for v. We choose to \ufb01x the top layer weights in this paper for\nsimplicity of exposition.\nWe will \ufb01nd it useful to consider the matrix multiplication form of the ReLU activations, which we\ndescribe below. Let 1(A) denote the indicator function of a set A, and de\ufb01ne diagonal matrices\n\u03a3l(x) \u2208 Rml\u00d7ml by [\u03a3l(x)]j,j = 1(w(cid:62)\nl = 1, . . . , L + 1. By convention we denote\ni=a Mi by Mb \u00b7 Mb\u22121 \u00b7 . . . \u00b7 Ma when a \u2264 b, and by the identity matrix when\nl (x) of\nl(cid:48)(cid:89)\n\na > b. With this convention, we can introduce notation for the l-to-l(cid:48) interlayer activations H l(cid:48)\nthe network. For 2 \u2264 l \u2264 l(cid:48) \u2264 L and input x \u2208 Rd we denote\n\nl,jxl\u22121 > 0),\n\n(cid:0)I + \u03b8\u03a3r(x)W (cid:62)\n\n(2 \u2264 l \u2264 l(cid:48) \u2264 L)\n\nH l(cid:48)\nl (x) :=\n\n(cid:1) .\n\n(1)\n\nr\n\nr=l\n\nl\n\n1 (x) = H l(cid:48)\n\nl (x) by H l(cid:48)\n\n2 (x)\u03a31(x)W (cid:62)\n\nl when the dependence on the input is clear.\n\n1 , and if l(cid:48) = L + 1 > l, we denote H L+1\n\n(x) =\nl (x). Using this notation, we can write the output of the neural network as\nl+1 (x)xl for any l \u2208 {0} \u222a [L + 1] and x \u2208 Rd. For notational simplicity, we will\n\nIf l = 1 < l(cid:48), we denote H l(cid:48)\n\u03a3L+1(x)W (cid:62)\nL+1H L\nfW (x) = v(cid:62)H L+1\ndenote \u03a3l(x) by \u03a3l and H l(cid:48)\ni=1 \u223c D from a distribution D, where xi \u2208 Rd and\nWe assume we have i.i.d. samples (xi, yi)n\nyi \u2208 {\u00b11}. We note the abuse of notation in the above, where xl \u2208 Rml refers to the l-th hidden\nlayer activations of an arbitrary input x \u2208 Rd while xi refers to the i-th sample xi \u2208 Rd. We shall\nuse xl,i \u2208 Rml when referring to the l-th hidden layer activations of a sample xi \u2208 Rd (where i \u2208 [n]\nand l \u2208 [L + 1]), while xl \u2208 Rml shall refer to the l-th hidden layer activation of arbitrary input\nx \u2208 Rd.\nLet (cid:96)(x) = log(1 + exp(\u2212x)) be the cross-entropy loss. We consider the empirical risk minimization\nproblem optimized by constant step size gradient descent,\n\nmin\nW\n\nLS(W ) :=\n\n1\nn\n\n(cid:96)(yi \u00b7 fW (xi)),\n\nW (k+1)\n\nl\n\n= W (k)\n\nl \u2212 \u03b7 \u00b7 \u2207Wl LS(W (k))\n\n(l \u2208 [L + 1]).\n\nWe shall see below that a key quantity for studying the trajectory of the weights in the above\noptimization regime is a surrogate loss de\ufb01ned by the derivative of the cross-entropy loss. We denote\nthe empirical and true surrogate loss by\n\nn(cid:88)\n\ni=1\n\nn(cid:88)\n\ni=1\n\nES(W ) := \u2212 1\nn\n\n(cid:96)(cid:48)(yi \u00b7 fW (xi)),\n\nED(W ) := E(x,y)\u223cD[\u2212(cid:96)(cid:48)(y \u00b7 fW (x))],\n\nrespectively. The empirical surrogate loss was \ufb01rst introduced by Cao and Gu [5] for the study of\ndeep non-residual networks. Finally, we note here a formula for the gradient of the output of the\nnetwork with respect to different layer weights:\n\n\u2207WlfW (x) = \u03b81(2\u2264l\u2264L)xl\u22121v(cid:62)H L+1\n\nl+1 \u03a3l(x),\n\n(1 \u2264 l \u2264 L + 1).\n\n(2)\n\n4\n\n\f3 Main Theory\n\nWe \ufb01rst go over the assumptions necessary for our proof and then shall discuss our main results. Our\nassumptions align with those made by Cao and Gu [5] in the fully connected case. The \ufb01rst main\nassumption is that the input data is normalized.\nAssumption 3.1. Input data are normalized: supp(Dx) \u2282 Sd\u22121 = {x \u2208 Rd : (cid:107)x(cid:107)2 = 1}.\nData normalization is common in statistical learning theory literature, from linear models up to and\nincluding recent work in neural networks [17, 28, 9, 1, 3, 5], and can easily be satis\ufb01ed for arbitrary\ntraining data by mapping samples x (cid:55)\u2192 x/(cid:107)x(cid:107)2.\nThe next assumption is on the data generating distribution. Because overparameterized networks can\nmemorize data, any hope of demonstrating that neural networks have a small generalization gap must\nrestrict the class of data distribution processes to one where some type of learning is possible.\nAssumption 3.2. Let p(u) denote the density of a standard d-dimensional Gaussian vector. De\ufb01ne\n\n(cid:40)(cid:90)\n\nRd\n\n(cid:41)\nc(u)\u03c3(u(cid:62)x)p(u)du : (cid:107)c(\u00b7)(cid:107)\u221e \u2264 1\n\n.\n\nF =\n\nAssume there exists f (\u00b7) \u2208 F and constant \u03b3 > 0 such that y \u00b7 f (x) \u2265 \u03b3 for all (x, y) \u2208 supp(D).\nAssumption 3.2 was introduced by Cao and Gu [5] for the analysis of fully connected networks and\nis applicable for distributions where samples can be perfectly classi\ufb01ed by the random kitchen sinks\nmodel of Rahimi and Recht [19]. One can view a function from this class as the in\ufb01nite width limit of\na one-hidden-layer neural network with regularizer given by a function c(\u00b7) with bounded (cid:96)\u221e-norm.\nAs pointed out by Cao and Gu [5], this assumption includes the linearly separable case.\nOur next assumption concerns the scaling of the weights at initialization.\nAssumption 3.3 (Gaussian initialization). We say that the weight matrices Wl \u2208 Rml\u22121\u00d7ml are\ngenerated via Gaussian initialization if each of the entries of Wl are generated independently from\nN (0, 2/ml).\n\nThis assumption is common to much of the recent theoretical analyses of neural networks [17, 28, 1,\n9, 3, 5] and is known as the He initialization due to its usage in the \ufb01rst ResNet paper by He et al. [13].\nThis assumption guarantees that the spectral norms of the weights are controlled at initialization.\nOur last assumption concerns the widths of the networks we consider and allows us to exclude\npathological dependencies between the width and other parameters that de\ufb01ne the architecture and\noptimization problem.\nAssumption 3.4 (Widths are of the same order). We assume mL+1 = \u0398(mL). We call m =\nmL \u2227 mL+1 the width of the network.\nOur \ufb01rst theorem shows that provided we have suf\ufb01cient overparameterization and suf\ufb01ciently small\nstep size, the iterates W (k) of gradient descent stay within a small neighborhood of their initialization.\nAdditionally, the empirical surrogate error can be bounded by a term that decreases as we increase\nthe width m of the network.\nTheorem 3.5. Suppose W (0) are generated via Gaussian initialization and that the residual scaling\nparameter satis\ufb01es \u03b8 = 1/\u2126(L). For \u03c4 > 0, denote a \u03c4-neighborhood of the weights W (0) =\n(W (0)\n\n, . . . , W (0)\n\nL+1) at initialization by\n\n1\n\nW(W (0), \u03c4 ) :=\n\nW = (W1, . . . , WL+1) :\n\n\u2264 \u03c4 \u2200l \u2208 [L + 1]\n\n.\n\n(cid:13)(cid:13)(cid:13)Wl \u2212 W (0)\n\nl\n\nThere exist absolute constants \u03bd, \u03bd(cid:48), \u03bd(cid:48)(cid:48), C, C(cid:48) > 0 such that for any \u03b4 > 0, provided \u03c4 \u2264\n2 , then if the width\n\u03bd\u03b312 (log m)\nof the network is such that,\n\n2 \u2227 \u03b34m\u22121), and K\u03b7 \u2264 \u03bd(cid:48)(cid:48)\u03c4 2\u03b34 (log(n/\u03b4))\n\n\u2212 3\n2 , \u03b7 \u2264 \u03bd(cid:48)(\u03c4 m\u2212 1\n\n\u2212 1\n\n(cid:110)\n\n\u03c4\u2212 4\n\n3 d log\n\n\u2228 d log\n\nmL\n\n\u2228 \u03c4\u2212 2\n\n3 (log m)\u22121 log\n\n\u2228 \u03b3\u22122\n\nd log\n\nm\n\u03c4 \u03b4\n\n\u03b4\n\nL\n\u03b4\n\n1\n\u03b3\n\n\u2228 log\n\nL\n\u03b4\n\n\u2228 log\n\nn\n\u03b4\n\nthen with probability at least 1 \u2212 \u03b4, gradient descent starting at W (0) with step size \u03b7 generates K\niterates W (1), . . . , W (K) that satisfy:\n\n(cid:13)(cid:13)(cid:13)F\n(cid:18)\n\n(cid:111)\n\n(cid:19)\n\n(cid:19)\n\nm \u2265 C(cid:48)(cid:18)\n\n5\n\n\f(i) W (k) \u2208 W(W (0), \u03c4 ) for all k \u2208 [K].\n(ii) There exists k \u2208 {0, . . . , K \u2212 1} with ES(W (k)) \u2264 C \u00b7 m\u2212 1\n\n2 \u00b7 (K\u03b7)\n\n\u2212 1\n\n2(cid:0)log n\n\n(cid:1) 1\n\n4 \u00b7 \u03b3\u22122.\n\n\u03b4\n\nThis theorem allows us to restrict our attention from the large class of all deep residual neural networks\nto the reduced complexity class of those with weights that satisfy W \u2208 W(W (0), \u03c4 ). Our analysis\nprovides a characterization of the radius of this reduced complexity class in terms of parameters\nthat de\ufb01ne the network architecture and optimization problem. Additionally, this theorem allows\nus to translate the optimization problem over the empirical loss LS(W ) into one for the empirical\nsurrogate loss ES(W (k)), a quantity that is simply related to the classi\ufb01cation error (its expectation is\nbounded by a constant multiple of the classi\ufb01cation error under 0-1 loss; see Appendix A.2).\nOur next theorem characterizes the Rademacher complexity of the class of residual networks with\nweights in a \u03c4-neighborhood of the initialization. Additionally, it connects the test accuracy with the\nempirical surrogate loss and the Rademacher complexity.\nTheorem 3.6. Let W (0) denote the weights at Gaussian initialization and suppose the residual scaling\nparameter satis\ufb01es \u03b8 = 1/\u2126(L). Suppose \u03c4 \u2264 1. Then there exist absolute constants C1, C2, C3 > 0\nsuch that for any \u03b4 > 0, provided\n\n(cid:17)\n\nthen with probability at least 1 \u2212 \u03b4, we have the following bound on the Rademacher complexity,\n\nm \u2265 C1\n\n\u03c4\u2212 2\n\n3 (log m)\u22121 log(L/\u03b4) \u2228 \u03c4\u2212 4\n\n(cid:16)\n(cid:16)(cid:8)fW : W \u2208 W(W (0), \u03c4 )(cid:9)(cid:17) \u2264 C2\n\nRn\n\nso that for all W \u2208 W(W (0), \u03c4 ),\n\nP(x,y)\u223cD (y \u00b7 fW (x) < 0) \u2264 2ES(W ) + C2\n\n(cid:18)\n\n\u03c4\n\n3 d log(m/(\u03c4 \u03b4)) \u2228 d log(mL/\u03b4)\n\n,\n\n4\n\n\u03c4\n\n(cid:18)\n3(cid:112)m log m +\n\n3(cid:112)m log m +\n(cid:19)\n\n\u221a\nm\u221a\n\u03c4\nn\n\n4\n\n(cid:19)\n(cid:114)\n\n\u221a\nm\u221a\n\u03c4\nn\n\n+ C3\n\n,\n\nlog(1/\u03b4)\n\nn\n\n.\n\n(3)\n\nWe shall see in Section 4 that we are able to derive the above bound on the Rademacher complexity by\nusing a semi-smoothness property of the neural network output and an upper bound on the gradient\nof the network output. Standard arguments from statistical learning theory provide the \ufb01rst and third\nterms in (3).\nThe missing ingredients needed to realize the result of Theorem 3.6 for networks trained by gradient\ndescent are supplied by Theorem 3.5, which gives (i) control of the growth of the empirical surrogate\nerror ES along the gradient descent trajectory, and (ii) the distance \u03c4 from initialization before which\nwe are guaranteed to \ufb01nd small empirical surrogate error. Putting these together yields Corollary 3.7.\nCorollary 3.7. Suppose that the residual scaling parameter satis\ufb01es \u03b8 = 1/\u2126(L). Let \u03b5, \u03b4 > 0 be\n\ufb01xed. Suppose that m\u2217 = \u02dcO(poly(\u03b3\u22121)) \u00b7 max(d, \u03b5\u221214) \u00b7 log(1/\u03b4) and n = \u02dcO(poly(\u03b3\u22121)) \u00b7 \u03b5\u22124.\nThen for any m \u2265 m\u2217, with probability at least 1 \u2212 \u03b4 over the initialization and training sample,\nthere is an iterate k \u2208 {0, . . . , K \u2212 1} with K = \u02dcO(poly(\u03b3\u22121))\u00b7 \u03b5\u22122 such that gradient descent with\nGaussian initialization and step size \u03b7 = O(\u03b34 \u00b7 m\u22121) satis\ufb01es\n\nP(x,y)\u223cD[y \u00b7 fW (k)(x) < 0] \u2264 \u03b5.\n\nThis corollary shows that for deep residual networks, provided we have suf\ufb01cient overparameteriza-\ntion, gradient descent is guaranteed to \ufb01nd networks that have arbitrarily high classi\ufb01cation accuracy.\nIn comparison with the results of Cao and Gu [5], the width m, number of samples n, step size \u03b7,\nand number of iterates K required for the guarantees for residual networks given in Theorem 3.5\nand Corollary 3.7 all have (at most) logarithmic dependence on L as opposed to the exponential\ndependence in the corresponding results for the non-residual architecture. Additionally, we note\nthat the step size and number of iterations required for our guarantees are independent of the depth,\nand this is due to the advantage of the residual architecture. Our analysis shows that the presence\nof skip connections in the network architecture removes the complications relating to the depth that\ntraditionally arise in the analysis of non-residual architectures for a variety of reasons. The \ufb01rst is a\ntechnical one from the proof, in which we show that the Lipschitz constant of the network output\n\n6\n\n\fand the semismoothness of the network depend at most logarithmically on the depth, so that the\nnetwork width does not blow up as the depth increases (see Lemmas 4.1 and 4.2 below). Second,\nthe presence of skip-connections allows for representations that are learned in the \ufb01rst layer to be\ndirectly passed to later layers without needing to use a wider network to relearn those representations.\nThis property was key to our proof of the gradient lower bound of Lemma 4.3 and has been used in\nprevious approximation results for deep residual networks, e.g., Yarotsky [24].\n\n4 Proof Sketch of the Main Theory\n\nIn this section we will provide a proof sketch of Theorems 3.5 and 3.6 and Corollary 3.7, following\nthe proof technique of Cao and Gu [5]. We will \ufb01rst collect the key lemmas needed for their proofs,\nleaving the proofs of these lemmas for Appendix B. We shall assume throughout this section that\nthe residual scaling parameter satis\ufb01es \u03b8 = 1/\u2126(L), which we note is a common assumption in the\nliterature of residual network analysis [8, 1, 26].\nOur \ufb01rst key lemma shows that the interlayer activations de\ufb01ned in (1) are uniformly bounded in x\nand l provided the network is suf\ufb01ciently wide.\nLemma 4.1 (Hidden layer and interlayer activations are bounded). Suppose that W1, . . . , WL+1 are\ngenerated via Gaussian initialization. Then there exist absolute constants C0, C1, C2 > 0 such that if\nm \u2265 C0d log (mL/\u03b4), then with probability at least 1 \u2212 \u03b4, for any l, l(cid:48) = 1, . . . , L + 1 with l \u2264 l(cid:48)\nand x \u2208 Sd\u22121, we have C1 \u2264 (cid:107)xl(cid:107)2 \u2264 C2 and\n\n\u2264 C2.\n\n(cid:13)(cid:13)(cid:13)H l(cid:48)\n\nl\n\n(cid:13)(cid:13)(cid:13)2\n\nDue to the scaling of \u03b8, we are able to get bounds on the interlayer and hidden layer activations that\ndo not grow with L. As we shall see, this will be key for the sublinear dependence on L for the results\nof Theorems 3.5 and 3.6. The fully connected architecture studied by Cao and Gu [5] had additional\npolynomial terms in L for both upper bounds for (cid:107)xl(cid:107)2 and\nOur next lemma describes a semi-smoothness property of the neural network output fW and the\nempirical loss LS.\nLemma 4.2 (Semismoothness of network output and objective loss). Let W1, . . . , WL+1 be generated\nvia Gaussian initialization, and let \u03c4 \u2264 1. De\ufb01ne\n\n(cid:13)(cid:13)(cid:13)H l(cid:48)\n\n(cid:13)(cid:13)(cid:13)2\n\n.\n\nl\n\n+ \u03b8\n\nl=2\n\n(cid:13)(cid:13)(cid:13)2\n\nm \u2265 C\n\n\u03c4\u2212 2\n\n(cid:13)(cid:13)(cid:13)(cid:99)Wl \u2212 \u02dcWl\n\n(cid:13)(cid:13)(cid:13)2\n\nL(cid:88)\n\n(cid:13)(cid:13)(cid:13)(cid:99)W1 \u2212 \u02dcW1\n\n(cid:13)(cid:13)(cid:13)(cid:99)WL+1 \u2212 \u02dcWL+1\n\nh((cid:99)W , \u02dcW ) :=\n(cid:16)\n\nThere exist absolute constants C, C > 0 such that if\n3 (log m)\u22121 log(L/\u03b4) \u2228 \u03c4\u2212 4\n\n(cid:13)(cid:13)(cid:13)2\n(cid:17)\nthen with probability at least 1 \u2212 \u03b4, we have for all x \u2208 Sd\u22121 and(cid:99)W , \u02dcW \u2208 W(W, \u03c4 ),\nm \u00b7 h((cid:99)W , \u02dcW )2\n(cid:21)\n(cid:17)(cid:62) \u2207Wl f \u02dcW (x)\n(cid:21)\n(cid:17)(cid:62) \u2207Wl LS( \u02dcW )\n\n3(cid:112)m log m \u00b7 h((cid:99)W , \u02dcW ) + C\nL+1(cid:88)\n3(cid:112)m log m \u00b7 h((cid:99)W , \u02dcW ) \u00b7 ES( \u02dcW ) + Cm \u00b7 h((cid:99)W , \u02dcW )2\nL+1(cid:88)\n\n(cid:20)(cid:16)(cid:99)Wl \u2212 \u02dcWl\n(cid:20)(cid:16)(cid:99)Wl \u2212 \u02dcWl\n\nLS((cid:99)W ) \u2212 LS( \u02dcW ) \u2264 C\u03c4\n\nf(cid:99)W (x) \u2212 f \u02dcW (x) \u2264 C\u03c4\n\n3 d log(m/(\u03c4 \u03b4)) \u2228 d log(mL/\u03b4)\n\n+\n\ntr\n\n+\n\ntr\n\n.\n\n.\n\n.\n\n,\n\n1\n\nl=1\n\n1\n\n+\n\n\u221a\n\nand\n\nl=1\n\nThe semismoothness of the neural network output function fW will be used in the analysis of\ngeneralization by Rademacher complexity arguments. For the objective loss LS, we apply this lemma\nfor weights along the trajectory of gradient descent. Since the difference in the weights of two\nl = \u2212\u03b7\u2207Wl LS(W (k)), the last term\nconsecutive steps of gradient descent satisfy W (k+1)\n\n\u2212 W (k)\n\nl\n\n7\n\n\f(cid:13)(cid:13)\u2207Wl LS(W (k))(cid:13)(cid:13)2\n\nin the bound for the objective loss LS will take the form \u2212\u03b7(cid:80)L+1\n\nl=1\n\nF . Thus by\nsimultaneously demonstrating (i) a lower bound for the gradient for at least one of the layers and\n(ii) an upper bound for the gradient at all layers (and hence an upper bound for h(W (k+1), W (k))),\nwe can connect the empirical surrogate loss ES(W (k)) at iteration k with that of the objective loss\nLS(W (k)) that will lead us to Theorem 3.5. Compared with the fully connected architecture of Cao\nand Gu [5], our bounds do not have any polynomial terms in L.\nThus the only remaining key items needed for our proof are upper bounds and lower bounds for the\ngradient of the objective loss, described in the following two lemmas.\nLemma 4.3. Let W = (W1, . . . , WL+1) be weights at Gaussian initialization. There ex-\nist absolute constants C, C, \u03bd such that for any \u03b4 > 0, provided \u03c4 \u2264 \u03bd\u03b33 and m \u2265\nW(W, \u03c4 ), we have\n\nC\u03b3\u22122(cid:0)d log \u03b3\u22121 + log(L/\u03b4)(cid:1) \u2228 C log(n/\u03b4), then with probability at least 1 \u2212 \u03b4, for all \u02dcW \u2208\n\nLemma 4.4. Let W = (W1, . . . , WL+1) be weights at Gaussian initialization. There exists an\nabsolute constant C > 0 such that for any \u03b4 > 0, provided m \u2265 C (d \u2228 log(L/\u03b4)) and \u03c4 \u2264 1, we\nhave for all \u02dcW \u2208 W(W, \u03c4 ) and all l,\n\n(cid:13)(cid:13)(cid:13)2\n(cid:13)(cid:13)(cid:13)\u2207WL+1LS( \u02dcW )\n(cid:13)(cid:13)(cid:13)\u2207Wl LS( \u02dcW )\n(cid:13)(cid:13)(cid:13)F\n\nF\n\n\u2265 C \u00b7 mL+1 \u00b7 \u03b34 \u00b7 ES( \u02dcW )2.\n\n\u2264 \u03b81(2\u2264l\u2264L) \u00b7 C\n\nm \u00b7 ES( \u02dcW ).\n\n\u221a\n\nNote that we provide only a lower bound for the gradient at the last layer. It may be possible to\nimprove the degrees of the polynomial terms of the results in Theorems 3.5 and 3.6 by deriving lower\nbounds for the other layers as well.\nWith all of the key lemmas in place, we can proceed with a proof sketch of Theorems 3.5 and 3.6.\nThe complete proofs can be found in Appendix A.\n\nProof of Theorem 3.5. Consider hk = h(W (k+1), W (k)), a quantity that measures the distance of\nthe weights between gradient descent iterations. It takes the form\n\n(cid:34)(cid:13)(cid:13)(cid:13)\u2207W1LS(W (k))\n\n(cid:13)(cid:13)(cid:13)2\n\n+ \u03b8\n\n(cid:13)(cid:13)(cid:13)\u2207Wl LS(W (k))\n\n(cid:13)(cid:13)(cid:13)2\n\n(cid:13)(cid:13)(cid:13)\u2207WL+1LS(W (k))\n(cid:13)(cid:13)(cid:13)2\n\n+\n\n(cid:35)\n\n.\n\nhk = \u03b7\n\nmES(W (k)). The gradient lower bound in Lemma 4.3\nBy Lemma 4.4 we can show that hk \u2264 C\u03b7\nsubstituted into Lemma 4.2 shows that the dominating term in the semismoothness comes from the\ngradient lower bound, so that we have for any k,\n\nLS(W (k+1)) \u2212 LS(W (k)) \u2264 \u2212C \u00b7 \u03b7 \u00b7 mL+1 \u00b7 \u03b34 \u00b7 ES(W (k))2.\n\nL(cid:88)\n\nl=2\n\n\u221a\n\nWe can telescope the above over k to get a bound on the loss at iteration k in terms of the bound\non the r.h.s. and the loss at initialization. A simple concentration argument shows that the loss at\ninitialization is small with mild overparameterization. By letting k\u2217 = argmin[K\u22121]ES(W (k))2, we\ncan thus show\n\n(cid:16)\n\n(cid:17) 1\n\nES(W (k\u2217)) \u2264 C3 (K\u03b7 \u00b7 m)\n\n\u2212 1\n\n2\n\nLS(W (0))\n\n2 \u00b7 \u03b3\u22122 \u2264 C3 (K\u03b7 \u00b7 m)\n\n\u2212 1\n\n2\n\n(cid:16)\n\n(cid:17) 1\n\nlog\n\nn\n\u03b4\n\n4 \u00b7 \u03b3\u22122.\n\nWe provide below a proof sketch of the bound for the Rademacher complexity given in Theorem 3.6,\nleaving the rest for Appendix A.2.\n\nProof of Theorem 3.6. Let \u03bei be independent Rademacher random variables. We consider a \ufb01rst-order\napproximation to the network output at initialization,\n\nFW (0),W (x) := fW (0)(x) +\n\ntr\n\nWl \u2212 W (0)\n\nl\n\n(cid:21)\n(cid:17)(cid:62) \u2207Wl fW (0)(x)\n\n,\n\nL+1(cid:88)\n\n(cid:20)(cid:16)\n\nl=1\n\n8\n\n\fand bound the Rademacher complexity by two terms,\n\n(cid:98)RS[F(W (0), \u03c4 )] \u2264 E\u03be\n\n(cid:34)\n\nn(cid:88)\n\ni=1\n\n1\nn\n\n(cid:35)\n\u03bei[f (xi) \u2212 FW (0),W (xi)]\nn(cid:88)\n\nL+1(cid:88)\n\n(cid:20)(cid:16)\n\n\u03bei\n\ntr\n\nWl \u2212 W (0)\n\nl\n\ni=1\n\nl=1\n\nsup\n\nW\u2208W(W (0),\u03c4 )\n\n(cid:34)\n\n+ E\u03be\n\nsup\n\nW\u2208W(W (0),\u03c4 )\n\n1\nn\n\n(cid:21)(cid:35)\n(cid:17)(cid:62) \u2207Wl fW (0)(x)\n\n3\n\n\u221a\n\nFor the \ufb01rst term, taking \u02dcW = W (0) in Lemma 4.2 results in |fW (x) \u2212 FW (0),W (x)| \u2264\nm log m. For the second term, since (cid:107)AB(cid:107)F \u2264 (cid:107)A(cid:107)F (cid:107)B(cid:107)2, we reduce this term to a\nC3\u03c4 4\nproduct of two terms. The \ufb01rst involves the norm of the distance of the weights from initialization,\nwhich is \u03c4. The second is the norm of the gradient at initialization, which can be taken care of by\nm. A\n\nusing Cauchy\u2013Schwarz and the gradient formula (2) to get (cid:107)\u2207Wl fW (0)(cid:107)F \u2264 C2\u03b81(2\u2264(cid:96)\u2264L)\u221a\n\n\u221a\nstandard application of Jensen inequality gives the 1/\n\nn term.\n\nFinally, we can put together Theorems 3.5 and 3.6 by appropriately choosing the scale of \u03c4, \u03b7, and K\nto get Corollary 3.7. We leave the detailed algebraic calculations for Appendix A.3.\n\nProof of Corollary 3.7. We need only specify conditions on \u03c4, \u03b7, K\u03b7, and m such that the results of\nTheorems 3.5 and 3.6 will hold, and making sure that each of the four terms in (3) are of the same\nscale. This can be satis\ufb01ed by imposing the condition K\u03b7 = \u03bd(cid:48)(cid:48)\u03b34\u03c4 2 (log(n/\u03b4))\n\n\u2212 1\n\n3(cid:112)m log m = C2\u03c4(cid:112)m/n = C3\n\n4\n\n2 and\n\n(cid:112)log(1/\u03b4)/n = \u03b5/4.\n\nC3 (K\u03b7m)\n\n\u2212 1\n\n2 (log(n/\u03b4))\n\n1\n\n4 \u00b7 \u03b3\u22122 = C2\u03c4\n\n5 Conclusions\n\nIn this paper, we derived algorithm-dependent optimization and generalization results for overpa-\nrameterized deep residual networks trained with random initialization using gradient descent. We\nshowed that this class of networks is both small enough to ensure a small generalization gap and\nalso large enough to achieve a small training loss. Important to our analysis is the insight that the\nintroduction of skip connections allows for us to essentially ignore the depth as a complicating factor\nin the analysis, in contrast with the well-known dif\ufb01culty of achieving nonvacuous generalization\nbounds for deep non-residual networks. This provides a theoretical understanding for the increased\nstability and generalization of deep residual networks over non-residual ones observed in practice.\n\nAcknowledgement\n\nWe would like to thank the anonymous reviewers for their helpful comments. This research was\nsponsored in part by the National Science Foundation IIS-1903202 and IIS-1906169. QG is also\npartially supported by the Salesforce Deep Learning Research Grant. The views and conclusions\ncontained in this paper are those of the authors and should not be interpreted as representing any\nfunding agencies.\n\nReferences\n[1] Z. Allen-Zhu, Y. Li, and Z. Song. A convergence theory for deep learning via over-\n\nparameterization. arXiv preprint, arXiv:1811.03962, 2018.\n\n[2] S. Arora, R. Ge, B. Neyshabur, and Y. Zhang. Stronger generalization bounds for deep nets via\na compression approach. In ICML, volume 80 of Proceedings of Machine Learning Research,\npages 254\u2013263. PMLR, 2018.\n\n[3] S. Arora, S. S. Du, W. Hu, Z. Li, and R. Wang. Fine-grained analysis of optimization and gener-\nalization for overparameterized two-layer neural networks. arXiv preprint, arXiv:1901.08584,\n2019.\n\n9\n\n\f[4] P. L. Bartlett, D. J. Foster, and M. J. Telgarsky. Spectrally-normalized margin bounds for neural\n\nnetworks. In NeurIPS, pages 6241\u20136250, 2017.\n\n[5] Y. Cao and Q. Gu. A generalization theory of gradient descent for learning over-parameterized\n\ndeep relu networks. arXiv preprint, arXiv:1902.01384, 2019.\n\n[6] Y. Cao and Q. Gu. Generalization bounds of stochastic gradient descent for wide and deep\n\nneural networks. In Conference on Neural Information Processing Systems, 2019.\n\n[7] S. Choi, S. Seo, B. Shin, H. Byun, M. Kersner, B. Kim, D. Kim, and S. Ha. Temporal convolution\n\nfor real-time keyword spotting on mobile devices. arXiv preprint, arXiv:1904.03814, 2019.\n\n[8] S. S. Du, J. D. Lee, H. Li, L. Wang, and X. Zhai. Gradient descent \ufb01nds global minima of deep\nneural networks. CoRR, abs/1811.03804, 2018. URL http://arxiv.org/abs/1811.03804.\n\n[9] S. S. Du, X. Zhai, B. P\u00f3czos, and A. Singh. Gradient descent provably optimizes over-\n\nparameterized neural networks. arXiv preprint, arXiv:1810.02054, 2018.\n\n[10] G. K. Dziugaite and D. M. Roy. Computing nonvacuous generalization bounds for deep\n(stochastic) neural networks with many more parameters than training data. In Proceedings\nof the Thirty-Third Conference on Uncertainty in Arti\ufb01cial Intelligence, UAI 2017, Sydney,\nAustralia, August 11-15, 2017, 2017. URL http://auai.org/uai2017/proceedings/\npapers/173.pdf.\n\n[11] W. E, C. Ma, Q. Wang, and L. Wu. Analysis of the gradient descent algorithm for a deep neural\n\nnetwork model with skip-connections. arXiv preprint, arXiv:1904.05263, 2019.\n\n[12] N. Golowich, A. Rakhlin, and O. Shamir. Size-independent sample complexity of neural\nnetworks. In COLT, volume 75 of Proceedings of Machine Learning Research, pages 297\u2013299.\nPMLR, 2018.\n\n[13] K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. In CVPR,\n\npages 770\u2013778. IEEE Computer Society, 2016.\n\n[14] F. N. Iandola, M. W. Moskewicz, K. Ashraf, S. Han, W. J. Dally, and K. Keutzer.\nSqueezenet: Alexnet-level accuracy with 50x fewer parameters and <1mb model size. arXiv,\narXiv:1602.07360, 2016. URL http://arxiv.org/abs/1602.07360.\n\n[15] A. Krizhevsky, I. Sutskever, and G. E. Hinton. Imagenet classi\ufb01cation with deep convolutional\n\nneural networks. Commun. ACM, 60(6):84\u201390, 2017.\n\n[16] X. Li, J. Lu, Z. Wang, J. D. Haupt, and T. Zhao. On tighter generalization bound for deep neural\n\nnetworks: Cnns, resnets, and beyond. arXiv preprint, arXiv:1806.05159, 2018.\n\n[17] Y. Li and Y. Liang. Learning overparameterized neural networks via stochastic gradient descent\n\non structured data. In NeurIPS, pages 8168\u20138177, 2018.\n\n[18] B. Neyshabur, S. Bhojanapalli, and N. Srebro. A pac-bayesian approach to spectrally-normalized\n\nmargin bounds for neural networks. In ICLR. OpenReview.net, 2018.\n\n[19] A. Rahimi and B. Recht. Weighted sums of random kitchen sinks: Replacing minimization with\n\nrandomization in learning. In NeurIPS, pages 1313\u20131320. Curran Associates, Inc., 2008.\n\n[20] T. N. Sainath and C. Parada. Convolutional neural networks for small-footprint keyword\n\nspotting. In INTERSPEECH, pages 1478\u20131482. ISCA, 2015.\n\n[21] S. Shalev-Shwartz and S. Ben-David. Understanding Machine Learning: From Theory to\nAlgorithms. Cambridge University Press, New York, NY, USA, 2014. ISBN 1107057132,\n9781107057135.\n\n[22] R. Tang and J. Lin. Deep residual learning for small-footprint keyword spotting. In 2018 IEEE\nInternational Conference on Acoustics, Speech and Signal Processing, ICASSP 2018, Calgary,\nAB, Canada, April 15-20, 2018, pages 5484\u20135488, 2018. doi: 10.1109/ICASSP.2018.8462688.\n\n10\n\n\f[23] R. Vershynin. Introduction to the non-asymptotic analysis of random matrices. arXiv preprint,\n\narXiv:1011.3027, 2010.\n\n[24] D. Yarotsky. Error bounds for approximations with deep relu networks. Neural Networks, 94:\n103\u2013114, 2017. doi: 10.1016/j.neunet.2017.07.002. URL https://doi.org/10.1016/j.\nneunet.2017.07.002.\n\n[25] C. Zhang, S. Bengio, M. Hardt, B. Recht, and O. Vinyals. Understanding deep learning requires\n\nrethinking generalization. In ICLR. OpenReview.net, 2017.\n\n[26] H. Zhang, D. Yu, W. Chen, and T. Liu. Training over-parameterized deep resnet is almost as\n\neasy as training a two-layer network. arXiv preprint, arXiv:1903.07120, 2019.\n\n[27] D. Zou and Q. Gu. An improved analysis of training over-parameterized deep neural networks.\n\nIn Conference on Neural Information Processing Systems, 2019.\n\n[28] D. Zou, Y. Cao, D. Zhou, and Q. Gu. Stochastic gradient descent optimizes over-parameterized\n\ndeep relu networks. arXiv preprint, arXiv:1811.08888, 2018.\n\n11\n\n\f", "award": [], "sourceid": 8393, "authors": [{"given_name": "Spencer", "family_name": "Frei", "institution": "UCLA"}, {"given_name": "Yuan", "family_name": "Cao", "institution": "UCLA"}, {"given_name": "Quanquan", "family_name": "Gu", "institution": "UCLA"}]}