{"title": "An Improved Analysis of Training Over-parameterized Deep Neural Networks", "book": "Advances in Neural Information Processing Systems", "page_first": 2055, "page_last": 2064, "abstract": "A recent line of research has shown that gradient-based algorithms with random initialization can converge to the global minima of the training loss for over-parameterized (i.e., sufficiently wide) deep neural networks. However, the condition on the width of the neural network to ensure the global convergence is very stringent, which is often a high-degree polynomial in the training sample size $n$ (e.g., $O(n^{24})$). In this paper, we provide an improved analysis of the global convergence of (stochastic) gradient descent for training deep neural networks, which only requires a milder over-parameterization condition than previous work in terms of the training sample size and other problem-dependent parameters. The main technical contributions of our analysis include (a) a tighter gradient lower bound that leads to a faster convergence of the algorithm, and (b) a sharper characterization of the trajectory length of the algorithm. By specializing our result to two-layer (i.e., one-hidden-layer) neural networks, it also provides a milder over-parameterization condition than the best-known result in prior work.", "full_text": "An Improved Analysis of Training\n\nOver-parameterized Deep Neural Networks\n\nDifan Zou\n\nQuanquan Gu\n\nDepartment of Computer Science\n\nUniversity of California, Los Angeles\n\nDepartment of Computer Science\n\nUniversity of California, Los Angeles\n\nLos Angeles, CA 90095\nknowzou@cs.ucla.edu\n\nLos Angeles, CA 90095\n\nqgu@cs.ucla.edu\n\nAbstract\n\nA recent line of research has shown that gradient-based algorithms with random\ninitialization can converge to the global minima of the training loss for over-\nparameterized (i.e., suf\ufb01ciently wide) deep neural networks. However, the condi-\ntion on the width of the neural network to ensure the global convergence is very\nstringent, which is often a high-degree polynomial in the training sample size\nn (e.g., O(n24)). In this paper, we provide an improved analysis of the global\nconvergence of (stochastic) gradient descent for training deep neural networks,\nwhich only requires a milder over-parameterization condition than previous work\nin terms of the training sample size and other problem-dependent parameters. The\nmain technical contributions of our analysis include (a) a tighter gradient lower\nbound that leads to a faster convergence of the algorithm, and (b) a sharper char-\nacterization of the trajectory length of the algorithm. By specializing our result\nto two-layer (i.e., one-hidden-layer) neural networks, it also provides a milder\nover-parameterization condition than the best-known result in prior work.\n\n1\n\nIntroduction\n\nRecent study [20] has revealed that deep neural networks trained by gradient-based algorithms can\n\ufb01t training data with random labels and achieve zero training error. Since the loss landscape of\ntraining deep neural network is highly nonconvex or even nonsmooth, conventional optimization\ntheory cannot explain why gradient descent (GD) and stochastic gradient descent (SGD) can \ufb01nd\nthe global minimum of the loss function (i.e., achieving zero training error). To better understand\nthe training of neural networks, there is a line of research [18, 5, 10, 16, 23, 8, 22, 12] studying\ntwo-layer (i.e., one-hidden-layer) neural networks, where it assumes there exists a teacher network\n(i.e., an underlying ground-truth network) generating the output given the input, and casts neural\nnetwork learning as weight matrix recovery for the teacher network. However, these studies not\nonly make strong assumptions on the training data (existence of ground-truth network with the\nsame architecture as the learned network), but also need special initialization methods that are very\ndifferent from the commonly used initialization method [13] in practice. Li and Liang [15], Du et al.\n[11] advanced this line of research by proving that under much milder assumptions on the training\ndata, (stochastic) gradient descent can attain a global convergence for training over-parameterized\n(i.e.,suf\ufb01ciently wide) two-layer ReLU network with widely used random initialization method [13].\nMore recently, Allen-Zhu et al. [2], Du et al. [9], Zou et al. [24] generalized the global convergence\nresults from two-layer networks to deep neural networks. However, there is a huge gap between the\ntheory and practice since all these work Li and Liang [15], Du et al. [11], Allen-Zhu et al. [2], Du\net al. [9], Zou et al. [24] require unrealistic over-parameterization conditions on the width of neural\nnetworks, especially for deep networks. In speci\ufb01c, in order to establish the global convergence for\ntraining two-layer ReLU networks, Du et al. [11] requires the network width, i.e., number of hidden\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\f0\n\nnodes, to be at least \u2126(n6/\u03bb4\n0), where n is the training sample size and \u03bb0 is the smallest eigenvalue\nof the so-called Gram matrix de\ufb01ned in Du et al. [11], which is essentially the neural tangent kernel\n[14, 7] on the training data. Under the same assumption on the training data, Wu et al. [19] improved\nOymak and Soltanolkotabi [17] improved the over-parameterization condition to \u2126(n(cid:107)X(cid:107)6\n0),\n2/\u03bb4\nwhere \u0001 is the target error and X \u2208 Rn\u00d7d is the input data matrix. For deep ReLU networks, the\nbest known result was established in Allen-Zhu et al. [2], which requires the network width to be at\n\nthe iteration complexity of GD in Du et al. [11] from O(cid:0)n2 log(1/\u0001)/\u03bb2\n(cid:1) and\nleast(cid:101)\u2126(kn24L12\u03c6\u22128)1 to ensure the global convergence of GD and SGD, where L is the number of\n\n(cid:1) to O(cid:0)n log(1/\u0001)/\u03bb0\n\nhidden layers, \u03c6 is the minimum data separation distance and k is the output dimension.\nThis paper continues the line of research, and improves the over-parameterization condition and\nthe global convergence rate of (stochastic) gradient descent for training deep neural networks. In\nspeci\ufb01c, under the same setting as in Allen-Zhu et al. [2], we prove faster global convergence rates for\nboth GD and SGD under a signi\ufb01cantly milder condition on the neural network width. Furthermore,\nwhen specializing our result to two-layer ReLU networks, it also outperforms the best-known result\nproved in Oymak and Soltanolkotabi [17]. The improvement in our result is due to the following\ntwo innovative proof techniques: (a) a tighter gradient lower bound, which leads to a faster rate of\nconvergence for GD/SGD; and (b) a sharper characterization of the trajectory length for GD/SGD\nuntil convergence.\nWe highlight our main contributions as follows:\n\u2022 We show that, with Gaussian random initialization [13] on each layer, when the number of hidden\n\nnodes per layer is(cid:101)\u2126(cid:0)kn8L12\u03c6\u22124(cid:1), GD can achieve \u0001 training loss within (cid:101)O(cid:0)n2L2 log(1/\u0001)\u03c6\u22121(cid:1)\nresult [2], our over-parameterization condition is milder by a factor of(cid:101)\u2126(n16\u03c6\u22124), and our iteration\ncomplexity is better by a factor of (cid:101)O(n4\u03c6\u22121).\ntion [13] on each layer, when the number of hidden nodes per layer is(cid:101)\u2126(cid:0)kn17L12B\u22124\u03c6\u22128(cid:1), SGD\ncan achieve \u0001 expected training loss within (cid:101)O(cid:0)n5 log(1/\u0001)B\u22121\u03c6\u22122(cid:1) iterations, where B is the\nare strictly better by a factor of(cid:101)\u2126(n7B5) and (cid:101)O(n2) respectively regarding over-parameterization\n\niterations, where L is the number of hidden layers, \u03c6 is the minimum data separation distance, n is\nthe number of training examples, and k is the output dimension. Compared with the state-of-the-art\n\n\u2022 We also prove a similar convergence result for SGD. We show that with Gaussian random initializa-\n\nminibatch size of SGD. Compared with the corresponding results in Allen-Zhu et al. [2], our results\n\ncondition and iteration complexity.\n\n\u2022 When specializing our results of training deep ReLU networks with GD to two-layer ReLU\nnetworks, it also outperforms the corresponding results [11, 19, 17]. In addition, for training\ntwo-layer ReLU networks with SGD, we are able to show much better result than training deep\nReLU networks with SGD.\n\nFor the ease of comparison, we summarize the best-known results [11, 2, 9, 19, 17] of training\noverparameterized neural networks with GD and compare with them in terms of over-parameterization\ncondition and iteration complexity in Table 1. We will show in Section 3 that, under the assumption\nthat all training data points have unit (cid:96)2 norm, which is the common assumption made in all these\nwork [11, 2, 9, 19, 17], \u03bb0 > 0 is equivalent to the fact that all training data are separated by some\ndistance \u03c6, and we have \u03bb0 = O(n\u22122\u03c6) [17]. Substituting \u03bb0 = \u2126(n\u22122\u03c6) into Table 1, it is evident\nthat our result outperforms all the other results under the same assumptions.\nNotation For scalars, vectors and matrices, we use lower case, lower case bold face, and upper case\nbold face letters to denote them respectively. For a positive integer, we denote by [k] the set {1, . . . , k}.\n\nFor a vector x = (x1, . . . , xd)(cid:62) and a positive integer p, we denote by (cid:107)x(cid:107)p =(cid:0)(cid:80)d\n\nthe (cid:96)p norm of x. In addition, we denote by (cid:107)x(cid:107)\u221e = maxi=1,...,d |xi| the (cid:96)\u221e norm of x, and\n(cid:107)x(cid:107)0 = |{xi : xi (cid:54)= 0, i = 1, . . . , d}| the (cid:96)0 norm of x. For a matrix A \u2208 Rm\u00d7n, we denote\nby (cid:107)A(cid:107)F the Frobenius norm of A, (cid:107)A(cid:107)2 the spectral norm (maximum singular value), \u03bbmin(A)\nthe smallest singular value, (cid:107)A(cid:107)0 the number of nonzero entries, and (cid:107)A(cid:107)2,\u221e the maximum (cid:96)2\nnorm over all row vectors, i.e., (cid:107)A(cid:107)2,\u221e = maxi=1,...,m (cid:107)Ai\u2217(cid:107)2. For a collection of matrices\nW = {W1, . . . , WL}, we denote (cid:107)W(cid:107)F =\nF , (cid:107)W(cid:107)2 = maxl\u2208[L] (cid:107)Wl(cid:107)2 and\n1Here(cid:101)\u2126(\u00b7) hides constants and the logarithmic dependencies on problem dependent parameters except \u0001.\n\ni=1 |xi|p(cid:1)1/p\n\n(cid:113)(cid:80)L\n\nl=1 (cid:107)Wl(cid:107)2\n\n2\n\n\f\u2126\n\n\u2126\n\nO\n\nO\n\nTable 1: Over-parameterization conditions and iteration complexities of GD for training overparam-\nterized neural networks. K(L) is the Gram matrix for L-hidden-layer neural network [9]. Note that\nthe dimension of the output is k = 1 in Du et al. [11, 9], Wu et al. [19], Oymak and Soltanolkotabi\n[17].\n\nOymak and Soltanolkotabi [17]\n\nDu et al. [11]\nWu et al. [19]\n\nOver-para. condition Iteration complexity Deep? ReLU?\n\n(cid:17)\n(cid:16) n6\n(cid:17)\n(cid:16) n6\n(cid:16) n(cid:107)X(cid:107)6\n(cid:17)\n(cid:17)\n(cid:16) 2O(L)\u00b7n4\n(cid:16) kn24L12\n(cid:17)\n(cid:101)\u2126\n(cid:16) kn8L12\n(cid:17)\n(cid:101)\u2126\n\n(cid:16) n2 log(1/\u0001)\n(cid:17)\n(cid:17)\n(cid:16) n log(1/\u0001)\n(cid:16)(cid:107)X(cid:107)2\n(cid:17)\n(cid:16) 2O(L)\u00b7n2 log(1/\u0001)\n(cid:17)\n(cid:16) n6L2 log(1/\u0001)\n(cid:17)\n(cid:17)\n(cid:16) n2L2 log(1/\u0001)\n(cid:107)W(cid:107)2,\u221e = maxl\u2208[L] (cid:107)Wl(cid:107)2,\u221e. Given two collections of matrices (cid:102)W = {(cid:102)W1, . . . ,(cid:102)WL} and\n(cid:99)W = {(cid:99)W1, . . . ,(cid:99)WL}, we de\ufb01ne their inner product as (cid:104)(cid:102)W,(cid:99)W(cid:105) = (cid:80)L\nl=1(cid:104)(cid:102)Wl,(cid:99)Wl(cid:105). For two\naddition, we use (cid:101)O(\u00b7) and(cid:101)\u2126(\u00b7) to hide logarithmic factors.\n\nsequences {an} and {bn}, we use an = O(bn) to denote that an \u2264 C1bn for some absolute constant\nC1 > 0, and use an = \u2126(bn) to denote that an \u2265 C2bn for some absolute constant C2 > 0. In\n\nAllen-Zhu et al. [2]\n\nyes\nyes\nyes\n\nyes\nyes\nyes\n\nno\nyes\nyes\n\nThis paper\n\n\u03bb0\n2 log(1/\u0001)\n\nDu et al. [9]\n\nno\nno\nno\n\n\u03bb4\n0\n\n\u03bb4\n0\n\n\u03bb4\n0\n\n\u03bb2\n\nmin(K(L))\n\n\u03bb4\n\nmin(K(L))\n\n\u2126\n\n\u2126\n\nO\n\nO\n\nO\n\n\u03bb2\n0\n\n\u03bb0\n\n\u03c62\n\n\u03c6\n\n\u03c68\n\n\u03c64\n\n2\n\nO\n\n2 Problem setup and algorithms\n\nIn this section, we introduce the problem setup and the training algorithms.\nFollowing Allen-Zhu et al. [2], we consider the training of an L-hidden layer fully connected neural\nnetwork, which takes x \u2208 Rd as input, and outputs y \u2208 Rk. In speci\ufb01c, the neural network is a\nvector-valued function fW : Rd \u2192 Rk, which is de\ufb01ned as\n\nfW(x) = V\u03c3(WL\u03c3(WL\u22121 \u00b7\u00b7\u00b7 \u03c3(W1x)\u00b7\u00b7\u00b7 )),\n\nwhere W1 \u2208 Rm\u00d7d, W2, . . . , WL \u2208 Rm\u00d7m denote the weight matrices for the hidden layers, and\nV \u2208 Rk\u00d7m denotes the weight matrix in the output layer, \u03c3(x) = max{0, x} is the entry-wise\nReLU activation function. In addition, we denote by \u03c3(cid:48)(x) = 1(x) the derivative of ReLU activation\nfunction and wl,j the weight vector of the j-th node in the l-th layer.\nGiven a training set {(xi, yi)}i=1,...,n where xi \u2208 Rd and yi \u2208 Rk, the empirical loss function for\ntraining the neural network is de\ufb01ned as\n\nL(W) :=\n\nwhere (cid:96)(\u00b7,\u00b7) is the loss function, and(cid:98)yi = fW(xi). In this paper, for the ease of exposition, we\n\nfollow Allen-Zhu et al. [2], Du et al. [11, 9], Oymak and Soltanolkotabi [17] and consider square\nloss as follows\n\ni=1\n\n(2.1)\n\nn(cid:88)\n\n1\nn\n\n(cid:96)((cid:98)yi, yi),\n\n(cid:96)((cid:98)yi, yi) =\n\n(cid:107)yi \u2212(cid:98)yi(cid:107)2\n\n1\n2\n\n2,\n\nwhere(cid:98)yi = fW(xi) \u2208 Rk denotes the output of the neural network given input xi. It is worth noting\n\nthat our result can be easily extended to other loss functions such as cross entropy loss [24] as well.\nWe will study both gradient descent and stochastic gradient descent as training algorithms, which are\ndisplayed in Algorithm 1. For gradient descent, we update the weight matrix W(t)\nl using full partial\ngradient \u2207WlL(W(t)). For stochastic gradient descent, we update the weight matrix W(t)\nl using\n\n(cid:1), where B(t) with |B(t)| = B denotes\n\nstochastic partial gradient 1/B(cid:80)\n\ns\u2208B(t) \u2207Wl (cid:96)(cid:0)fW(t)(xs), ys\n\nthe minibatch of training examples at the t-th iteration. Both algorithms are initialized in the same\n\n3\n\n\fAlgorithm 1 (Stochastic) Gradient descent with Gaussian random initialization\n1: input: Training data {xi, yi}i\u2208[n], step size \u03b7, total number of iterations T , minibatch size B.\n2: initialization: For all l \u2208 [L], each row of weight matrix W(0)\nis independently generated from\n\nl\n\nN (0, 2/mI), each row of V is independently generated from N (0, I/k)\n\nGradient Descent\n\n3: for t = 0, . . . , T do\n4: W(t+1)\n= W(t)\nl\n5: end for\n6: output: {W(T )\n\nl \u2212 \u03b7\u2207Wl L(W(t)) for all l \u2208 [L]\n}l\u2208[L]\n7: for t = 0, . . . , T do\n8:\n9: W(t+1)\nl\n10: end for\n11: output: {W(T )\n\ns\u2208B(t) \u2207Wl (cid:96)(cid:0)fW(t)(xs), ys\n(cid:80)\n\nUniformly sample a minibatch of training data B(t) \u2208 [n]\n\nl \u2212 \u03b7\n}l\u2208[L]\n\n= W(t)\n\nStochastic Gradient Descent\n\nB\n\nl\n\nl\n\n(cid:1) for all l \u2208 [L]\n\nway as Allen-Zhu et al. [2], which is essentially the initialization method [13] widely used in practice.\nIn the remaining of this paper, we denote by\n\u2207L(W(t)) = {\u2207Wl L(W(t))}l\u2208[L]\n\nthe collections of all partial gradients of L(W(t)) and (cid:96)(cid:0)fW(t) (xi), yi\n\n(cid:1) = {\u2207Wl (cid:96)(cid:0)fW(t)(xi), yi\n\nand \u2207(cid:96)(cid:0)fW(t) (xi), yi\n\n(cid:1)}l\u2208[L]\n\n(cid:1).\n\n3 Main theory\n\nIn this section, we present our main theoretical results. We make the following assumptions on the\ntraining data.\nAssumption 3.1. For any xi, it holds that (cid:107)xi(cid:107)2 = 1 and (xi)d = \u00b5, where \u00b5 is an positive constant.\nThe same assumption has been made in all previous work along this line [9, 2, 24, 17]. Note that\nrequiring the norm of all training examples to be 1 is not essential, and this assumption can be relaxed\nto be (cid:107)xi(cid:107)2 is lower and upper bounded by some constants.\nAssumption 3.2. For any two different training data points xi and xj, there exists a positive constant\n\u03c6 > 0 such that (cid:107)xi \u2212 xj(cid:107)2 \u2265 \u03c6.\nThis assumption has also been made in Allen-Zhu et al. [3, 2], which is essential to guarantee zero\ntraining error for deep neural networks. It is a quite mild assumption for the regression problem as\nstudied in this paper. Note that Du et al. [9] made a different assumption on training data, which\nrequires the Gram matrix K(L) (See their paper for details) de\ufb01ned on the L-hidden-layer networks\nis positive de\ufb01nite. However, their assumption is not easy to verify for neural networks with more\nthan two layers.\nBased on Assumptions 3.1 and 3.2, we are able to establish the global convergence rates of GD and\nSGD for training deep ReLU networks. We start with the result of GD for L-hidden-layer networks.\n\n3.1 Training L-hidden-layer ReLU networks with GD\n\nThe global convergence of GD for training deep neural networks is stated in the following theorem.\nTheorem 3.3. Under Assumptions 3.1 and 3.2, and suppose the number of hidden nodes per layer\nsatis\ufb01es\n\nThen if set the step size \u03b7 = O(cid:0)k/(L2m)(cid:1), with probability at least 1 \u2212 O(n\u22121), gradient descent is\n\n(3.1)\n\nm = \u2126(cid:0)kn8L12 log3(m)/\u03c64(cid:1).\nT = O(cid:0)n2L2 log(1/\u0001)/\u03c6(cid:1)\n\nable to \ufb01nd a point that achieves \u0001 training loss within\n\niterations.\n\n4\n\n\fRemark 3.4. The state-of-the-art results for training deep ReLU network are provided by Allen-Zhu\n\net al. [2], where the authors showed that GD can achieve \u0001-training loss within O(cid:0)n6L2 log(1/\u0001)/\u03c62(cid:1)\niterations if the neural network width satis\ufb01es m = (cid:101)\u2126(cid:0)kn24L12/\u03c68(cid:1). As a clear comparison,\nparameterization condition is milder than theirs by a factor of(cid:101)\u2126(n16/\u03c64). Du et al. [9] also proved\n\nour result on the iteration complexity is better than theirs by a factor of O(n4/\u03c6), and our over-\n\nthe global convergence of GD for training deep neural network with smooth activation functions. As\nshown in Table 1, the over-parameterization condition and iteration complexity in Du et al. [9] have\nan exponential dependency on L, which is much worse than the polynomial dependency on L as in\nAllen-Zhu et al. [2] and our result.\n\nWe now specialize our results in Theorem 3.3 to two-layer networks by removing the dependency on\nthe number of hidden layers, i.e., L. We state this result in the following corollary.\nCorollary 3.5. Under the same assumptions made in Theorem 3.3. For training two-layer ReLU\nthen with probability at least 1 \u2212 O(n\u22121), GD is able to \ufb01nd a point that achieves \u0001-training loss\n\nnetworks, if set the number of hidden nodes m = \u2126(cid:0)kn8 log3(m)/\u03c64(cid:1) and step size \u03b7 = O(k/m),\nwithin T = O(cid:0)n2 log(1/\u0001)/\u03c6(cid:1) iterations.\n\nFor training two-layer ReLU networks, Du et al. [11] made a different assumption on the training data\nto establish the global convergence of GD. Speci\ufb01cally, Du et al. [11] de\ufb01ned a Gram matrix, which\nis also known as neural tangent kernel [14], based on the training data {xi}i=1,...,n and assumed\nthat the smallest eigenvalue of such Gram matrix is strictly positive. In fact, for two-layer neural\nnetworks, their assumption is equivalent to Assumption 3.2, as shown in the following proposition.\nProposition 3.6. Under Assumption 3.1, de\ufb01ne the Gram matrix H \u2208 Rn\u00d7n as follows\n\nHij = Ew\u223cN (0,I)[x(cid:62)\n\ni xj\u03c3(cid:48)(w(cid:62)xi)\u03c3(cid:48)(w(cid:62)xj)],\n\nthen the assumption \u03bb0 = \u03bbmin(H) > 0 is equivalent to Assumption 3.2. In addition, there exists a\nsuf\ufb01ciently small constant C such that \u03bb0 \u2265 C\u03c6n\u22122.\nRemark 3.7. According to Proposition 3.6, we can make a direct comparison between our con-\nvergence results for two-layer ReLU networks in Corollary 3.5 with those in Du et al. [11], Oy-\nmak and Soltanolkotabi [17]. In speci\ufb01c, as shown in Table 1, the iteration complexity and over-\nparameterization condition proved in Du et al. [11] can be translated to O(n6 log(1/\u0001)/\u03c62) and\n\u2126(n14/\u03c64) respectively under Assumption 3.2. Oymak and Soltanolkotabi [17] improved the\nresult in Du et al. [11] and the improved iteration complexity and over-parameterization con-\nX = [x1, . . . , xn](cid:62) \u2208 Rd\u00d7n is the input data matrix. Our iteration complexity for two-layer\nReLU networks is better than that in Oymak and Soltanolkotabi [17] by a factor of O((cid:107)X(cid:107)2\n2) 3, and\nthe over-parameterization condition is also strictly milder than the that in Oymak and Soltanolkotabi\n[17] by a factor of O(n(cid:107)X(cid:107)6\n2).\n\ndition can be translated to O(cid:0)n2(cid:107)X(cid:107)2\n\n2 log(1/\u0001)/\u03c6(cid:1) 2 and \u2126(cid:0)n9(cid:107)X(cid:107)6\n\n2/\u03c64(cid:1) respectively, where\n\n3.2 Extension to training L-hidden-layer ReLU networks with SGD\n\nThen we extend the convergence results of GD to SGD in the following theorem.\nTheorem 3.8. Under Assumptions 3.1 and 3.2, and suppose the number of hidden nodes per layer\nsatis\ufb01es\n\nThen if set the step size as \u03b7 = O(cid:0)kB\u03c6/(n3m log(m))(cid:1), with probability at least 1 \u2212 O(n\u22121), SGD\n\nm = \u2126(cid:0)kn17L12 log3(m)/(B4\u03c68)(cid:1).\nT = O(cid:0)n5L2 log(m) log2 (1/\u0001)/(B\u03c62)(cid:1)\n\n(3.2)\n\nis able to achieve \u0001 expected training loss within\n\niterations.\n\n(cid:107)X(cid:107)2\n\n2It is worth noting that (cid:107)X(cid:107)2\n2 = O(n) in the worst case.\n3Here we set k = 1 in order to match the problem setting in Du et al. [11], Oymak and Soltanolkotabi [17].\n\n2 = O(n/d) if X is randomly generated, and\n\n2 = O(1) if d (cid:46) n, (cid:107)X(cid:107)2\n\n5\n\n\fRemark 3.9. We \ufb01rst compare our result with the state-of-the-art proved in Allen-Zhu et al. [2],\n\nwhere they showed that SGD can \ufb01nd a point with \u0001-training loss within (cid:101)O(cid:0)n7L2 log(1/\u0001)/(B\u03c62)(cid:1)\niterations if m = (cid:101)\u2126(cid:0)n24L12Bk/\u03c68(cid:1). In stark contrast, our result on the over-parameterization\ncondition is strictly better than it by a factor of(cid:101)\u2126(n7B5), and our result on the iteration complexity\n\nis also faster by a factor of O(n2).\n\nMoreover, we also characterize the convergence rate and over-parameterization condition of SGD for\ntraining two-layer networks. Unlike the gradient descent, which has the same convergence rate and\nover-parameterization condition for training both deep and two-layer networks in terms of training\ndata size n, we \ufb01nd that the over-parameterization condition of SGD can be further improved for\ntraining two-layer neural networks. We state this improved result in the following theorem.\nTheorem 3.10. Under the same assumptions made in Theorem 3.8. For two-layer ReLU networks,\nif set the number of hidden nodes and step size as\n\nm = \u2126(cid:0)k5/2n11 log3(m)/(\u03c65B)(cid:1),\n\n\u03b7 = O(cid:0)kB\u03c6/(n3m log(m))(cid:1),\n\nthen with probability at least 1 \u2212 O(n\u22121), stochastic gradient descent is able to achieve \u0001 training\n\nloss within T = O(cid:0)n5 log(m) log(1/\u0001)/(B\u03c62)(cid:1) iterations.\nm = \u2126(cid:0)kn17 log3(m)B\u22124\u03c6\u22128(cid:1), which is much worse than that in Theorem 3.10. This is because\n\nRemark 3.11. From Theorem 3.8, we can also obtain the convergence results of SGD for two-layer\nReLU networks by choosing L = 1. However, the resulting over-parameterization condition is\n\nfor two-layer networks, the training loss enjoys nicer local properties around the initialization, which\ncan be leveraged to improve the convergence of SGD. Due to space limit, we defer more details to\nAppendix A.\n\n4 Proof sketch of the main theory\n\nIn this section, we provide the proof sketch for Theorems 3.3, and highlight our technical contributions\nand innovative proof techniques.\n\n4.1 Overview of the technical contributions\n\nThe improvements in our result are mainly attributed to the following two aspects: (1) a tighter\ngradient lower bound leading to faster convergence; and (2) a sharper characterization of the trajectory\nlength of the algorithm.\nWe \ufb01rst de\ufb01ne the following perturbation region based on the initialization,\n\nB(W(0), \u03c4 ) = {W : (cid:107)Wl \u2212 W(0)\n\nl (cid:107)2 \u2264 \u03c4 for all l \u2208 [L]},\n\nwhere \u03c4 > 0 is the preset perturbation radius for each weight matrix Wl.\nTighter gradient lower bound. By the de\ufb01nition of \u2207L(W), we have (cid:107)\u2207L(W)(cid:107)2\n\n(cid:80)L\nF =\nl=1 (cid:107)\u2207Wl L(W)(cid:107)2\nF . Therefore, we can focus on the partial gradient of\nL(W) with respect to the weight matrix at the last hidden layer. Note that we further have\n(cid:107)\u2207WLL(W)(cid:107)2\n\nF \u2265 (cid:107)\u2207WL L(W)(cid:107)2\nj=1 (cid:107)\u2207wL,j L(W)(cid:107)2\n\nF =(cid:80)m\n\n\u2207wL,j L(W) =\n\n1\nn\n\n2, where\n\n(cid:104)fW(xi) \u2212 yi, vj(cid:105)\u03c3(cid:48)(cid:0)(cid:104)wL,j, xL\u22121,i(cid:105)(cid:1)xL\u22121,i,\n\nn(cid:88)\n\ni=1\n\nand xL\u22121,i denotes the output of the (L \u2212 1)-th hidden layer with input xi. In order to prove the\ngradient lower bound, for each xL\u22121,i, we introduce a region namely \u201cgradient region\u201d, denoted by\nWj, which is almost orthogonal to xL\u22121,i. Then we prove two major properties of these n regions\n{W1, . . . ,Wn}: (1) Wi \u2229 Wj = \u2205 if i (cid:54)= j, and (2) if wL,j \u2208 Wi for any i, with probability at\nleast 1/2, (cid:107)\u2207wL,j L(W)(cid:107)2 is suf\ufb01ciently large. We visualize these \u201cgradient regions\u201d in Figure 1(a).\nSince {wL,j}j\u2208[m] are randomly generated at the initialization, in order to get a larger bound of\n(cid:107)\u2207WLL(W)(cid:107)2\nF , we hope the size of these \u201cgradient regions\u201d to be as large as possible. We take the\nunion of the \u201cgradient regions\u201d for all training data, i.e., \u222an\ni=1Wi, which is shown in Figure 1(a). As a\n\n6\n\n\f(a) \u201cgradient region\u201d for {xL\u22121,i}i\u2208[n]\n\n(b) \u201cgradient region\u201d for xL\u22121,1\n\nFigure 1: (a): \u201cgradient region\u201d for all training data (b): \u201cgradient region\u201d for one training example.\n\ncomparison, Allen-Zhu et al. [2], Zou et al. [24] only leveraged the \u201cgradient region\u201d for one training\ndata point to establish the gradient lower bound, which is shown in Figure 1(b). Roughly speaking,\nthe size of \u201cgradient regions\u201d utilized in our proof is n times larger than those used in Allen-Zhu et al.\n[2], Zou et al. [24], which consequently leads to an O(n) improvement on the gradient lower bound.\nThe improved gradient lower bound is formally stated in the following lemma.\n\nLemma 4.1 (Gradient lower bound). Let \u03c4 = O(cid:0)\u03c63/2n\u22123L\u22126 log\nB(W(0), \u03c4 ), with probability at least 1 \u2212 exp(cid:0)O(m\u03c6/(dn))), it holds that\nF \u2265 O(cid:0)m\u03c6L(W)/(kn2)(cid:1).\n\n\u22123/2(m)(cid:1), then for all W \u2208\n\n(cid:107)\u2207L(W)(cid:107)2\n\n\u2212 W(t)\n\nl\n\nL(W(t)) \u2212(cid:113)\n\nl (cid:107)2 = \u03b7(cid:107)\u2207Wl L(W(t))(cid:107)2 \u2264(cid:112)Ckn2/(m\u03c6) \u00b7(cid:16)(cid:113)\n\nwhere C is an absolute constant. (4.1) enables the use of telescope sum, which yields (cid:107)W(t)\nW(0)\nas\n\nSharper characterization of the trajectory length. The improved analysis of the trajectory length\nis motivated by the following observation: at the t-th iteration, the decrease of the training loss\nafter one-step gradient descent is proportional to the gradient norm, i.e., L(W(t)) \u2212 L(W(t+1)) \u221d\nF . In addition, the gradient norm (cid:107)\u2207L(W(t))(cid:107)F determines the trajectory length in\n(cid:107)\u2207L(W(t))(cid:107)2\n(cid:17)\nthe t-th iteration. Putting them together, we can obtain\n(cid:107)W(t+1)\nl (cid:107)2 \u2264(cid:112)Ckn2L(W(0))/m\u03c6. In stark contrast, Allen-Zhu et al. [2] bounds the trajectory length\nl (cid:107)2 \u2264(cid:112)C(cid:48)kn6L2(W(0))/(m\u03c62) by taking summation over\n\u22123/2(m)(cid:1), if set the step size \u03b7 = O(cid:0)k/(L2m)(cid:1), with probability least\n\n\u2212 W(t)\nand further prove that (cid:107)W(t)\nl \u2212 W(0)\nt, where C(cid:48) is an absolute constant. Our sharp characterization of the trajectory length is formally\nsummarized in the following lemma.\niterates are staying inside the region B(W(0), \u03c4 ) with \u03c4 =\nLemma 4.2. Assuming all\n1 \u2212 O(n\u22121), the following holds for all t \u2265 0 and l \u2208 [L],\n\nO(cid:0)\u03c63/2n\u22123L\u22126 log\n\nl (cid:107)2 = \u03b7(cid:107)\u2207Wl L(W(t))(cid:107)2 \u2264 \u03b7\n\nC(cid:48)mL(W(t))/k,\n\n(cid:107)W(t+1)\n\nl\n\nl (cid:107)2 \u2264 O(cid:0)(cid:112)kn2 log(n)/(m\u03c6)(cid:1).\n\n(cid:107)W(t)\n\nl \u2212 W(0)\n\nL(W(t+1))\n\n,\n\n(4.1)\nl \u2212\n\n(cid:113)\n\n4.2 Proof of Theorem 3.3\n\nOur proof road map can be organized in three steps: (i) prove that the training loss enjoys good\ncurvature properties within the perturbation region B(W(0), \u03c4 ); (ii) show that gradient descent is able\nto converge to global minima based on such good curvature properties; and (iii) ensure all iterates\nstay inside the perturbation region until convergence.\nStep (i) Training loss properties. We \ufb01rst show some key properties of the training loss within\nB(W(0), \u03c4 ), which are essential to establish the convergence guarantees of gradient descent.\n\n7\n\n\fLemma 4.3. If m \u2265 O(L log(nL)), with probability at least 1 \u2212 O(n\u22121) it holds that L(W(0)) \u2264\n\n(cid:101)O(1).\n\nLemma 4.3 suggests that the training loss L(W) at the initial point does not depend on the number\nof hidden nodes per layer, i.e., m.\nMoreover, the training loss L(W) is nonsmooth due to the non-differentiable ReLU activation\nfunction. Generally speaking, smoothness is essential to achieve linear rate of convergence for\ngradient-based algorithms. Fortunately, Allen-Zhu et al. [2] showed that the training loss satis\ufb01es\nlocally semi-smoothness property, which is summarized in the following lemma.\nLemma 4.4 (Semi-smoothness [2]). Let\n\n\u03c4 \u2208(cid:2)\u2126(cid:0)k3/2/(m3/2L3/2 log3/2(m))(cid:1), O(cid:0)1/(L4.5 log3/2(m))(cid:1)(cid:3).\n\nB(W(0), \u03c4 ), with probability at least 1 \u2212 exp(\u2212\u2126(\u2212m\u03c4 3/2L)), there exist two constants C(cid:48) and C(cid:48)(cid:48)\nsuch that\n\nThen for any two collections (cid:99)W = {(cid:99)Wl}l\u2208[L] and (cid:102)W = {(cid:102)Wl}l\u2208[L] satisfying (cid:99)W,(cid:102)W \u2208\nL((cid:102)W) \u2264 L((cid:99)W) + (cid:104)\u2207L((cid:99)W),(cid:102)W \u2212(cid:99)W(cid:105)\nL((cid:99)W) \u00b7 \u03c4 1/3L2(cid:112)m log(m)\n\n\u00b7 (cid:107)(cid:102)W \u2212(cid:99)W(cid:107)2 +\n\n(cid:107)(cid:102)W \u2212(cid:99)W(cid:107)2\n\n+ C(cid:48)(cid:113)\n\nC(cid:48)(cid:48)L2m\n\n(4.2)\n\n\u221a\n\n2.\n\nk\n\nk\n\n(cid:18)\n\nLemma 4.4 is a rescaled version of Theorem 4 in Allen-Zhu et al. [2], since the training loss L(W)\nin (2.1) is divided by the training sample size n, as opposed to the training loss in Allen-Zhu et al. [2].\nThis lemma suggests that if the perturbation region is small, i.e., \u03c4 (cid:28) 1, the non-smooth term (third\nterm on the R.H.S. of (4.2)) is small and dominated by the gradient term (the second term on the\nthe R.H.S. of (4.2)). Therefore, the training loss behaves like a smooth function in the perturbation\nregion and the linear rate of convergence can be proved.\nStep (ii) Convergence rate of GD. Now we are going to establish the convergence rate for gradient\ndescent under the assumption that all iterates stay inside the region B(W(0), \u03c4 ), where \u03c4 will be\nspeci\ufb01ed later.\nLemma 4.5. Assume all\n\nstay inside the region B(W(0), \u03c4 ), where \u03c4\n\nO(cid:0)\u03c63/2n\u22123L\u22126 log\nO(cid:0)k/(L2m)(cid:1), with probability least 1 \u2212 exp(cid:0) \u2212 O(m\u03c4 3/2L)(cid:1), it holds that\n\n\u22123/2(m)(cid:1). Then under Assumptions 3.1 and 3.2, if set the step size \u03b7 =\n\niterates\n\n=\n\nL(W(t)) \u2264\n\n1 \u2212 O\n\nL(W(0)).\n\n(cid:18) m\u03c6\u03b7\n\n(cid:19)(cid:19)t\n\nkn2\n\nLemma 4.5 suggests that gradient descent is able to decrease the training loss to zero at a linear rate.\nStep (iii) Verifying all iterates of GD stay inside the perturbation region. Then we are going\nto ensure that all iterates of GD are staying inside the required region B(W(0), \u03c4 ). Note that we\nhave proved the distance (cid:107)W(t)\nl (cid:107)2 in Lemma 4.2. Therefore, it suf\ufb01ces to verify that such\ndistance is smaller than the preset value \u03c4. Thus, we can complete the proof of Theorem 3.3 by\nverifying the conditions based on our choice of m. Note that we have set the required number of m\nin (3.1), plugging (3.1) into the result of Lemma 4.2, we have with probability at least 1 \u2212 O(n\u22121),\nthe following holds for all t \u2264 T and l \u2208 [L]\n\nl \u2212 W(0)\n\n\u22123/2(m)(cid:1),\n\nl (cid:107)2 \u2264 O(cid:0)\u03c63/2n\u22123L\u22126 log\nT \u03b7 = O(cid:0)kn2 log(cid:0)1/\u0001(cid:1)m\u22121\u03c6\u22121(cid:1).\n\nwhich is exactly in the same order of \u03c4 in Lemma 4.5. Therefore, our choice of m guarantees that all\niterates are inside the required perturbation region. In addition, by Lemma 4.5, in order to achieve \u0001\naccuracy, we require\n\nThen substituting our choice of step size \u03b7 = O(cid:0)k/(L2m)(cid:1) into (4.3) and applying Lemma 4.3, we\n\n(4.3)\n\n(cid:107)W(t)\n\nl \u2212 W(0)\n\ncan get the desired result for T .\n\n8\n\n\f4.3 Optimizing both top and hidden layers\n\nHere we would like to brie\ufb02y discuss the extension to the case where the top layer is also optimized.\nThe proof sketch is as follows: similar to our current proof, we can also de\ufb01ne a small perturbation\nregion around the initialization, but the new de\ufb01nition involves a constraint on the top layer weights.\nSpeci\ufb01cally, such new perturbation region can be de\ufb01ned as follows,\n\nB(W(0), \u03c4 ) = {W : (cid:107)Wl \u2212 W(0)\n\nl (cid:107)2 \u2264 \u03c4 for all l \u2208 [L], (cid:107)V \u2212 V(0)(cid:107)2 \u2264 \u03c4(cid:48)}.\n\nThen, it can be proved that the neural network also enjoys good properties inside such region. Similar\nto the proof in this paper, based on these good properties, we can prove that until convergence the\nneural network weights, including the top layer weights, would not escape from such region. Note\nthat optimizing more parameter can lead to larger gradient, thus we can prove a larger gradient lower\nbound during the training process which can potential speed up the convergence of optimization\nalgorithm (e.g., GD, SGD).\n\n5 Conclusions and future work\n\nIn this paper, we studied the global convergence of (stochastic) gradient descent for training over-\nparameterized ReLU networks, and improved the state-of-the-art results. Our proof technique can\nbe also applied to prove similar results for other loss functions such as cross-entropy loss and other\nneural network architectures such as convolutional neural networks (CNN) [2, 11] and ResNet\n[2, 11, 21]. One important future work is to investigate whether the over-parameterization condition\nand the convergence rate can be further improved. It is promising that if we can further improve\nthe characterization of \u201cgradient region\u201d, as it may provide a tighter gradient lower bound and\nconsequently sharpen the over-parameterization condition. Another interesting future direction is to\nexplore the use of our proof technique to improve the generalization analysis of overparameterized\nneural networks trained by gradient-based algorithms [1, 6, 4].\n\nAcknowledgement\n\nWe thank the anonymous reviewers and area chair for their helpful comments. We also thank Jinshan\nZeng for his helpful comment on the proof in the earlier version of our work. This research was\nsponsored in part by the National Science Foundation CAREER Award IIS-1906169, BIGDATA\nIIS-1855099, and Salesforce Deep Learning Research Award. The views and conclusions contained\nin this paper are those of the authors and should not be interpreted as representing any funding\nagencies.\n\nReferences\n[1] ALLEN-ZHU, Z., LI, Y. and LIANG, Y. (2018). Learning and generalization in overparameter-\n\nized neural networks, going beyond two layers. arXiv preprint arXiv:1811.04918 .\n\n[2] ALLEN-ZHU, Z., LI, Y. and SONG, Z. (2018). A convergence theory for deep learning via\n\nover-parameterization. arXiv preprint arXiv:1811.03962 .\n\n[3] ALLEN-ZHU, Z., LI, Y. and SONG, Z. (2018). On the convergence rate of training recurrent\n\nneural networks. arXiv preprint arXiv:1810.12065 .\n\n[4] ARORA, S., DU, S. S., HU, W., LI, Z. and WANG, R. (2019). Fine-grained analysis of\noptimization and generalization for overparameterized two-layer neural networks. arXiv preprint\narXiv:1901.08584 .\n\n[5] BRUTZKUS, A. and GLOBERSON, A. (2017). Globally optimal gradient descent for a convnet\nwith gaussian inputs. In Proceedings of the 34th International Conference on Machine Learning-\nVolume 70. JMLR. org.\n\n[6] CAO, Y. and GU, Q. (2019). A generalization theory of gradient descent for learning over-\n\nparameterized deep relu networks. arXiv preprint arXiv:1902.01384 .\n\n9\n\n\f[7] CHIZAT, L. and BACH, F. (2018). A note on lazy training in supervised differentiable program-\n\nming. arXiv preprint arXiv:1812.07956 .\n\n[8] DU, S. S. and LEE, J. D. (2018). On the power of over-parametrization in neural networks\n\nwith quadratic activation. arXiv preprint arXiv:1803.01206 .\n\n[9] DU, S. S., LEE, J. D., LI, H., WANG, L. and ZHAI, X. (2018). Gradient descent \ufb01nds global\n\nminima of deep neural networks. arXiv preprint arXiv:1811.03804 .\n\n[10] DU, S. S., LEE, J. D. and TIAN, Y. (2017). When is a convolutional \ufb01lter easy to learn? arXiv\n\npreprint arXiv:1709.06129 .\n\n[11] DU, S. S., ZHAI, X., POCZOS, B. and SINGH, A. (2018). Gradient descent provably optimizes\n\nover-parameterized neural networks. arXiv preprint arXiv:1810.02054 .\n\n[12] GAO, W., MAKKUVA, A., OH, S. and VISWANATH, P. (2019). Learning one-hidden-layer\nneural networks under general input distributions. In The 22nd International Conference on\nArti\ufb01cial Intelligence and Statistics.\n\n[13] HE, K., ZHANG, X., REN, S. and SUN, J. (2015). Delving deep into recti\ufb01ers: Surpassing\nhuman-level performance on imagenet classi\ufb01cation. In Proceedings of the IEEE international\nconference on computer vision.\n\n[14] JACOT, A., GABRIEL, F. and HONGLER, C. (2018). Neural tangent kernel: Convergence and\n\ngeneralization in neural networks. In Advances in neural information processing systems.\n\n[15] LI, Y. and LIANG, Y. (2018). Learning overparameterized neural networks via stochastic\n\ngradient descent on structured data. arXiv preprint arXiv:1808.01204 .\n\n[16] LI, Y. and YUAN, Y. (2017). Convergence analysis of two-layer neural networks with ReLU\n\nactivation. arXiv preprint arXiv:1705.09886 .\n\n[17] OYMAK, S. and SOLTANOLKOTABI, M. (2019). Towards moderate overparameteriza-\ntion: global convergence guarantees for training shallow neural networks. arXiv preprint\narXiv:1902.04674 .\n\n[18] TIAN, Y. (2017). An analytical formula of population gradient for two-layered ReLU network\nand its applications in convergence and critical point analysis. arXiv preprint arXiv:1703.00560\n.\n\n[19] WU, X., DU, S. S. and WARD, R. (2019). Global convergence of adaptive gradient methods\n\nfor an over-parameterized neural network. arXiv preprint arXiv:1902.07111 .\n\n[20] ZHANG, C., BENGIO, S., HARDT, M., RECHT, B. and VINYALS, O. (2016). Understanding\n\ndeep learning requires rethinking generalization. arXiv preprint arXiv:1611.03530 .\n\n[21] ZHANG, H., YU, D., CHEN, W. and LIU, T.-Y. (2019). Training over-parameterized deep\n\nresnet is almost as easy as training a two-layer network. arXiv preprint arXiv:1903.07120 .\n\n[22] ZHANG, X., YU, Y., WANG, L. and GU, Q. (2018). Learning one-hidden-layer ReLU networks\n\nvia gradient descent. arXiv preprint arXiv:1806.07808 .\n\n[23] ZHONG, K., SONG, Z., JAIN, P., BARTLETT, P. L. and DHILLON, I. S. (2017). Recovery\n\nguarantees for one-hidden-layer neural networks. arXiv preprint arXiv:1706.03175 .\n\n[24] ZOU, D., CAO, Y., ZHOU, D. and GU, Q. (2018). Stochastic gradient descent optimizes\n\nover-parameterized deep relu networks. arXiv preprint arXiv:1811.08888 .\n\n10\n\n\f", "award": [], "sourceid": 1217, "authors": [{"given_name": "Difan", "family_name": "Zou", "institution": "University of California, Los Angeles"}, {"given_name": "Quanquan", "family_name": "Gu", "institution": "UCLA"}]}