{"title": "Convergence Analysis of Two-layer Neural Networks with ReLU Activation", "book": "Advances in Neural Information Processing Systems", "page_first": 597, "page_last": 607, "abstract": "In recent years, stochastic gradient descent (SGD) based techniques has become the standard tools for training neural networks. However, formal theoretical understanding of why SGD can train neural networks in practice is largely missing. In this paper, we make progress on understanding this mystery by providing a convergence analysis for SGD on a rich subset of two-layer feedforward networks with ReLU activations. This subset is characterized by a special structure called \"identity mapping\". We prove that, if input follows from Gaussian distribution, with standard $O(1/\\sqrt{d})$ initialization of the weights, SGD converges to the global minimum in polynomial number of steps. Unlike normal vanilla networks, the \"identity mapping\" makes our network asymmetric and thus the global minimum is unique. To complement our theory, we are also able to show experimentally that multi-layer networks with this mapping have better performance compared with normal vanilla networks. Our convergence theorem differs from traditional non-convex optimization techniques. We show that SGD converges to optimal in \"two phases\": In phase I, the gradient points to the wrong direction, however, a potential function $g$ gradually decreases. Then in phase II, SGD enters a nice one point convex region and converges. We also show that the identity mapping is necessary for convergence, as it moves the initial point to a better place for optimization. Experiment verifies our claims.", "full_text": "Convergence Analysis of Two-layer Neural Networks\n\nwith ReLU Activation\n\nYuanzhi Li\n\nComputer Science Department\n\nPrinceton University\n\nyuanzhil@cs.princeton.edu\n\nYang Yuan\n\nComputer Science Department\n\nCornell University\n\nyangyuan@cs.cornell.edu\n\nAbstract\n\nIn recent years, stochastic gradient descent (SGD) based techniques has become\nthe standard tools for training neural networks. However, formal theoretical under-\nstanding of why SGD can train neural networks in practice is largely missing.\nIn this paper, we make progress on understanding this mystery by providing a\nconvergence analysis for SGD on a rich subset of two-layer feedforward networks\nwith ReLU activations. This subset is characterized by a special structure called\n\u221a\n\u201cidentity mapping\u201d. We prove that, if input follows from Gaussian distribution,\nwith standard O(1/\nd) initialization of the weights, SGD converges to the global\nminimum in polynomial number of steps. Unlike normal vanilla networks, the\n\u201cidentity mapping\u201d makes our network asymmetric and thus the global minimum is\nunique. To complement our theory, we are also able to show experimentally that\nmulti-layer networks with this mapping have better performance compared with\nnormal vanilla networks.\nOur convergence theorem differs from traditional non-convex optimization tech-\nniques. We show that SGD converges to optimal in \u201ctwo phases\u201d: In phase I, the\ngradient points to the wrong direction, however, a potential function g gradually\ndecreases. Then in phase II, SGD enters a nice one point convex region and con-\nverges. We also show that the identity mapping is necessary for convergence, as it\nmoves the initial point to a better place for optimization. Experiment veri\ufb01es our\nclaims.\n\n1\n\nIntroduction\n\nDeep learning is the mainstream technique for many machine learning tasks, including image\nrecognition, machine translation, speech recognition, etc. [17]. Despite its success, the theoretical\nunderstanding on how it works remains poor. It is well known that neural networks have great\nexpressive power [22, 7, 3, 8, 31]. That is, for every function there exists a set of weights on the\nneural network such that it approximates the function everywhere. However, it is unclear how to\nobtain the desired weights. In practice, the most commonly used method is stochastic gradient\ndescent based methods (e.g., SGD, Momentum [40], Adagrad [10], Adam [25]), but to the best of\nour knowledge, there were no theoretical guarantees that such methods will \ufb01nd good weights.\nIn this paper, we give the \ufb01rst convergence analysis of SGD for two-layer feedforward network with\nReLU activations. For this basic network, it is known that even in the simpli\ufb01ed setting where the\nweights are initialized symmetrically and the ground truth forms orthonormal basis, gradient descent\nmight get stuck at saddle points [41].\nInspired by the structure of residual network (ResNet) [21], we add an extra identity mapping for\nthe hidden layer (see Figure 1). Surprisingly, we show that simply by adding this mapping, with the\nstandard initialization scheme and small step size, SGD always converges to the ground truth. In other\n\n31st Conference on Neural Information Processing Systems (NIPS 2017), Long Beach, CA, USA.\n\n\foutput\n\nTake sum\n\nReLU(W(cid:62) x)\n\noutput\n\nTake sum\n\n(cid:76)\n\nReLU((I + W)(cid:62) x)\n\nIdentity\nLink +x\n\nW(cid:62)x\n\nW(cid:62)x\n\ninput x\n\ninput x\n\nI + W\u2217\n\nI + W\n\nI\n\nIdentity mapping\n\nEasy for SGD\n\nSeems hard\n\nO\n\nUnknown\n\nFigure 1: Vanilla network (left), with identity mapping (right)\n\nFigure 2: Illustration for our result.\n\nwords, the optimization becomes signi\ufb01cantly easier, after adding the identity mapping. See Figure\n2, based on our analysis, the region near the identity matrix I contains only one global minimum\nwithout any saddle points or local minima, thus is easy for SGD to optimize. The role of the identity\nmapping here, is to move the initial point to this easier region (better initialization).\nOther than being feedforward and shallow, our network is different from ResNet in the sense that\nour identity mapping skips one layer instead of two. However, as we will show in Section 5.1, the\nskip-one-layer identity mapping already brings signi\ufb01cant improvement to vanilla networks.\nFormally, we consider the following function.\n\nf (x, W) = (cid:107)ReLU((I + W)(cid:62)x)(cid:107)1\n\n(1)\nwhere ReLU(v) = max(v, 0) is the ReLU activation function. x \u2208 Rd is the input vector sampled\nfrom a Gaussian distribution, and W \u2208 Rd\u00d7d is the weight matrix, where d is the number of input\nunits. Notice that I adds ei to column i of W, which makes f asymmetric in the sense that by\nswitching any two columns in W, we get different functions.\nFollowing the standard setting [34, 41], we assume that there exists a two-layer teacher network with\nweight W\u2217. We train the student network using (cid:96)2 loss:\n\nL(W) = Ex[(f (x, W) \u2212 f (x, W\u2217))2]\n\n(2)\n\nWe will de\ufb01ne a potential function g, and show that if g is small, the gradient points to partially\ncorrect direction and we get closer to W\u2217 after every SGD step. However, g could be large and thus\ngradient might point to the reverse direction. Fortunately, we also show that if g is large, by doing\nSGD, it will keep decreasing until it is small enough while maintaining the weight W in a nice region.\nWe call the process of decreasing g as Phase I, and the process of approaching W\u2217 as Phase II. See\nFigure 3 and simulations in Section 5.3.\nOur two phases framework is fundamentally different from any type of local convergence, as in Phase\nI, the gradient is pointing to the wrong direction to W\u2217, so the path from W to W\u2217 is non-convex,\nand SGD takes a long detour to arrive W\u2217. This framework could be potentially useful for analyzing\nother non-convex problems.\nTo support our theory, we have done a few other experiments and got interesting observations.\nFor example, as predicted by our theorem, we found that for multilayer feedforward network with\nidentity mappings, zero initialization performs as good as random initialization. At the \ufb01rst glance, it\ncontradicts the common belief \u201crandom initialization is necessary to break symmetry\u201d, but actually\nthe identity mapping itself serves as the asymmetric component. See Section 5.4.\nAnother common belief is that neural network has lots of local minima and saddle points [9], so\neven if there exists a global minimum, we may not be able to arrive there. As a result, even when\nthe teacher network is shallow, the student network usually needs to be deeper, otherwise it will\nunder\ufb01t. However, both our theorem and our experiment show that if the shallow teacher network\nis in a pretty large region near identity (Figure 2), SGD always converges to the global minimum\nby initializing the weights I + W in this region, with equally shallow student network. By contrast,\nwrong initialization gets stuck at local minimum and under\ufb01t. See Section 5.2.\n\n2\n\n\fRelated Work\n\nExpressivity. Even two-layer network has great expressive power. For example, two-layer network\nwith sigmoid activations could approximate any continuous function [22, 7, 3]. ReLU is the state-of-\nthe-art activation function [30, 13], and has great expressive power as well [29, 32, 31, 4, 26].\nLearning. Most previous results on learning neural network are negative [39, 28, 38], or positive but\nwith algorithms other than SGD [23, 43, 37, 14, 15, 16], or with strong assumptions on the model\n[1, 2]. [35] proved that with high probability, there exists a continuous decreasing path from random\ninitial point to the global minimum, but SGD may not follow this path. Recently, Zhong et al. showed\nthat with initialization point found using tensor decomposition, gradient descent could \ufb01nd the ground\ntruth for one hidden layer network [44].\nLinear network and independent activation. Some previous works simpli\ufb01ed the model by ignor-\ning the activation functions and considering deep linear networks [36, 24] or deep linear residual\nnetworks [19], which can only learn linear functions. Some previous results are based on independent\nactivation assumption that the activations of ReLU and the input are independent [5, 24].\nSaddle points. It is observed that saddle point is not a big problem for neural networks [9, 18]. In\ngeneral, if the objective is strict-saddle [11], SGD could escape all saddle points.\n2 Preliminaries\nDenote x as the input vector in Rd. For now, we \ufb01rst consider x sampled from normal distribution\nn) \u2208 Rd\u00d7d as the weights for the teacher network, W =\nN (0, I). Denote W\u2217 = (w\u2217\ni , wi \u2208 Rd are column\n(w1,\u00b7\u00b7\u00b7 , wn) \u2208 Rd\u00d7d as the weights for the student network, where w\u2217\nvectors. f (x, W\u2217), f (x, W) are de\ufb01ned in (1), representing the teacher and student network.\nWe want to know whether a randomly initialized W will converge to W\u2217, if we run SGD with l2\nloss de\ufb01ned in (2). Alternatively, we can write the loss L(W) as\n\n1,\u00b7\u00b7\u00b7 , w\u2217\n\nTaking derivative with respect to wj, we get\n\n\u2207L(W)j = 2Ex\n\nEx[(\u03a3iReLU((cid:104)ei + wi, x(cid:105)) \u2212 \u03a3iReLU((cid:104)ei + w\u2217\n\n(cid:34)(cid:32)(cid:88)\n\nReLU((cid:104)ei + wi, x(cid:105)) \u2212(cid:88)\n\ni\n\ni\n\ni , x(cid:105)))2]\n(cid:33)\ni , x(cid:105))\n\n(cid:35)\n\nx1(cid:104)ej +wj ,x(cid:105)\u22650\n\nReLU((cid:104)ei + w\u2217\n\n(cid:16) \u03c0\n\ni=1\n\n(cid:1) (ei + w\u2217\n\ni )\u2212(cid:0) \u03c0\n\n2 \u2212 \u03b8i,j\n\n(cid:1) (ei + wi) +(cid:0)(cid:107)ei + w\u2217\n\nwhere 1e is the indicator function that equals 1 if the event e is true, and 0 otherwise. Here\n\u2207L(W) \u2208 Rd\u00d7d, and \u2207L(W)j is its j-th column.\nDenote \u03b8i,j as the angle between ei + wi and ej + wj, \u03b8i\u2217,j as the angle between ei + w\u2217\ni and ej + wj.\n. Denote I + W\u2217 and I + W\u2217 as the column-normalized version of I + W\u2217 and\nDenote \u00afv = v(cid:107)v(cid:107)2\nI + W such that every column has unit norm. Since the input is from a normal distribution, one can\ncompute the expectation inside the gradient as follows.\n\nLemma 2.1 (Eqn (13) from [41]). If x \u223c N (0, I), then \u2212\u2207L(W)j = (cid:80)d\n(cid:0) \u03c0\n2 \u2212 \u03b8i\u2217,j\nRemark. Although the gradient of ReLU is not well de\ufb01ned at the point of zero, if we assume input x\nis from the Gaussian distribution, the loss function becomes smooth, and the gradient is well de\ufb01ned\neverywhere.\nDenote u \u2208 Rd as the all one vector. Denote Diag(W) as the diagonal matrix of matrix W,\nDiag(v) as a diagonal matrix whose main diagonal equals to the vector v. Denote O\ufb00-Diag(W) (cid:44)\nW \u2212 Diag(W). Denote [d] as the set {1,\u00b7\u00b7\u00b7 , d}. Throughout the paper, we abuse the notation of\ninner product between matrices W, W\u2217,\u2207L(W), such that (cid:104)\u2207L(W), W(cid:105) means the summation of\nthe entrywise products. (cid:107)W(cid:107)2 is the spectral norm of W, and (cid:107)W(cid:107)F is the Frobenius norm of W.\nWe de\ufb01ne the potential function g and variables gj, Aj, A below, which will be useful in the proof.\ni (cid:107)2 \u2212 (cid:107)ei + wi(cid:107)2), and variable\n\n2 (w\u2217\ni (cid:107)2 sin \u03b8i\u2217,j \u2212(cid:107)ei + wi(cid:107)2 sin \u03b8i,j\n\nDe\ufb01nition 2.2. We de\ufb01ne the potential function g (cid:44)(cid:80)d\ngj (cid:44)(cid:80)\n\ni (cid:107)2 \u2212 (cid:107)ei + wi(cid:107)2).\n\n(cid:17)\ni \u2212 wi) +\n\n(cid:1)ej + wj\n\ni=1((cid:107)ei + w\u2217\n\ni(cid:54)=j((cid:107)ei + w\u2217\n\n3\n\n\fW1\n\nW6\n\nW10\n\nW\u2217\n\nFigure 3: Phase I: W1 \u2192 W6, W may go to\nthe wrong direction but the potential is shrinking.\nPhase II: W6 \u2192 W10, W gets closer to W\u2217 in\nevery step by one point convexity.\n\nDe\ufb01nition 2.3. Denote Aj (cid:44)(cid:80)\n\nFigure 4: The function is one point strongly con-\nvex as every point\u2019s negative gradient points to\nthe center, but not convex as any line between\nthe center and the red region is below surface.\n\n), A (cid:44)(cid:80)d\n\ni=1((ei +\n\n(cid:62) \u2212 (ei + wi)ei + wi\n(cid:62)\ni(cid:54)=j((ei + w\u2217\n) = (I + W\u2217)I + W\u2217(cid:62) \u2212 (I + W)I + W\n(cid:62)\n\ni )ei + w\u2217\n\ni\n\ni\n\n(cid:62) \u2212 (ei + wi)ei + wi\n\nw\u2217\ni )ei + w\u2217\nIn this paper, we consider the standard SGD with mini batch method for training the neural network.\nAssume W0 is the initial point, and in step t > 0, we have the following updating rule:\n\n.\n\n(cid:62)\n\nWt+1 = Wt \u2212 \u03b7tGt\n\nwhere the stochastic gradient Gt = \u2207L(Wt) + Et with E[Et] = 0 and (cid:107)Et(cid:107)F \u2264 \u03b5. Let G2 (cid:44)\n6d\u03b3 + \u03b5, GF (cid:44) 6d1.5\u03b3 + \u03b5, where \u03b3 is the upper bound of (cid:107)W\u2217(cid:107)2 and (cid:107)W0(cid:107)2 (de\ufb01ned later). As\nwe will see in Lemma C.2, they are the upper bound of (cid:107)Gt(cid:107)2 and (cid:107)Gt(cid:107)F respectively.\nIt\u2019s clear that L is not convex, In order to get convergence guarantees, we need a weaker condition\ncalled one point convexity.\nDe\ufb01nition 2.4 (One point strongly convexity). A function f (x) is called \u03b4-one point strongly convex\nin domain D with respect to point x\u2217, if \u2200x \u2208 D,(cid:104)\u2212\u2207f (x), x\u2217 \u2212 x(cid:105) > \u03b4(cid:107)x\u2217 \u2212 x(cid:107)2\n2.\n\nBy de\ufb01nition, if a function f is strongly convex, it is also one point strongly convex in the entire space\nwith respect to the global minimum. However, the reverse is not necessarily true, e.g., see Figure\n4. If a function is one point strongly convex, then in every step a positive fraction of the negative\ngradient is pointing to the optimal point. As long as the step size is small enough, we will \ufb01nally\narrive the optimal point, possibly by a winding path. See Figure 3 for illustration, where starting from\nW6 (Phase II), we get closer to W\u2217 in every step. Formally, we have the following lemma.\nLemma 2.5. For function f (W), consider the SGD update Wt+1 = Wt \u2212 \u03b7Gt, where E[Gt] =\n\u2207f (Wt), E[(cid:107)Gt(cid:107)2\nF ] \u2264 G2. Suppose for all t, Wt is always inside the \u03b4-one point strongly convex\nregion with diameter D, i.e., (cid:107)Wt \u2212 W\u2217(cid:107)F \u2264 D. Then for any \u03b1 > 0 and any T such that\nT \u03b1 log T \u2265 D2\u03b42\n\n, we have E(cid:107)WT \u2212 W\u2217(cid:107)2\n\nF \u2264 (1+\u03b1) log T G2\n\n(1+\u03b1)G2 , if \u03b7 = (1+\u03b1) log T\n\n\u03b42T\n\n\u03b4T\n\n.\n\nThe proof can be found in Appendix J. Lemma 2.5 uses \ufb01xed step size, so it easily \ufb01ts the standard\npractical scheme that shrinks \u03b7 by a factor of 10 after every a few epochs. For example, we may\napply Lemma 2.5 every time \u03b7 gets changed. Notice that our lemma does not imply that WT will\nconverge to W\u2217. Instead, it only says WT will be suf\ufb01ciently close to W\u2217 with small step size \u03b7.\n3 Main Theorem\nTheorem 3.1 (Main Theorem). There exists constants \u03b3 > \u03b30 > 0 such that If x \u223c N (0, I),\n(cid:107)W0(cid:107)2,(cid:107)W\u2217(cid:107)2 \u2264 \u03b30, d \u2265 100, \u03b5 \u2264 \u03b32, then SGD for L(W) will \ufb01nd the ground truth W\u2217 by\ntwo phases. In Phase I, by setting \u03b7 \u2264 \u03b32\n, the potential function will keep decreasing until it is\nG2\n2\n16\u03b7 steps. In Phase II, for any \u03b1 > 0 and any T such that\nsmaller than 197\u03b32, which takes at most 1\nT \u03b1 log T \u2265\nF \u2264 1002(1+\u03b1) log T G2\n.\n\n, we have E(cid:107)WT \u2212 W\u2217(cid:107)2\n\n, if we set \u03b7 = (1+\u03b1) log T\n\n\u221a\n\u221a\nRemarks. Randomly initializing the weights with O(1/\n[27, 12, 20]. It is also well known that if the entries are initialized with O(1/\n\nd) is standard in deep learning, see\nd), the spectral norm\n\n1004(1+\u03b1)G2\nF\n\n36d\n\n9T\n\n\u03b4T\n\nF\n\n4\n\n050100150200050100150200\u22125051015\fd).\n\nof the random matrix is O(1) [33]. So our result matches with the common practice. Moreover, as we\nwill show in Section 5.5, networks with small average spectral norm already have good performance.\nThus, our assumption (cid:107)W\u2217(cid:107)2 = O(1) is reasonable. Notice that here we assume the spectral norm\n\u221a\nof W\u2217 to be constant, which means the Frobenius norm (cid:107)W\u2217(cid:107)F could be as big as O(\nThe assumption that the input follows a Gaussian distribution is not necessarily true in practice\n(Although this is a common assumption appeared in the previous papers [5, 41, 42], and also\nconsidered plausible in [6]). We could easily generalize the analysis to rotation invariant distributions,\nand potentially more general distributions (see Section 6). Moreover, previous analyses either ignore\nthe nonlinear activations and thus consider linear model [36, 24, 19], or directly [5, 24] or indirectly\n[41]1 assume that the activations are independent. By contrast, in our model the ReLU activations\nare highly correlated2 as (cid:107)W(cid:107)2,(cid:107)W\u2217(cid:107)2 = \u2126(1). As pointed out by [6], eliminating the unrealistic\nassumptions on activation independence is the central problem of analyzing the loss surface of neural\nnetwork, which was not fully addressed by the previous analyses.\nTo prove the main theorem, we split the process and present the following two theorems, which will\nbe proved in Appendix C and D.\nTheorem 3.2 (Phase I). There exists a constant \u03b3 > \u03b30 > 0 such that If (cid:107)W0(cid:107)2,(cid:107)W\u2217(cid:107)2 \u2264 \u03b30,\nd \u2265 100, \u03b7 \u2264 \u03b32\n, \u03b5 \u2264 \u03b32, then gt will keep decreasing by a factor of 1 \u2212 0.5\u03b7d for every step,\nuntil gt1 \u2264 197\u03b32 for step t1 \u2264 1\n16\u03b7 . After that, Phase II starts. That is, for every T > t1, we have\n(cid:107)WT(cid:107)2 \u2264 1\nTheorem 3.3 (Phase II). There exists a constant \u03b3 such that if (cid:107)W(cid:107)2,(cid:107)W\u2217(cid:107)2 \u2264 \u03b3, and g \u2264 0.1,\n\n100 and gT \u2264 0.1.\n\nthen (cid:104)\u2212\u2207L(W), W\u2217 \u2212 W(cid:105) =(cid:80)d\n\nj \u2212 wj(cid:105) > 0.03(cid:107)W\u2217 \u2212 W(cid:107)2\nF .\n\nj=1(cid:104)\u2212\u2207L(W)j, w\u2217\n\nG2\n2\n\nWith these two theorems, we get the main theorem immediately.\n\nProof for Theorem 3.1. By Theorem 3.2, we know the statement for Phase I is true, and we will enter\nphase II in 1\n16\u03b7 steps. After entering Phase II, based on Theorem 3.3, we simply use Lemma 2.5 by\nsetting \u03b4 = 0.03, D =\n\n\u221a\n50 , G = GF to get the convergence guarantee.\nd\n\n4 Overview of the Proofs\n\nGeneral Picture. In many convergence analyses for non-convex functions, one would like to show\nthat L is one point strongly convex, and directly apply Lemma 2.5 to get the convergence result.\nHowever, this is not true for 2-layer neural network, as the gradient may point to the wrong direction,\nsee Section 5.3.\nSo when is our L one point convex? Consider the following thought experiment: First, suppose\ni (cid:107)2 also go to 0. Thus, ei + wi and ei + w\u2217\n(cid:107)W(cid:107)2,(cid:107)W\u2217(cid:107)2 \u2192 0, we know (cid:107)wi(cid:107)2,(cid:107)w\u2217\n(cid:80)\ni are close to ei.\n2 , and \u03b8i\u2217,i \u2248 0. Based on Lemma 2.1, this gives us a na\u00efve approximation\nAs a result, \u03b8i,j, \u03b8i\u2217,j \u2248 \u03c0\ni \u2212 wi) + ej + wj\nof the negative gradient, i.e., \u2212\u2207L(W)j \u2248 \u03c0\ni(cid:54)=j((cid:107)ei +\n2 (w\u2217\n(cid:80)d\ni (cid:107)2 \u2212 (cid:107)ei + wi(cid:107)2) .\nw\u2217\n(cid:80)\ni \u2212wi) have positive inner product with W\u2217\u2212W,\nj\u2212wj) and \u03c0\ni=1(w\u2217\n2 (w\u2217\nWhile the \ufb01rst two terms \u03c0\ni (cid:107)2\u2212(cid:107)ei +wi(cid:107)2) can point to arbitrary direction. If the last\ni(cid:54)=j((cid:107)ei +w\u2217\nthe last term gj = ej + wj\nterm is small, it can be covered by the \ufb01rst two terms, and L becomes one point strongly convex. So\ni (cid:107)2 \u2212 (cid:107)ei + wi(cid:107)2).\n\nwe de\ufb01ne a potential function closely related to the last term: g =(cid:80)d\n\n(cid:80)d\ni=1(w\u2217\n\nWe show that if g is small enough, L is also one point strongly convex (Theorem 3.3).\n\u221a\nHowever, from random initialization, g can be as large as of \u2126(\nd), which is too big to be covered.\nFortunately, we show that if g is big, it will gradually decrease simply by doing SGD on L. More\nspeci\ufb01cally, we introduce a two phases convergence analysis framework:\n\ni=1((cid:107)ei + w\u2217\n\nj \u2212 wj) + \u03c0\n\n2\n\n2\n\n1They assume input is Gaussian and the W\u2217 is orthonormal, which means the activations are independent in\n\nteacher network.\n\n2 Let \u03c3i be the output of i-th ReLU unit, then in our setting,(cid:80)\n\nis far from being independent.\n\ni,j Cov[\u03c3i, \u03c3j] can be as large as \u2126(d), which\n\n5\n\n\f(cid:104)\n\nConstant\n\nPart\n\n+\n\nFirst\nOrder\n\n+\n\nHigher\nOrder\n\n, W\u2217 \u2212 W(cid:105)\n\n\u2265 [ \u03c0\n\n2 \u2212 O(g)](cid:107)W\u2217 \u2212 W(cid:107)2\nLemma D.2 + Lemma D.3\n\nF\n\n\u22121.3(cid:107)W\u2217 \u2212 W(cid:107)2\n\n\u22120.085(cid:107)W\u2217 \u2212 W(cid:107)2\n\nF\n\nLemma D.1\n\nF\n\nLemma B.2\n\nFigure 5: Lower bounds of inner product using Taylor expansion\n\n1. In Phase I, the potential function g is decreasing to a small value.\n2. In Phase II, g remains small, so L is one point convex and thus W starts to converge to W\u2217.\n\nWe believe that this framework could be helpful for other non-convex problems.\nTechnical dif\ufb01culty: Phase I. Our key technical challenge is to show that in Phase I, the potential\nfunction actually decreases to O(1) after polynomial number of iterations. However, we cannot show\nthis by merely looking at g itself. Instead, we introduce an auxiliary variable s = (W\u2217 \u2212 W)u,\nwhere u is the all one vector. By doing a careful calculation, we get their joint update rules (Lemma\n\nC.3 and Lemma C.4): (cid:26) st+1 \u2248 st \u2212 \u03c0\u03b7d\n\n\u221a\n\u221a\ndgt +\n2 st + \u03b7O(\ngt+1 \u2248 gt \u2212 \u03b7dgt + \u03b7O(\u03b3\nd(cid:107)st(cid:107)2 + d\u03b32)\n\nd\u03b3)\n\n\u221a\n\n\u03c0\n2\n\n\u03b8i\u2217,j =\n\nSolving this dynamics, we can show that gt will approach to (and stay around) O(\u03b3), thus we enter\nPhase II.\nTechnical dif\ufb01culty: Phase II. Although the overall approximation in the thought experiment looks\nsimple, the argument is based on an over simpli\ufb01ed assumption that \u03b8i\u2217,j, \u03b8i,j \u2248 \u03c0\n2 for i (cid:54)= j.\nHowever, when W\u2217 has constant spectral norm, even when W is very close to W\u2217, \u03b8i,j\u2217 could be\nconstantly far away from \u03c0\n2 , which prevents us from applying this approximation directly. To get a\nformal proof, we use the standard Taylor expansion and control the higher order terms. Speci\ufb01cally,\nwe write \u03b8i\u2217,j as \u03b8i\u2217,j = arccos(cid:104)ei + w\u2217\n\u2212 (cid:104)ei + w\u2217\n\ni , ej + wj(cid:105) and expand arccos at point 0, thus,\ni , ej + wj(cid:105) + O((cid:104)ei + w\u2217\nHowever, even when W \u2248 W\u2217, the higher order term O((cid:104)ei + w\u2217\nas a constant, which is too big for us. Our trick here is to consider the \u201cjoint Taylor expansion\u201d:\n\u03b8i\u2217,j \u2212 \u03b8i,j = (cid:104)ei + wi \u2212 ei + w\u2217\ni , ej + wj(cid:105) + O(|(cid:104)ei + w\u2217\ni , ej + wj(cid:105)3 \u2212 (cid:104)ei + wi, ej + wj(cid:105)3|)\ni , ej + wj(cid:105)3 \u2212 (cid:104)ei + wi, ej + wj(cid:105)3| also tends to zero, therefore\nAs W approaches W\u2217, |(cid:104)ei + w\u2217\nour approximation has bounded error.\nIn the thought experiment, we already know that the constant part in the Taylor expansion of \u2207L(W)\n2 \u2212 O(g)-one point convex. We show that after taking inner product with W\u2217 \u2212 W, the \ufb01rst\nis \u03c0\norder terms are lower bounded by (roughly) \u22121.3(cid:107)W\u2217 \u2212 W(cid:107)2\nF and the higher order terms are lower\nbounded by \u22120.085(cid:107)W\u2217 \u2212 W(cid:107)2\nF . Adding them together, we can see that L(W) is one point convex\nas long as g is small. See Figure 5.\nGeometric Lemma. In order to get through the whole analysis, we need tight bounds on a few\ncommon terms that appear everywhere. Instead of using na\u00efve algebraic techniques, we come up with\na nice geometric proof to get nearly optimal bounds. Due to space limit, we defer it to Appendix E.\n5 Experiments\n\ni , ej + wj(cid:105)3)\ni , ej + wj(cid:105)3) still can be as large\n\nIn this section, we present several simulation results to support our theory. Our code can be found in\nthe supplementary materials.\n\n5.1\n\nImportance of identity mapping\n\nIn this experiment, we compare the standard ResNet [21] and single skip model where identity\nmapping skips only one layer. See Figure 6 for the single skip model. We also ran the vanilla network,\nwhere the identity mappings are completely removed.\n\n6\n\n\fTable 1: Test error of three 56-layer networks on\nCifar-10\n\ninput\n\nTest Err\n\nResNet\n6.97%\n\nSingle skip Vanilla\n12.04%\n\n9.01%\n\nC\no\nn\nv\no\nl\nu\nt\ni\no\nn\n\nB\na\nt\nc\nh\nN\no\nr\nm\n\n\u2295 R\n\ne\nL\nU\n\noutput\n\nIdentity\n\nFigure 6: Illustration of one block in single skip\nmodel in Sec 5.1\n\n(a) Test Error, Train Error\n\n(b) (cid:107)W\u2217 \u2212 W(cid:107)F , (cid:107)W(cid:107)F\n\nFigure 7: Verifying the global convergence\n\nIn this experiment, we choose Cifar-10 as the dataset, and all the networks have 56-layers. Other than\nthe identity mappings, all other settings are identical and default. We run the experiments for 5 times\nand report the average test error. As we can see in Table 1, compared with vanilla network, by simply\nusing a single skip identity mapping, one can already improve the test error by 3.03%, and is 2.04%\nclose to the ResNet. So single skip identity mapping brings signi\ufb01cant improvement on test accuracy.\n\n5.2 Global minimum convergence\n\nIn this experiment, we verify our main theorem that for two-layer teacher network and student network\nwith identity mappings, as long as (cid:107)W0(cid:107)2,(cid:107)W\u2217(cid:107)2 is small, SGD always converges to the global\nminimum W\u2217, thus gives almost 0 training error and test error. We consider three student networks.\nThe \ufb01rst one (ResLink) is de\ufb01ned using (2), the second one (Vanilla) is the same model without the\nidentity mapping. The last one (3-Block) is a three block network with each block containing a linear\nlayer (500 hidden nodes), a batch normalization and a ReLU layer. The teacher network always\nshares the same structure as the student network.\nThe input dimension is 100. We generated a \ufb01xed W\u2217 for all the trials with (cid:107)W\u2217(cid:107)2 \u2248 0.6,(cid:107)W\u2217(cid:107)F \u2248\n5.7. We generated a training set of size 100, 000, and test set of size 10, 000, sampled from a Gaussian\ndistribution. We use batch size 200, step size 0.001. We run ResLink for 5 times with random\ninitialization ((cid:107)W(cid:107)2 \u2248 0.6 and (cid:107)W(cid:107)F \u2248 5), and plot the curves by taking the average.\nFigure 7(a) shows test error and training error of the three networks. Comparing Vanilla with 3-Block,\nwe \ufb01nd that 3-Block is more expressive, so its training error is smaller compared with vanilla network;\nbut it suffers from over\ufb01tting and has bigger test error. This is the standard over\ufb01tting vs under\ufb01tting\ntradeoff. Surprisingly, with only one hidden layer, ResLink has both zero test error and training\nerror. If we look at Figure 7(b), we know the distance between W and W\u2217 converges to 0, meaning\nResLink indeed \ufb01nds the global optimal in all 5 trials. By contrast, for vanilla network, which is\nessentially the same network with different initialization, (cid:107)W \u2212 W\u2217(cid:107)2 does not converge to zero3.\nThis is exactly what our theory predicted.\n\n5.3 Verify the dynamics\n\nIn this experiment, we verify our claims on the dynamics. Based on the analysis, we construct a\n1500\u00d7 1500 matrix W s.t. (cid:107)W(cid:107)2 \u2248 0.15,(cid:107)W(cid:107)F \u2248 5 , and set W\u2217 = 0. By plugging them into (2),\none can see that even in this simple case that W\u2217 = 0, initially the gradient is pointing to the wrong\ndirection, i.e., not one point convex. We then run SGD on W by using samples x from Gaussian\ndistribution, with batch size 300, step size 0.0001.\n\n3To make comparison meaningful, we set W \u2212 I to be the actual weight for Vanilla as its identity mapping\n\nis missing, which is why it has a much bigger initial norm.\n\n7\n\n0255075100125150175200epochs0246810lossTest (ResLink)Test (Vanilla)Test (3-Block)Train (ResLink)Train (Vanilla)Train (3-Block)0255075100125150175200epochs0.02.55.07.510.012.515.017.520.0l2 normW (ResLink)W-W* (ResLink)W (Vanilla)W-W* (Vanilla)\f(a) First 100 iterations\n\n(b) The entire process\n\nFigure 8: Verifying the dynamics\n\nFigure 8(a) shows the \ufb01rst 100 iterations. We can see that initially the inner product de\ufb01ned in\nDe\ufb01nition 2.4 is negative, then after about 15 iterations, it turns positive, which means W is in the\none point strongly convex region. At the same time, the potential g keeps decreasing to a small value,\nwhile the distance to optimal (which also equals to (cid:107)W(cid:107)F in this experiment) is not affected. They\nprecisely match with our description of Phase I in Theorem 3.2.\nAfter that, we enter Phase II and slowly approach to W\u2217, see Figure 8(b). Notice that the potential\ng is always very small, the inner product is always positive, and the distance to optimal is slowly\ndecreasing. Again, they precisely match with our Theorem 3.3.\n\n5.4 Zero initialization works\n\nIn this experiment, we used a simple 5-block neural network on MNIST, where every block contains\na 784 \u2217 784 feedforward layer, an identity mapping, and a ReLU layer. Cross entropy criterion is\nused. We compare zero initialization with standard O(1/\nd) random initialization. We found that\nfor zero initialization, we can get 1.28% test error, while for random initialization, we can get 1.27%\ntest error. Both results were obtained by taking average among 5 runs and use step size 0.1, batch\nsize 256. If the identity mapping is removed, zero initialization no longer works.\n\n\u221a\n\n5.5 Spectral norm of W\u2217\n\nWe also applied the exact model f de\ufb01ned in (1) to distinguish two classes in MNIST. For any input\nimage x, We say it\u2019s in class A if f (x, W) < TA,B, and in class B otherwise. Here TA,B is the\noptimal threshold for the function f (x, 0) to distinguish A and B. If W = 0, we get 7% training\nerror for distinguish class 0 and class 1. However, it can be improved to 1% with (cid:107)W(cid:107)2 = 0.6.\nWe tried this experiment for all possible 45 pairs of classes in MNIST, and improve the average\ntraining error from 34% (using W = 0) to 14% (using (cid:107)W(cid:107)2 = 0.6). Therefore our model with\n(cid:107)W(cid:107)2 = \u2126(1) has reasonable expressive power, and is substantially different from just using the\nidentity mapping alone.\n\n6 Discussions\n\nThe assumption that the input is Gaussian can be relaxed in several ways. For example, when\nthe distribution is N (0, \u03a3) where (cid:107)\u03a3 \u2212 I(cid:107)2 is bounded by a small constant, the same result holds\nwith slightly worse constants. Moreover, since the analysis relies Lemma 2.1, which is proved by\nconverting the original input space into polar space, it is easy to generalize the calculation to rotation\ninvariant distributions. Finally, for more general distributions, as long as we could explicitly compute\nthe expectation, which is in the form of O(W\u2217 \u2212 W) plus certain potential function, our analysis\nframework may also be applied.\nThere are many exciting open problems. For example, Our paper is the \ufb01rst one that gives solid\nSGD analysis for neural network with nonlinear activations, without unrealistic assumptions like\nindependent activation assumption. It would be great if one could further extend it to multiple layers,\nwhich would be a major breakthrough of understanding optimization for deep learning. Moreover,\nour two phase framework could be applied to other non-convex problems as well.\n\n8\n\n020406080100\u221210\u221250510152025P-IP-II050000100000150000200000250000300000350000\u221220246810121416Distance to optimalInner productPotential gLoss\fAcknowledgement\n\nThe authors want to thank Robert Kleinberg, Kilian Weinberger, Gao Huang, Adam Klivans and\nSurbhi Goel for helpful discussions, and the anonymous reviewers for their comments.\n\nReferences\n[1] Alexandr Andoni, Rina Panigrahy, Gregory Valiant, and Li Zhang. Learning polynomials with\n\nneural networks. In ICML, pages 1908\u20131916, 2014.\n\n[2] Sanjeev Arora, Aditya Bhaskara, Rong Ge, and Tengyu Ma. Provable bounds for learning\nsome deep representations. In Proceedings of the 31th International Conference on Machine\nLearning, ICML 2014, Beijing, China, 21-26 June 2014, pages 584\u2013592, 2014.\n\n[3] Andrew R. Barron. Universal approximation bounds for superpositions of a sigmoidal function.\n\nIEEE Trans. Information Theory, 39(3):930\u2013945, 1993.\n\n[4] Leo Breiman. Hinging hyperplanes for regression, classi\ufb01cation, and function approximation.\n\nIEEE Trans. Information Theory, 39(3):999\u20131013, 1993.\n\n[5] Anna Choromanska, Mikael Henaff, Micha\u00ebl Mathieu, G\u00e9rard Ben Arous, and Yann LeCun.\n\nThe loss surfaces of multilayer networks. In AISTATS, 2015.\n\n[6] Anna Choromanska, Yann LeCun, and G\u00e9rard Ben Arous. Open problem: The landscape of\nthe loss surfaces of multilayer networks. In Proceedings of The 28th Conference on Learning\nTheory, COLT 2015, Paris, France, July 3-6, 2015, pages 1756\u20131760, 2015.\n\n[7] George Cybenko. Approximation by superpositions of a sigmoidal function. MCSS, 5(4):455,\n\n1992.\n\n[8] Amit Daniely, Roy Frostig, and Yoram Singer. Toward deeper understanding of neural networks:\nThe power of initialization and a dual view on expressivity. In NIPS, pages 2253\u20132261, 2016.\n\n[9] Yann N Dauphin, Razvan Pascanu, Caglar Gulcehre, Kyunghyun Cho, Surya Ganguli, and\nYoshua Bengio. Identifying and attacking the saddle point problem in high-dimensional non-\nconvex optimization. In NIPS 2014, pages 2933\u20132941, 2014.\n\n[10] John C. Duchi, Elad Hazan, and Yoram Singer. Adaptive subgradient methods for online\nlearning and stochastic optimization. Journal of Machine Learning Research, 12:2121\u20132159,\n2011.\n\n[11] Rong Ge, Furong Huang, Chi Jin, and Yang Yuan. Escaping from saddle points - online\nstochastic gradient for tensor decomposition. In COLT 2015, volume 40, pages 797\u2013842, 2015.\n\n[12] Xavier Glorot and Yoshua Bengio. Understanding the dif\ufb01culty of training deep feedforward\n\nneural networks. In AISTATS, pages 249\u2013256, 2010.\n\n[13] Xavier Glorot, Antoine Bordes, and Yoshua Bengio. Deep sparse recti\ufb01er neural networks. In\n\nAISTATS, pages 315\u2013323, 2011.\n\n[14] Surbhi Goel, Varun Kanade, Adam R. Klivans, and Justin Thaler. Reliably learning the relu in\n\npolynomial time. CoRR, abs/1611.10258, 2016.\n\n[15] Surbhi Goel and Adam Klivans. Eigenvalue decay implies polynomial-time learnability for\n\nneural networks. In NIPS 2017, 2017.\n\n[16] Surbhi Goel and Adam Klivans. Learning Depth-Three Neural Networks in Polynomial Time.\n\nArXiv e-prints, 2017.\n\n[17] Ian Goodfellow, Yoshua Bengio, and Aaron Courville. Deep Learning. MIT Press, 2016.\n\nhttp://www.deeplearningbook.org.\n\n[18] Ian J. Goodfellow and Oriol Vinyals. Qualitatively characterizing neural network optimization\n\nproblems. CoRR, abs/1412.6544, 2014.\n\n9\n\n\f[19] Moritz Hardt and Tengyu Ma. Identity matters in deep learning. CoRR, abs/1611.04231, 2016.\n\n[20] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Delving deep into recti\ufb01ers:\nSurpassing human-level performance on imagenet classi\ufb01cation. In ICCV, pages 1026\u20131034,\n2015.\n\n[21] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image\n\nrecognition. In CVPR, pages 770\u2013778, 2016.\n\n[22] Kurt Hornik, Maxwell B. Stinchcombe, and Halbert White. Multilayer feedforward networks\n\nare universal approximators. Neural Networks, 2(5):359\u2013366, 1989.\n\n[23] Majid Janzamin, Hanie Sedghi, and Anima Anandkumar. Beating the perils of non-convexity:\nGuaranteed training of neural networks using tensor methods. arXiv preprint arXiv:1506.08473,\n2015.\n\n[24] Kenji Kawaguchi. Deep learning without poor local minima. In NIPS, pages 586\u2013594, 2016.\n\n[25] Diederik P. Kingma and Jimmy Ba. Adam: A method for stochastic optimization. CoRR,\n\nabs/1412.6980, 2014.\n\n[26] J. M. Klusowski and A. R. Barron. Risk Bounds for High-dimensional Ridge Function Combi-\n\nnations Including Neural Networks. ArXiv e-prints, July 2016.\n\n[27] Yann LeCun, Leon Bottou, Genevieve B. Orr, and Klaus Robert M\u00fcller. Ef\ufb01cient BackProp,\n\npages 9\u201350. Springer Berlin Heidelberg, Berlin, Heidelberg, 1998.\n\n[28] Roi Livni, Shai Shalev-Shwartz, and Ohad Shamir. On the computational ef\ufb01ciency of training\n\nneural networks. In NIPS, pages 855\u2013863, 2014.\n\n[29] Guido F. Mont\u00fafar, Razvan Pascanu, KyungHyun Cho, and Yoshua Bengio. On the number of\n\nlinear regions of deep neural networks. In NIPS, pages 2924\u20132932, 2014.\n\n[30] Vinod Nair and Geoffrey E. Hinton. Recti\ufb01ed linear units improve restricted boltzmann\n\nmachines. In ICML, pages 807\u2013814, 2010.\n\n[31] Xingyuan Pan and Vivek Srikumar. Expressiveness of recti\ufb01er networks. In ICML, pages\n\n2427\u20132435, 2016.\n\n[32] Razvan Pascanu, Guido Mont\u00fafar, and Yoshua Bengio. On the number of inference regions of\n\ndeep feed forward networks with piece-wise linear activations. CoRR, abs/1312.6098, 2013.\n\n[33] M. Rudelson and R. Vershynin. Non-asymptotic theory of random matrices: extreme singular\n\nvalues. ArXiv e-prints, 2010.\n\n[34] David Saad and Sara A. Solla. Dynamics of on-line gradient descent learning for multilayer\n\nneural networks. Advances in Neural Information Processing Systems, 8:302\u2013308, 1996.\n\n[35] Itay Safran and Ohad Shamir. On the quality of the initial basin in overspeci\ufb01ed neural networks.\n\nIn ICML, pages 774\u2013782, 2016.\n\n[36] Andrew M. Saxe, James L. McClelland, and Surya Ganguli. Exact solutions to the nonlinear\n\ndynamics of learning in deep linear neural networks. CoRR, abs/1312.6120, 2013.\n\n[37] Hanie Sedghi and Anima Anandkumar. Provable methods for training neural networks with\n\nsparse connectivity. ICLR, 2015.\n\n[38] Ohad Shamir.\n\nDistribution-speci\ufb01c hardness of learning neural networks.\n\nabs/1609.01037, 2016.\n\nCoRR,\n\n[39] Jir\u00ed S\u00edma. Training a single sigmoidal neuron is hard. Neural Computation, 14(11):2709\u20132728,\n\n2002.\n\n[40] Ilya Sutskever, James Martens, George E. Dahl, and Geoffrey E. Hinton. On the importance of\n\ninitialization and momentum in deep learning. In ICML, pages 1139\u20131147, 2013.\n\n10\n\n\f[41] Yuandong Tian. Symmetry-breaking convergence analysis of certain two-layered neural net-\n\nworks with relu nonlinearity. In Submitted to ICLR 2017, 2016.\n\n[42] Bo Xie, Yingyu Liang, and Le Song. Diversity leads to generalization in neural networks. In\n\nAISTATS, 2017.\n\n[43] Yuchen Zhang, Jason D. Lee, Martin J. Wainwright, and Michael I. Jordan. Learning halfspaces\n\nand neural networks with random initialization. CoRR, abs/1511.07948, 2015.\n\n[44] Kai Zhong, Zhao Song, Prateek Jain, Peter L. Bartlett, and Inderjit S. Dhillon. Recovery\n\nguarantees for one-hidden-layer neural networks. In ICML 2017, 2017.\n\n11\n\n\f", "award": [], "sourceid": 409, "authors": [{"given_name": "Yuanzhi", "family_name": "Li", "institution": "Princeton University"}, {"given_name": "Yang", "family_name": "Yuan", "institution": "Cornell University"}]}