{"title": "Global Convergence of Gradient Descent for Deep Linear Residual Networks", "book": "Advances in Neural Information Processing Systems", "page_first": 13389, "page_last": 13398, "abstract": "We analyze the global convergence of gradient descent for deep linear residual\n networks by proposing a new initialization: zero-asymmetric (ZAS)\n initialization. It is motivated by avoiding stable manifolds of saddle points.\n We prove that under the ZAS initialization, for an arbitrary target matrix,\n gradient descent converges to an $\\varepsilon$-optimal point in $O\\left( L^3\n \\log(1/\\varepsilon) \\right)$ iterations, which scales polynomially with the\n network depth $L$. Our result and the $\\exp(\\Omega(L))$ convergence time for the \n standard initialization (Xavier or near-identity)\n \\cite{shamir2018exponential} together demonstrate the importance of the\n residual structure and the initialization in the optimization for deep linear\n neural networks, especially when $L$ is large.", "full_text": "Global Convergence of Gradient Descent\n\nfor Deep Linear Residual Networks\n\nLei Wu\u2217 Qingcan Wang\u2217 Chao Ma\n\nProgram in Applied and Computational Mathematics\n\nPrinceton University\n\nPrinceton, NJ 08544, USA\n\n{leiwu,qingcanw,chaom}@princeton.edu\n\nAbstract\n\nWe analyze the global convergence of gradient descent for deep linear residual\nnetworks by proposing a new initialization: zero-asymmetric (ZAS) initialization.\nIt is motivated by avoiding stable manifolds of saddle points. We prove that under\nthe ZAS initialization, for an arbitrary target matrix, gradient descent converges\n\nto an \u03b5-optimal point in O(cid:0)L3 log(1/\u03b5)(cid:1) iterations, which scales polynomially\n\nwith the network depth L. Our result and the exp(\u2126(L)) convergence time for\nthe standard initialization (Xavier or near-identity) [18] together demonstrate the\nimportance of the residual structure and the initialization in the optimization for\ndeep linear neural networks, especially when L is large.\n\n1\n\nIntroduction\n\nIt is widely observed that simple gradient-based optimization algorithms are ef\ufb01cient for training deep\nneural networks [21], whose landscape is highly non-convex. To explain the ef\ufb01ciency, traditional\noptimization theories cannot be directly applied and the special structures of neural networks must be\ntaken into consideration. Recently many researches are devoted to this topic [13, 21, 4, 7, 6, 1, 23,\n15, 16], but the theoretical understanding is still far from suf\ufb01cient.\nIn this paper, we focus on a simpli\ufb01ed case: the deep linear neural network\n\nf (x; W1, . . . , WL) = WLWL\u22121 \u00b7\u00b7\u00b7 W1x,\n\n(1.1)\nwhere W1, . . . , WL are the weight matrices and L is the depth. Linear networks are simple since\nthey can only represent linear transformation, but they preserve one of the most important aspects of\ndeep neural networks, the layered structure. Therefore, analysis of linear networks will be helpful for\nunderstanding nonlinear cases. For example, the random orthogonal initialization proposed in [17]\nthat analyzes the gradient descent dynamics of deep linear networks was later shown to be useful for\ntraining recurrent networks with long term dependences [19].\nDespite the simplicity, the optimization of deep linear neural networks is still far from being well\nunderstood, especially the global convergence. [18] proves that the number of iterations required for\nconvergence could scales exponentially with the depth L. The result requires two conditions: (1) the\nwidth of each layer is 1; (2) the gradient descent starts from the standard Xavier [9] or near-identity\n[11] initialization. If these conditions break, the negative results does not imply that gradient descent\ncannot ef\ufb01ciently learn deep linear networks in general. [5] shows that if the width of every layer\nincreases with the network depth, gradient descent with the Gaussian random initialization does \ufb01nd\nthe global minima while the convergence time only scales polynomially with the depth. Here we\nattempt to circumvent the negative result in [18] by using better initialization strategies instead of\nincreasing the width.\n\u2217Equal contribution\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\fOur Contributions We propose the zero-asymmetric (ZAS) initialization, which initializes the\noutput layer WL to be zero and all the other layers Wl, l = 1, . . . , L \u2212 1 to be identity. So it is a\nlinear residual network with all the residual blocks and the output layer being zero. We then analyze\nhow the initialization affects the gradient descent dynamics.\n\n\u2022 We prove that starting from the ZAS initialization, the number of iterations required for\n\ngradient descent to \ufb01nd an \u03b5-optimal point is O(cid:0)L3 log(1/\u03b5)(cid:1). The only requirement for\n\nthe network is that the width of each layer is not less than the input dimension and the result\napplies to arbitrary target matrices.\n\n\u2022 We numerically compare the gradient descent dynamics between the ZAS and the near-\nidentity initialization for multi-dimensional deep linear networks. The comparison clearly\nshows that the convergence of gradient descent with the near-identity initialization involves\na saddle point escape process, while the ZAS initialization never encounters any saddle\npoint during the whole optimization process.\n\n\u2022 We provide an extension of the ZAS initialization to the nonlinear case. Moreover, the\n\nnumerical experiments justify its superiority compared to the standard initializations.\n\n1.1 Related work\n\nLinear networks The \ufb01rst line of works analyze the whole landscape. The early work [3] proves\nthat for two-layer linear networks, all the local minima are also global minima, and this result is\nextended to deep linear networks in [13, 14]. [10] provides a simpler proof of this result for deep\nresidual networks, and shows that the Polyak-\u0141ojasiewicz condition is satis\ufb01ed in a neighborhood of\na global minimum. However, these results do not imply that gradient descent can \ufb01nd global minima,\nand also cannot tell us the number of iterations required for convergence.\nThe second line of works directly deal with the trajectory of gradient descent dynamics, and our\nwork lies in this venue. [17] provides an analytic analysis to the gradient descent dynamics of linear\nnetworks, which nevertheless does not show that gradient descent can \ufb01nd global minima. [12]\nstudies the properties of solutions that the gradient descent converges to, without providing any\nconvergence rate. [4, 2] consider the following simpli\ufb01ed objective function for whitened data,\n\nR(W1, . . . , WL) =\n\n(cid:107)WL \u00b7\u00b7\u00b7 W1 \u2212 \u03a6(cid:107)2\nF .\n\n1\n2\n\nl+1Wl+1 \u2212 WlW T\n\nSpeci\ufb01cally, [4] analyzes the convergence of gradient descent with the identity initialization: WL =\n\u00b7\u00b7\u00b7 = W1 = I, and proves that if the target matrix \u03a6 is positive semi-de\ufb01nite or the initial loss is small\nenough, a polynomial-time convergence can be guaranteed. [2] extends the analysis to more general\ntarget matrices by imposing more conditions on the initialization: (1) approximately balance condition,\n(cid:107)W T\nl (cid:107)F \u2264 \u03b4; (2) rank-de\ufb01cient condition, (cid:107)WL \u00b7\u00b7\u00b7 W1 \u2212 \u03a6(cid:107)F \u2264 \u03c3min(\u03a6) \u2212 c for\na constant c > 0. The condition (2) still requires small initial loss, thus the convergence is local in\nnature. As a comparison, we do not impose any assumption on the target matrix or the initial loss.\nAs mentioned above, our work is closely related to [18], which proves that for one-dimensional deep\nlinear networks, gradient descent with the standard Xavier or near-identity initialization requires\nat least exp(\u2126(L)) iterations for \ufb01tting the target matrix \u03a6 = \u2212I. However, our result shows that\nthis dif\ufb01culty can be overcome by adopting a better initialization. [5] shows that if the width of\neach layer is larger than \u2126(L log(L)), then gradient descent converges to global minima at a rate\nO(log(1/\u03b5)). As a comparison, our result only requires that the width of each layer is not less than\nthe input dimension.\n\nNonlinear networks\n[6, 1, 23] establish the global convergence for deep networks with the width\nm \u2265 poly(n, L), where n denotes the number of training examples. [8] proves a similar result\nbut for speci\ufb01c neural networks with long-distance skip connections, which only requires the depth\nL \u2265 poly(n) and the width m \u2265 d + 1, where d is the input dimension.\nThe ZAS initialization we propose also closely resembles the \u201c\ufb01xup initialization\u201d recently proposed\nin [22]. Therefore, our result partially provides a theoretical explanation to the ef\ufb01ciency of \ufb01xup\ninitialization for training deep residual networks.\n\n2\n\n\f2 Preliminaries\nGiven training data {(xi, yi)}n\nlayers is de\ufb01ned as\n\ni=1 where xi \u2208 Rdx and yi \u2208 Rdy, a linear neural network with L\nf (x; W1, . . . , WL) = WLWL\u22121 \u00b7\u00b7\u00b7 W1x,\n(2.1)\nwhere Wl \u2208 Rdl\u00d7dl\u22121, l = 1, . . . , L are parameter matrices, and d0 = dx, dL = dy. Then the\nleast-squares loss\n\n\u02dcR(W1, . . . , WL) def=\n\n(cid:107)WLWL\u22121 \u00b7\u00b7\u00b7 W2W1X \u2212 Y (cid:107)2\nF ,\nwhere X = (x1, x2, . . . , xn) \u2208 Rdx\u00d7n and Y = (y1, y2, . . . , yn) \u2208 Rdy\u00d7n.\nFollowing [4, 2], in this paper we focus on the following simpli\ufb01ed objective function\n\n1\n2\n\n(2.2)\n\nR(W1, . . . , WL) def=\n\n(2.3)\nwhere Wl \u2208 Rd\u00d7d, l = 1, . . . , L and \u03a6 \u2208 Rd\u00d7d is the target matrix. Here we assume dl = d,\nl = 1, . . . , L for simplicity.\nThe gradient descent is given by\n\n(cid:107)WLWL\u22121 \u00b7\u00b7\u00b7 W2W1 \u2212 \u03a6(cid:107)2\nF ,\n\n1\n2\n\nWl(t + 1) = Wl(t) \u2212 \u03b7\u2207lR(t),\n\n(2.4)\nIn the following, we will always use the index t to denote the value of a variable after the t-th iteration.\n\u2207lR is the gradient of R with respect to the weight matrix Wl:\n\nl = 1, . . . , L, t = 0, 1, 2, . . .\n\n\u2207lR def=\n\n= W T\n\nL:l+1(WL:1 \u2212 \u03a6)W T\n\nl\u22121:1,\n\n\u2202R\n\u2202Wl\n\nwhere Wl2:l1\niterations.\n\ndef= Wl2Wl2\u22121 \u00b7\u00b7\u00b7 Wl1+1Wl1. Moreover, we keep the learning rate \u03b7 > 0 \ufb01xed for all\n\nNotations\nIn matrix equations, let I and 0 be the d-dimensional identity matrix and zero matrix\nrespectively. Let \u03bbmin(S) be the minimal eigenvalue of a symmetric matrix S and \u03c3min(A) be the\nminimal singular value of a square matrix A. Let (cid:107)A(cid:107)F and (cid:107)A(cid:107)2 be the Frobenius norm and (cid:96)2 norm\nof matrix A respectively. Recall that A(t) denotes the value of any variable A after the t-th iteration,\nand \u2207lR is the gradient of R with respect to the weight matrix Wl. We use standard notation O(\u00b7)\nand \u2126(\u00b7) to hide constants independent of network depth L.\n\n3 Zero-asymmetric initialization\n\nIn this section, we \ufb01rst describe the zero-asymmetric initialization, and then illustrate by a simple\nexample why this special initialization is helpful for optimization.\nDe\ufb01nition. For deep linear neural network (2.3), de\ufb01ne the zero-asymmetric (ZAS) initialization as\n(3.1)\n\nWl(0) = I, l = 1, . . . , L \u2212 1,\n\nand WL(0) = 0.\n\nUnder the ZAS initialization, the function represented by the network is a zero matrix. While\ncommonly used initialization such as the Xavier and the near-identity initialization treats all the layers\nequally, our initialization takes the output layer differently. In this sense, we call the initialization\nasymmetric.\nLet Wl = I + Ul, l = 1, . . . , L \u2212 1, then the linear network has the residual form\n\nR =\n\n1\n2\n\n(cid:107)WL(I + UL\u22121)\u00b7\u00b7\u00b7 (I + U1) \u2212 \u03a6(cid:107)2\nF .\n\nSince \u2202R/\u2202Ul = \u2202R/\u2202Wl, the dynamics will be the same as ZAS if we initialize Ul(0) = WL(0) =\n0. Therefore, ZAS is equivalent to initializing all the residual blocks and the output layer with zero in\na linear residual network. From this perspective, the ZAS initialization closely resembles the \u201c\ufb01xup\ninitialization\u201d [22] for nonlinear ResNets.\n\n3\n\n\fFigure 1: (Left) The landscape of the toy model R(w1, w2) and the two gradient descent trajectories.\n(Right) The dynamics of loss for two gradient descent trajectories. The blue curve is the gradient\ndescent trajectory initialized from (1\u22120.001, 1+0.001) (near-identity), and the red curve corresponds\nto the ZAS initialization (1, 0). We observe that the blue curve takes a long time in the neighborhood\nof saddle point (0, 0), however the red curve does not.\n\nUnderstanding the role of initialization Following [18], consider the following optimization\nproblem for one-dimensional linear network with target \u03a6 = \u22121:\n\n(3.2)\nThe origin O(0, . . . , 0) is a saddle point of R, so gradient descent with small initialization, e.g.,\nXavier initialization, will spend long time escaping the neighborhood of O. In addition,\n\nR(w1, w2, . . . , wL) = (wLwL\u22121 \u00b7\u00b7\u00b7 w1 + 1)2/2.\n\nnear-identity initialization introduces perturbation to leave M: wl(0) \u223c N(cid:0)1, \u03c32(cid:1), l = 1, . . . , L\n\nis a stable manifold of O, i.e., gradient \ufb02ow starting from any point in M will converge to O. The\n\nM = {(w1, . . . , wL) : w1 = w2 = \u00b7\u00b7\u00b7 = wL \u2265 0}\n\nfor some small \u03c3. However, [18] proves that it will still be attracted to the neighborhood of O, thus\ncannot guarantee the polynomial-time converge. As a comparison, the ZAS initialization breaks the\nsymmetry by initialize the output layer to be 0.\nFigure 1 provides a numerical result for depth L = 2. The near-identity initialization (blue curve)\nspends long time escaping the saddle region, while the ZAS initialization (red curve) converges to the\nglobal minima without attraction by the saddle point.\n\n4 Main results\n\nWe \ufb01rst provide and prove the continuous version of our main convergence result, i.e., the limit\ndynamics when \u03b7 \u2192 0. Then we give the result for discrete gradient descent, whose detailed proof is\nleft to the appendix.\n\n4.1 Continuous-time gradient descent\n\nThe continuous-time gradient descent dynamics is given by\n\n\u02d9Wl(t) = \u2212\u2207lR(t),\n\nl = 1, . . . , L, t \u2265 0.\n\n(4.1)\nIn this section, we always denote \u02d9A(t) = dA(t)/dt for any variable A depending on t. For the\ncontinuous dynamics, we have the following convergence result.\nTheorem 4.1 (Continuous-time gradient descent). For the deep linear network (2.3), the continuous-\ntime gradient descent (4.1) with the zero-asymmetric initialization (3.1) satis\ufb01es\n\nR(t) \u2264 e\u22122tR(0),\n\nt \u2265 0,\n\n(4.2)\n\nfor any \u03a6 \u2208 Rd\u00d7d and L \u2265 1.\nThe theorem above holds for arbitrary \u03a6, and does not require depth or width to be large. To prove\nthe theorem, we \ufb01rst de\ufb01ne a group of invariant matrices as following. Note that they also play a key\nrole in the analysis of [2].\n\n4\n\n1.00.50.00.51.01.00.50.00.51.0(0,0): saddle point020406080100120140Number of iterations107105103101101LossGD(0.999, 1.001)GD(1,0)\fDe\ufb01nition. For a deep linear network (2.3), de\ufb01ne the invariant matrix\n\nDl = W T\n\nl+1Wl+1 \u2212 WlW T\nl ,\n\nl = 1, 2, . . . , L \u2212 1.\n\n(4.3)\n\nLemma 4.2. The invariant matrices (4.3) are indeed invariances under continuous-time gradient\ndescent (4.1), i.e., Dl(t) = Dl(0) for l = 1, . . . , L \u2212 1 and t \u2265 0.\n\nProof. Recall that\n\n\u02d9Wl = \u2212\u2207lR = \u2212W T\n\nL:l+1(WL:1 \u2212 \u03a6)W T\n\nl\u22121:1,\n\nwe have\n\nthen\n\n\u02d9Dl =\n\nd\ndt\n\n(cid:2)W T\n\n\u02d9WlW T\n\nl = \u2212W T\nL:l+1(WL:1 \u2212 \u03a6)W T\n(cid:104)\n\nW T\n\nl+1\n\n\u02d9Wl+1 \u2212 \u02d9WlW T\n\nl\n\n(cid:3) =\n\n(cid:105)\n\n(cid:104)\n\nl+1Wl+1 \u2212 WlW T\n\nl\n\nl:1 = W T\nl+1\n\n\u02d9Wl+1,\n\n+\n\nW T\n\nl+1\n\n\u02d9Wl+1 \u2212 \u02d9WlW T\n\nl\n\n(cid:105)T\n\n= 0.\n\nTherefore, Dl(t) = Dl(0).\n\nProof of Theorem 4.1. From the ZAS initialization, Dl(t) = Dl(0) = 0, l = 1, . . . , L \u2212 2 and\nDL\u22121(t) = DL\u22121(0) = \u2212I, i.e.,\n\nWlW T\n\nl = W T\n\nWL\u22121W T\n\nL\u22121 = I + W T\n\nl+1Wl+1,\nL WL.\n\nl = 1, . . . , L \u2212 2,\n\nWL\u22121:1W T\n\nL\u22121:2 = WL\u22121:2W T\nL\u22121:1 = WL\u22121:2W1W T\n1 W T\n2\n= WL\u22121:3(W2W T\nW T\n2 )\n= \u00b7\u00b7\u00b7\n\nL\u22121:3\n\n2 W2W T\n\nL\u22121:2\n\nSo we have\n\nand\n\nThen\n\n,\n\nL\u22121\n\nL WL\n\n=(cid:0)WL\u22121W T\n=(cid:0)I + W T\nF =(cid:13)(cid:13)(WL:1 \u2212 \u03a6)W T\n(cid:16)(cid:0)I + W T\n(cid:16)\u2207T\n\nl R(t) \u02d9Wl(t)\n\n(cid:1)L\u22121\n(cid:1)L\u22121\n(cid:13)(cid:13)2\n(cid:1)L\u22121(cid:17) \u00b7 2R \u2265 2R.\n= \u2212 L(cid:88)\n\n(cid:107)\u2207lR(cid:107)2\n\nmin(WL\u22121:1)(cid:107)WL:1 \u2212 \u03a6(cid:107)2\n\nF = \u03bbmin\n\n\u2265 \u03c32\n\n= \u03bbmin\n\nL\u22121:1\n\nL(cid:88)\n\nL WL\n\n(cid:17)\n\ntr\n\nF\n\n(cid:107)\u2207LR(cid:107)2\n\n\u02d9R(t) =\n\n(cid:0)WL\u22121:1W T\n\nL\u22121:1\n\n(cid:1) \u00b7 2R\n\n(4.4)\n\nF \u2264 \u2212(cid:107)\u2207LR(cid:107)2\n\nF \u2264 \u22122R.\n\nl=1\n\nl=1\n\nTherefore, R(t) \u2264 e\u22122tR(0).\n\n(cid:20) Id0\n(cid:0)X TX(cid:1) > 0, following the similar proof, we will have (cid:107)\u2207L \u02dcR(cid:107)2\n\nRemark. (1). For rectangular weight matrices Wl \u2208 Rdl\u00d7dl\u22121, if dl \u2265 d0 = dx, l = 1, . . . , L \u2212 1,\n, then the\nwe can always ignore the redundant nodes by initializing WL = 0 and Wl =\nproof of Theorem 4.1 still holds. (2). For the general square loss \u02dcR in (2.2) with un-whitened data\nF \u2265 2\u03bbX \u02dcR, and\nX, if \u03bbX\n\u02dcR(t) \u2264 e\u22122\u03bbX t \u02dcR(0).\n\ndef= \u03bbmin\n\n(cid:21)\n\n0\n0\n\n0\n\n5\n\n\f4.2 Discrete-time gradient descent\n\n\u03b7 \u2264 min\n\nNow we consider the discrete-time gradient descent (2.4). The main theorem is stated below.\nTheorem 4.3 (Discrete gradient descent). For deep linear network (2.3) with the zero-asymmetric\ninitialization (3.1) and discrete-time gradient descent (2.4), if the learning rate satis\ufb01es\n\n,(cid:0)144L2\u03c64(cid:1)\u22121(cid:111)\nwhere \u03c6 = max(cid:8)2(cid:107)\u03a6(cid:107)F , 3L\u22121/2, 1(cid:9), then we have linear convergence\nR(t) \u2264(cid:16)\nSince the learning rate \u03b7 = O(cid:0)L\u22123(cid:1), the theorem indicates that gradient descent can achieve\nR(t) \u2264 \u03b5 in O(cid:0)L3 log(1/\u03b5)(cid:1) iterations.\n\n(cid:110)(cid:0)4L3\u03c66(cid:1)\u22121\n(cid:17)tR(0),\n\n1 \u2212 \u03b7\n2\n\nt = 0, 1, 2, . . .\n\n(4.5)\n\n4.2.1 Overview of the proof\n\nThe following is the proof sketch, and the detailed proof is deferred to the appendix.\nThe approach to the discrete-time result is similar to the continuous-time case. However, the matrices\nde\ufb01ned in (4.3) are not exactly invariant, but change slowly during the training process, which need\nto be controlled carefully.\nFirst, we propose the following three conditions, and prove that the \ufb01rst condition implies the other\ntwo.\n\nApproximate invariances For invariant matrices de\ufb01ned in (4.3),\n\n(cid:107)Dl(cid:107)2 = O(cid:0)L\u22123(cid:1) , l = 1, . . . , L \u2212 2,\n\nWeight bounds For weight matrices Wl,\n\nand (cid:107)I + DL\u22121(cid:107)2 = O(cid:0)L\u22122(cid:1) .\nL\u22121/2(cid:17)\n\n(cid:107)WL\u22121(cid:107) = O\n\n(cid:16)\n\nand\n\n(4.6)\n\n.\n\n(4.7)\n\n(cid:18) log L\n\n(cid:19)\n\nL\n\n(cid:107)Wl(cid:107)2 = 1 + O\n\n, l = 1, . . . , L \u2212 1,\n\nGradient bound The gradient of the last layer\n\n(4.8)\nLemma 4.4. The approximate invariances condition (4.6) implies the weight bounds (4.7) and the\ngradient bound (4.8).\n\n(cid:107)\u2207LR(cid:107)2\n\nF \u2265 R.\n\nSecond, to show that (4.6)\u2013(4.8) always holds during the training process, we need to estimate the\nchange of invariant matrix Dl(t + 1) \u2212 Dl(t) and the decrease of loss R(t + 1) \u2212 R(t) in one step.\nLemma 4.5. If the weight bounds (4.7) hold at iteration t, then the change of invariant matrices\nafter one-step update with learning rate \u03b7 satis\ufb01es\n\n(cid:107)Dl(t + 1) \u2212 Dl(t)(cid:107)2 = O(cid:0)\u03b72(cid:1)R(t), l = 1, . . . , L \u2212 2,\n(cid:107)DL\u22121(t + 1) \u2212 DL\u22121(t)(cid:107)2 = O(cid:0)\u03b72L(cid:1)R(t).\n\n(4.9)\nLemma 4.6. If the weight bounds (4.7) and the gradient bound (4.8) hold, and the learning rate\n\n\u03b7 = O(cid:0)L\u22122(cid:1), then the loss function\n\nR(t + 1) \u2264(cid:16)\n\n(cid:17)R(t).\n\n1 \u2212 \u03b7\n2\n\n(4.10)\n\nWith the three lemmas above, we are now ready to prove Theorem 4.3.\n\nProof of Theorem 4.3 (informal). We do induction on (4.5) and (4.6). Assume that they hold for\n0, 1, . . . , t. From the three lemmas above, (4.7)\u2013(4.10) also hold for 0, 1, . . . , t. So the loss function\n\nR(t + 1) \u2264(cid:16)\n\n(cid:17)R(t) \u2264(cid:16)\n\n1 \u2212 \u03b7\n2\n\n(cid:17)t+1R(0),\n\n1 \u2212 \u03b7\n2\n\n6\n\n\f1 \u2212 \u03b7\n2\n\ns=0\n\n\u03b7\n\nt(cid:88)\n\ns=0\n\n(cid:17)s \u2264 2\n\nR(0) = O(cid:0)\u03b7\u22121(cid:1) .\n\ni.e., (4.5) holds for t + 1. Now we have\nR(s) \u2264 R(0)\n\nRecall that the invariant matrices Dl(0) = 0, l = 1, . . . , L \u2212 2 and I + DL\u22121(0) = 0 at the\n\n(cid:16)\nt(cid:88)\ninitialization, and \u03b7 = O(cid:0)L\u22123(cid:1). From (4.9),\nfor l = 1, . . . , L \u2212 2. Similarly, (cid:107)I + DL\u22121(t + 1)(cid:107)2 \u2264 O(\u03b7L) = O(cid:0)L\u22122(cid:1), i.e., (4.6) holds for\nization with perturbation: Wl(0) \u223c N(cid:0)I, \u03c32(cid:1), l = 1, . . . , L \u2212 1 and WL(0) \u223c N(cid:0)0, \u03c32(cid:1), where \u03c3\n\nt + 1. Then we complete the induction.\nRemark. Following the proof sketch, we can actually prove Theorem 4.3 under \u201cnear-ZAS\u201d initial-\n\nR(s) = O(\u03b7) = O(cid:0)L\u22123(cid:1) .\n\n(cid:107)Dl(t + 1)(cid:107)2 \u2264 t(cid:88)\n\n(cid:107)Dl(s + 1) \u2212 Dl(s)(cid:107)2 = O(\u03b72)\n\nt(cid:88)\n\ns=0\n\ns=0\n\nis suf\ufb01ciently small such that the approximate invariances condition (4.6) holds at the initialization.\nNote that the constants hidden in O(\u00b7) may depend on the target matrix \u03a6.\n\n5 Numerical experiments\n\n5.1 Dependence on the depth\n\nO(cid:0)L3(cid:1), which holds for any target matrix in Rd\u00d7d. The \ufb01rst experiment examines how this depth\n\nTheorem 4.3 theoretically shows that the number of iterations required for convergence is at most\n\ndependence behaves in practice.\nIn experiments, we generate target matrices in two ways:\n\nBoth d = 2 and d = 100 are considered.\n\n\u2022 Gaussian random matrix: \u03a6 = (\u03c6ij) \u2208 Rd\u00d7d with \u03c6ij independently drawn from N (0, 1).\n\u2022 Negative identity matrix: \u03a6 = \u2212I \u2208 Rd\u00d7d. This target is adopted from [18], which proves\nthat in the case d = 1, the number of iteration required for convergence under the Xavier\nand the near-identity initialization scales exponentially with the depth L. Both d = 1 and\nd = 100 are considered.\n\nThe ZAS initialization (3.1) is applied for linear neural networks with different depth L, and we\nmanually tune the optimal learning rate for each L. As suggested by Theorem 4.3, we numerically\n\ufb01nd that the optimal learning rate decrease with L.\nFigure 2 shows number of iterations required to make the objective R \u2264 \u03b5 = 10\u221210. It is clear to\nsee that the number of iterations required roughly scale as O (L\u03b3), where \u03b3 \u2248 1/2 for the negative\nidentity matrix and \u03b3 \u2248 1 for the Gaussian random matrices. These scalings are better than the\ntheoretical \u03b3 = 3 in Theorem 4.3, which is a worst case result.\n\n5.2 Comparison with near-identity initialization in multi-dimensional cases\n\nThe near-identity initialization initializes each layer by\n\n(Ul)ij \u223c N (0, 1/(dL)) i.i.d.,\n\nl = 1, . . . , L\n\nWl = I + Ul,\n\n(5.1)\nwhere I is the identity matrix. Numerically, it was observed in [18] that for multi-dimensional\nnetworks (d = 25 in the experiments), gradient descent with the initialization (5.1) requires number\nof iterations to scale only polynomially with the depth, instead of exponentially. Here we compare\nit with the ZAS initialization by \ufb01tting negative identity matrix with 6-layer linear networks. The\nlearning rate \u03b7 = 0.01 for both initialization.\nFigure 3 shows the dynamics trajectories for both initializations. It strongly suggests that the ZAS\ninitialization is more ef\ufb01cient than the near-identity initialization (5.1). Gradient descent with the\nnear-identity initialization is attracted to a saddle region, spends a long time escaping that region, and\nthen converges fast to a global minimum. As a comparison, gradient descent with ZAS initialization\ndoes not encounter any saddle region during the whole optimization process.\n\n7\n\n\fFigure 2: Number of iterations required for the ZAS initialization to reach an \u03b5-optimal solution\nwhere \u03b5 = 10\u221210. Two type of target matrices, negative identity and Gaussian random matrices are\nconsidered. It is shown that the number of iterations required scales polynomially with the network\ndepth.\n\nFigure 3: Comparison between the ZAS and the near-identity initialization. The 5 dashed lines\ncorrespond to the multiple runs of gradient descent with the near-identity initialization. It is shown\nthat GD with the near-identity successfully escape the saddle region only 2 of 5 times in the given\nnumber of iterations, while the ZAS does not suffer from the attraction of saddle point at all.\n\n6 An extension to nonlinear Residual networks\nConsider the following residual network f : Rd \u2192 Rd(cid:48)\n\n:\n\nz0 = V0x,\nzl = zl\u22121 + Ul\u03c3(Vlzl\u22121),\n\nl = 1, . . . , L,\n\nf (x) = UL+1zL,\n\n(6.1)\nwhere V0 \u2208 RD\u00d7d, Ul \u2208 RD\u00d7m, Vl \u2208 Rm\u00d7D and UL+1 \u2208 Rd(cid:48)\u00d7D; d is the input dimension, d(cid:48) is\nthe output dimension, m is the width of the residual blocks and D is the width of skip connections.\nFor the nonlinear residual network (6.1), we propose the following modi\ufb01ed ZAS (mZAS) initialization:\n\nUl = 0,\n\nl = 1, 2, . . . , L + 1,\n\n(Vl)ij \u223c N (0, 1/D) i.i.d.,\n\nl = 0, 1, . . . , L.\n\n(6.2)\n\nWe test two types of initialization: (1) standard Xavier initialization; (2) mZAS initialization (6.2).\nThe experiments are conducted on Fashion-MNIST [20], where we select 1000 training samples\nforming the new training set to speed up the computation. Depth L = 100, 200, 2000, 10000 are\ntested, and the learning rate for each depth is tuned to the achieve the fastest convergence. The results\nare displayed in Figure 4.\nIt is shown that mZAS initialization always outperforms the Xavier initialization. Moreover, gradient\ndescent with mZAS initialization is even able to successfully optimize a 10000-layer residual network.\nIt clearly demonstrates that the ZAS-type initialization can be helpful for optimizing deep nonlinear\nresidual networks.\n\n8\n\n101102103104Depth101102NumberofiterationsNegativeidentityd=1d=100101102103104Depth101102103NumberofiterationsGaussianrandommatricesd=2d=100025005000750010000Numberofiterations10\u2212810\u2212610\u2212410\u22122100102LossZAS025005000750010000Numberofiterations10\u2212610\u22123100103NormofthegradientsZAS\fFigure 4: For the nonlinear residual network and Fashion-MNIST dataset, the mZAS initialization\noutperforms the Xavier initialization. The latter blow up for depth L = 2000, 10000. The learning\nrates are tuned to achieve the fastest convergence.\n\n7 Conclusion\n\nIn this paper we propose the ZAS initialization for deep linear residual network, under which gradient\ndescent converges to global minima for arbitrary target matrices with linear rate. Moreover, the\nrate only scales polynomially with the network depth. Numerical experiments show that the ZAS\ninitialization indeed avoids the attraction of saddle points, comparing to the near-identity initialization.\nThis type of initialization may be extended to the analysis of deep nonlinear residual networks, which\nwe leave as future work.\n\nAcknowledgments\n\nWe are grateful to Prof. Weinan E for helpful discussions, and the anonymous reviewers for valuable\ncomments and suggestions. This work is supported in part by a gift to Princeton University from\niFlytek and the ONR grant N00014-13-1-0338.\n\nReferences\n[1] Zeyuan Allen-Zhu, Yuanzhi Li, and Zhao Song. A convergence theory for deep learning via\nover-parameterization. In International Conference on Machine Learning, pages 242\u2013252,\n2019.\n\n[2] Sanjeev Arora, Nadav Cohen, Noah Golowich, and Wei Hu. A convergence analysis of\ngradient descent for deep linear neural networks. In International Conference on Learning\nRepresentations, 2019.\n\n[3] Pierre Baldi and Kurt Hornik. Neural networks and principal component analysis: Learning\n\nfrom examples without local minima. Neural networks, 2(1):53\u201358, 1989.\n\n[4] Peter Bartlett, Dave Helmbold, and Phil Long. Gradient descent with identity initialization\nIn International Conference on\n\nef\ufb01ciently learns positive de\ufb01nite linear transformations.\nMachine Learning, pages 520\u2013529, 2018.\n\n[5] Simon S. Du and Wei Hu. Width provably matters in optimization for deep linear neural\n\nnetworks. arXiv preprint arXiv:1901.08572, 2019.\n\n[6] Simon S. Du, Jason D Lee, Haochuan Li, Liwei Wang, and Xiyu Zhai. Gradient descent \ufb01nds\n\nglobal minima of deep neural networks. arXiv preprint arXiv:1811.03804, 2018.\n\n[7] Simon S. Du, Xiyu Zhai, Barnabas Poczos, and Aarti Singh. Gradient descent provably\nIn International Conference on Learning\n\noptimizes over-parameterized neural networks.\nRepresentations, 2019.\n\n[8] Weinan E, Chao Ma, Qingcan Wang, and Lei Wu. Analysis of the gradient descent algorithm for\na deep neural network model with skip-connections. arXiv preprint arXiv:1904.05263, 2019.\n\n9\n\n020406080100Numberofiterations(\u00d7100)20406080100TrainingAccuracy(%)L=100,lr=1e-1,mZASL=200,lr=1e-1,mZASL=2000,lr=2e-2,mZASL=10000,lr=2e-3,mZASL=100,lr=1e-3,XavierL=200,lr=1e-6,Xavier\f[9] Xavier Glorot and Yoshua Bengio. Understanding the dif\ufb01culty of training deep feedforward\n\nneural networks. In Aistats, volume 9, pages 249\u2013256, 2010.\n\n[10] Moritz Hardt and Tengyu Ma. Identity matters in deep learning. In International Conference on\n\nLearning Representations, 2017.\n\n[11] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for im-\nage recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern\nRecognition, pages 770\u2013778, 2016.\n\n[12] Ziwei Ji and Matus Telgarsky. Gradient descent aligns the layers of deep linear networks. arXiv\n\npreprint arXiv:1810.02032, 2018.\n\n[13] Kenji Kawaguchi. Deep learning without poor local minima. In Advances In Neural Information\n\nProcessing Systems, pages 586\u2013594, 2016.\n\n[14] Thomas Laurent and James Brecht. Deep linear networks with arbitrary loss: All local minima\n\nare global. In International Conference on Machine Learning, pages 2908\u20132913, 2018.\n\n[15] Song Mei, Andrea Montanari, and Phan-Minh Nguyen. A mean \ufb01eld view of the landscape of\ntwo-layers neural networks. In Proceedings of the National Academy of Sciences, volume 115,\npages E7665\u2013E7671, 2018.\n\n[16] Grant M. Rotskoff and Eric Vanden-Eijnden. Neural networks as interacting particle systems:\nAsymptotic convexity of the loss landscape and universal scaling of the approximation error.\narXiv preprint arXiv:1805.00915, 2018.\n\n[17] Andrew M. Saxe, James L. McClelland, and Surya Ganguli. Exact solutions to the nonlinear\n\ndynamics of learning in deep linear neural networks. arXiv preprint arXiv:1312.6120, 2013.\n\n[18] Ohad Shamir. Exponential convergence time of gradient descent for one-dimensional deep\n\nlinear neural networks. arXiv preprint arXiv:1809.08587, 2018.\n\n[19] Eugene Vorontsov, Chiheb Trabelsi, Samuel Kadoury, and Chris Pal. On orthogonality and\nlearning recurrent networks with long term dependencies. In International Conference on\nMachine Learning, pages 3570\u20133578, 2017.\n\n[20] Han Xiao, Kashif Rasul, and Roland Vollgraf. Fashion-mnist: A novel image dataset for\n\nbenchmarking machine learning algorithms. arXiv preprint arXiv:1708.07747, 2017.\n\n[21] Chiyuan Zhang, Samy Bengio, Moritz Hardt, Benjamin Recht, and Oriol Vinyals. Understanding\ndeep learning requires rethinking generalization. In International Conference on Learning\nRepresentations, 2017.\n\n[22] Hongyi Zhang, Yann N. Dauphin, and Tengyu Ma. Fixup initialization: Residual learning\n\nwithout normalization. In International Conference on Learning Representations, 2019.\n\n[23] Difan Zou, Yuan Cao, Dongruo Zhou, and Quanquan Gu. Stochastic gradient descent optimizes\n\nover-parameterized deep ReLU networks. arXiv preprint arXiv:1811.08888, 2018.\n\n10\n\n\f", "award": [], "sourceid": 7344, "authors": [{"given_name": "Lei", "family_name": "Wu", "institution": "Princeton University"}, {"given_name": "Qingcan", "family_name": "Wang", "institution": "Program in Applied and Computational Mathematics, Princeton University"}, {"given_name": "Chao", "family_name": "Ma", "institution": "Princeton University"}]}