{"title": "Step Size Matters in Deep Learning", "book": "Advances in Neural Information Processing Systems", "page_first": 3436, "page_last": 3444, "abstract": "Training a neural network with the gradient descent algorithm gives rise to a discrete-time nonlinear dynamical system. Consequently, behaviors that are typically observed in these systems emerge during training, such as convergence to an orbit but not to a fixed point or dependence of convergence on the initialization. Step size of the algorithm plays a critical role in these behaviors: it determines the subset of the local optima that the algorithm can converge to, and it specifies the magnitude of the oscillations if the algorithm converges to an orbit. To elucidate the effects of the step size on training of neural networks, we study the gradient descent algorithm as a discrete-time dynamical system, and by analyzing the Lyapunov stability of different solutions, we show the relationship between the step size of the algorithm and the solutions that can be obtained with this algorithm. The results provide an explanation for several phenomena observed in practice, including the deterioration in the training error with increased depth, the hardness of estimating linear mappings with large singular values, and the distinct performance of deep residual networks.", "full_text": "Step Size Matters in Deep Learning\n\nKamil Nar\n\nS. Shankar Sastry\n\nElectrical Engineering and Computer Sciences\n\nUniversity of California, Berkeley\n\nAbstract\n\nTraining a neural network with the gradient descent algorithm gives rise to a\ndiscrete-time nonlinear dynamical system. Consequently, behaviors that are typi-\ncally observed in these systems emerge during training, such as convergence to an\norbit but not to a \ufb01xed point or dependence of convergence on the initialization.\nStep size of the algorithm plays a critical role in these behaviors: it determines the\nsubset of the local optima that the algorithm can converge to, and it speci\ufb01es the\nmagnitude of the oscillations if the algorithm converges to an orbit. To elucidate the\neffects of the step size on training of neural networks, we study the gradient descent\nalgorithm as a discrete-time dynamical system, and by analyzing the Lyapunov\nstability of different solutions, we show the relationship between the step size of\nthe algorithm and the solutions that can be obtained with this algorithm. The results\nprovide an explanation for several phenomena observed in practice, including the\ndeterioration in the training error with increased depth, the hardness of estimating\nlinear mappings with large singular values, and the distinct performance of deep\nresidual networks.\n\n1\n\nIntroduction\n\nWhen gradient descent algorithm is used to minimize a function, say f : Rn ! R, it leads to a\ndiscrete-time dynamical system:\n\nx[k + 1] = x[k] rf (x[k]),\n\n(1)\n\nwhere x[k] is the state of the system, which consists of the parameters updated by the algorithm, and\n is the step size, or the learning rate of the algorithm. Every \ufb01xed point of the system (1) is called an\nequilibrium of the system, and they correspond to the critical points of the function f.\nUnless f is a quadratic function of the parameters, the system described by (1) is either a nonlinear\nsystem or a hybrid system that switches from one dynamics to another over time. Consequently, the\nsystem (1) can exhibit behaviors that are typically observed in nonlinear and hybrid systems, such as\nconvergence to an orbit but not to a \ufb01xed point, or dependence of convergence on the equilibria and\nthe initialization. The step size of the algorithm has a critical effect on these behaviors, as shown in\nthe following examples.\nExample 1. Convergence to a periodic orbit: Consider the continuously differentiable and convex\nfunction f1(x) = 2\n3|x|3/2, which has a unique local minimum at the origin. The gradient descent\nalgorithm on this function yields\n\nx[k + 1] =\u21e2 x[k] px[k],\n\nx[k] 0,\n\nx[k] + px[k], x[k] < 0.\n\n32nd Conference on Neural Information Processing Systems (NeurIPS 2018), Montr\u00e9al, Canada.\n\n\fAs expected, the origin is the only equilibrium of this system. Interestingly, however, x[k] converges\nto the origin only when the initial state x[0] belongs to a countable set S:\n2, . . .) .\n\nS =(0, 2,2,\n\n3 + p5\n\n3 + p5\n\n2,\n\n2\n\n2\n\nFor all other initializations, x[k] converges to an oscillation between 2/4 and 2/4. This implies\nthat, if the initial state x[0] is randomly drawn from a continuous distribution, then almost surely x[k]\ndoes not converge to the origin, yet |x[k]| converges to 2/4. In other words, with probability 1, the\nstate x[k] does not converge to a \ufb01xed point, such as a local optimum or a saddle point, even though\nthe estimation error converges to a \ufb01nite non-optimal value.\nExample 2. Dependence of convergence on the equilibrium: Consider the nonconvex function\nf2(x) = (x2 + 1)(x 1)2(x 2)2, which has two local minima at x = 1 and x = 2 as shown in\nFigure 1. Note that these local minima are also the two of the isolated equilibria of the dynamical\nsystem created by the gradient descent algorithm. The stability of these equilibria in the sense of\nLyapunov is determined by the step size of the algorithm. In particular, since the smoothness parameter\nof f2 around these equilibria is 4 and 10, they are stable only if the step size is smaller than 0.5 and\n0.2, respectively, and the gradient descent algorithm can converge to them only when these conditions\nare satis\ufb01ed. Due to the difference in the largest step size allowed for different equilibria, step size\nconveys information about the solution that can be obtained by the gradient descent algorithm. For\nexample, if the algorithm converges to an equilibrium with step size 0.3 from a point drawn randomly\nfrom a continuous distribution, then this equilibrium is almost surely x = 1.\n\nFigure 1: The function f2(x) = (x2 + 1)(x 1)2(x 2)2 of Example 2. Since the smoothness\nparameter of f2 at x = 1 is smaller than that at x = 2, the gradient descent algorithm cannot\nconverge to x = 2 but can converge to x = 1 for some values of the step size. If, for example, the\nalgorithm converges to an equilibrium from a randomly chosen initial point with step size 0.3, then\nthis equilibrium is almost surely x = 1.\n\nExample 3. Dependence of convergence on the initialization: Consider the function f3(x) = xL\nwhere L 2 N is an even number larger than 2. The gradient descent results in the system\n\nx[k + 1] = x[k] Lx[k]L1.\n\nThe state x[k] converges to the origin if the initial state satis\ufb01es x[0]L2 < (2/L) and x[k] diverges\nif x[0]L2 > (2/L).\nThese three examples demonstrate:\n\n1. the convergence of training error does not imply the convergence of the algorithm to a local\n\noptimum or a saddle point,\n\n2. the step size determines the magnitude of the oscillations if the algorithm converges to an\n\norbit but not to a \ufb01xed point,\n\n3. the step size restricts the set of local optima that the algorithm can converge to,\n4. the step size in\ufb02uences the convergence of the algorithm differently for each initialization.\n\n2\n\n0.60.81.01.21.41.61.82.02.2x0.00.10.20.30.4f2(x)\fNote that these are direct consequences of the nonlinear dynamics of the gradient descent algo-\nrithm and not of the (non)convexity of the function to be minimized. While both of the functions in\nExample 1 and Example 3 are convex, the identical behaviors are observed during the minimization\nof nonconvex training cost functions of neural networks as well.\n\n1.1 Our contributions\nIn this paper, we study the gradient descent algorithm as a discrete-time dynamical system during\ntraining deep neural networks, and we show the relationship between the step size of the algorithm\nand the solutions that can be obtained with this algorithm. In particular, we achieve the following:\n\n1. We analyze the Lyapunov stability of the gradient descent algorithm on deep linear networks\nand \ufb01nd different upper bounds on the step size that enable convergence to each solution. We\nshow that for every step size, the algorithm can converge to only a subset of the local optima,\nand there are always some local optima that the algorithm cannot converge to independent\nof the initialization.\n\n2. We establish that for deep linear networks, there is a direct connection between the smooth-\nness parameter of the training loss function and the largest singular value of the estimated\nlinear function. In particular, we show that if the gradient descent algorithm can converge\nto a solution with a large step size, the function estimated by the network must have small\nsingular values, and hence, the estimated function must have a small Lipschitz constant.\n\n3. We show that symmetric positive de\ufb01nite matrices can be estimated with a deep linear\nnetwork by initializing the weight matrices as the identity, and this initialization allows\nthe use of the largest step size. Conversely, the algorithm is most likely to converge for an\narbitrarily chosen step size if the weight matrices are initialized as the identity.\n\n4. We show that symmetric matrices with negative eigenvalues, on the other hand, cannot be\nestimated with the identity initialization, and the gradient descent algorithm converges to\nthe closest positive semide\ufb01nite matrix in Frobenius norm.\n\n5. For 2-layer neural networks with ReLU activations, we obtain an explicit relationship\nbetween the step size of the gradient descent algorithm and the output of the solution that\nthe algorithm can converge to.\n\n1.2 Related work\nIt is a well-known problem that the gradient of the training cost function can become disproportionate\nfor different parameters when training a neural network. Several works in the literature tried to address\nthis problem. For example, changing the geometry of optimization was proposed in (Neyshabur et al.,\n2017) and a regularized descent algorithm was proposed to prevent the gradients from exploding and\nvanishing during training.\nDeep residual networks, which is a speci\ufb01c class of neural networks, yielded exceptional results in\npractice with their peculiar structure (He et al., 2016). By keeping each layer of the network close to\nthe identity function, these networks were able to attain lower training and test errors as the depth of\nthe network was increased. To explain their distinct behavior, the training cost function of their linear\nversions was shown to possess some crucial properties (Hardt & Ma, 2016). Later, equivalent results\nwere also derived for nonlinear residual networks under certain conditions (Bartlett et al., 2018a).\nThe effect of the step size on training neural networks was empirically investigated in (Daniel et al.,\n2016). A step size adaptation scheme was proposed in (Rolinek & Martius, 2018) for the stochastic\ngradient method and shown to outperform the training with a constant step size. Similarly, some\nheuristic methods with variable step size were introduced and tested empirically in (Magoulas et al.,\n1997) and (Jacobs, 1988).\nTwo-layer linear networks were \ufb01rst studied in (Baldi & Hornik, 1989). The analysis was extended to\ndeep linear networks in (Kawaguchi, 2014), and it was shown that all local optima of these networks\nwere also the global optima. It was discovered in (Hardt & Ma, 2016) that the only critical points of\nthese networks were actually the global optima as long as all layers remained close to the identity\nfunction during training. The dynamics of training these networks were also analyzed in (Saxe et al.,\n2013) and (Gunasekar et al., 2017) by assuming an in\ufb01nitesimal step size and using a continuous-time\napproximation to the dynamics.\n\n3\n\n\fLyapunov analysis from the dynamical system theory (Khalil, 2002; Sastry, 1999), which is the\nmain tool for our results in this work, was used in the past to understand and improve the training of\nneural networks \u2013 especially that of the recurrent neural networks (Michel et al., 1988; Matsuoka,\n1992; Barabanov & Prokhorov, 2002). State-of-the-art feedforward networks, however, have not been\nanalyzed from this perspective.\nWe summarize the major differences between our contributions and the previous works as follows:\n\n1. We relate the vanishing and exploding gradients that arise during training feedforward\n\nnetworks to the Lyapunov stability of the gradient descent algorithm.\n\n2. Unlike the continuous-time analyses given in (Saxe et al., 2013) and (Gunesekar et al.,\n2017), we study the discrete-time dynamics of the gradient descent with an emphasis on the\nstep size. By doing so, we obtain upper bounds on the step size to be used, and we show\nthat the step size restricts the set of local optima that the algorithm can converge to. Note\nthat these results cannot be obtained with a continuous-time approximation.\n\n3. For deep linear networks with residual structure, (Hardt & Ma, 2016) shows that the gradient\nof the cost function cannot vanish away from a global optimum. This is not enough, however,\nto suggest the fast convergence of the algorithm. Given a \ufb01xed step size, the algorithm may\nalso converge to an oscillation around a local optimum, as in the case of Example 1. We\nrule out this possibility and provide a step size so that the algorithm converges to a global\noptimum with a linear rate.\n\n4. We recently found out that the convergence of the gradient descent algorithm was also studied\nin (Bartlett et al., 2018b) for symmetric positive de\ufb01nite matrices independently of and\nconcurrently with our preliminary work (Nar & Sastry, 2018). However, unlike (Bartlett et al.,\n2018b), we give an explicit step size value for the algorithm to converge with a linear rate,\nand we emphasize the fact that the identity initialization allows convergence with the largest\nstep size.\n\n2 Upper bounds on the step size for training deep linear networks\n\nDeep linear networks are a special class of neural networks that do not contain nonlinear activations.\nThey represent a linear mapping and can be described by a multiplication of a set of matrices, namely,\nWL \u00b7\u00b7\u00b7 W1, where Wi 2 Rni\u21e5ni1 for each i 2 [L] := {1, 2, . . . , L}. Due to the multiplication of\ndifferent parameters, their training cost is never a quadratic function of the parameters, and therefore,\nthe dynamics of the gradient descent algorithm is always nonlinear during training of these networks.\nFor this reason, they provide a simple model to study some of the nonlinear behaviors observed\nduring training neural networks.\nGiven a cost function `(WL \u00b7\u00b7\u00b7 W1), if point { \u02c6Wi}i2[L] is a local minimum, then {\u21b5i \u02c6Wi}i2[L] is also\na local minimum for every set of scalars {\u21b5i}i2[L] that satisfy \u21b51\u21b52 \u00b7\u00b7\u00b7 \u21b5L = 1. Consequently,\nindependent of the speci\ufb01c choice of `, the training cost function have in\ufb01nitely many local optima,\nnone of these local optima is isolated in the parameter space, and the cost function is not strongly\nconvex at any point in the parameter space.\nAlthough multiple local optima attain the same training cost for deep linear networks, the dynamics\nof the gradient descent algorithm exhibits distinct behaviors around these points. In particular, the\nstep size required to render each of these local optima stable in the sense of Lyapunov is very different.\nSince the Lyapunov stability of a point is a necessary condition for the convergence of the algorithm to\nthat point, the step size that allows convergence to each solution is also different, which is formalized\nin Theorem 1.\nTheorem 1. Given a nonzero matrix R 2 RnL\u21e5n0 and a set of points {xi}i2[N ] in Rn0 that satisfy\nNPN\ni=1 xix>i = I, assume that R is estimated as a multiplication of the matrices {Wj}j2[L] by\n1\nminimizing the squared error loss\n\ni=1 kRxi WLWL1 . . . W2W1xik2\n\n2\n\n(2)\n\n1\n\n2N XN\n\n4\n\n\fwhere Wj 2 Rnj\u21e5nj1 for all j 2 [L]. Then the gradient descent algorithm with random1 initializa-\ntion can converge to a solution { \u02c6Wj}j2[L] only if the step size satis\ufb01es\n\n \uf8ff\n\nj+1\n\n2\nj=1 p2\nj1q2\n\nPL\nqj =u> \u02c6WL \u02c6WL1 \u00b7\u00b7\u00b7 \u02c6Wj 8j 2 [L],\n\n(3)\n\nwhere\n\npj = \u02c6Wj \u00b7\u00b7\u00b7 \u02c6W2 \u02c6W1v,\n\nand u and v are the left and right singular vectors of \u02c6R = \u02c6WL \u00b7\u00b7\u00b7 \u02c6W1 corresponding to its largest\nsingular value.\nConsidering all the solutions {\u21b5i \u02c6Wi}i2[L] that satisfy \u21b51\u21b52 \u00b7\u00b7\u00b7 \u21b5L = 1, the bound in (3) can be\narbitrarily small for some of the local optima. Therefore, given a \ufb01xed step size , the gradient descent\nalgorithm can converge to only a subset of the local optima, and there are always some solutions that\nthe algorithm cannot converge to independent of the initialization.\nRemark 1. Theorem 1 provides a necessary condition for convergence to a speci\ufb01c solution. It rules\nout the possibility of converging to a large subset of the local optima; however, it does not state that\ngiven a step size , the algorithm converges to a solution which satis\ufb01es (3). It might be the case, for\nexample, that the algorithm converges to an oscillation around a local optimum which violates (3)\neven though there are some other local optima which satisfy (3).\nAs a necessary condition for the convergence to a global optimum, we can also \ufb01nd an upper bound\non the step size independent of the weight matrices of the solution, which is given next.\nCorollary 1. For the minimization problem in Theorem 1, the gradient descent algorithm with\nrandom initialization can converge to a global optimum only if the step size satis\ufb01es\n\n \uf8ff\n\n2\n\nL\u21e2(R)2(L1)/L ,\n\n(4)\n\nwhere \u21e2(R) is the largest singular value of R.\nRemark 2. Corollary 1 shows that, unlike the optimization of the ordinary least squares problem, the\nstep size required for the convergence of the algorithm depends on the parameter to be estimated, R.\nConsequently, estimating linear mappings with larger singular values requires the use of a smaller\nstep size. Conversely, the step size conveys information about the solution obtained if the algorithm\nconverges. That is, if the algorithm has converged with a large step size, then the Lipschitz constant\nof the function estimated must be small.\nCorollary 2. Assume that the gradient descent algorithm with random initialization has converged\nto a local optimum \u02c6R = \u02c6WL . . . \u02c6W1 for the minimization problem in Theorem 1. Then the largest\nsingular value of \u02c6R almost surely satis\ufb01es\n\n\u21e2( \u02c6R) \uf8ff\u2713 2\n\nL\u25c6L/(2L2)\n\n.\n\nThe smoothness parameter of the training cost function is directly related to the largest step size\nthat can be used, and consequently, to the Lyapunov stability of the gradient descent algorithm. The\ndenominators of the upper bounds (3) and (4) in Theorem 1 and Corollary 1 necessarily provide a\nlower bound for the smoothness parameter of the training cost function around corresponding local\noptima. As a result, Theorem 1 implies that there is no \ufb01nite Lipschitz constant for the gradient of\nthe training cost function over the whole parameter space.\n\n3\n\nIdentity initialization allows the largest step size for estimating symmetric\npositive de\ufb01nite matrices\n\nCorollary 1 provides only a necessary condition for the convergence of the gradient descent algorithm,\nand the bound (4) is not tight for every estimation problem. However, if the matrix to be estimated is\nsymmetric and positive de\ufb01nite, the algorithm can converge to a solution with step sizes close to (4),\nwhich requires a speci\ufb01c initialization of the weight parameters.\n\n1The random distribution must be continuous and assign zero probability to every set with measure zero.\n\n5\n\n\f1\n\n,\n\nL\n\n1\n2N\n\nTheorem 2. Assume that R 2 Rn\u21e5n is a symmetric positive semide\ufb01nite matrix, and given a set of\nNPN\npoints {xi}i2[N ] which satisfy 1\ni=1 xix>i = I, the matrix R is estimated as a multiplication of\nthe square matrices {Wj}j2[L] by minimizing\nNXi=1\n \uf8ff min\u21e2 1\n\nkRxi WL . . . W1xik2\n2.\nL\u21e2(R)2(L1)/L ,\n\nIf the weight parameters are initialized as Wi[0] = I for all i 2 [L] and the step size satis\ufb01es\n\nthen each Wi converges to R1/L with a linear rate.\nRemark 3. Theorem 2 shows that the algorithm converges to a global optimum despite the noncon-\nvexity of the optimization, and it provides a case where the bound (4) is almost tight. The tightness of\nthe bound implies that for the same step size, most of the other global optima are unstable in the sense\nof Lyapunov, and therefore, the algorithm cannot converge to them independent of the initialization.\nConsequently, using identity initialization allows convergence to a solution which is most likely to be\nstable for an arbitrarily chosen step size.\nRemark 4. Given that the identity initialization on deep linear networks is equivalent to the zero\ninitialization of linear residual networks (Hardt & Ma, 2016), Theorem 2 provides an alternative\nexplanation for the exceptional performance of deep residual networks as well (He et al., 2016).\nWhen the matrix to be estimated is symmetric but not positive semide\ufb01nite, the bound (4) is still tight\nfor some of the global optima. In this case, however, the eigenvalues of the estimate cannot attain\nnegative values if the weight matrices are initialized with the identity.\nTheorem 3. Let R 2 Rn\u21e5n in Theorem 2 be a symmetric matrix such that the minimum eigenvalue\nof R, min(R), is negative. If the weight parameters are initialized as Wi[0] = I for all i 2 [L] and\nthe step size satis\ufb01es\n\n \uf8ff min\u21e2\n\n1\n\n1 min(R)\n\n,\n\n1\nL\n\n,\n\n1\n\nL\u21e2(R)2(L1)/L ,\n\nalmost surely.\n\nmax\n\ni2[N ] kxik2k \u02c6f (xi)k2 \uf8ff\n\n6\n\n1\n\n\nthen the estimate \u02c6R = \u02c6WL \u00b7\u00b7\u00b7 \u02c6W1 converges to the closest positive semide\ufb01nite matrix to R in\nFrobenius norm.\n\nFrom the analysis of symmetric matrices, we observe that the step size required for convergence to a\nglobal optimum is largest when the singular vector of R corresponding to its largest singular value is\nampli\ufb01ed or attenuated equally at each layer of the network. If the initial weight matrices affect this\nvector in the opposite ways, i.e., if some of the layers attenuate this vector and the others amplify it,\nthen the required step size for convergence could be very small.\n\n4 Effect of step size on training two-layer networks with ReLU activations\n\nIn Section 2, we analyzed the relationship between the step size of the gradient descent algorithm and\nthe solutions that can be obtained by training deep linear networks. A similar relationship exists for\nnonlinear networks as well. The following theorem, for example, provides an upper bound on the\nstep size for the convergence of the algorithm when the network has two layers and ReLU activations.\nTheorem 4. Given a set of points {xi}i2[N ] in Rn, let a function f : Rn ! Rm be estimated by a\ntwo-layer neural network with ReLU activations by minimizing the squared error loss:\n\nmin\nW,V\n\n1\n\n2XN\n\ni=1 kW g(V xi b) f (xi)k2\n2,\n\nwhere g(\u00b7) is the ReLU function, b 2 Rr is the \ufb01xed bias vector, and the optimization is only over\nthe weight parameters W 2 Rm\u21e5r and V 2 Rr\u21e5n. If the gradient descent algorithm with random\ninitialization converges to a solution ( \u02c6W , \u02c6V ), then the estimate \u02c6f (x) = \u02c6W g( \u02c6V x b) satis\ufb01es\n\n\fTheorem 4 shows that if the algorithm is able to converge with a large step size, then the estimate \u02c6f (x)\nmust have a small magnitude for large values of kxk.\nSimilar to Corollary 1, the bound given by Theorem 4 is not necessarily tight. Nevertheless, it high-\nlights the effect of the step size on the convergence of the algorithm. To demonstrate that small\nchanges in the step size could lead to signi\ufb01cantly different solutions, we generated a piecewise\ncontinuous function f : [0, 1] ! R and estimated it with a two-layer network by minimizing\n\nXN\ni=1 |W g (V xi b) f (xi)|2\n\nwith two different step sizes 2{ 2 \u00b7 104, 3 \u00b7 104}, where W 2 R1\u21e520, V 2 R20, b 2 R20,\nN = 1000 and xi = i/N for all i 2 [N ]. The initial values of W, V and the constant vector b were\nall drawn from independent standard normal distributions; and the vector b was kept the same for both\nof the step sizes used. As shown in Figure 2, training with = 2 \u00b7 104 converged to a \ufb01xed solution,\nwhich provided an estimate \u02c6f close the original function f. In contrast, training with = 3 \u00b7 104\nconverged to an oscillation and not to a \ufb01xed point. That is, after suf\ufb01cient training, the estimate kept\nswitching between \u02c6fodd and \u02c6feven at each iteration of the gradient descent algorithm.2\n\nFigure 2: Estimates of the function f obtained by training a two-layer neural network with two\ndifferent step sizes. [Left] When the step size of the gradient descent algorithm is = 2 \u00b7 104, the\nalgorithm converges to a \ufb01xed point, which provides an estimate \u02c6f close to f. [Right] When the step\nsize is = 3 \u00b7 104, the algorithm converges to an oscillation and not to a \ufb01xed solution. That is,\nafter suf\ufb01cient training, the estimate keeps switching between \u02c6fodd and \u02c6feven at each iteration.\n\n5 Discussion\n\nWhen gradient descent algorithm is used to minimize a function, typically only three possibilities are\nconsidered: convergence to a local optimum, to a global optimum, or to a saddle point. In this work,\nwe considered the fourth possibility: the algorithm may not converge at all \u2013 even in the deterministic\nsetting. The training error may not re\ufb02ect the oscillations in the dynamics, or when a stochastic\noptimization method is used, the oscillations in the training error might be wrongly attributed to the\nstochasticity of the algorithm. We underlined that, if the training error of an algorithm converges to a\nnon-optimal value, that does not imply the algorithm is stuck near a bad local optimum or a saddle\npoint; it might simply be the case that the algorithm has not converged at all.\nWe showed that the step size of the gradient descent algorithm in\ufb02uences the dynamics of the\nalgorithm substantially. It renders some of the local optima unstable in the sense of Lyapunov, and\nthe algorithm cannot converge to these points independent of the initialization. It also determines the\nmagnitude of the oscillations if the algorithm converges to an orbit around an equilibrium point in the\nparameter space.\n\n2The code for the experiment is available at https://github.com/nar-k/NeurIPS-2018.\n\n7\n\n\fIn Corollary 2 and Theorem 4, we showed that the step size required for convergence to a speci\ufb01c\nsolution depends on the solution itself. In particular, we showed that there is a direct connection\nbetween the smoothness parameter of the training loss function and the Lipschitz constant of the\nfunction estimated by the network. This reveals that some solutions, such as linear functions with\nlarge singular values, are harder to converge to. Given that there exists a relationship between the\nLipschitz constants of the estimated functions and their generalization error (Bartlett et al., 2017),\nthis result could provide a better understanding of the generalization of deep neural networks.\nThe analysis in this paper was limited to the full-batch gradient descent algorithm. It remains as an\nopen problem to investigate if there are analogous results for the stochastic gradient methods.\n\nA Proof of Theorem 1\n\nLemma 1. Let f : Rm\u21e5n ! Rm\u21e5n be a linear map de\ufb01ned as f (X) = PL\ni=1 AiXBi, where\nAi 2 Rm\u21e5m and Bi 2 Rn\u21e5n are symmetric positive semide\ufb01nite matrices for all i 2 [L]. Then, for\nevery nonzero u 2 Rm and v 2 Rn, the largest eigenvalue of f satis\ufb01es\n\nmax(f ) \n\n1\nkuk2\n2kvk2\n\n2XL\n\ni=1\n\n(u>Aiu)(v>Biv).\n\nProof of Theorem 1. The cost function (2) can be written as\n\n1\n2\n\ntrace(WL \u00b7\u00b7\u00b7 W1 R)>(WL \u00b7\u00b7\u00b7 W1 R) .\n\nLet E denote the error in the estimate, i.e. E = WL \u00b7\u00b7\u00b7 W1 R. The gradient descent yields\n\n(5)\nBy multiplying the update equations of Wi[k] and subtracting R, we can obtain the dynamics of E as\n\nWi[k + 1] = Wi[k] W >i+1[k]\u00b7\u00b7\u00b7 W >L [k]E[k]W >1 [k]\u00b7\u00b7\u00b7 W >i1[k] 8i 2 [L].\n\nE[k + 1] = E[k] XL\nwhere o(\u00b7) denotes the higher order terms, and\nAi = WLWL1 \u00b7\u00b7\u00b7 Wi+1W >i+1 \u00b7\u00b7\u00b7 W >L1W >L\n\ni=1\n\n8i 2 [L],\n\nAi[k]E[k]Bi[k] + o(E[k]),\n\n(6)\n\nBi = W >1 W >2 \u00b7\u00b7\u00b7 W >i1Wi1 \u00b7\u00b7\u00b7 W2W1 8i 2 [L].\n\n@F\n\n(x[k] x\u21e4)\n\nLyapunov\u2019s indirect method of stability (Khalil, 2002; Sastry, 1999) states that given a dynamical\nsystem x[k + 1] = F (x[k]), its equilibrium x\u21e4 is stable in the sense of Lyapunov only if the\nlinearization of the system around x\u21e4\n\n(x[k + 1] x\u21e4) = (x[k] x\u21e4) +\n\ndoes not have any eigenvalue larger than 1 in magnitude. By using this fact for the system de\ufb01ned by\n(5)-(6), we can observe that an equilibrium {W \u21e4j }j2[L] with W \u21e4L \u00b7\u00b7\u00b7 W \u21e41 = \u02c6R is stable in the sense of\nLyapunov only if the system\n\u21e3E[k + 1] \u02c6R + R\u2318 =\u21e3E[k] \u02c6R + R\u2318 XL\nAi{W \u21e4j }\n\n@xx=x\u21e4\nAi{W \u21e4j }\u21e3E[k] \u02c6R + R\u2318 Bi{W \u21e4j }\n\u02dcEBi{W \u21e4j }\n\ndoes not have any real eigenvalue larger than (2/). Let u and v be the left and right singular vectors\nof \u02c6R corresponding to its largest singular value, and let pj and qj be de\ufb01ned as in the statement of\nTheorem 1. Then, by Lemma 1, the mapping f in (7) does not have an eigenvalue larger than (2/)\nonly if\n\ndoes not have any eigenvalue larger than 1 in magnitude, which requires that the mapping\n\nf ( \u02dcE) =XL\n\n(7)\n\ni=1\n\ni=1\n\nwhich completes the proof.\n\n,\n\n\u2305\n\nXL\n\ni=1\n\n2\n\n\np2\ni1q2\n\ni+1 \uf8ff\n\n8\n\n\fAcknowledgement\n\nThis research was supported by the U.S. Of\ufb01ce of Naval Research (ONR) MURI grant N00014-16-1-\n2710.\n\nReferences\n[1] P. Baldi and K. Hornik. Neural networks and principal component analysis: Learning from\n\nexamples without local minima. Neural Networks, Vol. 2, pp. 53\u201358, 1989.\n\n[2] N. E. Barabanov and D. V. Prokhorov. Stability analysis of discrete-time recurrent neural\n\nnetworks. IEEE Transactions on Neural Networks, Vol. 13, No. 2, pp. 292\u2013303, 2002.\n\n[3] P. L. Bartlett, S. Evans, and P. Long. Representing smooth functions as compositions of near-\nidentity functions with implications for deep network optimization. arXiv:1804.05012 [cs.LG],\n2018a.\n\n[4] P. L. Bartlett, D. J. Foster, and M. Telgarsky. Spectrally-normalized margin bounds for neural\n\nnetworks. In Advances in Neural Information Processing Systems, 2017.\n\n[5] P. L. Bartlett, D. P. Helmbold, and P. Long. Gradient descent with identity initialization ef\ufb01ciently\nlearns positive de\ufb01nite linear transformations by deep residual networks. arXiv:1802.06093\n[cs.LG], 2018b.\n\n[6] C. Daniel, J. Taylor, and S. Nowozin. Learning step size controllers for robust neural network\n\ntraining. In AAAI Conference on Arti\ufb01cial Intelligence, 2016.\n\n[7] S. Gunasekar, B. E. Woodworth, S. Bhojanapalli, B. Neyshabur, and N. Srebro. Implicit\nregularization in matrix factorization. In Advances in Neural Information Processing Systems,\npp. 6152\u20136160, 2017.\n\n[8] M. Hardt and T. Ma. Identity matters in deep learning. arXiv:1611.04231 [cs.LG], 2016.\n[9] K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. In IEEE\n\nConference on Computer Vision and Pattern Recognition, pp. 770\u2013778, 2016.\n\n[10] R. A. Jacobs. Increased rates of convergence through learning rate adaptation. Neural Networks,\n\nVol. 1, pp. 295\u2013307, 1988.\n\n[11] K. Kawaguchi. Deep learning without poor local minima. In Advances in Neural Information\n\nProcessing Systems, pp. 586\u2013594, 2016.\n\n[12] H. K. Khalil. Nonlinear Systems, 3rd Edition. Prentice Hall, 2002.\n[13] G. D. Magoulas, M. N. Vrahatis and G. S. Androulakis. Effective backpropagation training with\n\nvariable stepsize. Neural Networks, Vol. 10, No. 1, pp. 69\u201382, 1997.\n\n[14] K. Matsuoka. Stability conditions for nonlinear continuous neural networks with asymmetric\n\nconnection weights. Neural Networks, Vol. 5, No. 3, pp. 495\u2013500, 1992.\n\n[15] A. N. Michel, J. A. Farrell, and W. Porod. Stability results for neural networks. In Neural\n\nInformation Processing Systems, pp. 554\u2013563, 1988.\n\n[16] K. Nar and S. S. Sastry. Residual Networks: Lyapunov stability and convex decomposition.\n\narXiv:1803.08203 [cs.LG], 2018.\n\n[17] B. Neyshabur, R. Tomioka, R. Salakhutdinov, and N. Srebro. Geometry of optimization and\n\nimplicit regularization in deep learning. arXiv:1705.03071 [cs.LG], 2017.\n\n[18] S. Sastry. Nonlinear Systems: Analysis, Stability, and Control. Springer: New York, NY, 1999.\n[19] A. M. Saxe, J. L. McClelland, and S. Ganguli. Exact solutions to the nonlinear dynamics of\n\nlearning in deep linear neural networks. arXiv:1312.6120 [cs.NE], 2013.\n\n[20] M. Rolinek and G. Martius. L4: Practical loss-based stepsize adaptation for deep learning.\n\narXiv:1802.05074 [cs.LG], 2018.\n\n9\n\n\f", "award": [], "sourceid": 1767, "authors": [{"given_name": "Kamil", "family_name": "Nar", "institution": "University of California, Berkeley"}, {"given_name": "Shankar", "family_name": "Sastry", "institution": "Department of EECS, UC Berkeley"}]}