{"title": "Fast Convergence of Natural Gradient Descent for Over-Parameterized Neural Networks", "book": "Advances in Neural Information Processing Systems", "page_first": 8082, "page_last": 8093, "abstract": "Natural gradient descent has proven very effective at mitigating the catastrophic effects of pathological curvature in the objective function, but little is known theoretically about its convergence properties, especially for \\emph{non-linear} networks. In this work, we analyze for the first time the speed of convergence to global optimum for natural gradient descent on non-linear neural networks with the squared error loss. We identify two conditions which guarantee the global convergence: (1) the Jacobian matrix (of network's output for all training cases w.r.t the parameters) is full row rank and (2) the Jacobian matrix is stable for small perturbations around the initialization. For two-layer ReLU neural networks (i.e. with one hidden layer), we prove that these two conditions do hold throughout the training under the assumptions that the inputs do not degenerate and the network is over-parameterized. We further extend our analysis to more general loss function with similar convergence property. Lastly, we show that K-FAC, an approximate natural gradient descent method, also converges to global minima under the same assumptions.", "full_text": "Fast Convergence of Natural Gradient Descent\n\nfor Overparameterized Neural Networks\n\nGuodong Zhang1,2, James Martens3, Roger Grosse1,2\nUniversity of Toronto1, Vector Institute2, DeepMind3\n\n{gdzhang, rgrosse}@cs.toronto.edu, jamesmartens@google.com\n\nAbstract\n\nNatural gradient descent has proven effective at mitigating the effects of patho-\nlogical curvature in neural network optimization, but little is known theoretically\nabout its convergence properties, especially for nonlinear networks. In this work,\nwe analyze for the \ufb01rst time the speed of convergence of natural gradient descent\non nonlinear neural networks with squared-error loss. We identify two conditions\nwhich guarantee ef\ufb01cient convergence from random initializations: (1) the Jacobian\nmatrix (of network\u2019s output for all training cases with respect to the parameters)\nhas full row rank, and (2) the Jacobian matrix is stable for small perturbations\naround the initialization. For two-layer ReLU neural networks, we prove that these\ntwo conditions do in fact hold throughout the training, under the assumptions of\nnondegenerate inputs and overparameterization. We further extend our analysis\nto more general loss functions. Lastly, we show that K-FAC, an approximate\nnatural gradient descent method, also converges to global minima under the same\nassumptions, and we give a bound on the rate of this convergence.\n\n1\n\nIntroduction\n\nBecause training large neural networks is costly, there has been much interest in using second-\norder optimization to speed up training [Becker and LeCun, 1989, Martens, 2010, Martens and\nGrosse, 2015], and in particlar natural gradient descent [Amari, 1998, 1997]. Recently, scalable\napproximations to natural gradient descent have shown practical success in a variety of tasks and\narchitectures [Martens and Grosse, 2015, Grosse and Martens, 2016, Wu et al., 2017, Zhang et al.,\n2018a, Martens et al., 2018]. Natural gradient descent has an appealing interpretation as optimizing\nover a Riemannian manifold using an intrinsic distance metric; this implies the updates are invariant\nto transformations such as whitening [Ollivier, 2015, Luk and Grosse, 2018]. It is also closely\nconnected to Gauss-Newton optimization, suggesting it should achieve fast convergence in certain\nsettings [Pascanu and Bengio, 2013, Martens, 2014, Botev et al., 2017].\nDoes this intuition translate into faster convergence? Amari [1998] provided arguments in the\naf\ufb01rmative, as long as the cost function is well approximated by a convex quadratic. However, it\nremains unknown whether natural gradient descent can optimize neural networks faster than gradient\ndescent \u2014 a major gap in our understanding. The problem is that the optimization of neural networks\nis both nonconvex and non-smooth, making it dif\ufb01cult to prove nontrivial convergence bounds. In\ngeneral, \ufb01nding a global minimum of a general non-convex function is an NP-complete problem, and\nneural network training in particular is NP-complete [Blum and Rivest, 1992].\nHowever, in the past two years, researchers have \ufb01nally gained substantial traction in understanding\nthe dynamics of gradient-based optimization of neural networks. Theoretically, it has been shown\nthat gradient descent starting from a random initialization is able to \ufb01nd a global minimum if the\nnetwork is wide enough [Li and Liang, 2018, Du et al., 2018b,a, Zou et al., 2018, Allen-Zhu et al.,\n2018, Oymak and Soltanolkotabi, 2019]. The key technique of those works is to show that neural\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\fnetworks become well-behaved if they are largely overparameterized in the sense that the number of\nhidden units is polynomially large in the size of the training data. However, most of these works have\nfocused on standard gradient descent, leaving open the question of whether similar statements can be\nmade about other optimizers.\nMost convergence analysis of natural gradient descent has focused on simple convex quadratic\nobjectives (e.g. [Martens, 2014]). Very recently, the convergence properties of NGD were studied in\nthe context of linear networks [Bernacchia et al., 2018]. While the linearity assumption simpli\ufb01es the\nanalysis of training dynamics [Saxe et al., 2013], linear networks are severely limited in terms of their\nexpressivity, and it\u2019s not clear which conclusions will generalize from linear to nonlinear networks.\nIn this work, we analyze natural gradient descent for nonlinear networks. We give two simple\nand generic conditions on the Jacobian matrix which guarantee ef\ufb01cient convergence to a global\nminimum. We then apply this analysis to a particular distribution over two-layer ReLU networks\nwhich has recently been used to analyze the convergence of gradient descent [Li and Liang, 2018, Du\net al., 2018a, Oymak and Soltanolkotabi, 2019]. We show that for suf\ufb01ciently high network width,\nNGD will converge to the global minimum. We give bounds on the convergence rate of two-layer\nReLU networks that are much better than the analogous bounds that have been proven for gradient\ndescent [Du et al., 2018b, Wu et al., 2019, Oymak and Soltanolkotabi, 2019], while allowing for\nmuch higher learning rates. Moreover, in the limit of in\ufb01nite width, and assuming a squared error\nloss, we show that NGD converges in just one iteration. The main contributions of our work are\nsummarized as follows:\n\n\u2022 We provide the \ufb01rst convergence result for natural gradient descent in training randomly-\ninitialized overparameterized neural networks where the number of hidden units is polyno-\nmially larger than the number of training samples. We show that natural gradient descent\ngives an O(min(G1)) improvement in convergence rate given the same learning rate as\ngradient descent, where G1 is a Gram matrix that depends on the data.\n\u2022 We show that natural gradient enables us to use a much larger step size, resulting in an even\nfaster convergence rate. Speci\ufb01cally, the maximal step size of natural gradient descent is\nO (1) for (polynomially) wide networks.\n\u2022 We show that K-FAC [Martens and Grosse, 2015], an approximate natural gradient descent\nmethod, also converges to global minima with linear rate, although this result requires a\nhigher level of overparameterization compared to GD and exact NGD.\n\n\u2022 We analyze the generalization properties of NGD, showing that the improved convergence\n\nrates provably don\u2019t come at the expense of worse generalization.\n\n2 Related Works\n\nRecently, there have been many works studying the optimization problem in deep learning, i.e., why\nin practice many neural network architectures reliably converge to global minima (zero training error).\nOne popular way to attack this problem is to analyze the underlying loss surface [Hardt and Ma,\n2016, Kawaguchi, 2016, Kawaguchi and Bengio, 2018, Nguyen and Hein, 2017, Soudry and Carmon,\n2016]. The main argument of those works is that there are no bad local minima. It has been proven\nthat gradient descent can \ufb01nd global minima [Ge et al., 2015, Lee et al., 2016] if the loss surface\nsatis\ufb01es: (1) all local minima are global and (2) all saddle points are strict in the sense that there\nexists at least one negative curvature direction. Unfortunately, most of those works rely on unrealistic\nassumptions (e.g., linear activations [Hardt and Ma, 2016, Kawaguchi, 2016]) and cannot generalize\nto practical neural networks. Moreover, Yun et al. [2018] shows that small nonlinearity in shallow\nnetworks can create bad local minima.\nAnother way to understand the optimization of neural networks is to directly analyze the optimization\ndynamics. Our work also falls within this category. However, most work in this direction focuses\non gradient descent. Bartlett et al., Arora et al. [2019a] studied the optimization trajectory of deep\nlinear networks and showed that gradient descent can \ufb01nd global minima under some assumptions.\nPreviously, the dynamics of linear networks have also been studied by Saxe et al. [2013], Advani\nand Saxe [2017]. For nonlinear neural networks, a series of papers [Tian, 2017, Brutzkus and\nGloberson, 2017, Du et al., 2017, Li and Yuan, 2017, Zhang et al., 2018b] studied a speci\ufb01c class of\nshallow two-layer neural networks together with strong assumptions on input distribution as well\nas realizability of labels, proving global convergence of gradient descent. Very recently, there are\n\n2\n\n\fsome works proving global convergence of gradient descent [Li and Liang, 2018, Du et al., 2018b,a,\nAllen-Zhu et al., 2018, Zou et al., 2018, Gao et al., 2019] or adaptive gradient methods [Wu et al.,\n2019] on overparameterized neural networks. More speci\ufb01cally, Li and Liang [2018], Allen-Zhu et al.\n[2018], Zou et al. [2018] analyzed the dynamics of weights and showed that the gradient cannot be\nsmall if the objective value is large. On the other hand, Du et al. [2018b,a], Wu et al. [2019] studied\nthe dynamics of the outputs of neural networks, where the convergence properties are captured by a\nGram matrix. Our work is very similar to Du et al. [2018b], Wu et al. [2019]. We note that these\npapers all require the step size to be suf\ufb01ciently small to guarantee the global convergence, leading to\nslow convergence.\nTo our knowledge, there is only one paper [Bernacchia et al., 2018] studying the global convergence\nof natural gradient for neural networks. However, Bernacchia et al. [2018] only studied deep linear\nnetworks with in\ufb01nitesimal step size and squared error loss functions. In this sense, our work is the\n\ufb01rst one proving global convergence of natural gradient descent on nonlinear networks.\nThere have been many attempts to understand the generalization properties of neural networks\nsince Zhang et al. [2016]\u2019s seminal paper. Researchers have proposed norm-based generalization\nbounds [Neyshabur et al., 2015, 2017, Bartlett and Mendelson, 2002, Bartlett et al., 2017, Golowich\net al., 2017], compression bounds [Arora et al., 2018] and PAC-Bayes bounds [Dziugaite and Roy,\n2017, 2018, Zou et al., 2018]. Recently, overparameterization of neural networks together with\ngood initialization has been believed to be one key factor of good generalization. Neyshabur et al.\n[2019] empirically showed that wide neural networks stay close to the initialization, thus leading to\ngood generalization. Theoretically, researchers did prove that overparameterization as well as linear\nconvergence jointly restrict the weights to be close to the initialization [Du et al., 2018b,a, Allen-Zhu\net al., 2018, Zou et al., 2018, Arora et al., 2019b]. The most closely related paper is Arora et al.\n[2019b], which shows that the optimization and generalization phenomenon can be explained by a\nGram matrix. The main difference is that our analysis is based on natural gradient descent, which\nconverges faster and provably generalizes as well as gradient descent.\nConcurrently and independently, Cai et al. [2019] showed that natural gradient descent (they call it\nGram-Gauss-Newton) enjoys quadratic convergence rate guarantee for overparameterized networks\non regression problems. Additionally, they showed that it is much cheaper to precondition the gradient\nin the output space when the number of data points is much smaller than the number of parameters.\n\n3 Convergence Analysis of Natural Gradient Descent\n\nWe begin our convergence analysis of natural gradient descent \u2013 under appropriate conditions \u2013 for\nthe neural network optimization problem. Formally, we consider a generic neural network f (\u2713, x)\n2 (u y)2 for simplicity1, where \u2713 2 Rm\nwith a single output and squared error loss `(u, y) = 1\ndenots all parameters of the network (i.e. weights and biases). Given a training dataset {(xi, yi)}n\ni=1,\nwe want to minimize the following loss function:\n\nOne main focus of this paper is to analyze the following procedure:\n\nL(\u2713) =\n\n1\nn\n\n1\n2n\n\nnXi=1\n\n` (f (\u2713, xi), yi) =\n\nnXi=1\n\u2713(k + 1) = \u2713(k) \u2318F(\u2713(k))1 @L(\u2713(k))\n\n@\u2713(k)\n\n(f (\u2713, xi) yi)2 .\n\n,\n\n(1)\n\n(2)\n\nwhere \u2318> 0 is the step size, and F is the Fisher information matrix associated with the network\u2019s\npredictive distribution over y (which is implied by its loss function and is N (f (\u2713, xi), 1) for the\nsquared error loss) and the dataset\u2019s distribution over x.\nAs shown by Martens [2014], the Fisher F is equivalent to the generalized Gauss-Newton matrix,\nde\ufb01ned as Exi\u21e5J>i H`Ji\u21e4 if the predictive distribution is in the exponential family, such as categorical\n\ndistribution (for classi\ufb01cation) or Gaussian distribution (for regression). Ji is the Jacobian matrix\nof ui with respect to the parameters \u2713 and H` is the Hessian of the loss `(u, y) with respect to the\nnetwork prediction u (which is I in our setting). Therefore, with the squared error loss, the Fisher\n1It is easy to extend to multi-output networks and other loss functions, here we focus on single-output and\n\nquadratic just for notational simplicity.\n\n3\n\n\fmatrix can be compactly written as F = E\u21e5J>i Ji\u21e4 = 1\n\nn J>J (which coincides with classical Gauss-\nNewton matrix), where J = [J>1 , ..., J>n ]> is the Jacobian matrix for the whole dataset. In practice,\nwhen the number of parameters m is larger than number of samples n we have, the Fisher information\nmatrix F = 1\nn J>J is surely singular. In that case, we take the generalized inverse [Bernacchia et al.,\n2018] F\u2020 = nJ>G1G1J with G = JJ>, which gives the following update rule:\n\n\u2713(k + 1) = \u2713(k) \u2318J>JJ>1\n\n(u y),\n\nwhere u = [u1, ..., un]> = [f (\u2713, x1), ..., f (\u2713, xn)]> and y = [y1, ..., yn]>.\nWe now introduce two conditions on the network f\u2713 that suf\ufb01ce for proving the global convergence\nof NGD to a minimizer which achieves zero training loss (and is therefore a global minimizer).\nTo motivate these two conditions we make the following observations. First, the global minimizer\nis characterized by the condition that the gradient in the output space is zero for each case (i.e.\nruL(\u2713) = 0). Meanwhile, local minima are characterized by the condition that the gradient with\nrespect to the parameters r\u2713L(\u2713) is zero. Thus, one way to avoid \ufb01nding local minima that aren\u2019t\nglobal is to ensure that the parameter gradient is zero if and only if the output space gradient (for each\ncase) is zero. It\u2019s not hard to see that this property holds as long as G remains non-singular throughout\noptimization (or equivalently that J always has full row rank). The following two conditions ensure\nthat this happens, by \ufb01rst requiring that this property hold at initialization time, and second that J\nchanges slowly enough that it remains true in a big enough neighborhood around \u2713(0).\nCondition 1 (Full row rank of Jacobian matrix). The Jacobian matrix J(0) at the initialization has\nfull row rank, or equivalently, the Gram matrix G(0) = J(0)J(0)> is positive de\ufb01nite.\nRemark 1. Condition 1 implies that m \uf8ff n, which means the Fisher information matrix is singular\nand we have to use the generalized inverse except in the case where m = n.\nCondition 2 (Stable Jacobian). There exists 0 \uf8ff C < 1\n2 such that for all parameters \u2713 that satisfy\nk\u2713 \u2713(0)k2 \uf8ff 3kyu(0)k2\npmin(G(0))\n\n, we have\n\nkJ(\u2713) J(0)k2 \uf8ff\n\nC\n\n3pmin(G(0)).\n\n(3)\n\n(4)\n\n(5)\n\nThis condition shares the same spirit with the Lipschtiz smoothness assumption in classical optimiza-\ntion theory. It implies (with small C) that the network is close to a linearized network [Lee et al.,\n2019] around the initialization and therefore natural gradient descent update is close to the gradient\ndescent update in the output space. Along with Condition 1, we have the following theorem.\nTheorem 1 (Natural gradient descent). Let Condition 1 and 2 hold. Suppose we optimize with NGD\nusing a step size \u2318 \uf8ff 12C\n\n(1+C)2 . Then for k = 0, 1, 2, ... we have\n\nku(k) yk2\n\n2 \uf8ff (1 \u2318)k ku(0) yk2\n2 .\n\n2 is the squared error loss up to a constant. Due to space constraints we only\n\nTo be noted, ku(k) yk2\ngive a short sketch of the proof here. The full proof is given in Appendix B.\nProof Sketch. Our proof relies on the following insights. First, if the Jacobian matrix has full row\nrank, this guarantees linear convergence for in\ufb01nitesimal step size. The linear convergence property\nrestricts the parameters to be close to the initialization, which implies the Jacobian matrix is always\nfull row rank throughout the training, and therefore natural gradient descent with in\ufb01nitesimal step\nsize converges to global minima. Furthermore, given the network is close to a linearized network\n(since the Jacobian matrix is stable with respect to small perturbations around the initialization), we\nare able to extend the proof to discrete time with a large step size.\nIn summary, we prove that NGD exhibits linear convergence to the global minimizer of the neural\nnetwork training problem, under Conditions 1 and 2. We believe our arguments in this section are\ngeneral (i.e., architecture-agnostic), and can serve as a recipe for proving global convergence of\nnatural gradient descent in other settings.\n\n3.1 Other Loss Functions\nWe note that our analysis can be easily extended to more general loss function class. Here, we take\nthe class of functions that are \u00b5-strongly convex with L-Lipschitz gradients as an example. Note that\n\n4\n\n\fstrongly convexity is a very mild assumption since we can always add L2 regularization to make the\nconvex loss strongly convex. Therefore, this function class includes regularized cross-entropy loss\n(which is typically used in classi\ufb01cation) and squared error (for regression). For this type of loss, we\nneed a strong version of Condition 2.\nCondition 3 (Stable Jacobian). There exists 0 \uf8ff C < 1\nk\u2713 \u2713(0)k2 \uf8ff 3(1+\uf8ff)kyu(0)k2\n2pmin(G(0))\n\n1+\uf8ff such that for all parameters \u2713 that satisfy\n\nwhere \uf8ff = L\n\u00b5\n\n(6)\nTheorem 2. Under Condition 1 and 3, but with \u00b5-strongly convex loss function `(\u00b7,\u00b7) with L-Lipschitz\ngradient (\uf8ff = L\n\n, then we have for k = 0, 1, 2, ...\n\nkJ(\u2713) J(0)k2 \uf8ff\n\nC\n\n3pmin(G(0)).\n\n1(1+\uf8ff)C\n(1+C)2\n\n\u00b5+L\n\n\u00b5 ), and we set the step size \u2318 \uf8ff 2\n2 \uf8ff\u27131 \n\nku(k) yk2\n\n2\u2318\u00b5L\n\n\u00b5 + L\u25c6k\n\nku(0) yk2\n2.\n\n(7)\n\nThe key step of proving Theorem 2 is to show if m is large enough, then natural gradient descent is\napproximately gradient descent in the output space. Thus the results can be easily derived according to\nstandard bounds for convex optimization. Due to the page limit, we defer the proof to the Appendix C.\nRemark 2. In Theorem 2, the convergence rate depends on the condition number \uf8ff = L\n\u00b5 , which\ncan be removed if we take into the curvature information of the loss function. In other words, we\nexpect that the bound has no dependency on \uf8ff if we use the Fisher matrix rather than the classical\nGauss-Newton (assuming Euclidean metric in the output space [Luk and Grosse, 2018]) in Theorem 2.\n\n4 Optimizing Overparameterized Neural Networks\n\nIn Section 3, we analyzed the convergence properties of natural gradient descent, under the abstract\nConditions 1 and 2. In this section, we make our analysis concrete by applying it to a speci\ufb01c type of\noverparameterized network (with a certain random initialization). We show that Conditions 1 and 2\nhold with high probability. We therefore establish that NGD exhibits linear convergence to a global\nminimizer for such networks.\n\n4.1 Notation\nWe let [m] = {1, 2, ..., m}. We use \u2326, to denote the Kronecker and Hadamard products. And we\nuse \u21e4 and ? to denote row-wise and column-wise Khatri-Rao products, respectively. For a matrix A,\nwe use Aij to denote its (i, j)-th entry. We use k\u00b7k 2 to denote the Euclidean norm of a vector or\nspectral norm of a matrix and k\u00b7k F to denote the Frobenius norm of a matrix. We use max(A) and\nmin(A) to denote the largest and smallest eigenvalue of a square matrix, and max(A) and min(A)\nto denote the largest and smallest singular value of a (possibly non-square) matrix. For a positive\nde\ufb01nite matrix A, we use \uf8ffA to denote its condition number, i.e., max(A)/min(A). We also use\nh\u00b7,\u00b7i to denote the standard inner product between two vectors. Given an event E, we use I{E} to\ndenote the indicator function for E.\n\n4.2 Problem Setup\nFormally, we consider a neural network of the following form:\n\nf (w, a, x) =\n\n1\npm\n\nmXr=1\n\nar(w>r x),\n\n(8)\n\nwhere x 2 Rd is the input, w =\u21e5w>1 , ..., w>r\u21e4> 2 Rmd is the weight matrix (formed into a vector)\nof the \ufb01rst layer, ar 2 R is the output weight of hidden unit r and (\u00b7) is the ReLU activation\nfunction (acting entry-wise for vector arguments). For r 2 [m], we initialize the weights of \ufb01rst layer\nwr \u21e0N (0,\u232b 2I) and output weight ar \u21e0 unif [{1, +1}].\nFollowing Du et al. [2018b], Wu et al. [2019], we make the following assumption on the data.\n\n5\n\n\fFigure 1: Visualization of natural gradient update and gradient descent update in the output space (for a\nrandomly initialized network). We take two classes (4 and 9) from MNIST [LeCun et al., 1998] and generate\nthe targets (denoted as star in the \ufb01gure) by f (x) = x 0.5 + 0.3 \u21e5N (0, I) where x 2 R2 is one-hot target.\nWe get natural gradient update by running 100 iterations of conjugate gradient [Martens, 2010]. The \ufb01rst row:\na MLP with two hidden layers and 100 hidden units in each layer. The second row: a MLP with two hidden\nlayers and 6000 hidden units in each layer. In both cases, ReLU activation function was used. We interpolate\nthe step size from 0 to 1. For the over-parameterized network (in the second row), natural gradient descent\n(implemented by conjugate gradient) matches output space gradient well.\n\nAssumption 1. For all i, kxik2 = 1 and |yi| = O (1). For any i 6= j, xi , xj.\nThis very mild condition simply requires the inputs and outputs have standardized norms, and that\ndifferent input vectors are distinguishable from each other. Datasets that do not satisfy this condition\ncan be made to do so via simple pre-processing.\nFollowing Du et al. [2018b], Oymak and Soltanolkotabi [2019], Wu et al. [2019], we only optimize\nthe weights of the \ufb01rst layer2, i.e., \u2713 = w. Therefore, natural gradient descent can be simpli\ufb01ed to\n\nw(k + 1) = w(k) \u2318J>(JJ>)1(u y).\n\n(9)\n\nThough this is only a shallow fully connected neural network, the objective is still non-smooth and\nnon-convex [Du et al., 2018b] due to the use of ReLU activation function. We further note that this\ntwo-layer network model has been useful in understanding the optimization and generalization of\ndeep neural networks [Xie et al., 2016, Li and Liang, 2018, Du et al., 2018b, Arora et al., 2019b, Wu\net al., 2019], and some results have been extended to multi-layer networks [Du et al., 2018a].\nFollowing Du et al. [2018b], Wu et al. [2019], we de\ufb01ne the limiting Gram matrix as follows:\nDe\ufb01nition 1 (Limiting Gram Matrix). The limiting Gram matrix G1 2 Rn\u21e5n is de\ufb01ned as follows.\nFor (i, j)- entry, we have\n\nG1ij = Ew\u21e0N (0,\u232b2I)\u21e5x>i xjIw>xi 0, w>xj 0 \u21e4 = x>i xj\n\n\u21e1 arccos(x>i xj)\n\n2\u21e1\n\n.\n\n(10)\n\nThis matrix coincides with neural tangent kernel [Jacot et al., 2018] for ReLU activation function.\nAs shown by Du et al. [2018b], this matrix is positive de\ufb01nite and we de\ufb01ne its smallest eigenvalue\n0 , min(G1) > 0. In the same way, we can de\ufb01ne its \ufb01nite version G(t) = J(t)J(t)> with\n(i, j)-entry Gij(t) = 1\n\nm x>i xjPr2[m] Iwr(t)>xi 0, wr(t)>xj 0 .\n\n4.3 Exact Natural Gradient Descent\n\nIn this subsection, we present our result for this setting. The main dif\ufb01culty is to show that Conditions 1\nand 2 hold. Here we state our main result.\nTheorem 3 (Natural Gradient Descent for overparameterized Networks). Under Assumption 1, if\nwe i.i.d initialize wr \u21e0N (0,\u232b 2I), ar \u21e0 unif[{1, +1}] for r 2 [m], we set the number of hidden\n2We \ufb01x the second layer just for simplicity. Based on the same analysis, one can also prove global convergence\n\nfor jointly training both layers.\n\n6\n\n\u22122\u22121012\u22122\u22121012\u22122\u22121012\u22122\u22121012\u22122\u22121012\u22122\u22121012\u22122\u22121012\fnodes m =\u2326 \u21e3 n4\n\n\u232b24\n\n03\u2318, and the step size \u2318 = O(1), then with probability at least 1 over the\n\nrandom initialization we have for k = 0, 1, 2, ...\n\nku(k) yk2\n\n2 \uf8ff (1 \u2318)k ku(0) yk2\n2.\n\n(11)\n\nEven though the objective is non-convex and non-smooth, natural gradient descent with a constant\nstep size enjoys a linear convergence rate. For large enough m, we show that the learning rate can be\nchosen up to 1, so NGD can provably converge within O (1) steps. Compared to analogous bounds\nfor gradient descent [Du et al., 2018a, Oymak and Soltanolkotabi, 2019, Wu et al., 2019], we improve\nthe maximum allowable learning rate from O(1/n) to O(1) and also get rid of the dependency on 0.\nOverall, NGD (Theorem 3) gives an O(0/n) improvement over gradient descent.\nOur strategy to prove this result will be to show that for the given choice of random initialization,\nCondition 1 and 2 hold with high probability. For proving Condition 1 hold, we used matrix\nconcentration inequalities. For Condition 2, we show that kJ J(0)k2 = Om1/6, which implies\n\nthe Jacobian is stable for wide networks. For detailed proof, we refer the reader to the Appendix D.1.\n\n4.4 Approximate Natural Gradient Descent with K-FAC\nExact natural gradient descent is quite expensive in terms of computation or memory. In training deep\nneural networks, K-FAC [Martens and Grosse, 2015] has been a powerful optimizer for leveraging\ncurvature information while retaining tractable computation. The K-FAC update rule for the two-layer\nReLU network is given by\n\nw(k + 1) = w(k) \u2318\u21e5(X>X)1 \u2326 (S(k)>S(k))1\u21e4\n}\n\n{z\n\nKFAC\n\n|\n\nF1\n\nJ(k)>(u(k) y).\n\n(12)\n\nwhere X 2 Rn\u21e5d denotes the matrix formed from the n input vectors (i.e. X = [x1, ..., xn]>), and\nS = [0(Xw1), ..., 0(Xwm)] 2 Rn\u21e5m is the matrix of pre-activation derivatives. Under the same\nargument as the Gram matrix G1, we get that S1S1> is strictly positive de\ufb01nite with smallest\neigenvalue S (see Appendix D.3 for detailed proof).\nWe show that for suf\ufb01ciently wide networks, K-FAC does converge linearly to a global minimizer. We\nfurther show, with a particular transformation on the input data, K-FAC does match the optimization\nperformance of exact natural gradient for two-layer ReLU networks. Here we state the main result.\nTheorem 4 (K-FAC). Under the same assumptions as in Theorem 3, plus the additional assumption\n\nthat rank(X) = d, if we set the number of hidden units m = O\u2713\n3\u25c6 and step size\n\u2318 = OminX>X, then with probability at least 1 over the random initialization, we have\n\nfor k = 0, 1, 2, ...\n\nn4\nS\uf8ff4\nX>X\n\n\u232b24\n\nku(k) yk2\n\n2 \uf8ff\u27131 \n\n\u2318\n\nmax(X>X)\u25c6k\n\nku(0) yk2\n2.\n\n(13)\n\nThe key step in proving Theorem 4 is to show\n\nu(k + 1) u(k) \u21e1\u21e5X(X>X)1X> I\u21e4 (y u(k)) .\n\n(14)\nRemark 3. The convergence rate of K-FAC is captured by the condition number of the matrix X>X,\nas opposed to gradient descent [Du et al., 2018b, Oymak and Soltanolkotabi, 2019], for which the\nconvergence rate is determined by the condition number of the Gram matrix G.\nRemark 4. The dependence of the convergence rate on \uf8ffX>X in Theorem 4 may seem paradoxical,\nas K-FAC is invariant to invertible linear transformations of the data (including those that would\nchange \uf8ffX>X). But we note that said transformations would also make the norms of the input vectors\nnon-uniform, thus violating Assumption 1 in a way that isn\u2019t repairable. Interestingly, there exists an\ninvertible linear transformation which, if applied to the input vectors and followed by normalization,\nproduces vectors that simultaneously satisfy Assumption 1 and the condition \uf8ffX>X = 1 (thus\nimproving the bound in Theorem 4 substantially). See Appendix A for details. Notably, K-FAC is not\ninvariant to such pre-processing, as the normalization step is a nonlinear operation.\n\n7\n\n\fTo quantify the degree of overparameterization (which is a function of the network width m) required\nto achieve global convergence under our analysis, we must estimate S. To this end, we observe that\nG = XX> SS>, and then apply the following lemma:\nLemma 1. [Schur [1911]] For two positive de\ufb01nite matrices A and B, we have\n\nmax (A B) \uf8ff max\nmin (A B) min\n\ni\n\ni\n\nAiimax(B)\n\nAiimin(B)\n\n(15)\n\nThe diagonal entries of XX> are all 1 since the inputs are normalized. Therefore, we have 0 S\naccording to Lemma 1, and hence K-FAC requires a slightly higher degree of overparameterization\nthan exact NGD under our analysis.\n\n4.5 Bounding 0\nAs pointed out by Allen-Zhu et al. [2018], it is unclear if 1/0 is small or even polynomial. Here,\nwe bound using matrix concentration inequalities and harmonic analysis. To leverage harmonic\nanalysis, we have to assume the data xi are drawn i.i.d. from the unit sphere3.\nTheorem 5. Under this assumption on the training data, with probability 1 n exp(n/4),\n\n0 , min(G1) n/2, where 2 (0, 0.5)\n\n(16)\n\nBasically, Theorem 5 says that the Gram matrix G1 should have high chance of having large smallest\neigenvalue if the training data are uniformly distributed. Intuitively, we would expect the smallest\neigenvalue to be very small if all xi are similar to each other. Therefore, some notion of diversity of\nthe training inputs is needed. We conjecture that the smallest eigenvalue would still be large if the\ndata are -separable (i.e., kxi xjk2 for any pair i, j 2 [n]), an assumption adopted by Li and\nLiang [2018], Allen-Zhu et al. [2018], Zou et al. [2018].\n\n5 Generalization analysis\n\nIt is often speculated that NGD or other preconditioned gradient descent methods (e.g., Adam)\nperform worse than gradient descent in terms of generalization [Wilson et al., 2017]. In this section,\nwe show that NGD achieves the same generalization bounds which have been proved for GD, at least\nfor two-layer ReLU networks.\nConsider a loss function ` : R \u21e5 R ! R. The expected risk over the data distribution D and the\nempirical risk over a training set S = {(xi, yi)}n\n\nLD(f ) = E(x,y)\u21e0D [`(f (x), y)] and LS(f ) =\n\n`(f (xi), yi)\n\n(17)\n\ni=1 are de\ufb01ned as\n1\nn\n\nnXi=1\n\nIt has been shown [Neyshabur et al., 2019] that the Redemacher complexity [Bartlett and Mendel-\nson, 2002] for two-layer ReLU networks depends on kw w(0)k2. By the standard Rademacher\ncomplexity generalization bound, we have the following bound (see Appendix E.1 for proof):\nTheorem 6. Given a target error parameter \u270f> 0 and failure probability 2 (0, 1). Suppose\n0 , 1,\u270f 1. For any 1-Lipschitz loss function, with\n\u232b = O\u270fp0 and m \u232b2polyn, 1\nprobability at least 1 over random initialization and training samples, the two-layer neural\nnetwork f (w, a) trained by NGD for k \u2326\u21e3 1\n\u270f\u2318 iterations has expected loss LD(f (w, a)) =\n\nE(x,y)\u21e0D [`(f (w, a, x), y)] bounded as:\n\n\u2318 log 1\n\nLD(f (w, a)) \uf8ffr 2y>(G1)1y\n\nn\n\n+ 3r log(6/)\n\n2n\n\n+ \u270f\n\n(18)\n\nwhich matches the bound for gradient descent in Arora et al. [2019b]. For detailed proof, we refer the\nreader to the Appendix E.1.\n\n3This assumption is not too stringent since the inputs are already normalized. Moreover, we can relax the\nassumption of unit sphere input to separable input, which is used in Li and Liang [2018], Allen-Zhu et al. [2018],\nZou et al. [2018]. See Oymak and Soltanolkotabi [2019] (Theorem I.1) for more details.\n\n8\n\n\f6 Conclusion\n\nWe\u2019ve analyzed for the \ufb01rst time the rate of convergence to a global optimum for (both exact and\napproximate) natural gradient descent on nonlinear neural networks. Particularly, we identi\ufb01ed two\nconditions which guarantee the global convergence, i.e., the Jacobian matrix with respect to the\nparameters has full row rank and stable for perturbations around the initialization. Based on these\ninsights, we improved the convergence rate of gradient descent by a factor of O(0/n) on two-layer\nReLU networks by using natural gradient descent. Beyond that, we also showed that the improved\nconvergence rates don\u2019t come at the expense of worse generalization.\n\nAcknowledgements\n\nWe thank Jeffrey Z. HaoChen, Shengyang Sun and Mufan Li for helpful discussion. RG acknowledges\nsupport from the CIFAR Canadian AI Chairs program and the Ontario MRIS Early Researcher Award.\n\nReferences\nMadhu S Advani and Andrew M Saxe. High-dimensional dynamics of generalization error in neural\n\nnetworks. arXiv preprint arXiv:1710.03667, 2017.\n\nZeyuan Allen-Zhu, Yuanzhi Li, and Zhao Song. A convergence theory for deep learning via over-\n\nparameterization. arXiv preprint arXiv:1811.03962, 2018.\n\nShun-ichi Amari. Neural learning in structured parameter spaces-natural riemannian gradient. In\n\nAdvances in neural information processing systems, pages 127\u2013133, 1997.\n\nShun-Ichi Amari. Natural gradient works ef\ufb01ciently in learning. Neural computation, 10(2):251\u2013276,\n\n1998.\n\nSanjeev Arora, Rong Ge, Behnam Neyshabur, and Yi Zhang. Stronger generalization bounds for\n\ndeep nets via a compression approach. arXiv preprint arXiv:1802.05296, 2018.\n\nSanjeev Arora, Nadav Cohen, Noah Golowich, and Wei Hu. A convergence analysis of gradient\ndescent for deep linear neural networks. In International Conference on Learning Representations,\n2019a. URL https://openreview.net/forum?id=SkMQg3C5K7.\n\nSanjeev Arora, Simon S Du, Wei Hu, Zhiyuan Li, and Ruosong Wang. Fine-grained analysis of\noptimization and generalization for overparameterized two-layer neural networks. arXiv preprint\narXiv:1901.08584, 2019b.\n\nPeter L Bartlett and Shahar Mendelson. Rademacher and gaussian complexities: Risk bounds and\n\nstructural results. Journal of Machine Learning Research, 3(Nov):463\u2013482, 2002.\n\nPeter L Bartlett, David P Helmbold, and Philip M Long. Gradient descent with identity initialization\nef\ufb01ciently learns positive-de\ufb01nite linear transformations by deep residual networks. Neural\ncomputation, pages 1\u201326.\n\nPeter L Bartlett, Dylan J Foster, and Matus J Telgarsky. Spectrally-normalized margin bounds for\nneural networks. In Advances in Neural Information Processing Systems, pages 6240\u20136249, 2017.\nS. Becker and Y. LeCun. Improving the convergence of backpropagation learning with second order\n\nmethods. In Proceedings of the 1988 Connectionist Models Summer School, 1989.\n\nAlberto Bernacchia, Mate Lengyel, and Guillaume Hennequin. Exact natural gradient in deep linear\nnetworks and its application to the nonlinear case. In Advances in Neural Information Processing\nSystems, pages 5945\u20135954, 2018.\n\nAvrim Blum and Ronald L. Rivest. Training a 3-node neural network is np-complete. Neural\n\nNetworks, 5:117\u2013127, 1992.\n\nAleksandar Botev, Hippolyt Ritter, and David Barber. Practical Gauss-Newton optimisation for deep\n\nlearning. In International Conference on Machine Learning, 2017.\n\n9\n\n\fStephen Boyd and Lieven Vandenberghe. Convex optimization. Cambridge university press, 2004.\n\nAlon Brutzkus and Amir Globerson. Globally optimal gradient descent for a convnet with gaussian\ninputs. In Proceedings of the 34th International Conference on Machine Learning-Volume 70,\npages 605\u2013614. JMLR. org, 2017.\n\nTianle Cai, Ruiqi Gao, Jikai Hou, Siyu Chen, Dong Wang, Di He, Zhihua Zhang, and Liwei Wang.\nA gram-gauss-newton method learning overparameterized deep neural networks for regression\nproblems. arXiv preprint arXiv:1905.11675, 2019.\n\nSimon S Du, Jason D Lee, and Yuandong Tian. When is a convolutional \ufb01lter easy to learn? arXiv\n\npreprint arXiv:1709.06129, 2017.\n\nSimon S Du, Jason D Lee, Haochuan Li, Liwei Wang, and Xiyu Zhai. Gradient descent \ufb01nds global\n\nminima of deep neural networks. arXiv preprint arXiv:1811.03804, 2018a.\n\nSimon S Du, Xiyu Zhai, Barnabas Poczos, and Aarti Singh. Gradient descent provably optimizes\n\nover-parameterized neural networks. arXiv preprint arXiv:1810.02054, 2018b.\n\nGintare Karolina Dziugaite and Daniel M Roy. Computing nonvacuous generalization bounds for\ndeep (stochastic) neural networks with many more parameters than training data. arXiv preprint\narXiv:1703.11008, 2017.\n\nGintare Karolina Dziugaite and Daniel M Roy. Data-dependent pac-bayes priors via differential\n\nprivacy. In Advances in Neural Information Processing Systems, pages 8440\u20138450, 2018.\n\nJ\u00fcrgen Forster. A linear lower bound on the unbounded error probabilistic communication complexity.\n\nJournal of Computer and System Sciences, 65(4):612\u2013625, 2002.\n\nWeihao Gao, Ashok Makkuva, Sewoong Oh, and Pramod Viswanath. Learning one-hidden-layer\nIn The 22nd International Conference on\n\nneural networks under general input distributions.\nArti\ufb01cial Intelligence and Statistics, pages 1950\u20131959, 2019.\n\nRong Ge, Furong Huang, Chi Jin, and Yang Yuan. Escaping from saddle points\u2014online stochastic\n\ngradient for tensor decomposition. In Conference on Learning Theory, pages 797\u2013842, 2015.\n\nNoah Golowich, Alexander Rakhlin, and Ohad Shamir. Size-independent sample complexity of\n\nneural networks. arXiv preprint arXiv:1712.06541, 2017.\n\nRoger Grosse and James Martens. A Kronecker-factored approximate Fisher matrix for convolution\n\nlayers. In International Conference on Machine Learning, 2016.\n\nMoritz Hardt and Tengyu Ma. Identity matters in deep learning. arXiv preprint arXiv:1611.04231,\n\n2016.\n\nArthur Jacot, Franck Gabriel, and Cl\u00e9ment Hongler. Neural tangent kernel: Convergence and\ngeneralization in neural networks. In Advances in neural information processing systems, pages\n8580\u20138589, 2018.\n\nKenji Kawaguchi. Deep learning without poor local minima. In Advances in neural information\n\nprocessing systems, pages 586\u2013594, 2016.\n\nKenji Kawaguchi and Yoshua Bengio. Depth with nonlinearity creates no bad local minima in resnets.\n\narXiv preprint arXiv:1810.09038, 2018.\n\nYann LeCun, L\u00e9on Bottou, Yoshua Bengio, Patrick Haffner, et al. Gradient-based learning applied to\n\ndocument recognition. Proceedings of the IEEE, 86(11):2278\u20132324, 1998.\n\nJaehoon Lee, Lechao Xiao, Samuel S Schoenholz, Yasaman Bahri, Jascha Sohl-Dickstein, and Jeffrey\nPennington. Wide neural network of any depth evolve as linear models under gradient descent.\narXiv preprint arXiv:1902.06720, 2019.\n\nJason D Lee, Max Simchowitz, Michael I Jordan, and Benjamin Recht. Gradient descent converges\n\nto minimizers. arXiv preprint arXiv:1602.04915, 2016.\n\n10\n\n\fYuanzhi Li and Yingyu Liang. Learning overparameterized neural networks via stochastic gradient\ndescent on structured data. In Advances in Neural Information Processing Systems, pages 8168\u2013\n8177, 2018.\n\nYuanzhi Li and Yang Yuan. Convergence analysis of two-layer neural networks with relu activation.\n\nIn Advances in Neural Information Processing Systems, pages 597\u2013607, 2017.\n\nKevin Luk and Roger Grosse. A coordinate-free construction of scalable natural gradient. arXiv\n\npreprint arXiv:1808.1340, 2018.\n\nJames Martens. Deep learning via hessian-free optimization. In ICML, volume 27, pages 735\u2013742,\n\n2010.\n\nJames Martens. New insights and perspectives on the natural gradient method. arXiv preprint\n\narXiv:1412.1193, 2014.\n\nJames Martens and Roger Grosse. Optimizing neural networks with kronecker-factored approximate\n\ncurvature. In International conference on machine learning, pages 2408\u20132417, 2015.\n\nJames Martens, Jimmy Ba, and Matt Johnson. Kronecker-factored curvature approximations for\n\nrecurrent neural networks. In International Conference on Learning Representations, 2018.\n\nMehryar Mohri, Afshin Rostamizadeh, and Ameet Talwalkar. Foundations of machine learning.\n\n2018.\n\nBehnam Neyshabur, Ryota Tomioka, and Nathan Srebro. Norm-based capacity control in neural\n\nnetworks. In Conference on Learning Theory, pages 1376\u20131401, 2015.\n\nBehnam Neyshabur, Srinadh Bhojanapalli, and Nathan Srebro. A pac-bayesian approach to spectrally-\n\nnormalized margin bounds for neural networks. arXiv preprint arXiv:1707.09564, 2017.\n\nBehnam Neyshabur, Zhiyuan Li, Srinadh Bhojanapalli, Yann LeCun, and Nathan Srebro. The role\nof over-parametrization in generalization of neural networks. In International Conference on\nLearning Representations, 2019. URL https://openreview.net/forum?id=BygfghAcYX.\n\nQuynh Nguyen and Matthias Hein. The loss surface of deep and wide neural networks. In Proceedings\nof the 34th International Conference on Machine Learning-Volume 70, pages 2603\u20132612. JMLR.\norg, 2017.\n\nYann Ollivier. Riemannian metrics for neural networks I: Feedforward networks. Information and\n\nInference, 4:108\u2013153, 2015.\n\nSamet Oymak and Mahdi Soltanolkotabi. Towards moderate overparameterization: global con-\nvergence guarantees for training shallow neural networks. arXiv preprint arXiv:1902.04674,\n2019.\n\nRazvan Pascanu and Yoshua Bengio. Revisiting natural gradient for deep networks. arXiv preprint\n\narXiv:1301.3584, 2013.\n\nAndrew M Saxe, James L McClelland, and Surya Ganguli. Exact solutions to the nonlinear dynamics\n\nof learning in deep linear neural networks. arXiv preprint arXiv:1312.6120, 2013.\n\nJssai Schur. Bemerkungen zur theorie der beschr\u00e4nkten bilinearformen mit unendlich vielen ver\u00e4n-\n\nderlichen. Journal f\u00fcr die reine und Angewandte Mathematik, 140:1\u201328, 1911.\n\nDaniel Soudry and Yair Carmon. No bad local minima: Data independent training error guarantees\n\nfor multilayer neural networks. arXiv preprint arXiv:1605.08361, 2016.\n\nYuandong Tian. An analytical formula of population gradient for two-layered relu network and its\napplications in convergence and critical point analysis. In Proceedings of the 34th International\nConference on Machine Learning-Volume 70, pages 3404\u20133413. JMLR. org, 2017.\nJoel A Tropp et al. An introduction to matrix concentration inequalities. Foundations and Trends R\n\nin Machine Learning, 8(1-2):1\u2013230, 2015.\n\n11\n\n\fAshia C Wilson, Rebecca Roelofs, Mitchell Stern, Nati Srebro, and Benjamin Recht. The marginal\nvalue of adaptive gradient methods in machine learning. In Advances in Neural Information\nProcessing Systems, pages 4148\u20134158, 2017.\n\nXiaoxia Wu, Simon S Du, and Rachel Ward. Global convergence of adaptive gradient methods for an\n\nover-parameterized neural network. arXiv preprint arXiv:1902.07111, 2019.\n\nYuhuai Wu, Elman Mansimov, Shun Liao, Roger Grosse, and Jimmy Ba. Scalable trust-region method\nfor deep reinforcement learning using Kronecker-factored approximation. In Neural Information\nProcessing Systems, 2017.\n\nBo Xie, Yingyu Liang, and Le Song. Diverse neural network learns true target functions. arXiv\n\npreprint arXiv:1611.03131, 2016.\n\nChulhee Yun, Suvrit Sra, and Ali Jadbabaie. Small nonlinearities in activation functions create bad\n\nlocal minima in neural networks. arXiv preprint arXiv:1802.03487, 2018.\n\nChiyuan Zhang, Samy Bengio, Moritz Hardt, Benjamin Recht, and Oriol Vinyals. Understanding\n\ndeep learning requires rethinking generalization. arXiv preprint arXiv:1611.03530, 2016.\n\nGuodong Zhang, Shengyang Sun, David Duvenaud, and Roger Grosse. Noisy natural gradient as\nvariational inference. In International Conference on Machine Learning, pages 5847\u20135856, 2018a.\nXiao Zhang, Yaodong Yu, Lingxiao Wang, and Quanquan Gu. Learning one-hidden-layer relu\n\nnetworks via gradient descent. arXiv preprint arXiv:1806.07808, 2018b.\n\nDifan Zou, Yuan Cao, Dongruo Zhou, and Quanquan Gu. Stochastic gradient descent optimizes\n\nover-parameterized deep relu networks. arXiv preprint arXiv:1811.08888, 2018.\n\n12\n\n\f", "award": [], "sourceid": 4416, "authors": [{"given_name": "Guodong", "family_name": "Zhang", "institution": "University of Toronto"}, {"given_name": "James", "family_name": "Martens", "institution": "DeepMind"}, {"given_name": "Roger", "family_name": "Grosse", "institution": "University of Toronto"}]}