{"title": "Exact natural gradient in deep linear networks and its application to the nonlinear case", "book": "Advances in Neural Information Processing Systems", "page_first": 5941, "page_last": 5950, "abstract": "Stochastic gradient descent (SGD) remains the method of choice for deep learning, despite the limitations arising for ill-behaved objective functions. In cases where it could be estimated, the natural gradient has proven very effective at mitigating the catastrophic effects of pathological curvature in the objective function, but little is known theoretically about its convergence properties, and it has yet to find a practical implementation that would scale to very deep and large networks. Here, we derive an exact expression for the natural gradient in deep linear networks, which exhibit pathological curvature similar to the nonlinear case. We provide for the first time an analytical solution for its convergence rate, showing that the loss decreases exponentially to the global minimum in parameter space. Our expression for the natural gradient is surprisingly simple, computationally tractable, and explains why some approximations proposed previously work well in practice. This opens new avenues for approximating the natural gradient in the nonlinear case, and we show in preliminary experiments that our online natural gradient descent outperforms SGD on MNIST autoencoding while sharing its computational simplicity.", "full_text": "Exact natural gradient in deep linear networks and\n\napplication to the nonlinear case\n\nAlberto Bernacchia\n\nDepartment of Engineering\nUniversity of Cambridge\nCambridge, UK, CB2 1PZ\n\nab2347@cam.ac.uk\n\nM\u00e1t\u00e9 Lengyel\n\nDepartment of Engineering\nUniversity of Cambridge\nCambridge CB2 1PZ, UK\n\nDepartment of Cognitive Science\n\nCentral European University\nBudapest H-1051, Hungary\n\nm.lengyel@eng.cam.ac.uk\n\nGuillaume Hennequin\n\nDepartment of Engineering\nUniversity of Cambridge\nCambridge, UK, CB2 1PZ\n\ng.hennequin@eng.cam.ac.uk\n\nAbstract\n\nStochastic gradient descent (SGD) remains the method of choice for deep learning,\ndespite the limitations arising for ill-behaved objective functions. In cases where it\ncould be estimated, the natural gradient has proven very effective at mitigating the\ncatastrophic effects of pathological curvature in the objective function, but little\nis known theoretically about its convergence properties, and it has yet to \ufb01nd a\npractical implementation that would scale to very deep and large networks. Here,\nwe derive an exact expression for the natural gradient in deep linear networks, which\nexhibit pathological curvature similar to the nonlinear case. We provide for the \ufb01rst\ntime an analytical solution for its convergence rate, showing that the loss decreases\nexponentially to the global minimum in parameter space. Our expression for the\nnatural gradient is surprisingly simple, computationally tractable, and explains why\nsome approximations proposed previously work well in practice. This opens new\navenues for approximating the natural gradient in the nonlinear case, and we show\nin preliminary experiments that our online natural gradient descent outperforms\nSGD on MNIST autoencoding while sharing its computational simplicity.\n\n1\n\nIntroduction\n\nStochastic gradient descent (SGD) is used ubiquitously to train deep neural networks, due to its\nlow computational cost and ease of implementation. However, long narrow valleys, saddle points\nand plateaus in the objective function dramatically slow down learning and often give the illusory\nimpression of having reached a local minimum [Martens, 2010; Dauphin et al., 2014]. The natural\ngradient is an appealing alternative to the standard gradient: it accelerates convergence by using\ncurvature information, it represents the steepest descent direction in the space of distributions, and is\ninvariant to reparametrization of the network [Amari, 1998; Le Roux et al., 2008]. However, besides\nsome numerical evidence, the exact convergence rate of natural gradient remains unknown, and its\nimplementation remains prohibitive due to its very expensive numerical computation [Pascanu and\nBengio, 2013; Martens, 2014; Ollivier, 2015].\nIn order to gain theoretical insight into the convergence rate of natural gradient descent, we analyze a\ndeep (multilayer) linear network. While deep linear networks have obviously no practical relevance\n(they can only perform linear regression and are grossly over-parameterized, see below), their\n\n32nd Conference on Neural Information Processing Systems (NeurIPS 2018), Montr\u00e9al, Canada.\n\n\foptimization is non-convex and is plagued with similar pathological curvature effects as their nonlinear\ncounterparts. Critically, the dynamics of learning in linear networks are exactly solvable, making\nthem an ideal case study to understand the essence of the deep learning problem and \ufb01nd ef\ufb01cient\nsolutions [Saxe et al., 2013]. Here, we derive an exact expression for the natural gradient in deep\nlinear networks, from which we garner two major insights. First, we prove that the exact natural\ngradient leads to exponentially fast convergence towards the minimum achievable loss. This, to our\nknowledge, is the \ufb01rst case where a functional form for the natural gradient\u2019s convergence rate has\nbeen obtained for an arbitrarily deep multilayer network, and it con\ufb01rms the long-standing conjecture\nthat the natural gradient mitigates the problem of pathological curvature [Pascanu and Bengio, 2013;\nMartens, 2014] (and indeed, annihilates it completely in the linear case). Second, our exact solution\nreveals that the natural gradient can be computed much more ef\ufb01ciently than previously thought. By\nde\ufb01nition, the natural gradient is the product of the inverse of the P \u00d7 P Fisher information matrix F\nwith the P -dimensional gradient vector, where P is the number of network parameters (often in the\nmillions) [Yang and Amari, 1998; Amari et al., 2000; Park et al., 2000]. In contrast, our expression\nexploits the structure of degeneracies in F and requires computing a similar matrix-vector product\nbut in dimension N, the number of neurons in each layer (in the tens/hundreds). Although this simple\nexpression does not formally apply to the nonlinear case, we adapt it to nonlinear deep networks and\nshow that it outperforms SGD on the MNIST autoencoder problem.\nOur exact expression for the natural gradient suggests retrospective theoretical justi\ufb01cations for\nseveral previously proposed modi\ufb01cations of standard gradient descent that empirically improved its\nconvergence. In particular, we revisit previous approximations of the Fisher matrix (in the nonlinear\ncase) based on block-diagonal truncations, and provide a possible explanation for their performance\n(K-FAC, [Martens and Grosse, 2015; Grosse and Martens, 2016; Ba et al., 2016], see also [Heskes,\n2000; Povey et al., 2014; Desjardins et al., 2015]). We show that, even in the simple linear case, the\nexact inverse Fisher matrix is not block-diagonal and the contributions of the off-diagonal blocks to\nthe natural gradient have the same order of magnitude as the on-diagonal blocks. Therefore, contrary\nto what has been proposed previously, the off-diagonal blocks cannot in principle be neglected.\nInstead, our analysis reveals that, when taking the inverse and multiplying by the gradient, the\noff-diagonal blocks of F contribute the exact same terms as the diagonal blocks. This observation\nis at the core of the surprisingly ef\ufb01cient yet exact way of computing the natural gradient that we\npropose here.\nFinally, our algebraic expression for the natural gradient exhibits similarities with recent, biologically-\ninspired backpropagation algorithms. To obtain the natural gradient, we show that the error must\nback-propagate through the (pseudo-)inverses of the weight matrices, rather than their transposes.\nMultiplication by the matrix pseudo-inverse emerges automatically in algorithms where both forward\nand backward weights are free parameters [Lillicrap et al., 2016; Luo et al., 2017].\n\n2 Natural gradient in deep networks\n\nWe consider the problem of learning an input-output relationship on the basis of observed data\nsamples {(xi, yi)} (input-output pairs) drawn from an underlying, unknown distribution p(cid:63)(x, y).\nThis is achieved by a deep discriminative model, which, given an input x, speci\ufb01es a conditional\ndensity q\u03b8(y|x) over possible outputs y, parameterized by the output layer of a deep network with a\nset of parameters \u03b8. Speci\ufb01cally, the input vector x \u2208 Rn0 propagates through a network of L layers\naccording to:\n\ni = 1, . . . , L\n\nxi = \u03c6i (Wi xi\u22121 + bi)\n\n(1)\nwhere xi \u2208 Rni is the output of layer i (which then serves as an input to layer i + 1), Wi \u2208 Rni\u00d7ni\u22121\nis a weight matrix into layer i, bi \u2208 Rni is a vector of bias parameters, \u03c6i is a function applied element-\nwise to its vector argument, and x0 is de\ufb01ned as equal to the input x. The set of parameters \u03b8 includes\nall the elements of the weight matrices and bias vectors of all layers, for a total of P parameters. For\nease of notation, in the following we include the bias vector bi for each layer as an additional row\nin Wi, and augment the activation vector xi\u22121 accordingly with one constant component equal to\none. The output of the last layer is xL: it depends on all parameters \u03b8 and determines the conditional\ndensity q\u03b8(y|x), which we assume here is a Gaussian with a mean determined by xL and a constant\ncovariance matrix, \u02dc\u03a3. Our theoretical results will be obtained for linear networks (\u03c6i(x) = x and\nbi = 0), but we later return to nonlinear networks in numerical simulations.\n\n2\n\n\fwhere q(cid:63)(x) =(cid:82) dy p(cid:63)(x, y) is the marginal distribution of the input and does not depend on the\n\nThe above model speci\ufb01es a joint distribution of input/output pairs, i.e. p\u03b8(x, y) = q\u03b8(y|x) q(cid:63)(x)\n\nparameters \u03b8. The network is trained via maximum likelihood, i.e. by minimizing the following loss\nfunction:\n\nL(\u03b8) = (cid:104)\u2212 log p\u03b8(x, y)(cid:105)p(cid:63) = (cid:104)\u2212 log q\u03b8(y|x)(cid:105)p(cid:63) + const\n\n(2)\nwhere the average is over the true distribution p(cid:63)(x, y). In the following, we will use the shorthand\nnotation (cid:96)(\u03b8|x, y) = log p\u03b8(x, y). Note that, in this setting, maximum likelihood is equivalent to\nminimizing the KL divergence DKL(p(cid:63)(cid:107) p\u03b8) between the true distribution p(cid:63)(x, y) and the model\ndistribution p\u03b8(x, y).\nA common way of minimizing L(\u03b8) is gradient descent, i.e. parameter updates of the form:\n\n(cid:28) \u2202(cid:96)(\u03b8|x, y)\n\n(cid:29)\n\n\u2202\u03b8\n\np(cid:63)\n\n\u221d \u2212 \u2202L\n\n\u2202\u03b8\n\nd\u03b8\ndt\n\n=\n\nwhere t denotes time elapsed in the optimization process. Although the theory of natural gradient we\ndevelop below applies to this continuous-time formulation [Mandt et al., 2017], numerical experiments\nare performed by discretizing Eq. 3 and setting a \ufb01nite learning rate parameter. The dynamics of\nEq. 3 are guaranteed to decrease the loss function in continuous time when the expectation over p(cid:63)\ncan be evaluated exactly; in practice, these dynamics are approximated by sampling from p(cid:63) using a\nbatch of training data points, and using a small but \ufb01nite time step (learning rate) \u2013 this is SGD.\nThe natural gradient corresponds to a modi\ufb01cation of Eq. 3, which consists of multiplying the\n(negative) gradient by the inverse of the Fisher information matrix F :\n\n(3)\n\n(4)\n\n(5)\n\n\u221d \u2212F (\u03b8)\u22121 \u2202L\nT(cid:43)\n(cid:42)\nwhere the Fisher information matrix F \u2208 RP\u00d7P is de\ufb01ned as\n\nd\u03b8\ndt\n\n\u2202\u03b8\n\nF (\u03b8) =\n\n\u2202(cid:96)\n\u2202\u03b8\n\n\u00b7 \u2202(cid:96)\n\u2202\u03b8\n\np\u03b8\n\nNote that the average is taken over the model distribution p\u03b8(x, y) = q\u03b8(y|x)q(cid:63)(x), rather than\nthe true distribution p(cid:63)(x, y). Since the Fisher matrix is positive de\ufb01nite, the natural gradient also\nguarantees decreasing loss in continuous time. The Fisher information matrix quanti\ufb01es the accuracy\nwith which a set of parameters can be estimated by the observation of data, and the natural gradient\nthus rescales the standard gradient accordingly. The natural gradient has a number of desirable\nproperties: it corresponds to steepest gradient descent in the space of distributions p\u03b8(x, y), it is\nparameterization-invariant, and affords good generalization performance [Amari, 1998; Le Roux\net al., 2008]. Moreover, natural gradient descent can be regarded as a second-order method in the\nspace of parameters (e.g. it reduces to the Gauss-Newton method in some cases [Pascanu and Bengio,\n2013; Martens, 2014]).\n\n3 Exact natural gradient for deep linear networks and quadratic loss\nIn this paper, we focus on regression problems where the conditional model distribution q\u03b8(y|x) is\nGaussian, with a mean equal to the output xL of the last layer of a deep network and some covariance\n\u02dc\u03a3. Note that other types of distributions can also be used, e.g. a categorical distribution parameterized\nby the output of a \ufb01nal softmax layer to address classi\ufb01cation problems. Using Eq. 2, the loss function\nfor a Gaussian distribution is equal to the mean squared error weighted by the inverse covariance\n\nL =\n\n1\n2\n\n(y \u2212 xL)T \u02dc\u03a3\u22121 (y \u2212 xL)\n\n+ const\n\n(6)\n\nwhere the loss depends on the parameters of the deep network through the conditional mean xL, and\nthe constant includes all the terms that do not depend on xL and thus on the network parameters.\nUsing the expression for the loss, we can compute the gradient with respect to the weight matrix into\nlayer i, as\n\ni\u22121\n\np(cid:63)\n\n(7)\n\n(cid:68)\n\n(cid:69)\n\np(cid:63)\n\n= \u2212(cid:10)ei xT\n\n(cid:11)\n\n\u2202L\n\u2202Wi\n\n3\n\n\fwhere ei \u2208 Rni is the error propagated backward to layer i (see below, Eq. 8), and xi\u22121 is the\nactivation of layer i \u2212 1 propagated forward (Eq. 1). Note that this expression for the gradient is a\nmatrix of the size of Wi. The expression for the backpropagated error is given by\n\nL \u25e6(cid:104) \u02dc\u03a3\u22121 (y \u2212 xL)\n(cid:105)\ni \u25e6(cid:2)W T\n(cid:3)\n\ni+1 ei+1\n\neL = \u03c6(cid:48)\nei = \u03c6(cid:48)\n\ni = 1, . . . , L \u2212 1\n\n(8)\n\nwhere the symbol \u25e6 denotes the element-wise (Hadamard) vector product, and \u03c6(cid:48)\ni denotes the scalar\nderivative of \u03c6i, evaluated at its argument de\ufb01ned in Eq. 1. The gradient is computationally cheap to\nevaluate, since a single forward pass is used to compute the activations xi of all layers, and a single\nbackward pass is used to compute the corresponding errors ei.\nIt is currently unknown if the natural gradient affords an expression as simple and computationally\ncheap as those used to evaluate the standard gradient (Eqs. 7-8). Here, we derive such an expression\nin the case of a deep linear network. We thus take \u03c6i(x) = x (\u2200i), and set the bias vectors to zero\nwithout loss of generality if the input has zero mean, (cid:104)x(cid:105)q(cid:63) = 0. Using Eq. 1, the activation of the\nlast layer is therefore equal to\n\nxL = (WL \u00b7 WL\u22121 \u00b7\u00b7\u00b7 W2 \u00b7 W1) x = W x\n\n(9)\n\ntotal number of parameters is P =(cid:80)L\n\nwhere we de\ufb01ned the total weight matrix product W in the last expression, equal to the chain of\nmatrix multiplications along all layers 1, 2, . . . , L. This expression makes obvious the uselessness of\nhaving multiple, successive linear layers, as their combined effect reduces to a single one. However,\nthe dynamics of learning (e.g. by gradient descent) in each layer is highly nonlinear, while being\namenable to analytical solutions [Saxe et al., 2013].\nIn the Supplementary Material, we calculate the Fisher information matrix F for a deep linear network.\nAs expected, the Fisher matrix is singular, due to the aforementioned parameter redundancies, and\ntherefore the model cannot be identi\ufb01ed in certain directions in parameter space. In particular, the\ni=1 ni ni\u22121, where ni \u00d7 ni\u22121 are the dimensions of matrix\nWi in layer i, and the total number of parameters is obtained by summing over all layers. However,\nthere are only nL n0 independent parameters, which are the dimensions of the total product of weight\nmatrices, W , in Eq. 9. Thus, the Fisher matrix is of rank nL n0 at most, and is therefore necessarily\nsingular.\nDue to the above singularity, the matrix inversion prescribed by Eq. 4 to obtain the natural gradient\nmust be replaced by a generalized inverse (indeed, this is the appropriate way of dealing with this\nsingularity, and it comes from the interpretation of the Fisher matrix as a metric in the space of\ndistributions [Pascanu and Bengio, 2013]). Note that there exist an in\ufb01nite number of generalized\ninverses. Our main result is proving that under the natural gradient, the dynamics of p\u03b8, and therefore\nalso the dynamics of the loss function, are identical for all possible generalized inverses of the\nFisher matrix (Supplementary Material). Moreover, any choice thereof leads to exponentially fast\nconvergence towards the minimum loss. Critically though, all those possible generalized inverses\nmight differ greatly in the simplicity and associated computational cost of the resulting parameter\nupdates. We \ufb01nd that one particular generalized inverse leads to the following, remarkably simple\nexpression:\n\n(10)\n\n(cid:10)ei eT\n\ni\n\n(cid:11)\u22121\n\np\u03b8\n\n(cid:10)ei xT\n\n(cid:11)\n\ni\u22121\n\np(cid:63)\n\n(cid:10)xi\u22121 xT\n\ni\u22121\n\n(cid:11)\u22121\n\np\u03b8\n\ndWi\ndt\n\n\u221d 1\nL\n\nThis expression is equal to the standard gradient (middle term, cf. Eq. 7), multiplied by the inverse\ncovariance of both the backward error ei (left) and forward activation xi\u22121 (right). Note that these\ncovariances correspond to averages over the model distribution p\u03b8(x, y), and not the true distribution\np(cid:63)(x, y). When the inverses of those covariances do not exist, it is their Moore-Penrose pseudoinverse\nthat must be used instead (Supplementary Material). As expected for the natural gradient, Eq. 10 is\ndimensionally consistent (weight updates have the same \u201cunits\u201d as the weight matrices themselves),\nand is covariant for linear transformations.\nAt \ufb01rst glance, Eq. 10 requires two matrix multiplications and inversions per layer, which make it\nmore costly than standard gradient descent. However, if the expectation over p(cid:63) is approximated by\nsampling as in SGD, then one only needs to perform two matrix-vector products, and make rank-1\nupdates of Wi, which brings the computational cost down to that of SGD. Finally, one can either\n(pseudo-)invert the two covariance matrices in Eq. 10 e.g. using an SVD (scales poorly with layer size,\n\n4\n\n\fbut otherwise cache ef\ufb01cient), or directly estimate their inverses using Sherman-Morrison updates\n(in which case the complexity scales with both layer size and network depth in the same way as for\nSGD). We discuss these practical issues further below.\n\n4 Analytic expression for convergence rate\n\nIn this section, we provide a simpli\ufb01ed derivation of the exponential decrease of the loss function\nunder the the natural gradient updates given by Eq. 10, which are based on a particular form of the\ngeneralized inverse of F . The equation for the natural gradient is given by Eq. 16 below, which\ncorresponds to Eq. 34 in the Supplementary Material. A more general derivation of the exponential\ndecrease of the loss function is given in the Supplementary Material, where we show that the same\nexponential decay of the loss holds for all possible generalized inverses.\nUsing Eqs. 1 and 8, the forward activation and backward error in a linear network are given by\n\nxi\u22121 = (Wi\u22121 \u00b7\u00b7\u00b7 W1) x\n\n(11)\n\n(12)\nUsing Eq. 7, the gradient of the loss function is equal to the averaged outer product of the backward\nerror and the forward activity, namely\n\nei = (WL \u00b7\u00b7\u00b7 Wi+1)T \u02dc\u03a3\u22121 (y \u2212 xL)\n\n\u2202L\n\u2202Wi\n\ni\u22121\n\n= \u2212(cid:10)ei xT\n(cid:11)\n(cid:10)eieT\n\n(cid:11)\n= (WL \u00b7\u00b7\u00b7 Wi+1)T \u02dc\u03a3\u22121(cid:68)\n\np(cid:63) = \u2212 (WL \u00b7\u00b7\u00b7 Wi+1)T \u02dc\u03a3\u22121(cid:10)(y \u2212 xL) xT(cid:11)\n(y \u2212 xL) (y \u2212 xL)T(cid:69)\n\ni\n\np\u03b8\n\nIn order to derive the natural gradient update, we calculate the covariance matrices in Eq. 10. The\ncovariance of the backward error is equal to\n\np(cid:63) (Wi\u22121 \u00b7\u00b7\u00b7 W1)T\n\n(13)\n\n\u02dc\u03a3\u22121 (WL \u00b7\u00b7\u00b7 Wi+1)\n\np\u03b8\n\n= (WL \u00b7\u00b7\u00b7 Wi+1)T \u02dc\u03a3\u22121 (WL \u00b7\u00b7\u00b7 Wi+1)\n\n(14)\nThe second line results from averaging over the model distribution p\u03b8(x, y) = q\u03b8(y|x)q(cid:63)(x): the\n\ufb01rst average over the conditional distribution q\u03b8(y|x) = N (y; xL, \u02dc\u03a3) yields the covariance \u02dc\u03a3 itself,\nand the latter does not depend on the input (making the average over q(cid:63)(x) unnecessary). Similar\narguments lead to the covariance of the forward activity:\n\n= (Wi\u22121 \u00b7\u00b7\u00b7 W1)(cid:10)xxT(cid:11)\n\n(cid:10)xi\u22121xT\n(cid:11)\nwhere \u03a3 =(cid:10)xxT(cid:11)\n\ni\u22121\n\np\u03b8\n\n(Wi\u22121 \u00b7\u00b7\u00b7 W1)T = (Wi\u22121 \u00b7\u00b7\u00b7 W1) \u03a3 (Wi\u22121 \u00b7\u00b7\u00b7 W1)T\n(15)\nq(cid:63) is the covariance of the input (the average is taken over the model distribution\n\np\u03b8\n\np\u03b8, but xxT depends on the input distribution q(cid:63) only).\nIn order to compute the natural gradient of Eq. 10, we need to invert the covariances in Eqs. 14\nand 15. However, they may not be invertible, except in special cases, such as when all weight matrices\nare square and invertible, and when both \u03a3 and \u02dc\u03a3 are full rank. We consider this simple case \ufb01rst,\nand then address the general case of non-square matrices. If we can invert explicitly the relevant\ncovariance matrices, substituting into Eq. 10, along with Eq. 13, yields updates of the form\n\ndWi\ndt\n\n\u221d 1\nL\n\n(WL \u00b7\u00b7\u00b7 Wi+1)\n\n\u22121\np(cid:63) \u03a3\u22121 (Wi\u22121 \u00b7\u00b7\u00b7 W1)\n\n(16)\n\nThis equation does not immediately suggest any advantage with respect to standard gradient descent.\nHowever, it is revealing to derive the dynamics of the total weight matrix product, W = WL \u00b7\u00b7\u00b7 W1,\nwhich represents the net input-output mapping performed by the network. Using the product rule of\ndifferentiation:\n\n\u22121(cid:10)(y \u2212 xL) xT\n\n0\n\n(cid:11)\n\nL(cid:88)\n\ni=1\n\ndW\ndt\n\n=\n\n(WL \u00b7\u00b7\u00b7 Wi+1)\n\ndWi\ndt\n\n(Wi\u22121 \u00b7\u00b7\u00b7 W1)\n\nSubstituting the expression for the update, Eq. 16, and using xL = W x0 we obtain\n\n\u221d \u2212W +(cid:10)yxT(cid:11)\n\np(cid:63) \u03a3\u22121\n\ndW\ndt\n\n5\n\n(17)\n\n(18)\n\n\fdynamics, and therefore converges exponentially fast towards(cid:10)yxT(cid:11) \u03a3\u22121, which is indeed the least\n\nThus, under natural gradient descent in continuous time, the total weight matrix obeys \ufb01rst order\n\nsquares solution to the linear regression problem [Bishop, 2016]. Since the loss is a quadratic function\nof W (cf. Eq. 6), Eq. 18 also proves that the loss decays exponentially towards zero under natural\ngradient descent. This result holds provided that the network parameters are not initialized at a saddle\npoint (for example, weights should not be initialized at zero).\nWhen the covariances in Eqs. 14 and 15 cannot be inverted, e.g. when the weight matrices are not\nsquare (the network is contracting, expanding, or contains a bottleneck), we show in the Supplemen-\ntary Material (Eq. 45) that the Moore-Penrose pseudo-inverse must be used instead, inducing similar\ndynamics for W :\n\ndW\ndt\n\n\u221d \u2212W +\n\n1\nL\n\np(cid:63) \u03a3\u22121P b\n\ni\n\n(19)\n\n(cid:10)yxT(cid:11)\n\nP a\ni\n\nL(cid:88)\n\ni=1\n\ni and P b\n\nHere, P a\ni are projection matrices that express the way in which the network architecture\nconstrains the space of solutions that the network is allowed to reach. For example, if the network has\na bottleneck, the total matrix W will only be able to attain a low-rank approximation of the optimal\ni = I (identity matrix) for\n\nsolution to the regression problem,(cid:10)yxT\n\n(cid:11) \u03a3\u22121. Note, for example, that P a\n\n0\n\na non-expanding network, while P b\n\ni = I for a non-contracting network.\n\n5\n\nImplementation of natural gradient descent and experiments\n\ni\n\np\u03b8\n\np\u03b8\n\ni\u22121\n\n(cid:11)\n\n(cid:11)\n\nthe gradient information) to estimate \u039bi =(cid:10)xi\u22121xT\nestimate \u02dc\u039bi =(cid:10)eieT\n\nSimilar to SGD, we approximate the average over p(cid:63) in Eq. 7 by using mini-batches of size M. For\neach input mini-batch x, we use the forward activations (already calculated in the forward pass to get\n. Then, for the same input mini-batch, we\nalso sample K times from the model predictive distribution q\u03b8(y|x), use these outputs as targets, and\nperform the corresponding K backward passes to obtain KM backpropagated error samples used to\n. Note that the true outputs of the training set are only used to compute (a\nstochastic estimate of) the gradient of the loss function, but never used to estimate \u039bi nor \u02dc\u039bi (indeed,\nthese are averages over p\u03b8, not p(cid:63)). In practice, we \ufb01nd that K = 1 suf\ufb01ces.\nWeights are updated according to Eq. 10, discretized using a small time step (learning rate \u03b1).\nInspired by the interpretation of NGD as a second-order method [Martens, 2014], we also incorporate\na Levenberg-Marquardt-type damping scheme: at each iteration k, we add\n\u03bbkI to both covariance\nmatrices \u039bi and \u02dc\u039bi prior to inverting them, where \u03bbk is an adaptive damping factor. Note that this is\nnot equivalent to adding \u03bbk to the Fisher matrix. Nevertheless, it does become equivalent to a small\nSGD step in the limit of large damping factor \u03bbk. Therefore, at iteration k we update the synaptic\nweights in layer i according to\n\n\u221a\n\n(cid:16)\u02dc\u039bi +\n\n\u221a\n\n\u2206Wi =\n\n\u03b1\nL\n\n(cid:16)\n\n(cid:11)\n\n(cid:17)\u22121\n\n\u221a\n\ni\u22121\n\np(cid:63)\n\n\u039bi +\n\n\u03bbkI\n\nWe update \u03bbk in each iteration to re\ufb02ect the ratio \u03c1k between i) the actual decrease in the loss\nresulting from the latest damped NG parameter update, and ii) the decrease predicted by a quadratic\napproximation to the loss1. The damping factor is updated as follows:\n\n\u03bbkI\n\n(cid:17)\u22121(cid:10)eixT\n(cid:40) 3\u03bbk\n\n\u03bbk+1 =\n\nif \u03c1k < 0.25\nif \u03c1k > 0.75\n\n2\n\n2\u03bbk\n\n3\n\n(20)\n\n(21)\n\nWe experimented with deep networks (linear and nonlinear) trained on regression problems (Fig. 1).\nFirst, we trained three linear networks to recover the mappings de\ufb01ned by random networks in their\nmodel class. The \ufb01rst network (Fig. 1A) had L = 16 layers of the same size ni = 20. The second\n(Fig. 1B) had L = 16 layers, of size 20(input), 30, 40, . . . , 100, . . . , 30, 20. While these two networks\n\n1Here, the quadratic approximation is implicitly de\ufb01ned as the quadratic function whose minimization by the\nNewton method would require a step in the direction of \u2206Wi, the momentary update taken by our damped NGD\nstep. The predicted decrease in loss under such a quadratic approximation is cheap to compute: if \u2206Wi is the NG\nupdate for layer i, then the predicted decrease in the loss is given by\n.\n\n(cid:16)\u2212\u03b1 + \u03b12\n\n(cid:10)eixT\n\n(cid:17)(cid:80)\n\n(cid:16)\n\n(cid:17)\n\n(cid:11)\n\n\u2206W T\ni\n\ni\u22121\n\ni tr\n\n2\n\np(cid:63)\n\n6\n\n\fFigure 1: Natural gradient in deep networks. (A\u2013C) Dynamics of the loss function under opti-\nmization with SGD (black) and NGD (red), for three deep linear networks with different architectures\n(see main text for details). Training time is reported both as number of training examples seen so\nfar (top) and wall-clock time (bottom). Both optimization algorithms start from the same initial\nnetwork parameters. Dashed gray lines denote the smallest possible loss, determined by the variance\nof the true underlying conditional density of y|x. (D) Test error for MNIST autoencoding in a deep\nnonlinear network (see main text for details); colors are the same as in (A\u2013C). SGD parameters:\nM = 20, learning rate \u03b1 optimized by grid search (A and B: \u03b1 = 0.08; C: \u03b1 = 0.02; D: \u03b1 = 0.04).\nNGD parameters: \u03b1 = 1, M = 1000.\n\nwere over-parameterized, our third network (Fig. 1C) was an under-parameterized bottleneck with\nsteep fan-in and fan-out, with L = 12 layers of size 200(input), 80, 34, 20, 10, 5, 2, 5, . . . , 80, 200.\nFor each architecture, we generated a network with random parameters \u03b8(cid:63) and used it as the reference\nmapping to be learned. We generated a training set of 104 examples, and a test set of 103 examples,\nby propagating inputs drawn from a correlated Gaussian distribution q(cid:63)(x) = N (x; 0, \u03a3) through\nthe network, and sampling outputs from a Gaussian conditional distribution q\u03b8(cid:63) (y|x) with covariance\n\u02dc\u03a3 = 10\u22126I. We generated \u03a3 to have random (orthogonal) eigenvectors and eigenvalues that decayed\nexponentially as e\u22125i/n0.\nWe compared SGD (with minibatch size M = 20, and learning rate optimized via grid search) and\nonline natural gradient (with minibatch size M = 1000). For both tasks, SGD made fast initial\nprogress, but slowed down dramatically very soon. In contrast, as predicted by our theory, natural\ngradient descent caused the test error to decrease exponentially and reach the minimum achievable\nloss (limited by \u02dc\u03a3) after only a few passes through the training set (Fig. 1A-C, top).\nAs a preliminary extension to the nonlinear case, we also trained a nonlinear network with eight\nlayers of size 784(input), 400, 200, 100, 50, 100, . . . , 784, to perform autoencoding of the MNIST\ndataset (Fig. 1D). All layers had \u03c6i(x) = tanh(x), except for the \ufb01nal linear layer. We compared\nstandard SGD (with M = 20 and \u03b1 optimized by grid search) to our proposed natural gradient\nmethod (Eq. 10, with adaptive damping and no further modi\ufb01cation). We set \u03b1 = 1, M = 1000 and\nK = 1. Despite our NGD steps only approximating the true natural gradient, it outperformed SGD\nin terms of data ef\ufb01ciency (Fig. 1D, top). Owing to the size of the input layer, our implementation of\nNGD via direct inversion of the relevant covariance matrices outperformed SGD only modestly in\nwall-clock time (Fig. 1D, bottom). We discuss alternative implementations below.\n\n7\n\n\u22126\u22124\u2212202020406080min.achievableABCD0204060800501001502000.10.20.3020406080\u22126\u22124\u2212202024min.achievable0240480.10.20.3051015log(testerror)trainingexamples(\u00d7103)linear(square)SGDNGDtrainingexamples(\u00d7103)linear(non-square)trainingexamples(\u00d7103)linear(bottleneck)testerrortrainingexamples(\u00d7103)nonlinear(MNISTautoenc.)log(testerror)wall-clocktime[s]wall-clocktime[s]wall-clocktime[s]testerrorwall-clocktime[min]\f6 Related work\n\nDiagonal approximations As reviewed in Martens [2014], some recent popular methods can be\ninterpreted as diagonal approximations to the Fisher matrix F , such as AdaGrad [Duchi et al., 2011],\nAdaDelta [Zeiler, 2012], and Adam [Kingma and Ba, 2014]. Those methods are computationally\ncheap, but do not capture pairwise dependencies between parameters. In theory, faster learning\ncould be obtained by leveraging full curvature information, which requires moving away from a\npurely diagonal approximation of F . However, this is computationally intensive for at least two\nreasons: i) the Fisher matrix is large, often impossible to store, let alone to invert, and ii) even if\none could compute F \u22121, the natural gradient would still require O(P 2) operations (where P is\nthe number of parameters). Much of the recent literature has focused on ways of mitigating this\ncomplexity. For example, in cases where it can be stored, F \u22121 can be estimated directly using the\nSherman-Morrison lemma [Amari et al., 2000]. When it cannot be stored, one can approximate the\nnatural gradient directly via conjugate gradients, exploiting fast methods for computing F v products\n(as in Hessian-free and Gauss-Newton optimization [Martens, 2010; Pascanu and Bengio, 2013;\nMartens and Grosse, 2015; Vinyals and Povey, 2012]). Often, however, many steps of conjugate\ngradients must be performed at each training iteration to make good progress on the loss. Here, we\nhave obtained the surprising result that F \u22121v products can in fact be obtained directly (in linear\nnetworks), at almost the same cost as F v.\n\nBlock-diagonal approximations\nIn order to obtain an expression for the natural gradient that\nwould be computationally cheap and feasible for practical applications, previous studies suggested a\nblock-diagonal approximation to the inverse Fisher information matrix, in the nonlinear case (K-FAC,\n[Martens and Grosse, 2015; Grosse and Martens, 2016; Ba et al., 2016], see also [Heskes, 2000;\nPovey et al., 2014; Desjardins et al., 2015]). In general, there is no formal justi\ufb01cation for assuming\nthat the Fisher information matrix (or its inverse) is block diagonal. In our deep linear network\nmodel, we show in the Supplementary Material (cf. Eq. 19) that the (i, j)-block of the exact Fisher\ninformation matrix (corresponding to the weight matrices of layers i and j), is equal to\n\nFij =(cid:10)xi\u22121xT\n\n(cid:11)\n\n\u2297(cid:10)eieT\n\n(cid:11)\n\nj\n\np\u03b8\n\np\u03b8\n\nj\u22121\n\n(22)\nThere is no reason to expect that this expression is zero for i (cid:54)= j, unless the outputs xi or the errors\nei are uncorrelated across all pairs of layers, and indeed Eq. 19 in the Supplementary Material shows\nthat it is not zero. Nevertheless, if we choose to ignore this fact and set Fij = 0 for i (cid:54)= j, then\ninverting the Fisher matrix (by inverting separately each diagonal block Fii) generates an expression\nproportional to the exact natural gradient of Eq. 10.\nIn order to understand this puzzling observation, we recall that the exact Fisher is singular, and we\nchose a speci\ufb01c form for the generalized inverse F g in order to derive Eq. 10 (while noting that the\ndynamics of the loss is the same for all possible inverses). In the Supplementary Material (cf. Eq. 36),\nwe note that the (i, j)-block of this speci\ufb01c generalized inverse is equal to\n\n(cid:10)xj\u22121xT\n\n(cid:11)\u22121\n\n\u2297(cid:10)ejeT\n\n(cid:11)\u22121\n\n1\nL2\n\ni\n\np\u03b8\n\np\u03b8\n\ni\u22121\n\n(F g)ij =\n\n(23)\nThus each block of the inverse Fisher is equal to the inverse of the corresponding block of the\n(transposed) Fisher matrix (note that we assumed square and invertible blocks). However, the inverse\nFisher is not block-diagonal either, thus it remains unclear why the approximation works. The\nsolution to this puzzle is the following. In deriving the natural gradient update for layer i, we must\nmultiply an entire row of blocks of the inverse Fisher by the gradient across all layers. Surprisingly,\neach of these blocks makes exactly the same contribution to the natural gradient (Eq. 37 in the\nSupplementary Material). Thus, we can get away with computing the single contribution of the\ndiagonal block for each row, and simply multiply that by the number of blocks in the row. This is of\ncourse equivalent, though only fortuitously so, to making a block-diagonal approximation of F in the\n\ufb01rst place. Therefore, somewhat incidentally, a block-diagonal approximation is expected to perform\njust as well as the full matrix inversion.\n\nWhitening and biological algorithms Our expression for the natural gradient offers post-hoc\njusti\ufb01cations for some recently proposed modi\ufb01cations of the standard gradient, whereby the forward\nactivation and backward errors are whitened prior to being multiplied to obtain the gradient at\neach layer [Desjardins et al., 2015; Fujimoto and Ohira, 2018]. In our method, these vectors are\n\n8\n\n\falso rescaled, albeit with their inverse covariances instead of the square root thereof (Eq. 10; see\nalso Heskes [2000]; Martens and Grosse [2015]). Notably, this form of rescaling is equivalent\nto backpropating the error through the (pseudo)-inverses of the weight matrices, rather than their\ntranspose (Eq. 16); interestingly, this strategy also tends to emerge in more biologically plausible\nalgorithms in which both forward and backward weights are free parameters [Lillicrap et al., 2016;\nLuo et al., 2017].\n7 Conclusions\n\nWe computed the natural gradient exactly for a deep linear network with quadratic loss function. We\nshowed that the natural gradient is not unique in this case, because the Fisher information is singular\ndue to over-parameterization. Surprisingly, we found that the loss function has the same convergence\nproperties for all possible natural gradients, i.e. as obtained by any generalized inverse of the Fisher\nmatrix. Indeed, one of our main results is the \ufb01rst exact solution for the convergence rate of the\nloss function under natural gradient descent, for a linear multilayer network: exponential decrease\ntowards the minimum loss. This result backs up empirical claims of the natural gradient ef\ufb01ciently\noptimizing deep networks; in the deep linear case, we \ufb01nd that it solves the problem of pathological\ncurvature entirely [Pascanu and Bengio, 2013; Martens, 2014]. Our results also consolidate deep\nlinear networks as a useful case study for advancing the theory of deep learning. While Saxe et al.\n[2013] used linear theory to propose new ways of initializing neural networks, we have used it to\npropose a new, ef\ufb01cient optimization algorithm. We found that natural gradient updates afford an\nunexpectedly cheap form, with similar computational requirements as plain SGD.\nCompared with the size of deep neural networks currently used, our application concerned relatively\nsmall networks of at most a few hundreds neurons per layer. Our current implementation based\non direct inversion of \u039b and \u02dc\u039b in Eq. 20 may scale poorly (in wall-clock time) as the layer sizes\nincrease. In this case, matrix pseudo-inversion in Eq. 10 could be performed using randomized SVD\nalgorithms [Halko et al., 2011]. Alternatively, direct estimation of those matrix inverses via the\nSherman-Morrison (SM) lemma should scale better [Amari et al., 2000], which we have con\ufb01rmed in\npreliminary simulations. As SM updates tend to be less cache-ef\ufb01cient than direct inversion (they\nrequire many matrix-vector products instead of fewer matrix-matrix products), they may only bene\ufb01t\nperformance for very large layers. Moreover, more work is needed to incorporate adaptive damping\ninto SM estimation of inverse covariances.\nOur analytical results were derived for continuous time optimization dynamics. While we presented\nnumerical evidence showing that a discrete-time implementation of NGD performs well, and it indeed\nshows the exponential decrease of the loss function predicted by our theory, further work is necessary\nin order to derive principled methods for discretizing the parameter updates [Martens, 2014].\nOur core results relied exclusively on linear activation functions. While we have had some success in\ntraining nonlinear networks using Eq. 10 as a drop-in replacement for SGD (Fig. 1D), much remains\nto be done to make our algorithm effective in general deep learning settings. Improvements could be\nmade to our adaptive damping scheme, for example through asymmetric damping of the covariance\nmatrices \u039bi and \u02dc\u039bi in Eq. 20 as proposed by Martens and Grosse [2015]. More generally, deeper links\nneed to be established between our linear NGD theory and systematic methods based on Kronecker\nfactorizations (K-FAC [Martens and Grosse, 2015; Grosse and Martens, 2016; Ba et al., 2016]). A\nkey insight from our analysis is that there exist in\ufb01nitely many ways of computing the NG in linear\ndeep networks (and probably also in nonlinear networks in which the Fisher matrix has been found\nto be near-degenerate [Le Roux et al., 2008]). While all of these different methods result in fast\nlearning with identical dynamics for the loss function, their computational complexity may differ\ngreatly. Moreover, there may be more than one computationally tractable method (such as the one\nwe have used here), and in turn, some of these may be more suitable than others for use as a drop-in\nreplacement to SGD in nonlinear networks. We suggest that further analysis of deep linear networks\nwill prove invaluable for deriving ef\ufb01cient new training algorithms.\n\nAcknowledgments\n\nWe thank Richard Turner and James Martens for discussions. This work was supported by Wellcome\nTrust Seed Award 202111/Z/16/Z (G.H.) and Wellcome Trust Investigator Award 095621/Z/11/Z\n(A.B.,M.L.).\n\n9\n\n\fReferences\nAmari, S.-I. (1998). Natural gradient works ef\ufb01ciently in learning. Neural computation, 10(2):251\u2013276.\nAmari, S.-I., Park, H., and Fukumizu, K. (2000). Adaptive method of realizing natural gradient learning for\n\nmultilayer perceptrons. Neural Computation, 12(6):1399\u20131409.\n\nBa, J., Grosse, R., and Martens, J. (2016). Distributed second-order optimization using kronecker-factored\n\napproximations. In Proc. 5rd Int. Conf. Learn. Representations.\n\nBishop, C. M. (2016). Pattern recognition and machine learning. Springer-Verlag New York.\nDauphin, Y. N., Pascanu, R., Gulcehre, C., Cho, K., Ganguli, S., and Bengio, Y. (2014). Identifying and attacking\nthe saddle point problem in high-dimensional non-convex optimization. In Advances in neural information\nprocessing systems, pages 2933\u20132941.\n\nDesjardins, G., Simonyan, K., Pascanu, R., et al. (2015). Natural neural networks. In Advances in Neural\n\nInformation Processing Systems, pages 2071\u20132079.\n\nDuchi, J., Hazan, E., and Singer, Y. (2011). Adaptive subgradient methods for online learning and stochastic\n\noptimization. Journal of Machine Learning Research, 12(Jul):2121\u20132159.\n\nFujimoto, Y. and Ohira, T. (2018). A neural network model with bidirectional whitening. In International\n\nConference on Arti\ufb01cial Intelligence and Soft Computing, pages 47\u201357. Springer.\n\nGrosse, R. and Martens, J. (2016). A kronecker-factored approximate \ufb01sher matrix for convolution layers. In\n\nInternational Conference on Machine Learning, pages 573\u2013582.\n\nHalko, N., Martinsson, P.-G., and Tropp, J. A. (2011). Finding structure with randomness: Probabilistic\n\nalgorithms for constructing approximate matrix decompositions. SIAM review, 53(2):217\u2013288.\n\nHeskes, T. (2000). On natural learning and pruning in multilayered perceptrons. Neural Computation, 12(4):881\u2013\n\n901.\n\nKingma, D. P. and Ba, J. L. (2014). Adam: A method for stochastic optimization. In Proc. 3rd Int. Conf. Learn.\n\nRepresentations.\n\nLe Roux, N., Manzagol, P.-A., and Bengio, Y. (2008). Topmoumoute online natural gradient algorithm. In\n\nAdvances in neural information processing systems, pages 849\u2013856.\n\nLillicrap, T. P., Cownden, D., Tweed, D. B., and Akerman, C. J. (2016). Random synaptic feedback weights\n\nsupport error backpropagation for deep learning. Nature communications, 7:13276.\n\nLuo, H., Fu, J., and Glass, J. (2017). Bidirectional backpropagation: Towards biologically plausible error signal\n\ntransmission in neural networks. arXiv preprint arXiv:1702.07097.\n\nMandt, S., Hoffman, M. D., and Blei, D. M. (2017). Stochastic gradient descent as approximate bayesian\n\ninference. The Journal of Machine Learning Research, 18(1):4873\u20134907.\n\nMartens, J. (2010). Deep learning via hessian-free optimization. In International conference on machine\n\nlearning, volume 27, pages 735\u2013742.\n\nMartens, J. (2014). New insights and perspectives on the natural gradient method.\n\narXiv:1412.1193.\n\narXiv preprint\n\nMartens, J. and Grosse, R. (2015). Optimizing neural networks with kronecker-factored approximate curvature.\n\nIn International conference on machine learning, pages 2408\u20132417.\n\nOllivier, Y. (2015). Riemannian metrics for neural networks I: feedforward networks. Information and Inference:\n\nA Journal of the IMA, 4(2):108\u2013153.\n\nPark, H., Amari, S.-I., and Fukumizu, K. (2000). Adaptive natural gradient learning algorithms for various\n\nstochastic models. Neural Networks, 13(7):755\u2013764.\n\nPascanu, R. and Bengio, Y. (2013). Revisiting natural gradient for deep networks. In Proc. 2rd Int. Conf. Learn.\n\nRepresentations.\n\nPovey, D., Zhang, X., and Khudanpur, S. (2014). Parallel training of dnns with natural gradient and parameter\n\naveraging. arXiv preprint arXiv:1410.7455.\n\nSaxe, A. M., McClelland, J. L., and Ganguli, S. (2013). Exact solutions to the nonlinear dynamics of learning in\n\ndeep linear neural networks. In Proc. 2rd Int. Conf. Learn. Representations.\n\nVinyals, O. and Povey, D. (2012). Krylov subspace descent for deep learning. In Arti\ufb01cial Intelligence and\n\nStatistics, pages 1261\u20131268.\n\nYang, H. H. and Amari, S.-i. (1998). Complexity issues in natural gradient descent method for training multilayer\n\nperceptrons. Neural Computation, 10(8):2137\u20132157.\n\nZeiler, M. D. (2012). Adadelta: an adaptive learning rate method. arXiv preprint arXiv:1212.5701.\n\n10\n\n\f", "award": [], "sourceid": 2889, "authors": [{"given_name": "Alberto", "family_name": "Bernacchia", "institution": "University of Cambridge"}, {"given_name": "Mate", "family_name": "Lengyel", "institution": "University of Cambridge"}, {"given_name": "Guillaume", "family_name": "Hennequin", "institution": "Cambridge"}]}