{"title": "Fast Approximate Natural Gradient Descent in a Kronecker Factored Eigenbasis", "book": "Advances in Neural Information Processing Systems", "page_first": 9550, "page_last": 9560, "abstract": "Optimization algorithms that leverage gradient covariance information, such as variants of natural gradient descent (Amari, 1998), offer the prospect of yielding more effective descent directions. For models with many parameters, the covari- ance matrix they are based on becomes gigantic, making them inapplicable in their original form. This has motivated research into both simple diagonal approxima- tions and more sophisticated factored approximations such as KFAC (Heskes, 2000; Martens & Grosse, 2015; Grosse & Martens, 2016). In the present work we draw inspiration from both to propose a novel approximation that is provably better than KFAC and amendable to cheap partial updates. It consists in tracking a diagonal variance, not in parameter coordinates, but in a Kronecker-factored eigenbasis, in which the diagonal approximation is likely to be more effective. Experiments show improvements over KFAC in optimization speed for several deep network architectures.", "full_text": "Fast Approximate Natural Gradient Descent in a\n\nKronecker-factored Eigenbasis\n\nThomas George\u22171, C\u00e9sar Laurent\u22171, Xavier Bouthillier1, Nicolas Ballas2, Pascal Vincent1,2,3\n1 Mila - Universit\u00e9 de Montr\u00e9al;\n\u2217 equal contribution\n\n2 Facebook AI Research;\n\n3 CIFAR;\n\n\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd \ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd \ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\n\n\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd \ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\n\nAbstract\n\nOptimization algorithms that leverage gradient covariance information, such as\nvariants of natural gradient descent (Amari, 1998), offer the prospect of yielding\nmore effective descent directions. For models with many parameters, the covari-\nance matrix they are based on becomes gigantic, making them inapplicable in their\noriginal form. This has motivated research into both simple diagonal approxima-\ntions and more sophisticated factored approximations such as KFAC (Heskes, 2000;\nMartens & Grosse, 2015; Grosse & Martens, 2016). In the present work we draw\ninspiration from both to propose a novel approximation that is provably better than\nKFAC and amendable to cheap partial updates. It consists in tracking a diagonal\nvariance, not in parameter coordinates, but in a Kronecker-factored eigenbasis,\nin which the diagonal approximation is likely to be more effective. Experiments\nshow improvements over KFAC in optimization speed for several deep network\narchitectures.\n\n1\n\nIntroduction\n\nDeep networks have exhibited state-of-the-art performance in many application areas, including\nimage recognition (He et al., 2016) and natural language processing (Gehring et al., 2017). However\ntop-performing systems often require days of training time and a large amount of computational\npower, so there is a need for ef\ufb01cient training methods.\nStochastic Gradient Descent (SGD) and its variants are the current workhorse for training neural\nnetworks. Training consists in optimizing the network parameters \u03b8 (of size n\u03b8) to minimize a\nregularized empirical risk R (\u03b8), through gradient descent. The negative loss gradient is approximated\nbased on a small subset of training examples (a mini-batch). The loss functions of neural networks\nare highly non-convex functions of the parameters, and the loss surface is known to have highly\nimbalanced curvature which limits the ef\ufb01ciency of 1st order optimization methods such as SGD.\nMethods that employ 2nd order information have the potential to speed up 1st order gradient descent\nby correcting for imbalanced curvature. The parameters are then updated as: \u03b8 \u2190 \u03b8 \u2212 \u03b7G\u22121\u2207\u03b8R (\u03b8),\nwhere \u03b7 is a positive learning-rate and G is a preconditioning matrix capturing the local curvature or\nrelated information such as the Hessian matrix in Newton\u2019s method or the Fisher Information Matrix\nin Natural Gradient (Amari, 1998). Matrix G has a gigantic size n\u03b8 \u00d7 n\u03b8 which makes it too large to\ncompute and invert in the context of modern deep neural networks with millions of parameters. For\npractical applications, it is necessary to trade-off quality of curvature information for ef\ufb01ciency.\nA long family of algorithms used for optimizing neural networks can be viewed as approximating the\ndiagonal of a large preconditioning matrix. Diagonal approximations of the Hessian (Becker et al.,\n1988) have been proven to be ef\ufb01cient, and algorithms that use the diagonal of the covariance matrix\nof the gradients are widely used among neural networks practitioners, such as Adagrad (Duchi et al.,\n\n32nd Conference on Neural Information Processing Systems (NeurIPS 2018), Montr\u00e9al, Canada.\n\n\f2011), Adadelta (Zeiler, 2012), RMSProp (Tieleman & Hinton, 2012), Adam (Kingma & Ba, 2015).\nWe refer the reader to Bottou et al. (2016) for an informative review of optimization methods for deep\nnetworks, including diagonal rescalings, and connections with the Batch Normalization (BN) (Ioffe\n& Szegedy, 2015) technique.\nMore elaborate algorithms do not restrict to diagonal approximations, but instead aim at accounting\nfor some correlations between different parameters (as encoded by non-diagonal elements of the\npreconditioning matrix). These methods range from Ollivier (2015) who introduces a rank 1 update\nthat accounts for the cross correlations between the biases and the weight matrices, to quasi Newton\nmethods (Liu & Nocedal, 1989) that build a running estimate of the exact non-diagonal precondi-\ntioning matrix, and also include block diagonals approaches with blocks corresponding to entire\nlayers (Heskes, 2000; Desjardins et al., 2015; Martens & Grosse, 2015; Fujimoto & Ohira, 2018).\nFactored approximations such as KFAC (Martens & Grosse, 2015; Ba et al., 2017) approximate each\nblock as a Kronecker product of two much smaller matrices, both of which can be estimated and\ninverted more ef\ufb01ciently than the full block matrix, since the inverse of a Kronecker product of two\nmatrices is the Kronecker product of their inverses.\nIn the present work, we draw inspiration from both diagonal and factored approximations. We\nintroduce an Eigenvalue-corrected Kronecker Factorization (EKFAC) that consists in tracking a\ndiagonal variance, not in parameter coordinates, but in a Kronecker-factored eigenbasis. We show that\nEKFAC is a provably better approximation of the Fisher Information Matrix than KFAC. In addition,\nwhile computing the Kronecker-factored eigenbasis is a computationally expensive operation that\nneeds to be amortized, tracking of the diagonal variance is a cheap operation. EKFAC therefore allows\nto perform partial updates of our curvature estimate G at the iteration level. We conduct an empirical\nevaluation of EKFAC on the deep auto-encoder optimization task using fully-connected networks\nand CIFAR-10 relying on deep convolutional neural networks where EKFAC shows improvements\nover KFAC in optimization.\n\n2 Background and notations\n\n\u2202\u03b8\n\nWe are given a dataset Dtrain containing (input, target) examples (x, y), and a neural network\nf\u03b8(x) with parameter vector \u03b8 of size n\u03b8. We want to \ufb01nd a value of \u03b8 that minimizes an\nempirical risk R(\u03b8) expressed as an average of a loss \ufffd incurred by f\u03b8 over the training set:\nR(\u03b8) = E(x,y)\u2208Dtrain [\ufffd(f\u03b8(x), y)]. We will use E to denote both expectations w.r.t. a distribu-\ntion or, as here, averages over \ufb01nite sets, as made clear by the subscript and context. Considered\nalgorithms for optimizing R(\u03b8) use stochastic gradients \u2207\u03b8 = \u2207\u03b8(x, y) = \u2202\ufffd(f\u03b8(x),y)\n, or their\naverage over a mini-batch of examples Dmini sampled from Dtrain. Stochastic gradient descent\n(SGD) does a 1st order update: \u03b8 \u2190 \u03b8 \u2212 \u03b7\u2207\u03b8 where \u03b7 is a positive learning rate. 2nd order methods\n\ufb01rst multiply \u2207\u03b8 by a preconditioning matrix G\u22121 yielding the update: \u03b8 \u2190 \u03b8 \u2212 \u03b7G\u22121\u2207\u03b8. Precon-\nditioning matrices for Natural Gradient (Amari, 1998) / Generalized Gauss-Newton (Schraudolph,\n2001) / TONGA (Le Roux et al., 2008) can all be expressed as either (centered) covariance or (un-\ncentered) second moment of \u2207\u03b8, computed over slightly different distributions of (x, y). The natural\ngradient uses the Fisher Information Matrix, which for a probabilistic classi\ufb01er can be expressed\nas G = Ex\u2208Dtrain,y\u223cp\u03b8(y|x)\ufffd\u2207\u03b8\u2207\ufffd\u03b8\ufffd where the expectation is taken over targets sampled form the\nuses G = E(x,y)\u2208Dtrain\ufffd\u2207\u03b8\u2207\ufffd\u03b8\ufffd. Our discussion and development applies regardless of the precise\ndistribution over (x, y) used to estimate a G so we will from here on use E without a subscript.\nMatrix G has a gigantic size n\u03b8 \u00d7 n\u03b8, which makes it too big to compute and invert. In order to get\na practical algorithm, we must \ufb01nd approximations of G that keep some of the relevant 2nd order\ninformation while removing the unnecessary and computationally costly parts. A \ufb01rst simpli\ufb01cation,\nadopted by nearly all prior approaches, consists in treating each layer of the neural network separately,\nignoring cross-layer terms. This amounts to a \ufb01rst block-diagonal approximation of G: each block\nG(l) caters for the parameters of a single layer l. Now G(l) can typically still be extremely large.\nA cheap but very crude approximation consists in using a diagonal G(l), i.e. taking into account\nthe variance in each parameter dimension, but ignoring all covariance structure. A less stringent\napproximation was proposed by Heskes (2000) and later Martens & Grosse (2015). They propose\nto approximate G(l) as a Kronecker product G(l) \u2248 A \u2297 B which involves two smaller matrices,\n\nmodel p\u03b8 = f\u03b8. By contrast, the empirical Fisher approximation or generalized Gauss-Newton\n\n2\n\n\fmaking it much cheaper to store, compute and invert1. Speci\ufb01cally for a layer l that receives input h\nof size din and computes linear pre-activations a = W \ufffdh of size dout (biases omitted for simplicity)\nfollowed by some non-linear activation function, let the backpropagated gradient on a be \u03b4 = \u2202\ufffd\n\u2202a .\n\u2202W = vec(h\u03b4\ufffd). The Kronecker factored\nThe gradients on parameters \u03b8(l) = W will be \u2207W = \u2202\ufffd\napproximation of corresponding G(l) = E\ufffd\u2207W\u2207\ufffdW\ufffd will use A = E\ufffdhh\ufffd\ufffd and B = E\ufffd\u03b4\u03b4\ufffd\ufffd i.e.\nmatrices of size din \u00d7 din and dout \u00d7 dout, whereas the full G(l) would be of size dindout \u00d7 dindout.\nUsing this Kronecker approximation (known as KFAC) corresponds to approximating entries of G(l)\nas follows: G(l)\n\nij,i\ufffdj\ufffd = E\ufffd\u2207Wij\u2207\ufffdWi\ufffd j\ufffd\ufffd = E [(hi\u03b4j)(hi\ufffd \u03b4j\ufffd )] \u2248 E [hihi\ufffd ] E [\u03b4j\u03b4j\ufffd ].\n\nA similar principle can be applied to obtain a Kronecker-factored expression for the covariance of the\ngradients of the parameters of a convolutional layer (Grosse & Martens, 2016). To obtain matrices\nA and B one then needs to also sum over spatial locations and corresponding receptive \ufb01elds, as\nillustrated in Figure 1.\n\nkh hs\nkw\n\nnin\n\nas\n\n\u03b4s =\n\n\u2202\ufffd\n\u2202as\n\nnout\n\nG(l)\n\n(nin nout kw kh)2 \u2248\n\nA\n\n(nin kw kh)2 \u2297 B\n\n(nout)2\n\nA = E\ufffd\ufffd(s,s\ufffd) hshs\ufffd\ufffd\ufffd , B = E\ufffd\ufffd(s,s\ufffd) \u03b4s\u03b4s\ufffd\ufffd\ufffd\n\ns \u2208 S, s\ufffd \u2208 S: spatial positions iterated over by the \ufb01lter\nhs: \ufb02attened input subtensor (receptive \ufb01eld) at position s\n\u03b4s: gradient of \ufffd w.r.t. output of \ufb01lter at position s\n\nFigure 1: KFAC for convolutional layer with noutninkwkh parameters.\n\n3 Proposed method\n\n3.1 Motivation: re\ufb02exion on diagonal rescaling in different coordinate bases\n\ni = Gi,i = E[(\u2207\u03b8)2\n\ni ]. So that update \u03b8 \u2190 \u03b8 \u2212 \u03b7G\u22121\n\nIt is instructive to contrast the type of \u201cexact\u201d natural gradient preconditioning of the gradient that\nuses the full Fisher Information Matrix would yield, to what we do when approximating this by using\na diagonal matrix only. Using the full matrix G = E[\u2207\u03b8\u2207\ufffd\u03b8 ] yields the natural gradient update: \u03b8 \u2190\n\u03b8\u2212\u03b7G\u22121\u2207\u03b8. When resorting to a diagonal approximation we instead use Gdiag = diag(\u03c32\n1, . . . , \u03c32\nn\u03b8 )\ndiag\u2207\u03b8 amounts to preconditioning the\nwhere \u03c32\ngradient vector \u2207\u03b8 by dividing each of its coordinates by an estimated second moment \u03c32\ni . This\ndiagonal rescaling happens in the initial basis of parameters \u03b8. By contrast, a full natural gradient\nupdate can be seen to do a similar diagonal rescaling, not along the initial parameter basis axes, but\nalong the axes of the eigenbasis of the matrix G. Let G = U SU\ufffd be the eigendecomposition of\nG. The operations that yield the full natural gradient update G\u22121\u2207\u03b8 = U S\u22121U\ufffd\u2207\u03b8 correspond\nto the sequence of a) multiplying gradient vector \u2207\u03b8 by U\ufffd which corresponds to switching to the\neigenbasis: U\ufffd\u2207\u03b8 yields the coordinates of the gradient vector expressed in that basis b) multiplying\nby a diagonal matrix S\u22121, which rescales each coordinate i (in that eigenbasis) by S\u22121\nii c) multiplying\nby U, which switches the rescaled vector back to the initial basis of parameters. It is easy to show\ni ] (proof is given in Appendix A.2). So similarly to what we do when using a\nthat Sii = E[(U\ufffd\u2207\u03b8)2\ndiagonal approximation, we are rescaling by the second moment of gradient vector components, but\nrather than doing this in the initial parameter basis, we do this in the eigenbasis of G. Note that the\nvariance measured along the leading eigenvector can be much larger than the variance along the axes\nof the initial parameter basis, so the effects of the rescaling by using either the full G or its diagonal\napproximation can be very different.\nNow what happens when we use the less crude KFAC approximation instead? We approximate2\nG \u2248 A \u2297 B yielding the update \u03b8 \u2190 \u03b8 \u2212 \u03b7(A \u2297 B)\u22121\u2207\u03b8. Let us similarly look at it through its\neigendecomposition. The eigendecomposition of the Kronecker product A\u2297 B of two real symmetric\n\n1Since (A \u2297 B)\u22121 = A\u22121 \u2297 B\u22121.\n2This approximation is done separately for each block G(l), we dropped the superscript to simplify notations.\n\n3\n\n\fpositive semi-de\ufb01nite matrices can be expressed using their own eigendecomposition A = UASAU\ufffdA\nand B = UBSBU\ufffdB , yielding A\u2297 B = (UASAU\ufffdA )\u2297 (UBSBU\ufffdB ) = (UA \u2297 UB)(SA \u2297 SB)(UA \u2297\nUB)\ufffd. UA \u2297 UB gives the orthogonal eigenbasis of the Kronecker product, we call it the Kronecker-\nFactored Eigenbasis (KFE). SA \u2297 SB is the diagonal matrix containing the associated eigenvalues.\nNote that each such eigenvalue will be a product of an eigenvalue of A stored in SA and an eigenvalue\nof B stored in SB. We can view the action of the resulting Kronecker-factored preconditioning in the\nsame way as we viewed the preconditioning by the full matrix: it consists in a) expressing gradient\nvector \u2207\u03b8 in a different basis UA \u2297 UB which can be thought of as approximating the directions of\nU, b) doing a diagonal rescaling by SA \u2297 SB in that basis, c) switching back to the initial parameter\nspace. Here however the rescaling factor (SA \u2297 SB)ii is not guaranteed to match the second moment\nalong the associated eigenvector E[((UA \u2297 UB)\ufffd\u2207\u03b8)2\ni ].\nIn summary (see Figure 2):\n\nparameter basis, which can be very far from the eigenbasis of G.\n\n\u2022 Full matrix G preconditioning will scale by variances estimated along the eigenbasis of G.\n\u2022 Diagonal preconditioning will scale by variances properly estimated, but along the initial\n\u2022 KFAC preconditioning will scale the gradient along the KFE basis that will likely be closer\nto the eigenbasis of G, but doesn\u2019t use properly estimated variances along these axes for this\nscaling (the scales being themselves constrained to being a Kronecker product SA \u2297 SB).\nRescaling of the gradient is done along a speci\ufb01c basis; length of vectors\nindicate (square root of) amount of downscaling. Exact Fisher Information\nMatrix rescales according to eigenvectors/values of exact covariance struc-\nture (green ellipse). Diagonal approximation uses parameter coordinate\nbasis, scaling by actual variance measured along these axes (indicated by\nhorizontal and vertical orange arrows touching exactly the ellipse), KFAC\nuses directions that approximate Fisher Information Matrix eigenvectors,\nbut uses approximate scaling (blue arrows not touching the ellipse). EK-\nFAC corrects this.\n\nEigenspectrum of \napproximations\n\n- Fisher \n- Diagonal \n- K-FAC \n- EKFAC\n\nFigure 2: Cartoon illustration of rescaling achieved by different preconditioning strategies\n\n3.2 Eigenvalue-corrected Kronecker Factorization (EKFAC)\n\nTo correct for the potentially inexact rescaling of KFAC, and obtain a better but still computationally\nef\ufb01cient approximation, instead of GKFAC = A\u2297B = (UA\u2297UB)(SA\u2297SB)(UA\u2297UB)\ufffd we propose\nto use an Eigenvalue-corrected Kronecker Factorization: GEKFAC = (UA \u2297 UB)S\u2217(UA \u2297 UB)\ufffd\nwhere S\u2217 is the diagonal matrix de\ufb01ned by S\u2217ii = s\u2217i = E[((UA \u2297 UB)\ufffd\u2207\u03b8)2\ni ]. Vector s\u2217 is the\nvector of second moments of the gradient vector coordinates expressed in the Kronecker-factored\nEigenbasis (KFE) UA \u2297 UB and can be ef\ufb01ciently estimated and stored.\nIn Appendix A.1 we prove that this S\u2217 is the optimal diagonal rescaling in that basis, in the sense that\nS\u2217 = arg minS \ufffdG\u2212 (UA\u2297 UB)S(UA\u2297 UB)\ufffd\ufffdF s.t. S is diagonal: it minimizes the approximation\nerror to G as measured by Frobenius norm (denoted \ufffd \u00b7 \ufffdF ), which KFAC\u2019s corresponding S =\nSA\u2297SB cannot generally achieve. A corollary of this is that we will always have \ufffdG\u2212GEKFAC\ufffdF \u2264\n\ufffdG \u2212 GKFAC\ufffdF i.e. EKFAC yields a better approximation of G than KFAC (Theorem 2 proven in\nAppendix). Figure 2 illustrates the different rescaling strategies, including EKFAC.\n\nPotential bene\ufb01ts: Since EKFAC is a better approximation of G than KFAC (in the limited sense\nof Frobenius norm of the residual) it may yield a better preconditioning of the gradient for optimizing\nneural networks3. Another bene\ufb01t is linked to computational ef\ufb01ciency: even if KFAC yields a\nreasonably good approximation in practice, it is costly to re-estimate and invert matrices A and B, so\nthis has to be amortized over many updates: re-estimation of the preconditioning is thus typically\ndone at a much lower frequency than the parameter updates, and may lag behind, no longer accurately\nre\ufb02ecting the local 2nd order information. Re-estimating the Kronecker-factored Eigenbasis (KFE)\n\n3Although there is no guarantee. In particular GEKFAC being a better approximation of G does not guarantee\n\nthat G\u22121\n\nEKFAC\u2207\u03b8 will be closer to the natural gradient update direction G\u22121\u2207\u03b8 .\n\n4\n\n\ffor EKFAC is similarly costly and must be similarly amortized. But re-estimating the diagonal\nscaling s\u2217 in that basis is cheap, doable with every mini-batch, so we can hope to reactively track and\nleverage the changes in 2nd order information along these directions. Figure 3 (right) provides an\nempirical con\ufb01rmation that tracking s\u2217 indeed allows to keep the approximation error of GEKFAC\nsmall, compared to GKFAC, between recomputations of basis or inverse .\n\nFigure 3: Left: Gradient correlation matrices measured in the initial parameter basis and in\nthe Kronecker-factored Eigenbasis (KFE), computed from a small 4 sigmoid layer MLP classi\ufb01er\ntrained on MNIST. Block corresponds to 250 parameters in the 2nd layer. Components are largely\ndecorrelated in the KFE, justifying the use of a diagonal rescaling method in that basis.\nRight: Approximation error \ufffdG\u2212 \u02c6G\ufffdF\nwhere \u02c6G is either GKFAC or GEKFAC, for the small MNIST\n\ufffdG\ufffdF\nclassi\ufb01er. KFE basis and KFAC inverse are recomputed every 100 iterations. EKFAC\u2019s cheap tracking\nof s\u2217 allows it to drift far less quickly than amortized KFAC from the exact empirical Fisher.\n\nDual view by working in the KFE:\nInstead of thinking of this new method as an improved\nfactorized approximation of G that we use as a preconditioning matrix, we can alternatively view it\nas applying a diagonal method, but in a different basis where the diagonal approximation is more\naccurate (an assumption we empirically con\ufb01rm in Figure 3 left). This can be seen by interpreting\nthe update given by EKFAC as a 3 step process: project the gradient in the KFE (\u2013), apply diagonal\nnatural gradient descent in this basis (\u2013), then project back to the parameter space (\u2013):\n\nG\u22121\nEKFAC\u2207\u03b8 = (UA \u2297 UB) S\u2217\u22121 (UA \u2297 UB)\ufffd\u2207\u03b8\n\nNote that by writing \u02dc\u2207\u03b8 = (UA \u2297 UB)\ufffd \u2207\u03b8 the projected gradient in KFE, the computation of the\ncoef\ufb01cients s\u2217i simpli\ufb01es in s\u2217i = E[( \u02dc\u2207\u03b8)2\ni ]. Figure 3 shows gradient correlation matrices in both\nthe initial parameter basis and in the KFE. Gradient components appear far less correlated when\nexpressed in the KFE, which justi\ufb01es using a diagonal rescaling method in that basis.\nThis viewpoint brings us close to network reparametrization approaches such as Fujimoto & Ohira\n(2018), whose proposal \u2013 that was already hinted towards by Desjardins et al. (2015) \u2013 amounts to a\nreparametrization equivalent of KFAC. More precisely, while Desjardins et al. (2015) empirically\nexplored a reparametrization that uses only input covariance A (and thus corresponds only to \"half\nof\" KFAC), Fujimoto & Ohira (2018) extend this to use also backpropagated gradient covariance\nB, making it essentially equivalent to KFAC (with a few extra twists). Our approach differs in\nthat moving to the KFE corresponds to a change of orthonormal basis, and more importantly that\nwe cheaply track and perform a more optimal full diagonal rescaling in that basis, rather than the\nconstrained factored SA \u2297 SB diagonal that these other approaches are implicitly using.\nAlgorithm: Using Eigenvalue-corrected Kronecker factorization (EKFAC) for neural network\noptimization involves: a) periodically (every n mini-batches) computing the Kronecker-factored\nEigenbasis by doing an eigendecomposition of the same A and B matrices as KFAC; b) estimating\nscaling vector s\u2217 as second moments of gradient coordinates in that implied basis; c) preconditioning\ngradients accordingly prior to updating model parameters. Algorithm 1 provides a high level pseudo-\ncode of EKFAC for the case of fully-connected layers4, and when using it to approximate the empirical\nFisher. In this version, we re-estimate s\u2217 from scratch on each mini-batch. An alternative (EKFAC-ra)\n\n4EKFAC for convolutional layers follows the same structure, but require a more convoluted notation.\n\n5\n\n\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd0100200300iterations50%100%150%200%approximation errorKFACEKFAC\fis to update s\u2217 as a running average of component-wise second moment of mini-batch averaged\ngradients.\n\nAlgorithm 1 EKFAC for fully connected networks\nRequire: n: recompute eigenbasis every n minibatches\nRequire: \u03b7: learning rate\nRequire: \ufffd: damping parameter\nprocedure EKFAC(Dtrain)\n\nwhile convergence is not reached, iteration i do\n\nsample a minibatch D from Dtrain\nDo forward and backprop pass as needed to obtain h and \u03b4\nfor all layer l do\n\nif i % n = 0 then\n\nCOMPUTEEIGENBASIS(D, l)\nend if\nCOMPUTESCALINGS(D, l)\n\u03b8 (x, y)\ufffd\n\u2207mini \u2190 E(x,y)\u2208D\ufffd\u2207(l)\nUPDATEPARAMETERS(\u2207mini, l)\n\nend for\nend while\nend procedure\nprocedure COMPUTEEIGENBASIS(D, l)\n\nU (l)\nA , S(l)\nU (l)\nB , S(l)\n\nA \u2190 eigendecomposition\ufffdED\ufffdh(l)h(l)\ufffd\ufffd\ufffd\nB \u2190 eigendecomposition\ufffdED\ufffd\u03b4(l)\u03b4(l)\ufffd\ufffd\ufffd\n\nend procedure\nprocedure COMPUTESCALINGS(D, l)\nB \ufffd\ufffd\n\u2207(l)\nA \u2297 U (l)\n\ns\u2217(l) \u2190 ED\ufffd\ufffd\ufffdU (l)\n\n\u03b8 \ufffd2\ufffd\n\nend procedure\nprocedure UPDATEPARAMETERS(\u2207mini, l)\n\u02dc\u2207 \u2190\ufffdU (l)\nB \ufffd\ufffd\nA \u2297 U (l)\n\u02dc\u2207 \u2190 \u02dc\u2207/\ufffds\u2217(l) + \ufffd\ufffd\nB \ufffd \u02dc\u2207\n\u2207precond \u2190\ufffdU (l)\nA \u2297 U (l)\n\u03b8(l) \u2190 \u03b8(l) \u2212 \u03b7\u2207precond\n\n\u2207mini\n\nend procedure\n\n# Amortize eigendecomposition\n\n# Project gradient in eigenbasis1\n\n# Project gradients in eigenbasis1\n# Element-wise scaling\n# Project back in parameter basis1\n# Update parameters\n\n1Can be ef\ufb01ciently computed using the following identity: (A \u2297 B)vec(C) = B\ufffdCA\n\n4 Experiments\n\nThis section presents an empirical evaluation of our proposed Eigenvalue Corrected KFAC (EKFAC)\nalgorithm in two variants: EKFAC estimates scalings s\u2217 as second moment of intrabatch gradients (in\nKFE coordinates) as in Algorithm 1, whereas EKFAC-ra estimates s\u2217 as a running average of squared\nminibatch gradient (in KFE coordinates). We compare them with KFAC and other baselines, primarily\nSGD with momentum, with and without batch-normalization (BN). For all our experiments KFAC\nand EKFAC approximate the empirical Fisher G. This research focuses on improving optimization\ntechniques, so except when speci\ufb01ed otherwise, we performed model and hyperparameter selection\nbased on the performance of the optimization objective, i.e. on training loss.\n\n6\n\n\f4.1 Deep auto-encoder\n\nWe consider the task of minimizing the reconstruction error of an 8-layer auto-encoder on the MNIST\ndataset, a standard task used to benchmark optimization algorithms (Hinton & Salakhutdinov, 2006;\nMartens & Grosse, 2015; Desjardins et al., 2015). The model consists of an encoder composed of 4\nfully-connected sigmoid layers, with a number of hidden units per layer of respectively 1000, 500,\n250, 30, and a symmetric decoder (with untied weights).\nWe compare EKFAC, computing the second moment statistics through its mini-batch, and EKFAC-ra,\nits running average variant, with different baselines (KFAC, SGD with momentum and BN, ADAM\nwith BN). For each algorithm, best hyperparameters were selected using a mix of grid and random\nsearch based on training error. Grid values for hyperparameters are: learning rate \u03b7 and damping\n\n\ufffd in\ufffd10\u22121, 10\u22122, 10\u22123, 10\u22124\ufffd, mini-batch size in {200, 500}.In addition we explored 20 values\n\nfor (\u03b7, \ufffd) by random search around each grid points. We found that extra care must be taken when\nchoosing the values of the learning rate and damping parameter \ufffd in order to get good performances,\nas often observed when working with algorithms that leverage curvature information (see Figure 4\n(d)). The learning rate and the damping parameters are kept constant during training.\n\n(a) Training loss\n\n(b) Wall-Clock Time\n\n(c) Validation Loss\n\n(d) Hyperparameters\n\nFigure 4: MNIST Deep Auto-Encoder task. Models are selected based on the best loss achieved\nduring training. SGD and Adam are with batch-norm. A \"freq\" of 50 means eigendecomposition or\ninverse is recomputed every 50 updates. (a) Training loss vs epochs. Both EKFAC and EKFAC-ra\nshow an optimization bene\ufb01t compared to amortized-KFAC and the other baselines. (b) Training loss\nvs wall-clock time. Optimization bene\ufb01ts transfer to faster training for EKFAC-ra. (c) Validation\n(d) Sensitivity to\nperformance. KFAC and EKFAC achieve a similar validation performance.\nhyperparameters values. Color corresponds to \ufb01nal loss reached after 20 epochs for batch size 200.\n\nFigure 4 (a) reports the training loss throughout training and shows that EKFAC and EKFAC-ra\nboth minimize faster the training loss per epoch than KFAC and the other baselines. In addition,\nFigure 4 (b) shows that the ef\ufb01cient tracking of diagonal scaling vector s\u2217 in EKFAC-ra, despite its\nslightly increased computational burden per update, allows to achieve faster training in wall-clock\ntime. Finally, on this task, EKFAC and EKFAC-ra achieve better optimization while also maintaining\na very good generalization performances (Figure 4 (c)).\nNext we investigate how the frequency of the inverse/eigendcomposition recomputation affects\noptimization. In Figure 5, we compare KFAC/EKFAC with different reparametrization frequencies to\na strong KFAC baseline where we reestimate and compute the inverse at each update. This baseline\noutperforms the amortized version (as a function of number of epochs), and is likely to leverage a\nbetter approximation of G as it recomputes the approximated eigenbasis at each update. However it\ncomes at a high computational cost, as seen in Figure 5 (b). Amortizing the eigendecomposition allows\nto strongly decrease the computational cost while slightly degrading the optimization performances.\nAs can be seen in Figure 5 (a), amortized EKFAC preserves better the optimization performance than\namortized KFAC. EKFAC re-estimates at each update, the diagonal second moments s\u2217 in the KFE\nbasis, which correspond to the eigenvalues of the EKFAC approximation of G. Thus it should better\ntrack the true curvature of the loss function. To verify this, we investigate how the eigenspectrum of\nthe true empirical Fisher G changes compared to the eigenspectrum of its approximations as GKFAC\n(or GEKFAC). In Figure 5 (c), we track their eigenspectra and report the l2 distance between them\nduring training. We compute the KFE once at the beginning and then keep it \ufb01xed during training.\nWe focus on the 4th layer of the auto-encoder: its small size allows to estimate the corresponding\nG and compute its eigenspectrum at a reasonable cost. We see that the spectrum of GKFAC quickly\ndiverges from the spectrum of G, whereas the cheap frequent reestimation of the diagonal scaling for\n\n7\n\n\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\f(a) Training loss\n\n(b) Wall-clock time\n\n(c) l2 distance to spectrum\nof G\n\nFigure 5: Impact of frequency of inverse/eigendecomposition recomputation for KFAC/EKFAC.\nA \"freq\" of 50 indicates a recomputation every 50 updates. (a)(b) Training loss v.s. epochs and\nwall-clock time. We see that EKFAC preserves better its optimization performances when the eigen-\ndecomposition is performed less frequently. (c). Evolution of l2 distance between the eigenspectrum\nof empirical Fisher G and eigenspectra of approximations GKFAC and GEKFAC. We see that the\nspectrum of GKFAC quickly diverges from the spectrum of G, whereas the EKFAC variants, thanks\nto their frequent diagonal reestimation, manage to much better track G.\n\nGEKFAC and GEKFAC\u2212ra allows their spectrum to stay much closer to that of G. This is consistent\nwith the evolution of approximation error shown earlier in Figure 3 on the small MNIST classi\ufb01er.\n\n4.2 CIFAR-10\n\n(a) Training loss\n\n(b) Wall-clock time\n\n(c) Accuracy (solid is train,\ndash is validation)\n\nFigure 6: VGG11 on CIFAR-10. \"freq\" corresponds to the eigendecomposition (inverse) frequency.\nIn (a) and (b), we report the performance of the hyper-parameters reaching the lowest training loss\nfor each epoch (to highlight which optimizers perform best given a \ufb01xed epoch budget). In (c) models\nare selected according to the best overall validation error. When the inverse/eigendecomposition\nis amortized on 500 iterations, EKFAC-ra shows an optimization bene\ufb01t while maintaining its\ngeneralization capability.\n\nIn this section, we evaluate our proposed algorithm on the CIFAR-10 dataset using a VGG11 convolu-\ntional neural network (Simonyan & Zisserman, 2015) and a Resnet34 (He et al., 2016). To implement\nKFAC/EKFAC in a convolutional neural network, we rely on the SUA approximation (Grosse &\nMartens, 2016) which has been shown to be competitive in practice (Laurent et al., 2018). We\nhighlight that we do not use BN in our model when they are trained using KFAC/EKFAC.\nAs in the previous experiments, a grid search is performed to select the hyperparameters. Around\neach grid point, learning rate and damping values are further explored through random search.\nWe experiment with constant learning rate in this section, but explore learning rate schedule with\nKFAC/EKFAC in Appendix D.2. the damping parameter is initialized according to Appendix C. In\nthe \ufb01gures that show the model training loss per epoch or wall-clock time, we report the performance\nof the hyper-parameters attaining the lowest training loss for each epoch. This per-epoch model\nselection allows to show which model type reaches the lowest cost during training and also which\none optimizes best given any \u201cepoch budget\u201d. We did not \ufb01nd one single set of hyperparameter for\nwhich the EKFAC optimization curve is below KFAC for all the epochs (and vice-versa). However,\ndoing a per-epoch model selection shows that the best EKFAC con\ufb01guration usually outperforms the\nbest KFAC for any chosen target epoch. This is also true for any chosen compute time budget.\n\n8\n\n\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\fIn Figure 6, we compare EKFAC/EKFAC-ra to KFAC and SGD Momentum with or without BN\nwhen training a VGG-11 network. We use a batch size of 500 for the KFAC based approaches\nand 200 for the SGD baselines. Figure 6 (a) show that EKFAC yields better optimization than the\nSGD baselines and KFAC in training loss per epoch when the computation of the KFE is amortized.\nFigure 6 (c) also shows that models trained with EKFAC maintain good generalization. EKFAC-ra\nshows some wall-clock time improvements over the baselines in that setting ( Figure 6 (b)). However,\nwe observe that using KFAC with a batch size of 200 can catch-up with EKFAC (but not EKFAC-ra)\nin wall-clock time despite being outperformed in term of optimization per iteration (see Figure D.2,\nin Appendix). VGG11 is a relatively small network by modern standard and the KFAC (with SUA\napproximation) remains computationally bearable in this model. We hypothesize that using smaller\nbatches, KFAC can be updated often enough per epoch to have a reasonable estimation error while\nnot paying too high a computational price.\nIn Figure 7, we report similar results when training a Resnet34. We compare EKFAC-ra with KFAC,\nand SGD with momentum and BN. To be able to train the Resnet34 without BN, we need to rely on a\ncareful initialization scheme (detailed in Appendix B) in order to ensure good signal propagation\nduring the forward and backward passes. EKFAC-ra outperforms both KFAC (when amortized) and\nSGD with momentum and BN in term of optimization per epochs, and compute time. This gain\nappears robust across different batch sizes (see Figure D.3 in the Appendix).\n\n(a) Training loss\n\n(b) Wall-clock time\n\n(c) Accuracy (solid is train,\ndash is validation)\n\nFigure 7: Training a Resnet Network with 34 layers on CIFAR-10. \"freq\" corresponds to eigende-\ncomposition (inverse) frequency. In (a) and (b), we report the performance of the hyper-parameters\nreaching the lowest training loss for each epoch (to highlight which optimizers perform best given\na \ufb01xed epoch budget). In (c) we select model according to the best overall validation error. When\nthe inverse/eigen decomposition is amortized on 500 iterations, EKFAC-ra shows optimization and\ncomputational time bene\ufb01ts while maintaining a good generalization capability.\n\n5 Conclusion and future work\n\nIn this work, we introduced the Eigenvalue-corrected Kronecker factorization (EKFAC), an approxi-\nmate factorization of the (empirical) Fisher Information Matrix that is computationally manageable\nwhile still being accurate. We formally proved (in Appendix) that EKFAC yields a more accurate\napproximation than its closest parent and competitor KFAC, in the sense of the Frobenius norm. Of\nmore practical importance, we showed that our algorithm allows to cheaply perform partial updates\nof the curvature estimate, by maintaining an up-to-date estimate of its eigenvalues while keeping the\nestimate of its eigenbasis \ufb01xed. This partial updating proves competitive when applied to optimizing\ndeep networks, both with respect to the number of iterations and wall-clock time.\nOur approach amounts to normalizing the gradient by its 2nd moment component-wise in a Kronecker-\nfactored Eigenbasis (KFE). But one could apply other component-wise (diagonal) adaptive algorithms\nsuch as Adagrad (Duchi et al., 2011), RMSProp (Tieleman & Hinton, 2012) or Adam (Kingma &\nBa, 2015), in the KFE where the diagonal approximation is much more accurate. This is left\nfor future work. We also intend to explore alternative strategies for obtaining the approximate\neigenbasis and investigate how to increase the robustness of the algorithm with respect to the damping\nhyperparameter. We also want to explore novel regularization strategies, so that the advantage of\nef\ufb01cient optimization algorithms can more reliably be translated to generalization error.\n\n9\n\n\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\fAcknowledgments\nThe experiments were conducted using PyTorch (Paszke et al. (2017)). The authors would like to thank\nFacebook, CIFAR, Calcul Quebec and Compute Canada, for research funding and computational\nresources.\n\nReferences\nShun-Ichi Amari. Natural gradient works ef\ufb01ciently in learning. Neural computation, 1998.\n\nJimmy Ba, Roger Grosse, and James Martens. Distributed second-order optimization using kronecker-\n\nfactored approximations. In ICLR, 2017.\n\nSue Becker, Yann Le Cun, et al. Improving the convergence of back-propagation learning with second\norder methods. In Proceedings of the 1988 connectionist models summer school. San Matteo, CA:\nMorgan Kaufmann, 1988.\n\nL\u00e9on Bottou, Frank E Curtis, and Jorge Nocedal. Optimization methods for large-scale machine\n\nlearning. arXiv preprint, 2016.\n\nGuillaume Desjardins, Karen Simonyan, Razvan Pascanu, et al. Natural neural networks. In NIPS,\n\n2015.\n\nJohn Duchi, Elad Hazan, and Yoram Singer. Adaptive subgradient methods for online learning and\n\nstochastic optimization. Journal of Machine Learning Research, 12(Jul):2121\u20132159, 2011.\n\nYuki Fujimoto and Toru Ohira. A neural network model with bidirectional whitening. In International\n\nConference on Arti\ufb01cial Intelligence and Soft Computing, pp. 47\u201357. Springer, 2018.\n\nJonas Gehring, Michael Auli, David Grangier, Denis Yarats, and Yann N Dauphin. Convolutional\n\nsequence to sequence learning. In ICLR, 2017.\n\nPriya Goyal, Piotr Doll\u00e1r, Ross Girshick, Pieter Noordhuis, Lukasz Wesolowski, Aapo Kyrola,\nAndrew Tulloch, Yangqing Jia, and Kaiming He. Accurate, large minibatch sgd: training imagenet\nin 1 hour. arXiv preprint arXiv:1706.02677, 2017.\n\nRoger Grosse and James Martens. A kronecker-factored approximate \ufb01sher matrix for convolution\n\nlayers. In ICML, 2016.\n\nKaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Delving deep into recti\ufb01ers: Surpassing\n\nhuman-level performance on imagenet classi\ufb01cation. In ICCV, 2015.\n\nKaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image\n\nrecognition. In CVPR, 2016.\n\nTom Heskes. On \u201cnatural\u201d learning and pruning in multilayered perceptrons. Neural Computation,\n\n12(4):881\u2013901, 2000.\n\nGeoffrey E Hinton and Ruslan R Salakhutdinov. Reducing the dimensionality of data with neural\n\nnetworks. science, 313(5786):504\u2013507, 2006.\n\nSergey Ioffe and Christian Szegedy. Batch normalization: Accelerating deep network training by\n\nreducing internal covariate shift. In ICML, 2015.\n\nDiederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. In ICLR, 2015.\n\nC\u00e9sar Laurent, Thomas George, Xavier Bouthillier, Nicolas Ballas, and Pascal Vincent. An evaluation\n\nof \ufb01sher approximations beyond kronecker factorization. ICLR Workshop, 2018.\n\nNicolas Le Roux, Pierre-Antoine Manzagol, and Yoshua Bengio. Topmoumoute online natural\n\ngradient algorithm. In NIPS, 2008.\n\nDong C Liu and Jorge Nocedal. On the limited memory bfgs method for large scale optimization.\n\nMathematical programming, 1989.\n\n10\n\n\fJames Martens and Roger Grosse. Optimizing neural networks with kronecker-factored approximate\n\ncurvature. In ICML, 2015.\n\nYann Ollivier. Riemannian metrics for neural networks i: feedforward networks. Information and\n\nInference: A Journal of the IMA, 2015.\n\nAdam Paszke, Sam Gross, Soumith Chintala, Gregory Chanan, Edward Yang, Zachary DeVito,\nZeming Lin, Alban Desmaison, Luca Antiga, and Adam Lerer. Automatic differentiation in\npytorch. 2017.\n\nNicol N Schraudolph. Fast curvature matrix-vector products. In International Conference on Arti\ufb01cial\n\nNeural Networks. Springer, 2001.\n\nKaren Simonyan and Andrew Zisserman. Very deep convolutional networks for large-scale image\n\nrecognition. In ICLR, 2015.\n\nTijmen Tieleman and Geoffrey Hinton. Lecture 6.5-rmsprop: Divide the gradient by a running\n\naverage of its recent magnitude. COURSERA: Neural networks for machine learning, 2012.\n\nMatthew D Zeiler. Adadelta: an adaptive learning rate method. arXiv preprint arXiv:1212.5701,\n\n2012.\n\n11\n\n\f", "award": [], "sourceid": 5828, "authors": [{"given_name": "Thomas", "family_name": "George", "institution": "MILA, Universit\u00e9 de Montr\u00e9al"}, {"given_name": "C\u00e9sar", "family_name": "Laurent", "institution": "Mila - Universit\u00e9 de Montr\u00e9al"}, {"given_name": "Xavier", "family_name": "Bouthillier", "institution": "Universit\u00e9 de Montr\u00e9al"}, {"given_name": "Nicolas", "family_name": "Ballas", "institution": "Facebook FAIR"}, {"given_name": "Pascal", "family_name": "Vincent", "institution": "Facebook and U. Montreal"}]}