{"title": "Limitations of the empirical Fisher approximation for natural gradient descent", "book": "Advances in Neural Information Processing Systems", "page_first": 4156, "page_last": 4167, "abstract": "Natural gradient descent, which preconditions a gradient descent update\nwith the Fisher information matrix of the underlying statistical model,\nis a way to capture partial second-order information. \nSeveral highly visible works have advocated an approximation known as the empirical Fisher,\ndrawing connections between approximate second-order methods and heuristics like Adam.\nWe dispute this argument by showing that the empirical Fisher---unlike the Fisher---does not generally capture second-order information.\nWe further argue that the conditions under which the empirical Fisher approaches the Fisher (and the Hessian)\nare unlikely to be met in practice, and that, even on simple optimization problems,\nthe pathologies of the empirical Fisher can have undesirable effects.", "full_text": "Limitations of the Empirical Fisher Approximation\n\nfor Natural Gradient Descent\n\nFrederik Kunstner1,2,3\n\nkunstner@cs.ubc.ca\n\nLukas Balles2,3\n\nlballes@tue.mpg.de\n\nPhilipp Hennig2,3\nph@tue.mpg.de\n\n\u00c9cole Polytechnique F\u00e9d\u00e9rale de Lausanne (EPFL), Switzerland1\n\nUniversity of T\u00fcbingen, Germany2\n\nMax Planck Institute for Intelligent Systems, T\u00fcbingen, Germany3\n\nAbstract\n\nNatural gradient descent, which preconditions a gradient descent update with the\nFisher information matrix of the underlying statistical model, is a way to capture\npartial second-order information. Several highly visible works have advocated\nan approximation known as the empirical Fisher, drawing connections between\napproximate second-order methods and heuristics like Adam. We dispute this\nargument by showing that the empirical Fisher\u2014unlike the Fisher\u2014does not\ngenerally capture second-order information. We further argue that the conditions\nunder which the empirical Fisher approaches the Fisher (and the Hessian) are\nunlikely to be met in practice, and that, even on simple optimization problems, the\npathologies of the empirical Fisher can have undesirable effects.\n\n(cid:80)\n\n(cid:80)\n\nL(\u03b8) := \u2212\n\nIntroduction\n\n1\nConsider a supervised machine learning problem of predicting outputs y \u2208 Y from inputs x \u2208 X.\nWe assume a probabilistic model for the conditional distribution of the form p\u03b8(y|x) = p(y|f (x, \u03b8)),\nwhere p(y|\u00b7) is an exponential family with natural parameters in F and f : X\u00d7RD \u2192 F is a prediction\nfunction parameterized by \u03b8 \u2208 RD. Given N iid training samples (xn, yn)N\nn=1, we want to minimize\n(1)\nThis framework covers common scenarios such as least-squares regression (Y = F = R and p(y|f ) =\nN (y; f, \u03c32) with \ufb01xed \u03c32) or C-class classi\ufb01cation with cross-entropy loss (Y = {1, . . . , C},\ni exp(fi)) with an arbitrary prediction function f. Eq. (1)\ncan be minimized by gradient descent, which updates \u03b8t+1 = \u03b8t \u2212 \u03b3t\u2207L(\u03b8t) with step size \u03b3t \u2208 R.\nThis update can be preconditioned with a matrix Bt that incorporates additional information, such\n\u22121\u2207L(\u03b8t). Choosing Bt to be the Hessian yields Newton\u2019s\nas local curvature, \u03b8t+1 = \u03b8t \u2212 \u03b3tBt\nmethod, but its computation is often burdensome and might not be desirable for non-convex problems.\nA prominent variant in machine learning is natural gradient descent [NGD; Amari, 1998]. It adapts\nto the information geometry of the problem by measuring the distance between parameters with the\nKullback\u2013Leibler divergence between the resulting distributions rather than their Euclidean distance,\nusing the Fisher information matrix (or simply \u201cFisher\u201d) of the model as a preconditioner,\n\nF = RC and p(y = c|f ) = exp(fc)/(cid:80)\n\nn log p\u03b8(yn|xn) = \u2212\n\nn log p(yn|f (xn, \u03b8)).\n\nF(\u03b8) :=(cid:80)\n\n\u2207\u03b8 log p\u03b8(y|xn) \u2207\u03b8 log p\u03b8(y|xn)(cid:62)(cid:3) .\n(cid:2)\n\nn Ep\u03b8(y|xn)\n\n(2)\nWhile this motivation is conceptually distinct from approximating the Hessian, the Fisher coincides\nwith a generalized Gauss-Newton [Schraudolph, 2002] approximation of the Hessian for the problems\npresented here. This gives NGD theoretical grounding as an approximate second-order method.\nA number of recent works in machine learning have relied on a certain approximation of the Fisher,\nwhich is often called the empirical Fisher (EF) and is de\ufb01ned as\n\n(cid:101)F(\u03b8) :=(cid:80)\n\nn \u2207\u03b8 log p\u03b8(yn|xn) \u2207\u03b8 log p\u03b8(yn|xn)(cid:62).\n\n(3)\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\fFigure 1: Fisher vs. empirical Fisher as preconditioners for linear least-squares regression on the data\nshown in the left-most panel. The second plot shows the gradient vector \ufb01eld of the (quadratic) loss\nfunction and sample trajectories for gradient descent. The remaining plots depict the vector \ufb01elds of\nthe natural gradient and the \u201cEF-preconditioned\u201d gradient, respectively. NGD successfully adapts to\nthe curvature whereas preconditioning with the empirical Fisher results in a distorted gradient \ufb01eld.\n\nAt \ufb01rst glance, this approximation is merely replacing the expectation over y in Eq. (2) with a sample\nyn. However, yn is a training label and not a sample from the model\u2019s predictive distribution p\u03b8(y|xn).\nTherefore, and contrary to what its name suggests, the empirical Fisher is not an empirical (i.e. Monte\nCarlo) estimate of the Fisher. Due to the unclear relationship between the model distribution and the\ndata distribution, the theoretical grounding of the empirical Fisher approximation is dubious.\nAdding to the confusion, the term \u201cempirical Fisher\u201d is used by different communities to refer to\ndifferent quantities. Authors closer to statistics tend to use \u201cempirical Fisher\u201d for Eq. (2), while many\nworks in machine learning, some listed in Section 2, use \u201cempirical Fisher\u201d for Eq. (3). While the\nstatistical terminology is more accurate, we adopt the term \u201cFisher\u201d for Eq. (2) and \u201cempirical Fisher\u201d\nfor Eq. (3), which is the subject of this work, to be accessible to readers more familiar with this\nconvention. We elaborate on the different uses of the terminology in Section 3.1.\nThe main purpose of this work is to provide a detailed critical discussion of the empirical Fisher\napproximation. While the discrepancy between the empirical Fisher and the Fisher has been\nmentioned in the literature before [Pascanu and Bengio, 2014, Martens, 2014], we see the need for\na detailed elaboration of the subtleties of this important issue. The intricacies of the relationship\nbetween the empirical Fisher and the Fisher remain opaque from the current literature. Not all authors\nusing the EF seem to be fully aware of the heuristic nature of this approximation and overlook\nits shortcomings, which can be seen clearly even on simple linear regression problems, see Fig. 1.\nNatural gradients adapt to the curvature of the function using the Fisher while the empirical Fisher\ndistorts the gradient \ufb01eld in a way that lead to worse updates than gradient descent.\nThe empirical Fisher approximation is so ubiquitous that it is sometimes just called the Fisher [e.g.,\nChaudhari et al., 2017, Wen et al., 2019]. Possibly as a result of this, there are examples of algorithms\ninvolving the Fisher, such as Elastic Weight Consolidation [Kirkpatrick et al., 2017] and KFAC\n[Martens and Grosse, 2015], which have been re-implemented by third parties using the empirical\nFisher. Interestingly, there is also at least one example of an algorithm that was originally developed\nusing the empirical Fisher and later found to work better with the Fisher [Wierstra et al., 2008,\nSun et al., 2009]. As the empirical Fisher is now used beyond optimization, for example as an\napproximation of the Hessian in empirical works studying properties of neural networks [Chaudhari\net al., 2017, Jastrz\u02dbebski et al., 2018], the pathologies of the EF approximation may lead the community\nto erroneous conclusions\u2014an arguably more worrysome outcome than a suboptimal preconditioner.\nThe poor theoretical grounding stands in stark contrast to the practical success that empirical Fisher-\nbased methods have seen. This paper is in no way meant to negate these practical advances but\nrather points out that the existing justi\ufb01cations for the approximation are insuf\ufb01cient and do not\nstand the test of simple examples. This indicates that there are effects at play that currently elude our\nunderstanding, which is not only unsatisfying, but might also prevent advancement of these methods.\nWe hope that this paper helps spark interest in understanding these effects; our \ufb01nal section explores\na possible direction.\n\n2\n\nDatasety=\u03b8x+bGDNGDEF\f1.1 Overview and contributions\n\nWe \ufb01rst provide a short but complete overview of natural gradient and the closely related generalized\nGauss-Newton method. Our main contribution is a critical discussion of the speci\ufb01c arguments used\nto advocate the empirical Fisher approximation. A principal conclusion is that, while the empirical\nFisher follows the formal de\ufb01nition of a generalized Gauss-Newton matrix, it is not guaranteed to\ncapture any useful second-order information. We propose a clarifying amendment to the de\ufb01nition\nof a generalized Gauss-Newton to ensure that all matrices satisfying it have useful approximation\nproperties. Furthermore, while there are conditions under which the empirical Fisher approaches\nthe true Fisher, we argue that these are unlikely to be met in practice. We illustrate that using the\nempirical Fisher can lead to highly undesirable effects; Fig. 1 shows a \ufb01rst example.\nThis raises the question: Why are methods based on the empirical Fisher practically successful?\nWe point to an alternative explanation, as an adaptation to gradient noise in stochastic optimization\ninstead of an adaptation to curvature.\n\n2 Related work\n\nThe generalized Gauss-Newton [Schraudolph, 2002] and natural gradient descent [Amari, 1998]\nmethods have inspired a line of work on approximate second-order optimization [Martens, 2010,\nBotev et al., 2017, Park et al., 2000, Pascanu and Bengio, 2014, Ollivier, 2015]. A successful\nexample in modern deep learning is the KFAC algorithm [Martens and Grosse, 2015], which uses a\ncomputationally ef\ufb01cient structural approximation to the Fisher.\nNumerous papers have relied on the empirical Fisher approximation for preconditioning and other\npurposes. Our critical discussion is in no way intended as an invalidation of these works. All of them\nprovide important insights and the use of the empirical Fisher is usually not essential to the main\ncontribution. However, there is a certain degree of vagueness regarding the relationship between the\nFisher, the EF, Gauss-Newton matrices and the Hessian. Oftentimes, only limited attention is devoted\nto possible implications of the empirical Fisher approximation.\nThe most prominent example of preconditioning with the EF is Adam, which uses a moving average\nof squared gradients as \u201can approximation to the diagonal of the Fisher information matrix\u201d [Kingma\nand Ba, 2015]. The EF has been used in the context of variational inference by various authors\n[Graves, 2011, Zhang et al., 2018, Salas et al., 2018, Khan et al., 2018, Mishkin et al., 2018], some\nof which have drawn further connections between NGD and Adam. There are also several works\nbuilding upon KFAC which substitute the EF for the Fisher [George et al., 2018, Osawa et al., 2019].\nThe empirical Fisher has also been used as an approximation of the Hessian for purposes other than\npreconditioning. Chaudhari et al. [2017] use it to investigate curvature properties of deep learning\ntraining objectives. It has also been employed to explain certain characteristics of SGD [Zhu et al.,\n2019, Jastrz\u02dbebski et al., 2018] or as a diagnostic tool during training [Liao et al., 2020].\nLe Roux et al. [2007] and Le Roux and Fitzgibbon [2010] have considered the empirical Fisher in its\ninterpretation as the (non-central) covariance matrix of stochastic gradients. While they refer to their\nmethod as \u201cOnline Natural Gradient\u201d, their goal is explicitly to adapt the update to the stochasticity\nof the gradient estimate, not to curvature. We will return to this perspective in Section 5.\nBefore moving on, we want to re-emphasize that other authors have previously raised concerns about\nthe empirical Fisher approximation [e.g., Pascanu and Bengio, 2014, Martens, 2014]. This paper\nis meant as a detailed elaboration of this known but subtle issue, with novel results and insights.\nConcurrent to our work, Thomas et al. [2019] investigated similar issues in the context of estimating\nthe generalization gap using information criteria.\n\n3 Generalized Gauss-Newton and natural gradient descent\n\nThis section brie\ufb02y introduces natural gradient descent, adresses the difference in terminology for the\nquantities of interest across \ufb01elds, introduces the generalized Gauss-Newton (GGN) and reviews the\nconnections between the Fisher, the GGN, and the Hessian.\n\n3\n\n\fQuantity\n\nF(cid:81)\nF(cid:81)\n\n\u02dcF\n\nn p\u03b8(x,y)\nn p\u03b8(y|xn)\n\nEq. (5)\nEq. (6)\nEq. (7)\n\nTerminology in statistics\nFisher\nempirical Fisher\n\nand machine learning\n\nFisher\nempirical Fisher\n\nTable 1: Common terminology for the Fisher information and related matrices by authors closely\naligned with statistics, such as Amari [1998], Park et al. [2000], and Karakida et al. [2019], or\nmachine learning, such as Martens [2010], Schaul et al. [2013], and Pascanu and Bengio [2014].\n\n3.1 Natural gradient descent\n\nGradient descent follows the direction of \u201csteepest descent\u201d, the negative gradient. But the de\ufb01nition\nof steepest depends on a notion of distance and the gradient is de\ufb01ned with respect to the Euclidean\ndistance. The natural gradient is a concept from information geometry [Amari, 1998] and applies\nwhen the gradient is taken w.r.t. the parameters \u03b8 of a probability distribution p\u03b8. Instead of measuring\nthe distance between parameters \u03b8 and \u03b8(cid:48) with the Euclidean distance, we use the Kullback\u2013Leibler\n(KL) divergence between the distributions p\u03b8 and p\u03b8(cid:48). The resulting steepest descent direction is the\nnegative gradient preconditioned with the Hessian of the KL divergence, which is exactly the Fisher\ninformation matrix of p\u03b8,\n\n(cid:2)\n\u2207\u03b8 log p\u03b8(z)\u2207\u03b8 log p\u03b8(z)T(cid:3) = Ep\u03b8(z)\n\n(cid:2)\n\u2212\u22072\n\n\u03b8 log\u03b8 p(z)(cid:3) .\n\nF(\u03b8) := Ep\u03b8(z)\n\n(4)\nThe second equality may seem counterintuitive; the difference between the outer product of gradients\nand the Hessian cancels out in expectation with respect to the model distribution at \u03b8, see Appendix A.\nThis equivalence highlights the relationship of the Fisher to the Hessian.\n\nn p\u03b8(x,y)(\u03b8) = N Ex,y\u223cp(x)p\u03b8(y|x)\n\nF(cid:81)\nn p\u03b8(y|xn)(\u03b8) =(cid:80)\nF(cid:81)\n\nThis is what statisticians would call the \u201cFisher information\u201d of the model p\u03b8(x, y). However, we\ntypically do not know the distribution over inputs p(x), so we use the empirical distribution over x\n\n3.2 Difference in terminology across \ufb01elds\nIn our setting, we only model the conditional distribution p\u03b8(y|x) of the joint distribution p\u03b8(x, y) =\np(x)p\u03b8(y|x). The Fisher information of \u03b8 for N samples from the joint distribution p\u03b8(x, y) is\n\n(cid:2)\n\u2207\u03b8 log p\u03b8(y|x)\u2207\u03b8 log p\u03b8(y|x)T(cid:3) ,\ninstead and compute the Fisher information of the conditional distribution(cid:81)\n\u2207\u03b8 log p\u03b8(y|xn)\u2207\u03b8 log p\u03b8(y|xn)T(cid:3) .\n(cid:2)\n[2014], as it is the Fisher information for the distribution we are optimizing,(cid:81)\n\n(6)\nThis is Eq. (2), which we call the \u201cFisher\u201d. This is the terminology used by work on the application\nof natural gradient methods in machine learning, such as Martens [2014] and Pascanu and Bengio\nn p\u03b8(y|xn). Work\ncloser to the statistics literature, following the seminal paper of Amari [1998], such as Park et al.\n[2000] and Karakida et al. [2019], call this quantity the \u201cempirical Fisher\u201d due to the usage of the\nempirical samples for the inputs. In constrast, we call Eq. (3) the \u201cempirical Fisher\u201d, restated here,\n\nn Ey\u223cp\u03b8(y|xn)\n\nn p\u03b8(y|xn);\n\n(5)\n\n\u02dcF(\u03b8) =(cid:80)\n\nn \u2207\u03b8 log p\u03b8(yn|xn)\u2207\u03b8 log p\u03b8(yn|xn)T ,\n\n(7)\nwhere \u201cempirical\u201d describes the use of samples for both the inputs and the outputs. This expression,\nhowever, does not have a direct interpretation as a Fisher information as it does not sample the output\naccording to the distribution de\ufb01ned by the model. Neither is it a Monte-Carlo approximation of\nEq. (6), as the samples yn do not come from p\u03b8(y|xn) but from the data distribution p(y|xn). How\nclose the empirical Fisher (Eq. 7) is to the Fisher (Eq. 6) depends on how close the model p\u03b8(y|xn)\nis to the true data-generating distribution p(y|xn).\n3.3 Generalized Gauss-Newton\n\nOne line of argument justifying the use of the empirical Fisher approximation uses the connection be-\ntween the Hessian and the Fisher through the generalized Gauss-Newton (GGN) matrix [Schraudolph,\n2002]. We give here a condensed overview of the de\ufb01nition and properties of the GGN.\n\n4\n\n\f2\n\n(cid:123)(cid:122)\n\n+(cid:80)\n(cid:124)\n\n(cid:123)(cid:122)\nn rn\u22072\n\n(cid:80)\nThe original Gauss-Newton algorithm is an approximation to Newton\u2019s method for nonlinear least\n\u22072 L(\u03b8) =(cid:80)\nn(f (xn, \u03b8) \u2212 yn)2. By the chain rule, the Hessian can be written as\nsquares problems, L(\u03b8) = 1\n(cid:124)\n(cid:125)\nn \u2207\u03b8f (xn, \u03b8)\u2207\u03b8f (xn, \u03b8)(cid:62)\n(8)\n\n(cid:125)\nSchraudolph [2002] generalized this idea to objectives of the form L(\u03b8) = (cid:80)\n\u22072 L(\u03b8) =(cid:80)\n\nbn : RD \u2192 RM and an : RM \u2192 R, for which the Hessian can be written as1\n\nwhere rn = f (xn, \u03b8) \u2212 yn are the residuals. The \ufb01rst part, G(\u03b8), is the Gauss-Newton matrix. For\nsmall residuals, R(\u03b8) will be small and G(\u03b8) will approximate the Hessian. In particular, when the\nmodel perfectly \ufb01ts the data, the Gauss-Newton is equal to the Hessian.\n\n(9)\nThe generalized Gauss-Newton matrix (GGN) is de\ufb01ned as the part of the Hessian that ignores the\nsecond-order information of bn,\n\nban(bn(\u03b8)) (J\u03b8bn(\u03b8)) +(cid:80)\n\nn,m[\u2207ban(bn(\u03b8))]m\u22072\n\nn an(bn(\u03b8)), with\n\nn(J\u03b8bn(\u03b8))(cid:62)\n\n\u03b8f (xn, \u03b8)\n\n\u22072\n\n\u03b8b(m)\n\nn (\u03b8).\n\n:=G(\u03b8)\n\n:=R(\u03b8)\n\n,\n\nn[J\u03b8bn(\u03b8)](cid:62)\n\n(10)\nIf an is convex, as is customary, the GGN is positive (semi-)de\ufb01nite even if the Hessian itself is not,\nmaking it a popular curvature matrix in non-convex problems such as neural network training. The\nGGN is ambiguous as it crucially depends on the \u201csplit\u201d given by an and bn. As an example, consider\nthe two following possible splits for the least-squares problem from above:\n\n\u22072\nban(bn(\u03b8)) [J\u03b8bn(\u03b8)].\n\nan(b) = 1\n\n2 (b \u2212 yn)2, bn(\u03b8) = f (xn, \u03b8),\n\n(11)\nThe \ufb01rst recovers the classical Gauss-Newton, while in the second case, the GGN equals the Hessian.\nWhile this is an extreme example, the split will be important for our discussion.\n\n2 (f (xn, b) \u2212 yn)2, bn(\u03b8) = \u03b8.\n\nan(b) = 1\n\nor\n\n3.4 Connections between the Fisher, the GGN and the Hessian\n\nG(\u03b8) :=(cid:80)\n\nan(b) = \u2212 log p(yn|b),\n\n(cid:80)\nn[J\u03b8f (xn, \u03b8)](cid:62)\n\nWhile NGD is not explicitly motivated as an approximate second-order method, the following result,\nnoted by several authors,2 shows that the Fisher captures partial curvature information about the\nproblem de\ufb01ned in Eq. (1).\nProposition 1 (Martens [2014], \u00a79.2). If p(y|f ) is an exponential family distribution with natural\nparameters f, then the Fisher information matrix coincides with the GGN of Eq. (1) using the split\n(12)\n\nbn(\u03b8) = f (xn, \u03b8),\n\nand reads F(\u03b8) = G(\u03b8) = \u2212\nf log p(yn|f (xn, \u03b8)) [J\u03b8f (xn, \u03b8)].\n\u22072\nf log p(y|f ) does\nFor completeness, a proof can be found in Appendix A. The key insight is that \u22072\nnot depend on y for exponential families. One can see Eq. (12) as the \u201ccanonical\u201d split, since it\nmatches the classical Gauss-Newton for the probabilistic interpretation of least-squares. From now\non, when referencing \u201cthe GGN\u201d without further speci\ufb01cation, we mean this particular split.\nThe GGN, and under the assumptions of Proposition 1 also the Fisher, are well-justi\ufb01ed approxi-\nmations of the Hessian and we can bound their approximation error in terms of the (generalized)\nresiduals, mirroring the motivation behind the classical Gauss-Newton (Proof in Appendix C.2).\nProposition 2. Let L(\u03b8) be de\ufb01ned as in Eq. (1) with F = RM . Denote by f (m)\nof f (xn,\u00b7) : RD\u2192 RM and assume each f (m)\nwhere r(\u03b8) =(cid:80)N\n\nthe m-th component\nis \u03b2-smooth. Let G(\u03b8) be the GGN (Eq. 10). Then,\n(13)\n\nn=1 (cid:107)\u2207f log p(yn|f (xn, \u03b8))(cid:107)1 and (cid:107) \u00b7 (cid:107)2 denotes the spectral norm.\n\n(cid:107)\u22072 L(\u03b8) \u2212 G(\u03b8)(cid:107)2\n\n2 \u2264 r(\u03b8)\u03b2,\n\nn\n\nn\n\n[\u00b7]m selects the m-th component of a vector; and b\n\n1J\u03b8bn(\u03b8) \u2208 RM\u00d7D is the Jacobian of bn; we use the shortened notation \u22072\n\nThe approximation improves as the residuals in r(\u03b8) diminish, and is exact if the data is perfectly \ufb01t.\nban(b)|b=bn(\u03b8);\n2Heskes [2000] showed this for regression with squared loss, Pascanu and Bengio [2014] for classi\ufb01cation\nwith cross-entropy loss, and Martens [2014] for general exponential families. However, this has been known\nearlier in the statistics literature in the context of \u201cFisher Scoring\u201d (see Wang [2010] for a review).\n\ndenotes the m-th component function of bn.\n\nban(bn(\u03b8)) := \u22072\n\n(m)\nn\n\n5\n\n\f4 Critical discussion of the empirical Fisher\n\nTwo arguments have been put forward to advocate the empirical Fisher approximation. Firstly, it\nhas been argued that it follows the de\ufb01nition of a generalized Gauss-Newton matrix, making it an\napproximate curvature matrix in its own right. We examine this relation in \u00a74.1 and show that, while\ntechnically correct, it does not entail the approximation guarantee usually associated with the GGN.\nSecondly, a popular argument is that the empirical Fisher approaches the Fisher at a minimum if the\nmodel \u201cis a good \ufb01t for the data\u201d. We discuss this argument in \u00a74.2 and point out that it requires\nstrong additional assumptions, which are unlikely to be met in practical scenarios. In addition, this\nargument only applies close to a minimum, which calls into question the usefulness of the empirical\nFisher in optimization. We discuss this in \u00a74.3, showing that preconditioning with the empirical\nFisher leads to adverse effects on the scaling and the direction of the updates far from an optimum.\nWe use simple examples to illustrate our arguments. We want to emphasize that, as these are counter-\nexamples to arguments found in the existing literature, they are designed to be as simple as possible,\nand deliberately do not involve intricate state-of-the art models that would complicate analysis. On a\nrelated note, while contemporary machine learning often relies on stochastic optimization, we restrict\nour considerations to the deterministic (full-batch) setting to focus on the adaptation to curvature.\n\n4.1 The empirical Fisher as a generalized Gauss-Newton matrix\n\nThe \ufb01rst justi\ufb01cation for the empirical Fisher is that it matches the construction of a generalized\nGauss-Newton (Eq. 10) using the split [Bottou et al., 2018]\n\nan(b) = \u2212 log b,\n\nbn(\u03b8) = p(yn|f (xn, \u03b8)).\n\n(14)\n\nAlthough technically correct,3 we argue that this split does not provide a reasonable approximation.\nFor example, consider a least-squares problem which corresponds to the log-likelihood log p(y|f ) =\n2 (y \u2212 f )2]. In this case, Eq. (14) splits the identity function, log exp(\u00b7), and takes into\nlog exp[\u2212 1\naccount the curvature from the log while ignoring that of exp. This questionable split runs counter to\nthe basic motivation behind the classical Gauss-Newton matrix, that small residuals lead to a good\napproximation to the Hessian: The empirical Fisher\n\nn \u2207\u03b8 log p\u03b8(yn|xn) \u2207\u03b8 log p\u03b8(yn|xn)(cid:62) =(cid:80)\n\n(cid:101)F(\u03b8) =(cid:80)\nF (\u03b8) = (cid:80)\nbe given by \u22072 L(\u03b8) = F (\u03b8) +(cid:80)\n\n(15)\napproaches zero as the residuals rn = f (xn, \u03b8) \u2212 yn become small. In that same limit, the Fisher\nn \u2207f (xn, \u03b8)\u2207f (xn, \u03b8)(cid:62) does approach the Hessian, which we recall from Eq. (8) to\n\u03b8f (xn, \u03b8). This argument generally applies for problems\nwhere we can \ufb01t all training samples such that \u2207\u03b8 log p\u03b8(yn|xn) = 0 for all n. In such cases, the EF\ngoes to zero while the Fisher (and the corresponding GGN) approaches the Hessian (Prop. 2).\nFor the generalized Gauss-Newton, the role of the \u201cresidual\u201d is played by the gradient \u2207ban(b);\ncompare Equations (8) and (9). To retain the motivation behind the classical Gauss-Newton, the\nsplit should be chosen such that this gradient can in principle attain zero, in which case the residual\ncurvature not captured by the GGN in (9) vanishes. The EF split (Eq. 14) does not satisfy this\nproperty, as \u2207b log b can never go to zero for a probability b \u2208 [0, 1]. It might be desirable to amend\nthe de\ufb01nition of a generalized Gauss-Newton to enforce this property (addition in bold):\nn an(bn(\u03b8)) with convex an, leads to a\n\nDe\ufb01nition 1 (Generalized Gauss-Newton). A split L(\u03b8) =(cid:80)\n\nn \u2207\u03b8f (xn, \u03b8) \u2207\u03b8f (xn, \u03b8)(cid:62),\nn r2\n\nn rn\u22072\n\ngeneralized Gauss-Newton matrix of L, de\ufb01ned as\n\nn Gn(\u03b8),\n\nGn(\u03b8) := [J\u03b8bn(\u03b8)](cid:62)\nn \u2208 Im(bn) such that \u2207ban(b)|b=b\u2217\n\n\u22072\nban(bn(\u03b8)) [J\u03b8bn(\u03b8)],\n\nif the split an, bn is such that there is b\u2217\nUnder suitable smoothness conditions, a split satisfying this condition will have a meaningful error\nbound akin to Proposition 2. To avoid confusion, we want to note that this condition does not assume\nthe existence of \u03b8\u2217 such that bn(\u03b8\u2217) = b\u2217\nn for all n; only that the residual gradient for each data point\ncan, in principle, go to zero.\n\n= 0.\n\nn\n\n(16)\n\nG(\u03b8) =(cid:80)\n\n3The equality can easily be veri\ufb01ed by plugging the split (14) into the de\ufb01nition of the GGN (Eq. 10) and\n\nban(b) = \u2207ban(b) \u2207ban(b)(cid:62) as a special property of the choice an(b) = \u2212 log(b).\n\nobserving that \u22072\n\n6\n\n\fFigure 2: Quadratic approximations of the loss function using the Fisher and the empirical Fisher\non a logistic regression problem. Logistic regression implicitly assumes identical class-conditional\ncovariances [Hastie et al., 2009, \u00a74.4.5]. The EF is a good approximation of the Fisher at the\nminimum if this assumption is ful\ufb01lled (left panel), but can be arbitrarily wrong if the assumption\nis violated, even at the minimum and with large N. Note: we achieve classi\ufb01cation accuracies of\n\u2265 85% in the misspeci\ufb01ed cases compared to 73% in the well-speci\ufb01ed case, which shows that a\nwell-performing model is not necessarily a well-speci\ufb01ed one.\n\n4.2 The empirical Fisher near a minimum\n\nN\n\nN ) N\u2192\u221e\n\u2212\u2192 1\n\nN F(\u03b8(cid:63)\n\nAn often repeated argument is that the empirical Fisher converges to the true Fisher when the model\nis a good \ufb01t for the data [e.g., Jastrz\u02dbebski et al., 2018, Zhu et al., 2019]. Unfortunately, this is\noften misunderstood to simply mean \u201cnear the minimum\u201d. The above statement has to be carefully\nformalized and requires additional assumptions, which we detail in the following.\nAssume that the training data consists of iid samples from some data-generating distribution\nptrue(x, y) = ptrue(y|x)ptrue(x).\nIf the model is realizable, i.e., there exists a parameter setting\n\u03b8T such that p\u03b8T (y|x) = ptrue(y|x), then clearly by a Monte Carlo sampling argument, as the number\n(y|x) converges to ptrue(y|x) as N \u2192 \u221e,\nestimate for N samples \u03b8(cid:63)\n(17)\nN ).\n\nof data points N goes to in\ufb01nity,(cid:101)F(\u03b8T)/N \u2192 F(\u03b8T)/N. Additionally, if the maximum likelihood\n\nN is consistent in the sense that p\u03b8(cid:63)\n\nN(cid:101)F(\u03b8(cid:63)\n\n1\n\nThat is, the empirical Fisher converges to the Fisher at the minimum as the number of data points\ngrows. (Both approach the Hessian, as can be seen from the second equality in Eq. 4 and detailed\nin Appendix C.2.) For the EF to be a useful approximation, we thus need (i) a \u201ccorrectly-speci\ufb01ed\u201d\nmodel in the sense of the realizability condition, and (ii) enough data to recover the true parameters.\nEven under the assumption that N is suf\ufb01ciently large, the model needs to be able to realize the true\ndata distribution. This requires that the likelihood p(y|f ) is well-speci\ufb01ed and that the prediction\nfunction f (x, \u03b8) captures all relevant information. This is possible in classical statistical modeling of,\nsay, scienti\ufb01c phenomena where the effect of x on y is modeled based on domain knowledge. But it\nis unlikely to hold when the model is only approximate, as is most often the case in machine learning.\nFigure 2 shows examples of model misspeci\ufb01cation and the effect on the empirical and true Fisher.\nIt is possible to satisfy the realizability condition by using a very \ufb02exible prediction function f (x, \u03b8),\nsuch as a deep network. However, \u201cenough\u201d data has to be seen relative to the model capacity. The\nmassively overparameterized models typically used in deep learning are able to \ufb01t the training data\nalmost perfectly, even when regularized [Zhang et al., 2017]. In such settings, the individual gradients,\nand thus the EF, will be close to zero at a minimum, whereas the Hessian will generally be nonzero.\n\n4.3 Preconditioning with the empirical Fisher far from an optimum\nThe relationship discussed in \u00a74.2 only holds close to the minimum. Any similarity between p\u03b8(y|x)\nand ptrue(y|x) is very unlikely when \u03b8 has not been adapted to the data, for example, at the beginning\nof an optimization procedure. This makes the empirical Fisher a questionable preconditioner.\n\n7\n\nDatasetCorrectMisspeci\ufb01ed(A)Misspeci\ufb01ed(B)QuadraticapproximationLosscontourFisheremp.FisherMinimum\fFigure 3: Fisher (NGD) vs. empirical Fisher (EFGD) as preconditioners (with damping) on linear\nclassi\ufb01cation (BreastCancer, a1a) and regression (Boston). While the EF can be a good approximation\nfor preconditioning on some problems (e.g., a1a), it is not guaranteed to be. The second row shows\nthe cosine similarity between the EF direction and the natural gradient, over the path taken by EFGD,\nshowing that the EF can lead to update directions that are opposite to the natural gradient (see Boston).\nEven when the direction is correct, the magnitude of the steps can lead to poor performance (see\nBreastCancer). See Appendix D for details and additional experiments.\n\nIn fact, the empirical Fisher can cause severe, adverse distortions of the gradient \ufb01eld far from the\noptimum, as evident even on the elementary linear regression problem of Fig. 1. As a consequence,\nEF-preconditioned gradient descent compares unfavorably to NGD even on simple linear regression\nand classi\ufb01cation tasks, as shown in Fig. 3. The cosine similarity plotted in Fig. 3 shows that the\nempirical Fisher can be arbitrarily far from the Fisher in that the two preconditioned updates point in\nalmost opposite directions.\nOne particular issue is the scaling of EF-preconditioned updates. As the empirical Fisher is the sum\nof \u201csquared\u201d gradients (Eq. 3), multiplying the gradient by the inverse of the EF leads to updates of\nmagnitude almost inversely proportional to that of the gradient, at least far from the optimum. This\neffect has to be counteracted by adapting the step size, which requires manual tuning and makes the\nselected step size dependent on the starting point; we explore this aspect further in Appendix E.\n\n5 Variance adaptation\n\nN(cid:101)F(\u03b8) = \u03a3(\u03b8) + \u2207L(\u03b8) \u2207L(\u03b8)(cid:62),\n\n\u03a3(\u03b8) := cov[g(\u03b8)].\n\nThe previous sections have shown that, interpreted as a curvature matrix, the empirical Fisher is a\nquestionable choice at best. Another perspective on the empirical Fisher is that, in contrast to the\nFisher, it contains useful information to adapt to the gradient noise in stochastic optimization.\nIn stochastic gradient descent [SGD; Robbins and Monro, 1951], we sample n \u2208 [N ] uniformly\nat random and use a stochastic gradient g(\u03b8) = \u2212N \u2207\u03b8 log p\u03b8(yn|xn) as an inexpensive but noisy\nestimate of \u2207L(\u03b8). The empirical Fisher, as a sum of outer products of individual gradients, coincides\nwith the non-central second moment of this estimate and can be written as\n(18)\nGradient noise is a major hindrance to SGD and the covariance information encoded in the EF may\nbe used to attenuate its harmful effects, e.g., by scaling back the update in high-noise directions.\nA small number of works have explored this idea before. Le Roux et al. [2007] showed that the\nupdate direction \u03a3(\u03b8)\u22121g(\u03b8) maximizes the probability of decreasing in function value, while Schaul\net al. [2013] proposed a diagonal rescaling based on the signal-to-noise ratio of each coordinate,\ni + \u03a3(\u03b8)ii). Balles and Hennig [2018] identi\ufb01ed these factors as\nDii := [\u2207L(\u03b8)]2\nA straightforward extension of this argument to full matrices yields the variance adaptation matrix\n\noptimal in that they minimize the expected error E(cid:2)\nM =(cid:0)\u03a3(\u03b8) + \u2207L(\u03b8) \u2207L(\u03b8)(cid:62)(cid:1)\u22121\n\n\u2207L(\u03b8) \u2207L(\u03b8)(cid:62) = (N(cid:101)F(\u03b8))\u22121\u2207L(\u03b8) \u2207L(\u03b8)(cid:62).\n\n(19)\nIn that sense, preconditioning with the empirical Fisher can be understood as an adaptation to gradient\nnoise instead of an adaptation to curvature. The multiplication with \u2207L(\u03b8)\u2207L(\u03b8)(cid:62) in Eq. (19) will\ncounteract the poor scaling discussed in \u00a74.3.\nThis perspective on the empirical Fisher is currently not well studied. Of course, there are obvious\ndif\ufb01culties ahead: Computing the matrix in Eq. (19) requires the evaluation of all gradients, which\n\n(cid:3) for a diagonal matrix D.\n\n(cid:107)Dg(\u03b8) \u2212 \u2207L(\u03b8)(cid:107)2\n\n2\n\ni / ([\u2207L(\u03b8)]2\n\n8\n\n10\u22122100LossBreastCancer101102Boston100a1a02575100Iteration-11Cosine(NGD,EFG)02575100Iteration-11050Iteration-11GDNGDEFGD\fdefeats its purpose. It is not obvious how to obtain meaningful estimates of this matrix from, say, a\nmini-batch of gradients, that would provably attenuate the effects of gradient noise. Nevertheless,\nwe believe that variance adaptation is a possible explanation for the practical success of existing\nmethods using the EF and an interesting avenue for future research. To put it bluntly: it may just be\nthat the name \u201cempirical Fisher\u201d is a fateful historical misnomer, and the quantity should instead just\nbe described as the gradient\u2019s non-central second moment.\nAs a \ufb01nal comment, it is worth pointing out that some methods precondition with the square-root of\nthe EF, the prime example being Adam. While this avoids the \u201cinverse gradient\u201d scaling discussed in\n\u00a74.3, it further widens the conceptual gap between those methods and natural gradient. In fact, such a\npreconditioning effectively cancels out the gradient magnitude, which has recently been examined\nmore closely as \u201csign gradient descent\u201d [Balles and Hennig, 2018, Bernstein et al., 2018].\n\n6 Conclusions\n\nWe offered a critical discussion of the empirical Fisher approximation, summarized as follows:\n\n\u2022 While the EF follows the formal de\ufb01nition of a generalized Gauss-Newton matrix, the\nunderlying split does not retain useful second-order information. We proposed a clarifying\namendment to the de\ufb01nition of the GGN.\n\n\u2022 A clear relationship between the empirical Fisher and the Fisher only exists at a minimum\nunder strong additional assumptions: (i) a correct model and (ii) enough data relative to\nmodel capacity. These conditions are unlikely to be met in practice, especially when using\noverparametrized general function approximators and settling for approximate minima.\n\n\u2022 Far from an optimum, EF preconditioning leads to update magnitudes which are inversely\nproportional to that of the gradient, complicating step size tuning and often leading to poor\nperformance even for linear models.\n\n\u2022 As a possible alternative explanation of the practical success of EF preconditioning, and an\ninteresting avenue for future research, we have pointed to the concept of variance adaptation.\n\nThe existing arguments do not justify the empirical Fisher as a reasonable approximation to the Fisher\nor the Hessian. Of course, this does not rule out the existence of certain model classes for which\nthe EF might give reasonable approximations. However, as long as we have not clearly identi\ufb01ed\nand understood these cases, the true Fisher is the \u201csafer\u201d choice as a curvature matrix and should be\npreferred in virtually all cases.\nContrary to conventional wisdom, the Fisher is not inherently harder to compute than the EF. As shown\nby Martens and Grosse [2015], an unbiased estimate of the true Fisher can be obtained at the same\ncomputational cost as the empirical Fisher by replacing the expectation in Eq. (2) with a single sample\n\u02dcyn from the model\u2019s predictive distribution p\u03b8(y|xn). Even exact computation of the Fisher is feasible\nin many cases. We discuss computational aspects further in Appendix B. The apparent reluctance to\ncompute the Fisher might have more to do with the current lack of convenient implementations in\ndeep learning libraries. We believe that it is misguided\u2014and potentially dangerous\u2014to accept the\npoor theoretical grounding of the EF approximation purely for implementational convenience.\n\nAcknowledgements\n\nWe thank Matthias Bauer, Felix Dangel, Filip de Roos, Diego Fioravanti, Jason Hartford, Si Kai\nLee, and Frank Schneider for their helpful comments on the manuscript. We thank Emtiyaz Khan,\nAaron Mishkin, and Didrik Nielsen for many insightful conversations that lead to this work, and the\nanonymous reviewers for their constructive feedback.\nLukas Balles kindly acknowledges the support of the International Max Planck Research School\nfor Intelligent Systems (IMPRS-IS). The authors gratefully acknowledge \ufb01nancial support by the\nEuropean Research Council through ERC StG Action 757275 / PANAMA and the DFG Cluster\nof Excellence \u201cMachine Learning - New Perspectives for Science\u201d, EXC 2064/1, project number\n390727645, the German Federal Ministry of Education and Research (BMBF) through the T\u00fcbingen\nAI Center (FKZ: 01IS18039A) and funds from the Ministry of Science, Research and Arts of the\nState of Baden-W\u00fcrttemberg.\n\n9\n\n\fReferences\nShun-ichi Amari. Natural gradient works ef\ufb01ciently in learning. Neural computation, 10(2):251\u2013276,\n\n1998.\n\nLukas Balles and Philipp Hennig. Dissecting Adam: The sign, magnitude and variance of stochastic\ngradients. In Jennifer G. Dy and Andreas Krause, editors, Proceedings of the 35th International\nConference on Machine Learning, ICML 2018, Stockholmsm\u00e4ssan, Stockholm, Sweden, July 10-15,\n2018, volume 80 of Proceedings of Machine Learning Research, pages 413\u2013422. PMLR, 2018.\n\nJeremy Bernstein, Yu-Xiang Wang, Kamyar Azizzadenesheli, and Anima Anandkumar. signSGD:\ncompressed optimisation for non-convex problems.\nIn Jennifer G. Dy and Andreas Krause,\neditors, Proceedings of the 35th International Conference on Machine Learning, ICML 2018,\nStockholmsm\u00e4ssan, Stockholm, Sweden, July 10-15, 2018, volume 80 of Proceedings of Machine\nLearning Research, pages 559\u2013568. PMLR, 2018.\n\nAleksandar Botev, Hippolyt Ritter, and David Barber. Practical Gauss-Newton optimisation for deep\nlearning. In Doina Precup and Yee Whye Teh, editors, Proceedings of the 34th International Con-\nference on Machine Learning, ICML 2017, Sydney, NSW, Australia, 6-11 August 2017, volume 70\nof Proceedings of Machine Learning Research, pages 557\u2013565. PMLR, 2017.\n\nL\u00e9on Bottou, Frank E. Curtis, and Jorge Nocedal. Optimization methods for large-scale machine\n\nlearning. SIAM Reviews, 60(2):223\u2013311, 2018.\n\nPratik Chaudhari, Anna Choromanska, Stefano Soatto, Yann LeCun, Carlo Baldassi, Christian Borgs,\nJennifer Chayes, Levent Sagun, and Riccardo Zecchina. Entropy-sgd: Biasing gradient descent\ninto wide valleys. In 5th International Conference on Learning Representations, ICLR 2017,\nToulon, France, April 24-26, 2017, Conference Track Proceedings. OpenReview.net, 2017.\n\nThomas George, C\u00e9sar Laurent, Xavier Bouthillier, Nicolas Ballas, and Pascal Vincent. Fast\napproximate natural gradient descent in a Kronecker-factored eigenbasis. In Samy Bengio, Hanna\nWallach, Hugo Larochelle, Kristen Grauman, Nicol\u00f2 Cesa-Bianchi, and Roman Garnett, editors,\nAdvances in Neural Information Processing Systems 31: Annual Conference on Neural Information\nProcessing Systems 2018, NeurIPS 2018, 3-8 December 2018, Montr\u00e9al, Canada., pages 9573\u2013\n9583, 2018.\n\nAlex Graves. Practical variational inference for neural networks. In John Shawe-Taylor, Richard S.\nZemel, Peter L. Bartlett, Fernando C. N. Pereira, and Kilian Q. Weinberger, editors, Advances\nin Neural Information Processing Systems 24: 25th Annual Conference on Neural Information\nProcessing Systems 2011. Proceedings of a meeting held 12-14 December 2011, Granada, Spain.,\npages 2348\u20132356, 2011.\n\nTrevor Hastie, Robert Tibshirani, and Jerome Friedman. The elements of statistical learning: data\n\nmining, inference, and prediction. Springer Verlag, 2009.\n\nTom Heskes. On \u201cnatural\u201d learning and pruning in multilayered perceptrons. Neural Computation,\n\n12(4):881\u2013901, 2000.\n\nStanis\u0142aw Jastrz\u02dbebski, Zac Kenton, Devansh Arpit, Nicolas Ballas, Asja Fischer, Amos Storkey,\nand Yoshua Bengio. Three factors in\ufb02uencing minima in SGD. In International Conference on\nArti\ufb01cial Neural Networks, 2018.\n\nEric Jones, Travis Oliphant, Pearu Peterson, et al. SciPy: Open source scienti\ufb01c tools for Python,\n\n2001. URL http://www.scipy.org/.\n\nRyo Karakida, Shotaro Akaho, and Shun ichi Amari. Universal statistics of \ufb01sher information in deep\nneural networks: Mean \ufb01eld approach. In Kamalika Chaudhuri and Masashi Sugiyama, editors,\nThe 22nd International Conference on Arti\ufb01cial Intelligence and Statistics, AISTATS 2019, 16-18\nApril 2019, Naha, Okinawa, Japan, volume 89 of Proceedings of Machine Learning Research,\npages 1032\u20131041. PMLR, 2019.\n\n10\n\n\fMohammad Emtiyaz Khan, Didrik Nielsen, Voot Tangkaratt, Wu Lin, Yarin Gal, and Akash Srivastava.\nFast and scalable Bayesian deep learning by weight-perturbation in Adam. In Jennifer G. Dy and\nAndreas Krause, editors, Proceedings of the 35th International Conference on Machine Learning,\nICML 2018, Stockholmsm\u00e4ssan, Stockholm, Sweden, July 10-15, 2018, volume 80 of Proceedings\nof Machine Learning Research, pages 2616\u20132625. PMLR, 2018.\n\nDiederik P. Kingma and Jimmy Ba. Adam: A method for stochastic optimization. In Yoshua Bengio\nand Yann LeCun, editors, 3rd International Conference on Learning Representations, ICLR 2015,\nSan Diego, CA, USA, May 7-9, 2015, Conference Track Proceedings, 2015.\n\nJames Kirkpatrick, Razvan Pascanu, Neil Rabinowitz, Joel Veness, Guillaume Desjardins, Andrei A.\nRusu, Kieran Milan, John Quan, Tiago Ramalho, Agnieszka Grabska-Barwinska, Demis Hassabis,\nClaudia Clopath, Dharshan Kumaran, and Raia Hadsell. Overcoming catastrophic forgetting in\nneural networks. Proceedings of the National Academy of Sciences, 114(13):3521\u20133526, 2017.\n\nNicolas Le Roux and Andrew Fitzgibbon. A fast natural Newton method. In Johannes F\u00fcrnkranz\nand Thorsten Joachims, editors, Proceedings of the 27th International Conference on Machine\nLearning (ICML-10), June 21-24, 2010, Haifa, Israel, pages 623\u2013630. Omnipress, 2010.\n\nNicolas Le Roux, Pierre-Antoine Manzagol, and Yoshua Bengio. Topmoumoute online natural\ngradient algorithm. In John C. Platt, Daphne Koller, Yoram Singer, and Sam T. Roweis, editors,\nAdvances in Neural Information Processing Systems 20, Proceedings of the Twenty-First Annual\nConference on Neural Information Processing Systems, Vancouver, British Columbia, Canada,\nDecember 3-6, 2007, pages 849\u2013856. Curran Associates, Inc., 2007.\n\nZhibin Liao, Tom Drummond, Ian Reid, and Gustavo Carneiro. Approximate Fisher information\nmatrix to characterise the training of deep neural networks. IEEE Transactions on Pattern Analysis\nand Machine Intelligence, 42(1):15\u201326, 2020.\n\nJames Martens. Deep learning via Hessian-free optimization. In Johannes F\u00fcrnkranz and Thorsten\nJoachims, editors, Proceedings of the 27th International Conference on Machine Learning (ICML-\n10), June 21-24, 2010, Haifa, Israel, pages 735\u2013742. Omnipress, 2010.\n\nJames Martens. New insights and perspectives on the natural gradient method. CoRR, abs/1412.1193,\n\n2014.\n\nJames Martens and Roger Grosse. Optimizing neural networks with Kronecker-factored approximate\ncurvature. In Francis Bach and David M. Blei, editors, Proceedings of the 32nd International\nConference on Machine Learning, ICML 2015, Lille, France, 6-11 July 2015, volume 37 of JMLR\nWorkshop and Conference Proceedings, pages 2408\u20132417, 2015.\n\nAaron Mishkin, Frederik Kunstner, Didrik Nielsen, Mark Schmidt, and Mohammad Emtiyaz Khan.\nSLANG: Fast structured covariance approximations for Bayesian deep learning with natural\ngradient. In Samy Bengio, Hanna Wallach, Hugo Larochelle, Kristen Grauman, Nicol\u00f2 Cesa-\nBianchi, and Roman Garnett, editors, Advances in Neural Information Processing Systems 31:\nAnnual Conference on Neural Information Processing Systems 2018, NeurIPS 2018, 3-8 December\n2018, Montr\u00e9al, Canada., pages 6248\u20136258. 2018.\n\nYann Ollivier. Riemannian metrics for neural networks I: feedforward networks. Information and\n\nInference: A Journal of the IMA, 4(2):108\u2013153, 2015.\n\nKazuki Osawa, Yohei Tsuji, Yuichiro Ueno, Akira Naruse, Rio Yokota, and Satoshi Matsuoka.\nLarge-scale distributed second-order optimization using Kronecker-factored approximate curvature\nfor deep convolutional neural networks. In The IEEE Conference on Computer Vision and Pattern\nRecognition (CVPR), pages 12359\u201312367. Computer Vision Foundation / IEEE, June 2019.\n\nHyeyoung Park, Shun-ichi Amari, and Kenji Fukumizu. Adaptive natural gradient learning algorithms\n\nfor various stochastic models. Neural Networks, 13(7):755\u2013764, 2000.\n\nRazvan Pascanu and Yoshua Bengio. Revisiting natural gradient for deep networks. In Yoshua Bengio\nand Yann LeCun, editors, 2nd International Conference on Learning Representations, ICLR 2014,\nBanff, AB, Canada, April 14-16, 2014, Conference Track Proceedings, 2014.\n\n11\n\n\fHerbert Robbins and Sutton Monro. A stochastic approximation method. The Annals of Mathematical\n\nStatistics, pages 400\u2013407, 1951.\n\nArnold Salas, Stefan Zohren, and Stephen Roberts. Practical Bayesian learning of neural networks\n\nvia adaptive subgradient methods. CoRR, abs/1811.03679, 2018.\n\nTom Schaul, Sixin Zhang, and Yann LeCun. No more pesky learning rates. In Proceedings of the\n30th International Conference on Machine Learning, ICML 2013, Atlanta, GA, USA, 16-21 June\n2013, volume 28 of JMLR Workshop and Conference Proceedings, pages 343\u2013351, 2013.\n\nNicol N. Schraudolph. Fast curvature matrix-vector products for second-order gradient descent.\n\nNeural Computation, 14(7):1723\u20131738, 2002.\n\nYi Sun, Daan Wierstra, Tom Schaul, and J\u00fcrgen Schmidhuber. Ef\ufb01cient natural evolution strategies.\nIn Proceedings of the 11th Annual conference on Genetic and evolutionary computation, pages\n539\u2013546, 2009.\n\nValentin Thomas, Fabian Pedregosa, Bart van Merri\u00ebnboer, Pierre-Antoine Manzagol, Yoshua Bengio,\n\nand Nicolas Le Roux. Information matrices and generalization. CoRR, abs/1906.07774, 2019.\n\nYong Wang. Fisher scoring: An interpolation family and its Monte Carlo implementations. Computa-\n\ntional Statistics & Data Analysis, 54(7):1744\u20131755, 2010.\n\nYeming Wen, Kevin Luk, Maxime Gazeau, Guodong Zhang, Harris Chan, and Jimmy Ba. Interplay\nbetween optimization and generalization of stochastic gradient descent with covariance noise.\nCoRR, abs/1902.08234, 2019. To appear in the 23rd International Conference on Arti\ufb01cial\nIntelligence and Statistics (AISTATS), 2020.\n\nDaan Wierstra, Tom Schaul, Jan Peters, and J\u00fcrgen Schmidhuber. Natural evolution strategies. In\n2008 IEEE Congress on Evolutionary Computation (IEEE World Congress on Computational\nIntelligence), pages 3381\u20133387, 2008.\n\nChiyuan Zhang, Samy Bengio, Moritz Hardt, Benjamin Recht, and Oriol Vinyals. Understanding\ndeep learning requires rethinking generalization. In 5th International Conference on Learning\nRepresentations, ICLR 2017, Toulon, France, April 24-26, 2017, Conference Track Proceedings.\nOpenReview.net, 2017.\n\nGuodong Zhang, Shengyang Sun, David Duvenaud, and Roger Grosse. Noisy natural gradient\nIn Jennifer G. Dy and Andreas Krause, editors, Proceedings of the\nas variational inference.\n35th International Conference on Machine Learning, ICML 2018, Stockholmsm\u00e4ssan, Stockholm,\nSweden, July 10-15, 2018, volume 80 of Proceedings of Machine Learning Research, pages\n5847\u20135856. PMLR, 2018.\n\nZhanxing Zhu, Jingfeng Wu, Bing Yu, Lei Wu, and Jinwen Ma. The anisotropic noise in stochastic\nIn\ngradient descent: Its behavior of escaping from sharp minima and regularization effects.\nKamalika Chaudhuri and Ruslan Salakhutdinov, editors, Proceedings of the 36th International\nConference on Machine Learning, ICML 2019, 9-15 June 2019, Long Beach, California, USA,\nvolume 97 of Proceedings of Machine Learning Research, pages 7654\u20137663. PMLR, 2019.\n\n12\n\n\f", "award": [], "sourceid": 2300, "authors": [{"given_name": "Frederik", "family_name": "Kunstner", "institution": "EPFL"}, {"given_name": "Philipp", "family_name": "Hennig", "institution": "University of T\u00fcbingen and MPI for Intelligent Systems T\u00fcbingen"}, {"given_name": "Lukas", "family_name": "Balles", "institution": "University of Tuebingen"}]}