{"title": "Topmoumoute Online Natural Gradient Algorithm", "book": "Advances in Neural Information Processing Systems", "page_first": 849, "page_last": 856, "abstract": "Guided by the goal of obtaining an optimization algorithm that is both fast and yielding good generalization, we study the descent direction maximizing the decrease in generalization error or the probability of not increasing generalization error. The surprising result is that from both the Bayesian and frequentist perspectives this can yield the natural gradient direction. Although that direction can be very expensive to compute we develop an efficient, general, online approximation to the natural gradient descent which is suited to large scale problems. We report experimental results showing much faster convergence in computation time and in number of iterations with TONGA (Topmoumoute Online natural Gradient Algorithm) than with stochastic gradient descent, even on very large datasets.", "full_text": "Topmoumoute online natural gradient algorithm\n\nNicolas Le Roux\n\nUniversity of Montreal\n\nnicolas.le.roux@umontreal.ca\n\nPierre-Antoine Manzagol\n\nUniversity of Montreal\n\nmanzagop@iro.umontreal.ca\n\nYoshua Bengio\n\nUniversity of Montreal\n\nyoshua.bengio@umontreal.ca\n\nAbstract\n\nGuided by the goal of obtaining an optimization algorithm that is both fast and\nyields good generalization, we study the descent direction maximizing the de-\ncrease in generalization error or the probability of not increasing generalization\nerror. The surprising result is that from both the Bayesian and frequentist perspec-\ntives this can yield the natural gradient direction. Although that direction can be\nvery expensive to compute we develop an ef(cid:2)cient, general, online approximation\nto the natural gradient descent which is suited to large scale problems. We re-\nport experimental results showing much faster convergence in computation time\nand in number of iterations with TONGA (Topmoumoute Online natural Gradient\nAlgorithm) than with stochastic gradient descent, even on very large datasets.\n\nIntroduction\n\nAn ef(cid:2)cient optimization algorithm is one that quickly (cid:2)nds a good minimum for a given cost func-\ntion. An ef(cid:2)cient learning algorithm must do the same, with the additional constraint that the func-\ntion is only known through a proxy. This work aims to improve the ability to generalize through\nmore ef(cid:2)cient learning algorithms.\nConsider the optimization of a cost on a training set with access to a validation set. As the end\nobjective is a good solution with respect to generalization, one often uses early stopping: optimizing\nthe training error while monitoring the validation error to (cid:2)ght over(cid:2)tting. This approach makes\nthe underlying assumption that over(cid:2)tting happens at the later stages. A better perspective is that\nover(cid:2)tting happens all through the learning, but starts being detrimental only at the point it overtakes\nthe (cid:147)true(cid:148) learning. In terms of gradients, the gradient of the cost on the training set is never collinear\nwith the true gradient, and the dot product between the two actually eventually becomes negative.\nEarly stopping is designed to determine when that happens. One can thus wonder: can one limit\nover(cid:2)tting before that point? Would this actually postpone that point?\nFrom this standpoint, we discover new justi(cid:2)cations behind the natural gradient [1]. Depending on\ncertain assumptions, it corresponds either to the direction minimizing the probability of increasing\ngeneralization error, or to the direction in which the generalization error is expected to decrease\nthe fastest. Unfortunately, natural gradient algorithms suffer from poor scaling properties, both with\nrespect to computation time and memory, when the number of parameters becomes large. To address\nthis issue, we propose a generally applicable online approximation of natural gradient that scales\nlinearly with the number of parameters (and requires computation time comparable to stochastic\ngradient descent). Experiments show that it can bring signi(cid:2)cant faster convergence and improved\ngeneralization.\n\n1\n\n\f1 Natural gradient\n\nencountered and can be quite dif(cid:2)cult. There exist various techniques to tackle it, their ef(cid:2)ciency\ndepending on L and p. In the case of non-convex optimization, gradient descent is a successful\n\nLet eL be a cost de(cid:2)ned as eL((cid:18)) =Z L(x; (cid:18))p(x)dx where L is a loss function over some parameters\n(cid:18) and over the random variable x with distribution p(x). The problem of minimizing eL over (cid:18) is often\ntechnique. The approach consists in progressively updating (cid:18) using the gradienteg = d eL\n[1] showed that the parameter space is a Riemannian space of metric eC (the covariance of the\nThe natural gradient direction is therefore given by eC (cid:0)1eg. The Riemannian space is known to\n\ncorrespond to the space of functions represented by the parameters (instead of the space of the\nparameters themselves).\nThe natural gradient somewhat resembles the Newton method. [6] showed that, in the case of a mean\nsquared cost function, the Hessian is equal to the sum of the covariance matrix of the gradients and\nof an additional term that vanishes to 0 as the training error goes down. Indeed, when the data are\ngenerated from the model, the Hessian and the covariance matrix are equal. There are two important\n\ngradients), and introduced the natural gradient as the direction of steepest descent in this space.\n\ndifferences: the covariance matrix eC is positive-de(cid:2)nite, which makes the technique more stable,\n\nbut contains no explicit second order information. The Hessian allows to account for variations in\nthe parameters. The covariance matrix accounts for slight variations in the set of training samples. It\nalso means that, if the gradients highly disagree in one direction, one should not go in that direction,\neven if the mean suggests otherwise. In that sense, it is a conservative gradient.\n\nd(cid:18) .\n\n2 A new justi(cid:2)cation for natural gradient\n\nUntil now, we supposed we had access to the true distribution p. However, this is usually not the\ncase and, in general, the distribution p is only known through the samples of the training set. These\nsamples de(cid:2)ne a cost L (resp. a gradient g) that, although close to the true cost (resp. gradient), is\nnot equal to it. We shall refer to L as the training error and to eL as the generalization error. The\ndanger is then to over(cid:2)t the parameters (cid:18) to the training set, yielding parameters that are not optimal\nwith respect to the generalization error.\nA simple way to (cid:2)ght over(cid:2)tting consists in determining the point when the continuation of the\noptimization on L will be detrimental to eL. This can be done by setting aside some samples to\nform a validation set that will provide an independent estimate of eL. Once the error starts increasing\n\non the validation set, the optimization should be stopped. We propose a different perspective on\nover(cid:2)tting. Instead of only monitoring the validation error, we consider using as descent direction\nan estimate of the direction that maximizes the probability of reducing the generalization error. The\ngoal is to limit over(cid:2)tting at every stage, with the hope that the optimal point with respect to the\nvalidation should have lower generalization error.\n\n(for a reasonably small step) when stepping in the direction of v. Likewise, if v T g is negative then\nthe training error drops. Since the learning objective is to minimize generalization error, we would\n\nConsider a descent direction v. We know that if vTeg is negative then the generalization error drops\nlike vTeg as small as possible, or at least always negative.\nnXi=1\n\nBy de(cid:2)nition, the gradient on the training set is g =\nand n is the\nnumber of training samples. With a rough approximation, one can consider the gis as draws from the\ntrue gradient distribution and assume all the gradients are independent and identically distributed.\nThe central limit theorem then gives\n\ngi where gi =\n\n@L(xi; (cid:18))\n\n1\nn\n\n@(cid:18)\n\ng (cid:24) N eg; eC\nn!\n\n@(cid:18) wrt p(x).\n\nwhere eC is the true covariance matrix of @L(x;(cid:18))\n\n2\n\n(1)\n\n\fWe will now show that, both in the Bayesian setting (with a Gaussian prior) and in the frequentist\nsetting (with some restrictions over the type of gradient considered), the natural gradient is optimal\nin some sense.\n\n2.1 Bayesian setting\n\nIn the Bayesian setting,eg is a random variable. We would thus like to de(cid:2)ne a posterior overeg given\nthe samples gi in order to have a posterior distribution over vTeg for any given direction v. The prior\novereg will be a Gaussian centered in 0 of variance (cid:27)2I. Thus, using eq. 1, the posterior overeg given\nthe gis (assuming the only information overeg given by the gis is through g and C) is\n\nDenoting eC(cid:27) = I +\n\nn(cid:27)2 , we therefore have\neC\n\negjg;eC (cid:24) N0@ I + eC\n(cid:27)2 + neC (cid:0)1(cid:19)(cid:0)11A\nn(cid:27)2!(cid:0)1\ng;(cid:18) I\n!\nvTegjg;eC (cid:24) N vTeC (cid:0)1\nvTeC (cid:0)1\n(cid:27) eCv\n\n(cid:27) g;\n\nn\n\n(2)\n\n(3)\n\nUsing this result, one can choose between several strategies, among which two are of particular\ninterest:\n\nmaximize the immediate gain). In this setting, the direction v to choose is\n\n(cid:15) choosing the direction v such that the expected value of v Teg is the lowest possible (to\nIf (cid:27) < 1, this is the regularized natural gradient. In the case of (cid:27) = 1, eC(cid:27) = I and this\n(cid:15) choosing the direction v to minimize the probability of vTeg to be positive. This is equiva-\n\nv / (cid:0)eC (cid:0)1\n\nis the batch gradient descent.\n\nlent to (cid:2)nding\n\n(cid:27) g:\n\n(4)\n\nargminv\n\n(we dropped n for the sake of clarity, since it does not change the result). If we square this\n\nquantity and take the derivative with respect to v, we (cid:2)nd 2eC (cid:0)1\n2eC (cid:0)1\n(cid:27) eCv(vT eC (cid:0)1\none is in the span of eC (cid:0)1\n(since eC and eC(cid:27) are invertible), i.e.\n\n(cid:27) g(vTeC (cid:0)1\n(cid:27) g)2 at the numerator. The (cid:2)rst term is in the span of eC (cid:0)1\n\n(cid:27) eCv)(cid:0)\n(cid:27) eCv. Hence, for the derivative to be zero, we must have g / eCv\n\n(cid:27) g)(vTeC (cid:0)1\n\n(cid:27) g and the second\n\n(5)\n\nThis direction is the natural gradient and does not depend on the value of (cid:27).\n\nv / (cid:0)eC (cid:0)1g:\n\n(cid:27) g\n\nvTeC (cid:0)1\npvTeC (cid:0)1\n(cid:27) eCv\n\n2.2 Frequentist setting\n\neC\n\nn(cid:17), we have\n\nconsider (as all second-order methods do) the directions v of the form v = M T g (i.e. we are only\nallowed to go in a direction which is a linear function of g).\n\nIn the frequentist setting,eg is a (cid:2)xed unknown quantity. For the sake of simplicity, we will only\nSince g (cid:24) N(cid:16)eg;\nThe matrix M (cid:3) which minimizes the probability of vTeg to be positive satis(cid:2)es\n\nvTeg = gT M g (cid:24) N egT Meg;egT M TeCMeg\nM (cid:3) = argminM egT Meg\negT M T CMeg\n\n!\n\n(6)\n\n(7)\n\nn\n\n3\n\n\fThe numerator of the derivative of this quantity isegegT M TeCMegegT (cid:0) 2eCMegegT MegegT . The (cid:2)rst\nterm is in the span ofeg and the second one is in the span of eCMeg. Thus, for this derivative to be\n0 for alleg, one must have M / eC (cid:0)1 and we obtain the same result as in the Bayesian case: the\n\nnatural gradient represents the direction minimizing the probability of increasing the generalization\nerror.\n\n3 Online natural gradient\nThe previous sections provided a number of justi(cid:2)cations for using the natural gradient. However,\nthe technique has a prohibitive computational cost, rendering it impractical for large scale problems.\nIndeed, considering p as the number of parameters and n as the number of examples, a direct batch\nimplementation of the natural gradient is O(p2) in space and O(np2 + p3) in time, associated re-\nspectively with the gradients\u2019 covariance storage, computation and inversion. This section reviews\nexisting low complexity implementations of the natural gradient, before proposing TONGA, a new\nlow complexity, online and generally applicable implementation suited to large scale problems. In\n\nthe previous sections we assumed the true covariance matrix eC to be known. In a practical algorithm\n\nwe of course use an empirical estimate, and here this estimate is furthermore based on a low-rank\napproximation denoted C (actually a sequence of estimates Ct).\n\n3.1 Low complexity natural gradient implementations\n[9] proposes a method speci(cid:2)c to the case of multilayer perceptrons. By operating on blocks of\nthe covariance matrix, this approach attains a lower computational complexity1. However, the tech-\nnique is quite involved, speci(cid:2)c to multilayer perceptrons and requires two assumptions: Gaussian\ndistributed inputs and a number of hidden units much inferior to that of input units. [2] offers a more\ngeneral approach based on the Sherman-Morrison formula used in Kalman (cid:2)lters: the technique\nmaintains an empirical estimate of the inversed covariance matrix that can be updated in O(p2). Yet\nthe memory requirement remains O(p2). It is however not necessary to compute the inverse of the\ngradients\u2019 covariance, since one only needs its product with the gradient. [10] offers two approaches\nto exploit this. The (cid:2)rst uses conjugate gradient descent to solve Cv = g. The second revisits\n[9] thereby achieving a lower complexity. [8] also proposes an iterative technique based on the\nminimization of a different cost. This technique is used in the minibatch setting, where Cv can be\ncomputed cheaply through two matrix vector products. However, estimating the gradient covariance\nonly from a small number of examples in one minibatch yields unstable estimation.\n\n3.2 TONGA\nExisting techniques fail to provide an implementation of the natural gradient adequate for the large\nscale setting. Their main failings are with respect to computational complexity or stability. TONGA\nwas designed to address these issues, which it does this by maintaining a low rank approximation of\nthe covariance and by casting both problems of (cid:2)nding the low rank approximation and of computing\nthe natural gradient in a lower dimensional space, thereby attaining a much lower complexity. What\nwe exploit here is that although a covariance matrix needs many gradients to be estimated, we can\ntake advantage of an observed property that it generally varies smoothly as training proceeds and\nmoves in parameter space.\n\n3.2.1 Computing the natural gradient direction between two eigendecompositions\nEven though our motivation for the use of natural gradient implied the covariance matrix of the em-\npirical gradients, we will use the second moment (i.e. the uncentered covariance matrix) throughout\nthe paper (and so did Amari in his work). The main reason is numerical stability. Indeed, in the\nbatch setting, we have (assuming C is the centered covariance matrix and g the mean) v = C (cid:0)1g,\nthus Cv = g. But then, (C + ggT )v = g + ggT v = g(1 + gT v) and\n\n(C + ggT )(cid:0)1g =\n\nv\n\n1 + gT v\n\n= (cid:22)v\n\n(8)\n\n1Though the technique allows for a compact representation of the covariance matrix, the working memory\n\nrequirement remains the same.\n\n4\n\n\f1\n\nkgk cos(g;v).\n\nEven though the direction is the same, the scale changes and the norm of the direction is bounded\nby\nSince TONGA operates using a low rank estimate of the gradients\u2019 non-centered covariance, we\nmust be able to update cheaply. When presented with a new gradient, we integrate its information\nusing the following update formula2:\n\n(9)\nwhere C0 = 0 and ^Ct(cid:0)1 is the low rank approximation at time step t (cid:0) 1. Ct is now likely of\ngreater rank, and the problem resides in computing its low rank approximation ^Ct. Writing ^Ct(cid:0)1 =\nXt(cid:0)1X T\n\nCt = (cid:13) ^Ct(cid:0)1 + gtgT\n\nt\n\nt(cid:0)1,\n\nCt = XtX T\n\nt with Xt = [p(cid:13)Xt(cid:0)1\n\ngt]\n\nWith such covariance matrices, computing the (regularized) natural direction vt is equal to\n\nvt = (Ct + (cid:21)I)(cid:0)1gt = (XtX T\nvt = (XtX T\n\nt + (cid:21)I)(cid:0)1Xtyt with yt = [0; : : : 0; 1]T :\n\nt + (cid:21)I)(cid:0)1gt\n\nUsing the Woodbury identity with positive de(cid:2)nite matrices [7], we have\n\n(10)\n(11)\n\n(12)\nIf Xt is of size p (cid:2) r (with r < p, thus yielding a covariance matrix of rank r), the cost of this\ncomputation is O(pr2 + r3). However, since the Gram matrix Gt = X T\n\nt Xt + (cid:21)I)(cid:0)1yt\n\nvt = Xt(X T\n\nGt =(cid:18) (cid:13)X T\n\np(cid:13)gT\n\nt(cid:0)1Xt(cid:0)1 p(cid:13)X T\nt(cid:0)1gt\ngT\nt Xt(cid:0)1\nt gt\n\n(cid:19) =(cid:18)\n\n(cid:13)Gt(cid:0)1\n\np(cid:13)gT\n\nt Xt(cid:0)1\n\nt Xt can be rewritten as\np(cid:13)X T\nt(cid:0)1gt\ngT\nt gt\n\n(cid:19) ;\n\n(13)\n\nthe cost of computing Gt using Gt(cid:0)1 reduces to O(pr + r3). This stresses the need to keep r small.\n\n3.2.2 Updating the low-rank estimate of Ct\nTo keep a low-rank estimate of Ct = XtX T\nthe (cid:2)rst k eigenvectors. This can be made at low cost using its relation to that of Gt:\n\nt , we can compute its eigendecomposition and keep only\n\nGt = V DV T\nCt = (XtV D(cid:0) 1\n\n(14)\nThe cost of such an eigendecomposition is O(kr2 + pkr) (for the computation of the eigendecom-\nposition of the Gram matrix and the computation of the eigenvectors, respectively). Since the cost of\ncomputing the natural direction is O(pr + r3), it is computationally more ef(cid:2)cient to let the rank of\nXt grow for several steps (using formula 12 in between) and then compute the eigendecomposition\nusing\n\n2 )D(XtV D(cid:0) 1\n\n2 )T\n\nCt+b = Xt+bX T\n\nb(cid:0)1\n\n2 gt+1;\n\n(cid:13)\n\n: : :\n\n(cid:13)\n\n1\n\n2 gt+b(cid:0)1;\n\n(cid:13)\n\nwith Ut the unnormalized eigenvectors computed during the previous eigendecomposition.\n\nt+b with Xt+b =h(cid:13)Ut;\n\nt+b\n\n2 gt+b]i\n\n3.2.3 Computational complexity\nThe computational complexity of TONGA depends on the complexity of updating the low rank\napproximation and on the complexity of computing the natural gradient. The cost of updating the\napproximation is in O(k(k + b)2 + p(k + b)k) (as above, using r = k + b). The cost of computing\nthe natural gradient vt is in O(p(k + b) + (k + b)3) (again, as above, using r = k + b). Assuming\n\nk + b (cid:28)p(p) and k (cid:20) b, TONGA\u2019s total computational cost per each natural gradient computation\n\nis then O(pb).\nFurthermore, by operating on minibatch gradients of size b0, we end up with a cost per example of\nb0 ). Choosing b = b0, yields O(p) per example, the same as stochastic gradient descent. Empiri-\nO( bp\ncal comparison of cpu time also shows comparable CPU time per example, but faster convergence.\nIn our experiments, p was in the tens of thousands, k was less than 5 and b was less than 50.\nThe result is an approximate natural gradient with low complexity, general applicability and (cid:3)exi-\nbility over the tradoff between computations and the quality of the estimate.\n\n2The second term is not weighted by 1(cid:0)(cid:13) so that the in(cid:3)uence of gt in Ct is the same for all t, even t = 0.To\nkeep the magnitude of the matrix constant, one must use a normalization constant equal to 1 + (cid:13) + : : : + (cid:13) t.\n\n5\n\n\f4 Block-diagonal online natural gradient for neural networks\n\nOne might wonder if there are better approximations of the covariance matrix C than computing its\n(cid:2)rst k eigenvectors. One possibility is a block-diagonal approximation from which to retain only\nthe (cid:2)rst k eigenvectors of every block (the value of k can be different for each block). Indeed, [4]\nshowed that the Hessian of a neural network with one hidden layer trained with the cross-entropy\ncost converges to a block diagonal matrix during optimization. These blocks are composed of the\nweights linking all the hidden units to one output unit and all the input units to one hidden unit.\nGiven the close relationship between the Hessian and the covariance matrices, we can assume they\nhave a similar shape during the optimization.\nFigure 1 shows the correlation between the standard stochastic gradients of the parameters of a\n16 (cid:0) 50 (cid:0) 26 neural network. The (cid:2)rst blocks represent the weights going from the input units to\neach hidden unit (thus 50 blocks of size 17, bias included) and the following represent the weights\ngoing from the hidden units to each output unit (26 blocks of size 51). One can see that the block-\ndiagonal approximation is reasonable. Thus, instead of selecting only k eigenvectors to represent\nthe full covariance matrix, we can select k eigenvectors for every block, yielding the same total cost.\nHowever, the rank of the approximation goes from k to k(cid:2)number of blocks. In the matrices shown\nin (cid:2)gure 1, which are of size 2176, a value of k = 5 yields an approximation of rank 380.\n\n(a) Stochastic gradient\n\n(b) TONGA\n\n(c) TONGA - zoom\n\nFigure 1: Absolute correlation between the standard stochastic gradients after one epoch in a neural\nnetwork with 16 input units, 50 hidden units and 26 output units when following stochastic gradient\ndirections (left) and natural gradient directions (center and right).\n\nFigure 2 shows the ratio of Frobenius norms kC(cid:0) (cid:22)Ck2\nfor different types of approximations (cid:22)C (full\nkCk2\nor block-diagonal). We can (cid:2)rst notice that approximating only the blocks yields a ratio of :35 (in\ncomparison, taking only the diagonal of C yields a ratio of :80), even though we considered only\n82076 out of the 4734976 elements of the matrix (1:73% of the total). This ratio is almost obtained\nwith k = 6. We can also notice that, for k < 30, the block-diagonal approximation is much better\n(in terms of the Frobenius norm) than the full approximation. The block diagonal approximation is\ntherefore very cost effective.\n\nF\n\nF\n\ns\nm\nr\no\nn\n\ni\n\n \ns\nu\nn\ne\nb\no\nr\nF\nd\ne\nr\na\nu\nq\ns\n \n\n \n\ne\nh\n\nt\n \nf\n\no\no\n\n \n\ni\nt\n\na\nR\n\n1\n\n0.9\n\n0.8\n\n0.7\n\n0.6\n\n0.5\n\n0.4\n\n0.3\n\n0.2\n\n0.1\n\n0\n\n \n\n200\n\n400\n\nFull matrix approximation\nBlock diagonal approximation\n\n \n\ns\nm\nr\no\nn\n\ni\n\n \ns\nu\nn\ne\nb\no\nr\nF\nd\ne\nr\na\nu\nq\ns\n \n\n \n\ne\nh\n\nt\n \nf\n\no\n\n \n\no\n\ni\nt\n\na\nR\n\nFull matrix approximation\nBlock diagonal approximation\n\n \n\n1\n\n0.9\n\n0.8\n\n0.7\n\n0.6\n\n0.5\n\n0.4\n\n0.3\n\n0.2\n\n800\n\n600\n1600\nNumber k of eigenvectors kept\n\n1000\n\n1400\n\n1200\n\n1800\n\n2000\n\n0.1\n\n \n\n5\n\n10\n\n15\n\n30\nNumber k of eigenvectors kept\n\n25\n\n20\n\n35\n\n40\n\n(a) Full view\n\n(b) Zoom\n\nFigure 2: Quality of the approximation (cid:22)C of the covariance C depending on the number of eigenvec-\ntors kept (k), in terms of the ratio of Frobenius norms kC(cid:0) (cid:22)Ck2\n, for different types of approximation\nkCk2\n(cid:22)C (full matrix or block diagonal)\n\nF\n\nF\n\n6\n\n\fThis shows the block diagonal approximation constitutes a powerful and cheap approximation of the\ncovariance matrix in the case of neural networks. Yet this approximation also readily applies to any\nmixture algorithm where we can assume independence between the components.\n\n5 Experiments\n\nWe performed a small number of experiments with TONGA approximating the full covariance ma-\ntrix, keeping the overhead of the natural gradient small (ie, limiting the rank of the approximation).\nRegrettably, TONGA performed only as well as stochastic gradient descent, while being rather sen-\nsitive to the hyperparameter values. The following experiments, on the other hand, use TONGA\nwith the block diagonal approximation and yield impressive results. We believe this is a re(cid:3)ection\nof the phenomenon illustrated in (cid:2)gure 2: the block diagonal approximation makes for a very cost\neffective approximation of the covariance matrix. All the experiments have been made optimizing\nhyperparameters on a validation set (not shown here) and selecting the best set of hyperparameters\nfor testing, trying to keep small the overhead due to natural gradient calculations.\nOne could worry about the number of hyperparameters of TONGA. However, default values of\nk = 5, b = 50 and (cid:13) = :995 yielded good results in every experiment. When (cid:21) goes to in(cid:2)nity,\nTONGA becomes the standard stochastic gradient algorithm. Therefore, a simple heuristic for (cid:21) is\nto progressively tune it down. In our experiments, we only tried powers of ten.\n\n5.1 MNIST dataset\n\nThe MNIST digits dataset consists of 50000 training samples, 10000 validation samples and 10000\ntest samples, each one composed of 784 pixels. There are 10 different classes (one for every digit).\n\n0.06\n\n0.05\n\n0.04\n\n0.03\n\n0.02\n\n0.01\n\nr\no\nr\nr\ne\n\n \n\nn\no\n\ni\nt\n\na\nc\ni\nf\ni\ns\ns\na\nC\n\nl\n\n0\n\n \n0\n\n500\n\n1000\n\n \n\nBlock diagonal TONGA\nStochastic batchsize=1\nStochastic batchsize=400\nStochastic batchsize=1000\nStochastic batchsize=2000\n\n0.06\n\n0.055\n\n0.05\n\n0.045\n\n0.04\n\n0.035\n\n0.03\n\n0.025\n\n0.02\n\nr\no\nr\nr\ne\n\n \n\nn\no\n\ni\nt\n\na\nc\ni\nf\ni\ns\ns\na\nC\n\nl\n\n \n\nBlock diagonal TONGA\nStochastic batchsize=1\nStochastic batchsize=400\nStochastic batchsize=1000\nStochastic batchsize=2000\n\n0.2\n\n0.18\n\n0.16\n\n0.14\n\n0.12\n\n0.1\n\n0.08\n\n0.06\n\n0.04\n\n0.02\n\nd\no\no\nh\n\ni\nl\n\ne\nk\n\ni\nl\n\n\u2212\ng\no\n\nl\n \n\ne\nv\ni\nt\n\na\ng\ne\nN\n\nBlock diagonal TONGA\nStochastic batchsize=1\nStochastic batchsize=400\nStochastic batchsize=1000\nStochastic batchsize=2000\n\n \n\n0.2\n\nd\no\no\nh\n\ni\nl\n\ne\nk\n\ni\nl\n\n\u2212\ng\no\n\nl\n \n\ne\nv\ni\nt\n\na\ng\ne\nN\n\n0.15\n\n0.1\n\n \n\nBlock diagonal TONGA\nStochastic batchsize=1\nStochastic batchsize=400\nStochastic batchsize=1000\nStochastic batchsize=2000\n\n2000\n\n1500\n3000\nCPU time (in seconds)\n\n2500\n\n3500\n\n4000\n\n4500\n\n0.015\n\n \n0\n\n500\n\n1000\n\n2000\n\n1500\n3000\nCPU time (in seconds)\n\n2500\n\n3500\n\n4000\n\n4500\n\n0\n\n \n0\n\n500\n\n1000\n\n2000\n\n1500\n3000\nCPU time (in seconds)\n\n2500\n\n3500\n\n4000\n\n4500\n\n0.05\n\n \n0\n\n500\n\n1000\n\n2000\n\n1500\n3000\nCPU time (in seconds)\n\n2500\n\n3500\n\n4000\n\n4500\n\n(a) Train class error\n\n(b) Test class error\n\n(c) Train NLL\n\n(d) Test NLL\n\nFigure 3: Comparison between stochastic gradient and TONGA on the MNIST dataset (50000 train-\ning examples), in terms of training and test classi(cid:2)cation error and Negative Log-Likelihood (NLL).\nThe mean and standard error have been computed using 9 different initializations.\n\nFigure 3 shows that in terms of training CPU time (which includes the overhead due to TONGA),\nTONGA allows much faster convergence in training NLL, as well as in testing classi(cid:2)cation error\nand testing NLL than ordinary stochastic and minibatch gradient descent on this task. One can also\nnote that minibatch stochastic gradient is able to pro(cid:2)t from matrix-matrix multiplications, but this\nadvantage is mainly seen in training classi(cid:2)cation error.\n\n5.2 Rectangles problem\n\nThe Rectangles-images task has been proposed in [5] to compare deep belief networks and support\nvector machines. It is a two-class problem and the inputs are 28(cid:2) 28 grey-level images of rectangles\nlocated in varying locations and of different dimensions. The inside of the rectangle and the back-\nground are extracted from different real images. We used 900,000 training examples and 10,000 val-\nidation examples (no early stopping was performed, we show the whole training/validation curves).\nAll the experiments are performed with a multi-layer network with a 784-200-200-100-2 architec-\nture (previously found to work well on this dataset). Figure 4 shows that in terms of training CPU\ntime, TONGA allows much faster convergence than ordinary stochastic gradient descent on this\ntask, as well as lower classi(cid:2)cation error.\n\n7\n\n\f0.55\n\nt\n\ne\ns\n \n\n0.5\n\ni\n\ng\nn\nn\na\nr\nt\n \n\ni\n\ne\nh\n\nt\n \n\n \n\nn\no\nd\no\no\nh\n\ni\nl\n\ne\nk\n\ni\nl\n\n\u2212\ng\no\n\nl\n \n\ne\nv\ni\nt\n\na\ng\ne\nN\n\n0.45\n\n0.4\n\n0.35\n\n0.3\n\n0.25\n\n0.2\n\n \n0\n\n0.5\n\n1\n\nStochastic gradient\nBlock diagonal TONGA\n\n \n\nStochastic gradient\nBlock diagonal TONGA\n\n \n\n0.2\n\n0.18\n\n0.16\n\n0.14\n\n0.12\n\n0.1\n\n0.08\n\nt\n\ne\ns\n \nt\ns\ne\n\nt\n \n\ne\nh\n\nt\n \n\nn\no\n\n \n\nd\no\no\nh\n\ni\nl\n\ne\nk\n\ni\nl\n\n\u2212\ng\no\n\nl\n \n\ne\nv\ni\nt\n\na\ng\ne\nN\n\n0.5\n\n0.45\n\n0.4\n\n0.35\n\n0.3\n\n0.25\n\n0.2\n\nStochastic gradient\nBlock diagonal TONGA\n\n \n\nt\n\ne\ns\n \nt\ns\ne\n\nt\n \n\ne\nh\n\nt\n \n\nn\no\n\n \nr\no\nr\nr\ne\nn\no\n\n \n\ni\nt\n\na\nc\ni\nf\ni\ns\ns\na\nC\n\nl\n\nStochastic gradient\nBlock diagonal TONGA\n\n \n\n0.2\n\n0.18\n\n0.16\n\n0.14\n\n0.12\n\n0.1\n\n0.08\n\nt\n\ne\ns\n \ng\nn\nn\na\nr\nt\n \n\ni\n\ni\n\ne\nh\n\nt\n \n\nn\no\n\n \nr\no\nr\nr\ne\nn\no\n\n \n\ni\nt\n\na\nc\ni\nf\ni\ns\ns\na\nC\n\nl\n\n1.5\n\n2\n\nCPU time (in seconds)\n\n2.5\n\n3\n\n3.5\nx 104\n\n0.06\n\n \n0\n\n0.5\n\n1\n\n1.5\n\n2\n\nCPU time (in seconds)\n\n2.5\n\n3\n\n3.5\nx 104\n\n \n0\n\n0.5\n\n1\n\n1.5\n\n2\n\nCPU time (in seconds)\n\n2.5\n\n3\n\n3.5\nx 104\n\n0.06\n\n \n0\n\n0.5\n\n1\n\n1.5\n\n2\n\nCPU time (in seconds)\n\n2.5\n\n3\n\n3.5\nx 104\n\n(a) Train NLL error\n\n(b) Test NLL error\n\n(c) Train class error\n\n(d) Test class error\n\nFigure 4: Comparison between stochastic gradient descent and TONGA w.r.t. NLL and classi(cid:2)ca-\ntion error, on training and validation sets for the rectangles problem (900,000 training examples).\n6 Discussion\n[3] reviews the different gradient descent techniques in the online setting and discusses their re-\nspective properties. Particularly, he states that a second order online algorithm (i.e., with a search\ndirection of is v = M g with g the gradient and M a positive semide(cid:2)nite matrix) is optimal (in terms\nof convergence speed) when M converges to H (cid:0)1. Furthermore, the speed of convergence depends\n(amongst other things) on the rank of the matrix M. Given the aforementioned relationship between\nthe covariance and the Hessian matrices, the natural gradient is close to optimal in the sense de(cid:2)ned\nabove, provided the model has enough capacity. On mixture models where the block-diagonal ap-\nproximation is appropriate, it allows us to maintain an approximation of much higher rank than a\nstandard low-rank approximation of the full covariance matrix.\n\nConclusion and future work\nWe bring two main contributions in this paper. First, by looking for the descent direction with either\nthe greatest probability of not increasing generalization error or the direction with the largest ex-\npected increase in generalization error, we obtain new justi(cid:2)cations for the natural gradient descent\ndirection. Second, we present an online low-rank approximation of natural gradient descent with\ncomputational complexity and CPU time similar to stochastic gradientr descent. In a number of\nexperimental comparisons we (cid:2)nd this optimization technique to beat stochastic gradient in terms of\nspeed and generalization (or in generalization for a given amount of training time). Even though de-\nfault values for the hyperparameters yield good results, it would be interesting to have an automatic\nprocedure to select the best set of hyperparameters.\n\nReferences\n[1] S. Amari. Natural gradient works ef(cid:2)ciently in learning. Neural Computation, 10(2):251(cid:150)276, 1998.\n[2] S. Amari, H. Park, and K. Fukumizu. Adaptive method of realizing natural gradient learning for multilayer\n\nperceptrons. Neural Computation, 12(6):1399(cid:150)1409, 2000.\n\n[3] L. Bottou. Stochastic learning. In O. Bousquet and U. von Luxburg, editors, Advanced Lectures on Ma-\nchine Learning, number LNAI 3176 in Lecture Notes in Arti(cid:2)cial Intelligence, pages 146(cid:150)168. Springer\nVerlag, Berlin, 2004.\n\n[4] R. Collobert. Large Scale Machine Learning. PhD thesis, Universit\u00b7e de Paris VI, LIP6, 2004.\n[5] H. Larochelle, D. Erhan, A. Courville, J. Bergstra, and Y. Bengio. An empirical evaluation of deep\narchitectures on problems with many factors of variation. In Twenty-fourth International Conference on\nMachine Learning (ICML\u20192007), 2007.\n\n[6] Y. LeCun, L. Bottou, G. Orr, and K.-R. M\u00a4uller. Ef(cid:2)cient backprop. In G. Orr and K.-R. M\u00a4uller, editors,\n\nNeural Networks: Tricks of the Trade, pages 9(cid:150)50. Springer, 1998.\n\n[7] K. B. Petersen and M. S. Pedersen. The matrix cookbook, feb 2006. Version 20051003.\n[8] N. N. Schraudolph. Fast curvature matrix-vector products for second-order gradient descent. Neural\n\nComputation, 14(7):1723(cid:150)1738, 2002.\n\n[9] H. H. Yang and S. Amari. Natural gradient descent for training multi-layer perceptrons. Submitted to\n\nIEEE Tr. on Neural Networks, 1997.\n\n[10] H. H. Yang and S. Amari. Complexity issues in natural gradient descent method for training multi-layer\n\nperceptrons. Neural Computation, 10(8):2137(cid:150)2157, 1998.\n\n8\n\n\f", "award": [], "sourceid": 56, "authors": [{"given_name": "Nicolas", "family_name": "Roux", "institution": null}, {"given_name": "Pierre-antoine", "family_name": "Manzagol", "institution": null}, {"given_name": "Yoshua", "family_name": "Bengio", "institution": null}]}