{"title": "The Efficiency and the Robustness of Natural Gradient Descent Learning Rule", "book": "Advances in Neural Information Processing Systems", "page_first": 385, "page_last": 391, "abstract": null, "full_text": "The Efficiency and The Robustness of \n\nNatural Gradient Descent Learning Rule \n\nHoward Hua Yang \n\nDepartment of Computer Science \n\nOregon Graduate Institute \n\nShun-ichi Amari \n\nLab. for Information Synthesis \nRlKEN Brain Science Institute \n\nPO Box 91000, Portland, OR 97291, USA \n\nWako-shi, Saitama 351-01, JAPAN \n\nhyang@cse.ogi.edu \n\namari@zoo.brain.riken.go.jp \n\nAbstract \n\nThe inverse of the Fisher information matrix is used in the natu(cid:173)\nral gradient descent algorithm to train single-layer and multi-layer \nperceptrons. We have discovered a new scheme to represent the \nFisher information matrix of a stochastic multi-layer perceptron. \nBased on this scheme, we have designed an algorithm to compute \nthe natural gradient. When the input dimension n is much larger \nthan the number of hidden neurons, the complexity of this algo(cid:173)\nrithm is of order O(n). It is confirmed by simulations that the \nnatural gradient descent learning rule is not only efficient but also \nrobust. \n\n1 \n\nINTRODUCTION \n\nThe inverse of the Fisher information matrix is required to find the Cramer-Rae \nlower bound to analyze the performance of an unbiased estimator. It is also needed \nin the natural gradient learning framework (Amari, 1997) to design statistically \nefficient algorithms for estimating parameters in general and for training neural \nnetworks in particular. In this paper, we assume a stochastic model for multi(cid:173)\nlayer perceptrons. Considering a Riemannian parameter space in which the Fisher \ninformation matrix is a metric tensor, we apply the natural gradient learning rule to \ntrain single-layer and multi-layer perceptrons. The main difficulty encountered is to \ncompute the inverse of the Fisher information matrix of large dimensions when the \ninput dimension is high. By exploring the structure of the Fisher information matrix \nand its inverse, we design a fast algorithm with lower complexity to implement the \nnatural gradient learning algorithm. \n\n\f386 \n\nH H Yang and S. Amari \n\n2 A STOCHASTIC MULTI-LAYER PERCEPTRON \n\nAssume the following model of a stochastic multi-layer perceptron: \n\nm \n\nz = L ail{J(wT x + bi ) + ~ \n\ni=l \n\n(1) \n\nwhere OT denotes the transpose, ~ ,..., N(O, (72) is a Gaussian random variable, and \nl{J(x) is a differentiable output function for hidden neurons. Assume the multi-layer \nnetwork has a n-dimensional input, m hidden neurons, a one dimensional output, \nand m S n. Denote a = (ai, ... ,am)T the weight vector of the output neuron, Wi = \n(Wli,\u00b7\u00b7\u00b7, Wni)T the weight vector of the i-th hidden neuron, and b = (b l ,\u00b7\u00b7\u00b7, bm)T \nthe vector of thresholds for the hidden neurons. Let W = [WI,\u00b7\u00b7\u00b7, W m ] be a \nmatrix formed by column weight vectors Wi, then (1) can be rewritten as z = \naT I{J(WT x + b) +~. Here, the scalar function I{J operates on each component of the \nvector WT x + b. \nThe joint probability density function (pdf) of the input and the output is \n\np(x,z;W,a,b) = p(zlx; W,a,b)p(x). \n\nDefine a loss function: \n\nL(x, z; 0) = -logp(x, Z; 0) = l(zlx; 0) -logp(x) \n\nwhere 0 = (wI,\u00b7\u00b7\u00b7, W~, aT, bTV includes all the parameters to be estimated and \n\nl(zlx; 0) = -logp(zlx; 0) = 2(72 (z - aT I{J(WT x + b\u00bb2. \n\n1 \n\nSince ~ = -!b, the Fisher information matrix is defined by \nG(O) = E[8L(8L)T] = E[~(~)T] \n\n80 80 \n\n80 80 \n\n(2) \n\nThe inverse of G(O) is often used in the Cramer-Rao inequality: \n\nE[II6 - 0*112 I 0*] ~ Tr(G-I(O*\u00bb \nwhere 6 is an unbiased estimator of a true parameter 0*. \nFor the on-line estimator Ot based on the independent examples {(xs, zs), s = \n1,\u00b7\u00b7\u00b7, t} drawn from the probability law p(x, Z; 0*), the Cramer-Rao inequality \nfor the on-line estimator is \n\nE[II6t - 0*112 I 0*] ~ ~Tr(G-I(O*\u00bb \n\n(3) \n\n3 NATURAL GRADIENT LEARNING \n\nConsider a parameter space e = {O} in which the divergence between two points \n01 and O2 is given by the Kullback-Leibler divergence \n\nD(OI, O2 ) = KL(P(x, z; OI)IIp(x, z; O2 )]. \n\nWhen the two points are infinitesimally close, we have the quadratic form \n\nD(O,O + dO) = ~dOTG(O)dO. \n\n(4) \n\n\fThe Efficiency and the Robustness of Natural Gradient Descent Learning Rule \n\n387 \n\nThis is regarded as the square of the length of dO. Since G(8) depends on 8, the \nparameter space is regarded as a Riemannian space in which the local distance is \ndefined by (4). Here, the Fisher information matrix G(8) plays the role of the \nRiemannian metric tensor. \n\nIt is shown by Amari(1997) that the steepest descent direction of a loss function \nC(8) in the Riemannian space (3 is \n\n-VC(8) = -G- 1(8)\\7C(8). \n\nThe natural gradient descent method is to decrease the loss function by updating \nthe parameter vector along this direction. By multiplying G- 1 (8), the covariant \ngradient \\7C(8) is converted into its contravariant form G- 1 (8)\\7C(8) which is \nconsistent with the contravariant differential form dC(8). \nInstead of using l(zlx; 8) we use the following loss function: \n\nlr(zlx; 8) = \"2(z - aT tp(WT x + b\u00bb2. \n\n1 \n\nWe have proved in [5] that G(8) = ~A(8) where A(8) does not depend on the \nunknown u. So G- 1(8)-lb = A -1(8)~. The on-line learning algorithms based on \nthe gradient ~ and the natural gradient A -1(8)~ are, respectively, \n\nfl. Oil \n\n8tH = 8t - t {)8 (ztlxt; 8t ), \nOil \n\n8t+1 = 8t - fA (8t ) {)8 (ztlXt; 8t ) \n\nfl.' \n\n-1 \n\n(5) \n\n(6) \n\nwhere fl. and fl.' are learning rates. \nWhen the negative log-likelihood function is chosen as the loss function, the natural \ngradient descent algorithm (6) gives a Fisher efficient on-line estimator (Amari, \n1997), i.e., the asymptotic variance of 8t driven by (6) satisfies \n\nwhich gives the mean square error \n\n(7) \n\n(8) \n\nThe main difficulty in implementing the natural gradient descent algorithm (6) is \nto compute the natural gradient on-line. To overcome this difficulty, we studied the \nstructure of the matrix A(8) in [5] and proposed an efficient scheme to represent \nthis matrix. Here, we briefly describe this scheme. \nLet A(8) = [Aijlcm+2)x(m+2) be a partition of A(8) corresponding to the par(cid:173)\ntition of 8 = (wf,.\u00b7\u00b7,w?;.,aT,bT)T. Denote Ui = Wi/I/Will,i = 1, \u00b7 \u00b7 \u00b7 ,m, \nU 1 = [U1,\u00b7 .. ,um ] and [VI,\u00b7\u00b7 . ,Vm ] = U 1 (UiU 1)-1. It has been proved in [5] that \nthose blocks in A(8) are divided into three classes: C1 = {Aij,i,j = 1,\u00b7 \u00b7 \u00b7 ,m}, \nC2 = {Ai,mH, A!:H,i' A i,m+2, A!:+2,i' i = 1,\u00b7\u00b7\u00b7, m} and C3 = {Am+i,m+j, i,j = \n1,2}. Each block in C1 is a linear combination of matrices UkVf, k, 1 = 1,\u00b7 \u00b7 \u00b7, m, \nand no = I - E~1 ukvf. Each block in C2 is a matrix whose column is a lin(cid:173)\near combination of {Vk' k = 1,\u00b7 .. ,m.}. The coefficients in these combinations are \nintegrals with respect to the multivariate Gaussian distribution N(O, R 1 ) where \n\n\f388 \n\nH. H. Yang and S. Amari \n\nHI = ufu 1 is m x m. Each block in C3 is an m x m matrix whose entries are also \nintegrals with respect to N(O, HI)' Detail expressions for these integrals are given \nin [5]. When rp(x) = erf(.i2), using the techniques in (Saa.d and Solla, 1995), we \ncan find the analytic expressions for most of these integrals. \nThe dimension of A(9) is (nm + 2m) x (nm + 2m). When the input dimension n \nis much larger than the number of hidden neurons, by using the above scheme, the \nspace for storing this large matrix is reduced from O(n2) to O(n). We also gave \na fast algorithm in [5] to compute A- 1(9) and the natural gradient with the time \ncomplexity O(n2) and O(n) respectively. The trick is to make use of the structure \nof the matrix A -1(9) . \n\n4 SIMULATION \n\nIn this section, we give some simulation results to demonstrate that the natural \ngradient descent algorithm is efficient and robust . \n\n4.1 Single-layer perceptron \n\nAssume 7-dimensional inputs Xt '\" N(O, J) and rp(u) = ~+:=:. For the single-layer \nperceptron, Z = rp(wTx), the on-line gradient descent (GD) and the natural GD \nalgorithms are respectively \n\nWt+l = Wt + J.to(t)(Zt - rp(w[ Xt))rp'(w[ Xt)Xt and \nWt+l = Wt + J.tdt) A-I (Wt)(Zt - rp(W[Xt\u00bbrp'(wiXt)Xt \n\nwhere \n\n(9) \n(10) \n\n(11) \n\n(12) \n\n(13) \n\nand j.to(t) and \nJ.t(1]i,Ci,Tijt),i = 0,1. Here, \n\nj.tl (t) are two learning rate schedules defined by J.ti(t) = \n\nJ.t(1],C,Tjt) = 1](1 + --)/(1 + -- + -). \n\nC t \n1]T \n\nc t \n1]T \n\nt2 \nT \n\n(14) \n\nis the search-then-converge schedule proposed by (Darken and Moody, 1992) . Note \nthat t < T is a \"search phase\" and t > T is a \"converge phase\". When Ti = 1, the \nlearning rate function J.ti(t) has no search phase but a weaker converge phase when \n1]i is small. When t is large, J.ti (t) decreases as \u00a5. \nRandomly choose a 7-dimensional vector as w\u00b7 for the teacher network: \n\nw\u00b7 = [-1.1043,0.4302,1.1978,1.5317, -2.2946, -0.7866,0.4428f. \n\nChoose 1]0 = 1.25, 1]1 = 0.05, Co = 8.75, CI = 1, and TO = Tl = 1. These parameters \nare selected by trial and error to optimize the performance of the GD and the \nnatural GD methods at the noise level u = 0.2. The training examples {(Xt, Zt)} \nare generated by Zt = rp(w\u00b7TXt) +~t where ~t '\" N(0,u2) and u2 is unknown to the \nalgorithms. \n\n\fThe Efficiency and the Robustness o/Natural Gradient Descent Learning Rule \n\n389 \n\nLet Wt and Wt be the weight vectors driven by the equations (9) and (to) respec(cid:173)\ntively. Ilwt - w\"'l1 and IIWt - w'\"l1 are error functions for the GD and the natural \nGD. \nDenote w'\" = IIw'\"lI. From the equation (11), we obtain the Cramer-Rao Lower \nBound (CRLB) for the deviation at the true weight vector w\"': \n\nu \nCRLB(t) = Vi \n\nn -1 \nd1 (w\"') + d2(w\"'r \n\n1 \n\n(15) \n\nFigure 1: Performance of the GD and the natural GD at different noise levels \nu = 0.2,0.4,1. \n\n- - natural GO \n----- GO \n\nCRLB \n\n- - - .. _--- --------\n---\"'- ..... _----- .... -\n... _--- .... -\n\n, - , , \n.... _--\n\nrtt\\.vt~~ ----------\n\n10-2'--:'-:--~-_=_-=___:=____::::::__~:__:_:_:-=__:: \n500 \n\n400 \n\n300 \n\n350 \n\n450 \n\no \n\n50 \n\n100 \n\n150 \n\n200 \n\n250 \n\nIteration \n\nIt is shown in Figure 1 that the natural GD algorithin reaches CRLB at different \nnoise levels while the GD algorithm reaches the CRLB only at the noise level u = \n0.2. The robustness of the natural gradient descent against the additive noise in \n\n\f390 \n\nH. H. Yang and S. Amari \n\nPerformance of the GD and \n\nFigure 2: \n1.25, 1. 75,2.25,2.75, \"11 = 0.05,0.2,0.4425,0.443, and CO = 8.75 and Cl = 1 are \nfixed. \n\nthe natural GD when \n\n\"10 \n\nthe training examples is clearly shown by Figure 1. When the teacher signal is \nnon-stationary, our simulations show that the natural GD algorithm also reaches \nthe CRLB. \nFigure 2 shows that the natural GD algorithm is more robust than the GD algo(cid:173)\nrithm against the change of the learning rate schedule. The performance of the GD \nalgorithm deteriorates when the constant \"10 in the learning rate schedule (..to(t) is \ndifferent from that optimal one. On the contrary, the natural GD algorithm per(cid:173)\nforms almost the same for all \"11 within a interval [0.05,0.4425]. Figure 2 also shows \nthat the natural GD algorithm breaks down when \"11 is larger than the critical num(cid:173)\nber 0.443. This means that the weak converge phase in the learning rate schedule \nis necessary. \n\n4.2 Multi-layer perceptron \n\nLet us consider the simple multi-layer perceptron with 2-dimensional input and 2-\nhidden neurons. The problem is to train the committee machine y =