{"title": "Adaptive Back-Propagation in On-Line Learning of Multilayer Networks", "book": "Advances in Neural Information Processing Systems", "page_first": 323, "page_last": 329, "abstract": null, "full_text": "Adaptive Back-Propagation in On-Line \n\nLearning of Multilayer Networks \n\nAnsgar H. L. West 1,2 and David Saad2 \n\n1 Department of Physics, University of Edinburgh \n\nEdinburgh EH9 3JZ, U.K. \n\n2Neural Computing Research Group, University of Aston \n\nBirmingham B4 7ET, U.K. \n\nAbstract \n\nAn adaptive back-propagation algorithm is studied and compared \nwith gradient descent (standard back-propagation) for on-line \nlearning in two-layer neural networks with an arbitrary number \nof hidden units. Within a statistical mechanics framework , both \nnumerical studies and a rigorous analysis show that the adaptive \nback-propagation method results in faster training by breaking the \nsymmetry between hidden units more efficiently and by providing \nfaster convergence to optimal generalization than gradient descent. \n\n1 \n\nINTRODUCTION \n\nMultilayer feedforward perceptrons (MLPs) are widely used in classification and \nregression applications due to their ability to learn a range of complicated maps [1] \nfrom examples. When learning a map fo from N-dimensional inputs e to scalars ( \nthe parameters {W} of the student network are adjusted according to some training \nalgorithm so that the map defined by these parameters fw approximates the teacher \nfo as close as possible. The resulting performance is measured by the generalization \nerror Eg, the average of a suitable error measure E over all possible inputs Eg = (E)e' \nThis error measure is normally defined as the squared distance between the output \nof the network and the desired output, i.e., \n\nOne distinguishes between two learning schemes: batch learning , where training \nalgorithms are generally based on minimizing the above error on the whole set of \ngiven examples, and on-line learning , where single examples are presented serially \nand the training algorithm adjusts the parameters after the presentation of each \n\n(1) \n\n\f324 \n\nA.H.L. VfEST,D.SAJU) \n\nexample. We measure the efficiency of these training algorithms by how fast (or \nwhether at all) they converge to an \"acceptable\" generalization error. \n\nThis research has been motivated by recent work [2] investigating an on-line learn(cid:173)\ning scenario of a general two-layer student network trained by gradient descent on a \ntask defined by a teacher network of similar architecture. It has been found that in \nthe early stages oftraining the student is drawn into a suboptimal symmetric phase, \ncharacterized by each student node imitating all teacher nodes with the same degree \nof success. Although the symmetry between the student nodes is eventually broken \nand the student converges to the minimal achievable generalization error, the ma(cid:173)\njority of the training time may be spent with the system trapped in the symmetric \nregime, as one can see in Fig. 1. To investigate possible improvements we introduce \nan adaptive back-propagation algorithm, which improves the ability of the student \nto distinguish between hidden nodes of the teacher. We compare its efficiency with \nthat of gradient descent in training two-layer networks following the framework \nof [2]. In this paper we present numerical studies and a rigorous analysis of both \nthe breaking of the symmetric phase and the convergence to optimal performance. \nWe find that adaptive back-propagation can significantly reduce training time in \nboth regimes by breaking the symmetry between hidden units more efficiently and \nby providing faster exponential convergence to zero generalization error. \n\n2 DERIVATION OF THE DYNAMICAL EQUATIONS \n\nThe student network we consider is a soft committee machine [3] , consisting of \nJ< hidden units which are connected to N-dimensional inputs e by their weight \nvectors W = {\"Wi} (i = 1, .. . , J<). All hidden units are connected to the linear \noutput unit by couplings of unit strength and the implemented mapping is there-\nfore fw(e) = L~l g(Xi), where Xi = \"Wi \u00b7e is the activation of hidden unit i and gO \nis a sigmoidal transfer function. The map fo to be learned is defined by a teacher \nnetwork of the same architecture except for a possible difference in the number of \nhidden units M and is defined by the weight vectors B = {Bn} (n = 1, ... , M). \nTraining examples are of the form (e,(~), where the components of the input \nvectors e are drawn independently from a zero mean unit variance Gaussian dis-\ntribution; the outputs are (~ = L~=l g(y~), where y~ = En\u00b7e is the activation of \nteacher hidden unit n. \nAn on-line training algorithm A is defined by the update of each weight in re(cid:173)\nsponse to the presentation of an example (e~, (~) , which can take the general form \n\"Wi~+1 =\"Wi~ +Ai({,} , W~ , e,(~) , where {,} defines parameters adjustable by \nthe user. In the case of standard back-propagation , i.e., gradient descent on the \nerror function defined in Eq. (1): Afd(7J, W~,e,(~) = (7JIN)8,/e with \n\n8,/ = 8~g'(xn = [(~ - fw(e)] g'(xn, \n\n(2) \nwhere the only user adjustable parameter is the learning rate 7J scaled by 11 N. \nOne can readily see that the only term that breaks the symmetry between different \nhidden units is g'(xn, i.e., the derivative ofthe transfer function gO. The fact that \na prolonged symmetric phase can exist indicates that this term is not significantly \ndifferent over the hidden units for a typical input in the symmetric phase. \n\nThe rationale of the adaptive back-propagation algorithm defined below is therefore \nto alter the g'-term, in order to magnify small differences in the activation between \nhidden units. This can be easily achieved by altering g'(Xi) to g'({3xi), where {3 \nplays the role of an inverse \"temperature\". Varying {3 changes the range of hidden \nunit activations relevant for training , e.g., for {3 > 1 learning is more confined to \n\n\fAdaptive Back-Propagation in On-line Learning of Multilayer Networks \n\n325 \n\nsmall activations, when compared to gradient descent (/3 = 1) . The whole adaptive \nback-propagation training algorithm is therefore: \n\nA:bP(7], j3 , W tl ,e,(tl) = ~8tlg'(/3xnetl= ~8te \n\n(3) \n\nwith 8tl as in Eq. (2). To compare the adaptive back-propagation algorithm with \nnormal gradient descent , we follow the statistical mechanics calculation in [2]. Here \nwe will only outline the main ideas and present the results of the calculation. \nAs we are interested in the typical behaviour of our training algorithm we average \nover all possible instances of the examples e. We rewrite the update equations (3) \nin \"'i as equations in the order parameters describing the overlaps between student \nnodes Qij = \"'i. Wj , student and teacher nodes R;,n = \"'i' Bn and teacher nodes \nTnm = Bn' Bm. The generalization error cg , measuring the typical performance, can \nbe expressed in these variables only [2]. The order parameters Qij and Rin are the \nnew dynamical variables, which are self-averaging with respect to the randomness \nin the training data in the thermodynamic limit (N --+ 00). \nIf we interpret the \nnormalized example number 0:' = p./ N as a continuous time variable, the update \nequations for the order parameters become first order coupled differential equations \n\ndR;,n \ndO:' \ndQ\u00b7 \u00b7 \n_'_J \ndO:' \n\n(4) \n\nAll the integrals in Eqs. (4) and the generalization error can be calculated explicitly \nif we choose g( x) = erf( x / V2) as the sigmoidal activation function [2] . The exact \nform of the resulting dynamical equations for adaptive back-propagation is similar to \nthe equations in [2] and will be presented elsewhere [4]. They can easily be integrated \nnumerically for any number of K student and M teacher hidden units. For the \nremainder of the paper , we will however focus on the realizable case (K = M) and \nuncorrelated isotropic teachers of unit length Tnm = 8nm . \nThe dynamical evolution of the overlaps Qij and R;,n follows from integrating the \nequations of motion (4) from initial conditions determined by the random initial(cid:173)\nization of the student weights W. Whereas the resulting norms Qii of the student \nvector will be order 0(1) , the overlaps Qij between student vectors, and student(cid:173)\nteacher vectors Rin will be only order 0(1/.JFi) . The random initialization of the \nweights is therefore simulated by initializing the norms Qii and the overlaps Qij and \nRin from uniform distributions in the [0, 0.5] and [0,10- 12] interval respectively. \n\nIn Fig. 1 we show the difference of a typical evolution of the overlaps and the \ngeneralization error for j3 = 12 and /3 = 1 (gradient descent) for K = 3 and 7] = 0.01. \nIn both cases, the student is drawn quickly into a suboptimal symmetric phase, \ncharacterized by a finite generalization error (Fig. Ie) and no differentiation between \nthe hidden units of the student : the student norms Qii and overlaps Qij are similar \n(Figs. 1b,ld) and the overlaps of each student node with all teacher nodes Rin are \nnearly identical (Figs. 1a,lc). The student trained by gradient descent (Figs. 1c,ld) \nis trapped in this unstable suboptimal solution for most ofthe training time, whereas \nadaptive back-propagation (Figs. 1a,lb) breaks the symmetry significantly earlier. \nThe convergence phase is characterized by a specialization of the different student \nnodes and the evolution of the overlap matrices Q and R to their optimal value T , \nexcept for the permutational symmetry due to the arbitrary labeling of the student \nnodes. Clearly, the choice /3 = 12 is suboptimal in this regime. The student trained \nwith /3 = 1 converges faster to zero generalization error (Fig. Ie). \nIn order to \noptimize /3 seperately for both the symmetric and the convergence phase, we will \nexamine the equations of motions analytically in the following section . \n\n\f326 \n\n1.0 \n\n0.8 \n\n0.6 \nRin \n0.4 \n\n0.2 \n\n(a) \n\n/' \ni \n, \nI \nI \n\nR 11 _ \nR 12 ......... \nR 13 ---_. \nR21 - _. \nR22 _.-\n\nR 23 ----\nR31 _ .. -\n\nR 32 \u00b7\u00b7\u00b7 \u00b7 . \nR33 _ ....\u2022 \n\no \n\n80000 \n1.0,...-;::(b:\"7)-------::::::===---, \n\n40000 a 60000 \n\n20000 \n\nA. H. L. WEST, D. SAAD \n\nRll (c) \n\n1.0 \n\n0.8 \n\n-\n....... R12 \n-- R21 \n---- R 13 \n_.- R22 \n---_. R 23 \n.. - R31 \n-\n. . . . Rn \nR33 \n\n0.6 \nRin \n\n0.4 _ .... \n\n0.2 \n\no \n\n1.0 _ \n\n40000 a 60000 \n\n20000 \nQ11 (d) \n\n80000 \n\n0.8 \n\n0.6 \nQij \n0.4 \n\n0.2 \n\n'--..-._. \n\n0.8 \n\n..... .... Q12 \n____ . Q22 \n- _. Q23 \n0.6 \n- .- Q33 \nQij --- Q13 \n0.4 \n\nQ11 -\nQ12 .. ...... . \nQ22 ----. \nQ23 - _. \nQ33 _.(cid:173)\n\nQ13 -.---\n\nO.O-+-T\"'I---...-... ..,.......,:=r=;=-r-.,-,--r-..-r-I \n80000 \n\n40000 a 60000 \n\n20000 \n\no \n\no \n\n20000 \n\n40000 a 60000 \n\n80000 \n\nFigure 1: Dynamical evolution of the \nstudent-teacher overlaps Rin (a,c), the \nstudent-student overlaps Qij (b,d), and \nthe generalization error (e) as a function \nof the normalized example number a \nfor a student with three hidden nodes \nlearning an isotropic three-node teacher \n(Tnm =c5nm ). The learning rate 7]=0.01 \nis fixed but the value of the inverse tem(cid:173)\nperature varies (a,b): .8=12 and (c,d): \n.8=1 (gradient descent) . \n\n.~-r-(~)----------------------, \n\ne \n\nf3 = 12 -\nf3 = 1 \n\n........ . \n\n.02-l\"------.-\n\n.01 \n\no \n\n20000 \n\n40000 a 60000 \n\n80000 \n\n3 ANALYSIS OF THE DYNAMICAL EQUATIONS \n\nIn the case of a realizable learning scenario (K=M) and isotropic teachers \n(Tnm =c5nm ) the order parameter space can be very well characterized by sim(cid:173)\nilar diagonal and off-diagonal elements of the overlap matrices Q and R, i.e., \nQij = Qc5ij + C(1 - c5ij ) for the student-student overlaps and, apart from a rela(cid:173)\nbeling of the student nodes, by Rin = Rc5in + 5(1 - c5in ) for the student-teacher \noverlaps. As one can see from Fig. 1, this approximation is particularly good in the \nsymmetric phase and during the final convergence to perfect generalization. \n\n3.1 SYMMETRIC PHASE AND ONSET OF SPECIALIZATION \n\nNumerical integration of the equations of motion for a range of learning scenarios \nshow that the length of the symmetric phase is especially prolonged by isotropic \nteachers and small learning rates 7]. We will therefore optimize the dynamics (4) in \n\n\fAdaptive Back-Propagation in On-line Learning of Multilayer Networks \n\n327 \n\nthe symmetric phase with respect to {3 for isotropic teachers in the small 7] regime , \nwhere terms proportional to 7]2 can be neglected . The fixed point of the truncated \nequations of motion \n\nQ* \n\n= C = 2I< -1 and R* = S* = W = \n\n* \n\n1 \n\nV K \n\n1 \n\nJI\u00ab2I< - 1) \n\n(5) \n\nis independent of f3 and thus identical to the one obtained in [2]. However, the \nsymmetric solution is an unstable fixed point of the dynamics and the small pertur(cid:173)\nbations introduced by the generically nonsymmetric initial conditions will eventually \ndrive the student towards specialization. \n\nTo study the onset of specialization, we expand the truncated differential equa(cid:173)\ntions to first order in the deviations q = Q - Q*, c = C - C* , T' = R - R*, and \ns = S - S\" from the fixed point values (5). The linearized equations of motion \ntake the form dv/do: = M\u00b7v, where v = (q, c, T', s) and M is a 4 x 4 matrix whose \nelements are the first derivatives of the truncated update equations (4) at the fixed \npoint with respect to v. Perturbations or modes which are proportional to the \neigenvectors Vi of M will therefore decrease or increase exponentially depending on \nwhether the corresponding eigenvalue Ai is negative or positive. For the onset of \nspecialization only the modes are relevant which are amplified by the dynamics, i.e., \nthe ones with positive eigenvalue. For them we can identify the inverse eigenvalue \nas a typical escape time Ti from the symmetric phase. \nWe find only one relevant perturbation for q = c = 0 and s = -r/(I< - 1). This can \nbe confirmed by a closer look at Fig. 1. The onset of specialization is signaled by the \nbreaking of the symmetry between the student-teacher overlaps, whereas significant \ndifferences from the symmetric fixed point values of the student norms and overlaps \noccur later . The escape time T associated with the above perturbation is \n\nT({3) = ~ V2I< - 1(2I< + {3)3/2 \n\n27] \n\nI< f3 \n\n(6) \n\nMinimization of T with respect to f3 yields rr'pt = 4I<, i.e., the optimal f3 scales \nwith the number of hidden units, and \n\nTopt = 97r V2I{ - 1 \n\n27] V6I< \n\n(7) \n\nTrapping in the symmetric phase is therefore always inversely proportional to the \nIn the large I< limit it is proportional to the number of hid(cid:173)\nlearning rate 7]. \nden nodes I< (T\"\"'\" 27rI 2A2 at the minimum of Al (see Fig. 2a). \nWe can further optimize convergence to optimal generalization by minimizing the \ndecay rate Aopt((3) with respect to (3 (see Fig. 2b). Numerically, we find that the \noptimal inverse temperature (30pt saturates for large f{ at (30pt ~ 2.03. For large f{, \nwe find an associated optimal convergence time r opt ((30pt) ,....., 2.90f{ for adaptive \nback-propagation optimized with respect to 7] and (3, which is an improvement \nby 17% when compared to ropt(l)\"\"'\" 3.48f{ for gradient descent optimized with \nrespect to 7]. The optimal and maximal learning rates show an asymptotic 1/ f{ \nbehaviour and 7]0pt((30Pt) ,....., 4.78/ f{, which is an increase by 20% compared to gra(cid:173)\ndient descent. Both algorithms are quite stable as the maximal learning rates, for \nwhich the learning process diverges, are about 30% higher than the optimal rates. \n\n4 SUMMARY AND DISCUSSION \n\nThis research has been motivated by the dominance of the suboptimal symmetric \nphase in on-line learning of two-layer feedforward networks trained by gradient \ndescent [2]. This trapping is emphasized for inappropriate small learning rates \nbut exists in all training scenarios, effecting the learning process considerably. We \n\n\fAdaptive Back-Propagation in On-line Learning of Multilayer Networks \n\n329 \n\nproposed an adaptive back-propagation training algorithm [Eq. (3)] parameterized \nby an inverse temperature /3, which is designed to improve specialization of the \nstudent nodes by enhancing differences in the activation between hidden units. Its \nperformance has been compared to gradient descent for a soft-committee student \nnetwork with J{ hidden units trying to learn a rule defined by an isotropic teacher \n(Tnm = Dnm) of the same architecture. \nA linear analysis of the equations of motion around the symmetric fixed point for \nsmall learning rates has shown that optimized adaptive back-propagation character(cid:173)\nized by /3opt = 4J{ breaks the symmetry significantly faster . The effect is especially \npronounced for large networks, where the trapping time of gradient descent grows \nT ex f{ /7] compared to T ex 1/7] for ~Pt . With increasing network size it seems to \nbecome harder for a student node trained by gradient descent to distinguish be(cid:173)\ntween the many teacher nodes and to specialize on one of them. In the adaptive \nback-propagation algorithm this effect can be eliminated by choosing /3opt ex J{. \n\nAn open question is how the choice of the optimal inverse temperature is effected for \nlarge learning rates, where 7]2-terms cannot be neglected, as unbounded increase of \nthe learning rate causes uncontrolled growth of the student norms. However, the full \nequations of motion are very difficult to analyse in the symmetric phase. Numerical \nstudies indicate that ~Pt is smaller but still scales with J{ and yields an overall \ndecrease in training time which is still significant. We also find that the optimal \nlearning rate 7]0pt, which exhibits the shortest symmetric phase, is significantly lower \nin this regime than during convergence [4] . \n\nDuring convergence, independent of which algorithm is used , the time constant for \ndecay to zero generalization error scales with J{, due to the necessary rescaling of \nthe learning rate by 1/1< as the typical quadratic deviation between teacher and \nstudent output increases proportional to 1<. The reduction in training time with \nadaptive back-propagation is 17% and independent of the number of hidden units in \ncontrast to the symmetric phase, where a factor 1< is gained. This can be explained \nby the fact that each student node is already specialized on one teacher node and \nthe effect of other nodes in inhibiting further specialization is negligible. In fact, \nat first it seems rather surprising that anything can be gained by not changing the \nweights of the network according to their error gradient . The optimal setting of \n/3 > 1, together with training at a larger learning rate, speeds up learning for small \nactivations and slows down learning for highly activated nodes. This is equivalent \nto favouring rotational changes of the weight vectors over pure length changes to a \ndegree determined by /3. \n\nWe believe that the adaptive back-propagation algorithm investigated here will \nbe beneficial for any multilayer feedforward network and hope that this work will \nmotivate further theoretical research into the efficiency of training algorithms and \ntheir systematic improvement. \n\nReferences \n\n[1] C. Cybenko, Math. Control Signals and Systems 2, 303 (1989). \n[2] D. Saad and S. A. Solla, Phys. Rev. E 52, 4225 (1995). \n[3] M. Biehl and H. Schwarze, 1. Phys. A 28, 643 (1995). \n[4] A. West and D. Saad, in preparation (1995). \n\n\f", "award": [], "sourceid": 1156, "authors": [{"given_name": "Ansgar", "family_name": "West", "institution": null}, {"given_name": "David", "family_name": "Saad", "institution": null}]}