{"title": "Learning with Noise and Regularizers in Multilayer Neural Networks", "book": "Advances in Neural Information Processing Systems", "page_first": 260, "page_last": 266, "abstract": null, "full_text": "Learning with Noise and Regularizers \n\n\u2022 In \n\nMultilayer Neural Networks \n\nDavid Saad \n\nDept. of Compo Sci. & App. Math. \n\nAston University \n\nBirmingham B4 7ET, UK \n\nD .Saad@aston.ac.uk \n\nAbstract \n\nSara A. Solla \n\nAT &T Research Labs \n\nHolmdel, NJ 07733, USA \nsolla@research .at t .com \n\nWe study the effect of noise and regularization in an on-line \ngradient-descent learning scenario for a general two-layer student \nnetwork with an arbitrary number of hidden units. Training ex(cid:173)\namples are randomly drawn input vectors labeled by a two-layer \nteacher network with an arbitrary number of hidden units; the ex(cid:173)\namples are corrupted by Gaussian noise affecting either the output \nor the model itself. We examine the effect of both types of noise \nand that of weight-decay regularization on the dynamical evolu(cid:173)\ntion of the order parameters and the generalization error in various \nphases of the learning process. \n\n1 \n\nIntroduction \n\nOne of the most powerful and commonly used methods for training large layered \nneural networks is that of on-line learning, whereby the internal network parameters \n{J} are modified after the presentation of each training example so as to minimize \nthe corresponding error. The goal is to bring the map fJ implemented by the \nnetwork as close as possible to a desired map j that generates the examples. Here \nwe focus on the learning of continuous maps via gradient descent on a differentiable \nerror function. \n\nRecent work [1]-[4] has provided a powerful tool for the analysis of gradient-descent \nlearning in a very general learning scenario [5]: that of a student network with N \ninput units, I< hidden units, and a single linear output unit, trained to implement a \ncontinuous map from an N-dimensional input space e onto a scalar (. Examples of \nthe target task j are in the form of input-output pairs (e', (1'). The output labels \n(JAto independently drawn inputs e' are provided by a teacher network of similar \n\n\fLearning with Noise and Regularizers in Multilayer Neural Networks \n\n261 \n\narchitecture, except that its number M of hidden units is not necessarily equal to \nK . \n\nHere we consider the possibility of a noise process pI-' that corrupts the teacher \noutput. Learning from corrupt examples is a realistic and frequently encountered \nscenario. Previous analysis of this case have been based on various approaches: \nBayesian [6], equilibrium statistical physics [7], and nonequilibrium techniques for \nanalyzing learning dynamics [8]. Here we adapt our previously formulated tech(cid:173)\nniques [2] to investigate the effect of different noise mechanisms on the dynamical \nevolution of the learning process and the resulting generalization ability. \n\n2 The model \n\nWe focus on a soft committee machine [1], for which all hidden-to-output weights \nare positive and of unit strength. Consider the student network: hidden unit i \nreceives information from input unit r through the weight Jir, and its activation \nunder presentation of an input pattern e = (6,\u00b7 .. , ~N) is Xi = J i . e, with J i = \n(Jil , . .. , JiN ) defined as the vector of incoming weights onto the i-th hidden unit. \nThe output of the student network is O\"(J, e) = 2:~1 g (J i \n. e), where g is the \nactivation function of the hidden units, taken here to be the error function g(x) == \nerf(x/v'2), and J == {Jdl~i~K is the set of input-to-hidden adaptive weights. \nThe components of the input vectors el-' are uncorrelated random variables with zero \nmean and unit variance. Output labels (I-' are provided by a teacher network of sim(cid:173)\nilar architecture: hidden unit n in the teacher network receives input information \nthrough the weight vector Bn = (Bn 1, . . . , BnN ), and its activation under presenta(cid:173)\ntion of the input pattern e is Y~ = Bn . e. In the noiseless case the teacher output \nis given by (t = 2:~=1 g (Bn . e). Here we concentrate on the architecturally \nmatched case M = K, and consider two types of Gaussian noise: additive output \nnoise that results in (I-' = pI-' + 2:~=1 g (Bn . e), and model noise introduced as \nfluctuations in the activations Y~ of the hidden units, (I-' = 'E~=1 g (p~ + Bn . e). \nThe random variables pI-' and p~ are taken to be Gaussian with zero mean and \nvariance (J'2 . \n\nThe error made by a student with weights J on a given input e is given by the \n\nquadratic deviation \n\n(1) \n\nmeasured with respect to the noiseless teacher (it is also possible to measure \nperformance as deviations with respect to the actual output ( provided by the \nnoisy teacher) . Performance on a typical input defines the generalization error \nEg(J) == < E(J,e) >{O' through an average over all possible input vectors e to \nbe performed implicitly through averages over the activations x = (Xl, ... , X K) and \nY = (Yl, . . . , YK) . These averages can be performed analytically [2] and result in a \ncompact expression for Eg in terms of order parameters: Qik == Ji \u00b7Jk, Rin == Ji\u00b7 B n , \nand Tnm == Bn . B m , which represent student-student, student-teacher, and teacher(cid:173)\nteacher overlaps, respectively. The parameters Tnm are characteristic of the task to \nbe learned and remain fixed during training, while the overlaps Qik among student \nhidden units and R in between a student and a teacher hidden units are determined \nby the student weights J and evolve during training. \n\nA gradient descent rule on the error made with respect to the actual output provided \n\n\f262 \n\nD. Saad and S. A. SolLa \n\nby the noisy teacher results in Jr+1 = Jf + N 8f e for the update of the student \nweights, where the learning rate 1] has been scaled with the input size N, and 8f \ndepends on the type of noise. The time evolution of the overlaps Rin and Qik can \nbe written in terms of similar difference equations. We consider the large N limit, \nand introduce a normalized number of examples Q' = III N to be interpreted as a \ncontinuous time variable in the N -+ 00 limit. The time evolution of Rin and Qik \nis thus described in terms of first-order differential equations. \n\n3 Output noise \n\nThe resulting equations of motion for the student-teacher and student-student over(cid:173)\nlaps are given in this case by: \n\n(2) \n\nwhere each term is to be averaged over all possible ways in which an example e \ncould be chosen at a given time step. These averages have been performed using \nthe techniques developed for the investigation of the noiseless case [2]; the only \ndifference due to the presence of additive output noise is the need to evaluate the \nfourth term in the equation of motion for Qik, proportional to both 1]2 and 0'2. \nWe focus on isotropic un correlated teacher vectors: Tnm = T 8nm , and choose T = 1 \nin our numerical examples. The time evolution of the overlaps Rin and Qik follows \nfrom integrating the equations of motion (2) from initial conditions determined by \na random initialization of the student vectors {Jd1* 0'0 ( asymptotic regime). \n\nAsymptotically the secondary overlaps S decay to zero, while Rin -+ -ICJii indicates \nfull alignment for Tnn = L As specialization proceeds, the student weight vectors \ngrow in length and become increasingly uncorrelated. It is interesting to observe \nthat in the presence of noise the student vectors grow asymptotically longer than \nthe teacher vectors: Qii -+ Qoo > 1, and acquire a small negative correlation with \neach other. Another detectable difference in the presence of noise is a larger gap \nbetween the values of Q and C in the symmetric phase. Larger norms for the \nstudent vectors result in larger generalization errors: as shown in Figure I.c, the \ngeneralization error increases monotonically with increasing noise level, both in the \nsymmetric and asymptotic regimes. \n\nFor an isotropic teacher, the teacher-student and student-student overlaps can thus \nbe fully characterized by four parameters: Qik = QCik +C(I- Cik) and R;n = RCin + \nS(I-Cin). In the symmetric phase the additional constraint R = S reflects the lack \nof differentiation among student vectors and reduces the number of parameters to \nthree. \n\nThe symmetric phase is characterized by a fixed point solution to the equations \n\n\f264 \n\nD. Saad and S. A. Solfa \n\nof motion (2) whose coordinates can be obtained analytically in the small noise \napproximation: R* = I/JK(2K -1) + 1/ 0'2 r8 , Q* = 1/(2K -1) + 1/ 0'2 q8 , and \nC* = 1/(2K -1) + 1/ 0'2 C8, with r 8, q8, and C8 given by relatively simple functions \nof K. The generalization error in this regime is given by: \n\n* K (7r \n0'21/ (2K - 1 ?/2 \nfg = -; 6' - Ii arCSIn 2K + 27r (2K + 1)1/2 ; \n\n. ( 1 )) \n\n, \n\n(3) \n\nnote its increase over the corresponding noiseless value, recovered for 0'2 = O. \nThe asymptotic phase is characterized by a fixed point solution with R* =j:. S*. The \ncoordinates of the asymptotic fixed point can also be obtained analytically in the \nsmall noise approximation: R* = 1 + 1/ 0'2 r a , S* = -1/ 0'2 Sa, Q* = 1 + 1/ 0'2 qa, \nand C* = -1/ 0'2 Ca , with r a, Sa, qa, and Ca given by rational functions of K with \ncorrections of order 1/. The asymptotic generalization error is given by \n\n* \nf g = 67r 1/ 0' .Ii . \n\n2 T.' \n\n.J3 \n\n(4) \n\nExplicit expressions for the coefficients r 8, q8' C8 , r a, Sa, qa, and Ca will not be given \nhere for lack of space; suffice it to say that the fixed point coordinates predicted on \nthe basis of the small noise approximation are found to be in excellent agreement \nwith the values obtained from the numerical integration of the equations of motion \nfor 0'2 ~ 0.3. \n\nIt is worth noting in Figure I.c that in the small noise regime the length of the \nsymmetric plateau decreases with increasing noise. This effect can be investigated \nanalytically by linearizing the equations of motion around the symmetric fixed point \nand identifying the positive eigenvalue responsible for the escape from the symmetric \nphase. This calculation has been carried out in the small noise approximation, to \nobtain A = (2/7r)K(2K - 1)-1/2(2K + 1)-3/2 + Au 0'21/, where Au is positive and \nincreases monotonically with K for K > 1. A faster escape from the symmetric \nplateau is explained by this increase of the positive eigenvalue. The calculation \nis valid for 0'21/ ~ 1; we observe experimentally that the trend is reversed as 0'2 \nincreases. A small level of noise assists in the process of differentiation among \nstudent vectors, while larger levels of noise tend to keep student vectors equally \nignorant about the task to be learned. \n\nThe asymptotic value (4) for the generalization error indicates that learning at finite \n1/ will result in asymptotically suboptimal performance for 0'2 > O. A monotonic \ndecrease ofthe learning rate is necessary to achieve optimal asymptotic performance \nwith f; = O. Learning at small 1/ results in long trapping times in the symmetric \nphase; we therefore suggest starting the training process with a relatively large value \nof 1/ and switching to a decaying learning rate at 0' = 0'0, after specialization begins. \nWe propose 1/ = 1/0 for 0' ~ 0'0 and 1/ = 1/0/(0' - O'oy for 0' > 0'0 . Convergence \nto the asymptotic solution requires z ~ 1. The value z = 1 corresponds to the \nfastest decay for 1/(0'); the question of interest is to determine the value of z which \nresults in fastest decay for fg(O'). Results shown in Figure l.d for 0' > 0'0 = 4000 \ncorrespond to M = K = 3, 1/0 = 0.7, and 0'2 = 0.1. Our numerical results indicate \noptimal decay of fg(O') for z = 1/2. A rigorous justification of this result remains \nto be found. \n\n4 Model noise \n\nThe resulting equations of motion for the student-teacher and student-student over(cid:173)\nlaps can also be obtained analytically in this case; they exhibit a structure remark-\n\n\fLearning with Noise and Regularizers in Multilayer Neural Networks \n\n265 \n\n0.06-\n\n, \n------- -- ---.-.---\u00b7,1 \n1-------.. :' \n~~ \n~ \n\n.~, \n\ntIGo.04 \nW \n\n0.02 \n\n- - u~o 5 \nu~o 1 \n--- - u'JJ 9 \n\n0.04 \n\n~0.03 \n100.02 \n\n~ -._.- - ---.-- ---_.- -- - - --. \n'._---_._-.... . _---- _ ... _------ -_._ .. _-------\n\n0.01 \n\n0.0 -+ - - - - - - . - - - - - - - - . - - - - ' \n\n0*10' \n\n5*10' \n\n1*10' \n\n\\0 \n\n20 \n\n30 \n\n40 \n\n50 \n\n60 \n\n70 \n\n80 \n\n90 \n\nK \n\nFigure 2: Left - The generalization error for different values of the noise variance \n(72; training examples are corrupted by model noise. Right - 7max as a function of \nK. \n\nably similar to those for the noiseless case reported in [2], except for some changes \nin the relevant covariance matrices. \n\nA numerical investigation of the dynamical evolution of the overlaps and generaliza(cid:173)\ntion error reveals qualitative and quantitative differences with the case of additive \noutput noise: 1) The sensitivity to noise is much higher for model noise than for \noutput noise. 2) The application of independent noise to the individual teacher \nhidden units results in an effective anisotropic teacher and causes fluctuations in \nthe symmetric phase; the various student hidden units acquire some degree of dif(cid:173)\nferentiation and the symmetric phase can no longer be fully characterized by unique \nvalues of Q and C. 3) The noise level does not affect the length of the symmetric \nphase. \n\nThe effect of model noise on the generalization error is illustrated in Figure 2 for \nM = K = 3, 'rJ = 0.2, and various noise levels. The generalization error increases \nmonotonically with increasing noise level, both in the symmetric and asymptotic \nregimes, but there is no modification in the length of the symmetric phase. The \ndynamical evolution of the overlaps, not shown here for the case of model noise, \nexhibits qualitative features quite similar to those discussed in the case of additive \noutput noise: we observe again a noise-induced widening of the gap between Q and \nC in the symmetric phase, while the asymptotic phase exhibits an enhancement of \nthe norm of the student vectors and a small degree of negative correlation between \nthem. \n\nApproximate analytic expressions based on a small noise expansion have been ob(cid:173)\ntained for the coordinates of the fixed point solutions which describe the symmetric \nand asymptotic phases. In the case of model noise the expansions for the symmetric \nsolution are independent of 'rJ and depend only on (72 and K. The coordinates of \nthe asymptotic fixed point can be expressed as: R* = 1 + (72 r a, S* = _(72 Sa, \nQ* = 1 + (72 qa, C* = _(72 Ca , with coefficients r a , Sa, qa, and Ca given by rational \nfunctions of K with corrections of order 'rJ . The important difference with the out(cid:173)\nput noise case is that the asymptotic fixed point is shifted from its noiseless position \neven for 'rJ = O. It is therefore not possible to achieve optimal asymptotic perfor(cid:173)\nmance even if a decaying learning rate is utilized. The asymptotic generalization \nerror is given by \n\n* \n(g=--(7 \\+'rJ(7 \n\n2}-' \n\n2K' \n\ny'3 \n1271\" \n\n(}., \n\n) \n\\,'rJ \n\n(a \n\n. \n\n(5) \n\n\f266 \n\nD. Saad and S. A. Solla \n\nNote that the asymptotic generalization error remains finite even as TJ - O. \n\n5 Regularlzers \n\nA method frequently used in real world training scenarios to overcome the effects of \nnoise and parameter redundancy (1< > M) is the use of regularizers such as weight \ndecay (for a review see [6]). \n\nWeight-decay regularization is easily incorporated within the framework of on-line \nlearning; it leads to a rule for the update of the student weights of the form Jf+l = \nJf + 11 6r e - 1:r Jf\u00b7 The corresponding equations of motion for the dynamical \nevolution of the teacher-student and student-student overlaps can again be obtained \nanalytically and integrated numerically from random initial conditions. \n\nThe picture that emerges is basically similar to that described for the noisy case: the \ndynamical evolution of the learning process goes through the same stages, although \nspecific values for the order parameters and generalization error at the symmetric \nphase and in the asymptotic regime are changed as a consequence of the modification \nin the dynamics. \n\nOur numerical investigations have revealed no scenario, either when training from \nnoisy data or in the presence of redundant parameters, where weight decay im(cid:173)\nproves the system performance or speeds up the training process. This lack of \neffect is probably a generic feature of on-line learning, due to the absence of an \nadditive, stationary error surface defined over a finite and fixed training set. In \noff-line (batch) learning, regularization leads to improved performance through the \nmodification of such error surface. These observations are consistent with the ab(cid:173)\nsence of 'overfitting' phenomena in on-line learning. One of the effects that arises \nwhen weight-decay regularization is introduced in on-line learning is a prolongation \nof the symmetric phase, due to a decrease in the positive eingenvalue that controls \nthe onset of specialization. This positive eigenvalue, which signals the instability of \nthe symmetric fixed point, decreases monotonically with increasing regularization \nstrength 'Y, and crosses zero at 'Ymax = TJ 7max. The dependence of 7max on 1< is \nshown in Figure 2; for 'Y > 'Ymax the symmetric fixed point is stable and the system \nremains trapped there for ever. \nThe work reported here focuses on an architecturally matched scenario, with M = \n1<. Over-realizable cases with 1< > M show a rich behavior that is rather less \namenable to generic analysis. It will be of interest to examine the effects of different \ntypes of noise and regularizers in this regime. \n\nAcknowledgement: D.S. acknowledges support from EPSRC grant GRjLl9232. \n\nReferences \n[1] M. Biehl and H. Schwarze, J. Phys. A 28, 643 (1995). \n[2] D. Saad and S.A. Solla, Phys. Rev. E 52, 4225 (1995). \n[3] D. Saad and S.A. Solla, preprint (1996). \n[4] P. Riegler and M. Biehl, J. Phys. A 28, L507 (1995). \n[5] G. Cybenko, Math. Control Signals and Systems 2, 303 (1989). \n[6] C.M. Bishop, Neural networks for pattern recognition, (Oxford University Press, Ox(cid:173)\n\nford, 1995). \n\n[7] T.L.H. Watkin, A. Rau, and M. Biehl, Rev. Mod. Phys. 65, 499 (1993). \n[8] K.R. Muller, M. Finke, N. Murata, K. Schulten, and S. Amari, Neural Computation \n\n8, 1085 (1996). \n\n\f", "award": [], "sourceid": 1243, "authors": [{"given_name": "David", "family_name": "Saad", "institution": null}, {"given_name": "Sara", "family_name": "Solla", "institution": null}]}*