{"title": "On-line Learning from Finite Training Sets in Nonlinear Networks", "book": "Advances in Neural Information Processing Systems", "page_first": 357, "page_last": 363, "abstract": null, "full_text": "Online learning from finite training sets \n\nin nonlinear networks \n\nPeter Sollich* \n\nDavid Barbert \n\nDepartment of Physics \nUniversity of Edinburgh \nEdinburgh ERg 3JZ, U.K. \n\nP.Sollich~ed.ac.uk \n\nDepartment of Applied Mathematics \n\nAston University \n\nBirmingham B4 7ET, U.K. \nD.Barber~aston . ac.uk \n\nAbstract \n\nOnline learning is one of the most common forms of neural net(cid:173)\nwork training. We present an analysis of online learning from finite \ntraining sets for non-linear networks (namely, soft-committee ma(cid:173)\nchines), advancing the theory to more realistic learning scenarios. \nDynamical equations are derived for an appropriate set of order \nparameters; these are exact in the limiting case of either linear \nnetworks or infinite training sets. Preliminary comparisons with \nsimulations suggest that the theory captures some effects of finite \ntraining sets, but may not yet account correctly for the presence of \nlocal minima. \n\n1 \n\nINTRODUCTION \n\nThe analysis of online gradient descent learning, as one of the most common forms \nof supervised learning, has recently stimulated a great deal of interest [1, 5, 7, 3]. In \nonline learning, the weights of a network ('student') are updated immediately after \npresentation of each training example (input-output pair) in order to reduce the \nerror that the network makes on that example. One of the primary goals of online \nlearning analysis is to track the resulting evolution of the generalization error - the \nerror that the student network makes on a novel test example, after a given number \nof example presentations. In order to specify the learning problem, the training \noutputs are assumed to be generated by a teacher network of known architecture. \nPrevious studies of online learning have often imposed somewhat restrictive and \n\n\u2022 Royal Society Dorothy Hodgkin Research Fellow \ntSupported by EPSRC grant GR/J75425: Novel Developments in Learning Theory for \n\nNeural Networks \n\n\f358 \n\nP. SolIich and D. Barber \n\nunrealistic assumptions about the learning framework. These restrictions are, either \nthat the size of the training set is infinite, or that the learning rate is small[l, 5, 4]. \nFinite training sets present a significant analytical difficulty as successive weight \nupdates are correlated, giving rise to highly non-trivial generalization dynamics. \nFor linear networks, the difficulties encountered with finite training sets and non(cid:173)\ninfinitesimal learning rates can be overcome by extending the standard set of de(cid:173)\nscriptive ('order') parameters to include the effects of weight update correlations[7]. \nIn the present work, we extend our analysis to nonlinear networks. The particular \nmodel we choose to study is the soft-committee machine which is capable of repre(cid:173)\nsenting a rich variety of input-output mappings. Its online learning dynamics has \nbeen studied comprehensively for infinite training sets[l, 5]. In order to carry out \nour analysis, we adapt tools originally developed in the statistical mechanics liter(cid:173)\nature which have found application, for example, in the study of Hopfield network \ndynamics[2]. \n\n2 MODEL AND OUTLINE OF CALCULATION \n\nFor an N-dimensional input vector x, the output of the soft committee machine is \ngiven by \n\n(I) \n\nwhere the nonlinear activation function g(hl ) = erf(hz/V2) acts on the activations \nhi = wtxl.JFi (the factor 1/.JFi is for convenience only). This is a neural network \nwith L hidden units, input to hidden weight vectors WI, 1 = I..L, and all hidden to \noutput weights set to 1. \n\nIn online learning the student weights are adapted on a sequence of presented exam(cid:173)\nples to better approximate the teacher mapping. The training examples are drawn, \nwith replacement, from a finite set, {(X/\",yl-') ,j.t = I..p}. This set remains fixed \nduring training. Its size relative to the input dimension is denoted by a = piN. \nWe take the input vectors xl-' as samples from an N dimensional Gaussian distri(cid:173)\nbution with zero mean and unit variance. The training outputs y'\" are assumed to \nbe generated by a teacher soft committee machine with hidden weight vectors w~, \nm = I..M, with additive Gaussian noise corrupting its activations and output. \nThe discrepancy between the teacher and student on a particular training exam(cid:173)\nple (x, y), drawn from the training set, is given by the squared difference of their \ncorresponding outputs, \n\nE= H~9(hl) -yr = H~9(hl) - ~g(km +em) -eor \n\nwhere the student and teacher activations are, respectively \n\nh, = {J;wtx \n\nkm = {J;(w:n?x, \n\n(2) \nand em, m = I..M and eo are noise variables corrupting the teacher activations and \noutput respectively. \nGiven a training example (x, y), the student weights are updated by a gradient \ndescent step with learning rate \"I, \n\nw; - W, = -\"I\\1wIE = - JNx8h l E \n\n(3) \n\n\fOn-line Learning from Finite Training Sets in Nonlinear Networks \n\n359 \n\nThe generalization error is defined to be the average error that the student makes on \na test example selected at random (and uncorrelated with the training set), which \nwe write as \u20acg = (E). \nAlthough one could, in principle, model the student weight dynamics directly, this \nwill typically involve too many parameters, and we seek a more compact representa(cid:173)\ntion for the evolution of the generalization error. It is straightforward to show that \nthe generalization error depends, not on a detailed description of all the network \n\nweights, but only on the overlap parameters Qll' = ~ W r WI' and Rim = ~ W r w':n \n[1, 5, 7]. In the case of infinite 0, it is possible to obtain a closed set of equations \ngoverning the overlap parameters Q, R [5]. For finite training sets, however, this is \nno longer possible, due to the correlations between successive weight updates[7]. \nIn order to overcome this difficulty, we use a technique developed originally to study \nstatistical physics systems [2] . Initially, consider the dynamics of a general vector of \norder parameters, denoted by 0, which are functions of the network weights w. If \nthe weight updates are described by a transition probability T(w -+ w'), then an \napproximate update equation for 0 is \n\n0' - 0 = IfdW' (O(w') - O(w)) T(w -+ W')) \n\n\\ \n\nP(w)oc6(O(w)-O) \n\n(4) \n\nIntuitively, the integral in the above equation expresses the average changel of 0 \ncaused by a weight update w -+ w', starting from (given) initial weights w. Since \nour aim is to develop a closed set of equations for the order parameter dynamics, we \nneed to remove the dependency on the initial weights w. The only information we \nhave regarding w is contained in the chosen order parameters 0, and we therefore \naverage the result over the 'subshell' of all w which correspond to these values of \nthe order parameters. This is expressed as the 8-function constraint in equation(4). \nIt is clear that if the integral in (4) depends on w only through O(w), then the \naverage is unnecessary and the resulting dynamical equations are exact. This is in \nfact the case for 0 -+ 00 and 0 = {Q, R}, the standard order parameters mentioned \nabove[5]. If this cannot be achieved, one should choose a set of order parameters to \nobtain approximate equations which are as close as possible to the exact solution. \nThe motivation for our choice of order parameters is based on the linear perceptron \ncase where, in addition to the standard parameters Q and R, the overlaps projected \nonto eigenspaces of the training input correlation matrix A = ~ E:=l xl' (xl') T are \nrequired2 . We therefore split the eigenvalues of A into r equal blocks ('Y = 1 ... r) \ncontaining N' = N Ir eigenvalues each, ordering the eigenvalues such that they \nincrease with 'Y. We then define projectors p'Y onto the corresponding eigenspaces \nand take as order parameters: \n\nR'Y \n\n_ 1 Tp'Y \n\n1m - N'w, \n\nwm U'Y - ~ Tp'Yb \n.. \n\nI. - Nt W, \n\nII \n\n(5) \n\nwhere the b B are linear combinations of the noise variables and training inputs, \n\n(6) \n\n1 Here we assume that the system size N is large enough that the mean values of the \n\nparameters alone describe the dynamics sufficiently well (i. e., self-averaging holds). \n\n2The order parameters actually used in our calculation for the linear perceptron[7] are \n\nLaplace transforms of these projected order parameters. \n\n\f360 \n\nP. Sollich and D. Barber \n\nAs r -+ 00, these order parameters become functionals of a continuous variable3 . \nThe updates for the order parameters (5) due to the weight updates (3) can be \nfound by taking the scalar products of (3) with either projected student or teacher \nweights, as appropriate. This then introduces the following activation 'components', \n\nk'Y = ff(w* )Tp'\"Yx \nm VNi m \n\nso that the student and teacher activations are h, = ~ E'\"Y hi and km = ~ E'\"Y k~, \nrespectively. For the linear perceptron, the chosen order parameters form a complete \nset - the dynamical equations close, without need for the average in (4). \nFor the nonlinear case, we now sketch the calculation of the order parameter update \nequations (4). Taken together, the integral over Wi (a sum of p discrete terms in \nour case, one for each training example) and the subshell average in (4), define \nan average over the activations (2), their components (7), and the noise variables \n~m, ~o. These variables turn out to be Gaussian distributed with zero mean, and \ntherefore only their covariances need to be worked out. One finds that these are in \nfact given by the naive training set averages. For example, \n\n= \n\n(8) \n\nwhere we have used p'\"Y A = a'\"YP'\"Y with a'\"Y \n'the' eigenvalue of A in the ,-th \neigenspace; this is well defined for r -+ 00 (see [6] for details of the eigenvalue \nspectrum). The correlations of the activations and noise variables explicitly ap(cid:173)\npearing in the error in (3) are calculated similarly to give, \n\n(9) \n\n'\"Y \n\n(h,h,,) = ~ L:; Q~, \n(h,km) = ~ L :; Rim \n(h,~s) = ~ L ~U,~ \n\n'\"Y \n\n'\"Y \n\nwhere the final equation defines the noise variances. The T~m' are projected over(cid:173)\nlaps between teacher weight vectors, T~m' = ~ (w~)Tp'\"Yw:n,. We will assume that \nthe teacher weights and training inputs are uncorrelated, so that T~m' is indepen(cid:173)\ndent of ,. The required covariances of the 'component' activations are \n\n(kinh,) \n\n(c] h,) \n\n(hi h\" ) \n\n-\n\n-\n\na'\"YR'\"Y \na \n'm \na'\"YU'\"Y \na \nls \na'\"YQ'\"Y \na \nII' \n\n(k~km') = \n\na'\"YT'\"Y \na mm' \n\n(c]km, ) \n\n(hJkm,) \n\n-\n\n-\n\n0 \n\na'\"YR'\"Y \na \n\n'm \n\n(k~~s) -\n\n(C]~8' ) \n\n-\n\n(hJ~s) = \n\n0 \na'\"Y 2 \n-(7s588, \na \n.!.U'\"Y \na \n's \n\n(10) \n3Note that the limit r -+ 00 is taken after the thermodynamic limit, i.e., r ~ N. This \nensures that the number of order parameters is always negligible compared to N (otherwise \nself-averaging would break down). \n\n\fOn-line Learning from Finite Training Sets in Nonlinear Networks \n\n0.03 r f I I : - - - - -........ - - - - - - - , \n\n(a) 0.25 \n\n361 \n\n(b) \n\n0.025 \n\nI \n\n0.02 \n\no \n\n0 0 \n\n000000000000000000000000 \n\n0.2 \n\n0.15 \n\nOOOOOOOOC \n\n000000000 \n\n0000 \n\nL.. \n~ aaaoaaaaaaaaaaaaaaaac \nI \n\\ \n\n... o~ooo \n'NNNoaaoa \n,,------------\n\n0.01 '------~-----~ \n100 \n\n50 \n\no \n\nt \n\no \n\n50 \n\nt 100 \n\nFigure 1: fg vs t for student and teacher with one hidden unit (L = M = 1); \na = 2, 3, 4 from above, learning rate \"I = 1. Noise of equal variance was added to \nboth activations and output (a) O'~ = 0'5 = 0.01, (b) O'~ = 0'5= 0.1. Simulations \nfor N = 100 are shown by circles; standard errors are of the order of the symbol \nsize. The bottom dashed lines show the infinite training set result for comparison. \nr = 10 was used for calculating the theoretical predictions; the curved marked \"+\" \nin (b), with r = 20 (and a = 2), shows that this is large enough to be effectively in \nthe r -+ 00 limit. \n\nUsing equation (3) and the definitions (7), we can now write down the dynamical \nequations, replacing the number of updates n by the continuous variable t = n/ N \nin the limit N -+ 00: \n\nOtRim \nOtU?s \nOtQIz, \n\n-\"I (k-:nOh,E) \n-\"I (c~oh,E) \n-\"I (h7 Oh\" E) - \"I (h~ Oh, E) + \"12 a-y (Oh,Eoh\" E) \n\na \n\n(11) \n\nwhere the averages are over zero mean Gaussian variables, with covariances (9,10). \nUsing the explicit form of the error E, we have \n\noh,E = g'(h,) [L9(hl') - Lg(km + em) - eo] \n\nI' \n\nm \n\n(12) \n\nwhich, together with the equations (11) completes the description of the dynamics. \nThe Gaussian averages in (11) can be straightforwardly evaluated in a manner \nsimilar to the infinite training set case[5], and we omit the rather cumbersome \nexplicit form of the resulting equations. \nWe note that, in contrast to the infinite training set case, the student activations \nhI and the noise variables Cs and es are now correlated through equation (10). \nIntuitively, this is reasonable as the weights become correlated, during training, \nwith the examples in the training set. In calculating the generalization error, on the \nother hand, such correlations are absent, and one has the same result as for infinite \ntraining sets. The dynamical equations (11), together with (9,10) constitute our \nmain result. They are exact for the limits of either a linear network (R, Q, T -+ 0, \nso that g(x) ex: x) or a -+ 00, and can be integrated numerically in a straightforward \nway. In principle, the limit r -+ 00 should be taken but, as shown below, relatively \nsmall values of r can be taken in practice. \n\n3 RESULTS AND DISCUSSION \n\nWe now discuss the main consequences of our result (11), comparing the resulting \npredictions for the generalization dynamics, fg(t), to the infinite training set theory \n\n\f362 \n\n0.25k \n\\ ... ---\n\n. \n0.15 \n\n0.05 \n\n1 \n, \n\n0.1 \n\nP. Sollich and D. Barber \n\n(a) \n\n0.4 ..----------~--~----, \n\n(b) \n\n,'--\n\n0.3 \n\n0.2 \n\n0.1 \n\n~ooooooooooooooooooo \nOL---~----------~----~ \no \n1W t 200 \n\n100 \n\nW \n\n02 100000000000000000000000 \n\n______________ ~ \n\n---\n\nO~--~------~----~~~ \no \n40 t 50 \n\n30 \n\n20 \n\n10 \n\nFigure 2: \u20acg VS t for two hidden units (L = M = 2). Left: a = 0.5, with a = 00 \nshown by dashed line for comparison; no noise. Right: a = 4, no noise (bottom) \nand noise on teacher activations and outputs of variance 0.1 (top). Simulations for \nN = 100 are shown by small circles; standard errors are less than the symbol size. \nLearning rate fJ = 2 throughout. \n\nand to simulations. Throughout, the teacher overlap matrix is set to Tij = c5ij \n(orthogonal teacher weight vectors of length V'ii). \nIn figure(l), we study the accuracy of our method as a function of the training \nset size for a nonlinear network with one hidden unit at two different noise levels. \nThe learning rate was set to fJ = 1 for both (a) and (b). For small activation \nand output noise (0'2 = 0.01), figure(la) , there is good agreement with the sim(cid:173)\nulations for a down to a = 3, below which the theory begins to underestimate \nthe generalization error, compared to simulations. Our finite a theory, however, \nis still considerably more accurate than the infinite a predictions. For larger noise \n(0'2 = 0.1, figure(lb\u00bb, our theory provides a reasonable quantitative estimate of the \ngeneralization dynamics for a > 3. Below this value there is significant disagree(cid:173)\nment, although the qualitative behaviour of the dynamics is predicted quite well, \nincluding the overfitting phenomenon beyond t ~ 10. The infinite a theory in this \ncase is qualitatively incorrect. \nIn the two hidden unit case, figure(2), our theory captures the initial evolution of \n\u20acg(t) very well, but diverges significantly from the simulations at larger t; neverthe(cid:173)\nless, it provides a considerable improvement on the infinite a theory. One reason for \nthe discrepancy at large t is that the theory predicts that different student hidden \nunits will always specialize to individual teacher hidden units for t --+ 00, whatever \nthe value of a. This leads to a decay of \u20acg from a plateau value at intermediate times \nt. In the simulations, on the other hand, this specialization (or symmetry breaking) \nappears to be inhibited or at least delayed until very large t. This can happen even \nfor zero noise and a 2:: L, where the training data should should contain enough \ninformation to force student and teacher weights to be equal asymptotically. The \nreason for this is not clear to us, and deserves further study. Our initial investiga(cid:173)\ntions, however, suggest that symmetry breaking may be strongly delayed due to the \npresence of saddle points in the training error surface with very 'shallow' unstable \ndirections. \n\nWhen our theory fails, which of its assumptions are violated? It is conceivable \nthat multiple local minima in the training error surface could cause self-averaging \nto break down; however, we have found no evidence for this, see figure(3a). On \nthe other hand, the simulation results in figure(3b) clearly show that the implicit \nassumption of Gaussian student activations - as discussed before eq. (8) - can be \nviolated. \n\n\fOn-line Learning from Finite Training Sets in Nonlinear Networks \n\n(a) \n\n363 \n\n(b) \n\n/ \n\nVariance over training histories \n\n10\"'\" ' - - - - - - - - - - - - - - - ' \n\n102 \n\nN \n\nFigure 3: (a) Variance of fg(t = 20) vs input dimension N for student and teacher \nwith two hidden units (L = M = 2), a = 0.5, 'fJ = 2, and zero noise. The bottom \ncurve shows the variance due to different random choices of training examples from \na fixed training set ('training history'); the top curve also includes the variance due \nto different training sets. Both are compatible with the liN decay expected if self(cid:173)\naveraging holds (dotted line). (b) Distribution (over training set) of the activation \nhI of the first hidden unit of the student. Histogram from simulations for N = 1000, \nall other parameter values as in (a). \n\nIn summary, the main theoretical contribution of this paper is the extension of online \nlearning analysis for finite training sets to nonlinear networks. Our approximate \ntheory does not require the use of replicas and yields ordinary first order differential \nequations for the time evolution of a set of order parameters. Its central implicit \nassumption (and its Achilles' heel) is that the student activations are Gaussian \ndistributed. In comparison with simulations, we have found that it is more accurate \nthan the infinite training set analysis at predicting the generalization dynamics for \nfinite training sets, both qualitatively and also quantitatively for small learning \ntimes t. Future work will have to show whether the theory can be extended to cope \nwith non-Gaussian student activations without incurring the technical difficulties \nof dynamical replica theory [2], and whether this will help to capture the effects of \nlocal minima and, more generally, 'rough' training error surfaces. \nAcknowledgments: We would like to thank Ansgar West for helpful discussions. \n\nReferences \n\n[1] M. Biehl and H. Schwarze. Journal of Physics A, 28:643-656, 1995. \n[2] A. C. C. Coolen, S. N. Laughton, and D. Sherrington. In NIPS 8, pp. 253-259, \nMIT Press, 1996; S.N. Laughton, A.C.C. Coolen, and D. Sherrington. Journal \nof Physics A, 29:763-786, 1996. \n\n[3] See for example: The dynamics of online learning. Workshop at NIPS'95. \n[4] T. Heskes and B. Kappen. Physical Review A, 44:2718-2762, 1994. \n[5] D. Saad and S. A. Solla Physical Review E, 52:4225, 1995. \n[6] P. Sollich. Journal of Physics A, 27:7771-7784, 1994. \n[7] P. Sollich and D. Barber. In NIPS 9, pp.274-280, MIT Press, 1997; Europhysics \n\nLetters, 38:477-482, 1997. \n\n\f", "award": [], "sourceid": 1390, "authors": [{"given_name": "Peter", "family_name": "Sollich", "institution": null}, {"given_name": "David", "family_name": "Barber", "institution": null}]}