{"title": "Statistical Dynamics of Batch Learning", "book": "Advances in Neural Information Processing Systems", "page_first": 286, "page_last": 292, "abstract": null, "full_text": "Statistical Dynamics of Batch Learning \n\nDepartment of Physics, Hong Kong University of Science and Technology \n\ns. Li and K. Y. Michael Wong \n\nClear Water Bay, Kowloon, Hong Kong \n\n{phlisong, phkywong}@ust.hk \n\nAbstract \n\nAn important issue in neural computing concerns the description of \nlearning dynamics with macroscopic dynamical variables. Recen(cid:173)\nt progress on on-line learning only addresses the often unrealistic \ncase of an infinite training set. We introduce a new framework to \nmodel batch learning of restricted sets of examples, widely applica(cid:173)\nble to any learning cost function, and fully taking into account the \ntemporal correlations introduced by the recycling of the examples. \nFor illustration we analyze the effects of weight decay and early \nstopping during the learning of teacher-generated examples. \n\n1 \n\nIntroduction \n\nThe dynamics of learning in neural computing is a complex multi-variate process. \nThe interest on the macroscopic level is thus to describe the process with macro(cid:173)\nscopic dynamical variables. Recently, much progress has been made on modeling \nthe dynamics of on-line learning, in which an independent example is generated for \neach learning step [1, 2]. Since statistical correlations among the examples can be \nignored, the dynamics can be simply described by instantaneous dynamical vari(cid:173)\nables. \nHowever, most studies on on-line learning focus on the ideal case in which the net(cid:173)\nwork has access to an almost infinite training set, whereas in many applications, \nthe collection of training examples may be costly. A restricted set of examples \nintroduces extra temporal correlations during learning, and the dynamics is much \nmore complicated. Early studies briefly considered the dynamics of Adaline learn(cid:173)\ning [3, 4, 5], and has recently been extended to linear perceptrons learning nonlinear \nrules [6, 7}. Recent attempts, using the dynamical replica theory, have been made \nto study the learning of restricted sets of examples, but so far exact results are pub(cid:173)\nlished for simple learning rules such as Hebbian learning, beyond which appropriate \napproximations are needed [8]. \n\nIn this paper, we introduce a new framework to model batch learning of restricted \nsets of examples, widely applicable to any learning rule which minimizes an arbi(cid:173)\ntrary cost function by gradient descent. It fully takes into account the temporal \ncorrelations during learning, and is therefore exact for large networks. \n\n\fStatistical Dynamics of Batch Learning \n\n287 \n\n2 Formulation \nConsider the single layer perceptron with N \u00bb 1 input nodes {~j} connecting to a \nsingle output node by the weights {Jj }. For convenience we assume that the inputs \n~j are Gaussian variables with mean 0 and variance 1, and the output state 5 is a \nfunction f(x) of the activation x at the output node, i.e. \n\n(1) \nThe network is assigned to \"learn\" p = aN examples which map inputs {{j} to the \noutputs {5~} (p = 1, ... ,p). 5~ are the outputs generated by a teacher percept ron \n{Bj }, namely \n\n5=f(x); x=J\u00b7{. \n\n(2) \nBatch learning by gradient descent is achieved by adjusting the weights {Jj } itera(cid:173)\ntively so that a certain cost function in terms of the student and teacher activations \n{x~} and {y~} is minimized. Hence we consider a general cost function \n\n(3) \n\nThe precise functional form of g(x, y) depends on the adopted learning algorithm. \nFor the case of binary outputs, f(x) = sgnx. Early studies on the learning dynamics \nconsidered Adaline learning [3, 4, 5], where g(x, y) = -(5 - x)2/2 with 5 = sgny. \nFor recent studies on Hebbian learning [8], g(x,y) = x5. \nTo ensure that the perceptron is regularized after learning, it is customary to intro(cid:173)\nduce a weight decay term. Furthermore, to avoid the system being trapped in local \nminima, noise is often added in the dynamics. Hence the gradient descent dynamics \nis given by \n\n-;u- - N ~g (x~ t ,yl' ~j -).Jj(t +\"1j t, \ndJj (t) _ 1 \" \" , ( ) \n\n) I' \n\n) \n\n( ) \n\n(4) \n\n~ \n\nwhere, here and below, g' (x, y) and gil (x, y) respectively represent the first and \nsecond partial derivatives of g(x, y) with respect to x. ). is the weight decay strength, \nand \"1j(t) is the noise term at temperature T with \n\n(5) \n\n3 The Cavity Method \n\nOur theory is the dynamical version of the cavity method [9, 10, 11]. It uses a \nself-consistency argument to consider what happens when a new example is added \nto a training set. The central quantity in this method is the cavity activation, which \nis the activation of a new example for a perceptron trained without that example. \nSince the original network has no information about the new example, the cavity \nactivation is stochastic. Specifically, denoting the new example by the label 0, its \ncavity activation at time t is \n\n(6) \nFor large N and independently generated examples, ho(t) is a Gaussian variable. \nIts covariance is given by the correlation function G(t, s) of the weights at times t \nand s, that is, \n\nho(t) = J(t) . f1. \n\n(ho(t)ho(s\u00bb) = J(t). J(s) == G(t,s), \n\n(7) \n\n\f288 \n\nS. Li and K. Y. M. Wong \n\nwhere ~J and ~2 are assumed to be independent for j i- k. The distribution is \nfurther specified by the teacher-student correlation R(t), given by \n\n(ho(t)yo) = j(t) . jj = R(t). \n\n(8) \n\nNow suppose the perceptron incorporates the new example at the batch-mode learn(cid:173)\ning step at time s. Then the activation of this new example at a subsequent time \nt > s will no longer be a random variable. Furthermore, the activations of the \noriginal p examples at time t will also be adjusted from {xJl(t)} to {x~(t)} because \nof the newcomer, which will in turn affect the evolution of the activation of example \n0, giving rise to the so-called Onsager reaction effects. This makes the dynamics \ncomplex, but fortunately for large p '\" N, we can assume that the adjustment from \nxJl(t) to x2(t) is small, and perturbative analysis can be applied. \nSuppose the weights of the original and new perceptron at time t are {Jj (t)} and \n{JJ(t)} respectively. Then a perturbation of (4) yields \n~g'(xo(t),yo)~J \n\n(! + ,x) (.tj(t) - Jj(t\u00bb = \n\n+ ~ 2: ~fgll(XJl(t), YJl)~r(J'2(t) - Jk(t\u00bb. \n\nJlk \n\n(9) \n\nThe first term on the right hand side describes the primary effects of adding example \no to the training set, and is the driving term for the difference between the two \nperceptrons. The second term describes the secondary effects due to the changes \nto the original examples caused by the added example, and is referred to as the \nOn sager reaction term. One should note the difference between the cavity and \ngeneric activations of the added example. The former is denoted by ho(t) and \ncorresponds to the activation in the perceptron {Jj (t) }, whereas the latter, denoted \nby Xo (t) and corresponding to the activation in the percept ron {.tj (t)}, is the one \nused in calculating the gradient in the driving term of (9). Since their notations \nare sufficiently distinct, we have omitted the superscript 0 in xo(t), which appears \nin the background examples x~(t). \nThe equation can be solved by the Green's function technique, yielding \n\n.tj(t) - Jj(t) = 2:! dsGjk(t, s) (~g~(s)~2) , \n\nk \n\n(10) \n\nwhere g~(s) = g'(xo(s),yo) and Gjk(t, s) is the weight Green's function satisfying \n\nGjk(t,S) = G(O)(t - S)6jk + ~ ~! dt'G(O)(t - t')~fg~(t')~rGik(t' - s), \n\n(11) \n\nJl\\ \nG(O)(t - s) = e(t - s) exp( -,x(t - s\u00bb \nis the bare Green's function, and e is the \nstep function. The weight Green's function describes how the effects of example 0 \npropagates from weight Jk at learning time s to weight Jj at a subsequent time t, \nincluding both primary and secondary effects. Hence all the temporal correlations \nhave been taken into account. \nFor large N, the equation can be solved by a diagrammatic approach similar to [5]. \nThe weight Green's function is self-averaging over the distribution of examples and \nis diagonal, i.e. limN-+ooGjk(t,s) = G(t,s)6jk , where \n\nG(t, s) = G(O)(t - s) + a ! dt1 ! dt2G(O)(t - td(g~(tdDJl(tl' t2))G(t2' s). \n\n(12) \n\n\fStatistical Dynamics of Batch Learning \n\nD ~ (t, s) is the example Green's function given by \n\nD~(t,s) = c5(t - s) + J dt'G(t,t')g~(t')D~(t',s). \n\n289 \n\n(13) \n\nThis allows us to express the generic activations of the examples in terms of their \ncavity counterparts. Multiplying both sides of (10) and summing over j, we get \n\nxo(t) - ho(t) = J dsG(t, s)g~(s). \n\n(14) \n\n(15) \n\n(16) \n\nThis equation is interpreted as follows. At time t, the generic activation xo{t) \ndeviates from its cavity counterpart because its gradient term g&(s) was present \nin the batch learning step at previous times s. This gradient term propagates its \ninfluence from time s to t via the Green's function G(t, s). Statistically, this equation \nenables us to express the activation distribution in terms of the cavity activation \ndistribution, thereby getting a macroscopic description of the dynamics. \nTo solve for the Green's functions and the activation distributions, we further need \nthe fluctuation-response relation derived by linear response theory, \n\nC(t, s) = a J dt'G(O) (t - t')(g~(t')x~(s\u00bb + 2T J dt'G(O)(t - t')G(s, t'). \n\nFinally, the teacher-student correlation is given by \n\nR(t) = a J dt'G(O)(t - t')(g~(t')y~}. \n\n4 A Solvable Case \n\nThe cavity method can be applied to the dynamics of learning with an arbitrary cost \nfunction. When it is applied to the Hebb rule, it yields results identical to the exact \nresults in [8]. Here we present the results for the Adaline rule to illustrate features \nof learning dynamics derivable from the study. This is a common learning rule and \nbears resemblance with the more common back-propagation rule. Theoretically, its \ndynamics is particularly convenient for analysis since g\" (x) = -1, rendering the \nweight Green's function time translation invariant, Le. G(t, s) = G(t - s). In this \ncase, the dynamics can be solved by Laplace transform. \nTo monitor the progress of learning, we are interested in three performance mea(cid:173)\nsures: (a) Training error ft, which is the probability of error for the training ex(cid:173)\namples. It is given by ft = (9 (-xsgnY\u00bb:q, , where the average is taken over the \njoint distribution p(x, y) of the training set. (b) Test error ftest, which is the prob(cid:173)\nability of error when the inputs e; of the training examples are corrupted by an \nadditive Gaussian noise of variance ~ 2 . This is a relevant performance measure \nwhen the percept ron is applied to process data which are the corrupted versions of \nthe training data. It is given by ftest = (H(xsgny/~JC(t,t\u00bb):r;y. When ~2 = 0, \nthe test error reduces to the training error. (c) Generalization error fg, which is the \nprobability of error for an arbitrary input ~j when the teacher and student outputs \nare compared. It is given by fg = arccos[R(t)/ JC(t, t\u00bb)/7r. \nFigure l(a) shows the evolution of the generalization error at T = O. When the \nweight decay strength varies, the steady-state generalization error is minimized at \nthe optimum \n\nAopt = '2 - 1, \n\n7r \n\n(17) \n\n\f290 \n\nS. Li and K. Y. M Wong \n\nwhich is independent of Q. It is interesting to note that in the cases of the linear \npercept ron , the optimal weight decay strength is also independent of Q and only \ndetermined by the output noise and unlearn ability of the examples [5, 7]. Simi(cid:173)\nlarly, here the student is only provided the coarse-grained version of the teacher's \nactivation in the form of binary bits. \nFor A < Aopt, the generalization error is a non-monotonic function in learning time. \nHence the dynamics is plagued by overtraining, and it is desirable to introduce early \nstopping to improve the perceptron performance. Similar behavior is observed in \nlinear perceptrons [5, 6, 7]. \nTo verify the theoretical predictions, simulations were done with N = 500 and using \n50 samples for averaging. As shown in Fig. l(a), the agreement is excellent. \nFigure 1 (b) compares the generalization errors at the steady-state and the early \nstopping point. It shows that early stopping improves the performance for A < Aopt, \nwhich becomes near-optimal when compared with the best result at A = Aopt. Hence \nearly stopping can speed up the learning process without significant sacrifice in the \ngeneralization ability. However, it cannot outperform the optimal result at steady(cid:173)\nstate. This agrees with a recent empirical observation that a careful control of the \nweight decay may be better than early stopping in optimizing generalization [12]. \n\n0.40 \n\n0.36 \n\nw~ \n\n.... \ne \n~ 0.32 L \nc \n. Q \niii \nN \n~ 0.28 \n~ \ni \n\n0.24 \n\n0.20 \n\no \n\na=O.5 \n\n1..=10 \n~:::::::::=:::;::;;;;;;;mEE:=::: 1..=0.1 \n1..=1.. \n\u2022 \n\nl.-..a~= ..... I~.2.....-............................. ~.....-.................. _ 1..=10 \nI..=Ql \n1..=\\\", \n\n0.38 ,-\n\n0.3s l \nI \n\n~~0.34~ \n\n~ 0.32 \nc \no \n~ 0.30 \n.t:! \n~ 0.28 \nQ) \nc \n~ 0.26 \n\nt \n\" \n\n00 \n\n0.24 [--r-hh--\n\ne; \n\na=O.5 \n\na=1.2 \n\n1.5 \n\n-----c... -'--'---~---\"--'--~-'---\n\n2 \n\n4 \n\n8 \n\n10 \n\n6 \n\ntimet \n\n12 \n\n0.22 \n\n0.0 \n\n---'- - _.-\n\n0.5 \n\nweight decay ).. \n\n1.0 \n\nFigure 1: (a) The evolution of the generalization error at T = 0 for Q = 0.5,1.2 \nand different weight decay strengths A. Theory: solid line, simulation: symbols. \n(b) Comparing the generalization error at the steady state (00) and at the early \nstopping point (tes ) for Q = 0.5,1.2 and T = O. \n\nIn the search for optimal learning algorithms, an important consideration is the \nenvironment in which the performance is tested. Besides the generalization per(cid:173)\nformance, there are applications in which the test examples have inputs correlated \nwith the training examples. Hence we are interested in the evolution of the test \nerror for a given additive Gaussian noise 6, in the inputs. Figure 2(a) shows, again, \nthat there is an optimal weight decay parameter Aopt which minimizes the test error. \nFurthermore, when the weight decay is weak, early stopping is desirable. \n\nFigure 2(b) shows the value of the optimal weight decay as a function of the input \nnoise variance 6,2. To the lowest order approximation, Aopt ex: 6,2 for sufficiently \nlarge 6, 2. The dependence of Aopt on input noise is rather general since it also holds \nin the case of random examples [13]. In the limit of small 6,2, Aopt vanishes as \n6,2 for Q < 1, whereas Aopt approaches a nonzero constant for Q > 1. Hence for \n\n\fStatistical Dynamics of Batch Learning \n\n291 \n\na < 1, weight decay is not necessary when the training error is optimized, but when \nthe percept ron is applied to process increasingly noisy data, weight decay becomes \nmore and more important in performance enhancement. \n\nFigure 2(b) also shows the phase line Aot(~2) below which overtraining occurs. \nAgain, to the lowest order approximation, Aot ex ~2 for sufficiently large ~2 . How(cid:173)\never, unlike the case of generalization error, the line for the onset of overtraining \ndoes not coincide exactly with the line of optimal weight decay. In particular, for \nan intermediate range of input noise, the optimal line lies in the region of over(cid:173)\ntraining, so that the optimal performance can only be attained by tuning both the \nweight decay strength and learning time. However, at least in the present case, \ncomputational results show that the improvement is marginal. \n\n028 c \n\ni ~A=Ol \n. l a=1.2 \n0.26 :--s 0 I \n\nA=lO \nA=3.6 \n\n0.30 , \nI \n\n, \nr \n... \ng \nt \nQ) 0.24 t \n~ \nl \n0.22 ~ \n\n20 \n\n15 \n\n.-< 10 \n\n5 \n\no.oo~ \n\n-0.05 ' - - - - - - - (cid:173)\n\no \n\n0.18 0---'--~2~~~3----\"4~--'-5----'6 \n\nTime \n\nFigure 2: (a) The evolution of the test error for ~ 2 = 3, T = 0 and different weight \ndecay strengths A (Aopt ::::: 1.5,3.6 for a = 0.5, 1.2 respectively). (b) The lines of \nthe optimal weight decay and the onset of overtraining for a = 5. Inset: The same \ndata with Aot - Aopt (magnified) versus ~2. \n\n5 Conclusion \n\nBased on the cavity method, we have introduced a new framework for modeling the \ndynamics of learning, which is applicable to any learning cost function, making it \na versatile theory. It takes into full account the temporal correlations generated by \nthe use of a restricted set of examples, which is more realistic in many situations \nthan theories of on-line learning with an infinite training set. \n\nWhile the Adaline rule is solvable by the cavity method, it is still a relatively \nsimple model approachable by more direct methods. Hence the justification of the \nmethod as a general framework for learning dynamics hinges on its applicability to \nless trivial cases. In general, g~(t') in (13) is not a constant and DJl(t, s) has to \nbe expanded as a series. The dynamical equations can then be considered as the \nstarting point of a perturbation theory, and results in various limits can be derived, \ne.g. the limits of small a, large a, large A, or the asymptotic limit. Another area for \nthe useful application of the cavity method is the case of batch learning with very \nlarge learning steps. Since it has been shown recently that such learning converges \nin a few steps [6], the dynamical equations remain simple enough for a meaningful \nstudy. Preliminary results along this direction are promising and will be reported \nelsewhere. \n\n\f292 \n\nS. Li and K. Y. M Wong \n\nAn alternative general theory for learning dynamics, the dynamical replica theo(cid:173)\nry, has recently been developed [8]. It yields exact results for Hebbian learning, \nand approximate results for more non-trivial cases. Based on certain self-averaging \nassumptions, the theory is able to approximate the dynamics by the evolution of \nsingle-time functions , at the expense of having to solve a set of saddle point equa(cid:173)\ntions in the replica formalism at every learning instant. On the other hand, our \ntheory retains the functions G(t,s) and C(t, s) with double arguments, but devel(cid:173)\nops naturally from the stochastic nature of the cavity activations. Contrary to a \nsuggestion [14], the cavity method can also be applied to the on-line learning with \nrestricted sets of examples. It is hoped that by adhering to an exact formalism, \nthe cavity method can provide more fundamental insights when the studies are \nextended to more sophisticated multilayer networks of practical importance. \nThe method enables us to study the effects of weight decay and early stopping. It \nshows that the optimal strength of weight decay is determined by the imprecision \nin the examples, or the level of input noise in anticipated applications. For weaker \nweight decay, the generalization performance can be made near-optimal by early \nstopping. Furthermore, depending on the performance measure, optimality may \nonly be attained by a combination of weight decay and early stopping. Though \nthe performance improvement is marginal in the present case, the question remains \nopen in the more general context. \nWe consider the present work as the beginning of an in-depth study of learning \ndynamics. Many interesting and challenging issues remain to be explored. \n\nAcknowledgments \n\nWe thank A. C. C. Coolen and D. Saad for fruitful discussions during NIPS. This \nwork was supported by the grant HKUST6130/97P from the Research Grant Coun(cid:173)\ncil of Hong Kong. \n\nReferences \n\n[1] D. Saad and S. Solla, Phys. Rev. Lett. 74,4337 (1995). \n[2] D. Saad and M. Rattray, Phys. Rev. Lett. 79, 2578 (1997). \n[3] J. Hertz, A. Krogh and G. I. Thorbergssen, J. Phys. A 22, 2133 (1989). \n[4] M. Opper, Europhys. Lett. 8, 389 (1989). \n[5] A. Krogh and J. A. Hertz, J. Phys. A 25, 1135 (1992). \n[6] S. Bos and M. Opper, J. Phys. A 31, 4835 (1998). \n[7] S. Bos, Phys. Rev. E 58, 833 (1998). \n[8] A. C. C. Coolen and D. Saad, in On-line Learning in Neural Networks, ed. D. Saad \n\n(Cambridge University Press, Cambridge, 1998). \n\n[9] M. Mezard, G. Parisi and M. Virasoro, Spin Glass Theory and Beyond (World Sci-\n\nentific, Singapore) (1987). \n\n[10] K. Y. M. Wong, Europhys. Lett. 30, 245 (1995). \n[11] K. Y. M. Wong, Advances in Neural Information Processing Systems 9, 302 (1997). \n[12] L. K. Hansen, J. Larsen and T. Fog, IEEE Int. Conf. on Acoustics, Speech, and Signal \n\nProcessing 4,3205 (1997). \n\n[13] Y. W. Tong, K. Y. M. Wong and S. Li, to appear in Proc. of IJCNN'99 (1999) . \n[14] A. C. C. Cool en and D. Saad, Preprint KCL-MTH-99-33 (1999). \n\n\f", "award": [], "sourceid": 1697, "authors": [{"given_name": "Song", "family_name": "Li", "institution": null}, {"given_name": "K. Y. Michael", "family_name": "Wong", "institution": null}]}