[x, y] and M+[xly] = M[xly), and leaving us exactly with the fonnalism \nof [6, 5] describing the case of noise-free teachers and restricted training sets (apart from \nsome new tenns due to the presence of weight decay, which was absent in [6, 5]). \n\nP[x, zly] = o[z-y]P[xIY] \n\n\fSupervised Learning with Restricted Training Sets \n\n241 \n\n0. , r------~--__, \n\n0..4 \n\n0..3 \n\n0..2 \n\n0. , , \n\n0..0. \n\n0. \n\n11>=0.' \n0;=1 \n\n0;=2 \n\na=4 \n\na=4 \n\n0;=2 \n\n0;=1 \n\n11>=0, \n\n--\n\n-- - ----\n\n- - ------\n------- ---- ----- -\n\n, 0. \n\n\" \n\n0..3 \n\n0.2 \n\n0., , \n\nno. I \n0. \n\n0..4 ~-------_____I 11>=0.' \n~-------~ 0;=1 \n:::---- - -----1 0;=2 \n\n':::::========:::j a=4 \n= =-=--=-=--=-=--=-=-=--=-=- -=-=-_oed \n\na=4 \n\n0;=2 \n\n_ __ ___ _____ _ \n\na= 1 \n\n, 0. \n\n\" \n\nFigure 1: On-line Hebbian learning: conditionally Gaussian approximation versus exact \nsolution in [9] (.,., = 1, ,X = 0.2). Left: \"I = 0.1, right: \"I = 0.5. Solid lines: approximated \ntheory, dashed lines: exact result. Upper curves: Eg as functions of time (here the two \ntheories agree), lower curves: E t as functions of time. \n\n6 Benchmark Tests: Hebbian Learning \n\nThe special case of Hebbian learning, i.e. Q[x, z] = sgn(z), can be solved exactly at any \ntime, for arbitrary {a, ,x, \"I} [9], providing yet another excellent benchmark for our theory. \nFor batch execution of Hebbian learning the macroscopic laws are obtained upon expanding \n(11,12) and retaining only those terms which are linear in.,.,. All integrations can now be \ndone and all equations solved explicitly, resulting in U =0, Z = 1, W = (I-2,X)J2/7r, and \nQ = Qo e-2rryt + 2Ro(I-2'x) e-17\"Yt[I_e-rrrt] f{ + [~(I-2,X)2+.!.] [I-e-17\"YtF \n\nV:; \n\n7r \n\na \n\n\"12 \n\n\"I \n\nR = Ro e-17\"Yt +(I-2'x)J2/7r[I-e-17\"Yt]/\"I \n\np\u00b1[xIY] = [27r(Q-R2)] -t e-tlz-RH sgn(y)[1-e-\"..,t]/a\"Y]2/(Q-R2) \n\nq = [aR2+(I_e- 17\"Yt)2 i'l]/aQ \n(19) \nFrom these results, in tum, follow the performance measures Eg = 7r- 1 arccos[ R/ JQ) and \nE = ! - !(1-,X)!D erf[IYIR+[I-e-77\"Yt ]/a\"l] + !,X!D erf[IYIR-[I-e-17\"Yt]/a\"l] \n\nt \n\n2 2 \n\nY \n\nJ2(Q-R2) \n\n2 \n\ny \n\nJ2(Q-R2) \n\nComparison with the exact solution, calculated along the lines of [9] or, equivalently, ob(cid:173)\ntained upon putting t \u00ab .,.,-2 in [9], shows that the above expressions are all exact. \nFor on-line execution we cannot (yet) solve the functional saddle-point equation in general. \nHowever, some analytical predictions can still be extracted from (11,12,13): \nQ = Qo e-217\"Yt + 2Ro(I-2,X) e-77\"Yt[I_e-17\"Yt] f{ + [~(I-2,X)2+.!.] [I_e- 17\"Yt]2 \n\nV:; \n\n7r \n\na \n\n\"12 \n\n\"I \n\nR = Ro e-17\"Y t + (I-2,X)J2/7r[I-e- 17\"Yt]/\"I \n\nJ dx xP\u00b1[xIY] = Ry \u00b1 sgn(y)[I-e-17\"Yt]/a\"l \n\n+ !L[I_e-217\"Yt] \n\n2\"1 \n\nwith U =0, W = (I-2,X)J2/7r, V = W R+[I-e-17\"Yt]/a\"l, and Z = 1. Comparison with the \nresults in [9] shows that the above expressions, and thus also that of E g , are all fully exact, \nat any time. Observables involving P[x, y, z] (including the training error) are not as easily \nsolved from our equations. Instead we used the conditionally Gaussian approximation \n(found to be adequate for the noiseless Hebbian case [5, 6, 7]). The result is shown in \nfigure 1. The agreement is reasonable, but significantly less than that in [6]; apparently \nteacher noise adds to the deformation of the field distribution away from a Gaussian shape. \n\n\f242 \n\nA. C. C. Coolen and C. W H. Mac \n\n000000 \n\n0.4 \n\nE \n\n0.0 \n\n0 \n\n2 \n\n4 \n\n6 \n\n10 \n\n0.4 \n\nE \n\n0.2 \n\n0.6 ~ \n\n0.4 \n\n0.2 ~ \nI \ni \n\n0.0 \n\n-3 \n\n0.6 f \n\n0.4 [ \n\n0.2 \n\n-2 \n\n-I \n\n0 \nX \n\n0.0 L-o!i6iIII.\"\"\"\"\"',-\"--~_~~ __ --' \n3 \n\n-3 \n\n2 \n\n-2 \n\n-I \n\n0 \nX \n\nFigure 2: Large a approximation versus numerical simulations (with N = 10,000), for \n\n,= 0 and A = 0.2. Top row: Perceptron rule, with.,., = ~. Bottom row: Adatron rule, \n\nwith.,., = ~. Left: training errors E t and generalisation errors Eg as functions of time, for \naE {~, 1, 2}. Lines: approximated theory, markers: simulations (circles: E t , squares: Eg) . \nRight: joint distributions for student field and teacher noise p\u00b1[x] = J dy P[x, y, z = \u00b1y] \n(upper: P+[x], lower: P-[x]). Histograms: simulations, lines: approximated theory. \n\n7 Non-Linear Learning Rules: Theory versus Simulations \n\nIn the case of non-linear learning rules no exact solution is known against which to test our \nformalism, leaving numerical simulations as the yardstick. We have evaluated numerically \nthe large a approximation of our theory for Perceptron learning, 9[x, z] = sgn(z)O[-xz], \nand for Adatron learning, 9[x, z] = sgn(z)lzIO[-xz]. This approximation leads to the \nfollowing fully explicit equation for the field distributions: \nd \n-p\u00b1[xly] = -\na \ndt \n\ndx' p\u00b1[x'ly]{o[x-x'-.,.,.1'[x', \u00b1y]] -o[x-x]} + _.,.,2 Z!:I 2 p\u00b1[xly] \n\n' 1 ~ \n\n1/ \n\n2 \n\n. \n\nWith \n\n_ ~ {P[ I ] [W _ \n.,., 8 \nX \n\nx y \n\ny \n\n,X + \n\nuX \nU[X\u00b1(y)-RY]+(V-RW)[X-X\u00b1(y)]]} \n\nQ _ R2 \n\nU = J Dydx {(I-A)P+[xly][x-P(y)]9[x,Y]+AP-[xly][x-x-(y)]9[x,-y]) \n\nV = ! Dydx x {(I-A)P+[xly]9[x, Y]+AP-[xly]9[x,-y]) \nW = 1 Dydx y {(1-A)P+[xly]9[x, Y]+AP-[xly]9[x,-y]) \nZ = 1 Dydx {(I-A)P+[xly]92[x, Y]+AP-[xly]92[x,-yJ) \n\n\fSupervised Learning with Restricted Training Sets \n\n243 \n\nand with the short-hands X\u00b1(y) = J dx xP\u00b1[xly). The result of our comparison is shown \nin figure 2. Note: Et increases monotonically with a, and Eg decreases monotonically \nwith a, at any t. As in the noise-free formalism [7], the large a approximation appears to \ncapture the dominant terms both for a -7 00 and for a -7 O. The predicting power of our \ntheory is mainly limited by numerical constraints. For instance, the Adatron learning rule \ngenerates singularities at x = 0 in the distributions P\u00b1[xly) (especially for small \"I) which, \nalthough predicted by our theory, are almost impossible to capture in numerical solutions. \n\n8 Discussion \n\nWe have shown how a recent theory to describe the dynamics of supervised learning with \nrestricted training sets (designed to apply in the data recycling regime, and for arbitrary on(cid:173)\nline and batch learning rules) [5, 6, 7] in large layered neural networks can be generalized \nsuccessfully in order to deal also with noisy teachers. In our generalized approach the joint \ndistribution P[x, y, z) for the fields of student, 'clean' teacher, and noisy teacher is taken to \nbe a dynamical order parameter, in addition to the conventional observables Q and R. From \nthe order parameter set {Q, R, P} we derive the generalization error Eg and the training \nerror Et . Following the prescriptions of dynamical replica theory one finds a diffusion \nequation for P[x, y, z], which we have evaluated by making the replica-symmetric ansatz. \nWe have carried out several orthogonal benchmark tests of our theory: (i) for a -7 00 (no \ndata recycling) our theory is exact, (ii) for A -7 0 (no teacher noise) our theory reduces \nto that of [5, 6, 7], and (iii) for batch Hebbian learning our theory is exact. For on-line \nHebbian learning our theory is exact with regard to the predictions for Q, R, Eg and the \ny-dependent conditional averages J dx xP\u00b1[xly), at any time, and a crude approximation \nof our equations already gives reasonable agreement with the exact results [9] for Et . For \nnon-linear learning rules (Perceptron and Adatron) we have compared numerical solution \nof a simple large a aproximation of our equations to numerical simulations, and found \nsatisfactory agreement. This paper is a preliminary presentation of results obtained in the \nsecond stage of a research programme aimed at extending our theoretical tools in the arena \nof learning dynamics, building on [5, 6, 7]. Ongoing work is aimed at systematic applica(cid:173)\ntion of our theory and its approximations to various types of non-linear learning rules, and \nat generalization of the theory to multi-layer networks. \n\nReferences \n[1] Mace C.W.H. and Coolen AC.C (1998), Statistics and Computing 8, 55 \n[2] Saad D. (ed.) (1998), On-Line Learning in Neural Networks (Cambridge: CUP) \n[3] Hertz J.A., Krogh A and Thorgersson G.I. (1989), J. Phys. A 22, 2133 \n[4] HomerH. (1992a), Z. Phys. B 86, 291 and Homer H. (1992b), Z. Phys. B 87,371 \n[5] Coolen A.C.C. and Saad D. (1998), in On-Line Learning in Neural Networks, Saad \n\nD. (ed.), (Cambridge: CUP) \n\n[6] Coolen AC.C. and Saad D. (1999), in Advances in Neural Information Processing \n\nSystems 11, Kearns D., Solla S.A., Cohn D.A (eds.), (MIT press) \n\n[7] Coolen A.C.C. and Saad D. (1999), preprints KCL-MTH-99-32 & KCL-MTH-99-33 \n[8] Rae H.C., Sollich P. and Coolen AC.C. (1999), in Advances in Neural Information \n\nProcessing Systems 11, Kearns D., Solla S.A., Cohn D.A. (eds.), (MIT press) \n\n[9] Rae H.C., Sollich P. and Coolen AC.C. (1999),J. Phys. A 32, 3321 \n[10] Inoue J.I. (1999) private communication \n[11] Wong K.YM., Li S. and Tong YW. (1999),preprint cond-mat19909004 \n[12] Biehl M., Riegler P. and Stechert M. (1995), Phys. Rev. E 52, 4624 \n\n\f", "award": [], "sourceid": 1693, "authors": [{"given_name": "Anthony", "family_name": "Coolen", "institution": null}, {"given_name": "C.", "family_name": "Mace", "institution": null}]}