{"title": "Dynamics of Training", "book": "Advances in Neural Information Processing Systems", "page_first": 141, "page_last": 147, "abstract": null, "full_text": "Dynamics of Training \n\nSiegfried Bos* \n\nLab for Information Representation \nRIKEN, Hirosawa 2-1, Wako-shi \n\nSaitama 351-01, Japan \n\nManfred Opper \n\nTheoretical Physics III \nUniversity of Wiirzburg \n\n97074 Wiirzburg, Germany \n\nAbstract \n\nA new method to calculate the full training process of a neural net(cid:173)\nwork is introduced. No sophisticated methods like the replica trick \nare used. The results are directly related to the actual number of \ntraining steps. Some results are presented here, like the maximal \nlearning rate, an exact description of early stopping, and the neces(cid:173)\nsary number of training steps. Further problems can be addressed \nwith this approach. \n\nINTRODUCTION \n\n1 \nTraining guided by empirical risk minimization does not always minimize the ex(cid:173)\npected risk. This phenomenon is called overfitting and is one of the major problems \nin neural network learning. In a previous work [Bos 1995] we developed an approx(cid:173)\nimate description of the training process using statistical mechanics. To solve this \nproblem exactly, we introduce a new description which is directly dependent on the \nactual training steps. As a first result we get analytical curves for empirical risk and \nexpected risk as functions of the training time, like the ones shown in Fig. l. \n\nTo make the method tractable we restrict ourselves to a quite simple neural net(cid:173)\n\nwork model, which nevertheless demonstrates some typical behavior of neural nets. \nThe model is a single layer perceptron, which has one N -dim. layer of adjustable \nweights W between input x and output z. The outputs are linear, Le. \n\nZ = h = r;:;r L WiXi . \n\n1 N \n\nvN i=l \n\n(1) \n\nWe are interested in supervised learning, where examples xf (J-L = 1, ... , P) are \ngiven for which the correct output z* is known. To define the task more clearly \nand to monitor the training process, we assume that the examples are provided \nby another network, the so called teacher network. The teacher is not restricted to \nlinear outputs, it can have a nonlinear output function 9*(h*). \n\n* email: boesClzoo.riken.go.jpandopperCiphysik.uni-wuerzburg.de \n\n\f142 \n\nS. Bos and M. Opper \n\nLearning by examples attempts to minimize the error averaged over all examples, \nLe. ET := 1/2 < (z~ - ZIl)2 >{il'}' which is called training error or empirical risk. In \nfact what we are interested in is a minimal error averaged over all possible inputs \nX, i.e EG := 1/2 < (z* - z)2 > {xEInput}' called generalization error or expected risk. \nIt can be shown [see Bos 1995] that for random inputs, Le. all components Xi are \nindependent and have zero means and unit variance, the generalization error can \nbe described by the order parameters R and Q, \n\nEG(t) = 2\" [G - 2H R(t) + Q(t)] \n\n1 \n\n(2) \n\nwith the two parameters G =< [g*(h)]2 >h and H =h. The order param(cid:173)\neters are defined as: \n\nR(t) =< N L WtWi(t) >{wt} , \n\n1 N \n\ni=l \n\n(3) \n\nAs a novelty in this paper we average the order parameters not as usual in statistical \nmechanics over many example realizations {xt}, but over many teacher realizations \n{Wt}, where we use a spherical distribution. This corresponds to a Bayesian average \nover the unknown teacher. A study of the static properties of this model was done \nby Saad [1996]. Further comments about the averages can be found in the appendix. \nIn the next section we introduce our new method briefly. Readers, who do not \nwish to go into technical details in first reading, can tum directly to the results (15) \nand (16). The remainder of the section can be read later, as a proof. In the third \nsection results will be presented and discussed. Finally, we conclude the paper with \na summary and a perspective on further problems. \n\n2 DYNAMICAL APPROACH \n\nBasically we exploit the gradient descent learning rule, using the linear student, i.e \ng'(h) = 1 and zJl. = hll = )wWxJl., \n\nFor P < N, the weights are linear combinations of the example inputs xr, if Wi(O) = \n\n0, \n\nAfter some algebra a recursion for (1 Jl.(t) can be found, Le. \n\n(5) \n\n(6) \n\nwhere the term in the round brackets defines the overlap matrix C IlV . From the \ngeometric series we know the solution of this recursion, and therefore for the weights \n\n(7) \n\n\fDynamics o/Training \n\nIt fulfills the initial conditions Wi(O) = \u00b0 and Wi (l) = 1] 'L:=1 z~ xr (Hebbian), \n\nand yields after infinite time steps the so called Pseudo-inverse weights, i.e. \n\n143 \n\nWi(oo) = ~ L z~ (C- 1 )/-IV xr . \n\np \n\n/-1,\u00a31=1 \n\n(8) \n\nThis is valid as long as the examples are linearly independent, i.e. P < N. Remarks \nabout the other case (P > N) will follow later. \n\nWith the expression (7) we can calculate the behavior of the order parameters \n\nfor the whole training process. For R(t) we get \n\nR(t) = 1 ~ [E - (E - 1]C)t] \n\nC \n\nN 6 \n\n/-1,\u00a31=1 \n\n/-I \n\n< Z* \n\n( 1 ~ W* v) \n\n'N 6 \nV lV a=1 \n\ni xi >{W,\"} \n\n/-IV \n\nFor the average we have used expression (21) from the appendix. Similarly we get \nfor the other order parameter, \n\n(9) \n\nX < Z. Z* N LXi Xi >{W:} \n\n) \n\u00a3ItT \n\n/-IT \n\n( \n\n1 \n\nN \n\na=1 \n\n(10) \n\nAgain we have applied an identity (20) from the appendix and we did some matrix \nalgebra. Note, up to this point the order parameters were calculated without any \nassumption about the statistics of the inputs. The results hold, even without the \nthermodynamic limit. \n\nThe trace can be calculated by an integration over the eigenvalues, thus we attain \n\nintegrals of the following form, \n\np \n\n~ L [(E - 1]C)1 cm] /-1/-1 = J d~ p(~)(l- 1]~)1 ~m =: I!n(t, 0:, 1]), \n\n~mu \n\n(11) \n\n/-1=1 \n\n~min \n\nwith l = {O, t, 2t} and m = {-1,0, 1}. \n\nThese integrals can be calculated once we know the density of the eigenvalues \np(~). The determination of this density can be found in recent literature calculated \nby Opper [1989J using replicas, by Krogh [1992J using perturbation theory and by \nSollich [1994J with matrix identities. We should note, that the thermodynamic limit \nand the special assumptions about the inputs enter the calculation here. All authors \nfound \n\n(12) \n\n\f144 \n\nS. Bos and M. Opper \n\nfor a < 1. The maximal and the minimal eigenvalues are ~max,min := (1 \u00b1 fo)2. So \nall that remains now is a numerical integration. \n\nSimilarly we can calculate the behavior of the training error from \n\nET(t) =< 2P L (z~ - hl-')2 >{Wn . \n\np \n\n1 \n\n(13) \n\n1-'=1 \n\nFor the overdetermined case (P > N) we can find a recursion analog to (6), \n\nW;{t + I} = ~ [6;; - (~ ~ x'tx'J ) 1 Wj{t} + .IN ~ z~x't . \n\n{14} \n\nThe term in the round brackets defines now the matrix B ij . The calculation is \ntherefore quite similar to above with the matrix B playing the role of matrix C. The \ndensity of the eigenvalues p(A) for matrix B is the one from above (12) multiplied \nbya. \n\nAltogether, we find the following results in the case of a < 1, \n\nEG(t, a, 77) = \"2 + \n\nG G - H2 (1 \n\n2 \n\na \n\n2t ) \n1 _ a - 2.1':\"'1 + L1 \n\nrl \n\nH2 \n\n- T a 1- 10 \n\n( \n\n2t) \n\n, \n\n(15) \n\n(16) \n\nEr(t,a,77) = \n\nG - H2 2t H2 2t \n10 + 2 II , \n\n2 \n\nand in the case of a > 1, \n\nE G(t,a,77) = \n\nG - H2 ( \n\n2 \n\n1+ a_I-2I-1+L1 +T1o, \n\n2t ) \n\nH2 2t \n\nt \n\n1 \n\nET ( t, a, 77) = \n\nG - H2 ( \n\n2 \n\n1 -\n\nl~t ) \n1 \n- + -\na \na \n\n+ -\n\nH2 1ft \n-\n2 a \n\n. \n\nIf t ---+ 00 all the time-dependent integrals Ik and .t;t vanish. The remaining first \ntwo terms describe, in the limit a -+ 00, the optimal convergence rate of the errors. \nIn the next section we discuss the implications of this result. \n\n3 RESULTS \nFirst we illustrate how well the theoretical results describe the training process. If \nwe compare the theory with simulations, we find a very good correspondence, see \nFig. 1. \n\nTrying other values for the learning rate we can see that there is a maximal \nlearning rate. It is twice the inverse of the maximal eigenvalue of the matrix B, i.e. \n\n2 \n\n2 \n\n77max = ~max = (1 + fo)2 . \n\n(17) \n\nThis is consistent with a more general result, that the maximalleaming is twice the \ninverse of the maximal eigenvalue of the Hessian. In the case of the linear perceptron \nthe matrix B is identical to the Hessian. \n\nAs our approach is directly related to the actual number of training steps we can \nexamine how training time varies in different training scenarios. Training can be \nstopped if the training error reaches a certain minimal value, i.e if ET(t) ~ E\u00a5!in +f. \nOr, in cross-validated early stopping, we will terminate training if the generalization \nerror starts to increase, i.e. if EG(t + 1) > EG(t). \n\n\fDynamics of Training \n\n145 \n\n0.4 \n\n0.3 \n\n0.2 \n\n0.1 \n\nET \n\n, , , \n, , , , , , , , \n\n, , , \n'l , \n\n----~ -----\u00b1-----~-----\u00b1-----~-- ---%- - -- -~ - - - --%- - - --~- - - --%- ----~ - - --~ - - - --~ ----\n\n0.0 \n\n0 \n\n50 \n\ntraining steps \n\n100 \n\n150 \n\nFigure 1: Behavior of the generalization error EG (upper line) and the training error \nET (lower line) during the training process. As the loading rate a = P / N = 1.5 is \nnear the storage capacity (a = 1) of the net overfitting occurs. The theory describes \nthe results of the simulations very well. Parameters: learning rate TJ = 0.1, system \nsize N = 200, and g .. (h) = tanh(')'h) with gain 'Y = 5. \n\nFig. 2 shows that in exhaustive training the training time diverges for a near 1, \nin the region where also the overfitting occurs. In the same region early stopping \nshows only a slight increase in training time. \n\nFurthermore, we can guess from Fig. 2 that asymptotically only a few training \nsteps are necessary to fulfill the stopping criteria. This has to be specified more \nprecisely. First we study the behavior of EG after only one training step, i.e. t = 1. \nSince we interested in the limit of many examples (a -+ 00) we choose the learning \nrate as a fraction of the maximal learning rate (17), i.e. TJ = TJO/~max' Then we can \ncalculate the behavior of EG(t = 1, a, ~;ix) analytically. We find that only in the \ncase of TJo = 1, the generalization error can reach its asymptotic minimum Einf. The \nrate of the convergence is a-l like in the optimal case, but the prefactor is different. \n\nHowever, already for t = 2 we find, \n\nEG (t = 2, a, TJ = ~;!x) := EG - E;.nf = G -2 H2 a ~ 1 + 0 (~2) . \n\n(18) \n\nIf a is large, so that we can neglect the a- 2 term, then two batch training steps are \nalready enough to get the optimal convergence rate. These results are illustrated in \nFig. 3. \n\n4 SUMMARY \nIn this paper we have calculated the behavior of the learning and the training error \nduring the whole training process. The novel approach relates the errors directly to \nthe actual number of training steps. It was shown how good this theory describes the \ntraining process. Several results have been presented, such as the maximal learning \nrate and the training time in different scenarios, like early stopping. If the learning \nrate is chosen appropriately, then only two batch training steps are necessary to \nreach the optimal convergence rate for sufficiently large a. \n\n\f146 \n\nS. Bas and M. Opper \n\neps=O.OOl \neps=O.OlO \nearly stop \n\nle+02 \n\nt \n\nle+OI \n\ntl. \n\n40 \n\n0 \n, , , , \n, , \n: ~ \n,: \n,? \n\n~ \n/ \nA'\". \n\n\\ \n\\0 ~~\" \n\\, \n\n\" \n\\~ \n1_ \n\n.\\i>. \n'.~ \nl \u00b7'.\\.a. \n\n\"'P. \n\n\"-t-R \n\n'---: \n\n.co ... '\n\n4-. \n\n4):' \n\nI;J / \n..-\n0,_,-' \n\n~ ... i \n\nO __ ~_i-\n\n0 \n\ncr-..t;>-L-~-<>-+-'1' -4 \n\no 0: ___ : \n,----_.. \n, \n-' \n\nle+OO \n\nle-Ol \n\nle+OO \n\nle+OI \n\nPIN \n\n...... \\ ..... ~ ...... ~ \n\n-----. \n, \n.. ---------1 \n, \n-----------------------\"\"'i~ ........... -, \n\n............... -..... \n\n. \n, \n\n, \n\" \n\nle+02 \n\nFigure 2: Number of necessary training steps to fulfill certain stopping criteria. The \nupper lines show the result if training is stopped when the training error is lower \nthan E~in+\u20ac, with \u20ac = 0.001 (dotted line) and \u20ac = 0.01 (dashed line). The solid line \nis the early stopping result where training is stopped, when the generalization error \nstarted to increase, EG(t + 1) > EG(t). Simulation results are indicated by marks. \nParameters: learning rate 11 = 0.01, system size N = 200, and 9.(h) = tanh(-yh) \nwith gain '\"Y = 5. \n\nFurther problems, like the dynamical description of weight decay, and the relation \nof the dynamical approach to the thermodynamic description of the training process \n[see Bos, 1995] can not be discussed here due to lack of space. These problems are \nexamined in an extended version of this work [Bos and Opper 1996]. It would be very \ninteresting if this method could be extended towards other, more realistic models. \n\nA APPENDIX \nHere we add some identities which are necessary for the averages over the teacher \nweight distributions, eqs. (9) and (10). In the statistical mechanics approach one \nassumes that the distribution of the local fields h is Gaussian. This becomes true, if \none averages over random inputs Xi, with first moments zero and one, which is the \nusual approach [see Bos 1995 and ref.]. In principle it is also possible to average over \nmany tasks, i.e many teacher realizations W\u00b7, which is done here. The Gaussian \nlocal fields h~ fulfill, \n\n< h~ >= 0, < h~h~ >=CjJ.v' \n\n(19) \n\nThis implies \n\n< z~ z~ >{Wtl \n\n00 \n\n00 \n\nJ Dh~ J Dh~ 9.(V1 - (CjJ.v)2 h~ + CjJ.v h~) 9.(h~) \n\n-00 \n\n-00 \n\nIn the second identity we first calculated the diagonal term and for the non-diagonal \nterm we made an expansion assuming small correlations. Similarly the following \n\n(20) \n\n\fDynamics o/Training \n\n147 \n\nle+OO \n\nIe-OI \n\nle-02 \n\nEO \n\nle-03 \n\nle-04 \n\nIe-OS \n\nIe-OI \n\n----\n\n............... \n\nexh. \nopt. \nt=1 \nt=2 \nt=3 \n\nIe+OO \n\nle+OI \n\nPIN \n\nle+02 \n\nle+03 \n\nle+04 \n\nFigure 3: Behavior of EG = EG - Einf after t training steps. Results for t = 1, 2 \nand 3 are given. For large enough a it is already after t = 2 training steps possible \nto reach the optimal convergence (solid line). If t = 3 the optimal result is reached \neven faster. Parameters: learning rate 17 = ~;!x and 9*(h) = tanhbh) with gain \n')' = 5. \n\nidentity can be proved, \n\n< z~ h~ >{W;}= 8,.\u00a31/ H + (G,.W \n\n- 8,.\" .. ) H. \n\n(21) \n\nAcknowledgment: We thank Shun-ichi Amari for many discussions and E. \n\nHelle, A. Stevenin-Barbier for proofreading and valuable comments. \n\nReferences \nBas S. (1995), 'Avoiding overfitting by finite temperature learning and cross(cid:173)\nvalidation', in Int. Conference on Artificial Neural Networks 95 (ICANN'95), \nedited by EC2 & Cie, Vo1.2, p.1l1- 1l6. \n\nBas S., and Opper M. (1996), 'An exact description of early stopping and weight \n\ndecay', submitted. \n\nKinzel W., and Opper M. (1995), 'Dynamics of learning', in Models of Neural \nNetworks I, edited by E. Domany, J. L. van Hemmen and K. Schulten, Springer, \np.157-179. \n\nKrogh A. (1992), 'Learning with noise in a linear perceptron', J. Phys. A 25, \n\np.1l35-1l47. \n\nOpper M. (1989), 'Learning in neural networks: Solvable dynamics', Europhys. \n\nLett. 8, p.389-392. \n\nSaad D. (1996), 'General Gaussian priors for improved generalization', submitted \n\nto Neural Networks. \n\nSollich P. (1995), 'Learning in large linear perceptrons and why the thermodynamic \n\nlimit is relevant to the real world', in NIPS 7, p.207-214. \n\n\f", "award": [], "sourceid": 1225, "authors": [{"given_name": "Siegfried", "family_name": "B\u00f6s", "institution": null}, {"given_name": "Manfred", "family_name": "Opper", "institution": null}]}