{"title": "Two Approaches to Optimal Annealing", "book": "Advances in Neural Information Processing Systems", "page_first": 301, "page_last": 307, "abstract": null, "full_text": "Two Approaches to Optimal Annealing \n\nTodd K. Leen \n\nDept of Compo Sci. & Engineering \n\nOregon Graduate Institute of \n\nScience and Technology \nP.O.Box 91000, Portland, \n\nOregon 97291-1000 \ntleen@cse.ogi.edu \n\nBernhard Schottky and David Saad \n\nNeural Computing Research Group \nDept of Compo Sci. & Appl. Math. \n\nAston University \n\nBirmingham, B4 7ET, UK \n\nschottba{ saadd}@aston.ac.uk \n\nAbstract \n\nWe employ both master equation and order parameter approaches \nto analyze the asymptotic dynamics of on-line learning with dif(cid:173)\nferent learning rate annealing schedules. We examine the relations \nbetween the results obtained by the two approaches and obtain new \nresults on the optimal decay coefficients and their dependence on \nthe number of hidden nodes in a two layer architecture. \n\n1 \n\nIntroduction \n\nThe asymptotic dynamics of stochastic on-line learning and it's dependence on the \nannealing schedule adopted for the learning coefficients have been studied for some \ntime in the stochastic approximation literature [1, 2] and more recently in the neural \nnetwork literature [3, 4, 5]. The latter studies are based on examining the Kramers(cid:173)\nMoyal expansion of the master equation for the weight space probability densities. \nA different approach, based on the deterministic dynamics of macroscopic quantities \ncalled order parameters, has been recently presented [6, 7]. This approach enables \none to monitor the evolution of the order parameters and the system performance \nat all times. \n\nIn this paper we examine the relation between the two approaches and contrast the \nresults obtained for different learning rate annealing schedules in the asymptotic \nregime. We employ the order parameter approach to examine the dependence of \nthe dynamics on the number of hidden nodes in a multilayer system. In addition, \nwe report some lesser-known results on non-standard annealing schedules \n\n\f302 \n\nT. K. Leen, B. Schottky and D. Saad \n\n2 Master Equation \n\nMost on-line learning algorithms assume the form Wt+l = Wt + 1]o/tP H(wt,xt) \nwhere Wt is the weight at time t, Xt is the training example, and H(w,x) is the \nweight update. The description of the algorithm's dynamics in terms of weight \nspace probability densities starts from the master equation \n\nP(w',t+1)= JdW (8(w'-w-~~H(w,x)))xP(w,t) \n\n(1) \n\nwhere ( .. '}x indicates averaging with respect to the measure on x, P(w,t) is the \nprobability density on weights at time t, and 8( ... ) is the Dirac function. One may \nuse the Kramers-Moyal expansion of Eq.(l) to derive a partial differential equation \nfor the weight probability density (here in one dimension for simplicity) {3, 4] \n\nat P ( w, t) = t (~~) i (7~ r a~ [ (Hi ( w, x) > x P ( w, t)] \n\nt=l \n\n. \n\n(2) \n\nFollowing {3], we make a small noise expansion for (2) by decomposing the weight \ntrajectory into a deterministic and stochastic pieces \n\nor \n\n~ = \n\n(TJo)-\"( \ntP \n\n(w-(t)) \n\n(3) \n\nwhere ( t) is the deterministic trajectory, and ~ are the fluctuations. Apart from \nthe factor (1]0/ tP)'Y that scales the fluctuations, this is identical to the formulation \nfor constant learning in {3]. The proper value for the unspecified exponent, will \nemerge from homogeneity requirements. Next, the dependence of the jump moments \n(Hi (w, x) > on 1]0 is explicated by a Taylor series expansion about the deterministic \npath . T!{e coefficients in this series expansion are denoted \n\na~i) == ai (Hi(w,x))x /awilw=tI> \n\nFinally one rewrites (2) in terms of and ~ and the expansion of the jump moments, \ntaking care to transform the differential operators in accordance with (3). \nThese transformations leave equations of motion for and the density I1(~, t) on \nthe fluctuations \n\nd \ndt = \n\n= \n\n(4) \n\nFor stochastic descent H ( w, x) = -V w E( w, x) and (4) describes the evolution of \n as descent on the average cost. The fluctuation equation (5) requires further \nmanipulation whose form depends on the context. For the usual case of descent \nin a quadratic minimum (ail) = -G, minus the cost function curvature), we take \n''i = 1/2 to insure that for any m, terms in the sum are homogeneous in T}o/tP \nFor constant learning rate (p = 0), rescaling time as t ~ 1]ot allows (5) to be written \nin a form convenient for perturbative analysis in 1]0 Typically, the limit 1]0 ~ 0 is \ninvoked and only the lowest order terms in 1]0 retained (e.g. [3]). These comprise \na diffusion operator, which results in a Gaussian approximation for equilibrium \ndensities. Higher order terms have been successfully used to calculate corrections \nto the equilibrium moments in powers of 1]0 [8]. \n\n\fTwo Approaches to Optimal Annealing \n\n303 \n\nOf primary interest here is the case of annealed learning, as required for convergence \nof the parameter estimates. Again assuming a quadratic bowl and 'Y = 1/2, the first \nfew terms of (5) are \n\nGt II = - :t Gd~II) - all) i; Ge(~II) + ~a~O): Gl II + 0 (: r/2. \n\n(6) \n\nAs t -+ 00 the right hand side of (6) is dominated by the first three terms (since \no < p S 1). Precisely which terms dominate depends on p. \nWe will first review the classical case p = 1. Asymptotically 1> -+ w\"', a local \noptimum. The first three leading terms on the right hand side of (6) are all of order \nlit. For t -+ 00, we discard the remaining terms. From the resulting equation we \nrecover a Gaussian equilibrium distribution for ~, or equivalently for Vt ( w - w\"') == \nVtv where v is called the weight error. The asymptotically normal distribution for \nVtv has variance 0'0v from which the asymptotic expected squared weight error \ncan be derived \n\n. \n\n1\u00b7 E[I 12] \n1m \nt->oo \n\nv \n\n2 \n\n-0 ' - -\n- , ; t v t -\n\n1 \n\n(0) \n\n2 \n7Jo a 2 \n\n1 \n27]0 G'\" - 1 t \n\n(7) \n\nwhere G'\" == G(w\"') is the curvature at the local optimum. \nPositive O',;t v requires T]o > 11 (2G\"'). If this condition is not met the expected \nsquared weight offset converges as (1It)1-2'1oG', slower than lit [5, for example, \nand references therein]. The above confirms the classical results [1] on asymptotic \nnormality and convergence rate for 1 I t annealing. \nFor the case 0 < p < 1, the second and third terms on the right hand side of (6) \nwill dominate as t -+ 00. Again, we have a Gaussian equilibrium density for ~. \nConsequently ViP v is asymptotically normal with variance O'~v leading to the \nexpected squared weight error \n\n1 \n2 \nytPv tP \n\n0' r;-;; \n\n-\n\n= T]o a~O) .!.. \ntP \n\n2G \n\n(8) \n\nNotice that the convergence is slower than lit and that there is no critical value of \nthe learning rate to obtain a sensible equilibrium distribution. (See [9] for earlier \nresults on 11tP annealing.) \nThe generalization error follows the same decay rate as the expected weight offset. \nIn one dimension, the expected squared weight offset is directly related to excess \ngeneralization error (the generalization error minus the least generalization error \nachievable) Eg = G E[v2 ]. In multiple dimensions, the expected squared weight \noffset, together with the maximum and minimum eigenvalues of G'\" provide upper \nand lower bounds on the excess generalization error proportional to E[lvI 2 ], with \nthe criticality condition on G'\" (for p = 1 )replaced with an analogous condition on \nits eigenvalues. \n\n3 Order parameters \n\nIn the Master equation approach, one focuses attention on the weight space distri(cid:173)\nbution P( w, t) and calculates quantities of interested by averaging over this density. \nAn alternative approach is to choose a smaller set of macroscopic variables that are \nsufficient for describing principal properties of the system such as the generaliza(cid:173)\ntion error (in contrast to the evolution of the weights w which are microscopic). \n\n\f304 \n\nT. K. Leen, B. Schottky and D. Saad \n\nFormally, one can replace the parameter dynamics presented in Eq.(1) by the cor(cid:173)\nresponding equation for macroscopic observables which can be easily derived from \nthe corresponding expressions for w. By choosing an appropriate set of macroscopic \nvariables and invoking the thermodynamic limit (i.e., looking at systems where the \nnumber of parameters is infinite), one obtains point distributions for the order pa(cid:173)\nrameters, rendering the dynamics deterministic. \n\nSeveral researchers [6, 7] have employed this approach for calculating the tr~ing \ndynamics of a soft committee machine (SCM) . The SCM maps inputs x E RN to \na scalar, through a model p{w,x) = 2:~lg{Wi' x). The activation function of \nthe hidden units is g{u) == erf{u/V2) and Wi is the set of input-to-hidden adaptive \nweights for the i = 1 ... K hidden nodes. The hidden-to-output weights are set \nto 1. This architecture preserves most of the properties of the learning dynamics \nand the evolution of the generalization error as a general two-layer network, and \nthe formalism can be easily extended to accommodate adaptive hidden-to-output \nweights [10]. \n\n. \n\nInput vectors x are independently drawn with zero mean and unit variance, and the \ncorresponding targets y are generated by deterministic teacher network corrupted \nby additive Gaussian output noise of zero mean and variance O'~. The teacher \nnetwork is also a SCM, with input-to-hidden weights wi. The order parameters \nsufficient to close the dynamics, and to describe the network generalization error \nare overlaps between various input-to-hidden vectors Wi . Wk == Qik, Wi' W~ _ \nRin, and w~\u00b7 w~ == Tnm \nNetwork performance is measured in terms of the generalization error Eg{W) _ \n(1/2 [ p(w, x) - Y ]2)~. The generalization error can be expressed in closed form in \nterms of the order parameters in the thermodynamiclimit (N -+ 00). The dynamics \nof the latter are also obtained in closed form [7]. These dynamics are coupled non(cid:173)\nlinear ordinary differential equations whose solution can only be obtained through \nnumerical integration. However, the asymptotic behavior in the case of annealed \nlearning is amenable to analysis, and this is one of the primary results of the paper. \nWe assume an isotropic teacher Tnm = 8 nm and use this symmetry to reduce the \nsystem to a vector of four order parameters uT = (r, q, s, c) related to the overlaps \nby Rin = 8in (1 + r) + (1- 8in)S and Qik = 8 ik(1 + q) + {1- 8 ik )C. \nWith learning rate annealing and limt-+oo u = \u00b0 we describe the dynamics in this \n\nvicinity by a linearization of the equations of motion in [7]. The linearization is \n\nd \ndt U = rJ1\\d U + rJ 0' /I b , \n\n2 2 \n\nwhere O'~ is the noise variance, b T = ~ (0,1/,;3,0,1/2), rJ = rJo/tP, and M is \n\n2 \n\nM = 3V31T \n\n3 \n\n-4 \n\n4 \n\n3 \n--V3 \n2 \n3V3 \n\n4 3 \n3 2\n2 \n\n-~(K -lhI(3) \n-(K - 1)V3 \n\n2 \n--(K-2)+-\nJ3 \n3V3(K - 2) + ~ \n\n2 \n\n3 \n-(K -1)V3 \n~ --(K - 1)V3 \no \n-3V3(K - 2) + ~ V3 \n\n(9) \n\n(10) \n\nThe asymptotic equations of motion (9) were derived by dropping terms of order \nO(rJlluI12) and higher, and terms of order O{rJ2 u). While the latter are linear in the \norder parameters, they are dominated by the rJu and rJ20'~b terms in (9) as t -+ 00. \n\n\fTwo Approaches to Optimal Annealing \n\n305 \n\nThis choice of truncations sheds light on the approach to equilibrium that is not \nimplicit in the master equation approach. In the latter, the dominant terms for \nthe asymptotics of (6) were identified by time scale of the coefficients, there was \nno identification of system observables that signal when the asymptotic regime \nis entered. For the order parameter approach, the conditions for validity of the \nasymptotic approximations are cast in terms of system observables 1JU vs rlu VS \n1J2 fI~. \n\nThe solution to (9) is \n\nu(t) = -yet, to) Uo + fI~ j3(t, to) b \n\nwhere Uo == u(to) and \n\n-y(t, to) = exp {M lot dr 1J(r)} and j3(t, to) = t dr -yet, r) 1J2( r). \n\nlto \n\n(11) \n\n(12) \n\nThe asymptotic order parameter dynamics allow us to compute the generalization \nerror (to first order in u) \n\n) \nE/ = -:; J3(q- 2r) + -2-(C- 28) \n\nK(l \n\nK-1 \n\n. \n\n(13) \n\nUsing the solution of Eq.(l1), the generalization error consists of two pieces: a \ncontribution depending on the actual initial conditions Uo and a contribution due \nto the second term on the r.h.s. of Eq.(l1), independent of Uo. The former decays \nmore rapidly than the latter, and we ignore it in what follows. Asymptotically, \nthe generalization error is of the form E/ = fI;(CI0l(t) + C202(t)), where Cj are K \ndependent coefficients, and OJ are eigenmodes that evolve as \n\nOJ = -\n\n1J5 \n\n1 + (}:j1Jo \n\nwith eigenvalues (Fig. l(a)) \n\n[! _ to'it)Ot;;-(O'it)O+I)] . \n\nt \n\n(}:1 = -~ (~ -2) and (}:2 = -~ (~ +2(K -1)) \n\n(14) \n\n(15) \n\n(16) \n\n(17) \n\nThe critical learning rate 1J~rit, above which the generalization decays as lit is, for \n, \nK> 2 \n\n-\n\n(1 1 ) \n\ncrit \n110 = max - (}:1 , - (}:2 = 41 J3 - 2 . \n\n7r \n\nFor 1Jo > 1J~rit both modes OJ, i = 1,2 decay as lit, and so \n\nE/ = -fI1I 11o \n\n2 2 (Cl \n\n1 + (}:11JO + 1 + (}:21Jo \n\nC2) 1 \n\nt == \n\n2 \n\n1 \nfIll f 1Jo,K) t \n\n( \n\nMinimizing the prefactor f (1Jo, K) in (17) minimizes the asymptotic error. The \nvalues 1J~Pt (K) are shown in Fig. 1 (b), where the special case of K = 1 (see below) \nis also included: There is a significant difference between the values for K = 1 and \nK = 2 and a rather weak dependence on K for K ~ 2. The sensitivity of the \ngeneralization error decay factor on the choice of 1Jo is shown in Fig. 1 ( c). \nThe influence of the noise strength on the generalization error can be seen directly \nfrom (17): the noise variance fI; is just a prefactor scaling the lit decay. Neither \nthe value for the critical nor for the optimal1Jo is influenced by it. \n\n\f306 \n\nT. K. Leen, B. Schottky and D. Saad \n\nThe calculation above holds for the case K = 1 (where c and s and the mode 01 are \nabsent). In this case \n\nopt(K _ 1) - 2 crit(R\" - 1) __ 2. _ J311\" \n110 \n2 \n\n-1]0 \n\n-\n\n-\n\n