{"title": "Weight Space Probability Densities in Stochastic Learning: I. Dynamics and Equilibria", "book": "Advances in Neural Information Processing Systems", "page_first": 451, "page_last": 458, "abstract": null, "full_text": "Weight  Space  Probability Densities \n\nin  Stochastic Learning: \n\nI.  Dynamics and  Equilibria \n\nTodd K.  Leen and John E.  Moody \n\nDepartment of Computer Science and Engineering \nOregon Graduate Institute of Science &  Technology \n\n19600 N.W.  von Neumann Dr. \n\nBeaverton,  OR 97006-1999 \n\nAbstract \n\nThe  ensemble  dynamics  of stochastic  learning  algorithms  can  be \nstudied  using  theoretical  techniques  from  statistical  physics.  We \ndevelop  the equations  of motion for  the  weight  space  probability \ndensities for  stochastic learning  algorithms.  We  discuss  equilibria \nin the diffusion  approximation and provide expressions for  special \ncases  of the LMS  algorithm.  The equilibrium  densities  are not in \ngeneral thermal (Gibbs) distributions in the objective function be(cid:173)\ning minimized, but rather depend upon an effective potential that \nincludes  diffusion  effects.  Finally  we  present  an  exact  analytical \nexpression for the time evolution of the density for a learning algo(cid:173)\nrithm with weight updates proportional to the sign of the gradient. \n\n1 \n\nIntroduction:  Theoretical Framework \n\nStochastic learning algorithms involve  weight  updates of the form \n\nw(n+1)  =  w(n)  + /-l(n)H[w(n),x(n)] \n\n(1) \n\nwhere w E 7\u00a3m  is  the vector of m  weights,  /-l  is  the learning rate, H[.]  E 7\u00a3m  is  the \nupdate function,  and  x(n)  is  the exemplar  (input or  input/target pair)  presented \n\n451 \n\n\f452 \n\nLeen  and  Moody \n\nto the network at the nth  iteration of the learning rule.  Often the update function \nis  based on  the gradient  of a  cost function  H(w,x)  =  -a\u00a3{w,x) law.  We  assume \nthat the exemplars are Li.d.  with underlying probability density p{x). \nWe  are  interested  in  studying  the  time  evolution  and  steady  state  behavior  of \nthe weight  space probability density  P(w, n)  for  ensembles  of networks  trained by \nstochastic learning.  Stochastic  process  theory  and  classical  statistical mechanics \nprovide  tools  for  doing  this.  As  we  shall  see,  the  ensemble  behavior  of stochas(cid:173)\ntic learning algorithms is  similar to that of diffusion processes in physical systems, \nalthough significant differences do exist. \n\n1.1  Dynamics of the Weight Space Probability Density \n\nEquation  (1)  defines  a  Markov  process on  the  weight  space.  Given  the  particular \ninput x, the single time-step transition probability density for this process is  a Dirac \ndelta function whose arguments satisfy the weight update (1): \n\nW ( w'  ~ w I x)  =  8 ( w - w' - J-t  H[ w' , x])  . \n\n(2) \n\nFrom this conditional transition probability,  we  calculate the total single  time-step \ntransition probability (Leen and Orr 1992, Ritter and Schulten 1988) \n\nW(w'  ~ w)  =  (  8( w - w' - J-tH[w',x])  }z \n\nwhere  ( ... }z  denotes integration over  the measure on the random variable  x. \nThe time evolution of the density is  given by  the Kolmogorov equation \n\nP(w, n + 1)  =  J dw' P(w', n) W(w' ~ w)  , \n\n(3) \n\n(4) \n\nwhich forms the basis for our dynamical description of the weight space probability \ndensity  1. \n\nStationary, or equilibrium, probability distributions are eigenfunctions of the tran(cid:173)\nsition probability \n\nPs(w)  = J dw'  Ps(w') W(w' ~ w). \n\n(5) \n\nIt is  particularly  interesting  to  note  that  for  problems  in  which  there  exists  an \noptimal weight  w,.  such that \n\nH(w,.,x)  =  0,  \"Ix, \n\none stationary solution is  a  delta function  at w  =  w,..  An important class  of such \nexamples are noise-free mapping problems for which weight values exist that realize \nthe desired  mapping over  all  possible  input/target pairs.  For  such  problems,  the \nensemble can settle into a  sharp distribution at the optimal weights  (for examples \nsee Leen and  Orr 1992,  Orr and Leen 1993). \n\nAlthough  the  Kolmogorov  equation can  be integrated  numerically,  we  would  like \nto  make  further  analytic  progress.  Towards  this  end  we  convert  the  Kolmogorov \n\n1 An  alternative  is  to  base  the  time  evolution  on  a  suitable  master  equation.  Both \n\napproaches give the same results. \n\n\fWeight  Space  Probability Densities in  Stochastic Learning:  I.  Dynamics and Equilibria \n\n453 \n\nequ~,tion into  a  differential\u00b7 difference equation by  expanding  (3)  as  a  power series \nin J.l.  Since the transition probability is defined in the sense of generalized functions \n(i.e.  distributions),  the proper way  to proceed is  to smear  (4)  with  a  smooth test \nfunction of compact support f(w)  to obtain \n\nJ dw  f{w)  P(w, n + 1)  =  J dw dw'  f(w)  P(w', n) W(w' -t w). \n\n(6) \n\nNext  we  use  the  transition probability  (3)  to  perform the integration over  wand \nexpand  the resulting  expression  as  a  power  series  in  J.l.  Finally,  we  integrate  by \npart5 to take derivatives off f, dropping the surface terms.  This results in a discrete \ntime version of the classic Kramers\u00b7Moyal expansion (llisken 1989) \n\nP(w,n+1)  - P(w,n)  = \n\nwhere Hja  denotes the ja th  component of the m-component vector H. \n\nIn section 3,  we  present an algorithm for  which  the Kramers-Moyal expansion can \nbe  explicitly  summed.  In general the full  expansion is  not  analytically  tractable, \nand to make further analytic progress we  will  truncate it at second order to obtain \nthe Fokker-Planck equation. \n\n1.2  The Fokker-Planck  (Diffusion)  Approximation \nFor  small  enough  1J.l HI,  the  Kramers-Moyal  expansion  (7)  can  be  truncated  to \nsecond order to obtain a  Fokker-Planck equation:2 \n\nP(w, n + 1)  - P(w, n)  = \n\n{) \n\n-J.l {)Wi  [ Ai(W) P(w, n)  ] \n\n(8) \n\nIn  (8),  and  throughout the  remainder of the  paper,  repeated indices  are  summed \nover.  In  the  Fokker-Planck  approximation,  only  two  coefficients  appear:  Ai ( w)  = \n(Hi)z,  called  the  drift  vector,  and  Bij(W)  = (Hi Hj)z'  called  the  diffusion  matrix. \nThe  drift  vector  is  simply  the  average  update  applied  at  w.  Since  the  diffusion \ncoefficients can be strongly dependent on the position in weight space, the equilib(cid:173)\nrium densities will,  in general, not be  thermal (Gibbs)  distributions in the potential \ncorresponding to  (H( w, x) ) z'  This  is  exemplified in  our discussion  of equilibrium \ndensities for  the LMS algorithm in section 2.1  below3 \u2022 \n\n2Radons et al.  (1990) independently derived a Fokker-Planck equation for backpropaga(cid:173)\n\ntion.  Earlier, Ritter and Schulten (1988) derived a  Fokker-Planck equation (for Kohonen's \nself-ordering feature map)  that is valid in the neighborhood of a  local optimum. \n\n3See  (Leen and Orr 1992, Orr and Leen  1993) for further examples. \n\n\f454 \n\nLeen  and Moody \n\n2  Equilibrium Densities in the Fokker-Planck \n\nApproximation \n\nIn equilibrium the probability density is stationary, P(w, n+1)  = P(w, n) = Ps(w), \n\nso  the Fokker-Planck equation (8)  becomes \n\n0= - a:i Ji(w)  ==  - a:i  (11.  Ai(W) P8(W)  - ~2 a:j  [Bij(W) P8(W)]  ) \n\n(9) \n\nHere,  we  have  implicitly  defined  the probability  density current  J(w).  In  equilib(cid:173)\nrium, its divergence is  zero. \nIf the drift and diffusion coefficients satisfy potential conditions,  then the equilibrium \ncurrent itself is  zero and  detailed  balance  is  obtained.  The potential conditions are \n(Gardiner,  1990) \n\nOWl  - OWk  =  0,  where  Zk(W) = Bk/(w)  2\" ow; Bi;(W) - Ai(W) \nOZk  OZ, \n\n[J-l  0 \n\nUnder these conditions the solution to (9)  for  the equilibrium density is: \n\nPs(w) = !...  e-2:F(w)/~,  F(w) = 1 dWk  Zk(W) \n\nJ( \n\nw \n\n1 \n\n(10) \n\n(11) \n\nwhere  J( is  a  normalization constant and F( w)  is  called the  effective  potential. \n\nIn  general,  the  potential  conditions  are  not  satisfied  for  stochastic learning  algo(cid:173)\nrithms  in  multiple  dimensions. 4 \nIn  this  respect,  stochastic learning  differs  from \nmost physical diffusion  processes.  However for  LMS  with inputs whose correlation \nmatrix is  isotropic,  the conditions are  satisfied and the equilibrium density can be \nreduced to the quadrature in  (11). \n\n2.1  Equilibrium  Density for  the LMS  Algorithm \n\nThe best known  on-line  learning system is  the LMS  adaptive  filter.  For  the  LMS \nalgorithm, the training examples consist of input/target pairs x(n) =  {s(n),t(n)}, \nthe model output is  u(n) = W\u00b7  s(n), and the cost function is  the squared error: \n\n\u00a3(w,x(n))  =  2 [t(n)-u(n)]2  =  2 [t(n)-w\u00b7s(n)]2 \n\n1 \n\n1 \n\nThe resulting update equations (for constant learning rate J-l)  are \nw(n+l)  = w(n)  + J-l[t(n)-w.s(n)]s(n). \n\n(12) \n\n(13) \n\nWe  assume that the training data are generated according to a  \"signal plus noise\" \nmodel: \n\n(14) \nwhere  w.  is  the  \"true\"  weight  vector  and  \u20ac(  n)  is  LLd.  noise  with  mean zero  and \nvariance  (12.  We  denote  the  correlation matrix  of the inputs  s( n)  by  R  and  the \n\n, \n\nt(n) = w \u2022 . s(n) + \u20ac(n) \n\n4For one-dimensional algorithms, the potential conditions are trivially satisfied. \n\n\fWeight  Space  Probability Densities in  Stochastic Learning:  I.  Dynamics and Equilibria \n\n455 \n\nfourth  order  correlation  tensor  of  the  inputs  by  S.  It is  convenient  to  shift  the \norigin of coordinates in weight  space and define the weight  error  vector \n\nv  = w  - w \u2022. \n\nIn terms of v,  the weight  update is \n\nv(n+l)  = v(n)  - JJ[s(n).v(n)]s(n)  + JJf(n)s(n). \n\nThe drift vector and diffusion  matrix are given  by \n\nAi=-(SiSj}s Vj  =  -RijVj \n\n(15) \n\nand \n\n, \n\nBij = (Si Sj Sle  SI  Vie  VI  +  f2  Sj Sj ) s (  = Sijlel Vie  VI  +  (72  Rij \n\n(16) \nrespectively.  Notice that the diffusion  matrix is  quadratic in  v.  Thus as  we  move \naway  from  the  global  minimum  at  v  =  0,  diffusive  spreading of the  probability \ndensity is  enhanced.  Notice also that, in general, both terms of the diffusion matrix \ncontribute an anisotropy. \nWe further assume that the inputs are drawn from a  zero-mean Gaussian process. \nThis  assumption allows  us  to  appeal  to  the  Gaussian  moment factoring  theorem \n(Haykin,  1991, p318)  to express the fourth-order correlation S in terms of R \n\nSijlel  = Rij Rlcl  +  Rile Rjl  + Ril Rjle \n\n. \n\nThe diffusion matrix reduces to \n\nTo  compute  the  effective  potential  (10  and  11)  the  diffusion  matrix  is  inverted \nusing  the Sherman-Morrison formula  (Press,  1987,  p67).  As  a  final  simplification, \nwe  assume that the input distribution is  spherically symmetric.  Thus \n\n(17) \n\nwhere I  denotes the identity matrix. \n\nR  =  rI  , \n\nTogether these  assumptions insure detailed  balance,  and  we  can integrate  (11)  in \nclosed  form.  In  figure  1,  we  compare  the  effective  potential F(v)  (for  1-D  LMS) \nwith the potential corresponding to the quadratic cost function. \n\nv \n\nFig.l:  Effective potential (dashed curve)  and cost function  (solid curve) for  I-D LMS. \n\nThe spatial dependence of the the diffusion  coefficient forces  the effective potential \nto soften relative to the cost function for large Ivl.  This accentuates the tails of the \ndistribution relative to a  gaussian. \n\n\f456 \n\nLeen  and Moody \n\nThe equilibrium density is \n\nPs{v)  =  K \n\n1  [ \n\n3r \n\n1 + u21vl2 \n\n] -( ~+m ) \n\n, \n\n(18) \n\nwhere,  as  before,  m  and  J(  denote  the  dimension  of  the  weight  vector  and  the \nnormalization constant for the density respectively.  For a  l-D filter,  the equilibrium \ndensity  can  be found  in  closed  form  without assuming  Gaussian  input  data.  We \nfind \n\n(19) \nWith gaussian inputs (for which S = 3r2 )  (19) properly reduces to (18) with m = 1. \nThe equilibrium densities (18) and (19) are clearly not gaussian, however in the limit \nof very small J.lr  they reduce to gaussian distributions with variance J.lu 2/2.  Figure \n2  shows  a  comparison between  the  theoretical result  and  a  histogram  of  200,000 \nvalues  of v  generated by simulation with J.l  = 0.005,  and u 2  = 1.0.  The input data \nwere drawn from a  zero-mean Gaussian distribution with r  = 4.0. \n\nI \n\nI \n\ni \n\nI \n\n-0.2  -0.1  0.0  0.1  0.2 \n\nI \n\nFig.2:  Equilibrium density for  1-D  LMS \n\nv \n\n3  An Exactly  Summable Model \n\nAs in the case of LMS learning above, stochastic gradient descent algorithms update \nweights  based  on  an instantaneous  estimate of the gradient  of some average  cost \nfunction \u00a3(w) = {\u00a3(w, x) }z.  That is,  the update is given by \n\nHi(W,X)  =  --0 \u00a3(w,x). \n\no \nWi \n\nAn alternative is to increment or decrement each weight by a fixed  amount depend(cid:173)\ning only on the sign of O\u00a3/OWi.  We formulated this alternative update rule because \nit avoids  a common problem for sigmoidal networks, getting stuck on  \"flat spots\" or \n\"plateaus\".  The standard gradient descent update rule yields  very  slow  movement \non  plateaus,  while  second  order  methods  such  as  gauss-newton  can be  unstable. \nThe sign-of-gradient update rule suffers from neither of these problems.s \n\n5The  use  of the sign  of the  gradient  has  been  suggested  previously  in  the  stochastic \napproximation literature by Fabian (1960) and in the neural network literature by Derthick \n(1984). \n\n\fWeight Space  Probability Densities in  Stochastic Learning:  I.  Dynamics and Equilibria \n\n457 \n\nIf at each iteration one chooses a weight at random for updating, then the Kramers(cid:173)\nMoyal  expansion  can be exactly  summed.  Thus  at  each  iteration we  1)  choose  a \nweight  Wi  and an exemplar x  at random, and 2)  update Wi  with \n\nH .( \n\nI  w,x \n\n)  _ \n-\n\n. \n\n-Sign \n\n(8\u00a3(w,x(n))) \n\n8 \nWi \n\n(20) \n\nWith this update rule,  Hj = \u00b11 or 0 and Hi Hj = lSij  (or 0).  All  of the coefficients \n(HiHj Hk ... ) z  in the Kramers-Moyal expansion (7) vanish unless  i  = j  = k = .... \nThe remaining series can be summed by breaking it into odd and even parts.  This \nleav\\!s \n\nP(w,n+l)  - P(w,n)  = \n\nm \n\n2m \n\n1  L  {  P(w + Ilj,n) Aj(w + Ilj) - P(w -Ilh n) Aj(w -Ilj) } \n1  L  {  P(w + Ilj, n) Bjj(w + Ilj) - 2P(w, n) Bjj(w) \n\nj=1 \nm \n\n2m \n\nj=1 \n\n+ \n\n+ P(w -Ilj, n) Bjj(w -Ilj) } \n\nwhere /-tj  denotes a  displacement along Wj  a  distance /-t,  Aj(w) = (Hj(w, x) )z'  and \nBjj(w)  = (H;(w,x)z'  Note  that  Bjj(w)  = 1  unless  H(w,x)  = 0,  for  all  x,  in \nwhich  case  Bjj(w) = O.  Although  exact,  (21)  curiously  has  the form  of a  second \n\norder finite  difference approximation to  the Fokker-Planck equation with diagonal \ndiffusion matrix.  This form is  understandable, since the dynamics (20)  restrict the \nweight values W  to a hypercubic lattice with cell length /-t  and generate only nearest \nneighbor interactions. \n\n(21) \n\n!J!:  k!~ \n\n0.5  1  1.5  2  2.5 \n\n0.5  1  1.5  2  2.5 \n\n-0.5 \n\nv \n\nv \n\nn=5oo \n\nn  =5000 \n\nFig.3:  Sequence of densities for  the XOR problem \n\n-0.5 \n\n0.5  1  1.5  2  2.5 \n\nv \n\nL: \n-0.5 \n\n-0.5 \n\nAs an example, figure 3 shows the cost function evaluated along a 1-D slice through \nthe weight space for  the XOR problem.  Along this line are local and global minima \nat  v  = 1  and  v  = 0  respectively.  Also  shown  is  the  probability  density  (vertical \nlines).  The  sequence shows  the spreading of the density  from  its  initialization  at \nthe local minimum, and its eventual collection at the global minimum. \n\n\f458 \n\nLeen  and Moody \n\n4  Discussion \n\nA theoretical approach that focuses on the dynamics of the weight space probability \ndensity, as we do here, provides powerful tools to extend understanding of stochastic \nsearch.  Both transient and equilibrium behavior can be studied using these tools. \nWe  expect that knowledge of equilibrium weight space distributions can be used in \nconjunction with theories of generalization (e.g.  Moody, 1992) to assess the influence \nof stochastic search on prediction error.  Characterization of transient phenomena \nshould facilitate the design and evaluation of search strategies such as data batching \nand adaptive learning rate schedules.  Transient phenomena are treated in  greater \ndepth in the companion paper in  this volume  (Orr and Leen,  1993). \n\nAcknowledgements \n\nT. Leen was supported under grants N00014-91-J-1482 and N00014-90-J-1349 from \nONR.  J.  Moody was  supported under grants 89-0478 from  AFOSR,  ECS-9114333 \nfrom NSF,  and N00014-89-J-1228 and N00014-92-J-4062 from  ONR. \n\nReferences \n\nTodd K.  Leen and Genevieve B.  Orr (1992), Weight-space probability densities and conver(cid:173)\ngence times for stochastic learning.  In  International Joint Conference on Neural Networks, \npages IV  158-164.  IEEE, June. \n\nH.  Ritter and K.  Schulten (1988), Convergence properties of Kohonen's topology conserv(cid:173)\ning maps:  Fluctuations, stability and dimension selection, Bioi.  Cybern.,  60,  59-71. \n\nGenevieve  B.  Orr and  Todd  K.  Leen  (1993),  Probability densities in stochastic learning; \nII.  Transients and  Basin  Hopping  Times.  In  Giles,  C.L.,  Hanson,  S.J.,  and  Cowan,  J.D. \n(eds.),  Advances  in  Neural  Information  Processing  Systems  5.  San  Mateo,  CA:  Morgan \nKaufmann Publishers. \n\nH.  Risken  (1989),  The  Fokker-Planck Equation Springer-Verlag, Berlin. \n\nG.  Radons, H.G.  Schuster and D.  Werner (1990), Fokker-Planck description oflearning in \nbackpropagation networks, International Neural Network Conference - INNC 90,  Paris, II \n993-996, Kluwer Academic Publishers. \n\nC.W.  Gardiner (1990),  Handbook  of Stochastic  Methods,  2nd Ed.  Springer-Verlag,  Berlin. \n\nSimon  Haykin  (1991),  Adaptive  Filter  Theory,  2nd  edition.  Prentice  Hall,  Englewood \nCliffs,  N.J. \n\nW.H. Press, B.P. Flannery, S.A.  Teukolsky, and W.T. Vetterling (1987)  Numerical Recipes \n- the  Art of Scientific  Computing.  Cambridge University Press,  Cambridge I New  York. \nV.  Fabian (1960), Stochastic approximation methods.  Czechoslovak  Math  J.,  10,  123-159. \n\nMark Derthick (1984), Variations on the Boltzmann machine learning algorithm.  Technical \nReport  CMU-CS-84-120,  Department  of Computer Science,  Carnegie-Mellon University, \nPittsburgh, PA,  August. \n\nJohn  E.  Moody  (1992),  The  effective  number  of parameters:  An  analysis  of generaliza(cid:173)\ntion  and  regularization in  nonlinear learning systems.  In  J .E.  Moody,  S.J.  Hanson,  and \nR.P.  Lipmann,  editors,  Advances  in  Neural  Information  Processing  Systems  4.  Morgan \nKaufmann Publishers, San Mateo,  CA. \n\n\f", "award": [], "sourceid": 634, "authors": [{"given_name": "Todd", "family_name": "Leen", "institution": null}, {"given_name": "John", "family_name": "Moody", "institution": null}]}