{"title": "A Local Algorithm to Learn Trajectories with Stochastic Neural Networks", "book": "Advances in Neural Information Processing Systems", "page_first": 83, "page_last": 87, "abstract": null, "full_text": "A  Local Algorithm to Learn Trajectories \n\nwith Stochastic Neural Networks \n\nJavier R.  Movellan\u00b7 \n\nDepartment of Cognitive Science \nUniversity of California San  Diego \n\nLa Jolla, CA 92093-0515 \n\nAbstract \n\nThis paper presents  a simple algorithm to learn trajectories with a \ncontinuous  time,  continuous  activation  version  of the  Boltzmann \nmachine.  The  algorithm  takes  advantage  of intrinsic  Brownian \nnoise in the network to easily compute gradients using entirely local \ncomputations.  The  algorithm may  be  ideal  for  parallel hardware \nimplementations. \n\nThis  paper  presents  a  learning  algorithm to train  continuous stochastic  networks \nto  respond  with  desired  trajectories  in  the  output  units  to  environmental  input \ntrajectories.  This is a task, with potential applications to a variety of problems such \nas stochastic modeling of neural processes,  artificial motor control,  and continuous \nspeech  recognition .  For  example,  in  a  continuous speech  recognition  problem,  the \ninput trajectory  may be  a  sequence  of fast  Fourier  transform coefficients,  and  the \noutput  a  likely  trajectory  of phonemic  patterns  corresponding  to  the  input.  This \npaper was based on recent  work on diffusion networks by Movellan and McClelland \n(in  press)  and  by  recent  papers  by  Apolloni  and  de  Falco  (1991)  and  Neal  (1992) \non  asymmetric  Boltzmann  machines.  The  learning  algorithm  can  be  seen  as  a \ngeneralization of their work  to  the stochastic  diffusion  case  and  to  the problem of \nlearning continuous stochastic trajectories. \n\nDiffusion networks are governed by the standard connectionist differential equations \nplus  an  independent  additive  noise  component.  The resulting  process  is  governed \n\n\u00b7Pa.rt of this work was done  while a.t  Ca.rnegie  Mellon  University. \n\n83 \n\n\f84 \n\nMovellan \n\nby a set of Langevin stochastic differential equations \n\ndai(t) = Ai  dri/ti(t) dt + crdBi(t);  i E {I, ... , n} \n\n(1) \nwhere Ai  is the processing rate of the ith  unit, cr  is the diffusion constant, which con(cid:173)\ntrols the flow  of entropy  throughout the network,  and dBi(t)  is a  Brownian motion \ndifferential (Soon,  1973).  The drift function is the deterministic part of the process. \nFor consistency  I use  the same drift function  as  in Movellan and  McClelland,  1992 \nbut many other options are  possible:  dri/ti(t) = EJ=1 Wijaj(t) - /-l ai (t),  where \nWij  is  the  weight  from the  jth  to the  ith  unit,  and  /-1  is  the inverse  of a  logistic \nfunction  scaled in the  (min - max) interval:/- 1(a) = log m-;~!~' \n\nIn  practice  DNs  are  simulated  in  digital  computers  with  a  system  of stochastic \ndifference  equations \n\nai(t+dt)=ai(t)+Aidri/ti(t)dt+crzi(t)../Xi;  iE{I, ... ,n} \n\n(2) \nwhere  Zi(t)  is  a  standard  Gaussian  random  variable.  I  start  the  derivations  of \nthe learning algorithm for  the trajectory learning task using  the  discrete  time pro(cid:173)\ncess  (equation  2)  and  then  I  take  limits  to  obtain  the  continuous  diffusion  ex(cid:173)\npression.  To  simplify  the  derivations  I  adopt  the  following  notation:  a  trajectory \nof states  -input,  hidden  and  output  units- is  represented  as  a  = [a(I) ... a(t m )]  = \n[al(I) ... an (I) ... al(tm ) .. . an (t m )].  The  trajectory  vector  can  be  partitioned  into  3 \nconsecutive  row  vectors  representing  the trajectories of the input,  hidden and out(cid:173)\nput units a  = [xhy). \n\nThe  key  to  the  learning  algorithm is  obtaining  the  gradient  of the  probability of \nspecific trajectories.  Once we know this gradient we have all the information needed \nto  increase  the  probability  of desired  trajectories  and  decrease  the  probability  of \nunwanted trajectories.  To obtain this gradient we first  need  to do some derivations \non  the  transition  probability  densities.  Using  the  discrete  time  approximation to \nthe diffusion  process,  it follows  that  the  conditional transition probability  density \nfunctions  are multivariate Gaussian \n\nFrom equation 2 and  3 it follows that \n\no \nA'  ~ \nOWij log  p(a(t + dt)1 a(t\u00bb  =  ;  Zi(t)  V dtaj(t) \n\n(3) \n\n(4) \n\nSince  the  network  is  Markovian,  the  probability  of  an  entire  trajectory  can  be \ncomputed from  the product of the  transition probabilities \n\np(a) = p(a(to\u00bb  II p(a(t + dt)la(t\u00bb \n\nt \",-1 \n\nt=to \n\no ( ) \n:  a  = p(a)~ L  Zi(t)../Xi aj(t) \n\nA  t\",-l \n\nThe derivative of the probability of a specific  trajectory follows \n\nWij \n\ncr  t=to \n\n(5) \n\n(6) \n\n\fA Local Algorithm to Learn Trajectories with Stochastic Neural Networks \n\n8S \n\nIn  practice,  the above rule is  all is  needed  for  discrete  time computer simulations. \nWe  can obtain the continuous time form by taking limits as  ~t --+ 0,  in which case \nthe  sum  becomes  Ito's  stochastic  integral  of aj(t)  with  respect  to  the  Brownian \nmotion differential over a  {to, T} interval. \n\nA similar equation may be obtained for  the ~i parameters \n\nop(a) = p(a)'~i iT aj(t)dBi(t) \na Wi; \no;i~) = pea)! iT drifti(t)dBi(t) \n\nto \n\nU \n\nU \n\nto \n\n\u2022 \n\nFor notational convenience I define the following random variables and refer to them \nas the  delta  signals \n\n. \n6Wij {a) =  a \n\nO1og  pea) \n\nWij \n\n~i iT \n= -\n\nU \n\nto \n\naj(t)dBi(t) \n\n(7) \n\n(8) \n\n(9) \n\n(10) \n\nand \n\n1 \n\nn \n\nA \n\n~ \n\nc: \n0 \n:::; \n~ 0.5-\n~ \n\n0 \n\n0 \n\n~  V  ~  V  l} \n\nI \n\nI \n\n100 \n\n200 \nTime Steps \n\n300 \n\n1 \n\n(\\ \n\nc: \n0 \n; ;  \nCIS > \nn \ne:(  0.5-\nen \nC) \nCIS \n~ en \n.a: \n\n0 \n\n0 \n\nB \n\n\"  1\\  A \n\nA  I \n\nV  V \n\nV \n\nV  V \n\nI \n\nI \n\n100 \n\n200 \nTime Steps \n\n300 \n\nFigure 1:  A) A sample Trajectory.  B) The Average Trajectory.  As Time Progresses \nSample Trajectories  Become Statistically Independent  Dampening the Average. \n\n\f86 \n\nMovellan \n\nThe  approach  taken  in  this  paper  is  to  minimize the expected  value  of the  error \nassigned to spontaneously generated trajectories 0  = E(p(a\u00bb where pea) is a signal \nindicating the overall error of a  particular trajectory  and  usually depends  only on \nthe output unit trajectory.  The necessary  gradients follow \n\n(11) \n\n(12) \n\nSince the above learning rule does  not require calculating derivatives of the p func(cid:173)\ntion, it provides great flexibility making it applicable to a wide variety of situations. \nFor  example  pea)  can  be  the TSS  between  the  desired  and  obtained  output  unit \ntrajectories or it could be a reinforcement signal indicating whether the trajectory is \nor is not desirable.  Figure La shows a typical output of a network trained with TSS \nas  the  p signal to follow  a  sinusoidal trajectory.  The network  consisted  of 1 input \nunit, 3 hidden units,  and 1 output unit.  The input was  constant through time and \nthe  network  was  trained  only  with  the first  period  of the  sinusoid.  The  expected \nvalues  in  equations  11  and  12  were  estimated  using  400  spontaneously  generated \ntrajectories  at each learning epoch.  It is  interesting to note that  although the net(cid:173)\nwork  was  trained  for  a  single  period,  it  continued  oscillating without  dampening. \nHowever,  the  expected  value  of the  activations  dampened,  as  Figure  l.b  shows. \nThe dampening of the average activation is due to the fact that as time progresses, \nthe  effects  of noise  accumulate  and  the initially  phase  locked  trajectories  become \nindependent  oscillators. \n\np transition = 0.2 \n\nHidden state = 0 \n\n! '--\n\np transition = 0.05 \n\np(response 1) = 0.1 \n\n[RuP=. J \n\nHidden state = 1 \n\nJ! \nr~~J \n\np(response 1) = 0.8 \n\n20,-------------__________ _ \n\n18 \n\n->-\n\n16 \n==  14 \n:.c \n~ 12 \no \n\n~ 0.10 -.E g  8 \n\nbest possible performance \n\ng>  6 \n...J \n\n4 \n\n2 \n\nO~------,-------~------~ \n2900 \n\n.0 \n\n900 \n1900 \nLearning Epoch \n\nFigure 2:  A) The Hidden Markov Emitter.  B) Average Error Throughout Training. \nThe Bayesian  Limit is Achieved  at About 2000  Epochs. \n\n\fA Local Algorithm to Learn Trajectories with Stochastic Neural Networks \n\n87 \n\nThe learning rule is also  applicable in  reinforcement  situations where  we just have \nan  overall  measure  of fitness  of  the  obtained  trajectories,  but  we  do  not  know \nwhat  the  desired  trajectory  looks  like.  For  example,  in  a  motor  control  problem \nwe  could  use  as  fitness  signal  (-p)  the  distance  walked  by  a  robot  controlled  by \na  DN  network.  Equations 11  and  12  could  then  be  used  to  gradually improve the \naverage distance walked by the robot.  In  trajectory recognition  problems we  could \nuse  an  overall judgment of the  likelihood  of the  obtained trajectories.  I  tried  this \nlast  approach  with a  toy  version  of a  continuous speech  recognition  problem.  The \n\"emitter\"  was  a  hidden  Markov  model  (see  Figure  2)  that  produced  sequences  of \noutputs - the equivalent of fast Fourier transform loads - fed  as input to the receiver. \nThe receiver  was  a  DN  network  which  received  as  input,  sequences  of 10  outputs \nfrom  the  emitter  Markov  model.  The  network's  task  was  to  guess  the  sequence \nof hidden  states  of the  emitter  given  the  sequence  of outputs  from  the  emitter. \nThe DN  outputs were  interpreted  as  the inferred  state of the emitter.  Output unit \nactivations greater  than  0.5  were  evaluated  as  indicating  that  the  emitter was  in \nstate  1 at that  particular time.  Outputs  smaller  than  0.5  were  evaluated  as  state \nO.  To  achieve  optimal  performance  in  this  task  the  network  had  to  combine  two \nsources of information:  top-down information about typical state transitions of the \nemitter, and bottom up information about the likelihood of the hidden states of the \nemitter given  its responses. \nThe  network  was  trained  with  rules  11  and  12  using  the  negative  log joint  prob(cid:173)\nability of the  DN  input  trajectory  and  the  DN  output  trajectory  as  error  signal. \nThis signal  was  calculated  using  the  transition  probabilities of the emitter hidden \nMarkov  model  and  did  not require  knowledge  of its actual state trajectories.  The \nnecessary gradients for  equations 11  and  12 were  estimated using  1000 spontaneous \ntrajectories  at  each  learning  epoch.  As  Figure  3 shows  the  network  started  pro(cid:173)\nducing  unlikely  trajectories  but continuously improved.  The figure  also  shows  the \nperformance expected from an optimal classifier.  As training progressed the network \napproached optimal performance. \n\nAcknowledgements \n\nThis  work  was  funded  through  the  NIMH  grant  MH47566  and  a  grant  from  the \nPittsburgh Supercomputer Center. \n\nReferences \n\nB.  Apolloni,  &  D.  de  Falco.  (1991)  Learning  by  asymmetric  parallel  Boltzmann \nmachines.  Neural  Computation,  3, 402-408. \nR.  Neal.  (1992)  Asymmetric  Parallel  Boltzmann  Machines  are  Belief  Networks, \nNeural  Computation,  4, 832-834. \nJ. Movellan & J. McClelland.  (1992a) Learning continuous probability distributions \nwith symmetric diffusion  networks.  To appear in  Cognitive  Science. \n\nT.  Soon.  (1973)  Random  Differential  Equations  in  Science  and  Engineering,  Aca(cid:173)\ndemic  Press,  New  York. \n\n\f", "award": [], "sourceid": 746, "authors": [{"given_name": "Javier", "family_name": "Movellan", "institution": null}]}