{"title": "Spiking Boltzmann Machines", "book": "Advances in Neural Information Processing Systems", "page_first": 122, "page_last": 128, "abstract": null, "full_text": "Spiking Boltzmann Machines \n\nGeoffrey E.  Hinton \n\nAndrew  D.  Brown \n\nGatsby Computational Neuroscience Unit \n\nDepartment of Computer Science \n\nUniversity College London \nLondon WCIN 3AR,  UK \nhinton@gatsby. ucl. ac. uk \n\nUniversity of Toronto \n\nToronto,  Canada \n\nandy@cs.utoronto.ca \n\nAbstract \n\nWe first show how to represent sharp posterior probability distribu(cid:173)\ntions using real valued coefficients on broadly-tuned basis functions. \nThen we  show how  the precise times of spikes can be used to con(cid:173)\nvey the real-valued  coefficients on  the basis functions  quickly and \naccurately.  Finally we  describe a  simple  simulation in which spik(cid:173)\ning neurons learn to model an image sequence by fitting a dynamic \ngenerative model. \n\n1  Population codes and  energy landscapes \n\nA perceived object is represented in the brain by the activities of many neurons, but \nthere is no general consensus on how the activities of individual neurons combine to \nrepresent the multiple properties of an object.  We start by focussing on the case of \na single object that has multiple instantiation parameters such as position, velocity, \nsize and orientation.  We assume that each neuron has an ideal stimulus in the space \nof instantiation parameters and that its activation rate or probability of activation \nfalls off monotonically in all directions as the actual stimulus departs from this ideal. \nThe semantic problem is  to define  exactly what instantiation parameters are being \nrepresented when  the activities of many such neurons are specified. \n\nHinton,  Rumelhart and  McClelland  (1986)  consider  binary neurons with  receptive \nfields  that  are  convex  in  instantiation  space.  They  assume  that  when  an  object \nis  present  it  activates  all  of the neurons  in whose  receptive  fields  its  instantiation \nparameters  lie.  Consequently,  if it  is  known  that  only  one  object  is  present,  the \nparameter  values  of the  object  must  lie  within  the  feasible  region  formed  by  the \nintersection of the receptive fields  of the active neurons.  This will  be called a  con(cid:173)\njunctive  distributed  representation.  Assuming  that  each  receptive  field  occupies \nonly  a  small  fraction  of the  whole  space,  an  interesting  property  of this  type  of \n\"coarse coding\"  is that the bigger the receptive fields,  the more accurate the repre(cid:173)\nsentation.  However,  large receptive fields  lead to a  loss  of resolution when  several \nobjects are present simultaneously. \n\nWhen the sensory input  is  noisy,  it  is  impossible  to infer  the exact  parameters of \nobjects so it  makes sense  for  a  perceptual system to represent  the probability dis(cid:173)\ntribution  across  parameters  rather  than  just  a  single  best  estimate  or  a  feasible \nregion.  The full  probability distribution is  essential for  correctly  combining  infor-\n\n\fSpiking Boltzmann Machines \n\n123 \n\nE(x) \n\nP(X) \n\nFigure 1:  a)  Energy landscape over a one(cid:173)\ndimensional  space.  Each  neuron  adds  a \ndimple  (dotted  line)  to  the  energy  land(cid:173)\nscape  (solid  line).  b)  The  corresponding \nprobability density.  Where  dimples over(cid:173)\nlap the corresponding probability density \nbecomes sharper.  Since the dimples decay \nto zero,  the location of a  sharp probabili(cid:173)\nty peak is  not affected by distant dimples \nand multimodal distributions can be rep(cid:173)\nresented. \n\nmation from different times or different Sources.  One obvious way to represent this \ndistribution  (Anderson  and  van  Essen,  1994)  is  to allow  each  neuron  to  represent \na fairly  compact probability distribution over the space of instantiation parameters \nand  to  treat  the  activity  levels  of neurons  as  (unnormalized)  mixing  proportions. \nThe semantics of this  disjunctive distributed representation is  precise,  but the per(cid:173)\ncepts  it  allows  are  not  because  it  is  impossible  to represent  distributions  that  are \nsharper  than  the  individual  receptive  fields  and,  in  high-dimensional  spaces,  the \nindividual fields  must  be  broad in  order to cover  the space.  Disjunctive  represen(cid:173)\ntations are used  in  Kohonen's  self-organizing map  which  is  why  it  is  restricted  to \nvery low  dimensional latent spaces. \n\nThe disjunctive model can be viewed as an attempt to approximate arbitrary smooth \nprobability  distributions  by  adding  together  probability  distributions  contributed \nby  each  active  neuron.  Coarse coding suggests a  multiplicative approach in which \nthe  addition  is  done  in  the  domain  of energies  (negative  log  probabilities).  Each \nactive neuron contributes an energy landscape over the whole space of instantiation \nparameters.  The activity level of the neuron multiplies its energy landscape and the \nlandscapes for  all  neurons in the population are added  (Figure 1).  If,  for  example, \neach  neuron  has  a  full  covariance  Gaussian  tuning  function,  its  energy  landscape \nis  a  parabolic bowl whose  curvature matrix is  the inverse of the covariance matrix. \nThe  activity  level  of the  neuron  scales  the  inverse  covariance  matrix.  If there  are \nk instantiation  parameters then only  k + k(k + 1)/2 real  numbers  are required to \nspan the space of means and inverse covariance matrices.  So  the real-valued activ(cid:173)\nities of O(k2)  neurons are sufficient  to represent  arbitrary full  covariance Gaussian \ndistributions over the space of instantiation parameters. \n\nTreating neural activities as multiplicative coefficients  on additive contributions to \nenergy  landscapes  has  a  number  of advantages.  Unlike  disjunctive  codes,  vague \ndistributions are represented  by low  activities  so  significant  biochemical  energy  is \nonly required  when  distributions are quite sharp.  A  central operation in Bayesian \ninference  is  to  combine  a  prior  term  with  a  likelihood  term  or  to  combine  two \nconditionally  independent  likelihood  terms.  This  is  trivially  achieved  by  adding \ntwo energy landscapesl . \n\nlWe  thank  Zoubin  Ghahramani  for  pointing  out  that  another  important  operation, \nconvolving a probability distribution with Gaussian noise, is a difficult non-linear operation \non the energy landscape. \n\n\f124 \n\nG. E.  Hinton and A. D.  Brown \n\n2  Representing the coefficients on the basis functions \n\nTo perform perception at video rates, the probability distributions over instantiation \nparameters  need  to  be  represented  at  about  30  frames  per  second.  This  seems \ndifficult  using  relatively  slow  spiking  neurons  because  it  requires  the  real-valued \nmultiplicative coefficients on the basis functions to be communicated accurately and \nquickly  using  all-or-none  spikes.  The  trick is  to  realise  that  when  a  spike  arrives \nat  another  neuron it  produces a  postsynaptic potential  that  is  a  smooth  function \nof time.  So  from  the  perspective  of the  postsynaptic  neuron,  the  spike  has  been \nconvolved with a smooth temporal function.  By adding a  number of these smooth \nfunctions together, with appropriate temporal offsets,  it is possible to represent any \nsmoothly varying sequence of coefficient  values on a  basis function,  and this makes \nit possible to represent the temporal evolution of probability distributions as shown \nin  Figure  2.  The ability  to vary  the  location  of a  spike  in the single dimension of \ntime thus allows real-valued control of the representation of probability distributions \nover multiple spatial dimensions. \n\na) \n\n.~ \n> \" \n~ \n\n'iii OOiS \nQ) \nII: \n\nEncoded Value \n\n,  , \n\nTime \n\nb) \n\nneuron 2 \n\ntime I \n\nFigure 2:  a)Two spiking neurons centered at 0 and 1 can represent the time-varying \nmean  and  standard  deviation  on  a  single  spatial  dimension.  The  spikes  are  first \nconvolved  with  a  temporal  kernel  and  the  resulting  activity  values  are treated  as \nexponents on Gaussian distributions  centered  at  0 and  1.  The  ratio of the activi(cid:173)\nty values  determines the  mean  and the  sum  of the  activity  values  determines  the \ninverse  variance.  b)  The  same  method  can  be  used  for  two  (or  more)  spatial  di(cid:173)\nmensions.  Time flows  from  top to bottom.  Each spike makes a  contribution to the \nenergy landscape that  resembles  an hourglass  (thin  lines).  The waist  of the hour(cid:173)\nglass  corresponds  to  the  time  at  which  the  spike  has  its strongest  effect  on  some \npost-synaptic population.  By moving the hourglasses in  time,  it  is  possible to get \nwhatever  temporal  cross-sections  are  desired  (thick  lines)  provided  the  temporal \nsampling rate is comparable to the time course of the effect  of a  spike. \n\nOur proposed use of spike timing to convey real values quickly and accurately does \nnot require precise coincidence detection, sub-threshold oscillations, modifiable time \ndelays,  or any of the other paraphernalia that has been invoked to explain how the \nbrain  could  make  effective  use  of the  single,  real-valued  degree  of freedom  in  the \ntiming of a  spike  (Hopfield,  1995). \n\nThe  coding  scheme  we  have  proposed  would  be  far  more  convincing  if  we  could \nshow  how  it  was  learned and  could  demonstrate that  it  was  effective  in  a  simula(cid:173)\ntion.  There are two  ways to design  a  learning algorithm for  such  spiking  neurons. \nWe  could  work in  the relatively low-dimensional space of the instantiation param(cid:173)\neters and design the learning to produce the right representations and interactions \nbetween representations in this  space.  Or we  could treat this space as  an implicit \nemergent  property of the  network  and  design  the  learning  algorithm  to  optimize \n\n\fSpiking Boltzmann Machines \n\n125 \n\nsome  objective function  in  the  much  higher-dimensional  space of neural  activities \nin the hope that this  will  create representations that  can be understood  using the \nimplicit space of instantiation parameters.  We  chose the latter approach. \n\n3  A  learning algorithm for  restricted Boltzmann machines \n\nHinton  (1999)  describes  a  learning  algorithm  for  probabilistic  generative  models \nthat  are  composed  of  a  number  of  experts.  Each  expert  specifies  a  probability \ndistribution over  the visible variables and the experts are combined by multiplying \nthese distributions together and renormalizing. \n\n(1) \n\nwhere  d  is  a  data vector in  a  discrete space, Om  is  all the parameters of individual \nmodel  m, Pm(d\\Om)  is  the probability of d  under  model  m,  and i  is  an index over \nall possible vectors in the data space. \n\nThe  coding  scheme  we  have  described  is  just  a  product  of experts  in  which  each \nspike is  an expert.  We  first  summarize the Product of Experts learning rule for  a \nrestricted Boltzmann machine  (RBM)  which consists of a layer of stochastic binary \nvisible units connected to a layer of stochastic binary hidden units with no intralayer \nconnections.  We  then extend RBM's to deal with temporal data. \n\nIn  an  RBM,  each  hidden  unit  is  an  expert.  When  it  is  off it  specifies  a  uniform \ndistribution over  the states of the visible  units.  When  it  is  on,  its weight  to  each \nvisible  unit  specifies  the log odds  that the visible  unit  is  on.  Multiplying together \nthe  distributions  specified  by different  hidden  units  is  achieved  by  adding the  log \nodds.  Inference in an RBM  is much easier than in a causal belief net because there \nis  no explaining away.  The  hidden  states,  S j,  are conditionally independent given \nthe  visible  states,  Si,  and  the  distribution  of  Sj  is  given  by  the  standard  logistic \nfunction  a:  p(Sj = 1)  = a(L:i WijSi).  Conversely, the hidden states of an RBM are \nmarginally dependent so it is  easy for  an RBM to learn population codes in  which \nunits may  be highly correlated.  It is  hard to do this in causal belief nets with one \nhidden layer because the generative model  of a  causal  belief net  assumes  marginal \nindependence. \n\nAn  RBM can be trained by following the gradient of the log likelihood of the data: \n\n(2) \n\nwhere  < SiSj  >0  is  the expected value of SiSj  when  data is  clamped on the visible \nunits and the hidden states are sampled from their conditional distribution given the \ndata,  and  < SiSj  >00  is  the expected value of SiSj  after prolonged Gibbs sampling \nthat  alternates  between  sampling from  the  conditional  distribution  of the  hidden \nstates given the visible states and vice versa. \nThis  learning  rule  not  work  well  because  the  sampling  noise  in  the  estimate  of \n< SiSj  >00  swamps the gradient.  It is  far  more effective to maximize the  difference \nbetween the log likelihood of the data and the log likelihood of the one-step recon(cid:173)\nstructions of the data that are produced by first  picking binary hidden states from \ntheir conditional distribution given the data and then picking binary visible states \nfrom their conditional distribution given the hidden states.  The gradient of the log \n\n\f126 \n\nG.  E.  Hinton and A. D.  Brown \n\nlikelihood of the one-step reconstructions is complicated because changing a  weight \nchanges the probability distribution of the reconstructions: \n\n+ \n\n(3) \n\nwhere Ql  is  the distribution of the one-step reconstructions of the training data and \nQoo  is  the  equilibrium  distribution  (i.e. \nthe  stationary  distribution  of prolonged \nGibbs sampling).  Fortunately, the cumbersome third term is sufficiently small that \nignoring  it  does  not  prevent  the  vector  of weight  changes  from  having  a  positive \ncosine with the true gradient of the difference of the log likelhoods so the following \nvery simple  learning rule  works much better than Eq.  2. \n\n(4) \n\n4  Restricted Boltzmann machines through time \n\nUsing a  restricted Boltzmann machine we  can represent time by  spatializing it,  i.e. \ntaking each  visible  unit,  i,  and hidden  unit,  j, and replicating them through time \nwith  the  constraint  that  the  weight  WijT  between  replica  t  of  i  and  replica  t + T \nof j  does  not depend on t.  To  implement  the desired  temporal smoothing,  we  also \nforce  the weights to be a  smooth function  of T  that has the shape of the temporal \nkernel,  shown  in  Figure  3.  The  only  remaining  degree  of freedom  in  the  weights \nbetween  replicas of i  and replicas of j  is  the scale  of the temporal kernel  and it is \nthis  scale  that  is  learned.  The  replicas  of the  visible  and  hidden  units  still  form \na  bipartite graph and the  probability distribution  over the  hidden  replicas  can be \ninferred exactly without considering data that lies  further  into the future than the \nwidth of the temporal kernel. \n\nOne  problem  with  the  restricted  Boltzmann  machine  when  we  spatialize  time  is \nthat hidden units at one time step have no memory of their states at previous time \nsteps;  they  only  see  the data.  If we  were  to  add  undirected  connections  between \nhidden  units  at  different  time  steps,  then the architecture would  return  to a  fully \nconnected Boltzmann machine in which the hidden units are no longer conditionally \nindependent given the data.  A useful trick borrowed from Elman nets is to allow the \nhidden  units  to see  their  previous states,  but to treat these  observations like  data \nthat cannot be modified by future  hidden states.  Thus, the hidden states may still \nbe  inferred  independently  without  resorting to  Gibbs  sampling.  The  connections \nbetween  hidden  layer  weights  also  follow  the  time  course  of  the  temporal  kernel. \nThese connections act  as  a  predictive  prior over  the  hidden  units.  It is  important \nto note that these forward  connections are not required for  the network to model a \nsequence,  but only for  the purposes of extrapolating into the future. \n\nFigure 3:  The form of the temporal kernel. \n\n\fSpiking Boltzmann Machines \n\n127 \n\nNow the probability that Sj(t)  =  1 given the states of the visible units is, \n\nP(Sj(t) =  1) = u (~W,jh,(t) + ~ W,;h,(t))  . \n\nwhere  hi(t)  is  the  convolution  of the  history  of visible  unit  i  with  the  temporal \nkernel, \n\n00 \n\nT=O \n\nand  hk(t),  the  convolution  of  the  hidden  unit  history,  is  computed  similarly.  2 \nLearning the weights follows  immediately from  this formula for  doing inference.  In \nthe positive phase the visible units are clamped at each time step and the posterior \nof the hidden units conditioned on the data is computed (we assume zero boundary \nconditions for  time before  t  =  0).  Then in the negative phase we  sample from  the \nposterior of the  hidden  units,  and  compute  the distribution  over  the  visible  units \nat  each  time  step  given  these  hidden  unit  states.  In  each  phase  the  correlations \nbetween the hidden and visible  units are computed and the learning rule is, \n\n00  00 \n\nAWij  =  L L r(7)  ((Sj(t)Si(t - 7))0  - (Sj(t)Si(t - 7))1) . \n\nt=O T=O \n\n5  Results \n\nWe  trained this network on  a sequence of 8x8 synthetic images of a  Gaussian blob \nmoving in  a  circular path.  In the following  diagrams we  display the time sequence \nof images  as  a  matrix.  Each row  of the matrix represents a  single  image  with  its \npixels  stretched  out  into  a  vector  in  scanline  order,  and  each  column  is  the  time \ncourse of a single  pixel.  The intensity f the pixel  is  represented by  the area of the \nwhite patch.  We  used 20 hidden units.  Figure 5a shows a segment  (200 time steps) \nof the  time  series  which  was  used  in  training.  In  this  sequence  the  period of the \nblob is  80  time steps. \n\nFigure 5b shows how  the trained model reconstructs the data after we  sample from \nthe  hidden  layer  units.  Once  we  have  trained the  model  it  is  possible  to do  fore(cid:173)\ncasting by clamping visible layer units for  a  segment  of a  sequence and then doing \niterative Gibbs sampling to generate future points in the sequence.  Figure 5c shows \nthat given  50  time steps from the series,  the model can predict reasonably far  into \nthe future,  before the pattern dies out.  One problem with these simulations is  that \nwe  are treating the real valued intensities in the images as probabilities.  While this \nworks  for  the blob  images,  where  the  values  can  be  viewed  as  the probabilities of \npixels in a  binary image being on, this is  not true for  more natural images. \n\n6  Discussion \n\nIn  our  initial  simulations  we  used  a  causal  sigmoid  belief  network  (SBN)  rather \nthan a  restricted Boltzmann machine.  Inference in an SBN  is  much more  difficult \nthan in  an  RBM.  It requires  Gibbs  sampling  or  severe  approximations,  and  even \nif a  temporal  kernel  is  used  to ensure  that  a  replica of a  hidden  unit  at one time \n\n2Computing  the  conditional  probability  distribution  over  the  visible  units  given  the \nhidden states is done in a similar fashion,  with the caveat that the weights in each direction \nmust be symmetric.  Thus,  the convolution is  done using the reverse kernel. \n\n\f128 \n\nG.  E.  Hinton and A.  D.  Brown \n\na) \n\nb) \n\nc) \n\nFigure 4:  a)  The original data,  b)  reconstruction of the data,  and c)  prediction of \nthe  data given  50  time  steps  of the  sequence.  The  black  line  indicates  where  the \nprediction begins. \n\nhas  no  connections to replicas of visible units  at very  different  times,  the posterior \ndistribution of the hidden units still depends  on  data far  in the future.  The Gibbs \nsampling made  our SBN  simulations  very  slow  and  the  sampling  noise  made  the \nlearning far  less  effective  than in the RBM.  Although  the  RBM  simulations  seem \ncloser to biological plausibility, they too suffer from  a major problem.  To apply the \nlearning procedure it is necessary to reconstruct the data from the hidden states and \nwe  do  not know  how  to do  this  without interfering with the incoming datastream. \nIn our simulations we simply ignored this problem by allowing a visible unit to have \nboth an observed value  and a  reconstructed value  at the same time. \n\nAcknowledgements \nWe  thank  Zoubin  Ghahramani,  Peter Dayan,  Rich  Zemel,  Terry  Sejnowski  and Radford \nNeal  for  helpful  discussions.  This research  was  funded  by grants from  the Gatsby  Foun(cid:173)\ndation  and NSERC. \n\nReferences \n\nAnderson, C.H.  & van Essen, D.C  (1994).  Neurobiological computational systems.  In J.M \nZureda,  R.J.  Marks,  & C.J.  Robinson  (Eds.),  Computational  Intelligence  Imitating  Life \n213-222.  New  York:  IEEE Press. \n\nHinton, G.  E.  (1999)  Products of Experts.  ICANN 99:  Ninth  international conference  on \nArtificial Neural  Networks,  Edinburgh,  1-6. \n\nHinton,  G.  E.,  McClelland,  J.  L.,  &  Rumelhart,  D.  E.  (1986)  Distributed representation(cid:173)\ns.  In  Rumelhart,  D.  E.  and  McClelland,  J.  L.,  editors,  Parallel  Distributed  Processing: \nExplorations  in  the  Microstructure  of  Cognition.  Volume  1:  Foundations,  MIT  Press, \nCambridge,  MA. \n\nHopfield,  J.  (1995).  Pattern  recognition  computation  using  action  potential  timing  for \nstimulus representation.  Nature,  376, 33-36. \n\n\f", "award": [], "sourceid": 1724, "authors": [{"given_name": "Geoffrey", "family_name": "Hinton", "institution": null}, {"given_name": "Andrew", "family_name": "Brown", "institution": null}]}