{"title": "Spiking Boltzmann Machines", "book": "Advances in Neural Information Processing Systems", "page_first": 122, "page_last": 128, "abstract": null, "full_text": "Spiking Boltzmann Machines \n\nGeoffrey E. Hinton \n\nAndrew D. Brown \n\nGatsby Computational Neuroscience Unit \n\nDepartment of Computer Science \n\nUniversity College London \nLondon WCIN 3AR, UK \nhinton@gatsby. ucl. ac. uk \n\nUniversity of Toronto \n\nToronto, Canada \n\nandy@cs.utoronto.ca \n\nAbstract \n\nWe first show how to represent sharp posterior probability distribu(cid:173)\ntions using real valued coefficients on broadly-tuned basis functions. \nThen we show how the precise times of spikes can be used to con(cid:173)\nvey the real-valued coefficients on the basis functions quickly and \naccurately. Finally we describe a simple simulation in which spik(cid:173)\ning neurons learn to model an image sequence by fitting a dynamic \ngenerative model. \n\n1 Population codes and energy landscapes \n\nA perceived object is represented in the brain by the activities of many neurons, but \nthere is no general consensus on how the activities of individual neurons combine to \nrepresent the multiple properties of an object. We start by focussing on the case of \na single object that has multiple instantiation parameters such as position, velocity, \nsize and orientation. We assume that each neuron has an ideal stimulus in the space \nof instantiation parameters and that its activation rate or probability of activation \nfalls off monotonically in all directions as the actual stimulus departs from this ideal. \nThe semantic problem is to define exactly what instantiation parameters are being \nrepresented when the activities of many such neurons are specified. \n\nHinton, Rumelhart and McClelland (1986) consider binary neurons with receptive \nfields that are convex in instantiation space. They assume that when an object \nis present it activates all of the neurons in whose receptive fields its instantiation \nparameters lie. Consequently, if it is known that only one object is present, the \nparameter values of the object must lie within the feasible region formed by the \nintersection of the receptive fields of the active neurons. This will be called a con(cid:173)\njunctive distributed representation. Assuming that each receptive field occupies \nonly a small fraction of the whole space, an interesting property of this type of \n\"coarse coding\" is that the bigger the receptive fields, the more accurate the repre(cid:173)\nsentation. However, large receptive fields lead to a loss of resolution when several \nobjects are present simultaneously. \n\nWhen the sensory input is noisy, it is impossible to infer the exact parameters of \nobjects so it makes sense for a perceptual system to represent the probability dis(cid:173)\ntribution across parameters rather than just a single best estimate or a feasible \nregion. The full probability distribution is essential for correctly combining infor-\n\n\fSpiking Boltzmann Machines \n\n123 \n\nE(x) \n\nP(X) \n\nFigure 1: a) Energy landscape over a one(cid:173)\ndimensional space. Each neuron adds a \ndimple (dotted line) to the energy land(cid:173)\nscape (solid line). b) The corresponding \nprobability density. Where dimples over(cid:173)\nlap the corresponding probability density \nbecomes sharper. Since the dimples decay \nto zero, the location of a sharp probabili(cid:173)\nty peak is not affected by distant dimples \nand multimodal distributions can be rep(cid:173)\nresented. \n\nmation from different times or different Sources. One obvious way to represent this \ndistribution (Anderson and van Essen, 1994) is to allow each neuron to represent \na fairly compact probability distribution over the space of instantiation parameters \nand to treat the activity levels of neurons as (unnormalized) mixing proportions. \nThe semantics of this disjunctive distributed representation is precise, but the per(cid:173)\ncepts it allows are not because it is impossible to represent distributions that are \nsharper than the individual receptive fields and, in high-dimensional spaces, the \nindividual fields must be broad in order to cover the space. Disjunctive represen(cid:173)\ntations are used in Kohonen's self-organizing map which is why it is restricted to \nvery low dimensional latent spaces. \n\nThe disjunctive model can be viewed as an attempt to approximate arbitrary smooth \nprobability distributions by adding together probability distributions contributed \nby each active neuron. Coarse coding suggests a multiplicative approach in which \nthe addition is done in the domain of energies (negative log probabilities). Each \nactive neuron contributes an energy landscape over the whole space of instantiation \nparameters. The activity level of the neuron multiplies its energy landscape and the \nlandscapes for all neurons in the population are added (Figure 1). If, for example, \neach neuron has a full covariance Gaussian tuning function, its energy landscape \nis a parabolic bowl whose curvature matrix is the inverse of the covariance matrix. \nThe activity level of the neuron scales the inverse covariance matrix. If there are \nk instantiation parameters then only k + k(k + 1)/2 real numbers are required to \nspan the space of means and inverse covariance matrices. So the real-valued activ(cid:173)\nities of O(k2) neurons are sufficient to represent arbitrary full covariance Gaussian \ndistributions over the space of instantiation parameters. \n\nTreating neural activities as multiplicative coefficients on additive contributions to \nenergy landscapes has a number of advantages. Unlike disjunctive codes, vague \ndistributions are represented by low activities so significant biochemical energy is \nonly required when distributions are quite sharp. A central operation in Bayesian \ninference is to combine a prior term with a likelihood term or to combine two \nconditionally independent likelihood terms. This is trivially achieved by adding \ntwo energy landscapesl . \n\nlWe thank Zoubin Ghahramani for pointing out that another important operation, \nconvolving a probability distribution with Gaussian noise, is a difficult non-linear operation \non the energy landscape. \n\n\f124 \n\nG. E. Hinton and A. D. Brown \n\n2 Representing the coefficients on the basis functions \n\nTo perform perception at video rates, the probability distributions over instantiation \nparameters need to be represented at about 30 frames per second. This seems \ndifficult using relatively slow spiking neurons because it requires the real-valued \nmultiplicative coefficients on the basis functions to be communicated accurately and \nquickly using all-or-none spikes. The trick is to realise that when a spike arrives \nat another neuron it produces a postsynaptic potential that is a smooth function \nof time. So from the perspective of the postsynaptic neuron, the spike has been \nconvolved with a smooth temporal function. By adding a number of these smooth \nfunctions together, with appropriate temporal offsets, it is possible to represent any \nsmoothly varying sequence of coefficient values on a basis function, and this makes \nit possible to represent the temporal evolution of probability distributions as shown \nin Figure 2. The ability to vary the location of a spike in the single dimension of \ntime thus allows real-valued control of the representation of probability distributions \nover multiple spatial dimensions. \n\na) \n\n.~ \n> \" \n~ \n\n'iii OOiS \nQ) \nII: \n\nEncoded Value \n\n, , \n\nTime \n\nb) \n\nneuron 2 \n\ntime I \n\nFigure 2: a)Two spiking neurons centered at 0 and 1 can represent the time-varying \nmean and standard deviation on a single spatial dimension. The spikes are first \nconvolved with a temporal kernel and the resulting activity values are treated as \nexponents on Gaussian distributions centered at 0 and 1. The ratio of the activi(cid:173)\nty values determines the mean and the sum of the activity values determines the \ninverse variance. b) The same method can be used for two (or more) spatial di(cid:173)\nmensions. Time flows from top to bottom. Each spike makes a contribution to the \nenergy landscape that resembles an hourglass (thin lines). The waist of the hour(cid:173)\nglass corresponds to the time at which the spike has its strongest effect on some \npost-synaptic population. By moving the hourglasses in time, it is possible to get \nwhatever temporal cross-sections are desired (thick lines) provided the temporal \nsampling rate is comparable to the time course of the effect of a spike. \n\nOur proposed use of spike timing to convey real values quickly and accurately does \nnot require precise coincidence detection, sub-threshold oscillations, modifiable time \ndelays, or any of the other paraphernalia that has been invoked to explain how the \nbrain could make effective use of the single, real-valued degree of freedom in the \ntiming of a spike (Hopfield, 1995). \n\nThe coding scheme we have proposed would be far more convincing if we could \nshow how it was learned and could demonstrate that it was effective in a simula(cid:173)\ntion. There are two ways to design a learning algorithm for such spiking neurons. \nWe could work in the relatively low-dimensional space of the instantiation param(cid:173)\neters and design the learning to produce the right representations and interactions \nbetween representations in this space. Or we could treat this space as an implicit \nemergent property of the network and design the learning algorithm to optimize \n\n\fSpiking Boltzmann Machines \n\n125 \n\nsome objective function in the much higher-dimensional space of neural activities \nin the hope that this will create representations that can be understood using the \nimplicit space of instantiation parameters. We chose the latter approach. \n\n3 A learning algorithm for restricted Boltzmann machines \n\nHinton (1999) describes a learning algorithm for probabilistic generative models \nthat are composed of a number of experts. Each expert specifies a probability \ndistribution over the visible variables and the experts are combined by multiplying \nthese distributions together and renormalizing. \n\n(1) \n\nwhere d is a data vector in a discrete space, Om is all the parameters of individual \nmodel m, Pm(d\\Om) is the probability of d under model m, and i is an index over \nall possible vectors in the data space. \n\nThe coding scheme we have described is just a product of experts in which each \nspike is an expert. We first summarize the Product of Experts learning rule for a \nrestricted Boltzmann machine (RBM) which consists of a layer of stochastic binary \nvisible units connected to a layer of stochastic binary hidden units with no intralayer \nconnections. We then extend RBM's to deal with temporal data. \n\nIn an RBM, each hidden unit is an expert. When it is off it specifies a uniform \ndistribution over the states of the visible units. When it is on, its weight to each \nvisible unit specifies the log odds that the visible unit is on. Multiplying together \nthe distributions specified by different hidden units is achieved by adding the log \nodds. Inference in an RBM is much easier than in a causal belief net because there \nis no explaining away. The hidden states, S j, are conditionally independent given \nthe visible states, Si, and the distribution of Sj is given by the standard logistic \nfunction a: p(Sj = 1) = a(L:i WijSi). Conversely, the hidden states of an RBM are \nmarginally dependent so it is easy for an RBM to learn population codes in which \nunits may be highly correlated. It is hard to do this in causal belief nets with one \nhidden layer because the generative model of a causal belief net assumes marginal \nindependence. \n\nAn RBM can be trained by following the gradient of the log likelihood of the data: \n\n(2) \n\nwhere < SiSj >0 is the expected value of SiSj when data is clamped on the visible \nunits and the hidden states are sampled from their conditional distribution given the \ndata, and < SiSj >00 is the expected value of SiSj after prolonged Gibbs sampling \nthat alternates between sampling from the conditional distribution of the hidden \nstates given the visible states and vice versa. \nThis learning rule not work well because the sampling noise in the estimate of \n< SiSj >00 swamps the gradient. It is far more effective to maximize the difference \nbetween the log likelihood of the data and the log likelihood of the one-step recon(cid:173)\nstructions of the data that are produced by first picking binary hidden states from \ntheir conditional distribution given the data and then picking binary visible states \nfrom their conditional distribution given the hidden states. The gradient of the log \n\n\f126 \n\nG. E. Hinton and A. D. Brown \n\nlikelihood of the one-step reconstructions is complicated because changing a weight \nchanges the probability distribution of the reconstructions: \n\n+ \n\n(3) \n\nwhere Ql is the distribution of the one-step reconstructions of the training data and \nQoo is the equilibrium distribution (i.e. \nthe stationary distribution of prolonged \nGibbs sampling). Fortunately, the cumbersome third term is sufficiently small that \nignoring it does not prevent the vector of weight changes from having a positive \ncosine with the true gradient of the difference of the log likelhoods so the following \nvery simple learning rule works much better than Eq. 2. \n\n(4) \n\n4 Restricted Boltzmann machines through time \n\nUsing a restricted Boltzmann machine we can represent time by spatializing it, i.e. \ntaking each visible unit, i, and hidden unit, j, and replicating them through time \nwith the constraint that the weight WijT between replica t of i and replica t + T \nof j does not depend on t. To implement the desired temporal smoothing, we also \nforce the weights to be a smooth function of T that has the shape of the temporal \nkernel, shown in Figure 3. The only remaining degree of freedom in the weights \nbetween replicas of i and replicas of j is the scale of the temporal kernel and it is \nthis scale that is learned. The replicas of the visible and hidden units still form \na bipartite graph and the probability distribution over the hidden replicas can be \ninferred exactly without considering data that lies further into the future than the \nwidth of the temporal kernel. \n\nOne problem with the restricted Boltzmann machine when we spatialize time is \nthat hidden units at one time step have no memory of their states at previous time \nsteps; they only see the data. If we were to add undirected connections between \nhidden units at different time steps, then the architecture would return to a fully \nconnected Boltzmann machine in which the hidden units are no longer conditionally \nindependent given the data. A useful trick borrowed from Elman nets is to allow the \nhidden units to see their previous states, but to treat these observations like data \nthat cannot be modified by future hidden states. Thus, the hidden states may still \nbe inferred independently without resorting to Gibbs sampling. The connections \nbetween hidden layer weights also follow the time course of the temporal kernel. \nThese connections act as a predictive prior over the hidden units. It is important \nto note that these forward connections are not required for the network to model a \nsequence, but only for the purposes of extrapolating into the future. \n\nFigure 3: The form of the temporal kernel. \n\n\fSpiking Boltzmann Machines \n\n127 \n\nNow the probability that Sj(t) = 1 given the states of the visible units is, \n\nP(Sj(t) = 1) = u (~W,jh,(t) + ~ W,;h,(t)) . \n\nwhere hi(t) is the convolution of the history of visible unit i with the temporal \nkernel, \n\n00 \n\nT=O \n\nand hk(t), the convolution of the hidden unit history, is computed similarly. 2 \nLearning the weights follows immediately from this formula for doing inference. In \nthe positive phase the visible units are clamped at each time step and the posterior \nof the hidden units conditioned on the data is computed (we assume zero boundary \nconditions for time before t = 0). Then in the negative phase we sample from the \nposterior of the hidden units, and compute the distribution over the visible units \nat each time step given these hidden unit states. In each phase the correlations \nbetween the hidden and visible units are computed and the learning rule is, \n\n00 00 \n\nAWij = L L r(7) ((Sj(t)Si(t - 7))0 - (Sj(t)Si(t - 7))1) . \n\nt=O T=O \n\n5 Results \n\nWe trained this network on a sequence of 8x8 synthetic images of a Gaussian blob \nmoving in a circular path. In the following diagrams we display the time sequence \nof images as a matrix. Each row of the matrix represents a single image with its \npixels stretched out into a vector in scanline order, and each column is the time \ncourse of a single pixel. The intensity f the pixel is represented by the area of the \nwhite patch. We used 20 hidden units. Figure 5a shows a segment (200 time steps) \nof the time series which was used in training. In this sequence the period of the \nblob is 80 time steps. \n\nFigure 5b shows how the trained model reconstructs the data after we sample from \nthe hidden layer units. Once we have trained the model it is possible to do fore(cid:173)\ncasting by clamping visible layer units for a segment of a sequence and then doing \niterative Gibbs sampling to generate future points in the sequence. Figure 5c shows \nthat given 50 time steps from the series, the model can predict reasonably far into \nthe future, before the pattern dies out. One problem with these simulations is that \nwe are treating the real valued intensities in the images as probabilities. While this \nworks for the blob images, where the values can be viewed as the probabilities of \npixels in a binary image being on, this is not true for more natural images. \n\n6 Discussion \n\nIn our initial simulations we used a causal sigmoid belief network (SBN) rather \nthan a restricted Boltzmann machine. Inference in an SBN is much more difficult \nthan in an RBM. It requires Gibbs sampling or severe approximations, and even \nif a temporal kernel is used to ensure that a replica of a hidden unit at one time \n\n2Computing the conditional probability distribution over the visible units given the \nhidden states is done in a similar fashion, with the caveat that the weights in each direction \nmust be symmetric. Thus, the convolution is done using the reverse kernel. \n\n\f128 \n\nG. E. Hinton and A. D. Brown \n\na) \n\nb) \n\nc) \n\nFigure 4: a) The original data, b) reconstruction of the data, and c) prediction of \nthe data given 50 time steps of the sequence. The black line indicates where the \nprediction begins. \n\nhas no connections to replicas of visible units at very different times, the posterior \ndistribution of the hidden units still depends on data far in the future. The Gibbs \nsampling made our SBN simulations very slow and the sampling noise made the \nlearning far less effective than in the RBM. Although the RBM simulations seem \ncloser to biological plausibility, they too suffer from a major problem. To apply the \nlearning procedure it is necessary to reconstruct the data from the hidden states and \nwe do not know how to do this without interfering with the incoming datastream. \nIn our simulations we simply ignored this problem by allowing a visible unit to have \nboth an observed value and a reconstructed value at the same time. \n\nAcknowledgements \nWe thank Zoubin Ghahramani, Peter Dayan, Rich Zemel, Terry Sejnowski and Radford \nNeal for helpful discussions. This research was funded by grants from the Gatsby Foun(cid:173)\ndation and NSERC. \n\nReferences \n\nAnderson, C.H. & van Essen, D.C (1994). Neurobiological computational systems. In J.M \nZureda, R.J. Marks, & C.J. Robinson (Eds.), Computational Intelligence Imitating Life \n213-222. New York: IEEE Press. \n\nHinton, G. E. (1999) Products of Experts. ICANN 99: Ninth international conference on \nArtificial Neural Networks, Edinburgh, 1-6. \n\nHinton, G. E., McClelland, J. L., & Rumelhart, D. E. (1986) Distributed representation(cid:173)\ns. In Rumelhart, D. E. and McClelland, J. L., editors, Parallel Distributed Processing: \nExplorations in the Microstructure of Cognition. Volume 1: Foundations, MIT Press, \nCambridge, MA. \n\nHopfield, J. (1995). Pattern recognition computation using action potential timing for \nstimulus representation. Nature, 376, 33-36. \n\n\f", "award": [], "sourceid": 1724, "authors": [{"given_name": "Geoffrey", "family_name": "Hinton", "institution": null}, {"given_name": "Andrew", "family_name": "Brown", "institution": null}]}