{"title": "Reinforcement Learning and Time Perception -- a Model of Animal Experiments", "book": "Advances in Neural Information Processing Systems", "page_first": 115, "page_last": 122, "abstract": null, "full_text": "Reinforcement Learning and Time \nPerception -\na Model of Animal \n\nExperiments \n\nJ. L. Shapiro \n\nDepartment of Computer Science \n\nUniversity of Manchester \nManchester, M13 9PL U.K. \n\njls@cs.man.ac.uk \n\nJohn Wearden \n\nDepartment of Psychology \nUniversity of Manchester \nManchester, M13 9PL U.K. \n\nAbstract \n\nAnimal data on delayed-reward conditioning experiments shows a \nstriking property -\nthe data for different time intervals collapses \ninto a single curve when the data is scaled by the time interval. \nThis is called the scalar property of interval timing. Here a simple \nmodel of a neural clock is presented and shown to give rise to the \nscalar property. The model is an accumulator consisting of noisy, \nlinear spiking neurons. It is analytically tractable and contains \nonly three parameters. When coupled with reinforcement learning \nit simulates peak procedure experiments, producing both the scalar \nproperty and the pattern of single trial covariances. \n\n1 \n\nIntroduction \n\nAn aspect of delayed-reward reinforcement learning problem which has a long his(cid:173)\ntory of study in animal experiments, but has been overlooked by theorists, is the \nlearning of the expected time to the reward. In a number of animal experiments, \nanimals need to wait a given time interval after a stimulus before performing an \naction in order to receive the reward. In order to be able to do this, the animal \nrequires an internal clock or mechanism for perceiving time intervals, as well as a \nlearning system which can tackle more familiar aspects of delayed reward reinforce(cid:173)\nment learning problem. In this paper it is shown that a simple connectionist model \nof an accumulator used to measure time duration, coupled to a standard TD('\\) \nreinforcement learning rule reproduces the most prominent features of the animal \nexperiments. \n\nThe reason it might be desirable for a learner to learn the expected time to receive \na reward is that it allows it to perform the action for an appropriate length of \ntime. An example described by Grossberg and Merrill [4] and modeled in animal \nexperiments by Gibbon and Church [3] is foraging. An animal which had no sense of \nthe typical time to find food might leave too often, thereby spending an inordinate \namount of time flying between patches. Alternatively it could remain in a depleted \npatch and starve. The ability to learn times to rewards is an important aspect of \n\n\fintelligent behavior more generally. \n\n1.1 Peak Procedure Experiments \n\nA typical type of experiment which investigates how animals learn the time between \nstimulus and reward is the peak procedure. In this, the animal is trained to respond \nafter a given time interval tr has elapsed. Some stimulus (e.g. a light) is presented \nwhich stays on during the trial. The animal is able to respond at any time. The \nanimal receives a reward for the first response after the length of time tr . The trial \nends when the animal receives the reward. \n\nOn some trials, however, no reward is given even when the animal responds appro(cid:173)\npriately. This is to see when the animal would stop responding. What happens \nin non-reward trials is that the animal typically will start responding at a certain \ntime, will respond for a period, and then stop responding. Responses averaged over \nmany trials, however, give a smooth curve. The highest response is at the time \ninterval tr , and there is variation around this. The inaccuracy in the response (as \nmeasured by the standard deviation in the average response curves for non-reward \ntrials) is also proportional to the time interval. In other words, the ratio of the \nstandard deviation to the mean response time (the coefficient of variation) is a \nconstant independent of the time interval. \n\nA more striking property of the timing curves is scalar property, of which the above \nare two consequences. When the average response rate for non-reward trials is \nmultiplied by the time interval and plotted against the relative time (time divided \nby the time interval) the data from different time intervals collapse onto one curve. \n\nThis strong form of the scalar property can be expressed mathematically as follows. \nLet T be the actual time since the start of the trial and T be subjective time. \nSubjective time is the time duration which the animal perceives to have occurred, \n(or at least appears to perceive judging from its behavior). The experiments show \nthat T varies for a given T. This variation can be expressed as a conditional \nprobability, the probability of acting as though the time is T given that the actual \ntime is T, which is written P(TIT). The fact that the data collapses implies this \nprobability depends on T and T in a special way, \n\n(1) \n\nHere Pinv is the function which describes the shape of the scaled curves. Thus, time \nacts as a scale factor. This is a strong and striking result. This has been seen in \nmany species, including rats, pigeons, turtles; humans will show similar results if the \ntime intervals are short or if they are prevented from counting through distracting \ntasks. For reviews of interval timing phenomena, see [5] and [3] . \n\nA key question which remains unanswered is: what is the origin of the scalar prop(cid:173)\nerty. Since the scalar property is ubiquitous, it may be revealing something fun(cid:173)\ndamental about the nature of an internal clock or time perception system. This \nis especially true if there are only a few known mechanisms which generate this \nphenomenon. It is well known that any model based on the accumulation of in(cid:173)\ndependent errors, such as a clock with a variable pulse-rate, does not produce the \nscalar property. In such a model it would be the ratio of the variance to the mean \nresponse time which would be independent of the time interval (a consequence of \nthe law of large numbers). In section 2, a simple stochastic process will be presented \nwhich gives rise to scalar timing. In section 3 simulations of the model on the peak \n\n\fprocedure are presented. The model reproduces experimental results on the mean \nresponses and the covariation between responses on non-reward trials. \n\n2 The model \n\n2.1 An accumulator network of spiking neurons \n\nHere it is shown that a simple connectionist model of an accumulator can give rise \nto the strong scalar property. The network consists of noisy, linear, spiking neurons \nwhich are connected in a random, spatially homogeneous way. The network encodes \ntime as the total activity in the network which grows during the measured time \ninterval. Psychological aspects of the model will be presented elsewhere [8] \n\nThe network consists of N identical neurons. The connectivity between neurons \nis random and defined by a connection matrix Cij which is random and sparse. \nThe connection strength is the same between all connected neurons. An important \nparameter is the fan-out of the ith neuron Ci ; its average across the network is \ndenoted C. Time is in discrete units of size T, the time required for a spike produced \nby a neuron to invoke a spike in a connected neuron. There is no refractory period. \n\nThe neurons are linear -\nthe expected number of spikes produced by a neuron is \n\"( times the number of pre-synaptic spikes. Let ai(t) denote the number of spikes \nproduced by neuron i at time t. This obeys \n\nhi(t) \n\nai(t + T) = L Va + Ii(t), \n\na=l \n\n(2) \n\nwhere hi(t) is the number of spikes feeding into neuron i, hi(t) = E j CjiXj(t). Ii(t) \nis the external input at i , and V is a random variable which determines whether a \npre-synaptic spike invokes one in a connected neuron. The mean of v is \"( and the \nvariance is denoted a~. So the spikes behave independently; saturation effects are \nignored. The total activity of the network is \n\nN \n\nn(t) = L ai(t). \n\ni = l \n\n(3) \n\nAt each time-step, the number of spikes will grow due to the fan-out of the neurons. \nAt the same time, the number of spikes will shrink due to the fact that a spike \ninvokes another spike with a probability less than 1. An essential assumption of \nthis work is that these two processes balance each other, C\"( = 1. \nFinally, in order for this network to act as an accumulator, it receives statistically \nstationary input during the time interval which is being measured, so I(t) is only \npresent during the measured interval and statistically stationary then. \n\n2.2 Derivation of the strong scalar property \n\nHere it is shown that the network activity obeys equation (1). Let y be the scaled \nnetwork activity, \n\n(4) \nThe goal here is the derive the probability distribution for y as a function of time, \nP(ylt). In order to do this, we use the cumulant generating function (or character(cid:173)\nistic function). For any probability distribution, p(x), the generating function for \n\ny(t) = n(t)/t. \n\n\fcumulants is, \n\nG(8) \n\n(5) \n\n(6) \nwhere n is the domain of p(x), \"'i is the ith cumulant of p(x), and 8 is just a dummy \nvariable. Taking the nth derivative of G(8) with respect to 8 and setting 8 to 0 gives \n\"'i. Cumulants are like moments, see [1] for some definitions and properties. \nWe will derive a recursion relation for the cumulant generating function for y(t), \ndenoted G y (8; t). Let G y (8) denote the generating function for the distribution of v \nand G [(8) denote the generating functions for the distribution of inputs I(t). These \nlatter two are assumed to be stationary, hence there is no time-dependence. From \nequation 2 it follows that, \n\nGy(8;t+T) = G[C:T)+~LGy[tCiGYC:T);t]. \n\n(7) \n\nIn deriving the above, it was assumed that the activity at each node is statistically \nthe same, and that the fan-out at i is uncorrelated with the activity at i (this \nrequires a sufficiently sparsely connectivity, i.e. no tight loops). \n\nDifferentiating the last equation n times with respect to 8 and setting 8 to zero \nproduces a set recursion relations for the cumulants of y, denoted \"'n. It is necessary \nto take terms only up to first order in lit to find the fixed point distribution. The \nrecursion relations to this order are \n\n\u2022 \n\n( T) \nm[ \n1-- \"'l(t)+--\nt+ T \nt \nT) \n1n(n-1) 2 \n(\n1 - n- \"'n(t) + -\nt \nt \n+ 0 C~) ;n > 1. \n\n2 \n\nC(J\"y\"'n-l(t) \n\n(8) \n\n(9) \n\nThe above depends upon the mean total input activity m[ == G~(O) the average \nfan-out C, and the variance in the noise v, (J\"~ == G~(O). In general it would depend \nupon the fan-out times the mean of the noise v, but that is 1 by assumption. Higher \norder statistics in C and v only contribute to terms which are higher order in lit. \nThe above equations converge to a fixed point, which shows that n(t)lt has a time(cid:173)\nindependent distribution for large t. The fixed point is found to be \n\n(J\"~) \nGy (8,00) = ~ ,\"'n(OO) = -2 log 1- 28 . \n\n2m[ \n\n( \n\n~ 8n \nn=O n. \n\n(J\" \n\nT \n\nEquation 10 is the generating function for a gamma distribution, \n\nwith \n\nR (I b) = exp( -xlb)xa - 1 \nr x a, \n\nbar(a) \n\n2m[ \na = C 2; \n(J\"y \n\nb = C(J\"~. \n\n2T \n\n(10) \n\n(11) \n\n(12) \n\nCorrections to the fixed point are O(l/t). \nWhat this shows is that for large t, the distribution of neural activity, n is scalar, \n\nP(nlt) = ~ Pr (~ la, b) ; \n\n(13) \n\nwith a and b defined above. \n\n\f2.3 Reinforcement learning of time intervals \n\nThe above model represents a way for a simple connectionist system to measure a \ntime interval. In order to model behavior, the system must learn to association the \nexternal stimulus and the clock with the response and the reward. To do this, some \nadditional components are needed. \n\nThe ith stimulus is represented by a signal Si. The output of the accumulator trig(cid:173)\ngers a set of clock nodes which convert the quantity or activity encoding of time \nused by the accumulator into a \"spatial code\" in which particular nodes represent \ndifferent network activities. This was done because it is difficult to use the accu(cid:173)\nmulator activity directly, as this takes a wide range of values. Each clock node \nresponds to a particular accumulator activity. The output of the ith clock node \nat time t is denoted Xi(t) ; it is one if the activity is i, zero otherwise. It would \nbe more reasonable to use a coarse coding, but this fine-grained encoding is partic(cid:173)\nularly simple. The components of the learning model are shown schematically in \nfigure 1. \n\nVj(t) \n\nStimulus \nsi - - - --I \n\nFigure 1: The learning model. The accumulator feeds into a bank of clock nodes, Xi , \nwhich are tuned to accumulator activities . The response Vj is triggered by simulta(cid:173)\nneous presence of both the stimulus Si and the appropriate clock node. Solid lines \ndenote weights which are fixed; dashed lines show weights which learn according to \nthe TD(A) learning rule. \n\nThe stimulus and the clock nodes feed into response nodes. The output of the jth \nresponse node, Vj(t) is given by \n\n(14) \n\nHere () is a threshold, Aij is the association between the stimulus and the response, \nand Wij is the association between a clock node and the response. Both the stimulus \nand the appropriate clock node must be present in order for there to be a reasonable \nprobability of a response. The response probability is Vj (t) , unless that is negative, \nin which case there is no response, or is greater than 1, in which case there is \ndefinitely a response. \n\nBoth Aij and the w's learn via a TD-A learning rule. TD-A is an important learning \n\n\frule for modeling associative conditioning; it has been used to model aspects of \nclassical conditioning including Pavlovian conditioning and blocking. For example, \na model which is very effective at modeling Pavlovian eye-blink experiments and \nother classical conditioning results has been proposed by Moore et. al. [6] building \non the model of Sutton, Barto, and Desmond (see description in [7]). This model \nrepresents time using a tapped delay line; at each time-step, a different node in the \ndelay line is activated. Time acts as one of the conditioned stimuli. The conditioned \nstimsing temporal difference (TD) reinforcement learning is associated with the \nresponse through the unconditioned stimulus. These authors did not attempt to \nmodel the scalar property, and in their model time is represented accurately by \nthe system. The model presented here is similar to these models. The clock nodes \nplay the role of the tapped delay-line nodes in that model. However, here they \nare stimulated by the accumulator rather than each other, and they will follow a \nstochastic trajectory due to the fluctuating nature of the accumulator \n\nThe learning rule for Wij couples to an \"eligibility trace\" for the clock nodes Xi(t) \nwhich takes time to build up and decays after the node is turned off. They obey \nthe following equations, \n\n(15) \n\nThe standard TD-A learning parameters, \"( and A are used, see [9]. The learning \nequations are \n\nt:.Wij \nt:.Aij \n8(t) \n\na8(t + T)Xi(t), \na8(t + T)Si' \nR(t) + \"( Vj(t) - Vj(t - T). \n\n(16) \n(17) \n(18) \nHere a is a learning rate, 8 is the temporal difference component, R(t) is the re(cid:173)\ninforcement. The outputs Vj at both times use the current value of the weights. \nThe threshold is set to a constant value (-1 in the simulations). It would make no \ndifference if a eligibility trace were used for the stimulus Si, because that was held \non during the learning. \n\n3 Simulations \n\nThe model has been used to simulate peak procedure. In the simulations, the model \nis forced to respond for the first set of trials (50 trials in the simulations); otherwise \nthe model would never respond. This could represent shaping in real experiments. \nAfter that the model learns using reward trials for an additional number of trials \n(150 trials in these simulations). The system is then run for 1000 trials, every \n10th trial is a non-reward trial; the system continues to learn during these trials. \nFigure 2 shows average over non-reward trials for different time intervals. The scalar \nproperty clearly holds. \n\nGibbon and Church [3] have argued that the covariation between trials is a useful \ndiagnostic to distinguish models of scalar timing. The methodology which they \nproposed is to fit the results of single non-reward trials from peak procedure exper(cid:173)\niments to a break-run-break pattern of response The animal is assumed to respond \nat a low rate until a start time is reached. The animal then responds at a high rate \nuntil a stop time is reached, whence it returns to the low response rate. The covari(cid:173)\nation between the start and stop times between trials is measured and compared to \nthose predicted by theory. \n\nThe question Gibbon and Church asked was, how does the start and stop time co(cid:173)\nvary across trials. For example, if the animal starts responding early, does it stop \n\n\fQ) \n~ O.5 \ncd \nl-< \nCJ) 0.4 \nrn \n.::: \n0\nP. \nrn \nQ) \n$-.( 0.2 \n\n0.3 \n\ntime \n\ntime/tr \n\nFigure 2: Left) Average response of the spatially encoded network for non-reward \ntrials. The accumulator parameters are: mI = 10, Cu2 = 1 (Poisson limit); learning \nparameters are \"( = 0.75, A = 1, learning rate 0: is 0.5. Right) Relative time plotted \nagainst response rate times time interval for reinforcement times of 40T, 80T, 160T, \n240T, and 320T. All experiments are averages over 100 non-reward trials, which \nwere every 10 trial in 1000 learning trials. \n\nresponding early, as though it has a shifted estimate of the time interval? Or does \nit stop responding late, as though it has a more liberal view about what constitutes \nthe particular interval. The covariance between start and stop parameters addresses \nthis question. \n\nComparable experiments can be carried out on the model proposed here. The \nprocedure used is described in [2]. Figure 3 shows a comparison with data from \nreference [2] with simulations. The pattern of covariation found in the simulations \nis qualitatively similar to that of the animal data. The interesting quantity is the \ncorrelation between the start time and the spread (difference between stop and start \ntimes). This is negative in both. \n\n0.5 \n\n0.5 \n\nI~ I~ I \n\n-0.5 '---~-____:-____:-___:_-___:_-___:_----' \n\n-0.5 '----~____:,----____:-___:_-___:_-___:_----' \n\nFigure 3: Left) Covariances across individual trials in experiments on rats. Data is \ntaken from Table 2 of reference [2] averaged over the four conditions. The covari(cid:173)\nances are shown in the following order: 1. start-stop, 2. start-spread, 3. spread(cid:173)\nmiddle, 4. start-middle, 5. stop-spread, 6. stop-middle. The black, gray, and white \nbars are for times of reinforcement tr of 15,30, and 60 seconds respectively. Right) \nCovariances across individual trials simulated by the model. The reinforcement \ntimes are 40T, 80T, and 160T. The covariances are given in the same order as in left \nfigure. \n\n\f4 Conclusion \n\nPrevious models of interval timing fail to explain its most striking feature -\ncollapse of the data when scaled by the time interval. We have presented a sim(cid:173)\nple model of an accumulator clock based on spiking, noisy, linear neurons which \nproduces this effect. It is a simple model, analytically tractable, based on a driven \nthe time for a spike on one neuron \nbranching process. The parameters are: T -\nto excite spikes on connected neurons, mI -\nthe average number of spikes excited \nexternally at each short time interval T, and the variance of the spike transmission \nprocess, which in this model is (}\"~. A weakness of this model is that it requires \nfine-tuning of a pair of parameters, so that the expected number of spikes grows in \nwith external excitation only. \n\nthe \n\nOnce a scalar clock is produced, simple reinforcement learning can be used to asso(cid:173)\nciate the clock signal with appropriate responses. A set of intermediate clock nodes \nwas used to encode time. TD-'\\ reinforcement learning between the intermediate \nnodes at reinforcement and an eligibility trace simulates peak procedure and the \nindividual trial covariances. \n\nReferences \n\n[1] M. Abramowitz and 1. A. Stegun, editors. Handbook of Mathematical Functions. New \n\nYork: Dover Publications, 1967. \n\n[2] Russell M. Church, Walter H. Meck, and John Gibbon. Application of scalar timing \ntheory to individual trials. Journal of Experimental Psychology - Animal Behavior \nProcesses, 20(2):135- 155, 1994. \n\n[3] John Gibbon and Russell M. Church. Representation of time. Cognition, 37:23- 54, \n\n1990. \n\n[4] Stephen Grossberg and John W. L. Merrill. A neural network model of adaptively \ntimed reinforcement learning and hippocampal dynamics. Cognitive Brain Research, \n1:3- 38, 1992. \n\n[5] S. C. Hinton and W . H. Meck. How time flies: Functional and neural mechansims \nof interval timing. In C. M. Bradshaw and E. Szadabi, editors, Tim e and Behaviour: \nPsychological and Neurobehavioural Analyses. Amsterdam: Elsevier Science, 1997. \n\n[6] J. W. Moore, J. E. Desmond, and N. E. Berthier. Adaptively timed conditioned \nresponses and the cerebellum: A neural network approach. Biological Cybernetics, \n62:17- 28, 1989. \n\n[7] John W. Moore, Neil D. Berthier, and Diana E. J. Blazis. Classical eye-blink condition(cid:173)\n\ning: Brain systems and implementation of a computational model. In Michael Gabriel \nand John Moore, editors, Learning and Computational Neuroscience: Foundations of \nAdaptive Networks, A Bradford Book, pages 359- 387. The MIT Press, 1990. \n\n[8] J. L. Shapiro and John Wearden. Modelling scalar timing by an accumulator network \n\nof spiking neurons. In preparation, 200l. \n\n[9] Richard S. Sutton and Andrew G. Barto. Reinforcment Learning: An Introduction. A \n\nBradford Book. The MIT Press, 1998. \n\n\f", "award": [], "sourceid": 2000, "authors": [{"given_name": "Jonathan", "family_name": "Shapiro", "institution": null}, {"given_name": "J.", "family_name": "Wearden", "institution": null}]}