{"title": "Modeling Temporal Structure in Classical Conditioning", "book": "Advances in Neural Information Processing Systems", "page_first": 3, "page_last": 10, "abstract": null, "full_text": "Modeling Temporal Structure in Classical \n\nConditioning \n\nAaron C. Courville1,3 and David S. Touretzky 2,3 \n\n1 Robotics Institute, 2Computer Science Department \n\n3Center for the Neural Basis of Cognition \n\nCarnegie Mellon University, Pittsburgh, PA 15213-3891 \n\n{ aarone, dst} @es.emu.edu \n\nAbstract \n\nThe Temporal Coding Hypothesis of Miller and colleagues [7] sug(cid:173)\ngests that animals integrate related temporal patterns of stimuli \ninto single memory representations. We formalize this concept \nusing quasi-Bayes estimation to update the parameters of a con(cid:173)\nstrained hidden Markov model. This approach allows us to account \nfor some surprising temporal effects in the second order condition(cid:173)\ning experiments of Miller et al. [1 , 2, 3], which other models are \nunable to explain. \n\n1 \n\nIntroduction \n\nAnimal learning involves more than just predicting reinforcement. The well-known \nphenomena of latent learning and sensory preconditioning indicate that animals \nlearn about stimuli in their environment before any reinforcement is supplied. More \nrecently, a series of experiments by R. R. Miller and colleagues has demonstrated \nthat in classical conditioning paradigms, animals appear to learn the temporal struc(cid:173)\nture of the stimuli [8]. We will review three of these experiments. We then present \na model of conditioning based on a constrained hidden Markov model, using quasi(cid:173)\nBayes estimation to adjust the model parameters online. Simulation results confirm \nthat the model reproduces the experimental observations, suggesting that this ap(cid:173)\nproach is a viable alternative to earlier models of classical conditioning which can(cid:173)\nnot account for the Miller et al. experiments. Table 1 summarizes the experimental \nparadigms and the results. \n\nExpt. 1: Simultaneous Conditioning. Responding to a conditioned stimulus \n(CS) is impaired when it is presented simultaneously with the unconditioned stimu(cid:173)\nlus (US) rather than preceding the US. The failure of the simultaneous conditioning \nprocedure to demonstrate a conditioned response (CR) is a well established result \nin the classical conditioning literature [9]. Barnet et al. [1] reported an interesting \n\n\fExpt. 1 \n\nExpt.2A \nExpt. 2B \n\nExpt. 3 \n\nPhase 1 \n(4)T+ US \n(12)T -+ C \n(12)T -+ C \n(96)L -+ US -+ X \n\nPhase 2 \n(4)C -+ T \n(8)T -+ US \n(8)T ---+ US \n(8) B -+ X \n\nTest => Result Test => Result \n\nT=> -\n\nC=> -\nC =>CR \n\nX=> -\n\nC =>CR \n\nB =>CR \n\nTable 1: Experimental Paradigms. Phases 1 and 2 represent two stages of training trials, \neach presented (n) times. The plus sign (+ ) indicates simultaneous presentation of stimuli; \nthe short arrow (-+) indicates one stimulus immediately following another; and the long \narrow (-----+) indicates a 5 sec gap between stimulus offset and the following stimulus onset. \nFor Expt. 1, the tone T, click train C, and footshock US were all of 5 sec duration. For \nExpt. 2, the tone and click train durations were 5 sec and the footshock US lasted 0.5 \nsec. For Expt. 3, the light L, buzzer E , and auditory stimulus X (either a tone or white \nnoise) were all of 30 sec duration, while the footshock US lasted 1 sec. CR indicates a \nconditioned response to the test stimulus. \n\nsecond-order extension of the classic paradigm. While a tone CS presented simulta(cid:173)\nneously with a footshock results in a minimal CR to the tone, a click train preceding \nthe tone (in phase 2) does acquire associative strength, as indicated by a CR. \n\nExpt. 2: Sensory Preconditioning. Cole et al. [2] exposed rats to a tone T \nimmediately followed by a click train C. In a second phase, the tone was paired \nwith a footshock US that either immediately followed tone offset (variant A), or \noccurred 5 sec after tone offset (variant B). They found that when C and US both \nimmediately follow T , little conditioned response is elicited by the presentation of \nC. However, when the US occurs 5 sec after tone offset, so that it occurs later than \nC (measured relative to T), then C does come to elicit a CR. \n\nExpt. 3: Backward Conditioning. \nIn another experiment by Cole et al. [3], \nrats were presented with a flashing light L followed by a footshock US, followed by \nan auditory stimulus X (either a tone or white noise). In phase 2, a buzzer B was \nfollowed by X. Testing revealed that while X did not elicit a CR (in fact, it became \na conditioned inhibitor), X did impart an excitatory association to B. \n\n2 Existing Models of Classical Conditioning \n\nThe Rescorla-Wagner model [11] is still the best-known model of classical condi(cid:173)\ntioning, but as a trial-level model, it cannot account for within-trial effects such \nas second order conditioning or sensitivity to stimulus timing. Sutton and Barto \ndeveloped V-dot theory [14] as a real-time extension of Rescorla-Wagner. Further \nrefinements led to the Temporal Difference (TD) learning algorithm [14]. These \nextensions can produce second order conditioning. And using a memory buffer \nrepresentation (what Sutton and Barto call a complete serial compound), TD can \nrepresent the temporal structure of a trial. However, TD cannot account for the em(cid:173)\npirical data in Experiments 1- 3 because it does not make inferences about temporal \nrelationships among stimuli; it focuses solely on predicting the US. In Experiment \n1, some versions of TD can account for the reduced associative strength of a CS \nwhen its onset occurs simultaneously with the US, but no version of TD can explain \nwhy the second-order stimulus C should acquire greater associative strength than \n\n\fT. In Experiment 2, no learning occurs in Phase 1 with TD because no prediction \nerror is generated by pairing T with C. As a result, no CR is elicited by C after \nT has been paired with the US in Phase 2. In Experiment 3, TD fails to predict \nthe results because X is not predictive of the US; thus X acquires no associative \nstrength to pass on to B in the second phase. \n\nEven models that predict future stimuli have trouble accounting for Miller et al. 's \nresults. Dayan's \"successor representation\" [4], the world model of Sutton and \nPinette [15], and the basal ganglia model of Suri and Schultz [13] all attempt to \npredict future stimulus vectors. Suri and Schultz's model can even produce one \nform of sensory preconditioning. However, none of these models can account for \nthe responses in any of the three experiments in Table 1, because they do not make \nthe necessary inferences about relations among stimuli. \n\nTemporal Coding Hypothesis The temporal coding hypothesis (TCH) [7] \nposits that temporal contiguity is sufficient to produce an association between stim(cid:173)\nuli. A CS does not need to predict reward in order to acquire an association with \nthe US. Furthermore, the association is not a simple scalar quantity. Instead, infor(cid:173)\nmation about the temporal relationships among stimuli is encoded implicitly and \nautomatically in the memory representation of the trial. Most importantly, TCH \nclaims that memory representations of trials with similar stimuli become integrated \nin such a way as to preserve the relative temporal information [3]. \n\nIf we apply the concept of memory integration to Experiment 1, we get the memory \nrepresentation, C ---+ T + US. If we interpret a CR as a prediction of imminent \nreinforcement, then we arrive at the correct prediction of a strong response to C \nand a weak response to T. Integrating the hypothesized memory representations of \nthe two phases of Experiment 2 results in: A) T ---+ C+US and B) T ---+ C ---+ US. The \nstimulus C is only predictive ofthe US in variant B, consistent with the experimental \nfindings. For Experiment 3, an integrated memory representation of the two phases \nproduces L+ B ---+ US ---+ X. Stimulus B is predictive of the US while X is not. Thus, \nthe temporal coding hypothesis is able to account for the results of each of the three \nexperiments by associating stimuli with a timeline. \n\n3 A Computational Model of Temporal Coding \n\nA straightforward formalization of a timeline is a Markov chain of states. For \nthis initial version of our model, state transitions within the chain are fixed and \ndeterministic. Each state represents one instant of time, and at each timestep a \ntransition is made to the next state in the chain. This restricted representation is \nkey to capturing the phenomena underlying the empirical results. Multiple time(cid:173)\nlines (or Markov chains) emanate from a single holding state. The transitions out \nof this holding state are the only probabilistic and adaptive transitions in the sim(cid:173)\nplified model. These transition probabilities represent the frequency with which \nthe timelines are experienced. Figure 1 illustrates the model structure used in all \nsimulations. \n\nOur goal is to show that our model successfully integrates the timelines of the two \ntraining phases of each experiment. In the context of a collection of Markov chains, \nintegrating timelines amounts to both phases of training becoming associated with \na single Markov chain. Figure 1 shows the integration of the two phases of Expt. 2B. \n\n\fFigure 1: A depiction of the state and observation structure of the model. Shown are two \ntimelines, one headed by state j and the other headed by state k. State i, the holding state, \ntransitions to states j and k with probabilities aij and aik respectively. Below the timeline \nrepresentations are a sequence of observations represented here as the symbols T, C and \nUS. The T and C stimuli appear for two time steps each to simulate their presentation for \nan extended duration in the experiment. \n\nDuring the second phase of the experiments, the second Markov chain (shown in \nFigure 1 starting with state k) offers an alternative to the chain associated with the \nfirst phase of learning. If we successfully integrate the timelines, this second chain \nis not used. \n\nAs suggested in Figure 1, associated with each state is a stimulus observation. \n\"Stimulus space\" is an n-dimensional continuous space, where n is the number \nof distinct stimuli that can be observed (tone, light, shock, etc.) Each state has \nan expectation concerning the stimuli that should be observed when that state is \noccupied. This expectation is modeled by a probability density function, over this \nspace, defined by a mixture of two multivariate Gaussians. The probability density \nat stimulus observation xt in state i at time tis, \n\nwhere Wi is a mixture coefficient for the two Gaussians associated with state i. The \nGaussian means /tiD and /til and variances ufo and ufl are vectors of the same \ndimension as the stimulus vector xt. Given knowledge of the state, the stimulus \ncomponents are assumed to be mutually independent (covariance terms are zero). \nWe chose a continuous model of observations over a discrete observation model to \ncapture stimulus generalization effects. These are not pursued in this paper. \n\nFor each state, the first Gaussian pdf is non-adaptive, meaning /tiO is fixed about \na point in stimulus space representing the absence of stimuli. ufo is fixed as well. \nFor the second Gaussian, /til and Ufl are adaptive. This mixture of one fixed and \none adaptive Gaussian is an approximation to the animal's belief distribution about \nstimuli, reflecting the observed tolerance animals have to absent expected stimuli. \nPut another way, animals seem to be less surprised by the absence of an expected \nstimulus than by the presence of an unexpected stimulus. \n\nWe assume that knowledge of the current state st is inaccessible to the learner. This \ninformation must be inferred from the observed stimuli. In the case of a Markov \nchain, learning with hidden state is exactly the problem of parameter estimation in \n\nhidden Markov models. That is, we must update the estimates of w, /tl and ur for \n\n\feach state, and aij for each state transition (out of the holding state), in order to \nmaximize the likelihood of the sequence of observations \n\nThe standard algorithm for hidden Markov model parameter estimation is the \nBaum-Welch method [10]. Baum-Welch is an off-line learning algorithm that re(cid:173)\nquires all observations used in training to be held in memory. In a model of classical \nconditioning, this is an unrealistic assumption about animals' memory capabilities. \nWe therefore require an online learning scheme for the hidden Markov model, with \nonly limited memory requirements. \n\nRecursive Bayesian inference is one possible online learning scheme. \nthe appealing property of combining prior beliefs about the world with cur(cid:173)\nrent observations through the recursive application of Bayes' theorem, p(Alxt) IX \np(xt lxt- 1 , A)p(AIXt- 1 ). The prior distribution, p(AIXt- 1 ) reflects the belief over \nthe parameter A before the observation at time t, xt. X t- 1 is the observation his(cid:173)\ntory up to time t-l, i.e. X t- 1 = {xt- 1 ,xt- 2 , ... }. The likelihood, p(xtIXt-l,A) \nis the probability density over xt as a function of the parameter A. \n\nIt offers \n\nUnfortunately, the implementation of exact recursive Bayesian inference for a con(cid:173)\ntinuous density hidden Markov model (CDHMM) is computationally intractable. \nThis is a consequence of there being missing data in the form of hidden state. \nWith hidden state, the posterior distribution over the model parameters, after the \nobservation, is given by \n\nN \n\np(Alxt) IX LP(xtlst = i, X t- 1 , A)p(st = iIXt- 1 , A)p(AIXt- 1 ), \n\n(2) \n\ni=1 \n\nwhere we have summed over the N hidden states. Computing the recursion for \nmultiple time steps results in an exponentially growing number of terms contributing \nto the exact posterior. \n\nWe instead use a recursive quasi-Bayes approximate inference scheme developed \nby Huo and Lee [5], who employ a quasi-Bayes approach [12]. The quasi-Bayes \napproach exploits the existence of a repeating distribution (natural conjugate) over \nthe parameters for the complete-data CDHMM. (i.e. where missing data such as the \nstate sequence is taken to be known). Briefly, we estimate the value of the missing \ndata. We then use these estimates, together with the observations, to update the \nhyperparameters governing the prior distribution over the parameters (using Bayes' \ntheorem). This results in an approximation to the exact posterior distribution over \nCDHMM parameters within the conjugate family of the complete-data CDHMM. \nSee [5] for a more detailed description of the algorithm. \n\nEstimating the missing data (hidden state) involves estimating transition probabil(cid:173)\nities between states, ~0 = Pr(sT = i, ST+1 = jlXt , A), and joint state and mixture \ncomponent label probabilities ([k = Pr(sT = i, IT = klXt , A). Here zr = k is the \nmixture component label indicating which Gaussian, k E {a, I}, is the source of the \nstimulus observation at time T. A is the current estimate of all model parameters. \n\nWe use an online version of the forward-backward algorithm [6] to estimate ~0 and \n([1. The forward pass computes the joint probability over state occupancy (taken to \nbe both the state value and the mixture component label) at time T and the sequence \nof observations up to time T. The backward pass computes the probability of the \nobservations in a memory buffer from time T to the present time t given the state \n\n\foccupancy at time T. The forward and backward passes over state/observation \nsequences are combined to give an estimate of the state occupancy at time T given \nthe observations up to the present time t. In the simulations reported here the \nmemory buffer was 7 time steps long (t - T = 6). \n\nWe use the estimates from the forward-backward algorithm together with the ob(cid:173)\nservations to update the hyperparameters. For the CDHMM, this prior is taken \nto be a product of Dirichlet probability density functions (pdfs) for the transition \nprobabilities (aij) , beta pdfs for the observation model mixture coefficients (Wi) \nand normal-gamma pdfs for the Gaussian parameters (Mil and afl)' The basic hy(cid:173)\nperparameters are exponentially weighted counts of events, with recency weighting \ndetermined by a forgetting parameter p. For example, \"'ij is the number of expected \ntransitions observed from state i to state j, and is used to update the estimate of \nparameter aij. The hyperparameter Vik estimates the number of stimulus observa(cid:173)\ntions in state i credited to Gaussian k , and is used to update the mixture parameter \nWi. The remaining hyperparameters 'Ij;, \u00a2, and () serve to define the pdfs over Mil \nand afl' The variable d in the equations below indexes over stimulus dimensions. \nSi1d is an estimate of the sample variance, and is a constant in the present model. \n\nT _ \n\n((T-1) \n\n\"'ij - P \"'ij \n\n1) \n\n1 \n\n+ + '>ij \n(:T \n\n-\n\nT _ \n\n((T-1) \n\nv ik - P v ik \n\n1) \n\n1 \n\n+ + '>ik \nr T \n\n-\n\n. I,T \nr T \n'l' i1d = P 'I' i1d + '>i1 \n\n. 1,(T-1) \n\n,/,T \n'l'i 1d -\n\n_ p(,/,(T-1) _ 1) + 1H[1 \n\n'l'i1d \n\n2 \n\n2 \n\n_ p() T- 1 + (i1 Sild + \n\n7\" \n\n) \n\n()T \n\ni1d -\n\n( \ni1d \n\n2 \n\n0,,( 7\"-1 ) ,7\" \n\nPo/ i 1d \n\n' i l \n\n2(p1/Ji;d 1) H[1) \n\n( ) \n(xT _ II. T- 1 )2 \n\nd \n\nf\"'i 1d \n\nIn the last step of our inference procedure, we update our estimate of the model \nparameters as the mode of their approximate posterior distribution. While this is \nan approximation to proper Bayesian inference on the parameter values, the mode \nof the approximate posterior is guaranteed to converge to a mode of the exact \nposterior. In the equations below, N is the number of states in the model. \n\nT_ \n\nWi -\n\nv[1- 1 \n\nvio + viI -2 \n\n4 Results and Discussion \n\nThe model contained two timelines (Markov chains). Let i denote the holding \nstate and j, k the initial states of the two chains. The transition probabilities were \ninitialized as aij = aik = 0.025 and aii = 0.95. Adaptive Gaussian means Mild were \ninitialized to small random values around a baseline of 10-4 for all states. The \nexponential forgetting factor was P = 0.9975, and both the sample variances Si1d \nand the fixed variances aIOd were set to 0.2. \n\nWe trained the model on each of the experimental protocols of Table 1, using the \nsame numbers of trials reported in the original papers. The model was run contin(cid:173)\nuously through both phases of the experiments with a random intertrial interval. \n\n\f'+ \n-:::: \nC \nOJ \nE \nOJ \n~ \n.E \nc \n&!1 \nOi \n0 \ntrr \n\nnoCR \n\nCR \n\nT \nC \nExperiment 1 \n\n4 \n'+ \n-:::: \n~3 \nE \nOJ \n() \n02 \nC \n\"Qi \n~1 \ntrr \n0 \n\n0 \n\nnoCR \n\n(A)C \n(B)C \nExperiment 2 \n\n5 \n'+ \n-:g4 \nOJ \nE \n~3 \n.E \n~2 \na: \ng;1 \ntrr \n0 \n\nnoCR \n\nX \nB \nExperiment 3 \n\nFigure 2: Results from 20 runs of the model simulation with each experimental paradigm. \nOn the ordinate is the total reinforcement (US) , on a log scale, above the baseline (an \narbitrary perception threshold) expected to occur on the next time step. The error bars \nrepresent two standard deviations away from the mean. \n\nFigure 2 shows t he simulation results from each of the three experiments. If we \nassume that the CR varies monotonically with the US prediction, then in each case, \nt he model's predicted CR agreed with the observations of Miller et al. \n\nThe CR predictions are the result of the model integrating t he two phases of learning \ninto one t imeline. At the time of the presentation of the Phase 2 stimuli, the states \nforming the timeline describing the Phase 1 pattern of stimuli were judged more \nlikely to have produced the Phase 2 stimuli than states in the other timeline, which \nserved as a null hypothesis. In another experiment, not shown here, we trained the \nmodel on disjoint stimuli in the two phases. In that situation it correctly chose a \nseparate timeline for each phase, rather than merging the two. \n\nWe have shown that under the assumption t hat observation probabilities are mod(cid:173)\neled by a mixture of Gaussians, and a very restrictive state transition structure, a \nhidden Markov model can integrate the memory representations of similar temporal \nstimulus patterns. \"Similarity\" is formalized in this framework as likelihood under \nthe t imeline model. We propose t his model as a mechanism for the integration of \nmemory representations postulated in the Temporal Coding Hypothesis. \n\nThe model can be extended in many ways. The current version assumes t hat event \nchains are long enough to represent an entire trial, but short enough that the model \nwill return to the holding state before the start of the next trial. An obvious \nrefinement would be a mechanism to dynamically adjust chain lengths based on \nexperience. We are also exploring a generalization of the model to the semi-Markov \ndomain, where state occupancy duration is modeled explicitly as a pdf. State tran(cid:173)\nsitions would then be tied to changes in observations, rather than following a rigid \nprogression as is currently the case. Finally, we are experimenting with mechanisms \nthat allow new chains to be split off from old ones when the model determines that \ncurrent stimuli differ consistently from t he closest matching timeline. \n\nFitting stimuli into existing timelines serves to maximize the likelihood of current \nobservations in light of past experience. But why should animals learn the temporal \nstructure of stimuli as timelines? A collection of timelines may be a reasonable \nmodel of the natural world. If t his is true, t hen learning with such a strong inductive \nbias may help t he animal to bring experience of related phenomena to bear in novel \nsit uations- a desirable characteristic for an adaptive system in a changing world. \n\n\fAcknowledgments \n\nThanks to Nathaniel Daw and Ralph Miller for helpful discussions. This research \nwas funded by National Science Foundation grants IRI-9720350 and IIS-997S403. \nAaron Courville was funded in part by a Canadian NSERC PGS B fellowship. \n\nReferences \n\n[1] R. C. Barnet, H. M. Arnold, and R. R. Miller. Simultaneous conditioning demon(cid:173)\n\nstrated in second-order conditioning: Evidence for similar associative structure in \nforward and simultaneous conditioning. Learning and Motivation, 22:253- 268, 1991. \n\n[2] R. P. Cole, R. C. Barnet, and R. R . Miller. Temporal encoding in trace conditioning. \n\nAnimal Learning and Behavior, 23(2) :144- 153, 1995. \n\n[3] R. P. Cole and R. R. Miller. Conditioned excitation and conditioned inhibition ac(cid:173)\nquired through backward conditioning. Learning and Motivation, 30:129- 156, 1999. \n\n[4] P. Dayan. Improving generalization for temporal difference learning: the successor \n\nrepresentation. Neural Computation, 5:613- 624, 1993. \n\n[5] Q. Huo and C.-H. Lee. On-line adaptive learning of the continuous density hidden \nMarkov model based on approximate recursive Bayes estimate. IEEE Transactions \non Speech and Audio Processing, 5(2):161- 172, 1997. \n\n[6] V . Krishnamurthy and J . B. Moore. On-line estimation of hidden Markov model \nparameters based on the Kullback-Leibler information measure. IEEE Transactions \non Signal Processing, 41(8):2557- 2573, 1993. \n\n[7] L. D. Matzel, F. P. Held, and R. R. Miller. Information and the expression of simul(cid:173)\n\ntaneous and backward associations: Implications for contiguity theory. Learning and \nMotivation, 19:317- 344, 1988. \n\n[8] R. R. Miller and R . C. Barnet. The role of time in elementary associations. Current \n\nDirections in Psychological Science, 2(4):106- 111, 1993. \n\n[9] 1. P. Pavlov. Conditioned Reflexes. Oxford University Press, 1927. \n\n[10] L. R. Rabiner. A tutorial on hidden Markov models and selected applications III \n\nspeech recognition. Proceedings of the IEEE, 77(2) :257- 285, 1989. \n\n[11] R. A. Rescorla and A. R. Wagner. A theory of Pavlovian conditioning: Variations in \nthe effectiveness of reinforcement and nonreinforcement. In A. H. Black and W. F. \nProkasy, editors, Classical Conditioning II. Appleton-Century-Crofts, 1972. \n\n[12] A. F . M. Smith and U. E. Makov. A quasi-Bayes sequential procedure for mixtures. \n\nJournal of th e Royal Statistical Society, 40(1):106- 112, 1978. \n\n[13] R. E. Suri and W. Schultz. Temporal difference model reproduces anticipatory neural \n\nactivity. Neural Computation, 13(4):841- 862, 200l. \n\n[14] R. S. Sutton and A. G. Barto. Time-derivative models of Pavlovian reinforcement. In \n\nM. Gabriel and J. Moore, editors, Learning and Computational Neuroscience: Foun(cid:173)\ndations of Adaptive N etworks, chapter 12, pages 497- 537. MIT Press, 1990. \n\n[15] R. S. Sutton and B. Pinette. The learning of world models by connectionist networks. \nIn L. Erlbaum, editor, Proceedings of the seventh annual conference of the cognitive \nscience society, pages 54- 64, Irvine, California, August 1985. \n\n\f", "award": [], "sourceid": 2026, "authors": [{"given_name": "Aaron", "family_name": "Courville", "institution": null}, {"given_name": "David", "family_name": "Touretzky", "institution": null}]}