{"title": "Timing and Partial Observability in the Dopamine System", "book": "Advances in Neural Information Processing Systems", "page_first": 99, "page_last": 106, "abstract": null, "full_text": "Timing and Partial Observability in the\n\nDopamine System\n\nNathaniel D. Daw1;3, Aaron C. Courville2;3, and David S. Touretzky1;3\n\n1Computer Science Department, 2Robotics Institute, 3Center for the Neural Basis of Cognition\n\nCarnegie Mellon University, Pittsburgh, PA 15213\n\nfdaw,aaronc,dstg@cs.cmu.edu\n\nAbstract\n\nAccording to a series of in\ufb02uential models, dopamine (DA) neurons sig-\nnal reward prediction error using a temporal-difference (TD) algorithm.\nWe address a problem not convincingly solved in these accounts: how to\nmaintain a representation of cues that predict delayed consequences. Our\nnew model uses a TD rule grounded in partially observable semi-Markov\nprocesses, a formalism that captures two largely neglected features of DA\nexperiments: hidden state and temporal variability. Previous models pre-\ndicted rewards using a tapped delay line representation of sensory inputs;\nwe replace this with a more active process of inference about the under-\nlying state of the world. The DA system can then learn to map these\ninferred states to reward predictions using TD. The new model can ex-\nplain previously vexing data on the responses of DA neurons in the face\nof temporal variability. By combining statistical model-based learning\nwith a physiologically grounded TD theory, it also brings into contact\nwith physiology some insights about behavior that had previously been\ncon\ufb01ned to more abstract psychological models.\n\n1 Introduction\n\nA series of models [1, 2, 3, 4, 5] based on temporal-difference (TD) learning [6] has ex-\nplained most responses of primate dopamine (DA) neurons during conditioning [7] as an\nerror signal for predicting reward, and has also identi\ufb01ed the DA system as a substrate for\nconditioning behavior [8]. We address a troublesome issue from these models: how to\nmaintain a representation of cues that predict delayed consequences. For this, we use a\nformalism that extends the Markov processes in which previous models were grounded.\n\nEven in the laboratory, the world is often poorly described as Markov in immediate sen-\nsory observations. In trace conditioning, for instance, nothing observable spans the delay\nbetween a transient stimulus and the reward it predicts. For DA models, this raises prob-\nlems of coping with hidden state and of tracking temporal intervals. Most previous models\naddress these issues using a tapped delay line representation of the world\u2019s state. This\naugments the representation of current sensory observations with remembered past obser-\nvations, dividing temporal intervals into a series of states to mark the passage of time. But\nlinear combinations of tapped delay lines do not properly model variability in the intervals\nbetween events. Also, the augmented representation may poorly match the contingency\n\n\fstructure of the experimental situation: for instance, depending on the amount of history\nretained, it may be insuf\ufb01cient to span delays, or it may contain old, irrelevant data.\n\nWe propose a model that better re\ufb02ects experimental situations by using a formalism that\nexplicitly incorporates hidden state and temporal variability: a partially observable semi-\nMarkov process. The proposal envisions the interaction between a cortical perceptual sys-\ntem that infers the world\u2019s hidden state using an internal world model, and a dopaminergic\nTD system that learns reward predictions for these inferred states. This model improves on\nits predecessors\u2019 descriptions of neuronal \ufb01ring in situations involving temporal variability,\nand suggests additional connections with animal behavior.\n\n2 DA models and temporal variability\n\nISI\n\nS\n\nR\n\n(a)\n\nITI\n\nR\n\nITI\n\nISI\n\nS\n\n(b)\n\n...\n\nMarkov TD model\nS\ufb01\n\n\u2039 R\n\nS\ufb01\n\nS\ufb01\n\n\u2039 R\n\n\u2039 R\n\n1\nTime \ufb01\n\n2\n\n3\n\n1\n0\n\u22121\n1\n0\n\u22121\n1\n0\n\u22121\n0\n(c)\n\nsemi\u2212Markov TD model\n\n0.1 S\ufb01\n0\n\u22120.1\n0.1 S\ufb01\n0\n\u22120.1\n0.1 S\ufb01\n0\n\u22120.1\n0\n(d)\n\nR\n\nR\n\n\u2039 R\n\n1\nTime \ufb01\n\n2\n\n3\n\nMarkov TD model\n\nsemi\u2212Markov TD model\n\nd \u2039 S\n\nd \u2039 S\n\nd \u2039 S\n\n1\n0\n1\n0\n1\n0\n1\n0\n1\n0\n0\n(e)\n\nd \u2039 S\n\nd \u2039 S\n\n\u2039 R\n\n\u2039 R\n\n\u2039 R\n\n\u2039 R\n\n\u2039 R\n\n4\n\n2\nTime \ufb01\n\n\u2039 S\n\n\u2039 S\n\n\u2039 S\n\n\u2039 S\n\n\u2039 S\n\n0.1\n\u22120.1\n0.1\n\u22120.1\n0.1\n\u22120.1\n0.1\n\u22120.1\n0.1\n\u22120.1\n0\n(f)\n\n\u2039 R\n\n\u2039 R\n\n\u2039 R\n\n\u2039 R\n\n\u2039 R\n4\n\n2\nTime \ufb01\n\nFigure 1: S: stimulus; R: reward. (a,b) State spaces for the Markov tapped delay line (a)\nand our semi-Markov (b) TD models of a trace conditioning experiment. (c,d) Modeled\nDA activity (TD error) when an expected reward is delivered early (top), on time (middle)\nor late (bottom). The tapped delay line model (c) produces spurious negative error after\nan early reward, while, in accord with experiments, our semi-Markov model (d) does not.\nShaded stripes under (d) and (f) track the model\u2019s belief distribution over the world\u2019s hidden\nstate (given a one-timestep backward pass), with the ISI in white, the ITI in black, and gray\nfor uncertainty between the two. (e,f) Modeled DA activity when reward timing varies\nuniformly over a range. The tapped delay line model (e) incorrectly predicts identical\nexcitation to rewards delivered at all times, while, in accord with experiment, our model (f)\npredicts a response that declines with delay.\n\nSeveral models [1, 2, 3, 4, 5] identify the \ufb01ring of DA neurons with the reward prediction\nerror signal (cid:14)t of a TD algorithm [6]. In the models, DA neurons are excited by positive er-\nror in reward prediction (caused by unexpected rewards or reward-predicting stimuli) and\ninhibited by negative prediction error (caused by the omission of expected reward). If a\nreward arrives as expected, the models predict no change in \ufb01ring rate. These character-\nistics have been demonstrated in recordings of primate DA neurons [7]. In idealized form\n(neglecting some instrumental contingencies), these experiments and the others that we\nconsider here are all variations on trace conditioning, in which a phasic stimulus such as a\n\ufb02ash of light signals that reward will be delivered after a delay.\n\nTD systems map a representation of the state of the world to a prediction of future reward,\nbut previous DA modeling exploited few experimental constraints on the form of this rep-\nresentation. Houk et al. [1] computed values using only immediately observable stimuli\nand allowed learning about rewards to accrue to previously observed stimuli using eligibil-\nity traces. But in trace conditioning, DA neurons show a timed pause in their background\n\ufb01ring when an expected reward fails to arrive [7]. Because the Houk et al. [1] model does\nnot learn temporal relationships, it cannot produce well timed inhibition. Montague et al.\n[2] and Schultz et al. [3] addressed these data using a tapped delay line representation of\nstimulus history [8]: at time t, each stimulus is represented by a vector whose nth element\n\nd\nd\nd\n\u2039\n\u2039\n\fcodes whether the stimulus was observed at time t (cid:0) n. This representation allows the\nmodels to learn the temporal relationship between stimulus and reward, and to correctly\npredict phasic inhibition timelocked to omitted rewards.\n\nThese models, however, mispredict the behavior of DA neurons when the interval between\nstimulus and reward varies. In one experiment [9], animals were trained to expect a con-\nstant stimulus-reward interval, which was later varied. When a reward is delivered earlier\nthan expected, the tapped delay line models correctly predict that it should trigger positive\nerror (dopaminergic excitation), but also incorrectly predict a further burst of negative error\n(inhibition, not seen experimentally) when the reward fails to arrive at the time it was orig-\ninally expected (Figure 1c, top). In part, this occurs because the models do not represent\nthe reward as an observation, so its arrival can have no effect on later predictions. More\nfundamentally, this is a problem with how the models partition events into a state space.\n\nFigure 1a illustrates how the tapped delay lines mark time in the interval between stimulus\nand reward using a series of states, each of which learns its own reward prediction. After\nthe stimulus occurs, the model\u2019s representation marches through each state in succession.\nBut this device fails to capture a distribution over the interval between two events.\nIf\nthe second event has occurred, the interval is complete and the system should not expect\nreward again, but the tapped delay line continues to advance. This may be correctable,\nthough awkwardly, by representing the reward with its own delay line, which can then\nlearn to suppress further reward expectation after a reward occurs [10]. However, to our\nknowledge it is experimentally unclear whether the suppression of this response requires\nrepeated experience with the situation, as this account predicts. Also, whether this works\ndepends on how information from multiple cues is combined into an aggregate reward\nprediction (i.e. on the function approximator used: it is easy to verify that a standard linear\ncombination of the delay lines does not suf\ufb01ce).\n\nThe models have a similar problem with a related experiment [11] (Figure 1e) where the\nstimulus-reward interval varied uniformly over a range of delays throughout training. In\nthis case, all substates within the interval see reward with the same (low) probability, so\neach produces identical positive error when reward occurs there. In animal experiments,\nhowever, stronger dopaminergic activity is seen for earlier rewards [11].\n\n3 A new model\n\nBoth of these experiments demonstrate that current TD models of DA do not adequately\ntreat variability in event timing. We address them with a TD model grounded in a formal-\nism that incorporates temporal variability, a partially observable [12] semi-Markov [13]\nprocess. Such a process is described by three functions, O, Q, and D, operating over two\nsets: the hidden states S and observations O. Q associates each state with a probability\ndistribution over possible successors. If the process is in state s 2 S, then the next state is\ns0 with probability Qss0. These discrete state transitions can occur irregularly in continuous\ntime (which we approximate to arbitrarily \ufb01ne discretization). The dwell time (cid:28) spent in\ns before making a transition is distributed with probability Ds(cid:28) ; we de\ufb01ne the indicator\n(cid:30)t as one if the state transitioned between t and t + 1 and zero otherwise. On entering s,\nthe process emits some observation o 2 O with probability Oso. Some observations are\ndistinguished as rewarding; we separately write the reward magnitude of an observation as\nr. Note that the processes we consider in this paper do not contain decisions.\nIn this formalism, a trace conditioning experiment can be treated as alternation between\ntwo states (Figure 1b). The states correspond to the intervals between stimulus and reward\n(interstimulus interval: ISI) and between reward and stimulus (intertrial interval: ITI). A\nstimulus is the likely observation when entering the ISI and a reward when entering the ITI.\n\n\fWe will index variables both by the time t and by a discrete index n which counts state\ntransitions; e.g. the nth state, sn, is entered at time t = Pn(cid:0)1\nk=1 (cid:28)k and can thus also be writ-\nten as st. If (cid:30)t = 0 (if the state did not transition between t and t + 1) then st+1 = st, ot+1\nis null and rt+1 = 0 (i.e., nonempty observations and rewards occur only on transitions).\nState transitions may be unsignaled: ot+1 may be null even if (cid:30)t = 1. An unsignaled\ntransition into the ITI state occurs in our model when reward is omitted, a common exper-\nimental manipulation [7]. This example demonstrates the relationship between temporal\nvariability and partial observability: if reward timing can vary, nothing in the observable\nstate reveals whether a late reward is still coming or has been omitted completely.\n\nTD algorithms [6] approximate a function mapping each state to its value, de\ufb01ned as the\nexpectation (with respect to variability in reward magnitude, state succession, and dwell\ntimes) of summed, discounted future reward, starting from that state. In the semi-Markov\ncase [13], a state\u2019s value is de\ufb01ned as the reward expectation at the moment it is entered;\nwe do not count rewards received on the transition in. The value of the nth state entered is:\n\nVsn = E (cid:2)(cid:13)(cid:28)n\n\nrn+1 + (cid:13)(cid:28)n+(cid:28)n+1 rn+2 + :::(cid:3)\n\n= E (cid:2)(cid:13)(cid:28)n(rn+1 + Vsn+1)(cid:3)\n\nwhere (cid:13) < 1 is a discounting parameter.\nWe address partial observability by using model-based inference to determine a distribution\nover the hidden states, which then serves as a basis over which a modi\ufb01ed TD algorithm\ncan learn values. The approach is similar to the Q-learning algorithm of Chrisman [14].\nIn our setting, however, values can in principle be learned exactly, since without decisions,\nthey are linear in the space of hidden states.\n\nFor state inference, we assume that the brain\u2019s sensory processing systems use an internal\nmodel of the semi-Markov process \u2014 that is, the functions O, Q, and D. Here we take\nthe model as given, though we have treated parts of the problem of learning such models\nelsewhere [15]. A key assumption about this internal model is that its distributions over\nintervals, rewards and observations contain asymptotic uncertainty, that is, they are not\narbitrarily sharp. When learning internal models, such uncertainty can result from an as-\nsumption that parameters of the world are constantly changing [16]. Thus, in the inference\nmodel for the trace conditioning experiment, the ISI duration is modeled with a probabil-\nity distribution with some nonzero variance rather than an impulse function. The model\nlikewise assigns a small probability to anomalous transitions and observations (e.g. unre-\nwarded transitions into the ITI state). This uncertainty is present only in the internal model:\nmost anomalous events never occur in our simulations.\nGiven the model and a series of observations o1 : : : ot, we can determine the likelihood\nthat each hidden state is active using a standard forward-backward algorithm for hidden\nsemi-Markov models [17]. The important quantity is the probability, for each state, that\nthe system left that state at time t. With a one-timestep backward pass (to match the one-\ntimestep value backups in the TD rule), this is:\n\n(cid:12)s;t = P (st = s; (cid:30)t = 1jo1 : : : ot+1)\n\nBy Bayes\u2019 theorem, (cid:12)s;t / P (ot+1jst = s; (cid:30)t = 1) (cid:1) P (st = s; (cid:30)t = 1jo1 : : : ot). The\n\ufb01rst term can be computed by integrating over st+1 in the model: P (ot+1jst=s; (cid:30)t=1) =\not+1; the second requires integrating over possible state sequences and\nPs02S Qss0 (cid:1) Os0\ndwell times:\n\nP (st = s; (cid:30)t = 1jo1 : : : ot) =\n\ndlastO\n\nX\n\n(cid:28) =1\n\nDs(cid:28) (cid:1)Osot(cid:0)(cid:28) +1 (cid:1)P (st(cid:0)(cid:28) +1 = s; (cid:30)t(cid:0)(cid:28) = 1jo1 : : : ot(cid:0)(cid:28) )\n\nwhere dlastO is the number of timesteps since the last non-null observation and\nP (st(cid:0)(cid:28) +1 = s; (cid:30)t(cid:0)(cid:28) = 1jo1 : : : ot(cid:0)(cid:28) ), the chance that the process entered s at t (cid:0) (cid:28) + 1,\nequals Ps0 2S Qs0s (cid:1) P (st(cid:0)(cid:28) = s0; (cid:30)t(cid:0)(cid:28) = 1jo1 : : : ot(cid:0)(cid:28) ), allowing recursive computation.\n\n\f(cid:12) is used for TD learning because it represents the probability of a transition, which is\nthe event that triggers a value update in fully observable semi-Markov TD. Due to partial\nobservability, we may not be certain when transitions have occurred or from which states,\nso we perform TD updates to every state at every timestep, weighted by (cid:12). We denote our\nestimate of the value of state s as ^Vs, to distinguish it from the true value Vs. The update\nto ^Vs at time t is proportional to the TD error:\n\n(cid:14)s;t = (cid:12)s;t(E[(cid:13)(cid:28) ] (cid:1) (rt+1 + E[ ^Vs0 ]) (cid:0) ^Vs)\n\nwhere E[(cid:13) (cid:28) ] = Pk (cid:13)kP ((cid:28)t = kjst = s; (cid:30)t = 1; o1 : : : ot+1) is the expected discounting\n(since dwell time may be uncertain) and E[ ^Vs0 ] = Ps0 2S\n^Vs0 P (st+1 = s0jst = s; (cid:30)t =\n1; ot+1) is the expected subsequent value. Both expectations are conditioned on the process\nhaving left state s at time t, and computed using the internal world model.\nAs in previous models, we associate the error signal (cid:14) with DA activity. However, be-\ncause of uncertainty as to the state of the world, the TD error signal is vector-valued rather\nthan scalar. DA neurons could code this vector in a distributed manner, which might ex-\nplain experimentally observed response variability between neurons [7]. Alternatively, (cid:14)s;t\ncan be approximated with a scalar, which performs well if the inferred state occupancy is\nsharply peaked. In our \ufb01gures, we use such an approximation, plotting DA activity as the\ncumulative TD error over states (implicitly weighted by (cid:12)): (cid:14)t = Ps2S (cid:14)s;t. An approx-\nimate version of the vector signal could be reconstructed at target areas by multiplying by\n(cid:12)s;t= Ps0 2S (cid:12)s0;t.\nNote that with full observability, the (vector) learning rule reduces to standard semi-Markov\nTD, and conversely with full unobservability, it nudges states in the direction of a value\niteration backup. In fact, the algorithm is exact in that it has the same \ufb01xed point as value\niteration, assuming the inference model matches the contingencies of the world.\n(Due\nto uncertainty it does so only approximately in our simulations.) We sketch the proof.\nWith each TD update, ^Vs is nudged toward some target value with some step size (cid:12)s;t;\nthe \ufb01xed point is the average of the targets, weighted by their probabilities and their step\nsizes. Fixing some arbitrary t, the update targets and (cid:12) are functions of the observations\no1 : : : ot+1, which are generated according to P (o1 : : : ot+1). The \ufb01xed point is:\n\n^Vs = Po1:::ot+1\n\nP (o1 : : : ot+1) (cid:1) (cid:12)s;t (cid:1) E[(cid:13)(cid:28) ] (cid:1) (rt+1 + E[ ^Vs0 ])\n\nPo1:::ot+1\n\nP (o1 : : : ot) (cid:1) (cid:12)s;t\n\nMarginalizing out the observations reduces this to Bellman\u2019s equation for ^Vs, which is also,\nof course, the \ufb01xed-point equation for value iteration.\n\n4 Results\n\nWhen expected reward is delivered early, the semi-Markov model assumes that this signals\nan early transition into the ITI state, and it thus does not expect further reward or produce\nspurious negative error (Figure 1d, top). Because of variability in the model\u2019s ISI estimate,\nan early transition, while improbable, better explains the data than some other path through\nthe state space. The early reward is worth more than expected, due to reduced discounting,\nand is thus accompanied by positive error.\n\nThe model can also infer a state transition from the passage of time, absent any observa-\ntions. In Figure 1d (bottom), when the reward is delivered late, the system infers that the\nworld has entered the ITI state without reward, producing negative error.\n\nFigure 1f shows our model\u2019s behavior when the ISI is uniformly distributed [11]. (The\ndwell time distribution D in the inference model was changed to re\ufb02ect this distribution,\n\n\fas an animal should learn a different model here.) Earlier-than-average rewards are worth\nmore than expected (due to discounting) and cause positive prediction error, while later-\nthan-average rewards cause negative error because they are more heavily discounted. This\nis broadly consistent with the experimental \ufb01nding of decreasing response with increasing\ndelay [11]. Inhibition at longer delays has not so far been observed in this experiment,\nthough inhibition is in general dif\ufb01cult to detect.\nIf discovered, such inhibition would\nsupport the semi-Markov model.\n\nBecause it combines a conditional probability model with TD learning, our approach can\nincorporate insights from previous behavioral theories into a physiological model. Our\nstate inference approach is based on a hidden Markov model (HMM) account we previ-\nously advanced to explain animal learning about the temporal relationships of events [15].\nThe present theory (with the model learning scheme from that paper) would account for\nthe same data. Our model also accommodates two important theoretical ideas from more\nabstract models of animal learning that previous TD models cannot. One is the notion of\nuncertainty in some of its internal parameters, which Kakade and Dayan [16] use to ex-\nplain interval timing and attentional effects in learning. Second, Gallistel has suggested\nthat animal learning processes are timescale invariant. For example, altering the speed of\nevents has no effect on the number of trials it takes animals to learn a stimulus-reward as-\nsociation [18]. This is not true of Markov TD models because their transitions are clocked\nto a \ufb01xed timescale. With tapped delay lines, timescale dilation increases the number of\nmarker states in Figure 1a and slows learning. But our semi-Markov model is timescale\ninvariant: learning is induced by state transitions which in turn are triggered by events or\nby the passage of time on a scale controlled by the internal model. (The form of temporal\ndiscounting we use is not timescale invariant, but this can be corrected as in [5].)\n\n5 Discussion\n\nWe have presented a model of the DA system that improves on previous models\u2019 accounts\nof data involving temporal variability and partial observability, because, unlike prior mod-\nels, it is grounded in a formalism that explicitly incorporates these considerations. Like\nprevious models, ours identi\ufb01es the DA response with reward prediction error, but it dif-\nfers in the representational systems driving the predictions. Previous models assumed that\ntapped delay lines transcribed raw sensory events; ours envisions that these events inform\na more active process of inference about the underlying state of the world. This is a princi-\npled approach to the problem of representing state when events can be separated by delays.\n\nSimpler schemes may capture the neuronal data, which are sparse, but without address-\ning the underlying computational issues we identify, they are unlikely to generalize. For\ninstance, Suri and Schultz [4] propose that reward delivery overrides stimulus representa-\ntions, canceling pending predictions and eliminating the spurious negative error in Figure\n1c (top). But this would disrupt the behaviorally demonstrated ability of animals to learn\nthat a stimulus predicts a series of rewards. Such static representational rules are insuf\ufb01-\ncient since different tasks have different mnemonic requirements. In our account, unlike\nmore ad-hoc theories, the problem of learning an appropriate representation for a task is\nwell speci\ufb01ed: it is the problem of modeling the task. Though we have not simulated model\nlearning here (this is an important area for future work), it is possible using online HMM\nlearning, and we have used this technique in a model of conditioning [15]. Another issue\nfor the future is extending our theory to encompass action selection. DA models often as-\nsume an actor-critic framework [1] in which reward predictions are used to evaluate action\nselection policies. Partial observability complicates such an extension here, since policies\nmust be de\ufb01ned over belief states (distributions over the hidden states S) to accommodate\nuncertainty; our use of S as a linear basis for value predictions is thus an oversimpli\ufb01cation.\nPuzzlingly, the data we consider suggest that animals build internal models but also use\n\n\fsample-based TD methods to predict values. Given a full world model (which could in\nprinciple be solved directly for V ), it seems unclear why TD learning should be necessary.\nBut since the world model must be learned incrementally online, it may be infeasible to\ncontinually re-solve it, and parts of the model may be poorly speci\ufb01ed. In this case, TD\nlearning in the inferred state space could maintain a reasonably current and observation-\nally grounded value function. (Our particular formulation, which relies extensively on the\nmodel in the TD rule, may not be ideal from this perspective.)\n\nSuri [19] and Dayan [20] have also proposed TD theories of DA that incorporate world\nmodels to explain behavioral effects, though they do not address the theoretical issues or\ndopaminergic data considered here. While those accounts use the world model for directly\nanticipating future events, we have proposed another role for it in state inference. Also\nunlike our theory, the others cannot explain the experiments discussed in [15] because their\ninternal models cannot represent simultaneous or backward contingencies. However, they\ntreat the two major issues we have neglected: world model learning and action planning.\n\nThe formal models in question have roughly equivalent explanatory power: a semi-Markov\nmodel can be simulated (to arbitrarily \ufb01ne temporal discretization) by a Markov model\nthat subdivides its states by dwell time. There is also an isomorphism between higher-\norder and partially observable Markov models. Thus it would be possible to devise a state\nrepresentation for a Markov model that copes properly with temporal variability. But doing\nso by elaborating the tapped delay line architecture would amount to building a clockwork\nengine for the inference process we describe, without the bene\ufb01t of useful abstractions such\nas distributions over intervals; a clearer approach would subdivide the states in our model.\n\nThough there exist isomorphisms between the formal models, there are algorithmic dif-\nferences that may make our proposal experimentally distinguishable from others. The in-\nhibitory responses in Figure 1f re\ufb02ect the way semi-Markov models account for the costs of\ndelays; they would not be seen in a Markov model with subdivided states. Such inhibition\nis somewhat parameter-dependent, since if inference parameters assign high probability to\nunsignaled transitions the decrease in reward value with delay can be mitigated by increas-\ning uncertainty about the hidden state. Nonetheless, should data not uphold our prediction\nof inhibitory responses to late rewards, they would suggest a different de\ufb01nition of a state\u2019s\nvalue. One choice would be the subdivision of our semi-Markov states by dwell time dis-\ncussed above, which in the experiment of Figure 1f would decrease TD error toward but\nnot past zero for longer delays. In this case, later rewards are less surprising because the\nconditional probability of reward increases as time passes without reward.\n\nA related prediction suggested by our model is that DA responses not just to rewards but\nalso to stimuli that signal reward might be modulated by their timing relative to expecta-\ntion. Responses to reward-predicting stimuli disappear in overtrained animals, presumably\nbecause the stimuli come to be predicted by events in the previous trial [7]. In tapped delay\nline models, this is possible only for a constant ITI (since if expectancy is divided between\na number of states, stimulus delivery in any one of them cannot be completely predicted\naway). But the response to a stimulus in the semi-Markov model can show behavior exactly\nanalogous to the reward response in Figure 1f \u2014 positive or negative error depending on\nthe time of delivery relative to expectation. So, even in an experiment involving a random-\nized ITI, the net stimulus response (averaged over the range of ITIs) could be attenuated.\nSuch behavior occurred in our simulations; the modeled DA responses to the stimuli in\nFigures 1d and 1f are positive because they were taken after shorter-than-average ITIs. It is\ndif\ufb01cult to evaluate this observation against available data, since the experiment involving\novertrained monkeys [7] contained minimal ITI variability.\n\nWe have suggested that the TD error may be a vector signal, with different neurons signal-\ning errors for different elements of a state distribution. This could be investigated experi-\nmentally by recording DA neurons as a situation of ambiguous reward expectancy (e.g. one\n\n\freward or three) resolved into a situation of intermediate, determinate reward expectancy\n(e.g. two rewards). Neurons carrying an aggregate error should uniformly report no error,\nbut with a vector signal, different neurons might report both positive and negative error.\n\nAcknowledgments\n\nThis work was supported by National Science Foundation grants IIS-9978403 and DGE-\n9987588. Aaron Courville was funded in part by a Canadian NSERC PGS B fellowship.\nWe thank Sham Kakade and Peter Dayan for helpful discussions.\n\nReferences\n[1] JC Houk, JL Adams, and AG Barto. A model of how the basal ganglia generate and use neural\nsignals that predict reinforcement. In JC Houk, JL Davis, and DG Beiser, editors, Models of\nInformation Processing in the Basal Ganglia, pages 249\u2013270. MIT Press, 1995.\n\n[2] PR Montague, P Dayan, and TJ Sejnowski. A framework for mesencephalic dopamine systems\n\nbased on predictive Hebbian learning. J Neurosci, 16:1936\u20131947, 1996.\n\n[3] W Schultz, P Dayan, and PR Montague. A neural substrate of prediction and reward. Science,\n\n275:1593\u20131599, 1997.\n\n[4] RE Suri and W Schultz. A neural network with dopamine-like reinforcement signal that learns\n\na spatial delayed response task. Neurosci, 91:871\u2013890, 1999.\n\n[5] ND Daw and DS Touretzky. Long-term reward prediction in TD models of the dopamine\n\nsystem. Neural Comp, 14:2567\u20132583, 2002.\n\n[6] RS Sutton. Learning to predict by the method of temporal differences. Machine Learning,\n\n3:9\u201344, 1988.\n\n[7] W Schultz. Predictive reward signal of dopamine neurons. J Neurophys, 80:1\u201327, 1998.\n[8] RS Sutton and AG Barto. Time-derivative models of Pavlovian reinforcement. In M Gabriel\nand J Moore, editors, Learning and Computational Neuroscience: Foundations of Adaptive\nNetworks, pages 497\u2013537. MIT Press, 1990.\n\n[9] JR Hollerman and W Schultz. Dopamine neurons report an error in the temporal prediction of\n\nreward during learning. Nature Neurosci, 1:304\u2013309, 1998.\n\n[10] DS Touretzky, ND Daw, and EJ Tira-Thompson. Combining con\ufb01gural and TD learning on a\n\nrobot. In ICDL 2, pages 47\u201352. IEEE Computer Society, 2002.\n\n[11] CD Fiorillo and W Schultz. The reward responses of dopamine neurons persist when prediction\nof reward is probabilistic with respect to time or occurrence. In Soc. Neurosci. Abstracts, volume\n27: 827.5, 2001.\n\n[12] LP Kaelbling, ML Littman, and AR Cassandra. Planning and acting in partially observable\n\nstochastic domains. Artif Intell, 101:99\u2013134, 1998.\n\n[13] SJ Bradtke and MO Duff. Reinforcement learning methods for continuous-time Markov Deci-\n\nsion Problems. In NIPS 7, pages 393\u2013400. MIT Press, 1995.\n\n[14] L Chrisman. Reinforcement learning with perceptual aliasing: The perceptual distinctions ap-\n\nproach. In AAAI 10, pages 183\u2013188, 1992.\n\n[15] AC Courville and DS Touretzky. Modeling temporal structure in classical conditioning.\n\nNIPS 14, pages 3\u201310. MIT Press, 2001.\n\nIn\n\n[16] S Kakade and P Dayan. Acquisition in autoshaping.\n\n2000.\n\nIn NIPS 12, pages 24\u201330. MIT Press,\n\n[17] Y Guedon and C Cocozza-Thivent. Explicit state occupancy modeling by hidden semi-Markov\n\nmodels: Application of Derin\u2019s scheme. Comp Speech and Lang, 4:167\u2013192, 1990.\n\n[18] CR Gallistel and J Gibbon. Time, rate and conditioning. Psych Rev, 107(2):289\u2013344, 2000.\n[19] RE Suri. Anticipatory responses of dopamine neurons and cortical neurons reproduced by\n\ninternal model. Exp Brain Research, 140:234\u2013240, 2001.\n\n[20] P Dayan. Motivated reinforcement learning. In NIPS 14, pages 11\u201318. MIT Press, 2001.\n\n\f", "award": [], "sourceid": 2219, "authors": [{"given_name": "Nathaniel", "family_name": "Daw", "institution": null}, {"given_name": "Aaron", "family_name": "Courville", "institution": null}, {"given_name": "David", "family_name": "Touretzky", "institution": null}]}