{"title": "Hippocampal Model of Rat Spatial Abilities Using Temporal Difference Learning", "book": "Advances in Neural Information Processing Systems", "page_first": 145, "page_last": 151, "abstract": "", "full_text": "Hippocampal Model of Rat Spatial Abilities \n\nUsing Temporal Difference Learning \n\nDavid J Foster* \n\nCentre for Neuroscience \nEdinburgh University \n\nRichard GM Morris \n\nCentre for Neuroscience \nEdinburgh University \n\nPeter Dayan \nE25-210, MIT \n\nCambridge, MA 02139 \n\nAbstract \n\nWe provide a model of the standard watermaze task, and of a more \nchallenging task involving novel platform locations, in which rats \nexhibit one-trial learning after a few days of training. The model \nuses hippocampal place cells to support reinforcement learning, \nand also, in an integrated manner, to build and use allocentric \ncoordinates. \n\n1 \n\nINTRODUCTION \n\nWhilst it has long been known both that the hippocampus of the rat is needed for \nnormal performance on spatial tasks l3 , 11 and that certain cells in the hippocampus \nexhibit place-related firing,12 it has not been clear how place cells are actually used \nfor navigation. One of the principal conceptual problems has been understanding \nhow the hippocampus could specify or learn paths to goals when spatially tuned \ncells in the hippocampus respond only on the basis of the rat's current location. \nThis work uses recent ideas from reinforcement learning to solve this problem in \nthe context of two rodent spatial learning results. \nReference memory in the watermazell (RMW) has been a key task demonstrating \nthe importance of the hippocampus for spatial learning. On each trial, the rat is \nplaced in a circular pool of cloudy water, the only escape from which is a platform \nwhich is hidden (below the water surface) but which remains in a constant position. \nA random choice of starting pOSition is used for each trial. Rats take asymptotically \nshort paths after approximately 10 trials (see figure 1 a). Delayed match-to-place \n(DMP) learning is a refined version in which the platform'S location is changed on \neach day. Figure 1 b shows escape latencies for rats given four trials per day for nine \ndays, with the platform in a novel position on each day. On early days, acquisition \n\n\u00b7Crichton Street, Edinburgh EH8 9LE, United Kingdom. Funded by Edin. Univ. \nHoldsworth Scholarship, the McDonnell-Pew foundation and NSF grant IBN-9634339. \nEmail: djf@cfn.ed.ac.uk \n\n\f146 \n\n100 \n\n90 \n\na \n\nD. J Foster; R. G. M. Mo\"is and P. Dayan \n\n100 \n\n90 \n\n80 \n\nb \n\n_ 70 \n\n'\" i oo \n\n'\" 50 \n-' \n\n~40 \n~30 \n\n13 \n\n17 \n\n21 \n\n2S \n\n20 \n\n10 \n\nFigure 1: a) Latencies for rats on the reference memory in the watermaze (RMW) \ntask (N=8). b) Latencies for rats on the Delayed Match-to-Place (DMP) task (N=62). \n\nis gradual but on later days, rats show one-trial learning, that is, near asymptotic \nperformance on the second trial to a novel platform position. \nThe RMW task has been extensively modelled. 6,4,5,20 By contrast, the DMP task \nis new and computationally more challenging. It is solved here by integrating a \nstandard actor-critic reinforcement learning system2,7 which guarantees that the \nrat will be competent to perform well in arbitrary mazes, with a system that learns \nspatial coordinates in the maze. Temporal difference learning 1 7 (TO) is used for actor, \ncritic and coordinate learning. TO learning is attractive because of its generality for \narbitrary Markov decision problems and the fact that reward systems in vertebrates \nappear to instantiate it. 14 \n\n2 THEMODEL \n\nThe model comprises two distinct networks (figure 2): the actor-critic network and \na coordinate learning network. The contribution of the hippocampus, for both \nnetworks, is to provide a state-space representation in the form of place cell basis \nfunctions. Note that only the activities of place cells are required, by contrast with \ndecoding schemes which require detailed information about each place cell.4 \n\nACTOR-CRITIC \n\nSYSTEM \n\nCOORDINATE SYSTEM \n\nr-------------\n\nRemembered \nGoal coordinates \n\n1 \n\n1 \n1 \n\n1 \nVECTOR COMPUTA nONI \n\n~ \n\nCoordinate \nRepresentation \n\n1 ______ -------1 \n\n1 \n\nFigure 2: Model diagram showing the interaction between actor-critic and coordi(cid:173)\nnate system components. \n\n\fHippocampal Model of Rat Spatial Abilities Using TD Learning \n\n147 \n\n2.1 Actor-Critic Learning \n\nPlace cells are modelled as being tuned to location. At position p, place cell \ni has an output given by h(p) = exp{ -lip - sdI2/2(12}, where Si is the place \nfield centre, and (1 = 0.1 for all place fields. The critic learns a value function \nV(p) = L:i wih(p) which comes to represent the distance of p from the goal, using \nthe TO rule 6.w~ ex: 8t h(pt), where \n\n(1) \n\nis the TD error, pt is position at time t, and the reward r(pt, pt+I) is 1 for any \nmove onto the platform, and 0 otherwise. In a slight alteration of the original rule, \nthe value V (p) is set to zero when p is at the goal, thus ensuring that the total \nfuture rewards for moving onto the goal will be exactly 1. Such a modification \nimproves stability in the case of TD learning with overlapping basis functions. \nThe discount factor, I' was set to 0.99. Simultaneously the rat refines a policy, \nwhich is represented by eight action cells. Each action cell (aj in figure 2) receives \na parameterised input at any position p: aj (p) = L:i qjdi (p). An action is chosen \nstochastically with probabilities given by P(aj) = exp{2aj}/ L:k exp{2ak}. Action \nweights are reinforced according to:2 \n\nwhere 9j((Jt) is a gaussian function of the difference between the head direction \n(Jt at time t and the preferred direction of the jth action cell. Figure 3 shows the \ndevelopment of a policy over a few trials. \n\n(2) \n\nV(p)l \n\nTriall \n\nV(p) 1 \n\nTrialS \n\nV(P)l \n\nTriall3 \n\n0.5 \n\n0. \n0.5 \n\n0.5 \nI \n01 \n0.5 \n\n0.5 \n\n0: \n0.5 \n\n0.5 \n\n0 \n\n0.5 \n\n.---'--- -\n\n0.5 \n\n-0.5 -0.5 \n\n-------\n\n0 \n\nFigure 3: The RMW task: the value function gradually disseminates information \nabout reward proximity to all regions of the environment. Policies and paths are \nalso shown. \n\nThere is no analytical guarantee for the convergence of TD learning with policy \nadaptation. However our simulations show that the algorithm always converges \nfor the RMW task. In a simulated arena of diameter 1m and with swimming speeds \nof 20cm/s, the simulation matched the performance of the real rats very closely (see \nfigure S). This demonstrates that TD-based reinforcement learning is adequately \nfast to account for the learning performance of real animals. \n\n\f148 \n\n2.2 Coordinate Learning \n\nD. 1. Foster, R. G. M Morris and P. Dayan \n\nAlthough the learning of a value function and policy is appropriate for finding \na fixed platform, the actor-critic model does not allow the transfer of knowledge \nfrom the task defined by one goal position to that defined by any other; thus it \ncould not generate the sort of one-trial learning that is shown by rats on the DMP \ntask (see figure 1 b). This requires acquisition of some goal-independent know ledge \nabout s~ace. A natural mechanism for this is the path integration or self-motion \nsystem. 0,10 However, path integration presents two problems. First, since the rat \nis put into the maze in a different position for each trial, how can it learn consistent \ncoordinates across the whole maze? Second, how can a general, powerful, but slow, \nbehavioral learning mechanism such as TO be integrated with a specific, limited, \nbut fast learning mechanism involving spatial coordinates? \nSince TO critic learning is based on enforcing consistency in estimates of future \nreward, we can also use it to learn spatially consistent coordinates on the basis \nof samples of self-motion. It is assumed that the rat has an allocentric frame of \nreference.1s The model learns parameterised estimates of the x and y coordinates \nof all positions p: x(p) = Li w[ fi(P) and y(p) = Li wY h(p), Importantly, while \nplace cells were again critical in supporting spatial representation, they do not embody \na map of space. The coordinate functions, like the value function previously, have to \nbe learned. \nAs the simulated rat moves around, the coordinate weights {w[} are adjusted \naccording to: \n\nt \n\nLlwi ()( (Llxt + X (pt+l ) - X(pt)) L At - k h(pk) \n\n(3) \n\nk=1 \n\nwhere Llxt is the self-motion estimate in the x direction. A similar update is applied \nto {wn. In this case, the full TO(A) algorithm was used (with A = 0.9); however \nTD(O) could also have been used, taking slightly longer. Figure 4a shows the x and \ny coordinates at early and late phases of learning. It is apparent that they rapidly \nbecome quite accurate - this is an extremely easy task in an open field maze. \nAn important issue in the learning of coordinates is drift, since the coordinate \nsystem receives no direct information about the location of the origin. It turns out \nthat the three controlling factors over the implicit origin are: the boundary of the \narena, the prior setting of the coordinate weights (in this case all were zero) and \nthe position and prior value of any absorbing area (in this case the platform). If the \ncoordinate system as a whole were to drift once coordinates have been established, \nthis would invalidate coordinates that have been remembered by the rat over long \nperiods. However, since the expected value of the prediction error at time steps \nshould be zero for any self-consistent coordinate mapping, such a mapping should \nremain stable. This is demonstrated for a single run: figure 4b shows the mean \nvalue of coordinates x evolving over trials, with little drift after the first few trials. \nWe modeled the coordinate system as influencing the choice of swimming direction \nin the manner of an abstract action. I5 The (internally specified) coordinates of the \nmost recent goal position are stored in short term memory and used, along with the \ncurrent coordinates, to calculate a vector heading. This vector heading is thrown \ninto the stochastic competition with the other possible actions, governed by a \nsingle weight which changes in a similar manner to the other action weights (as in \nequation 2, see also fig 4d), depending on the TO error, and on the angular proximity \nof the current head direction to the coordinate direction. Thus, whether the the \ncoordinate-based direction is likely to be used depends upon its past performance. \nOne simplification in the model is the treatment of extinction. In the DMP task, \n\n\fHippocampal Model of Rat Spatial Abilities Using 1D Learning \n\n149 \n\n,. \n\nIII \n\n.1 \n\nTRIAL \n\n26 \n\n16 \n\n~Ol \n\n\" TJUAL \nd i: \n!\" ~o \n\n~o \n\nd\u00b7 \n\nr \n\nr. \n\nFigure 4: The evolution of the coordinate system for a typical simulation run: a.) \ncoordinate outputs at early and late phases of learning, b.) the extent of drift in the \ncoordinates, as shown by the mean coordinate value for a single run, c.) a measure \n\nA2 \n\nf \n' where k \no coor mate error for the same run (7E = \nindexes measurement points (max Np ) and r indexes runs (max Nr), Xr(Pk) is the \nmodel estimate of X at position Pk, X(Pk) is the ideal estimate for a coordinate \nsystem centred on zero, and Xr is the mean value over all the model coordinates, \nd.) the increase during training of the probability of choosing the abstract action. \nThis demonstrates the integration of the coordinates into the control system. \n\n~ ~ {X r (Pr.)-X r -X(pr.)}2 \n\n(Np-l)Nr \n\nreal rats extinguish to a platform that has moved fairly quickly whereas the actor(cid:173)\ncritic model extinguishes far more slowly. To get around this, when a simulated \nrat reaches a goal that has just been moved, the value and action weights are \nreinitialised, but the coordinate weights wf and wf, and the weights for the abstract \naction, are not. \n\n3 RESULTS \n\nThe main results of this paper are the replication by simulation of rat performance \non the RMW and DMP tasks. Figures la and b show the course of learning for \nthe rats; figures Sa and b for the model. For the DMP task, one-shot acquisition is \napparent by the end of training. \n\n4 DISCUSSION \n\nWe have built a model for one-trial spatial learning in the watermaze which uses \na single TD learning algorithm in two separate systems. One system is based on a \nreinforcement learning that can solve general Markovian decision problems, and \nthe other is based on coordinate learning and is specialised for an open-field water \nmaze. Place cells in the hippocampus offer an excellent substrate for learning the \nactor, the critic and the coordinates. \nThe model is explicit about the relationship between the general and specific learn(cid:173)\ning systems, and the learning behavior shows that they integrate seamlessly. As \ncurrently constituted, the coordinate system would fail if there were a barrier in \nthe maze. We plan to extend the model to allow the coordinate system to specify \nabstract targets other than the most recent platform position - this could allow \nit fast navigation around a larger class of environments. It is also important to \nimprove the model of learning 'set' behavior - the information about the nature of \n\n\f150 \n\na \n\n12 \n\nj :10 \\ \n\n~ . ~ \n~ . \n\nD. 1. Foster; R. G. M. Mo\"is and P. Dayan \n\nb \n\n14 \n\n12 \n\n10 \n\u00a7. \nz \n~ S> \n.. \n\n'\"~ ............................................ \n\n0~D.~yl~~y~2~D~.y~3~D~.y~47~~yS~~~.~~~y~7~~~y~.7D.~y9~ \n\nFigure 5: a.) Performance of the actor-critic model on the RMW task, and b.) \nperformance of the full model on the DMP task. The data for comparison is shown \nin figures la and b. \n\nthe DMP task that the rats acquire over the course of the first few days of training. \nInterestingly, learning set is incomplete - on the first trial of each day, the rats \nstill aim for the platform position on the previous day, even though this is never \ncorrect.16 The significant differences in the path lengths on the first trial of each \nday (evidence in figure Ib and figure 5b) come from the relative placements of the \nplatforms. However, the model did not use the same positions as the empirical \ndata, and, in any case, the model of exploration behavior is rather simplistic. \nThe model demonstrates that reinforcement learning methods are perfectly fast \nenough to match empirical learning curves. This is fortunate, since, unlike most \nmodels specifically designed for open-field navigation,6,4,5,2o RL methods can \nprovably cope with substantially more complicated tasks with arbitrary barriers, \netc, since they solve the temporal credit assignment problem in its full generality. \nThe model also addresses the problem that coordinates in different parts of the \nsame environment need to be mutually consistent, even if the animal only expe(cid:173)\nriences some parts on separate trials. An important property of the model is that \nthere is no requirement for the animal to have any explicit knowledge of the rela(cid:173)\ntionship between different place cells or place field position, size or shape. Such a \nrequirement is imposed in various models. 9,4,6,2o \n\nExperiments that are suggested by this model (as well as by certain others) con(cid:173)\ncern the relationship between hippocampally dependent and independent spatial \nlearning. First, once the coordinate system has been acquired, we predict that \nmerely placing the rat at a new location would be enough to let it find the platform \nin one shot, though it might be necessary to reinforce the placement e.g. by first \nplacing the rat in a bucket of cold water. Second, we know that the establishment \nof place fields in an environment happens substantiallr faster than establishment \nof one-shot or even ordinary learning to a platform.2 We predict that blocking \nplasticity in the hippocampus following the establishment of place cells (possibly \nachieved without a platform) would not block learning of a platform. In fact, new \nexperiments show that after extensive pre-training, rats can perform one-trial learn(cid:173)\ning in the same environment to new platform positions on the DMP task without \nhippocampal synaptic plasticity. 16 This is in contrast to the effects of hippocampal \nlesion, which completely disrupts performance. According to the model, coor(cid:173)\ndinates will have been learned during pre-training. The full prediction remains \nuntested: that once place fields have been established, coordinates could be learned \nin the absence of hippocampal synaptic plasticity. A third prediction follows from \nevidence that rats with restricted hippocampal lesions can learn the fixed platform \n\n\fHippocampal Model of Rat Spatial Abilities Using TD Learning \n\n151 \n\ntask, but much more slowly, based on a gradual \"shaping\" procedure.22 In our \nmodel, they may also be able to learn coordinates. However, a lengthy training \nprocedure could be required, and testing might be complicated if expressing the \nknowledge required the use of hippocampus dependent short-term memory for \nthe last platform location. I6 \nOne way of expressing the contribution of the hippocampus in the model is to say \nthat its function is to provide a behavioural state space for the solution of complex \ntasks. Hence the contribution of the hippocampus to navigation is to provide \nplace cells whose firing properties remain consistent in a given environment. It \nfollows that in different behavioural situations, hippocampal cells should provide \na representation based on something other than locations -\nand, indeed, there \nis evidence for this. 8 With regard to the role of the hippocampus in spatial tasks, \nthe model demonstrates that the hippocampus may be fundamentally necessary \nwithout embodying a map. \n\nReferences \n\n[1] Barto, AG & Sutton, RS (1981) BioI. Cyber., 43:1-8. \n[2] Barto, AG, Sutton, RS & Anderson, CW (1983) IEEE Trans. on Systems, Man \n\nand Cybernetics 13:834-846. \n\n[3] Barto, AG, Sutton, RS & Watkins, CJCH (1989) Tech Report 89-95, CAIS, Univ. \n\nMass., Amherst, MA. \n\n[4] Blum, KI & Abbott, LF (1996) Neural Computation, 8:85-93. \n[5] Brown, MA & Sharp, PE (1995) Hippocampus 5:171-188. \n[6] Burgess, N, Reece, M & O'Keefe, J (1994) Neural Networks, 7:1065-1081. \n[7] Dayan, P (1991) NIPS 3, RP Lippmann et aI, eds., 464-470. \n[8] Eichenbaum, HB (1996) Curro Opin. Neurobiol., 6:187-195. \n[9] Gerstner, W & Abbott, LF (1996) J. Computational Neurosci. 4:79-94. \n[10] McNaughton, BL et a1 (1996) J. Exp. BioI., 199:173-185. \n[11] Morris, RGM et al (1982) Nature, 297:681-683. \n[12] O'Keefe, J & Dostrovsky, J (1971) Brain Res., 34(171). \n[13] Olton, OS & Samuelson, RJ (1976) J. Exp. Psych: A.B.P., 2:97-116. \n\nRudy, JW & Sutherland, RW (1995) Hippocampus, 5:375-389. \n\n[14] SchUltz, W, Dayan, P & Montague, PR (1997) Science, 275, 1593-1599. \n[15] Singh, SP Reinforcement learning with a hierarchy of abstract models. \n[16] Steele, RJ & Morris, RGM in preparation. \n[17] Sutton, RS (1988) Machine Learning, 3:9-44. \n[18] Taube, JS (1995) J. Neurosci. 15(1):70-86. \n[19] Tsitsiklis, IN & Van Roy, B (1996) Tech Report LIDS-P-2322, M.LT. \n[20] Wan, HS, Touretzky, OS & Redish, AD (1993) Proc. 1993 Connectionist Models \n\nSummer School, Lawrence Erlbaum, 11-19. \n\n[21] Watkins, CJCH (1989) PhD Thesis, Cambridge. \n[22] Whishaw, IQ & Jarrard, LF (1996) Hippocampus \n[23] Wilson, MA & McNaughton, BL (1993) Science 261:1055-1058. \n\n\f", "award": [], "sourceid": 1381, "authors": [{"given_name": "David", "family_name": "Foster", "institution": null}, {"given_name": "Richard", "family_name": "Morris", "institution": null}, {"given_name": "Peter", "family_name": "Dayan", "institution": null}]}