{"title": "Hippocampal Model of Rat Spatial Abilities Using Temporal Difference Learning", "book": "Advances in Neural Information Processing Systems", "page_first": 145, "page_last": 151, "abstract": "", "full_text": "Hippocampal Model of Rat Spatial Abilities \n\nUsing Temporal Difference Learning \n\nDavid J Foster* \n\nCentre for Neuroscience \nEdinburgh University \n\nRichard GM Morris \n\nCentre for Neuroscience \nEdinburgh University \n\nPeter Dayan \nE25-210, MIT \n\nCambridge, MA 02139 \n\nAbstract \n\nWe provide a model of the standard watermaze task, and of a more \nchallenging task involving novel platform locations, in which rats \nexhibit one-trial learning after a few days of training.  The model \nuses hippocampal place cells to support reinforcement learning, \nand also,  in an integrated manner,  to build  and use  allocentric \ncoordinates. \n\n1 \n\nINTRODUCTION \n\nWhilst it has long been known both that the hippocampus of the rat is needed for \nnormal performance on spatial tasks l3 , 11  and that certain cells in the hippocampus \nexhibit place-related firing,12  it has not been clear how place cells are actually used \nfor navigation.  One of the principal conceptual problems has been understanding \nhow the hippocampus could specify or learn paths to goals when spatially tuned \ncells in the hippocampus respond only on the basis of the rat's current location. \nThis work uses recent ideas from reinforcement learning to solve this problem in \nthe context of two rodent spatial learning results. \nReference memory in the watermazell (RMW) has been a key task demonstrating \nthe importance of the hippocampus for spatial learning.  On each trial, the rat is \nplaced in a circular pool of cloudy water, the only escape from which is a platform \nwhich is hidden (below the water surface) but which remains in a constant position. \nA random choice of starting pOSition is used for each trial.  Rats take asymptotically \nshort paths after approximately 10 trials (see figure 1 a).  Delayed match-to-place \n(DMP) learning is a refined version in which the platform'S location is changed on \neach day.  Figure 1 b shows escape latencies for rats given four trials per day for nine \ndays, with the platform in a novel position on each day.  On early days, acquisition \n\n\u00b7Crichton  Street,  Edinburgh  EH8  9LE,  United  Kingdom.  Funded  by  Edin.  Univ. \nHoldsworth  Scholarship,  the  McDonnell-Pew  foundation  and  NSF  grant  IBN-9634339. \nEmail:  djf@cfn.ed.ac.uk \n\n\f146 \n\n100 \n\n90 \n\na \n\nD.  J  Foster; R.  G.  M.  Mo\"is and P.  Dayan \n\n100 \n\n90 \n\n80 \n\nb \n\n_  70 \n\n'\" i oo \n\n'\"  50 \n-' \n\n~40 \n~30 \n\n13 \n\n17 \n\n21 \n\n2S \n\n20 \n\n10 \n\nFigure 1:  a)  Latencies for  rats on the reference memory in the watermaze (RMW) \ntask (N=8).  b) Latencies for rats on the Delayed Match-to-Place (DMP) task (N=62). \n\nis gradual but on later days, rats show one-trial learning, that is, near asymptotic \nperformance on the second trial to a novel platform position. \nThe RMW task has been extensively modelled. 6,4,5,20  By contrast, the DMP task \nis new and computationally more challenging.  It is solved here by integrating a \nstandard actor-critic reinforcement learning system2,7  which guarantees that the \nrat will be competent to perform well in arbitrary mazes, with a system that learns \nspatial coordinates in the maze. Temporal difference learning 1 7 (TO) is used for actor, \ncritic and coordinate learning. TO learning is attractive because of its generality for \narbitrary Markov decision problems and the fact that reward systems in vertebrates \nappear to instantiate it. 14 \n\n2  THEMODEL \n\nThe model comprises two distinct networks (figure 2):  the actor-critic network and \na  coordinate learning network.  The contribution  of  the hippocampus,  for  both \nnetworks, is to provide a state-space representation in the form of place cell basis \nfunctions.  Note that only the activities of place cells are required, by contrast with \ndecoding schemes which require detailed information about each place cell.4 \n\nACTOR-CRITIC \n\nSYSTEM \n\nCOORDINATE SYSTEM \n\nr-------------\n\nRemembered \nGoal coordinates \n\n1 \n\n1 \n1 \n\n1 \nVECTOR COMPUTA nONI \n\n~ \n\nCoordinate \nRepresentation \n\n1 ______ -------1 \n\n1 \n\nFigure 2:  Model diagram showing the interaction between actor-critic and coordi(cid:173)\nnate system components. \n\n\fHippocampal Model of Rat Spatial Abilities Using  TD  Learning \n\n147 \n\n2.1  Actor-Critic Learning \n\nPlace  cells  are  modelled  as  being  tuned  to  location.  At  position  p,  place  cell \ni  has  an  output given by h(p)  =  exp{ -lip - sdI2/2(12},  where  Si  is  the  place \nfield  centre, and (1  =  0.1  for  all  place fields.  The critic  learns a  value function \nV(p) = L:i wih(p) which comes to represent the distance of p  from the goal, using \nthe TO rule 6.w~ ex:  8t h(pt), where \n\n(1) \n\nis  the TD  error,  pt is position at time t,  and the reward r(pt, pt+I)  is  1 for  any \nmove onto the platform, and 0 otherwise. In a slight alteration of the original rule, \nthe value V (p)  is  set to zero when p  is at  the goal,  thus ensuring that the  total \nfuture rewards for  moving onto the goal will be exactly 1.  Such a  modification \nimproves stability  in the case  of  TD  learning with  overlapping basis functions. \nThe discount factor,  I' was set to  0.99.  Simultaneously  the rat refines  a  policy, \nwhich is represented by eight action cells.  Each action cell (aj  in figure 2) receives \na parameterised input at any position p:  aj (p)  =  L:i qjdi (p).  An action is chosen \nstochastically with probabilities given by P(aj) = exp{2aj}/ L:k exp{2ak}.  Action \nweights are reinforced according to:2 \n\nwhere 9j((Jt)  is  a  gaussian function of the difference between the head direction \n(Jt  at time t and the preferred direction of the jth action cell.  Figure 3 shows the \ndevelopment of a policy over a few trials. \n\n(2) \n\nV(p)l \n\nTriall \n\nV(p)  1 \n\nTrialS \n\nV(P)l \n\nTriall3 \n\n0.5 \n\n0. \n0.5 \n\n0.5 \nI \n01 \n0.5 \n\n0.5 \n\n0: \n0.5 \n\n0.5 \n\n0 \n\n0.5 \n\n.---'--- -\n\n0.5 \n\n-0.5  -0.5 \n\n-------\n\n0 \n\nFigure 3:  The RMW task:  the value function gradually disseminates information \nabout reward proximity to all regions of the environment.  Policies and paths are \nalso shown. \n\nThere is  no analytical guarantee for  the convergence of TD learning with policy \nadaptation.  However our simulations show that the algorithm always converges \nfor the RMW task. In a simulated arena of diameter 1m and with swimming speeds \nof 20cm/s, the simulation matched the performance of the real rats very closely (see \nfigure S).  This demonstrates that TD-based reinforcement learning is adequately \nfast to account for the learning performance of real animals. \n\n\f148 \n\n2.2  Coordinate Learning \n\nD. 1. Foster, R.  G.  M  Morris and P.  Dayan \n\nAlthough the learning of a  value function  and policy  is  appropriate for  finding \na fixed platform, the actor-critic model does not allow the transfer of knowledge \nfrom  the task defined by one goal position to that defined by any  other;  thus it \ncould not generate the sort of one-trial learning that is shown by rats on the DMP \ntask (see figure 1 b).  This requires acquisition of some goal-independent know ledge \nabout s~ace.  A natural mechanism for this is the path integration or self-motion \nsystem.  0,10 However, path integration presents two problems.  First, since the rat \nis put into the maze in a different position for each trial, how can it learn consistent \ncoordinates across the whole maze? Second, how can a general, powerful, but slow, \nbehavioral learning mechanism such as TO be integrated with a specific, limited, \nbut fast learning mechanism involving spatial coordinates? \nSince TO critic  learning is based on enforcing consistency in estimates of future \nreward,  we can also  use it to learn spatially consistent coordinates on the  basis \nof samples of self-motion.  It is  assumed that the rat has an allocentric  frame  of \nreference.1s  The model learns parameterised estimates of the x and y coordinates \nof all positions p:  x(p)  = Li w[ fi(P)  and y(p) = Li wY h(p),  Importantly, while \nplace cells were again critical in supporting spatial representation, they do not embody \na map of space.  The coordinate functions, like the value function previously, have to \nbe learned. \nAs  the  simulated  rat moves  around,  the  coordinate weights  {w[}  are adjusted \naccording to: \n\nt \n\nLlwi ()(  (Llxt + X (pt+l ) - X(pt)) L At - k h(pk) \n\n(3) \n\nk=1 \n\nwhere Llxt is the self-motion estimate in the x direction.  A similar update is applied \nto {wn.  In this case, the full TO(A) algorithm was used (with A =  0.9);  however \nTD(O) could also have been used, taking slightly longer.  Figure 4a shows the x and \ny coordinates at early and late phases of learning.  It is apparent that they rapidly \nbecome quite accurate - this is an extremely easy task in an open field maze. \nAn  important  issue  in  the  learning  of  coordinates  is  drift,  since  the  coordinate \nsystem receives no direct information about the location of the origin.  It turns out \nthat the three controlling factors over the implicit origin are:  the boundary of the \narena, the prior setting of the coordinate weights (in this case all were zero) and \nthe position and prior value of any absorbing area (in this case the platform). If the \ncoordinate system as a whole were to drift once coordinates have been established, \nthis would invalidate coordinates that have been remembered by the rat over long \nperiods.  However, since the expected value of the prediction error at time steps \nshould be zero for any self-consistent coordinate mapping, such a mapping should \nremain stable.  This is  demonstrated for  a single run:  figure 4b  shows the mean \nvalue of coordinates x evolving over trials, with little drift after the first few trials. \nWe modeled the coordinate system as influencing the choice of swimming direction \nin the manner of an abstract action. I5 The (internally specified) coordinates of the \nmost recent goal position are stored in short term memory and used, along with the \ncurrent coordinates, to calculate a vector heading.  This vector heading is thrown \ninto  the  stochastic  competition  with  the  other  possible  actions,  governed  by  a \nsingle weight which changes in a similar manner to the other action weights (as in \nequation 2, see also fig 4d), depending on the TO error, and on the angular proximity \nof the current head direction to the coordinate direction.  Thus,  whether the the \ncoordinate-based direction is likely to be used depends upon its past performance. \nOne simplification in the model is the treatment of extinction.  In the DMP task, \n\n\fHippocampal Model of Rat Spatial Abilities Using 1D Learning \n\n149 \n\n,. \n\nIII \n\n.1 \n\nTRIAL \n\n26 \n\n16 \n\n~Ol \n\n\" TJUAL \nd  i: \n!\" ~o \n\n~o \n\nd\u00b7 \n\nr \n\nr. \n\nFigure 4:  The evolution of the coordinate system for a typical simulation run:  a.) \ncoordinate outputs at early and late phases of learning, b.)  the extent of drift in the \ncoordinates, as shown by the mean coordinate value for a single run, c.)  a measure \n\nA2 \n\nf \n'  where  k \no  coor  mate error for  the same  run  (7E  = \nindexes measurement points (max Np )  and r indexes runs (max Nr), Xr(Pk)  is the \nmodel estimate of  X  at position Pk,  X(Pk)  is  the ideal estimate for  a coordinate \nsystem centred on zero, and Xr is the mean value over all the model coordinates, \nd.)  the increase during training of the probability of choosing the abstract action. \nThis demonstrates the integration of the coordinates into the control system. \n\n~ ~ {X r (Pr.)-X r -X(pr.)}2 \n\n(Np-l)Nr \n\nreal rats extinguish to a platform that has moved fairly quickly whereas the actor(cid:173)\ncritic model extinguishes far more slowly.  To get around this, when a simulated \nrat reaches  a  goal  that has just been moved,  the  value  and  action  weights  are \nreinitialised, but the coordinate weights wf and wf, and the weights for the abstract \naction, are not. \n\n3  RESULTS \n\nThe main results of this paper are the replication by simulation of rat performance \non the RMW and DMP tasks.  Figures  la and b show the course of learning for \nthe rats; figures Sa and b for the model.  For the DMP task, one-shot acquisition is \napparent by the end of training. \n\n4  DISCUSSION \n\nWe have built a model for one-trial spatial learning in the watermaze which uses \na single TD learning algorithm in two separate systems.  One system is based on a \nreinforcement learning that can solve general Markovian decision problems, and \nthe other is based on coordinate learning and is specialised for an open-field water \nmaze.  Place cells in the hippocampus offer an excellent substrate for learning the \nactor, the critic and the coordinates. \nThe model is explicit about the relationship between the general and specific learn(cid:173)\ning systems, and the learning behavior shows that they integrate seamlessly.  As \ncurrently constituted, the coordinate system would fail if there were a barrier in \nthe maze.  We plan to extend the model to allow the coordinate system to specify \nabstract targets other than the most recent platform position - this  could  allow \nit fast  navigation around a  larger class  of environments.  It is  also  important to \nimprove the model of learning 'set' behavior - the information about the nature of \n\n\f150 \n\na \n\n12 \n\nj :10 \\  \n\n~ . ~ \n~ . \n\nD.  1.  Foster; R.  G.  M.  Mo\"is and P.  Dayan \n\nb \n\n14 \n\n12 \n\n10 \n\u00a7. \nz \n~ S> \n.. \n\n'\"~ ............................................ \n\n0~D.~yl~~y~2~D~.y~3~D~.y~47~~yS~~~.~~~y~7~~~y~.7D.~y9~ \n\nFigure 5:  a.)  Performance  of  the  actor-critic  model  on the  RMW  task,  and  b.) \nperformance of the full model on the DMP task.  The data for comparison is shown \nin figures la and b. \n\nthe DMP task that the rats acquire over the course of the first few days of training. \nInterestingly,  learning set is  incomplete - on the first  trial  of each  day,  the  rats \nstill aim for  the platform position on the previous day, even though this is never \ncorrect.16  The significant differences in the path lengths on the first  trial of each \nday (evidence in figure Ib and figure 5b) come from the relative placements of the \nplatforms.  However,  the model did not use the same positions as  the empirical \ndata, and, in any case, the model of exploration behavior is rather simplistic. \nThe model demonstrates that reinforcement learning methods  are perfectly  fast \nenough to match empirical learning curves.  This is  fortunate, since, unlike most \nmodels  specifically  designed  for  open-field  navigation,6,4,5,2o  RL  methods  can \nprovably cope with substantially more complicated tasks with arbitrary barriers, \netc,  since they solve the temporal credit assignment problem in its full generality. \nThe model also addresses the problem that coordinates in different parts of the \nsame environment need to be mutually consistent, even if the animal only expe(cid:173)\nriences some parts on separate trials.  An important property of the model is that \nthere is no requirement for the animal to have any explicit knowledge of the rela(cid:173)\ntionship between different place cells or place field position, size or shape.  Such a \nrequirement is imposed in various models. 9,4,6,2o \n\nExperiments that are suggested by this model (as well as by certain others) con(cid:173)\ncern the relationship between hippocampally dependent and independent spatial \nlearning.  First,  once  the  coordinate system has been acquired,  we predict  that \nmerely placing the rat at a new location would be enough to let it find the platform \nin one shot, though it might be necessary to reinforce the placement e.g.  by first \nplacing the rat in a bucket of cold water.  Second, we know that the establishment \nof place fields in an environment happens substantiallr faster than establishment \nof  one-shot or even ordinary learning to a  platform.2  We  predict that blocking \nplasticity in the hippocampus following  the establishment of place cells (possibly \nachieved without a platform) would not block learning of a platform.  In fact, new \nexperiments show that after extensive pre-training, rats can perform one-trial learn(cid:173)\ning in the same environment to new platform positions on the DMP task without \nhippocampal synaptic plasticity. 16  This is in contrast to the effects of hippocampal \nlesion,  which completely disrupts  performance.  According to the model,  coor(cid:173)\ndinates will have been learned during pre-training.  The full  prediction remains \nuntested: that once place fields have been established, coordinates could be learned \nin the absence of hippocampal synaptic plasticity.  A third prediction follows from \nevidence that rats with restricted hippocampal lesions can learn the fixed platform \n\n\fHippocampal Model of Rat Spatial Abilities Using  TD  Learning \n\n151 \n\ntask,  but much more slowly,  based on a  gradual \"shaping\" procedure.22  In  our \nmodel,  they may also be able to learn coordinates.  However, a lengthy training \nprocedure could be required, and testing might be complicated if expressing the \nknowledge required the use of hippocampus dependent short-term memory for \nthe last platform location. I6 \nOne way of expressing the contribution of the hippocampus in the model is to say \nthat its function is to provide a behavioural state space for the solution of complex \ntasks.  Hence  the  contribution  of  the hippocampus  to  navigation  is  to  provide \nplace cells whose firing properties remain consistent in a  given environment.  It \nfollows that in different behavioural situations, hippocampal cells should provide \na  representation based on something other than locations -\nand,  indeed,  there \nis evidence for  this. 8  With regard to the role of the hippocampus in spatial tasks, \nthe model demonstrates that the hippocampus may be fundamentally necessary \nwithout embodying a map. \n\nReferences \n\n[1]  Barto, AG &  Sutton, RS (1981) BioI.  Cyber., 43:1-8. \n[2]  Barto, AG,  Sutton, RS  &  Anderson, CW (1983)  IEEE  Trans.  on  Systems,  Man \n\nand Cybernetics 13:834-846. \n\n[3]  Barto, AG, Sutton, RS  &  Watkins, CJCH (1989) Tech  Report 89-95, CAIS, Univ. \n\nMass., Amherst, MA. \n\n[4]  Blum, KI &  Abbott, LF (1996) Neural Computation, 8:85-93. \n[5]  Brown, MA & Sharp, PE (1995) Hippocampus 5:171-188. \n[6]  Burgess, N, Reece, M &  O'Keefe, J (1994) Neural Networks, 7:1065-1081. \n[7]  Dayan, P (1991) NIPS 3, RP Lippmann et aI, eds., 464-470. \n[8]  Eichenbaum, HB  (1996) Curro Opin. Neurobiol., 6:187-195. \n[9]  Gerstner, W &  Abbott, LF (1996) J.  Computational Neurosci. 4:79-94. \n[10]  McNaughton, BL et a1  (1996) J.  Exp.  BioI., 199:173-185. \n[11]  Morris, RGM et al (1982) Nature, 297:681-683. \n[12]  O'Keefe, J &  Dostrovsky, J (1971) Brain Res., 34(171). \n[13]  Olton, OS &  Samuelson, RJ  (1976) J.  Exp. Psych:  A.B.P., 2:97-116. \n\nRudy, JW &  Sutherland, RW (1995) Hippocampus, 5:375-389. \n\n[14]  SchUltz, W, Dayan, P &  Montague, PR (1997) Science, 275, 1593-1599. \n[15]  Singh, SP Reinforcement learning with a hierarchy of abstract models. \n[16]  Steele, RJ  &  Morris, RGM in preparation. \n[17]  Sutton, RS (1988) Machine Learning, 3:9-44. \n[18]  Taube, JS (1995) J.  Neurosci.  15(1):70-86. \n[19]  Tsitsiklis, IN &  Van Roy, B (1996) Tech  Report LIDS-P-2322, M.LT. \n[20]  Wan, HS, Touretzky, OS  &  Redish, AD (1993)  Proc.  1993 Connectionist Models \n\nSummer School, Lawrence Erlbaum, 11-19. \n\n[21]  Watkins, CJCH (1989) PhD Thesis, Cambridge. \n[22]  Whishaw, IQ &  Jarrard, LF (1996) Hippocampus \n[23]  Wilson, MA &  McNaughton, BL (1993) Science 261:1055-1058. \n\n\f", "award": [], "sourceid": 1381, "authors": [{"given_name": "David", "family_name": "Foster", "institution": null}, {"given_name": "Richard", "family_name": "Morris", "institution": null}, {"given_name": "Peter", "family_name": "Dayan", "institution": null}]}