{"title": "Hippocampal Contributions to Control: The Third Way", "book": "Advances in Neural Information Processing Systems", "page_first": 889, "page_last": 896, "abstract": null, "full_text": "Hippocampal Contributions to Control:\n\nThe Third Way\n\nM\u00b4at\u00b4e Lengyel\n\nCollegium Budapest Institute for Advanced Study\n2 Szenth\u00b4aroms\u00b4ag u, Budapest, H-1014, Hungary\n\nand\n\nComputational & Biological Learning Lab\n\nCambridge University Engineering Department\nTrumpington Street, Cambridge CB2 1PZ, UK\n\nlmate@gatsby.ucl.ac.uk\n\nPeter Dayan\n\nGatsby Computational Neuroscience Unit, UCL\n\n17 Queen Square, London WC1N 3AR, UK\n\ndayan@gatsby.ucl.ac.uk\n\nAbstract\n\nRecent experimental studies have focused on the specialization of different neural\nstructures for different types of instrumental behavior. Recent theoretical work\nhas provided normative accounts for why there should be more than one control\nsystem, and how the output of different controllers can be integrated. Two par-\nticlar controllers have been identi\ufb01ed, one associated with a forward model and\nthe prefrontal cortex and a second associated with computationally simpler, habit-\nual, actor-critic methods and part of the striatum. We argue here for the normative\nappropriateness of an additional, but so far marginalized control system, associ-\nated with episodic memory, and involving the hippocampus and medial temporal\ncortices. We analyze in depth a class of simple environments to show that episodic\ncontrol should be useful in a range of cases characterized by complexity and in-\nferential noise, and most particularly at the very early stages of learning, long\nbefore habitization has set in. We interpret data on the transfer of control from the\nhippocampus to the striatum in the light of this hypothesis.\n\n1 Introduction\n\nWhat use is an episodic memory? It might seem that the possibility of a fulminant recreation of a\nformer experience plays a critical role in enabling us to act appropriately in the world [1]. However,\nwhy should it be better to act on the basis of the recollection of single happenings, rather than\nthe seemingly normative use of accumulated statistics from multiple events? The task of building\nsuch a statistical model is normally the dominion of semantic memory [2], the other main form of\ndeclarative memory. Issues of this kind are frequently discussed under the rubric of multiple memory\nsystems [3, 4]; here we consider it from a normative viewpoint in which memories are directly used\nfor control.\nOur answer to the initial question is the computational challenge of using a semantic memory as a\nforward model in sequential decision tasks in which many actions must be taken before a goal is\nreached [5]. Forward and backward search in the tree of actions and consequent states (ie model-\nbased reinforcement learning [6]) in such domains impose crippling demands on working memory\n\n1\n\n\f(to store partial evaluations) and it may not even be possible to expand out the tree in reasonable\ntimes. If we think of the inevitable resulting errors in evaluation as a form of computational noise\nor uncertainty, then the use of the semantic memory for control will be expected to be subject to\nsubstantial error. The main task for this paper is to explore and understand the circumstances under\nwhich episodic control, although seemingly less ef\ufb01cient in its use of experience, should be expected\nto be more accurate, and therefore be evident both psychologically and neurally.\nThis argument about episodic control exactly parallels one recently made for habitual or cached\ncontrol [5]. Model-free reinforcement learning methods, such as Q-learning [7] are computationally\ntrivial (and therefore accurate) at the time of use, since they learn state-value functions or state-\naction-value functions that cache the results of the expensive and dif\ufb01cult search. However, model-\nfree methods learn through a form of bootstrapping, which is known to be inef\ufb01cient in the use of\nexperience. It is therefore optimal to employ cached control rather than model-based control only\nafter suf\ufb01cient experience, when the inaccuracy of the former over learning is outweighed by the\ncomputational noise induced in using the latter. The exact tradeoff depends on the prior statistics\nover the possible tasks.\nWe will show that in general, just as model-free control is better than model-based control after\nsubstantial experience, episodic control is better than model-based control after only very limited\nexperience. For some classes of environments, these two other controllers signi\ufb01cantly squeeze the\ndomain of optimal use of semantic control.\nThis analysis is purely computational. However, it has psychological and neural implications and\nassociations. It was argued [5] that the transition from model-based to model-free control explains\na wealth of psychological observations about the transition over the course of learning from goal-\ndirected control (which is considered to be model-based) to habitual control (which is model-free).\nIn turn, this is associated with an apparent functional segregation between the dorsolateral prefrontal\ncortex and dorsomedial striatum, implementing model-based control, and the dorsolateral striatum\n(and its neuromodulatory inputs), implementing model-free control. Exactly how the uncertainties\nassociated with these two types of control are calculated is not clear, although it is known that the\nprelimbic and infralimbic cortices are somehow involved in arbitration. The psychological construct\nfor episodic control is obvious; its neural realization is likely to be the hippocampus and medial tem-\nporal cortical regions. How arbitration might work for this third controller is also not clear, although\nthere have been suggestions on how uncertainty may be represented neurally in the hippocampus\n[8]. There is also evidence for the transfer of control from hippocampal to striatal structures over\nthe course of learning [9, 10] suggesting that arbitration might happen, but unfortunately, in these\nstudies, the possibility of an additional step via dorsolateral prefrontal cortex was not fully tested.\nIn this paper, we explore the nature and (f)utility of episodic control. Section 2 describes the\nsimple tree-structured Markov decision problems that we use to illustrate and quantitate our ar-\nguments. Section 3 provides a detailed, albeit approximate, analysis of uncertainty and learning in\nthese environments. Finally, section 4 uses these analytical methods and simulations to study the\nepisodic/forward model tradeoff.\n\n2 Paradigm for analysis\n\nWe seek to analyse computational and statistical trade-offs that arise in choosing actions that max-\nimize long-term rewards in sequential decision making problems. The trade-offs originate in un-\ncertainties associated with learning and inference. We characterize these tasks as Markov decision\nprocesses (MDPs) [6] whose transition and reward structure are initially unknown by the subject,\nbut are drawn from a parameterized prior that is known.\nThe key question is how well different possible control strategies can perform given this prior and\na measured amount of experience. Like [11], we simplify exploration using a form of parallel sam-\npling model in order to focus on the ability of controllers to exploit knowledge extracted about an\nenvironment. Performance is naturally measured using the average reward that would be collected\nin a trial; this average is then itself averaged over draws of the MDP and the stochasticity associated\nwith the exploratory actions. We analyse three controllers: a model-based controller without com-\nputational noise, which provides a theoretical upper limit on performance, a realistic model-based\n\n2\n\n\fA\n\nB\n\nFigure 1: A, An example tree-structured MDP, with depth D = 2, branching factor B = 3, and\nA = 4 available actions in each non-terminal state. The horizontal stacked bars in the boxes of the\nleft and middle column show the transition probabilities for different actions at non-terminal states,\ncolor coded by the successor states to which they lead (matching the color of the corresponding\narrows). Transition probability distributions are iid. according to a Dirichlet distribution whose\nparameters are all 1. Gaussians in the right column show the reward distributions at terminal states.\nEach has unit variance and a mean which is drawn iid. from a normal distribution of mean \u00b5\u00afr = 0\nand standard deviation \u03c3\u00afr = 5. All parameters in later \ufb01gures are the same, unless otherwise noted.\nB, Validating the analytical approximations by numerical simulations (A = 3).\n\ncontroller with computational noise that we regard as the model of semantic memory-based control,\nand an \u2018episodic controller\u2019.\nWe concentrate on a simple subset of MDPs, namely \u2018tree-structured MDPs\u2019 (tMDPs), which are\nillustrated in Figure 1A (and de\ufb01ned formally in the Supporting Material). We expect the qualitative\ncharacteristics of our \ufb01ndings to apply for general MDPs; however, we used tMDPs since they\nrepresent a \ufb01rst-order, analytically tractable, approximation of the general problem presented by any\nMDP at a given decision point if it is unfolded in time (ie a decision tree with \ufb01nite time-horizon).\nActions lead to further states (and potentially rewards), from where further possible actions and\nthus states become available, and so on. The key difference is that in a general MDP, a state can\nbe revisited several times even within the same episode, which is impossible in a tMDP. Thus, our\napproach neglects correlations between values of future states. This is formally correct in the limit\nof in\ufb01nitely diluted MDPs, but is otherwise just an approximation.\n\n3 The model-based controller\n\nIn our paradigm, the task for the model-based controllers is to use the data from the exploratory\ntrials to work out posterior distributions over the unknown transition and reward structure of the\ntMDP, and then report the best action at each state. It is well known that actually doing this is rad-\nically intractable. However, to understand the tradeoffs between different controllers, we only need\nto analyze the expected return from doing so, averaging over all the random quantities. One of the\ntechnical contributions of this work is the set of analytically- and empirically-justi\ufb01ed approxima-\ntions to those averages (which are presented in the Supplementary Material), based on the assumed\nknowledge of the parameters governing the generation of the tMDP, and as a function of the amount\nof exploratory experience.\nWe proceed in three stages. First, we consider the model-based controller in the case that it has\nexperienced so many samples that the parameters of the tMDP are known exactly. This provides an\n(approximate) upper bound on the expected performance of any controller. Second, we approximate\n\n3\n\n!1001000.30.6terminal staterewardprobability00.511234non!terminal stateactionstransition probabilities024681000.511.522.53Learning timePerformance numerical simulationsanalytical approximations\fthe impact of incomplete exploration by corrupting the controller by an aliquot of noise whose\nmagnitude is determined by the parameters of the problem. Finally, we approximate the additionally\ndeleterious effect of limited computational resources by adding an assumed induced bias and extra\nvariance.\nThe \ufb01rst step is to calculate the asymptotic performance when in\ufb01nitely many data have been col-\nlected. In this limit, transition probabilities and reward distributions can be treated as known quan-\ntities. Critical to our analysis is that the independence and symmetry properties of regular tMDPs\nimply that we mostly need only analyze a single \u2018sub-treelet\u2019 of the tree (one non-terminal state and\nits successor states), from which the results generalise to the whole tree by recursion. In the case\nof the asymptotic analysis, this recursive formulation turns out to allow for a closed-form solution\nfor the mean \u00b5 and variance \u03c32 of an approximate Gaussian distribution characterizing the average\nvalue of one full tree traversal starting from the root node:\n\n\u00b5 = \u00b5\u00afr +\n\n1 \u2212 \u03bbD/2\n1 \u2212 \u03bb1/2\n\n2\n\n2\n\n\u03bb1 \u03c3\u00afr\n\n\u03c32 = \u03bbD\n\n2 \u03c32\n\u00afr\n\n(1)\n\n\u00afr are the mean and variance of the normal distribution from which the means of the\nwhere \u00b5\u00afr and \u03c32\nreward distributions at the terminal states are drawn, and 0 \u2264 \u03bb1, \u03bb2 \u2264 1 are constants that depend\non the other parameters of the tMDP. This calculation depends on characterizing order statistics\nof multivariate Gaussian distributions which are equicorrelated (all the off-diagonal terms of the\ncovariance matrix are the same) [12]. Equation 1 is actually an interesting result in and of itself \u2013\nit indicates the extent to which the controller can take advantage of the variability \u00b5 \u2212 \u00b5\u00afr \u221d \u03c3\u00afr in\nboosting its expected return from the root node as a function of the depth of the tree.\nThe second step is to observe that we expect the bene\ufb01ts of episodic control to be most apparent\ngiven very limited exploratory experience. To make analytical progress, we are forced to make the\nsigni\ufb01cant assumption that the effects of this can be modeled by assuming that the controller does\nnot have access to the true values of actions, but only to \u2018noisy\u2019 versions. This \u2018noise\u2019 comes from\nthe fact that computing the values of different actions is based on estimates of transition probability\nand reward distributions. These estimates are inherently stochastic themselves, as they are based on\nstochastic experience. We have been able to show that the form of the resulting \u2018noise\u2019 in the action\nvalues can have the effect of scaling down the true values of actions at states by a factor \u03c61 and\nadding extra noise \u03c62. Although we were unable to \ufb01nd a closed-form solution for the effects of \u03c61\nand \u03c62 on the performance of the controller, a recursive analytical formulation, though involved, is\nstill possible (see Supporting Material).\nFigure 1B shows the learning curve for the model-based controller computed using our analytical\npredictions (blue line) and using exhaustive numerical simulations (red line, average performance\nin 100 sample tMDPs, with the learning process rerun 100 times in each). The inaccuracies entailed\nby our approximations are tolerable (also for other parameters; simulations not shown), and so from\nthis point we use those to analyse the performance of the optimal, model-based controller.\nThe dark blue solid curve in \ufb01gure 2A (labelled \u03b72 = 0) shows the performance of model-based\ncontrol as a function of the number of exploration samples (the equivalent of the dark blue curve in\n\ufb01gure 1B, but for A = 4 rather than A = 3). For comparison, the dashed line shows the asymptotic\nexpected value. The slight decrease in the approximate value arises because the approximations\nbecome slightly looser as the noise gets less; however, once again we have been able to show (sim-\nulations not shown) that our analysis is highly accurate compared with extensive actual samples.\nThe \ufb01nal step is to model the effects of the computational complexity of the model-based controller\non performance arising from the severe demands it places on such facets as working memory. These\nnecessitate pruning (ie ignoring parts of the decision tree), or sub-sampling, or some other such\napproximation. We treat the effects of all approximations by forcing the controller to have access\nto only noisy versions of the (exploration-limited) action values. Just as for incomplete exploration,\nwe model the noise as a combination of downscaling the true action values by a parameter \u03b71 and\nadding excess variability \u03b72. Note that whereas the terms \u03c61, \u03c62 characterizing the effects of learning\nare determined by the number of samples; \u03b71, \u03b72 are set by hand to capture the assumed effects\non inference of the computational complexity. The asymptotic values of the curves in \ufb01gure 2A\nfor various values of \u03b72 (for all of them, \u03b71 = 1) demonstrate the effects of inferential noise on\nperformance.\n\n4\n\n\fA\n\nB\n\nFigure 2:\n\nA, Learning curves for the model-based controller at different levels of computational noise: \u03b71 = 1,\n\u03b72 is increased from 0 to 3. The approximations used for computing these curves are less accurate in\nthe low-noise limit, hence the paradoxical slight decrease in the performance of the perfect controller\n(without noise) at the end of learning. The dashed line shows the asymptotic approximation which\nis more accurate in this limit, demonstrating that the inaccuracy of the experience-dependent ap-\nproximation is not disastrous. B, Performance of noisy controllers normalized by that of the perfect\ncontroller in the same environment at the same amount of experience. The brown line corresponds\nto a more dif\ufb01cult environment with greater depth. Note that \u2018learning time\u2019 is measured by the\nnumber of times every state-action pair has been sampled. Thus decreased performance shown in\nthe more complex environment is not due to the increased sparsity of experience.\n\nSo far, we have separately considered the effects of computational noise and uncertainty due to lim-\nited experience. In reality, both factors affect the model-based controller. The full plots in \ufb01gure 2A,\nB show the interaction of these two factors (\ufb01gure 2B shows the same data as \ufb01gure 2A, but scaled\nto the performance of the noise-free controller for the given amount of experience). Computational\nnoise not only makes the asymptotic performance worse, by simply down-scaling average rewards,\nbut it also makes learning effectively slower. This is because the adverse effects of computational\nnoise depend on the differences between the values of possible actions. If these values appear to\nbe widely different, then computational noise will still preserve their order, and thus the one that\nis truly best is still likely to be chosen. However, if action values appear roughly the same, then a\nlittle noise can easily change their ordering and make the controller choose a suboptimal one. Little\nexperience only licenses small apparent differences between values, and this boosts the corrupting\neffect of the inferential noise. Given more experience, the controller increasingly learns to make\ndistinctions between different actions that looked the same a priori.\nThus, while earlier work suggested that model-based control will be superior at the limit of few\nexploratory samples due to the unsurpassable data-ef\ufb01ciency of optimal statistical inference [5], we\nshow here that in the really low data limit another factor cripples its performance: the ampli\ufb01ed\nin\ufb02uence of computational noise. How much experience constitutes \u2018little\u2019 and how much noise\ncounts as \u2019much\u2019 is of course relative to the complexity of the environment.\n\n4 Episodic control\n\nIf model-based control is indeed crippled by computational noise given limited exploration, could\nthere be an effective alternative? Although outside the scope of our formal analysis, this is partic-\nularly important given the ubiquity of non-stationary environments [13], for which the effects of\ncontinual change bound the effective number of exploratory samples. That the cache-based or ha-\nbitual controller is even worse in this limit (since it learns by bootstrapping) was a main rationale\nfor the uncertainty-based account of the transfer from goal-directed to habitual control suggested by\nDaw et al [5]. Thus the habitual controller cannot step into the breach.\nIt is here that we expect episodic control to be most useful. Intuitively, if a subject has experienced a\ncomplex environment just a few times, and found a sequence of actions that works reasonably well,\nthen, provided that exploitation is at a premium over exploration, it seems obvious for the subject\n\n5\n\n02040608010000.511.522.533.5Learning timePerformance !2=0 ! asymptotic!2=0!2=1!2=2!2=30204060801000.10.20.30.40.50.60.70.80.9Learning timeRelative performance !2=1!2=2!2=3!2=2, D=12\fA\n\nB\n\nC\n\nFigure 3: Episodic vs. model-based control. Solid red line shows the performance of noisy model-\nbased control (\u03b72 = 2), blue line shows that of episodic control. Dashed red line shows the case of\nperfect model-based control which constitutes the best performance that could possibly be achieved.\nThe branching factor of the environment increased from B = 2 (A), B = 3 (B) to B = 4 (C).\n\njust to repeat exactly those actions, rather than trying to build and use a complex model. This act of\nreplaying a particular sequence of events from the past is exactly an instance of episodic control.\nMore speci\ufb01cally, we employ an extremely simple model of episodic memory, and assume that each\ntime the subject experiences a reward that is considered large enough (larger than expected a priori)\nit stores the speci\ufb01c sequence of state-action pairs leading up to this reward, and tries to follow\nsuch a sequence whenever it stumbles upon a state included in it. If multiple successful sequences\nare available for the same state, the one that yielded maximal reward is followed. We expect such\na strategy to be useful in the low data limit because, unlike in cache-based control, there is no\nissue of bootstrapping and temporal credit assignment, and unlike in model-based control, there is\nno exhaustive tree-search involved in action selection. Of course its advantages will be ultimately\ncounteracted by the haphazardness of using single samples that are \u2018adequate\u2019, but by that time the\nother controllers can take over.\nAlthough we expect our approximate analytical methods to provide some insight into its character-\nistics, we have so far only been able to use simulations to study the episodic controller in the usual\nclass of tMDPs. Comparing the blue (episodic) and red (model-based, but noisy; \u03b72 = 2) curves, in\n\ufb01gure 3A-C, it is apparent that episodic control indeed outperforms noisy model-based control in the\nlow data limit. The dashed curves show the performance of the idealized model-based controller that\nis noise-free. This emphasizes the arbitrariness of our choice of noise level \u2013 the greater the noise,\nthe longer the dominance of episodic control. However, in complicated environments, even very\nsmall amounts of noise are near catastrophic for model-based control (see brown line in Fig. 2B),\nand so this issue is not nugatory.\nThe progression of the learning curves in \ufb01gure 3A-C make the same point a different way. They\nshow what happens as the complexity of the environment is increased by increasing the branching\nfactor. At the same level of computational noise, episodic control supplants model-based control for\nincreasing volumes of exploratory samples. We expect that the same is true if the complexity of the\nenvironment is increased by increasing the depth of the tree (D) instead, or as well.\nFigure 3A-C also makes the point that the asymptotic performance of the episodic controller is\nrather poor, and is barely improved by extra learning. A smarter episodic strategy, perhaps involving\nreconsolidation to eliminate unfortunate sample trajectories, might perform more competently.\n\n5 Discussion\n\nAn episodic controller operates by remembering for each state the single action that led to the best\noutcome so far observed. Here, we studied the nature and bene\ufb01ts of episodic control. This con-\ntroller is statistically inef\ufb01cient for solving Markov decision problems compared with the normative\nstrategy of building a statistical forward model of the transitions and outcomes, and searching for\nthe optimal action. However, episodic control is computationally very straightforward, and there-\nfore does not suffer from any excess uncertainty or noise arising from the severe calculational and\nsearch complexities of the forward model. This implies that it can best forward model control under\nvarious circumstances.\n\n6\n\n05101500.511.522.533.54Learning timePerformance model!based perfectmodel!based nois;episodic05101500.511.522.533.5Learning timePerformance model!based perfectmodel!based nois:episodic05101500.511.522.533.5Learning timePerformance model!based perfectmodel!based nois:episodic\fTo explore this, we \ufb01rst characterized a class of regular tree-structured Markov decision problems\nusing four critical parameters \u2013 the depth of the tree; the fan-out from each state; the number of\nactions per state, and the characteristic (Dirichlet) statistics of the transitions consequent on each\naction. We then used theoretical and empirical methods to analyze the statistical structure of control\nbased on a forward model in the face of limited data. We showed that this control can readily be\noutperformed by an episodic controller which does not suffer from computational inaccuracy, at\nleast in the particular limits of high task complexity and signi\ufb01cant inferential noise in the model-\nbased controller. We also showed how the noise in the latter has a particularly pernicious effect\non the course of learning, corrupting the choice between actions whose values appear, because of\nlimited experience, closer than they actually are.\nOur results are obviously partial. In particular, the constraint of using a regular tree-structured MDP\nis much too severe \u2013 given the intuition from the results above, we can now consider more conven-\ntional MDPs that better model the classes of experiments that have probed the transfer of control.\nFurther, it would be important to consider models of exploration more general than the parallel sam-\npler, which provides homogeneous sampling of state-action pairs. The particular challenge is when\nexploration and exploitation are coupled, as then all the samples become interdependent in a gordian\nmanner.\nOur analysis paralleled that of\n[5], who showed that the noisy forward-model controller is also\nbeaten by a cached (actor-critic-like) controller in the opposite limit of substantial experience in an\nenvironment. The cached controller is also computationally straightforward, but relies on a com-\npletely different structure of learning and inference.\nIn psychological terms, the episodic controller is best thought of as being goal-directed, since the\nultimate outcome forms part of the episode that is recalled. Unfortunately, this makes it dif\ufb01cult\nto distinguish behaviorally from goal-directed control resulting from the forward model. In neural\nterms, the episodic controller is likely to rely on the very well investigated systems involved in\nepisodic memory, namely the hippocampus and medial temporal cortices. Importantly, there is direct\nevidence of the transfer of control from hippocampal to striatal structures over the course of learning\n[9, 10], and there is some evidence that episodic and habitual control can be simultaneously active.\nUnfortunately, there are few data [14] on structures that might control the competition or transfer\nprocess, and no test as to whether there is an intermediate phase in which prefrontal mechanisms\ninstantiating the forward model might be dominant. Predictions from our work associated with this\nare the most ripe for experimental test.\nThis paper is an extended answer to the question of the computational bene\ufb01t of episodic memory,\nwhich, crudely speaking, stores particular samples, over semantic memory, which stores probability\ndistributions. It is, of course, not the only answer \u2013 for instance, subjects that cache are obviously\nbetter off remembering exactly where in particular they stored food [15] than having to search all\nthe places that are likely under a (semantic) prior. Equally, in game theoretic interactions between\ncompetitors, Nash equilibria are typically stochastic, and therefore seemingly excellent candidates\nfor control based on a semantic memory. However, taking advantage of the \ufb02aws in an opponent\nrequire exactly remembering how its actions deviate from stationary statistics, for which an episodic\nmemory is a most useful tool [16].\nOne potential caveat to our results is that methods associated with memory-based reasoning [17]\n(such as kernel density estimation) create a semantic memory from an episodic one, for instance\nby recalling all episodes close to a cue, and weighting them by a statistically-appropriate measure\nof their distance. This form of semantic memory can be seen as arising without any consolidation\nprocess whatsoever. However, although this method has its computational attractions, it is psycho-\nlogically implausible since phenomena such as priming make it extremely dif\ufb01cult to recall multiple\nclosely related samples from an episodic memory, let alone to do so in a statistically unbiased way\n(but see [18]).\nIn sum, we have provided a normative justi\ufb01cation from the perspective of appropriate control for\nthe episodic component of a multiple memory system. Pressing from a theoretical perspective is a\nricher understanding of the integration beyond mere competition, of the information residing in, and\nthe decisions made by, all the systems involved in choice.\n\n7\n\n\fAcknowledgements\n\nWe are very grateful to Nathaniel Daw and Quentin Huys for helpful discussions. Funding was from\nthe Gatsby Charitable Foundation (ML and PD), and the EU Framework 6 (IST-FET 1940) (ML).\n\nReferences\n[1] Dudai, Y. & Carruthers, M. The Janus face of Mnemosyne. Nature 434, 567 (2005).\n[2] K\u00b4ali, S. & Dayan, P. Off-line replay maintains declarative memories in a model of hippocampal-\n\nneocortical interactions. Nat. Neurosci. 7, 286\u2013294 (2004).\n\n[3] McClelland, J.L., McNaughton, B.L. & O\u2019Reilly, R.C. Why there are complementary learning systems\nin the hippocampus and neocortex: insights from the successes and failures of connectionist models of\nlearning and memory. Psychol. Rev. 102, 419\u2013457 (1995).\n\n[4] White, N.M. & McDonald, R.J. Multiple parallel memory systems in the brain of the rat. Neurobiol Learn\n\nMem 77, 125\u2013184 (2002).\n\n[5] Daw, N.D., Niv, Y. & Dayan, P. Uncertainty-based competition between prefrontal and dorsolateral\n\nstriatal systems for behavioral control. Nat. Neurosci. 8, 1704\u20131711 (2005).\n\n[6] Sutton, R.S. & Barto, A.G. Reinforcement Learning (MIT Press, 1998).\n[7] Watkins, C.J.C.H. Learning from Delayed Rewards. PhD thesis, Cambridge University, (1989).\n[8] Lengyel, M. & Dayan, P. Uncertainty, phase and oscillatory hippocampal recall. in Advances in Neural\nInformation Processing Systems 19 (eds. Sch\u00a8olkopf, B., Platt, J. & Hoffman, T.) 833\u2013840 (MIT Press,\nCambridge, MA, 2007).\n\n[9] Packard, M.G. & McGaugh, J.L. Double dissociation of fornix and caudate nucleus lesions on acquisition\nof two water maze tasks: further evidence for multiple memory systems. Behav. Neurosci. 106, 439\u2013446\n(1992).\n\n[10] Poldrack, R.A. et al. Interactive memory systems in the human brain. Nature 414, 546\u2013550 (2001).\n[11] Kearns, M. & Singh, S. Finite-sample convergence rates for Q-learning and indirect algorithms.\n\nin\nAdvances in Neural Information Processing Systems Vol. 11 (eds. Kearns, M.S., Solla, S.A. & Cohn,\nD.A.), Vol. 11, 996\u20131002 (MIT Press, Cambridge, MA, 1999).\n\n[12] Owen, D.B. & Steck, G.P. Moments of order statistics from the equicorrelated multivariate normal distri-\n\nbution. Ann Math Stat 33, 1286\u20131291 (1962).\n\n[13] Kording, K.P., Tenenbaum, J.B. & Shadmehr, R. The dynamics of memory as a consequence of optimal\n\nadaptation to a changing body. Nat. Neurosci. 10, 779\u2013786 (2007).\n\n[14] Poldrack, R.A. & Rodriguez, P. How do memory systems interact? Evidence from human classi\ufb01cation\n\nlearning. Neurobiol Learn Mem 82, 324\u2013332 (2004).\n\n[15] Clayton, N.S. & Dickinson, A. Episodic-like memory during cache recovery by scrub jays. Nature 395,\n\n272\u2013274 (1998).\n\n[16] Clayton, N.S., Dally, J.M. & Emery, N.J. Social cognition by food-caching corvids. the western scrub-jay\n\nas a natural psychologist. Philos. Trans. R. Soc. Lond. B Biol. Sci. 362, 507\u2013522 (2007).\n\n[17] Stan\ufb01ll, C. & Waltz, D. Toward memory-based reasoning. Communications of the ACM 29, 1213\u20131228\n\n[18] Hintzman, D.L. MINERVA 2: A simulation model of human memory. Behav Res Methods Instrum\n\n(1986).\n\nComput 16, 96\u2013101 (1984).\n\n8\n\n\f", "award": [], "sourceid": 927, "authors": [{"given_name": "M\u00e1t\u00e9", "family_name": "Lengyel", "institution": null}, {"given_name": "Peter", "family_name": "Dayan", "institution": null}]}