{"title": "Better Transfer Learning with Inferred Successor Maps", "book": "Advances in Neural Information Processing Systems", "page_first": 9029, "page_last": 9040, "abstract": "Humans and animals show remarkable flexibility in adjusting their behaviour when their goals, or rewards in the environment change. While such flexibility is a hallmark of intelligent behaviour, these multi-task scenarios remain an important challenge for machine learning algorithms and neurobiological models alike. We investigated two approaches that could enable this flexibility: factorized representations, which abstract away general aspects of a task from those prone to change, and nonparametric, memory-based approaches, which can provide a principled way of using similarity to past experiences to guide current behaviour. In particular, we combine the successor representation (SR), that factors the value of actions into expected outcomes and corresponding rewards, with evaluating task similarity through clustering the space of rewards. The proposed algorithm inverts a generative model over tasks, and dynamically samples from a flexible number of distinct SR maps while accumulating evidence about the current task context through amortized inference. It improves SR's transfer capabilities and outperforms competing algorithms and baselines in settings with both known and unsignalled rewards changes. Further, as a neurobiological model of spatial coding in the hippocampus, it explains important signatures of this representation, such as the \"flickering\" behaviour of hippocampal maps, and trajectory-dependent place cells (so-called splitter cells) and their dynamics. We thus provide a novel algorithmic approach for multi-task learning, as well as a common normative framework that links together these different characteristics of the brain's spatial representation.", "full_text": "Better transfer learning with inferred successor maps\n\nTamas J. Madarasz\nUniversity of Oxford\n\ntamas.madarasz@ndcn.ox.ac.uk\n\nTimothy E. Behrens\nUniversity of Oxford\n\nbehrens@fmrib.ox.ac.uk\n\nAbstract\n\nHumans and animals show remarkable \ufb02exibility in adjusting their behaviour when\ntheir goals, or rewards in the environment change. While such \ufb02exibility is a\nhallmark of intelligent behaviour, these multi-task scenarios remain an important\nchallenge for machine learning algorithms and neurobiological models alike. We\ninvestigated two approaches that could enable this \ufb02exibility: factorized represen-\ntations, which abstract away general aspects of a task from those prone to change,\nand nonparametric, memory-based approaches, which can provide a principled\nway of using similarity to past experiences to guide current behaviour. In par-\nticular, we combine the successor representation (SR), that factors the value of\nactions into expected outcomes and corresponding rewards, with evaluating task\nsimilarity through clustering the space of rewards. The proposed algorithm inverts\na generative model over tasks, and dynamically samples from a \ufb02exible number\nof distinct SR maps while accumulating evidence about the current task context\nthrough amortized inference. It improves SR\u2019s transfer capabilities and outperforms\ncompeting algorithms and baselines in settings with both known and unsignalled\nrewards changes. Further, as a neurobiological model of spatial coding in the\nhippocampus, it explains important signatures of this representation, such as the\n\"\ufb02ickering\" behaviour of hippocampal maps, and trajectory-dependent place cells\n(so-called splitter cells) and their dynamics. We thus provide a novel algorithmic\napproach for multi-task learning, as well as a common normative framework that\nlinks together these different characteristics of the brain\u2019s spatial representation.\n\n1\n\nIntroduction\n\nDespite recent successes seen in reinforcement learning (RL) [1, 2], some important gulfs remain\nbetween sophisticated reward-driven learning algorithms, and the behavioural \ufb02exibility observed in\nbiological agents. Humans and animals seem especially apt at the ef\ufb01cient transfer of knowledge\nbetween different tasks, and the adaptive reuse of successful past behaviours in new situations, an\nability that has sparked renewed interest in machine learning in recent years.\nSeveral frameworks have been proposed to help move the two forms of learning closer together, by\nincorporating transfer and generalisation capabilities into RL agents. Here we focus on two such ideas:\nabstracting away the general aspects of a family of tasks and combining it with speci\ufb01c task features\non the \ufb02y through factorisation [3, 4]. And nonparametric, memory-based approaches [5, 6, 7, 8] that\nmay help transfer learning by providing a principled framework for reusing information, based on\ninference about the similarity between the agent\u2019s current situation, and situations observed in the\npast.\nWe focus in particular on a speci\ufb01c instance of the transfer learning problem, where the agent acts in\nan environment with \ufb01xed dynamics, but changing reward function or goal locations (but see section\n5 for more involved changes in a task). This setting is especially useful for developing an intuition\nabout how an algorithm balances the retention of knowledge about the environment shared between\ntasks, while specializing its policy for the current instantiation at hand. This is also a central challenge\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\fin the related problem of continual learning that has been examined in terms of stability-plasticity\ntrade-off [9], or catastrophic forgetting [10, 11].\nDayan\u2019s SR [3] is well-suited for transfer learning in settings with \ufb01xed dynamics, as the decom-\nposition of the value function into representations of expected outcomes (future state occupancies)\nand corresponding rewards allows us to quickly recompute values under new reward settings. Im-\nportantly, however, SR also suffers from limitations when applied in transfer learning scenarios: the\nrepresentation of expected future states still implicitly encodes previous reward functions through its\ndependence on the behavioural policy under which it was learnt, which in turn was tuned to exploit\nthese previous rewards. This can make it dif\ufb01cult to approximate the new optimal value function\nfollowing a large change in the environment\u2019s reward function, as states that are not en route to\nprevious goals/rewards will be poorly represented in the SR. In such cases the agent will stick to\nvisiting old reward locations that are no longer the most desirable, or take suboptimal routes to new\nrewards [12, 13].\nTo overcome this limitation, we combine the successor representation with a nonparametric clustering\nof the space of tasks (in this case the space of possible reward functions), and compress the repre-\nsentation of policies for similar environments into common successor maps. We provide a simple\napproximation to the corresponding hierarchical inference problem and evaluate reward function\nsimilarity on a diffused, kernel-based reward representation, which allows us to link the policies of\nsimilar environments without imposing any limitations on the precision or entropy of the policy being\nexecuted on a speci\ufb01c task. This similarity-based policy recall, operating at the task level, allows us\nto outperform baselines and previous methods in simple navigation tasks. Our approach naturally\nhandles unsignalled changes in the reward function with no explicit task or episode boundaries, while\nalso imposing reasonable limits on storage and computational complexity. Further, the principles of\nour approach should readily extend to settings with different types of factorizations, as SR itself can\nbe seen as an example of generalized value functions [4] that can extend the dynamic programming\napproach usually applied to rewards and values to other environmental features.\nWe also aim to build a learning system whose components are neuroscienti\ufb01cally grounded, and can\nreproduce some of the empirically observed phenomena typical of this type of learning. The presence\nof parallel predictive representations in the brain has previously been proposed in the context of\nsimple associative learning in amygdala and hippocampal circuits [14, 15], as well as speci\ufb01cally in\nthe framework of nonparametric clustering of experiences into latent contexts using a computational\napproach on which we also build [16]. Simultaneous representation of dynamic and diverse averages\nof experienced rewards has also been reported in the anterior cingulate cortex [17] and other cortical\nareas, and a representation of a probability distribution over latent contexts has been observed in\nthe human orbitofrontal cortex [18]. The hippocampus itself has long been regarded as serving\nboth as a pattern separator, as well as an autoassociative network, with attractor dynamics enabling\npattern completion [19, 20]. This balance between generalizing over similar experiences and tasks by\ncompressing them into a shared representation, while also maintaining task-speci\ufb01c specialization is\na key feature of our proposed hippocampal maps.\nOn the neurobiological level we thus aim to offer a framework that binds these ideas into a common\nrepresentation, linking two putative, but disparate functions of the hippocampal formation: a prospec-\ntive map of space [21, 22, 23], and an ef\ufb01cient memory processing organ, in this case compressing\nexperiences to help optimal decision making. We simulate two different rodent spatial navigation\ntasks: in the \ufb01rst we show that our model gives insights into the emergence of fast, \"\ufb02ickering\"\nremapping of hippocampal maps [24, 25], seen when rodents navigate to changing reward locations\n[26, 27]. In the second task, we provide a quantitative account of trajectory-dependent hippocam-\npal representations (so-called splitter cells) [21] during learning. Our model therefore links these\nphenomena as manifestations of a common underlying learning and control strategy.\n\n2 Reinforcement Learning and the Successor Representation\n\nIn RL problems an agent interacts with an environment by taking actions, and receiving observations\nand rewards. Formally, an MDP can be de\ufb01ned as the tuple T = (S,A, p,R, \u03b3), specifying a set of\nstates S, actions A, the state transition dynamics p(s(cid:48)|s, a), a reward function R(s, a, s(cid:48)), and the\ndiscount factor \u03b3 \u2208 [0, 1], that reduces the weight of rewards obtained further in the future. For the\n\n2\n\n\fFigure 1: Model components:(a) Overview of the model. A successor map/network is sampled\naccording to \u03c9, the probability weight vector over contexts. This sampled map is used to select an\naction one or more SR receives TD updates, while \u03c9 is also updated given the experienced reward,\nusing inference in a generative model of expected rewards. (b) Dirichlet process Gaussian mixture\nmodel of these average, or convolved reward (CR) values. The Dirichlet process is de\ufb01ned by a base\ndistribution H and concentration parameter \u03b1, giving a distribution over CR value distributions. (c)\nComputing CR maps by convolving discounted rewards along experienced trajectories. (d) Neural\nnetwork architecture for continuous state space tasks.(e) Example navigation environment for our\nexperiments.\n\ntransfer learning setting, we will consider a family of MDPs with shared dynamics but changing\nreward functions, {T (S,A, p,\u00b7)}, where the rewards are determined by some stochastic process R.\nInstead of solving an MDP by applying dynamic programming to value functions, as in Q-learning\n[28], it is possible to compute expected discounted sums over future state occupancies as proposed by\nDayan\u2019s SR framework. Namely, the successor representation maintains an expectation over future\nstate occupancies given a policy \u03c0:\n\nt (s, a, s(cid:48)) = E(\n\nM \u03c0\n\n\u03b3kI(st+k+1=s(cid:48))|st = s, at = a)\n\n(1)\n\n\u221e(cid:88)\n\nk=0\n\nWe will make the simplifying assumption that rewards r(s, a, s(cid:48)) are functions only of the arrival\nstate s(cid:48). This allows us to represent value functions purely in terms of future state occupancies, rather\nthan future state-action pairs, which is more in line with what is currently known about prospective\nrepresentations in the hippocampus [22, 23, 29]. Our proposed modi\ufb01cations to the representation,\nhowever, extend to the successor representation predicting state-action pairs as well.\nIn the tabular case, or if the rewards are linear in the state representation \u03c6(s), the SR or successor\nfeatures can be used to compute the action-value function Q(s, a) exactly, given knowledge of the\ncurrent reward weights w satisfying r(s, a, s(cid:48)) = \u03c6(s(cid:48))\u00b7 w. In this case the SR and the value function\nare given by\n\n\u03b3k\u03c6(st+k+1)|st = s, at = a), Q\u03c0\n\nt (s, a) =\n\nt (s, a, s(cid:48)) \u00b7 w(s(cid:48)).\n\nM \u03c0\n\n(2)\n\nM \u03c0\n\nt (s, a, :) = E(\n\n\u221e(cid:88)\n\nk=0\n\n(cid:88)\n\ns(cid:48)\n\nWe can therefore apply the Bellman updates to this representation, as follows, and reap the bene\ufb01ts\nof policy iteration.\n\nM (st, at, :) \u2190 M (st, at, :) + \u03b1(\u03c6(st+1) + \u03b3 \u00b7 M (st+1, a(cid:63), :) \u2212 M (st, at, :))\na(cid:63) = arg max\n\n(M (st+1, a, :) \u00b7 w)\n\na\n\n(3)\n\n3\n\n\f3 Motivation and learning algorithm\n\n3.1 Model setup\n\nOur algorithm, the Bayesian Successor Representation (BSR, Fig. 1a, Algorithm 1 in Supplementary\nInformation) extends the successor temporal difference (TD) learning framework in the \ufb01rst instance\nby using multiple successor maps. The agent then maintains a belief distribution \u03c9, over these maps,\nand samples one at every step, according to these belief weights. This sampled map is used to\ngenerate an action, one (e.g. the most likely, or the sampled SR) or all SR maps receive TD updates,\nwhile the reward and observation signals are used to perform inference over \u03c9.\nOur approach rests on the intuition that it is advantageous to transfer policies between task settings\nwith similar reward functions, where similarity in this case means encountering similar goals or\nrewards along the similar trajectories in state space. The aim is to transfer policies between envi-\nronments where similar rewards/goals are near each other, while avoiding negative transfer, and\nwithout relying on model-based knowledge of the environment\u2019s dynamics that can be hard to learn in\ngeneral and can introduce errors. To achieve this, BSR adjudicates between different successor maps\nusing online inference over latent clusters of reward functions. We evaluate reward similarity using\na kernel-based average of rewards collected along trajectories, a type of local (though temporally\nbidirectional) approximation to the value function. These average, or convolved reward (CR) values\nvcr, (Fig. 1c) are then used to perform inference using a nonparametric Dirichlet process mixture\nmodel [30]. The mixture components determining the likelihoods of the Vcr are parametrized using\na CR map for each context, represented by the vector wcr\ni , giving a Gaussian linear basis function\nmodel.\n\nG \u223c DP (\u03b1, H)\ni (s) \u223c N (\u03c6(s) \u00b7 Wcr\nV cr\n\ni , \u03c32\n\nCR)\n\nCR_mapi = Wcr\n\ni \u223c G\n\n(4)\n(5)\n\nWe regard the choice of successor map as part of the inference over these latent contexts, by attaching,\nfor each context, a successor map Mi to the corresponding CR map wcr\ni , which allows us to use the\nappropriate Mi for a policy iteration step (3), and for action selection\n\na = arg max\n\na\n\n(Mi(st, a, :) \u00b7 wt).\n\n(6)\n\nNamely, we sample Mi from the distribution over contexts, \u03c9 to choose an action while pursuing\na suitable exploration strategy, and perform weighted updates either on all maps if updates are\ninexpensive (e.g in the tabular case) or only the most likely map Margmax(\u03c9) or the sampled map\nMi if updates are expensive (e.g. function approximation case). Finally, we de\ufb01ne the observed CR\nvalues as a dot product between rewards received and a symmetric kernel of exponentiated discount\ni (s) = rt\u2212f :t+f \u00b7 K\u03b3. The common reward feature vector\nfactors K\u03b3 = [\u03b3\u2212f , . . . , \u03b3f ]T , such that vcr\nw is learned by regressing it onto experienced rewards, i.e. minimizing (cid:107)\u03c6(st)T \u00b7 w \u2212 r(t)(cid:107). This\nsetup allows the agent to accumulate evidence during and across episodes, infer the current reward\ncontext as it takes steps in the environment, and use this inference to improve its policy and select\nactions.\n\n3.2\n\nInference\n\nInference in this setting is challenging for a number of reasons: like value functions, CR values are\npolicy dependent, and will change as the agent improves its policy, even during periods when the\nreward function is stationary. Since we would like the agent to \ufb01nd the optimal policy for the current\nenvironment as quickly as possible, the sampling will be biased and the vcr observation likelihoods\nwill change. Further, the dataset of observed CR values expands with every step the agent takes, and\naction selection requires consulting the posterior at every step, making usual Markov Chain Monte\nCarlo (MCMC) approaches like Gibbs-sampling computationally problematic.\nSequential Monte Carlo (SMC) methods, such as particle \ufb01ltering, offer a compromise of faster\ncomputation at the cost of the \u2018stickiness\u2019 of the sampler, where previously assigned clusters are not\nupdated in light of new observations. We adopt such an approach for inference in our algorithm,\nas we believe it to be well-suited to this multi-task RL setup with dynamically changing reward\nfunctions and policies. The interchangeability assumption of the DP provides an intuitive choice for\n\n4\n\n\f(cid:40) mk\n\nP (ci\n\nt = k | ci\n\n1:t\u22121) =\n\nthe proposal distribution for the particle \ufb01lter, so we extend each particle ci of partitions by sampling\nfrom the prior implied by the Chinese Restaurant Process (CRP) view of the DP:\n\nt\u22121+\u03b1, where mk is the number of observations assigned to cluster k\nt\u22121+\u03b1 , if k is a new cluster\n\n\u03b1\n\n(7)\nThe importance weight for a particle of context assignments pi requires integrating over the prior\n(base) distribution H using all the CR value observations. If we adopt a conjugate prior to the\nGaussian observation likelihood in (5), this can be done analytically: for a multivariate Gaussian\nprior for the mixture components the posterior CR maps and the posterior predictive distribution\nrequired to calculate the importance weights will both be Gaussian.\nThis procedure, which we term Gaussian SR (GSR) still requires computing separate posteriors\nover the CR maps for each particle, with each computation involving the inversion of a matrix with\ndimensions dim(\u03c6) (See S1 for details). We therefore developed a more computationally ef\ufb01cient\nalternative, with a single posterior for each CR map and a single update performed on the map\ncorresponding to the currently most likely context. We also forgo the Gaussian representation, and\nperformed simple delta rule updates, while annealing the learning rate. Though we incur the increased\nspace complexity of using several maps, by limiting computation to performing TD updates, and\napproximate inference outlined above, BSR provides an ef\ufb01cient algorithm for handling multi-task\nscenarios with multiple SR representations.\nRecently proposed approaches by Barreto et al. also adjudicate between several policies using\nsuccessor features [31, 32] and as such we directly compare our methods to their generalized policy\niteration (GPI) framework. In GPI the agent directly evaluates the value functions of all stored\npolicies, to \ufb01nd the overall largest action-value, and in later work also builds a linear basis of the\nreward space by pre-learning a set of base tasks. This approach can lead to dif\ufb01culties with the\nselection of the right successor map, as it depends strongly on how sharply policies are tuned and\nwhich states the agent visits near important rewards. A sharply tuned policy for a previous reward\nsetting with reward locations close to current rewards could lose out to a broadly tuned, but mostly\nunhelpful competing policy. On the other hand, keeping all stored policies diffuse, or otherwise\nregularising them can be costly, as it can hinder exploitation or the fast instantiation of optimal\npolicies given new rewards. Similarly, constructing a set of base tasks [32] can be dif\ufb01cult as it\nmight require encountering a large number of tasks before successful transfer could be guaranteed, as\ndemonstrated most simply in a tabular setting.\n\n4 Experiments\n\n4.1 Grid-world with signalled rewards and context-speci\ufb01c replay\n\nWe \ufb01rst tested the performance of the model in a tabular maze navigation task (Fig. 1e), where both\nthe start and goal locations changed every 20 trials, giving a large number of partially overlapping\ntasks. This was necessary to test for precise, task-speci\ufb01c action selection in every state, which is\nnot required under some other scenarios [31, 12]. In the \ufb01rst experiment, the reward function was\nprovided to directly test the algorithms\u2019 ability to map out routes to new goals. Episodes terminated\nwhen the agent reached the goal state and received a reward, or after 75 steps if the agent has failed\nto reach the goal. We added walls to the interior of the maze to make the environment\u2019s dynamics\nnon-trivial, such that a single SR representing the uniform policy (diffusion-SR) would not be able\nto select optimal actions. We compared BSR to a single SR representation (SSR), an adaptation of\nGPI (SFQL) from Barreto et al. [31] to state-state SR, as well as an agent that was provided with\na pre-designed clustering, using a speci\ufb01c map whenever the goal was in a particular quadrant of\nthe maze (Known Quadrant, KQ). Each algorithm, except SSR, was provided with four maps to\nuse, such that GPI, once all its maps were in use, randomly selected one to overwrite, otherwise\nfollowing its original speci\ufb01cations. We added a replay buffer to each algorithm, and replayed\nrandomly sampled transitions for all of our experiments in the paper. Replay buffers had speci\ufb01c\ncompartments for each successor map, with each transition added to the compartment corresponding\nto the map used to select the corresponding action. The replay process thus formed part of our\noverall nonparametric approach to continual learning. We tested each algorithm with different\n\u0001-greedy exploration rates \u0001 \u2208 [0., 0.05, . . . , 0.35] (after an initial period of high exploration) and\nSR learning rates \u03b1SR \u2208 [0.001, 0.005, 0.01, 0.05, 0.1]. Notably BSR performed best across all\n\n5\n\n\fFigure 2: Simulation results for best parameter settings. Error bars represent mean \u00b1 s.e.m. (a)\nTotal number of steps taken across episodes to reach a changing, but signalled goal in the tabular\ngrid-world navigation task.(b) BSR adjusts better to new goals than the other algorithms, as illustrated\nby the average length of each episode. (c) BSR with two types of exploration bonuses performs best\nat navigation with unsignalled goals, puddles, and task-boundaries in a puddle-world environment.\nDifferent shades represent different exploration offsets for the relevant algorithms. (d) Exploration\nbonuses help BSR \ufb01nd new goals faster. (e) The proposed improvements transfer to a navigation task\nin a continuous state-space maze with unsignalled goal locations and task boundaries. (f) Example\ntrajectories from Experiment II showing how BSR, but not SSR adjusts to the optimal route over the\nsame episodes.\n\nparameter settings, but performed best with \u0001 = 0, whereas the other algorithms performed better\nwith higher exploration rates. The total number of steps taken to complete all episodes, using the best\nperforming setting for each algorithm, is shown in Fig. 2a and Table S1. Figs. 2b and S2 show lengths\nof individual episodes. Increasing the number of maps in GPI to 10 led to worse performance by the\nagent, showing that it wasn\u2019t the lack of capacity, but the inability to generalize well that impaired\nits performance. We also compared BSR directly with the matched full particle \ufb01ltering solution,\nGSR, which performed slightly worse (Fig. S1), suggesting that BSR could maintain competitive\nperformance with decreased computational costs.\n\n4.2 Puddle-world with unsignalled reward changes and task boundaries\n\nIn the second experiment, we made the environment more challenging by introducing puddles, which\ncarried a penalty of 1 every time the agent entered a puddle state. Reward function changes were also\nnot signalled, except that GPI still received task change signals to reset its use of the map and reset\nits memory buffer. Negative rewards are known to be problematic and potentially prevent agents\nfrom properly exploring and reaching goals, we therefore tried exploration rates up to and including\n0.65, and also evaluated an algorithm that randomly switched between the different successor maps\nat every step, corresponding to BSR with \ufb01xed and uniform belief weights (Equal Weights, EW).\nThis provided a control showing that it was not increased entropy or dithering that drove the better\nperformance of BSR (Fig. 2c).\n\n4.3 Directed exploration with reward space priors and curiosity offsets\n\nAs expected, optimal performance required high exploration rates from the algorithms in this task\n(Table S.2), which afforded us the opportunity to test if CR maps could also act to facilitate exploration.\nSince they act as a kind of prior for rewards in a particular context, it should be possible to use\nthem to infer likely reward features, and direct actions towards these rewards. Because of the linear\nnature of SR, we can achieve this simply by offsetting the now context-speci\ufb01c reward weight\nvector wi using the CR maps wcr (Algorithm 1, line 7). This can help the agent \ufb02exibly redirect its\n\n6\n\n\fFigure 3: Multiple maps and \ufb02ickering representations in the hippocampus. Fast paced hippocampal\n\ufb02ickering as (a) animals, and (b) BSR adjust to a new reward environment. For every time step,\nvertical lines show the difference between the z-scored correlation coef\ufb01cients of the current \ufb01ring\npattern to the pre-probe and to the post-probe \ufb01ring patterns. (c) Average evolution of z-score\ndifferences across learning trials within a session.\n\nexploration following a change in the reward function, as switching to a new map will encourage\nit to move towards states where rewards generally occur in that particular context (Fig. 2d,f).\nWe also experimented with constant offsets to w before each episode, which in turn is related to\nupper con\ufb01dence bound (UCB) exploration strategies popular in the bandit task setting. Under the\nassumption e.g. that reward features drift with Gaussian white noise between episodes, using a simple\nconstant padding function gives an UCB for the reward features. While we saw strong improvements\nfrom these constant offsets for SSR and EW as well, BSR also showed further strong improvements\nwhen the two types of exploration offsets were combined. These offsets represent a type of curiosity\nin the multi-task setting, where they encourage the exploration of states or features not visited recently,\nbut they are unlike the traditional pseudo rewards often used in reward shaping. They only temporarily\nin\ufb02uence the agent\u2019s beliefs about rewards, but never the actual rewards themselves. This means that\nthis reward-guidance isn\u2019t expected to interfere with the agent\u2019s behaviour in the same manner as\npseudo rewards can [33], however, like any prior, it can potentially have detrimental effects as well\nas the bene\ufb01cial ones demonstrated here. We leave a fuller exploration of these, including applying\nsuch offsets stepwise and integrating it into approaches like GPI, for future work.\n\n4.4 Function approximation setting\n\nThe tabular setting enables us to test many components of the algorithm and compare the emerging\nrepresentations to their biological counterparts. However, it is important to validate that these can\nbe scaled up and used with function approximators, allowing the use of continuous state and action\nspaces and more complex tasks. As a proof of principle, we created a continuous version of the maze\nfrom Experiment I, where steps were perturbed by 2D Gaussian noise, such that agents could end up\nvisiting any point inside the continuous state space of the maze. State embeddings were provided in\nthe form of Gaussian radial basis functions and agents used an arti\ufb01cial neural network equivalent of\nAlgorithm 1, where the Matrix components M (:, a, :) were replaced by a multi-layer perceptron \u03c8a\n(Fig. 1d). We tested BSR-4 with two different update strategies vs. SSR and GPI-4 in this setting,\nwith exploration rates up to 0.45, with BSR outperforming the other two again. Fig. 2c shows the\nperformance of the algorithms in terms of total reward collected by timestep in the environment, to\nemphasize the connection with continual learning.\n\n5 Neural data analysis\n\n5.1 Hippocampal \ufb02ickering during navigation to multiple, changing rewards\n\nOur algorithm draws inspiration from biology, where animals face similar continual learning tasks\nwhile foraging or evading predators in familiar environments. We performed two sets of analyses\n\n7\n\n\fFigure 4: Splitter cell representations from animals and arti\ufb01cial agents completing a Y-maze\nnavigation task with changing goals. (a) Outline of the Y-maze, adapted from [21]. Three possible\ngoal locations de\ufb01ne four possible routes. (b) Information about trial type in the animals\u2019 neural\nrepresentation is similar to that in BSR. Horizontal axes show the actual trial type, vertical axes the\ndecoded trial type at the start of the trial. Known Goal (KG) uses the same principle as KQ before.\n\nto test if the brain uses similar mechanisms to BSR, comparing our model to experimental data\nfrom rodent navigation tasks with changing reward settings. We used the successor representation\nas a proxy to neural \ufb01ring rates as suggested in [23], with \ufb01ring rates proportional to the expected\ndiscounted visitation of a state rt(s(cid:48)) \u221d M (st, at, s(cid:48)). Our framework predicts the continual,\nfast-paced remapping of hippocampal prospective representations, in particular in situations when\nchanging rewards increases the entropy of the distribution over maps. Intriguingly, such \u2018\ufb02ickering\u2019\nof hippocampal place cells have indeed been reported, though a normative framework accounting for\nthis phenomenon has been missing. Dupret et al. and Boccara et al. [27, 26] recorded hippocampal\nneurons in a task where rodents had to collect three hidden food items in an open maze, with the\nlocations of the three rewards changing between sessions. Both papers report the \ufb02ickering of distinct\nhippocampal representations, gradually moving from being similar the old to being similar to the new\none as measured by z-scored similarity (Fig. 3a, adapted with permission from [26]). BSR naturally\nand consistently captured the observed \ufb02ickering behaviour as shown on a representative session\nin Fig. 3b (with further sessions shown in Figs. S6-S9). Further, it was the only tested algorithm\nthat captured the smooth, monotonic evolution of the z-scores across trials (Fig. 3c), and gave the\nclosest match of 0.90 for the empirically measured correlation coef\ufb01cient of 0.95 characterizing this\nprogression [26].\n\n5.2 Splitter cells\n\nAnother intriguing phenomenon of spatial tuning in the hippocampus is the presence of so-called\nsplitter cells that exhibit route-dependent \ufb01ring and current location conditional both on previous and\nfuture states [34]. While the successor representation is purely prospective, in BSR the inference over\nthe reward context depends also on the route taken, predicting exactly this type of past-and-future\ndependent representation. Further, rather than a hard-assignment to a particular map, our model\npredicts switching back and forth between representations in the face of uncertainty. We analysed\ndata of rats performing a navigation task in a double Y-maze with 3 alternating reward locations [21]\n(Fig. 4a, adapted with permission). The central goal location has two possible routes (Route 2 and\n3), one of which is blocked every time this goal is rewarded, giving 4 trial types. These blockades\nmomentarily change the dynamics of the environment, a challenge for SR [13]. Our inference\nframework, however, overcomes this challenge by \u2018recognizing\u2019 a lack of convolved rewards along\nthe blocked route when the animal can\u2019t get access to the goal location, allowing the algorithm to\nswitch to using the alternate route. Other algorithms, notably GPI, struggles with this problem (Fig.\nS4). as it has to explicitly map the policy of going around the maze to escape the barrier. To further\ntest our model\u2019s correspondence with hippocampal spatial coding, we followed the approach adopted\nin Grieves et al. [21] for trying to decode the trial type from the hippocampal activity as animals\nbegin the trial in the start box. The analysis was only performed on successful trials, and thus a\nsimple prospective representation would result in large values on the diagonal, as in the case of GPI\nand KQ. In contrast, the empirical data resembles the pattern predicted by BSR, where a sampling of\n\n8\n\n\fmaps results in a more balanced representation, while still providing route dependent encoding that\ndifferentiates all four possible routes already in the start box.\n\n6 Related work\n\nA number of recent or concurrent papers have proposed algorithms for introducing transfer into\nRL/deep RL settings, by using multiple policies in some way, though none of them use an inferential\nframework similar to ours, provides a principled way to deal with unsignalled task boundaries, or\nexplains biological phenomena. We extensively discuss the work of Barreto et al. [31, 32] in the\npaper. Our approach shares elements with earlier work on probabilistic policy reuse [35], which\nalso samples from a distribution over policies, however does so only at the beginning of each\nepisode, doesn\u2019t follow a principled approach for inferring the weight distribution, and is limited\nint performance by its use of Q-values rather than factored representations. Wilson et al. [36] and\nLazaric et al. [37] employed hierarchical Bayesian inference for generalization in multi-task RL\nusing Gibbs sampling, however neither used the \ufb02exibility afforded by the successor representation\nor integrated online inference and control as we do in our method. [36] uses value iteration to solve\nthe currently most likely MDP, while [37] applies the inference directly on state-value functions.\nOther approaches tackle the tension between generality and specialization by regularizing specialized\npolicies in some way with a central general policy [38, 39], which we instead expressly tried to avoid\nhere. General value functions in the Horde architecture [4] also calculate several value functions in\nparallel, corresponding to a set of different pseudo-reward functions, and in this sense are closely\nrelated, but a more generalized version of SR. Schaul et al. combined the Horde architecture with\na factorization of value functions into state and goal embeddings to create universal value function\napproximators (UVFAs) [40]. Unlike our approach, UVFA-Horde \ufb01rst learns goal-speci\ufb01c value\nfunctions, before trying to learn \ufb02exible goal and state embeddings through matrix factorization, such\nthat the successful transfer to unseen goals and states depends on the success of this implicit mapping.\nMore recent work from Ma et al. [41] and Borsa et al. [42] combines the idea of universal value\nfunctions and SR to try to learn an approximate universal SR. Similarly to UVFAs however, [41]\nrelies on a neural network architecture implicitly learning the (smooth) structure of the value function\nand SR for different goals, in a setting where this smoothness is supported by the structure in the\ninput state representation (visual input of nearby goals). Further, this representation is then used\nonly as a critic to train a goal-conditioned policy for a new signalled goal location. [42] proposes to\novercome some of these limitations by combining the UVFA approach with GPI. However, it doesn\u2019t\nformulate a general framework for choosing base policies and their embeddings when learning a\nparticular task space, or for sampling these policies, or addresses the issue of task boundaries and\nonline adjustment of policy use. Other recent work for continual learning also mixed learning from\ncurrent experience with selectively replaying old experiences that are relevant to the current task\n[43, 44]. Our approach naturally incorporates, though is not dependent, on such replay, where relevant\nmemories are sampled from SR speci\ufb01c replay buffers, thus forming part of the overall clustering\nprocess. [45] also develops a nonparametric Bayesian approach to avoid relearning old tasks while\nidentifying task boundaries for sequence prediction tasks, with possible applications for model-based\nRL, while [46] explored the relative advantages of clustering transition and reward functions jointly\nor independently for generalization. [12, 13] also discuss the limitations of SR in transfer scenarios,\nand [47] found evidence of these dif\ufb01culties in certain policy revaluation settings in humans.\n\n7 Conclusion and future work\n\nIn this paper we proposed an extension to the SR framework by coupling it with the nonparametric\nclustering of the task space and amortized inference using diffuse, convolved reward values. We\nhave shown that this can improve the representation\u2019s transfer capabilities by overcoming a major\nlimitation, the policy dependence of the representation, and turning it instead into a strength through\npolicy reuse. Our algorithm is naturally well-suited for continual learning where rewards in the\nenvironment persistently change. While in the current setting we only inferred a weight distribution\nover the different maps and separate pairs of SR bases and CR maps it opens the possibility for\napproaches that create essentially new successor maps from limited experience. One such avenue\nis the use of a hierarchical approach similar to hierarchical DP mixtures [48] together with the\ncomposition of submaps or maplets which could allow the agent to combine different skills according\nto the task\u2019s demand. We leave this for future work. Further, in BSR we only represent uncertainty\n\n9\n\n\fas a distribution over the different SR maps, but it is straight forward to extend the framework\nto represent uncertainty within the SR maps (over the SR associations) as well, and ultimately to\nincorporate these ideas into a more general framework of RL as inference [49].\n\n8 Acknowledgements\n\nWe would like to thank Rex Liu for his very detailed and helpful comments and suggestions during the\ndevelopment of the manuscript, as well as Evan Russek and James Whittington for helpful comments\nand discussions. Thanks also to Roddy Grieves and Paul Dudchenko for generously sharing data\nfrom their experiments.\n\nReferences\n[1] V. Mnih, K. Kavukcuoglu, D. Silver, A. A. Rusu, J. Veness, M. G. Bellemare, A. Graves,\nM. Riedmiller, A. K. Fidjeland, G. Ostrovski, S. Petersen, C. Beattie, A. Sadik, I. Antonoglou,\nH. King, D. Kumaran, D. Wierstra, S. Legg, and D. Hassabis, \u201cHuman-level control through\ndeep reinforcement learning.,\u201d Nature, 2015.\n\n[2] D. Silver, A. Huang, C. J. Maddison, A. Guez, L. Sifre, G. van den Driessche, J. Schrittwieser,\nI. Antonoglou, V. Panneershelvam, M. Lanctot, S. Dieleman, D. Grewe, J. Nham, N. Kalch-\nbrenner, I. Sutskever, T. Lillicrap, M. Leach, K. Kavukcuoglu, T. Graepel, and D. Hassabis,\n\u201cMastering the game of Go with deep neural networks and tree search.,\u201d Nature, 2016.\n\n[3] P. Dayan, \u201cImproving Generalization for Temporal Difference Learning: The Successor Repre-\n\nsentation,\u201d Neural Comput., 2008.\n\n[4] R. S. Sutton, J. Modayil, M. Delp, T. Degris, P. M. Pilarski, A. White, and D. Precup, \u201cHorde:\nA Scalable Real-time Architecture for Learning Knowledge from Unsupervised Sensorimotor\nInteraction,\u201d Aamas, 2011.\n\n[5] A. Pritzel, B. Uria, S. Srinivasan, A. Puigdom\u00e8nech, O. Vinyals, D. Hassabis, D. Wierstra, and\n\nC. Blundell, \u201cNeural Episodic Control,\u201d arXiv preprint 1703.01988, 2017.\n\n[6] S. J. Gershman and N. D. Daw, \u201cReinforcement learning and episodic memory in humans and\nanimals: An integrative framework,\u201d Annual Review of Psychology, vol. 68, no. 1, pp. 101\u2013128,\n2017.\n\n[7] M. Botvinick, S. Ritter, J. X. Wang, Z. Kurth-Nelson, C. Blundell, and D. Hassabis, \u201cReinforce-\n\nment Learning, Fast and Slow,\u201d Trends Cogn. Sci., 2019.\n\n[8] M. Lengyel and P. Dayan, \u201cHippocampal Contributions to Control: The Third Way,\u201d in Adv.\n\nNeural Inf. Process. Syst., 2007.\n\n[9] G. A. Carpenter and S. Grossberg, \u201cA massively parallel architecture for a self-organizing\n\nneural pattern recognition machine,\u201d Comput. Vision, Graph. Image Process., 1987.\n\n[10] M. McCloskey and N. J. Cohen, \u201cCatastrophic Interference in Connectionist Networks: The\n\nSequential Learning Problem,\u201d Psychol. Learn. Motiv. - Adv. Res. Theory, 1989.\n\n[11] B. Ans and S. Rousset, \u201cAvoiding catastrophic forgetting by coupling two reverberating neural\n\nnetworks,\u201d Comptes Rendus l\u2019Academie des Sci. - Ser. III, 1997.\n\n[12] L. Lehnert, S. Tellex, and M. L. Littman, \u201cAdvantages and limitations of using successor\n\nfeatures for transfer in reinforcement learning,\u201d arXiv preprint 1708.00102, 2017.\n\n[13] E. M. Russek, I. Momennejad, M. M. Botvinick, S. J. Gershman, and N. D. Daw, \u201cPredictive\nrepresentations can link model-based reinforcement learning to model-free mechanisms,\u201d PLOS\nComputational Biology, 2017.\n\n[14] A. Courville, N. D. Daw, G. Gordon, and D. S. Touretzky, \u201cModel uncertainty in classical\n\nconditioning,\u201d in Adv. Neural Inf. Process. Syst., vol. 16, 2004.\n\n[15] T. J. Madarasz, L. Diaz-Mataix, O. Akhand, E. A. Ycu, J. E. LeDoux, and J. P. Johansen, \u201cEval-\nuation of ambiguous associations in the amygdala by learning the structure of the environment,\u201d\nNat. Neurosci., 2016.\n\n[16] S. J. Gershman, D. M. Blei, and Y. Niv, \u201cContext, learning, and extinction.,\u201d Psychol. Rev.,\n\nvol. 117, 2010.\n\n10\n\n\f[17] D. Meder, N. Kolling, L. Verhagen, M. K. Wittmann, J. Scholl, K. H. Madsen, O. J. Hulme,\nT. E. Behrens, and M. F. Rushworth, \u201cSimultaneous representation of a spectrum of dynamically\nchanging value estimates during decision making,\u201d Nat. Commun., 2017.\n\n[18] S. C. Y. Chan, Y. Niv, and K. A. Norman, \u201cA Probability Distribution over Latent Causes, in the\n\nOrbitofrontal Cortex,\u201d J. Neurosci., 2016.\n\n[19] M. A. Yassa and C. E. Stark, \u201cPattern separation in the hippocampus,\u201d Trends in Neurosciences,\n\n2011.\n\n[20] E. T. Rolls, \u201cA theory of hippocampal function in memory,\u201d Hippocampus, 1996.\n[21] R. M. Grieves, E. R. Wood, and P. A. Dudchenko, \u201cPlace cells on a maze encode routes rather\n\nthan destinations,\u201d Elife, 2016.\n\n[22] T. I. Brown, V. A. Carr, K. F. LaRocque, S. E. Favila, A. M. Gordon, B. Bowles, J. N. Bailenson,\nand A. D. Wagner, \u201cProspective representation of navigational goals in the human hippocampus,\u201d\nScience, 2016.\n\n[23] K. L. Stachenfeld, M. M. Botvinick, and S. J. Gershman, \u201cThe hippocampus as a predictive\n\nmap,\u201d Nat. Neurosci., 2017.\n\n[24] K. Jezek, E. J. Henriksen, A. Treves, E. I. Moser, and M. B. Moser, \u201cTheta-paced \ufb02ickering\n\nbetween place-cell maps in the hippocampus,\u201d Nature, 2011.\n\n[25] K. Kay, J. E. Chung, M. Sosa, J. S. Schor, M. P. Karlsson, M. C. Larkin, D. F. Liu, and L. M.\nFrank, \u201cRegular cycling between representations of alternatives in the hippocampus,\u201d bioRxiv,\n2019.\n\n[26] C. N. Boccara, M. Nardin, F. Stella, J. O\u2019Neill, and J. Csicsvari, \u201cThe entorhinal cognitive map\n\nis attracted to goals,\u201d Science, 2019.\n\n[27] D. Dupret, J. O\u2019Neill, and J. Csicsvari, \u201cDynamic recon\ufb01guration of hippocampal interneuron\n\ncircuits during spatial learning.,\u201d Neuron, 2013.\n\n[28] C. J. C. H. Watkins, Learning from delayed rewards. PhD thesis, 1989.\n[29] J. O\u2019Doherty, P. Dayan, J. Schultz, R. Deichmann, K. Friston, and R. J. Dolan, \u201cDissociable\n\nRoles of Ventral and Dorsal Striatum in Instrumental Conditioning,\u201d Science, 2004.\n\n[30] C. E. Antoniak, \u201cMixtures of Dirichlet Processes with Applications to Bayesian Nonparametric\n\nProblems,\u201d Ann. Stat., 1974.\n\n[31] A. Barreto, W. Dabney, R. Munos, J. J. Hunt, T. Schaul, H. van Hasselt, and D. Silver, \u201cSuccessor\n\nFeatures for Transfer in Reinforcement Learning,\u201d in Adv. Neural Inf. Process. Syst., 2017.\n\n[32] A. Barreto, D. Borsa, J. Quan, T. Schaul, D. Silver, M. Hessel, D. Mankowitz, A. \u017d\u00eddek,\nand R. Munos, \u201cTransfer in Deep Reinforcement Learning Using Successor Features and\nGeneralised Policy Improvement,\u201d 2019.\n\n[33] A. Y. Ng, D. Harada, and S. J. Russell, \u201cPolicy invariance under reward transformations: Theory\nand application to reward shaping,\u201d in Proceedings of the Sixteenth International Conference on\nMachine Learning, 1999.\n\n[34] P. A. Dudchenko and E. R. Wood, \u201cSplitter cells: Hippocampal place cells whose \ufb01ring is mod-\nulated by where the animal is going or where it has been,\u201d in Space, Time Mem. Hippocampal\nForm., 2014.\n\n[35] F. Fern\u00e1ndez and M. Veloso, \u201cProbabilistic policy reuse in a reinforcement learning agent,\u201d in\nProceedings of the Fifth International Joint Conference on Autonomous Agents and Multiagent\nSystems, 2006.\n\n[36] A. Wilson, A. Fern, S. Ray, and P. Tadepalli, \u201cMulti-task reinforcement learning: a hierarchical\nbayesian approach.,\u201d in Proceedings of the 274h International Conference on Machine Learning,\n2007.\n\n[37] A. Lazaric and M. Ghavamzadeh, \u201cBayesian multi-task reinforcement learning,\u201d in Proceedings\n\nof the 27th International Conference on Machine Learning, 2010.\n\n[38] Y. Teh, V. Bapst, W. M. Czarnecki, J. Quan, J. Kirkpatrick, R. Hadsell, N. Heess, and R. Pascanu,\n\n\u201cDistral: Robust multitask reinforcement learning,\u201d in Adv. Neural Inf.Process. Syst., 2017.\n\n[39] C. Finn, P. Abbeel, and S. Levine, \u201cModel-Agnostic Meta-Learning for Fast Adaptation of Deep\n\nNetworks,\u201d arXiv 1703.03400, 2017.\n\n11\n\n\f[40] T. Schaul, D. Horgan, K. Gregor, and D. Silver, \u201cUniversal Value Function Approximators,\u201d in\n\n32nd Int. Conf. Mach. Learn., 2015.\n\n[41] C. Ma, J. Wen, and Y. Bengio, \u201cUniversal successor representations for transfer reinforcement\n\nlearning,\u201d in International Conference on Learning Representations, 2018.\n\n[42] D. Borsa, A. Barreto, J. Quan, D. J. Mankowitz, H. van Hasselt, R. Munos, D. Silver, and\nT. Schaul, \u201cUniversal successor features approximators,\u201d in International Conference on Learn-\ning Representations, 2019.\n\n[43] D. Rolnick, A. Ahuja, J. Schwarz, T. P. Lillicrap, and G. Wayne, \u201cExperience Replay for\n\nContinual Learning,\u201d arXiv 1811.11682, 2018.\n\n[44] D. Isele and A. Cosgun, \u201cSelective Experience Replay for Lifelong Learning,\u201d in AAAI Confer-\n\nence on Arti\ufb01cial Intelligence, 2018.\n\n[45] K. Milan, J. Veness, J. Kirkpatrick, M. Bowling, A. Koop, and D. Hassabis, \u201cThe Forget-me-not\n\nProcess,\u201d in Adv. Neural Inf. Process. Syst., 2016.\n\n[46] N. T. Franklin and M. J. Frank, \u201cCompositional clustering in task structure learning,\u201d PLOS\n\nComputational Biology, 2018.\n\n[47] I. Momennejad, E. M. Russek, J. H. Cheong, M. M. Botvinick, N. D. Daw, and S. J. Gershman,\n\n\u201cThe successor representation in human reinforcement learning,\u201d Nat. Hum. Behav., 2017.\n\n[48] Y. W. Teh, M. I. Jordan, M. J. Beal, and D. M. Blei, \u201cHierarchical dirichlet processes,\u201d Journal\n\nof the American Statistical Association, 2006.\n\n[49] S. Levine, \u201cReinforcement learning and control as probabilistic inference: Tutorial and review,\u201d\n\nCoRR, 2018.\n\n12\n\n\f", "award": [], "sourceid": 4837, "authors": [{"given_name": "Tamas", "family_name": "Madarasz", "institution": "University of Oxford"}, {"given_name": "Tim", "family_name": "Behrens", "institution": "University of Oxford"}]}