{"title": "Fast deep reinforcement learning using online adjustments from the past", "book": "Advances in Neural Information Processing Systems", "page_first": 10567, "page_last": 10577, "abstract": "We propose Ephemeral Value Adjusments (EVA): a means of allowing deep reinforcement learning agents to rapidly adapt to experience in their replay buffer.\nEVA shifts the value predicted by a neural network with an estimate of the value function found by prioritised sweeping over experience tuples from the replay buffer near the current state. EVA combines a number of recent ideas around combining episodic memory-like structures into reinforcement learning agents: slot-based storage, content-based retrieval, and memory-based planning.\nWe show that EVA is performant on a demonstration task and Atari games.", "full_text": "Fast deep reinforcement learning using online\n\nadjustments from the past\n\nSteven S. Hansen \u2217, Pablo Sprechmann \u2217, Alexander Pritzel \u2217, Andr\u00e9 Barreto, Charles Blundell\n\n{stevenhansen,psprechmann,apritzel,andrebarreto,cblundell}@google.com\n\nDeepMind\n\nAbstract\n\nWe propose Ephemeral Value Adjusments (EVA): a means of allowing deep re-\ninforcement learning agents to rapidly adapt to experience in their replay buffer.\nEVA shifts the value predicted by a neural network with an estimate of the value\nfunction found by planning over experience tuples from the replay buffer near\nthe current state. EVA combines a number of recent ideas around combining\nepisodic memory-like structures into reinforcement learning agents: slot-based\nstorage, content-based retrieval, and memory-based planning. We show that EVA\nis performant on a demonstration task and Atari games.\n\n1\n\nIntroduction\n\nComplementary learning systems [McClelland et al., 1995, CLS] combine two mechanisms for\nlearning: one, fast learning and highly adaptive but poor at generalising, the other, slow at learning\nand consequentially better at generalising across many examples. The need for two systems re\ufb02ects\nthe typical trade-off between the sample ef\ufb01ciency and the computational complexity of a learning\nalgorithm. We argue that the majority of contemporary deep reinforcement learning systems fall into\nthe latter category: slow, gradient-based updates combined with incremental updates from Bellman\nbackups result in systems that are good at generalising, as evidenced by many successes [Mnih et al.,\n2015, Silver et al., 2016, Morav\u02c7c\u00edk et al., 2017], but take many steps in an environment to achieve\nthis feat.\nRL methods are often categorised as either model-free methods or model-based RL methods [Sutton\nand Barto, 1998]. In practice, model-free methods are typically fast at acting time, but computationally\nexpensive to update from experience, whilst model-based methods can be quick to update but\nexpensive to act with (as on-the-\ufb02y planning is required). Recently there has been interest in\nincorporating episodic memory-like into reinforcement learning algorithms [Blundell et al., 2016a,\nPritzel et al., 2017], potentially providing increases in \ufb02exibility and learning speed, driven by\nmotivations from the neuroscience literature known as Episodic Control [Dayan and Daw, 2008,\nGershman and Daw, 2017]. Episodic Control use episodic memory in lieu of a learnt model of\nthe environment, aiming for a different computational trade-off to model-free and model-based\napproaches.\nWe will be interested in a hybrid approach, motivated by the observations of CLS [McClelland et al.,\n1995], where we will build an agent with two systems: one slow and general (model-free) and the\nother fast and adaptive (episodic control-like). Similar to previous proposals for agents, the fast,\nadaptive subsystem of our agent uses episodic memories to remember and later mimic previously\nexperienced rewarding sequences of states and actions. This can be seen as a memory-based form of\nplanning [Silver et al., 2008], in which related experiences are recalled to inform decisions. Planning\n\n\u2217denotes equal contribution.\n\n32nd Conference on Neural Information Processing Systems (NeurIPS 2018), Montr\u00e9al, Canada.\n\n\fin this context can be thought as the re-evaluation of the past experience using current knowledge to\nimprove model-free value estimates.\nCritical to many approaches to deep reinforcement learning is the replay buffer [Mnih et al., 2015,\nEspeholt et al., 2018]. The replay buffer stores previously seen tuples of experience: state, action,\nreward, and next state. These stored experience tuples are then used to train a value function\napproximator using gradient descent. Typically one step of gradient descent on data from the replay\nbuffer is taken per action in the environment, as (with the exception of [Barth-Maron et al., 2018])\na greater reliance on replay data leads to unstable performance. Consequently, we propose that the\nreplay buffer may frequently contain information that could signi\ufb01cantly improve the policy of an\nagent but never be fully integrated into the decision making of an agent. We posit that this happens\nfor three reasons: (i) the slow, global gradient updates to the value function due to noisy gradients and\nthe stability of learning dynamics, (ii) the replay buffer is of limited size and experience tuples are\nregularly removed (thus limiting the opportunity for gradient descent to learn from it), (iii) training\nfrom experience tuples neglects the trajectory nature of an agents experience: one tuple occurs after\nanother and so information about the value of the next state should be quickly integrated into the\nvalue of the current state.\nIn this work we explore a method of allowing deep reinforcement learning agents to simultaneously:\n(i) learn the parameters of the value function approximation slowly, and (ii) adapt the value function\nquickly and locally within an episode. Adaptation of the value function is achieved by planning over\npreviously experienced trajectories (sequences of temporally adjacent tuples) that are grounded in\nestimates from the value function approximation. This process provides a complementary way of\nestimating the value function.\nInterestingly our approach requires very little modi\ufb01cation of existing replay-based deep reinforce-\nment learning agents: in addition to storing the current state and next state (which are typically large:\nfull inputs to the network), we propose to also store trajectory information (pointers to successor\ntuples) and one layer of current hidden activations (typically much smaller than the state). Using\nthis information our method adapts the value function prediction using memory-based rollouts of\nprevious experience based on the hidden representation. The adjustment to the value function is not\nstored after it is used to take an action (thus it is ephemeral). We call our method Ephemeral Value\nAdjustment (EVA).\n\n2 Background\n\nThe action-value function of a policy \u03c0 is de\ufb01ned as Q\u03c0(s, a) = E\u03c0 [(cid:80)\nthe policy \u03c0 at state s is given by V \u03c0(s) = E\u03c0 [(cid:80)\n\nt \u03b3trt | s, a] [Sutton and\nBarto, 1998], where s and a are the initial state and action respectively, \u03b3 \u2208 [0, 1] is a discount factor,\nand the expectation denotes that the \u03c0 is followed thereafter. Similarly, the value function under\nt \u03b3trt | s] and is simply the expected return for\n\nfollowing policy \u03c0 starting at state s.\nIn value-based model-free reinforcement learning methods, the action-value function is represented\nusing a function approximator. Deep Q-Network agents [Mnih et al., 2015, DQN] use Q-learning\n[Watkins and Dayan, 1992] to learn an action-value function Q\u03b8(st, at) to rank which action at is\nbest to take in each state st at step t. Q\u03b8 is parameterised by a convolutional neural network (CNN),\nwith parameters collectively denoted by \u03b8, that takes a 2D pixel representation of the state st as input,\nand outputs a vector containing the value of each action at that state. The agent executes an \u0001-greedy\npolicy to trade-off exploration and exploitation: with probability \u0001 the agent picks an action uniformly\nat random, otherwise it picks the action at = arg maxa Q(st, a).\nWhen the agent observes a transition, DQN stores the (st, at, rt, st+1) tuple in a replay buffer, the\ncontents of which are used for training. This neural network is trained by minimizing the squared\nerror between the network\u2019s output and the Q-learning target yt = rt + \u03b3 maxa \u02dcQ(st+1, a), for a\nsubset of transitions sampled at random from the replay buffer. The target network \u02dcQ(st+1, a) is an\nolder version of the value network that is updated periodically. It was shown by Mnih et al. [2015]\nthat both, the use of a target network and sampling uncorrelated transitions from the replay buffer,\nare critical for stable training.\n\n2\n\n\fFigure 1: Left: Trajectory-centric planning over memories the replay buffer. Right: adjusting the\nparametric policy at action selection time using EVA.\n\n3 Ephemeral Value Adjustments\n\nEphemeral value adjustments are a way to augment an arbitrary value-based off-policy agent. This\nis accomplished through a trace computation algorithm, which rapidly produces value estimates\nby combining previously encountered trajectories with parametric estimates. Our agent consists of\nthree components: a standard parametric reinforcement learner with its replay buffer augmented to\nmaintains trajectory information, a trace computation algorithm that periodically plans over subsets\nof data in the replay buffer, a small value buffer which stores the value estimates resulting from the\nplanning process. The overall policy of EVA is dictated by the action-value function,\n\nQ(s, a) = \u03bbQ\u03b8(s, a) + (1 \u2212 \u03bb)QNP(s, a)\n\n(1)\nQ\u03b8 is the value estimate from the parametric model and QNP is the value estimate from the trace\ncomputation algorithm (non-parametric). Figure 1 (Right) shows a block diagram of the method. The\nparametric component of EVA consists of the standard DQN-style architecture, Q\u03b8, a feedforward\nconvolutional neural network: several convolution layers followed by two linear layers that ultimately\nproduce action-value function estimates. Training is done exactly as in DQN, brie\ufb02y reviewed in\nSection 2 and fully described in [Mnih et al., 2015].\n\n3.1 Trajectory selection and planning\n\nThe second to \ufb01nal layer of the DQN network is used to embed the currently observed state (pixels)\ninto a lower dimensional space. Note that similarity in this space has been optimised for action-value\nestimation by the parametric model. Periodically (every 20 steps in all the reported experiments),\nthe k nearest neighbours in the global buffer are queried from the current state embedding (on the\nbasis of their (cid:96)2 distance). Using the stored trajectory information, the 50 subsequent steps are also\nretrieved for each neighbour. Each of these k trajectories are passed to a trace computation algorithm\n(described below), and all of the resulting Q values are stored into the value buffer alongside their\nembedding. Figure 1 (Left) shows a diagram of this procedure. The non-parametric nature of this\nprocess means that while these estimates are less reliant on the accuracy of the parametric model,\nthey are more relevant locally. This local buffer is meant to cache the results of the trace computation\nfor states that are likely to be nearby the current state.\n\n3.2 Computing value estimates on memory traces\n\nBy having the replay buffer maintain trajectory information, values can be propagated through time\nto produce trajectory-centric value estimates QNP(s, a). Figure 1 (Right) shows how the value buffer\nis used to derive the action-value estimate. There are several methods for estimating this value\nfunction, we shall describe n-step, trajectory-centric planning (TCP) and kernel-based RL (KBRL)\ntrace computation algorithms. N-step estimates for trajectories from the replay buffer are calculated\nas follows,\n\n(cid:26)maxa Q\u03b8(st, a)\n\nrt + \u03b3VNP(st+1)\n\nVNP(st) =\n\nif t = T\notherwise,\n\n(2)\n\nwhere T is the length of the trajectory and st, rtt are the states and rewards of the trajectory. These\nestimates utilise information in the replay buffer that might not be consolidated into the parametric\nmodel, and thus should be complementary to the purely parametric estimates. While this process will\n\n3\n\n\fAlgorithm 1: Ephemerally Value Adjustments\nInput\n\n: Replay buffer D\nValue buffer L\nMixing hyper-parameter \u03bb\nMaximum roll-out hyper-parameter \u03c4\n\nfor e := 1,\u221e do\n\nfor t := 1, T do\n\n(cid:80)K\nk=0 QNP(sk,\u00b7)\n\nReceive observation st from environment with embedding ht\nCollect trace computed values from k nearest neighbours\nQNP(sk,\u00b7)|h(sk) \u2208 KNN(h(st),L)\nQEVA(st,\u00b7) := \u03bbQ\u03b8(\u02c6s,\u00b7) + (1 \u2212 \u03bb)\nat \u2190 \u0001-greedy policy based on QEVA(st,\u00b7)\nTake action at, receive reward rt+1\nAppend (st, at, rt+1, ht, e) to D\nTm := (st:t+\u03c4 , at:t+\u03c4 , rt+1:t+\u03c4 +1, ht:t+\u03c4 , et:t+\u03c4 )|h(sm) \u2208 KNN(h(st),D))\nQNP \u2190 using Tm via the TCP algorithm\nAppend (ht, QNP) to L\n\nK\n\nend\n\nend\n\nserve as a useful baseline, the n-step return just evaluates the policy de\ufb01ned by the sampled trajectory;\nonly the initial parametric bootstrap involves an estimate of the optimal value function. W Ideally,\nthe values at all time-steps should estimate the optimal value function,\na(cid:48) Q(s(cid:48), a(cid:48)).\n\nQ(s, a) \u2190 r(s, a) + \u03b3 max\n\n(3)\n\nThus another way to estimate QNP(s, a) is to apply the Bellman policy improvement operator at\neach time step, as shown in (3). While (2) could be applied recursively, traversing the trajectory\nbackwards, this improvement operator requires knowing the value of the counter-factual actions. We\ncall this trajectory-centric planning. We propose using the parametric model for these off-trajectory\nvalue estimates, constructing the complete set of action-conditional value-estimates, called this\ntrajectory-centric planning (TCP):\n\nQNP(st, a) =\n\nThis allows for the same recursive application as before,\n\n(cid:26)rt + \u03b3VNP(st+1)\n(cid:26)maxa Q\u03b8(st, a)\n\nQ\u03b8(st, a)\n\nif at = a\notherwise.\n\nif t = T\n\nVNP(st) =\n\nmaxa QNP(st, a) otherwise,\n\n(4)\n\n(5)\n\nThe trajectory-centric estimates for the k nearest neighbours are then averaged with the parametric\nestimate on the basis of a hyper-parameter \u03bb, as shown in Algorithm 1 and represented graphically\non Figure 1 (Left). Refer to the supplementary material for a detailed algorithm.\n\n3.3 From trajectory-centric to kernel-based planning\n\nThe above method may seem ad hoc \u2013 why trust the on-trajectory samples completely and only utilise\nthe parametric estimates for the counter-factual actions? Why not analyse the trajectories together,\nrather than treating them independently? To address these concerns, we propose a generalisation of\nthe trajectory-centric method which extends kernel-based reinforcement learning (KBRL)[Ormoneit\nand Sen, 2002]. KBRL is a non-parametric approach to planning with strong theoretical guarantees.2\nFor each action a, KBRL stores experience tuples (st, rt, st+1) \u2208 Sa. Since Sa is \ufb01nite (equal to the\nnumber of stored transitions), and these states have known transitions, we can perform value iteration\n\n2Convergence to a global optima assuming that underlying MDP dynamics are Lipschitz continuous, and the\n\nkernel is appropriately shrunk as a function of data.\n\n4\n\n\fto obtain value estimates for all resultant states st+1 (the values of the origin states st are not needed,\nas the Bellman equation only evaluates states after a transition). We can obtain an approximate\nversion of the Bellman equation by using the kernel to compare all resultant states to all origin states,\nas shown in Equation 6. We de\ufb01ne a similarity kernel on states (in fact, embeddings of the current\nstate, as described above), \u03ba(s, s(cid:48)), typically a Gaussian kernel. The action-value function of KBRL\nis then estimated using:\n\nQNP(st, at) =\n\n\u03ba(st, s)\n\nr + \u03b3 max\n\na(cid:48) QNP(s(cid:48), a(cid:48))\n\n(6)\n\n(cid:88)\n\n(s,r,s(cid:48))\u2208Sa\n\n(cid:104)\n\n(cid:105)\n\nIn effect, the stored \u2018origin\u2019 states (s \u2208 S) transition to some \u2018resultant state\u2019 (s \u2208 S(cid:48)) and get the\nstored reward. By using a similarity kernel \u03ba(x0, x1), we can map resultant states to a distribution\nover the origin states. This makes the state transitions from S \u2192 S instead of S \u2192 S(cid:48), meaning that\nall transitions only involve states that have been previously encountered.\nIn the context of trajectory-centric planning, KBRL can be seen as an alternative way of dealing with\ncounter-factual actions: estimate their effects using nearby transitions. Additionally, KBRL is not\nconstrained to dealing with individual trajectories, since it treats all transitions independently.\nWe propose to add an absorbing pseudo-state \u02c6s to KBRL\u2019s model whose similarity to the other pseudo-\nstates is \ufb01xed, that is, \u03ba(st, \u02c6s) = C for some C > 0 for all st. Using this de\ufb01nition we can make\nKBRL softly blend similarity and parametric counter-factual action evaluation. This is accomplished\nby setting the pseudo-state\u2019s value to be equal to the parametric value function evaluated at the state\nunder comparison: when st is being evaluated, QNP(\u02c6s, a) \u2248 Q\u03b8(\u02c6s, a) thus by setting C appropriately,\nwe can guarantee that the parametric estimates will dominate when data density is low. Note that this\nis in addition to the blending of value functions described in Equation 1.\nKBRL can be made numerically identical to trajectory-centric planning by shrinking the kernel\nbandwidth (i.e., the length scale of the Gaussian kernel) and pseudo-state similarity.3 With the\nappropriate values, this will result in value estimates being dominated by exact matches (on-trajectory)\nand parametric estimates when none are found. This reduction is of interest as KBRL is signi\ufb01cantly\nmore expensive than trajectory-centric planning. KBRL\u2019s computational complexity is O(AN 2) and\ntrajectory-centric planning has a complexity of O(N ), where N is the number of stored transitions\nand A is the cardinality of the action space. We can thus think of this parametrically augmented\nversion of KBRL as the theoretical foundation for trajectory-centric planning. In practice, we use the\nTCP trace computation algorithm (Equations 4 and 5) unless otherwise noted.\n\n4 Related work\n\nThere has been a lot of recent work on using memory-augmented neural networks as a function\napproximation for RL agents: using LSTMs [Bakker et al., 2003, Hausknecht and Stone, 2015], or\nmore sophisticated architectures [Graves et al., 2016, Oh et al., 2016, Wayne et al., 2018]. However,\nthe motivation behind these works is to obtain a better state representation in partially observable or\nnon-Markovian environments, in which feed-forward models would not be appropriate. The focus of\nthis work is on data ef\ufb01ciency, which is improved in a representation agnostic manner.\nThe main use of long term episodic memory is the replay buffer introduced by DQN.\nWhile it is central to stable training, it also allows to signi\ufb01cantly improve the data ef\ufb01ciency of the\nmethod, compare with the online counterparts that achieve stable training by having several actors\n[Mnih et al., 2016]. The replay frequency is hyper-parameter that has been carefully tuned in DQN.\nLearning cannot be sped-up by increasing the frequency of replay without harming end performance.\nThe problem is that the network would over\ufb01t to the content of the replay buffer affecting its ability\nto learn a better policy. An alternative approach is prioritised experience replay [Schaul et al., 2015],\nwhich changes the data distribution used during training by biasing it toward transitions with high\ntemporal difference error. These works use the replay buffer during training time only. Our approach\naims at leveraging the replay buffer at decision time and thus is complementary to prioritisation, as\nit impacts the behaviour policy but not how the replay buffer is sampled from (the supplementary\nmaterials for a preliminary comparison).\n\n3Modulo the fact that KBRL would still be able to \ufb01nd \u2018shortcuts\u2019 between or within trajectories owing to its\n\nexhaustive similarity comparisons between states\n\n5\n\n\fFigure 2: Left: Performance of EVA ran on a single episode using a pre-trained DQN agent (and\ncorresponding replay buffer) for 300K steps and 4 coins, see text for detailed description. Results\nare the average over 200 runs. Eva provides an immediate boost in performance. We can see that the\nbene\ufb01ts saturate as \u03bb increases. Center and Right: Performance when using EVA throughout training.\n\u03bb = 0 corresponds to the DQN baseline with 1 (Center) and 2 coins (Right)\n\nUsing previous experience at decision time is closely related to non-parametric approaches for Q-\nfunction approximation [Santamar\u00eda et al., 1997, Munos and Moore, 1998, Gabel and Riedmiller,\n2005]. Our work is particularly related to techniques following the ideas of episodic control. Blundell\net al. [2016b, MFEC] recently used local regression for Q-function estimation using the mean of\nthe k-nearest neighbours searched over random projections of the pixel inputs. Pritzel et al. [2017]\nextended this line of work with NEC, using the reward signal to learn an embedding space in which\nto perform the local-regression. These works showed dramatic improvements in data ef\ufb01ciency,\nspecially in early stages of training. This work differs from these approaches in that, rather than using\nmemory for local regression, memory is used as a form of local planning, which is made possible by\nexploiting the trajectory structure of the memories in the replay buffer. Furthermore, the memory\nrequirements of NEC is signi\ufb01cantly larger than that of EVA. NEC uses a large memory buffer per\naction in addition to a replay buffer. Our work only adds a small overhead over the standard DQN\nreplay buffer and needs to query a single replay buffer one time every several acting steps (20 in our\nexperiments) during training. In addition, NEC and MFEC fundamentally change the structure of\nthe model, whereas EVA is strictly supplemental. More recent works have looked at including NEC\ntype of architecture to aid the learning of a parametric model [Nishio and Yamane, 2018, Jain and\nLindsey, 2018], sharing memory requirements with NEC.\nThe memory-based planning aspect of our approach also has precedent in the literature. Brea [2017]\nexplicitly compare a local regression approach (NEC) to prioritised sweeping and \ufb01nd that the latter\nis preferable, but fail to show scalable result. Savinov et al. [2018] build a memory-based graph and\nplan over it, but rely on a \ufb01xed exploration policy. Xiao et al. [2018] combine MCTS planning with\nNEC, but relies on a built-in model of the environment.\nIn the context of supervised learning, several works have looked at using non-parametric type of\napproaches to improve the performance of models using neural networks. Kaiser et al. [2016]\nintroduced a differentiable layer of key-value pairs that can be plugged into a neural network to help\nit remember rare events. Works in the context of language modelling have augmented prediction with\nattention over recent examples to account for the distributional shift between training and testing\nsettings, such as neural cache [Grave et al., 2016] and pointer sentinel networks [Merity et al., 2016].\nThe work by Sprechmann et al. [2018] is also motivated by the CLS framework. However, they use\nan episodic memory to improve a parametric model in the context of supervised learning and do not\nconsider reinforcement learning.\n\n5 Experiments\n\n5.1 A simple example\n\nWe begin the experimental section by showing how EVA works on a simple \u201cgridworld\u201d environment\nimplemented with the pycolab game engine [Stepleton, 2017]. The task is to collect a given number\nof coins in the minimum number of steps possible, that can be thought as a very simple variant of the\ntravel salesman problem. At the beginning of each episode, the agent and the coins are placed at a\n\n6\n\n0.10.20.30.40.5\u03bb1012Average ReturnEVA single episodeDQN(ours)EVA100000200000Number of Frames2101Episode Return1 Coin\u03bb0.00.30.40.6100000200000300000Number of Frames21012Episode Return2 Coins\u03bb0.00.30.40.6\fFigure 3: Comparison of the learning curves averaged over three random seeds of EVA agent the\nbaseline according to the mean (Left) and median (Right) human normalised score. The x-axis is in\nbillions of environment frames. We also included the original DQN results from [Mnih et al., 2015].\n\nrandom location of a grid with size 5 \u00d7 13, see the supplementary material for a screen-shot. The\nagent can take four possible actions {left, right, up, down} and receives a reward of 1 when collecting\na coin and a reward of \u22120.01 at every step. If the agent takes an action that would it move into a wall,\nit stays at its current position. We restrict the maximum length of an episode to 500 steps. We use\nan agent featuring a two-layer convolutional neural network, followed by a fully connected layer\nproducing a 64-dimensional embedding which is then used for the look-ups in the replay buffer of\nsize 50K. The input is an RGB image of the maze. Results are reported in Figure 2.\n\nEvaluation of a single episode We use the same pre-trained network (with its corresponding replay\nbuffer) and run a single episode with and without using EVA, see Figure 2 (Left). We can see that, by\nleveraging the trajectories in the replay buffer, EVA immediately boosts performance of the baseline.\nNote that the weights of the network are exactly the same in both cases. The bene\ufb01ts saturate around\n\u03bb = 0.4, which suggests that the policy of the non-parametric component alone is unable to generalise\nproperly.\n\nEvaluation of the full EVA algorithm Figure 2 (Center, Left) shows the performance of EVA on\nful episodes using one and two coins evaluating different values of the mixing parameter \u03bb. \u03bb = 0\ncorresponds to the standard DQN baseline. We show the hyper-parameters that lead to the highest end\nperformance of the baseline DQN. We can see that EVA provides a signi\ufb01cant boost in data ef\ufb01ciency.\nFor the single coin case, it requires slightly more than half of the data to obtain \ufb01nal performance and\nhigher value of lambda is better. This is likely due to the fact that there are only 4K unique states,\nthus all states are likely to be in the replay buffer. On the two case setting, however, the number of\npossible states for the two coin case is approximately 195K, which is signi\ufb01cantly larger than the\nreplay buffer size. Again here, performance saturates around \u03bb = 0.4.\n\n5.2 EVA and Atari games\n\nIn order to validate whether EVA leads to gains in complex domains we evaluated our approach\non the Atari Learning Environment(ALE; Bellemare et al., 2013). We used the set of 55 Atari\nGames, please see the supplementary material for details. The hyper-parameters were tuned using a\nsubset of 5 games (Pong, H.E.R.O., Frostbite, Ms Pacman and Qbert). The hyper-parameters shared\nbetween the baseline and EVA (e.g. learning rate) were chosen to maximise the performance of\nthe baseline (\u03bb = 0) on a run over 20M frames on the selected subset of games. The in\ufb02uence of\nthese hyper-parameters on EVA and the baseline are highly correlated. Performance saturates around\n\u03bb = 0.4 as in the simple example. We chose the lowest frequency that would not harm performance\n(20 steps), the rollout length was set to 50 and the number of neighbours used for estimating QNP\nwas set to 5. We observed that performance decreases as the number of neighbours increases. See the\nsupplementary material for details on all hyper-parameters used.\n\n7\n\n0306090120Millions of Frames050100150200250300Human normalized ScoreMean Human Normalized ScoreDQNDoubleDQNPrioritisedDDQNDQN(ours)\u03bb=0.40306090120Millions of Frames0255075100Median Human Normalized Score\fFigure 4: Comparison of the learning curves averaged over three random seeds of EVA agent with\ndifferent trace computations according to the mean (Left) and median (Right) human normalised\nscore. The x-axis is in 10s of millions of environment frames.\n\nWe compared absolute performance of agents according to human normalised score as in Mnih et al.\n[2015]. Figure 3 summarises the obtained results, where we ran three random seeds for \u03bb = 0 (which\nis our version of DQN) and EVA with \u03bb = 0.4. In order to obtain uncertainty estimates, we report the\nmean and standard deviation per time step of the curves obtained by randomly selecting one random\nseed per game (this is, one out of three possible seeds for each of the 55 games). For reference,\nwe also included the original DQN results from [Mnih et al., 2015]. EVA is able to improve the\nlearning speed as well as the \ufb01nal performance level using exactly the same architecture and learning\nparameters as our baseline. It is able to achieve the end performance of the baseline in 40 million\nframes.\n\nEffect of trace computation To understand how EVA helps performance, we compare three\ndifferent versions of the trace computation at the core of the EVA approach. The standard (trajectory-\ncentric) trace computation can be simpli\ufb01ed by removing the parametric evaluations of counter-factual\nactions. This ablation results in the n-step trace computation (as shown in 2). Since the standard trace\ncomputation can be seen as a special-case of parametrically-augmented KBRL, we also consider this\ntrace computation. Due to the increased computation of this trace computation, these experiments are\nonly run for 40 million frames. For parametrically-augmented KBRL, a Gaussian similarity kernel is\nused with a bandwidth parameter of 10\u22124 and a paramteric similarity of 10\u22122.\nEVA is signi\ufb01cantly worse than the baseline with the n-step trace computation. This can be seen as\nevidence for the importance of the parametric evaluation of counter-factual actions. Without this\nadditional computation, EVA\u2019s policy is too dependant on the quality of the policy expressed in the\ntrajectories, a negative feedback loop that results in divergence on several games. Interesting, the\nstandard trace computation is as good as, if not better than, the much more costly KBRL method.\nWhile KBRL is capable of merging the data from the different trajectories into a global plan, it does\nnot given on-trajectory information a privileged status without an extremely small bandwidth 4. In\nnear-deterministic environments like Atari, this privileged status is appropriate and acts as a strong\nprior, as can be seen in the lower variance of this method.\n\nConsolidation EVA relies in the TCP at decision time. However, one would expect that after\ntraining, the parametric model would be able to consolidate the information available on the episodic\nmemory and be capable of acting without relying on the planning process. We veri\ufb01ed that annealing\nthe value of \u03bb to zero over two million steps leads to no degradation in performance on our Atari\nexperiments. Note that when \u03bb = 0 our agent reduces to the standard DQN agent.\n\n4To achieve this privileged status for on-trajectory information, the minimum off-trajectory similarity must\n\nbe known, and typically results in bandwidth so small as to be numerically unstable\n\n8\n\n010203040Millions of Frames050100150200250Human normalized ScoreMean Human Normalized ScoreModelDQNKBRLN StepDQN(ours)TCP010203040Millions of Frames0204060Median Human Normalized Score\f6 Discussion\n\nDespite only changing the value function underlying the behaviour policy, EVA improves the overall\nrate of learning. This is due to two factors. The \ufb01rst is that the adjusted policy should be closer to\nthe optimal policy by better exploiting the information in the replay data. The second is that this\nimproved policy should \ufb01ll the replay buffer with more useful data. This means that the ephemeral\nadjustments indirectly impact the parametric value function by changing the distribution of data that\nit is trained on.\nDuring the training process, as the agent explores the environment, knowledge about value functions\nare extracted gradually from the interactions with the environment. Since the value-function drives\nthe data acquisition process, the ability to quickly incorporate on highly rewarded experiences could\nsigni\ufb01cantly boost the sample ef\ufb01ciency of the learning process.\n\nAcknowledgments\n\nThe authors would like to thank Melissa Tan, Paul Komarek, Volodymyr Mnih, Alistair Muldal, Adri\u00e0\nBadia, Hado van Hasselt, Yotam Doron, Ian Osband, Daan Wierstra, Demis Hassabis, Dharshan\nKumaran, Siddhant Jayakumar, Razvan Pascanu, and Oriol Vinyals. Finally, we thank the anonymous\nreviewers for their comments and suggestions to improve the paper.\n\nReferences\nJames L McClelland, Bruce L McNaughton, and Randall C O\u2019Reilly. Why there are complementary\nlearning systems in the hippocampus and neocortex: insights from the successes and failures of\nconnectionist models of learning and memory. Psychological review, 102(3):419, 1995.\n\nVolodymyr Mnih, Koray Kavukcuoglu, David Silver, Andrei A Rusu, Joel Veness, Marc G Bellemare,\nAlex Graves, Martin Riedmiller, Andreas K Fidjeland, Georg Ostrovski, et al. Human-level control\nthrough deep reinforcement learning. Nature, 518(7540):529, 2015.\n\nDavid Silver, Aja Huang, Chris J Maddison, Arthur Guez, Laurent Sifre, George Van Den Driessche,\nJulian Schrittwieser, Ioannis Antonoglou, Veda Panneershelvam, Marc Lanctot, et al. Mastering\nthe game of go with deep neural networks and tree search. nature, 529(7587):484\u2013489, 2016.\n\nMatej Morav\u02c7c\u00edk, Martin Schmid, Neil Burch, Viliam Lis`y, Dustin Morrill, Nolan Bard, Trevor\nDavis, Kevin Waugh, Michael Johanson, and Michael Bowling. Deepstack: Expert-level arti\ufb01cial\nintelligence in heads-up no-limit poker. Science, 356(6337):508\u2013513, 2017.\n\nRichard S Sutton and Andrew G Barto. Reinforcement learning: An introduction. MIT press, 1998.\n\nCharles Blundell, Benigno Uria, Alexander Pritzel, Yazhe Li, Avraham Ruderman, Joel Z Leibo,\nJack Rae, Daan Wierstra, and Demis Hassabis. Model-free episodic control. arXiv preprint\narXiv:1606.04460, 2016a.\n\nAlexander Pritzel, Benigno Uria, Sriram Srinivasan, Adri\u00e0 Puigdom\u00e8nech, Oriol Vinyals, Demis\n\nHassabis, Daan Wierstra, and Charles Blundell. Neural episodic control. ICML, 2017.\n\nPeter Dayan and Nathaniel D Daw. Decision theory, reinforcement learning, and the brain. Cognitive,\n\nAffective, & Behavioral Neuroscience, 8(4):429\u2013453, 2008.\n\nSamuel J Gershman and Nathaniel D Daw. Reinforcement learning and episodic memory in humans\n\nand animals: an integrative framework. Annual review of psychology, 68:101\u2013128, 2017.\n\nDavid Silver, Richard S Sutton, and Martin M\u00fcller. Sample-based learning and search with permanent\nand transient memories. In Proceedings of the 25th international conference on Machine learning,\npages 968\u2013975. ACM, 2008.\n\nLasse Espeholt, Hubert Soyer, Remi Munos, Karen Simonyan, Volodymir Mnih, Tom Ward, Yotam\nDoron, Vlad Firoiu, Tim Harley, Iain Dunning, et al. Impala: Scalable distributed deep-rl with\nimportance weighted actor-learner architectures. arXiv preprint arXiv:1802.01561, 2018.\n\n9\n\n\fGabriel Barth-Maron, Matthew W Hoffman, David Budden, Will Dabney, Dan Horgan, Alistair\nMuldal, Nicolas Heess, and Timothy Lillicrap. Distributed distributional deterministic policy\ngradients. arXiv preprint arXiv:1804.08617, 2018.\n\nChristopher JCH Watkins and Peter Dayan. Q-learning. Machine learning, 8(3-4):279\u2013292, 1992.\nDirk Ormoneit and \u00b4Saunak Sen. Kernel-based reinforcement learning. Machine learning, 49(2-3):\n\n161\u2013178, 2002.\n\nBram Bakker, Viktor Zhumatiy, Gabriel Gruener, and J\u00fcrgen Schmidhuber. A robot that reinforcement-\nlearns to identify and memorize important previous observations. In Intelligent Robots and Systems,\n2003.(IROS 2003). Proceedings. 2003 IEEE/RSJ International Conference on, volume 1, pages\n430\u2013435. IEEE, 2003.\n\nMatthew Hausknecht and Peter Stone. Deep recurrent q-learning for partially observable mdps. arXiv\n\npreprint arXiv:1507.06527, 2015.\n\nAlex Graves, Greg Wayne, Malcolm Reynolds, Tim Harley, Ivo Danihelka, Agnieszka Grabska-\nBarwi\u00b4nska, Sergio G\u00f3mez Colmenarejo, Edward Grefenstette, Tiago Ramalho, John Agapiou, et al.\nHybrid computing using a neural network with dynamic external memory. Nature, 538(7626):\n471\u2013476, 2016.\n\nJunhyuk Oh, Valliappa Chockalingam, Honglak Lee, et al. Control of memory, active perception, and\naction in minecraft. In Proceedings of The 33rd International Conference on Machine Learning,\npages 2790\u20132799, 2016.\n\nGreg Wayne, Chia-Chun Hung, David Amos, Mehdi Mirza, Arun Ahuja, Agnieszka Grabska-\nBarwinska, Jack Rae, Piotr Mirowski, Joel Z Leibo, Adam Santoro, et al. Unsupervised predictive\nmemory in a goal-directed agent. arXiv preprint arXiv:1803.10760, 2018.\n\nVolodymyr Mnih, Adria Puigdomenech Badia, Mehdi Mirza, Alex Graves, Timothy P Lillicrap, Tim\nHarley, David Silver, and Koray Kavukcuoglu. Asynchronous methods for deep reinforcement\nlearning. In International Conference on Machine Learning, 2016.\n\nTom Schaul, John Quan, Ioannis Antonoglou, and David Silver. Prioritized experience replay. CoRR,\n\nabs/1511.05952, 2015.\n\nJuan C Santamar\u00eda, Richard S Sutton, and Ashwin Ram. Experiments with reinforcement learning in\n\nproblems with continuous state and action spaces. Adaptive behavior, 6(2):163\u2013217, 1997.\n\nRemi Munos and Andrew W Moore. Barycentric interpolators for continuous space and time\n\nreinforcement learning. In NIPS, pages 1024\u20131030, 1998.\n\nThomas Gabel and Martin Riedmiller. Cbr for state value function approximation in reinforcement\nlearning. In International Conference on Case-Based Reasoning, pages 206\u2013221. Springer, 2005.\n\nCharles Blundell, Benigno Uria, Alexander Pritzel, Yazhe Li, Avraham Ruderman, Joel Z Leibo,\nJack Rae, Daan Wierstra, and Demis Hassabis. Model-free episodic control. arXiv preprint\narXiv:1606.04460, 2016b.\n\nDaichi Nishio and Satoshi Yamane. Faster deep q-learning using neural episodic control. arXiv\n\npreprint arXiv:1801.01968, 2018.\n\nMika Sarkin Jain and Jack Lindsey. Semiparametric reinforcement learning. ICLR 2018 Workshop,\n\n2018.\n\nJohanni Brea. Is prioritized sweeping the better episodic control? arXiv preprint arXiv:1711.06677,\n\n2017.\n\nNikolay Savinov, Alexey Dosovitskiy, and Vladlen Koltun. Semi-parametric topological memory for\n\nnavigation. arXiv preprint arXiv:1803.00653, 2018.\n\nChenjun Xiao, Jincheng Mei, and Martin M\u00fcller. Memory-augmented monte carlo tree search. AAAI,\n\n2018.\n\n10\n\n\fLukasz Kaiser, O\ufb01r Nachum, Aurko Roy, and Samy Bengio. Learning to remember rare events.\n\n2016.\n\nEdouard Grave, Armand Joulin, and Nicolas Usunier. Improving neural language models with a\n\ncontinuous cache. arXiv preprint arXiv:1612.04426, 2016.\n\nStephen Merity, Caiming Xiong, James Bradbury, and Richard Socher. Pointer sentinel mixture\n\nmodels. arXiv preprint arXiv:1609.07843, 2016.\n\nPablo Sprechmann, Siddhant M Jayakumar, Jack W Rae, Alexander Pritzel, Adri\u00e0 Puigdom\u00e8nech\nBadia, Benigno Uria, Oriol Vinyals, Demis Hassabis, Razvan Pascanu, and Charles Blundell.\nMemory-based parameter adaptation. ICLR, 2018.\n\nTom Stepleton. The pycolab game engine. https://github.com/deepmind/pycolab/tree/\n\nmaster/pycolab, 2017.\n\nMarc G Bellemare, Yavar Naddaf, Joel Veness, and Michael Bowling. The arcade learning envi-\nronment: An evaluation platform for general agents. J. Artif. Intell. Res.(JAIR), 47:253\u2013279,\n2013.\n\n11\n\n\f", "award": [], "sourceid": 6750, "authors": [{"given_name": "Steven", "family_name": "Hansen", "institution": "DeepMind"}, {"given_name": "Alexander", "family_name": "Pritzel", "institution": "Deepmind"}, {"given_name": "Pablo", "family_name": "Sprechmann", "institution": "DeepMind"}, {"given_name": "Andre", "family_name": "Barreto", "institution": "DeepMind"}, {"given_name": "Charles", "family_name": "Blundell", "institution": "DeepMind"}]}