{"title": "When to use parametric models in reinforcement learning?", "book": "Advances in Neural Information Processing Systems", "page_first": 14322, "page_last": 14333, "abstract": "We examine the question of when and how parametric models are most useful in reinforcement learning. In particular, we look at commonalities and differences between parametric models and experience replay. Replay-based learning algorithms share important traits with model-based approaches, including the ability to plan: to use more computation without additional data to improve predictions and behaviour. We discuss when to expect benefits from either approach, and interpret prior work in this context. We hypothesise that, under suitable conditions, replay-based algorithms should be competitive to or better than model-based algorithms if the model is used only to generate fictional transitions from observed states for an update rule that is otherwise model-free. We validated this hypothesis on Atari 2600 video games. The replay-based algorithm attained state-of-the-art data efficiency, improving over prior results with parametric models. Additionally, we discuss different ways to use models. We show that it can be better to plan backward than to plan forward when using models to perform credit assignment (e.g., to directly learn a value or policy), even though the latter seems more common. Finally, we argue and demonstrate that it can be beneficial to plan forward for immediate behaviour, rather than for credit assignment.", "full_text": "When to use parametric models\n\nin reinforcement learning?\n\nHado van Hasselt\n\nDeepMind\nLondon, UK\n\nhado@google.com\n\nMatteo Hessel\n\nDeepMind\nLondon, UK\n\nmtthss@google.com\n\nJohn Aslanides\n\nDeepMind\nLondon, UK\n\njaslanides@google.com\n\nAbstract\n\nWe examine the question of when and how parametric models are most useful in\nreinforcement learning. In particular, we look at commonalities and differences\nbetween parametric models and experience replay. Replay-based learning algo-\nrithms share important traits with model-based approaches, including the ability\nto plan: to use more computation without additional data to improve predictions\nand behaviour. We discuss when to expect bene\ufb01ts from either approach, and\ninterpret prior work in this context. We hypothesise that, under suitable conditions,\nreplay-based algorithms should be competitive to or better than model-based al-\ngorithms if the model is used only to generate \ufb01ctional transitions from observed\nstates for an update rule that is otherwise model-free. We validated this hypothesis\non Atari 2600 video games. The replay-based algorithm attained state-of-the-art\ndata ef\ufb01ciency, improving over prior results with parametric models. Additionally,\nwe discuss different ways to use models. We show that it can be better to plan\nbackward than to plan forward when using models to perform credit assignment\n(e.g., to directly learn a value or policy), even though the latter seems more com-\nmon. Finally, we argue and demonstrate that it can be bene\ufb01cial to plan forward\nfor immediate behaviour, rather than for credit assignment.\n\nThe general setting we consider is learning to make decisions from \ufb01nite interactions with an\nenvironment. Although the distinction is not fully unambiguous, there exist two prototypical families\nof algorithms: those that learn without an explicit model of the environment (model free), and those\nthat \ufb01rst learn a model and then use it to plan a solution (model based).\nThere are good reasons for building the capability to learn some sort of model of the world into\narti\ufb01cial agents. Models may allow transfer of knowledge in ways that policies and scalar value pre-\ndictions do not, and may allow agents to acquire rich knowledge about the world before knowing how\nthis knowledge is best used. In addition, models can be used to plan: to use additional computation,\nwithout requiring additional experience, to improve the agent\u2019s predictions and decisions.\nIn this paper, we discuss commonalities and differences between parametric models and experience\nreplay [Lin, 1992]. Although replay-based agents are not always thought of as model-based, replay\nshares many characteristics that we often associate with parametric models. In particular, we can\n\u2018plan\u2019 with the experience stored in the replay memory in the sense that we can use additional\ncomputation to improve the agent\u2019s predictions and policies in between interactions with the real\nenvironment.\nOur work was partially inspired by recent work by Kaiser et al. [2019], who showed that planning\nwith a parametric model allows for data-ef\ufb01cient learning on several Atari video games. A main\ncomparison was to Rainbow DQN [Hessel et al., 2018a], which uses replay. We explain why their\nresults may perhaps be considered surprising, and show that in a like-for-like comparison Rainbow\nDQN outperformed the scores of the model-based agent, with less experience and computation.\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\ffor interaction \u2208 {1, 2, . . . , M} do\n\nAlgorithm 1 Model-based reinforcement learning\n1: Input: state sample procedure d\n2: Input: model m\n3: Input: policy \u03c0\n4: Input: predictions v\n5: Input: environment E\n6: Get initial state s \u2190 E\n7: for iteration \u2208 {1, 2, . . . , K} do\n8:\nGenerate action: a \u2190 \u03c0(s)\n9:\nGenerate reward, next state: r, s(cid:48) \u2190 E(a)\n10:\nm, d \u2190 UPDATEMODEL(s, a, r, s(cid:48))\n11:\n\u03c0, v \u2190 UPDATEAGENT(s, a, r, s(cid:48))\n12:\nUpdate current state: s \u2190 s(cid:48)\n13:\n14:\n15:\n16:\n17:\n18:\n19:\n20: end for\n\nGenerate state, action \u02dcs, \u02dca \u2190 d\nGenerate reward, next state: \u02dcr, \u02dcs(cid:48) \u2190 m(\u02dcs, \u02dca)\n\u03c0, v \u2190 UPDATEAGENT(\u02dcs, \u02dca, \u02dcr, \u02dcs(cid:48))\n\nend for\nfor planning step \u2208 {1, 2, . . . , P} do\n\nend for\n\nWe discuss this in the context of a broad discussion of parametric models and experience replay. We\nexamine equivalences between them, potential failure modes of planning with parametric models,\nand how to exploit parametric models in addition to, or instead of, using them to provide imagined\nexperiences to an otherwise model-free algorithm.\nIn particular, we will discuss three different ways to use a learnt, imperfect, model. First, we can plan\nforward for credit assignment. This means we roll the model forward from real states, for instance\nstored in a replay buffer, and use the resulting modelled transitions to learn predictions or policies.\nWe argue that this can be worse than planning backward from real states, because the former will\ninvolve updating real states with \ufb01ctional experiences, whereas the latter only involves updating\n\ufb01ctional states, which seems safer. This hypothesis is validated empirically. Finally, a third use of a\nmodel is to plan forward from the current state, to help determine the immediate behaviour.\nWe believe that planning backward for credit assignment or planning forward for behaviour may be\nmore bene\ufb01cial than planning forward for credit assignment. To see why, consider an inaccurate\nmodel that for instance predicts a transition to some magical world that is not truly there, thereby\nproviding a \ufb01ctional path to high rewards. Planning forward for credit assignment may then result in\nincorrect predictions and policies that assume this \ufb01ction is real. Instead, planning backward when\nthe model is inaccurate may lead to updates to \ufb01ctional states that are unreachable, which may or\nmay not be useful, but is less likely to be harmful than updating real states with \ufb01ctional transitions.\nHowever, if we do plan forward but only use the inaccurate model to inform our behaviour rather\nthan trusting its transitions as if they are real, then we might expect an agent to go and see whether\nthere is in fact a magical world around the corner. This may result in useful data, and perhaps even\nuseful exploration, regardless of whether the modelled transition in fact exists or not.\n\n1 Model-based reinforcement learning\n\nWe now de\ufb01ne the terminology that we use in the paper, and present a generic algorithm that\nencompasses both model-based and replay-based algorithms.\nWe consider the reinforcement learning setting [Sutton and Barto, 2018] in which an agent interacts\nwith an environment, in the sense that the agent outputs actions and then obtains observations and\nrewards from the environment. We consider the control setting, in which the goal is to optimise the\naccumulation of the rewards over time by picking appropriate sequences of actions. The action an\nagent outputs typically depends on its state. This state is a function of past observations; in some\ncases it is suf\ufb01cient to just use the immediate observation as state, in other cases a more sophisticated\nagent state is required to yield suitable decisions. The state of the agent should not be confused with\n\n2\n\n\fthe state of the environment, which is typically not fully observable to the agent, and is also typically\nmuch too large to reason about directly.\nWe use the word planning to refer to any algorithm that uses additional computation to improve\nits predictions or behaviour without consuming additional data. Conversely, we reserve the term\nlearning for updates that depend on newly observed experience.\nThe term model will refer to functions that take a state and action as input, and that output a\nreward and next state. Sometimes we may have a perfect model, as in board games (e.g., chess and\ngo); sometimes the model needs to be learnt before it can be used. Models can be stochastic, to\napproximate inherently stochastic transition dynamics, or to model the agent\u2019s uncertainty about\nthe future. Expectation models are deterministic, and output (an approximation of) the expected\nreward and state. If the true dynamics are stochastic, iterating expectation models multiple steps\nmay be unhelpful, as an expected state may itself not be a valid state; the output of a model may not\nhave useful semantics when using an expected state as input rather than a real state [cf. Wan et al.,\n2019]. Planning is associated with models, because a common way to use computation to improve\npredictions and policies is to search using a model. For instance, in Dyna [Sutton, 1990], learning and\nplanning are combined by using new experience to learn both the model and the agent\u2019s predictions,\nand then planning to further improve the predictions.\nExperience replay [Lin, 1992] refers to storing previously observed transitions to replay later for\nadditional updates to the predictions and policy. Replay may be used for planning and, when queried\nat state-action pairs we have observed, experience replay may be indistinguishable from an accurate\nmodel. Sometimes, there may be no practical differences between replay and models, depending on\nhow they are used. On the other hand, a replay memory is less \ufb02exible than a model, since we cannot\nquery it at arbitrary states that are not present in the replay memory.\n\n1.1 A generic algorithm\n\nAlgorithm 1 is a generic model-based learning algorithm. It runs for K iterations, in each of which\nM interactions with the environment occur. The total number of interactions is thus T \u2261 K \u00d7 M.\nThe experience is used to update a model (line 11) and the policy or predictions of the agent (line\n12). Then, P steps of planning are performed, where transitions sampled from the model are used to\nupdate the agent (line 18). For P = 0, the model is not used, hence the algorithm is model-free (we\ncould then also skip line 11). If P > 0, and the agent update in line 12 does not do anything, we have\na purely model-based algorithm. The agent updates in lines 12 and 18 could differ, or they could treat\nreal and modelled transitions equivalently.\nMany known algorithms from the model-based literature are instances of algorithm 1. If lines 12 and\n18 both update the agent\u2019s predictions in the same way, the resulting algorithm is known as Dyna\n[Sutton, 1990] \u2013 for instance, if predictions v include action values (normally denoted with q) and we\nupdate using Q-learning [Watkins, 1989, Watkins and Dayan, 1992], we obtain Dyna-Q [Sutton and\nBarto, 2018]. One can extend Algorithm 1 further, for instance by allowing planning and model-free\nlearning to happen simultaneously. Such extensions are orthogonal to our discussion and we do not\ndiscuss them further.\nSome algorithms typically thought of as being model-free also \ufb01t into this framework. For instance,\nDQN [Mnih et al., 2013, 2015] and neural-\ufb01tted Q-iteration [Riedmiller, 2005] match Algorithm 1, if\nwe stretch the de\ufb01nitions of \u2018model\u2019 to include the more limited replay buffers. DQN learns from\ntransitions sampled from a replay buffer by using Q-learning with neural networks. In Algorithm 1,\nthis corresponds to updating a non-parametric model, in line 11, by storing observed transitions in the\nbuffer (perhaps overwriting old transitions); line 17 then retrieves a transition from this buffer. The\npolicy is only updated with transitions sampled from the replay buffer (i.e., line 12 has no effect).\n\n2 Model properties\n\nA main advantage of using models is the ability to plan: to use additional computation, but no new\ndata, to improve the agent\u2019s policy or predictions. Sutton and Barto [2018] illustrate the bene\ufb01ts\nof planning in a simple grid world (Figure 1, on the left), where the agent must learn to navigate\nalong the shortest path to a \ufb01xed goal location. On the right of Figure 1 we use this domain to\nshow how the performance of a replay-based Q-learning agent (blue) and that of a Dyna-Q agent\n\n3\n\n\fFigure 1: Left: the layout of the grid world [Sutton and Barto, 2018], \u2018S\u2019 and \u2018G\u2019 denote the start and\ngoal state, respectively. Right: Q-learning with replay (blue) or Dyna-Q with a parametric model\n(red); y-axis: the total number of steps to complete 25 episodes of experience, x-axis: the number of\nupdates per step in the environment. Both axes are on a logarithmic scale.\n\n(red) scale similarly with the amount of planning (measured in terms of the number of updates per\nreal environment step). Both agents use a multi-layer perceptron to approximate action values, but\nDyna-Q also used identical networks to model transitions, terminations and rewards. The algorithm is\ncalled \u2018forward Dyna\u2019 in the \ufb01gure, because it samples states from the replay and then steps forward\none step using the model. Later we will consider a variant that, instead, steps backward with an\ninverse model. The appendix contains further details on the experiments.\n\n2.1 Computational properties\n\nThere are clear computational differences between using parametric models and replay. For instance,\nKaiser et al. [2019] use a fairly large deep neural network to model the pixel dynamics in Atari,\nwhich means predicting a single transition can require non-trivial computation. In general, parametric\nmodels typically require more computations than it takes to sample from a replay buffer.\nOn the other hand, replay tightly couples model capacity and memory requirements: each transition\nthat is stored takes up a certain amount of memory. If we do not remove any transitions, the memory\ncan grow unbounded. If we limit the memory usage, then this implies that the effective capacity of\nthe replay is limited as any transitions we replace are forgotten completely. In contrast, parametric\nmodels may be able to achieve good accuracy with a \ufb01xed and comparatively small memory footprint.\n\n2.2 Equivalences\n\nSuppose we manage to learn a model that perfectly matches the transitions observed thus far. If\nwe would then use such a perfect model to generate experiences only from states that were actually\nobserved, the resulting updates would be indistinguishable from doing experience replay. In that\nsense, replay matches a perfect model, albeit only from the states we have observed.1 Therefore,\nall else being equal, we would expect that using an imperfect (e.g., parametric) model to generate\n\ufb01ctional experiences from truly observed states should probably not result in better learning.\nThere are some subtleties to this argument. First, the argument can be made even stronger in some\ncases. When making linear predictions with least-squares temporal-difference learning [LSTD,\nBradtke and Barto, 1996, Boyan, 1999], the model-free algorithm on the original data does not\nrequire (or indeed bene\ufb01t from) planning: the solution will already be a best \ufb01t (in a least squares\nsense) even with a single pass through the data. In fact, if we \ufb01t a linear model to the data and then\nfully solve this model, the solution is equal to the LSTD solution [Parr et al., 2008]. One can also\nshow that exhaustive replay with linear TD(\u03bb) [Sutton, 1988] is equivalent to a one-time pass through\nthe data with LSTD(\u03bb) [van Seijen and Sutton, 2015], because replay similarly allows us to solve the\nempirical \u2018model\u2019 that is implicitly de\ufb01ned by the observed data.\nThese full equivalences are however limited to linear prediction, and do not extend straightforwardly\nto non-linear functions, or to the control setting. This leaves open the question of when to use a\nparametric model rather than replay, or vice versa.\n\n1One could go one step further and extend replay full non-parametric models. For instance Pan et al. [2018]\n\nuse kernel methods to allow querying the replay-based model at states that are not stored in the buffer.\n\n4\n\n0.10.31.03.0Updates per real step3e31e43e4Total steps (log. scale)Forward DynaReplayScalability\fFigure 2: Left: four rooms grid world [Sutton et al., 1998]. Center-left: planning forward from\nthe current state to update the current behaviour (0 steps corresponds to Q-learning); y-axis: total\nnumber of steps required to complete 100 episodes, x-axis: search depth. Center-right: comparing\nreplay (blue), forward Dyna (red), and backward Dyna (black); y-axis: episode length (logarithmic\nscale), x-axis: number of episodes. Right: adding stochasticity to the transition dynamics (in the\nform of a 20% probability of transitioning to a random adjacent cell irrespectively of the action), then\ncomparing again replay (blue), forward Dyna (red), and backward Dyna (black); y-axis: episode\nlength (logarithmic scale), x-axis: number of episodes\n\n2.3 When do parametric models help learning?\n\nWhen should we expect bene\ufb01ts from learning and using a parametric model, rather than using the\nactual data? We discussed important computational differences above. Here we focus on learning\nef\ufb01ciency: when do parametric models help learning?\nFirst, parametric models may be useful to plan into the future to help determine our policy of\nbehaviour. The ability to generalise to unseen or counter-factual transitions can be used to plan from\nthe current state into the future (sometimes called planning \u2018in the now\u2019 [Kaelbling and Lozano-P\u00e9rez,\n2010]), even if this exact state has never before been observed. This is commonly and successfully\nemployed in model-predictive control [Richalet et al., 1978, Morari and Lee, 1999, Mayne, 2014,\nWagener et al., 2019]. Classically, the model is constructed by hand rather than learnt directly from\nexperience, but the principle of planning forward to \ufb01nd suitable behaviour is the same. It is not\npossible to replicate this with standard replay, because in interesting rich domains the current state\nwill typically not exactly appear in the replay. Even if it would, replay does not allow easy generation\nof possible next states, in addition to the one trajectory that actually happened.\nIf we use a model to select actions, rather than trusting its imagined transitions to update the policy\nor predictions, it may be less essential to have a highly accurate model. For instance, the model may\npredict a shortcut that does not actually exist; using this to then steer behaviour results in experience\nthat is both suitable to correct the error in the model, and that yields the kind of directed, temporally\nconsistent behaviour typically sought for exploration purposes [Lowrey et al., 2019].\nWe illustrate this with an experiment on a classic four room grid-world [Sutton et al., 1998]. We\nlearnt a tabular forward model that generates transitions (s, a) \u2192 (r, \u03b3, s(cid:48)), where s and s(cid:48) are states,\na is an action, r is a reward, and \u03b3 \u2208 [0, 1] is a discount factor. We then used this model to plan via a\nsimple breadth-\ufb01rst search up to a \ufb01xed depth, bootstrapping from a value function q(s, a) learnt via\nstandard Q-learning. We then use the resulting planned values of the actions at the current state to\nbehave. This process can be interpreted as using a multi-step greedy policy [Efroni et al., 2018] to\ndetermine behaviour, instead of the more standard one-step greedy policy. The results are illustrated\nin the second plot in Figure 2: more planning was bene\ufb01cial.\nIn addition to planning forward to improve behaviour, models may be useful for credit assignment\nthrough backward planning. Consider an algorithm where, as before, we sample real visited states\nfrom a replay buffer, but instead of planning one step into the future from these states we plan one step\nbackward. One motivation is that if the model is poor then planning a step forward will update the\nreal sampled state with a misleading imagined transition. This will potentially cause harmful updates\nto the value at these real states. Conversely, if we plan backwards we update an imagined state. If the\nmodel is poor this imagined state perhaps does not resemble any real state. Updating such \ufb01ctional\nstates seems less harmful. When the model becomes very accurate, forward and backward planning\nboth start to be equally useful. For a purely data-driven (partial) model, such as a replay buffer, there\n\n5\n\n012Search depth13e46e49e4Total steps (log. scale)Planning in the now01000200030004000Episode1030100Episode length (log. scale)BackwardDynaForwardDynaBackward vs Forward(deterministic)Replay01000200030004000Episode1030100Episode length (log. scale)Backward DynaForwardDynaBackward vs Forward(stochastic)Replay\fis no meaningful distinction. But with a learnt model that is at times inaccurate, backward planning\nmay be less error-prone than forward planning for credit assignment.\nWe illustrate potential bene\ufb01ts of backward planning with a simple experiment on the four-room\nenvironment. In the two right-most plots of Figure 2, we compare the performance of applying tabular\nQ-learning to transitions generated by a forward model (red), a backward model (black), or replay\n(blue). The forward model learns distributions over states, rewards, and terminations Pr(r, \u03b3, s(cid:48)|s, a).\nThe backward model learns the inverse Pr(s, a|r, \u03b3, s(cid:48)). Both use a Dirichlet(1) prior. We evaluated\nthe algorithms in the deterministic four-room environment, as well as in a stochastic variant where\non each step there is a 20% probability of transitioning to a random adjacent cell irrespective of the\naction. In both cases, backward planning resulted in faster learning than forward planning. In the\ndeterministic case, the forward model catches up later in learning, reaching the same performance of\nreplay after 2000 episodes; instead, planning with a backward model is competitive with replay in\nearly learning but performs slightly worse later in training. We conjecture that the slower convergence\nin later stages of training may be due to the fact that predicting the source state and action in a\ntransition is a non-stationary problem (as it depends on the agent\u2019s policy), and given that early\nepisodes include many more transitions than later ones, it can take many episodes for a Bayesian\nmodel to forget policies observed early in training. The lack of convergence to the optimal policy for\nthe forward planning algorithm in the stochastic setting may be due to the independent sampling of\nthe successor state and reward, which may result in inconsistent transitions. Both these issues may be\naddressed by a suitable choice of the model. More detailed investigations are out of scope for this\npaper, but it is good to recognise that such modelling choices have measurable effects on learning.\n\n3 A failure to learn\n\nWe now describe how planning in a Dyna-style learning algorithm can, perhaps surprisingly easily,\nlead to catastrophic learning updates.\nAlgorithms that combine function approximation (e.g., neural networks), bootstrapping (as in temporal\ndifference methods [Sutton, 1988]), and off-policy learning [Sutton and Barto, 2018, Precup et al.,\n2000] can be unstable [Williams and Baird III, 1993, Baird, 1995, Sutton, 1995, Tsitsiklis and Van\nRoy, 1997, Sutton et al., 2009, 2016] \u2014 this is sometimes called the deadly triad [Sutton and Barto,\n2018, van Hasselt et al., 2018].\nThis has implications for Dyna-style learning, as well as for replay methods [cf. van Hasselt et al.,\n2018]. When using replay it is sometimes relatively straightforward to determine how off-policy the\nstate sampling distribution is, and the sampled transitions will always be real transitions under that\ndistribution (assuming the transition dynamics are stationary). In contrast, the projected states given\nby a parametric model may differ from the states that would occur under the real dynamics, due to\nmodelling error. The update rule will then be solving a predictive question for the MDP induced by\nthe model, but with a state distribution that does not match the on-policy distribution in that MDP.\nTo understand this issue better, consider using Algorithm 1 to estimate expected cumulative discounted\nrewards v\u03c0(s) = E [Rt+1 + \u03b3Rt+2 + . . . | St = s, \u03c0] for a policy \u03c0 by updating vw(s) \u2248 v\u03c0(s) with\ntemporal difference (TD) learning [Sutton, 1988]:\n\nw \u2190 w + \u03b1\u03b4t\u2207wvw(St) ,\n\nwith\n\n\u03b4t \u2261 Rt+1 + \u03b3t+1vw(St+1) \u2212 vw(St) ,\n\nt \u2212 \u03b3xtxt+1\n\n(1)\nwhere Rt+1 \u2208 R and \u03b3t+1 \u2208 [0, 1] are the reward and discount on the transition from St to St+1,\nand \u03b1 > 0 is a small step size. Consider linear predictions vw(St) = w(cid:62)xt \u2248 v\u03c0(St), where\nxt \u2261 x(St) is a feature vector for state St. The expected TD update is then w \u2190 (I \u2212 \u03b1A)w + \u03b1b,\n\n(cid:3) = X(cid:62)D(I \u2212 \u03b3P(cid:62))X, where the expectation\n\nwith b = E [Rt+1xt] and A = E(cid:2)xtx(cid:62)\n\nis over the transition dynamics and over the sampling distribution d of the states. The transition\ndynamics can be written as a matrix P, that contains the probabilities [P]ij = p(St+1 = i | St = j, \u03c0)\nof transitioning from any state j to any state i under policy \u03c0. The diagonal matrix D contains the\nprobabilities [D]ii = d(i) = P (St = i | \u03c0) of sampling each state i on its diagonal. The matrix X\ncontains the feature vectors x(s) of all states on its rows, and maps between state and feature space.\nNote that both P and D are linear operators in state space, not feature space.\nThese updates are guaranteed to be stable (i.e., converge) if A = X(cid:62)D(I \u2212 \u03b3P(cid:62))X is positive\nsemi-de\ufb01nite [Sutton et al., 2016], with spectral radius \u03c1(A) smaller than 1/\u03b1. The deadly triad\noccurs when D and P do not match: then A can be negative de\ufb01nite, the spectral radius \u03c1(I \u2212 \u03b1A)\n\n6\n\n\fcan be larger than one, and the weights can diverge. This can happen when D does not correspond to\nthe steady-state distribution of the policy that conditions P \u2014 that is, if we update off-policy.\nProposition 1. Consider uniformly replaying transitions from a buffer containing full episodes\n(e.g., add new full episodes on termination, potentially remove an old full episode), and using these\ntransitions in the TD algorithm de\ufb01ned by update (1). This algorithm is stable.\n\nProof. The replay buffer de\ufb01nes an empirical model, where the induced policy is the empirical\ndistribution of actions: \u02dc\u03c0(a|s) = n(s, a)/n(s), where n(s) and n(s, a) are the number of times s and\nthe pair (s, a) show up in the replay. (The behaviour policy can change while \ufb01lling the replay, the\nresulting empirical policy is then a sample of a mixture of these policies). The empirical transitions\n[ \u02dcP]ij = n(i, j)/n(i) and state distributions [ \u02dcD]ii = n(s)/N, where N is the total size of the replay\nbuffer, then both correspond to the same empirical policy. Therefore, \u03c1( \u02dcX(cid:62) \u02dcD(I \u2212 \u03b3 \u02dcP(cid:62)) \u02dcX) > 0,\nand TD will be stable and will not diverge.\n\nThis proposition can be extended to the case where transitions are added to the replay one at the time,\nrather then in full episodes. If, however, we sample states according to a non-uniform distribution\n(e.g., using prioritised replay) this can make replay-based algorithms less stable and potentially\ndivergent [cf. van Hasselt et al., 2018].\nWe now show that a very similar algorithm that uses models in place of replay can diverge.\nProposition 2. Consider uniformly replaying states from a replay buffer, then generating transitions\nwith a learnt model \u02c6pm, and using these transitions in a TD update (1). This algorithm can diverge.\nProof. The learnt dynamics \u02c6Pm \u2248 P do not necessarily match the empirical dynamics of the replay,\nwhich means that the empirical replay distribution d, used in the updates, does not necessarily\ncorrespond to the steady-state distribution of these dynamics. Then the model error could lead to\na negative de\ufb01nite \u02c6A \u2261 X(cid:62) \u02dcD(I \u2212 \u03b3 \u02c6P(cid:62)\nm)X, resulting in a spectral radius \u03c1(I \u2212 \u03b1 \u02c6A) > 1, and\ndivergence of the parameters w.\n\nIntuitively, the issue is that the model m can lead to states that are uncommon, or impossible, under\nthe sampling distribution d. Those states are not sampled to be updated directly, but do change through\ngeneralisation when sampled states are updated. This can lead to divergent learning dynamics.\nThere are ways to mitigate the failure described above. First, we could repeatedly iterate the model,\nand sample transitions from the states the model generates as well as to those states, to induce a state\ndistribution that is consistent with the model. This is not fully satisfactory, as states typically become\never-more unrealistic when iterating a learnt model, although there is some indication this may be\nhelpful [Holland et al., 2018]. Second, we could rely less on bootstrapping by using multi-step\nreturns [Sutton, 1988, van Hasselt and Sutton, 2015, Sutton and Barto, 2018]. This mitigates the\ninstability [cf. van Hasselt et al., 2018]. In the extreme, full Monte-Carlo updates do not diverge,\nthough they would have high variance. Third, we could employ algorithms speci\ufb01cally for stable\noff-policy learning, although these are often speci\ufb01c to the linear setting [Sutton et al., 2008, 2009,\nvan Hasselt et al., 2014] or assume the sampling is done on trajectory [Sutton et al., 2016]. Note that\nseveral algorithms exist that correct the return towards a desired policy [Harutyunyan et al., 2016,\nMunos et al., 2016], which is a separate issue from off-policy sampling of states. Although off-policy\nlearning algorithms may be part of the long-term answer, we do not yet have a de\ufb01nitive solution. To\nquote Sutton and Barto [2018]: The potential for off-policy learning remains tantalising, the best\nway to achieve it still a mystery.\nUnderstanding such failures to learn is important to understand and improve our algorithms. However,\njust because divergence can occur does not mean it does occur [cf. van Hasselt et al., 2018]. Indeed,\nin the next section we compare a replay-based algorithm to a model-based algorithm which was\nstable enough to achieve impressive sample-ef\ufb01ciency on the Atari benchmark.\n\n4 Model-based algorithms at scale\n\nWe now discuss two algorithms in more detail: \ufb01rst SimPLe [Kaiser et al., 2019], which uses a\nparametric model, then Rainbow DQN [Hessel et al., 2018a], which uses experience replay (and was\nused as baseline by Kaiser et al.).\n\n7\n\n\fSimPLe Kaiser et al. [2019] showed data-ef\ufb01cient learning is possible in Atari 2600 videos games\nfrom the arcade learning environment [Bellemare et al., 2013] with a purely model-based approach:\nonly updating the policy with data sampled from a learnt parametric model m. The resulting \u201csimu-\nlated policy learning\u201d (SimPLe) algorithm performed relatively well after just 102,400 interactions\n(409,600 frames \u2014 two hours of simulated play) within each game. In Algorithm 1, this corresponds\nto setting K \u00d7 M = 16 \u00d7 6400 = 102,400. Although SimPLe used limited data, it used a large\nnumber of samples from the model, similar to using P = 800,000.2\n\nRainbow DQN One of the main results by Kaiser et al. [2019] was to compare SimPLe to Rainbow\nDQN [Hessel et al., 2018a], which combines the DQN algorithm [Mnih et al., 2013, 2015] with\ndouble Q-learning [van Hasselt, 2010, van Hasselt et al., 2016], dueling network architectures [Wang\net al., 2016], prioritised experience replay [Schaul et al., 2016], noisy networks for exploration\n[Fortunato et al., 2017], and distributional reinforcement learning [Bellemare et al., 2017]. Like\nDQN, Rainbow DQN uses mini-batches of transitions sampled from experience replay [Lin, 1992]\nand uses Q-learning [Watkins, 1989] to learn the action-value estimates which determine the policy.\nRainbow DQN uses multi-step returns [cf. Sutton, 1988, Sutton and Barto, 2018] rather than the\none-step return used in the original DQN algorithm.\n\n4.1 A data ef\ufb01cient Rainbow DQN\n\nIn the notation of Algorithm 1, the total number of transitions sampled from replay during learning\nwill be K\u00d7P , while the total number of interactions with the environment will be K\u00d7M. Originally,\nin both DQN and Rainbow DQN, a batch of 32 transitions was sampled every 4 real interactions. So\nM = 4 and P = 32. The total number of interactions was 50M (200 million frames), which means\nK = 50M/4 = 12.5M.\nIn our experiments below, we trained Rainbow DQN for a total number of real interactions comparable\nto that of SimPLe, by setting K = 100,000, M = 1 and P = 32. The total number of replayed\nsamples (3.2 million) is then still less than the total number of model samples used in SimPLe (15.2\nmillion). Rainbow DQN is also more ef\ufb01cient computation-wise, since sampling from a replay buffer\nis faster than generating a transition with a learnt model.\nThe other changes we made to make Rainbow DQN more data ef\ufb01cient were to increase the number\nof steps in the multi-step returns from 3 to 20, and to reduce the number of steps before we start\nsampling from replay from 20, 000 to 1600. We used the fairly standard convolutional Q network\nfrom Hessel et al. [2018b]. We have not tried to exhaustively tune the algorithm and we do not doubt\nthat the algorithm can be made even more data ef\ufb01cient by futher tuning its hyper-parameters.\n\n4.2 Empirical results\n\nWe ran Rainbow DQN on the same 26 Atari games reported by Kaiser et al. [2019]. In Figure 3, we\nplotted the performance of our version of Rainbow DQN as a function of the number of interactions\nwith the environment. Performance was measured in terms of episode returns, normalised using\nhuman and random scores [van Hasselt et al., 2016], and then aggregated across the 26 games by\ntaking their median. Error bars are shown as computed over the 5 independent replicas of each\nexperiment. The \ufb01nal performance of SimPLe, according to the same metric, is shown in Figure 3 as\na dashed horizontal line.\nAs expected, the hyper-parameters proposed by Hessel et al. [2018a] for the larger-data regime of\n50 million interactions are not well suited to a regime of extreme data-ef\ufb01ciency (purple line in\nFigure 3). Performance was better for our slightly-tweaked data-ef\ufb01cient version of Rainbow DQN\n(red), that matched the performance of SimPLe after just 70,000 interactions with the environment,\nreaching roughly 25% higher performance by 100,000 interactions. The performance of our agent\nwas superior to that of SimPLe in 17 out of 26 games. More detailed results are included in the\nappendix, including ablations and per-game performance.\n\n2The actual number of reported model samples was 19 \u00d7 800, 000 = 15.2 million, because P was varied\n\ndepending on the iteration.\n\n8\n\n\fFigure 3: Median human-normalised episode returns of a tuned Rainbow, as a function of environment\ninteractions (=frames/action repeats). The horizontal dashed line corresponds to the performance of\nSimPLe [Kaiser et al., 2019]. Error bars are computed over 5 seeds.\n\n5 Conclusions\n\nWe discussed commonalities and differences between replay and model-based methods. In particular,\nwe discussed how model errors may cause issues when we use a parametric model in a replay-like\nsetting, where we sample observed states from the past. We note that model-based learning can be\nunstable in theory, and hypothesised that replay is likely a better strategy under that state sampling\ndistribution. This is con\ufb01rmed by at-scale experiments on Atari 2600 video games, where our\nreplay-based agent attained state-of-the-art data ef\ufb01ciency, besting the impressive model-based results\nby Kaiser et al. [2019].\nWe further hypothesised that parametric models are perhaps more useful when used either 1) to\nplan backward for credit assignment, or 2) to plan forward for behaviour. Planning forward for\ncredit assignment was hypothesised and shown to be less effective, even though the approach is quite\ncommon. The intuitive reasoning was that when the model is inaccurate, then planning backwards\nwith a learnt model may lead to updating \ufb01ctional states, which seems less harmful than updating\nreal states with inaccurate transitions as would happen in forward planning for credit assignment.\nForward planning for behaviour, rather than credit assignment, was deemed potentially useful and\nless likely to be harmful for learning, because the resulting plan is not trusted as real experience by\nthe prediction or policy updates. Empirical results supported these conclusions.\nThere is a rich literature on model-based reinforcement learning, and this paper cannot cover all the\npotential ways to plan with learnt models. One notable topic that is out of scope for this paper is\nthe consideration of abstract models [Silver et al., 2017] and alternative ways to use these models in\naddition to classic planning [cf. Weber et al., 2017].\nFinally, we note that our discussion focused mostly on the distinction between parametric models and\nreplay, because these are the most common, but it is good to acknowledge that one can also consider\nnon-parametric models. For instance, one could apply a nearest-neighbours or kernel approach to\na replay buffer, and thereby obtain a non-parametric model that can be equivalent to replay when\nsampled at the observed states, but that can interpolate and generalise to unseen states when sampled\nat other states [Pan et al., 2018]. This is conceptually an appealing alternative, although it comes\nwith practical algorithmic questions of how best to de\ufb01ne distance metrics in high-dimensional state\nspaces. This seems another interesting potential avenue for more future work.\n\n9\n\n250005000075000100000Agent-environment interactions00.050.100.15Median human-random normalized returnsSimPLe(a)Canonical RainbowData-efficient Rainbow\fAcknowledgments\n\nThe authors bene\ufb01tted greatly from feedback from Tom Schaul, Adam White, Brian Tanner, Richard\nSutton, Theophane Weber, Arthur Guez, and Lars Buesing.\n\nReferences\nL. Baird. Residual algorithms: Reinforcement learning with function approximation. In Proceedings of the\n\nTwelfth International Conference on Machine Learning, pages 30\u201337, 1995.\n\nM. G. Bellemare, Y. Naddaf, J. Veness, and M. Bowling. The arcade learning environment: An evaluation\n\nplatform for general agents. J. Artif. Intell. Res. (JAIR), 47:253\u2013279, 2013.\n\nM. G. Bellemare, W. Dabney, and R. Munos. A distributional perspective on reinforcement learning.\n\nProceedings of the 34th International Conference on Machine Learning, pages 449\u2013458, 2017.\n\nIn\n\nJ. A. Boyan. Least-squares temporal difference learning. In Proc. 16th International Conf. on Machine Learning,\n\npages 49\u201356. Morgan Kaufmann, 1999.\n\nS. J. Bradtke and A. G. Barto. Linear least-squares algorithms for temporal difference learning. Machine\n\nLearning, 22:33\u201357, 1996.\n\nY. Efroni, G. Dalal, B. Scherrer, and S. Mannor. Multiple-step greedy policies in approximate and online\n\nreinforcement learning. In Advances in Neural Information Processing Systems, pages 5238\u20135247, 2018.\n\nM. Fortunato, M. G. Azar, B. Piot, J. Menick, I. Osband, A. Graves, V. Mnih, R. Munos, D. Hassabis, O. Pietquin,\n\nC. Blundell, and S. Legg. Noisy networks for exploration. CoRR, abs/1706.10295, 2017.\n\nA. Harutyunyan, M. G. Bellemare, T. Stepleton, and R. Munos. Q(\u03bb) with off-policy corrections. In Proceedings\nof the 27th International Conference on Algorithmic Learning Theory (ALT-2016), volume 9925 of Lecture\nNotes in Arti\ufb01cial Intelligence, pages 305\u2013320. Springer, 2016.\n\nM. Hessel, J. Modayil, H. van Hasselt, T. Schaul, G. Ostrovski, W. Dabney, D. Horgan, B. Piot, M. Azar, and\n\nD. Silver. Rainbow: Combining improvements in deep reinforcement learning. AAAI, 2018a.\n\nM. Hessel, H. van Hasselt, J. Modayil, and D. Silver. On inductive biases in deep reinforcement learning.\n\nOpenReview, https://openreview.net/forum?id=rJgvf3RcFQ, 2018b.\n\nG. Z. Holland, E. Talvitie, and M. Bowling. The effect of planning shape on dyna-style planning in high-\n\ndimensional state spaces. CoRR, abs/1806.01825, 2018.\n\nL. P. Kaelbling and T. Lozano-P\u00e9rez. Hierarchical planning in the now. In Workshops at the Twenty-Fourth AAAI\n\nConference on Arti\ufb01cial Intelligence, 2010.\n\nL. Kaiser, M. Babaeizadeh, P. Milos, B. Osinski, R. H. Campbell, K. Czechowski, D. Erhan, C. Finn, P. Koza-\nkowski, S. Levine, R. Sepassi, G. Tucker, and H. Michalewski. Model-based reinforcement learning for atari.\narXiv preprint arXiv:1503.00185, 2019.\n\nL. Lin. Self-improving reactive agents based on reinforcement learning, planning and teaching. Machine\n\nlearning, 8(3):293\u2013321, 1992.\n\nK. Lowrey, A. Rajeswaran, S. Kakade, E. Todorov, and I. Mordatch. Plan online, learn of\ufb02ine: Ef\ufb01cient learning\nand exploration via model-based control. In International Conference on Learning Representations, 2019.\n\nD. Q. Mayne. Model predictive control: Recent developments and future promise. Automatica, 50(12):\n\n2967\u20132986, 2014.\n\nV. Mnih, K. Kavukcuoglu, D. Silver, A. Graves, I. Antonoglou, D. Wierstra, and M. Riedmiller. Playing atari\n\nwith deep reinforcement learning. arXiv preprint arXiv:1312.5602, 2013.\n\nV. Mnih, K. Kavukcuoglu, D. Silver, A. A. Rusu, J. Veness, M. G. Bellemare, A. Graves, M. Riedmiller, A. K.\nFidjeland, G. Ostrovski, S. Petersen, C. Beattie, A. Sadik, I. Antonoglou, H. King, D. Kumaran, D. Wierstra,\nS. Legg, and D. Hassabis. Human-level control through deep reinforcement learning. Nature, 518(7540):\n529\u2013533, 2015.\n\nM. Morari and J. H. Lee. Model predictive control: past, present and future. Computers & Chemical Engineering,\n\n23(4-5):667\u2013682, 1999.\n\n10\n\n\fR. Munos, T. Stepleton, A. Harutyunyan, and M. Bellemare. Safe and ef\ufb01cient off-policy reinforcement learning.\n\nIn Advances in Neural Information Processing Systems, pages 1054\u20131062, 2016.\n\nY. Pan, M. Zaheer, A. White, A. Patterson, and M. White. Organizing experience: a deeper look at replay\nmechanisms for sample-based planning in continuous state domains. In Proceedings of the Twenty-Seventh\nInternational Joint Conference on Arti\ufb01cial Intelligence, IJCAI-18, pages 4794\u20134800. International Joint\nConferences on Arti\ufb01cial Intelligence Organization, 7 2018.\n\nR. Parr, L. Li, G. Taylor, C. Painter-Wake\ufb01eld, and M. L. Littman. An analysis of linear models, linear\nvalue-function approximation, and feature selection for reinforcement learning. In Proceedings of the 25th\ninternational conference on Machine learning, pages 752\u2013759, 2008.\n\nD. Precup, R. S. Sutton, and S. P. Singh. Eligibility traces for off-policy policy evaluation. In Proceedings\nof the Seventeenth International Conference on Machine Learning (ICML 2000), pages 766\u2013773, Stanford\nUniversity, Stanford, CA, USA, 2000. Morgan Kaufmann.\n\nJ. Richalet, A. Rault, J. Testud, and J. Papon. Model predictive heuristic control. Automatica (Journal of IFAC),\n\n14(5):413\u2013428, 1978.\n\nM. Riedmiller. Neural \ufb01tted Q iteration - \ufb01rst experiences with a data ef\ufb01cient neural reinforcement learning\nmethod. In J. Gama, R. Camacho, P. Brazdil, A. Jorge, and L. Torgo, editors, Proceedings of the 16th\nEuropean Conference on Machine Learning (ECML\u201905), pages 317\u2013328. Springer, 2005.\n\nT. Schaul, J. Quan, I. Antonoglou, and D. Silver. Prioritized experience replay. In International Conference on\n\nLearning Representations, Puerto Rico, 2016.\n\nD. Silver, H. van Hasselt, M. Hessel, T. Schaul, A. Guez, T. Harley, G. Dulac-Arnold, D. Reichert, N. Rabinowitz,\nA. Barreto, and T. Degris. The predictron: End-to-end learning and planning. In D. Precup and Y. W. Teh,\neditors, Proceedings of the 34th International Conference on Machine Learning, volume 70 of Proceedings of\nMachine Learning Research, pages 3191\u20133199, International Convention Centre, Sydney, Australia, 06\u201311\nAug 2017. PMLR.\n\nR. S. Sutton. Learning to predict by the methods of temporal differences. Machine learning, 3(1):9\u201344, 1988.\n\nR. S. Sutton. Integrated architectures for learning, planning, and reacting based on approximating dynamic\nprogramming. In Proceedings of the seventh international conference on machine learning, pages 216\u2013224,\n1990.\n\nR. S. Sutton. On the virtues of linear learning and trajectory distributions. In Proceedings of the Workshop on\n\nValue Function Approximation, Machine Learning Conference, page 85, 1995.\n\nR. S. Sutton and A. G. Barto. Reinforcement Learning: An Introduction. The MIT press, Cambridge MA, 2018.\n\nR. S. Sutton, D. Precup, and S. P. Singh.\n\nIn\nProceedings of the Fifteenth International Conference on Machine Learning (ICML 1998, pages 556\u2013564.\nMorgan Kaufmann Publishers Inc., 1998.\n\nIntra-Option Learning about Temporally Abstract Actions.\n\nR. S. Sutton, C. Szepesv\u00e1ri, and H. R. Maei. A convergent O(n) algorithm for off-policy temporal-difference\nlearning with linear function approximation. Advances in Neural Information Processing Systems 21 (NIPS-\n08), 21:1609\u20131616, 2008.\n\nR. S. Sutton, H. R. Maei, D. Precup, S. Bhatnagar, D. Silver, C. Szepesv\u00e1ri, and E. Wiewiora. Fast gradient-\ndescent methods for temporal-difference learning with linear function approximation. In Proceedings of the\n26th Annual International Conference on Machine Learning (ICML 2009), pages 993\u20131000. ACM, 2009.\n\nR. S. Sutton, A. R. Mahmood, and M. White. An emphatic approach to the problem of off-policy temporal-\n\ndifference learning. Journal of Machine Learning Research, 17(73):1\u201329, 2016.\n\nJ. N. Tsitsiklis and B. Van Roy. An analysis of temporal-difference learning with function approximation. IEEE\n\nTransactions on Automatic Control, 42(5):674\u2013690, 1997.\n\nH. van Hasselt. Double Q-learning. Advances in Neural Information Processing Systems, 23:2613\u20132621, 2010.\n\nH. van Hasselt and R. S. Sutton. Learning to predict independent of span. CoRR, abs/1508.04582, 2015.\n\nH. van Hasselt, A. R. Mahmood, and R. S. Sutton. Off-policy TD(\u03bb) with a true online equivalence.\n\nUncertainty in Arti\ufb01cial Intelligence, 2014.\n\nIn\n\nH. van Hasselt, A. Guez, and D. Silver. Deep reinforcement learning with Double Q-learning. AAAI, 2016.\n\n11\n\n\fH. van Hasselt, Y. Doron, F. Strub, M. Hessel, N. Sonnerat, and J. Modayil. Deep reinforcement learning and\n\nthe deadly triad. CoRR, abs/1812.02648, 2018.\n\nH. van Seijen and R. Sutton. A deeper look at planning as learning from replay. In International conference on\n\nmachine learning, pages 2314\u20132322, 2015.\n\nN. Wagener, C.-A. Cheng, J. Sacks, and B. Boots. An online learning approach to model predictive control.\n\narXiv preprint axXiv:1902.08967, 2019.\n\nY. Wan, M. Zaheer, A. White, M. White, and R. S. Sutton. Planning with expectation models. CoRR,\n\nabs/1904.01191, 2019.\n\nZ. Wang, N. de Freitas, T. Schaul, M. Hessel, H. van Hasselt, and M. Lanctot. Dueling network architectures for\ndeep reinforcement learning. In International Conference on Machine Learning, New York, NY, USA, 2016.\n\nC. J. C. H. Watkins. Learning from delayed rewards. PhD thesis, University of Cambridge England, 1989.\n\nC. J. C. H. Watkins and P. Dayan. Q-learning. Machine Learning, 8:279\u2013292, 1992.\n\nT. Weber, S. Racani\u00e8re, D. P. Reichert, L. Buesing, A. Guez, D. J. Rezende, A. P. Badia, O. Vinyals, N. Heess,\nY. Li, R. Pascanu, P. Battaglia, D. Silver, and D. Wierstra. Imagination-augmented agents for deep reinforce-\nment learning. CoRR, abs/1707.06203, 2017.\n\nR. J. Williams and L. C. Baird III. Analysis of some incremental variants of policy iteration: First steps\ntoward understanding actor-critic learning systems. Technical report, Tech. rep. NU-CCS-93-11, Northeastern\nUniversity, College of Computer Science, Boston, MA., 1993.\n\n12\n\n\f", "award": [], "sourceid": 8106, "authors": [{"given_name": "Hado", "family_name": "van Hasselt", "institution": "DeepMind"}, {"given_name": "Matteo", "family_name": "Hessel", "institution": "Google DeepMind"}, {"given_name": "John", "family_name": "Aslanides", "institution": "DeepMind"}]}