{"title": "Successor Uncertainties: Exploration and Uncertainty in Temporal Difference Learning", "book": "Advances in Neural Information Processing Systems", "page_first": 4507, "page_last": 4516, "abstract": "Posterior sampling for reinforcement learning (PSRL) is an effective method for balancing exploration and exploitation in reinforcement learning. Randomised value functions (RVF) can be viewed as a promising approach to scaling PSRL. However, we show that most contemporary algorithms combining RVF with neural network function approximation do not possess the properties which make PSRL effective, and provably fail in sparse reward problems. Moreover, we find that propagation of uncertainty, a property of PSRL previously thought important for exploration, does not preclude this failure. We use these insights to design Successor Uncertainties (SU), a cheap and easy to implement RVF algorithm that retains key properties of PSRL. SU is highly effective on hard tabular exploration benchmarks. Furthermore, on the Atari 2600 domain, it surpasses human performance on 38 of 49 games tested (achieving a median human normalised score of 2.09), and outperforms its closest RVF competitor, Bootstrapped DQN, on 36 of those.", "full_text": "Successor Uncertainties: Exploration and\n\nUncertainty in Temporal Difference Learning\n\nDavid Janz\u2217\u2020\n\nUniversity of Cambridge\n\ndj343@cam.ac.uk\n\nJiri Hron\u2217\n\nUniversity of Cambridge\n\njh2084@cam.ac.uk\n\nKatja Hofmann\nMicrosoft Research\n\nJos\u00e9 Miguel Hern\u00e1ndez-Lobato\n\nUniversity of Cambridge\n\nAlan Turing Institute\nMicrosoft Research\n\nPrzemys\u0142aw Mazur\nWayve Technologies\n\nSebastian Tschiatschek\n\nMicrosoft Research\n\nAbstract\n\nPosterior sampling for reinforcement learning (PSRL) is an effective method for\nbalancing exploration and exploitation in reinforcement learning. Randomised\nvalue functions (RVF) can be viewed as a promising approach to scaling PSRL.\nHowever, we show that most contemporary algorithms combining RVF with neural\nnetwork function approximation do not possess the properties which make PSRL\neffective, and provably fail in sparse reward problems. Moreover, we \ufb01nd that\npropagation of uncertainty, a property of PSRL previously thought important for ex-\nploration, does not preclude this failure. We use these insights to design Successor\nUncertainties (SU), a cheap and easy to implement RVF algorithm that retains key\nproperties of PSRL. SU is highly effective on hard tabular exploration benchmarks.\nFurthermore, on the Atari 2600 domain, it surpasses human performance on 38\nof 49 games tested (achieving a median human normalised score of 2.09), and\noutperforms its closest RVF competitor, Bootstrapped DQN, on 36 of those.\n\n1\n\nIntroduction\n\nPerhaps the most important open question within reinforcement learning is how to effectively balance\nexploration of an unknown environment with exploitation of the already accumulated knowledge\n(Kaelbling et al., 1996; Sutton et al., 1998; Busoniu et al., 2010). In this paper, we study this in the\nclassic setting where the unknown environment is modelled as a Markov Decision Process (MDP).\nSpeci\ufb01cally, we focus on developing an algorithm that combines effective exploration with neural\nnetwork function approximation. Our approach is inspired by Posterior Sampling for Reinforcement\nLearning (PSRL; Strens, 2000; Osband et al., 2013). PSRL approaches the exploration/exploitation\ntrade-off by explicitly accounting for uncertainty about the true underlying MDP. In tabular settings,\nPSRL achieves impressive results and close to optimal regret (Osband et al., 2013; Osband & Van Roy,\n2016). However, many existing attempts to scale PSRL and combine it with neural network function\napproximation sacri\ufb01ce the very aspects that make PSRL effective. In this work, we examine several\nof these algorithms in the context of PSRL and:\n\n1. Prove that a previous avenue of research, propagation of uncertainty (O\u2019Donoghue et al.,\n2018), is neither suf\ufb01cient nor necessary for effective exploration under posterior sampling.\n\n\u2217Equal contribution\n\u2020Work partly done during an internship at Microsoft Research Cambridge\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\f2. Introduce Successor Uncertainties (SU), a cheap and scalable model-free exploration algo-\n\nrithm that retains crucial elements of the PSRL algorithm.\n\n3. Show that SU is highly effective on hard tabular exploration problems.\n4. Present Atari 2600 results: SU outperforms Bootstrapped DQN (Osband et al., 2016a) on\n\n36/49 and Uncertainty Bellman Equation (O\u2019Donoghue et al., 2018) on 43/49 games.\n\n2 Background\n\nt := Et(P\u221e\n\n\u03c4=0 \u03b3\u03c4 R\u03c4+1) with \u03b3 \u2208 [0, 1).\n\u03c4=t \u03b3\u03c4\u2212tR\u03c4+1) = Et(Rt+1) + \u03b3Et(Q\u03c0\n\nWe use the following notation: for X a random variable, we denote its distribution by PX. Further, if\nf is a measurable function, then f(X) follows the distbution f#PX (the pushforward of PX by f).\nWe consider \ufb01nite MDPs: a tuple (S,A,T ), where S is a \ufb01nite state space, A a \ufb01nite action space,\nand T : S \u00d7 A \u2192 P(S \u00d7 R) a transition probability kernel mapping from the state-action space\nS \u00d7 A to the set of probability distributions P(S \u00d7 R) on the product space of states S and rewards\nR \u2282 R; R is assumed to be bounded throughout. For each time step t \u2208 N, the agent selects an\naction At by sampling from a distribution speci\ufb01ed by its policy \u03c0 : S \u2192 P(A) for the current state\nSt, and receives a new state and reward (St+1, Rt+1) \u223c T (St, At). This gives rise to a Markov\nprocess (St, At)t\u22650 and a reward process (Rt)t\u22651. The task of solving an MDP amounts to \ufb01nding\n\na policy \u03c0? which maximises the expected return E(P\u221e\n\n(T \u03c0 \u02c6Q)(s, a) = E(S0,R0)\u223cT (s,a)[R0 + \u03b3EA0\u223c\u03c0(S0) \u02c6Q(S0, A0)] .\n\nCrucial to many so called model-free methods for solving MDPs is the state-action value function\nt+1) , where Et is\n(Q function) for a policy \u03c0: Q\u03c0\nused to denote an expectation conditional on (S\u03c4 , A\u03c4)\u03c4\u2264t. Model-free methods use the recursive\nnature of the Bellman equation to construct a model \u02c6Q\u03c0 : S \u00d7 A \u2192 R, which estimates Q\u03c0\nt for any\ngiven (St = s, At = a), through repeated application of the Bellman operator T \u03c0 : RS\u00d7A \u2192 RS\u00d7A:\n(1)\nSince T \u03c0 is a contraction on RS\u00d7A with a unique \ufb01xed point \u02c6Q\u03c0, that is T \u03c0 \u02c6Q\u03c0 = \u02c6Q\u03c0, the iterated\napplication of T \u03c0 to any initial \u02c6Q \u2208 RS\u00d7A yields \u02c6Q\u03c0. The expectations in equation (1) can be\nestimated via Monte Carlo using experiences (s, a, r, s0) obtained through interaction with the MDP.\nA key challenge is then in obtaining experiences that are highly informative about the optimal policy.\nA simple and effective approach to collecting such experiences is PSRL, a model-based algorithm\nbased on two components: (i) a distribution over rewards and transition dynamics P \u02c6T obtained using\na Bayesian modelling approach, treating rewards and transition probabilities as random variables;\nand (ii) the posterior sampling exploration algorithm (Thompson, 1933; Dearden et al., 1998) which\nsamples \u02c6T \u223c P \u02c6T , computes the optimal policy \u02c6\u03c0 with respect to the sampled \u02c6T , and follows \u02c6\u03c0 for\nthe duration of a single episode. The collected data are then used to update the P \u02c6T model, and\nthe whole process is iterated until convergence.\nWhile PSRL performs very well on tabular problems, it is computationally expensive and does not\nutilise any additional information about the state space structure (e.g. visual similarity when states\nare represented by images). A family of methods called Randomised Value Functions (RVF; Osband\net al., 2016b) attempt to overcome these issues by directly modelling a distribution over Q functions,\nP \u02c6Q, instead of over MDPs, P \u02c6T . Rather than acting greedily with respect to a sampled MDP as in\nPSRL, the agent then acts greedily with respect to a sample \u02c6Q \u223c P \u02c6Q drawn at the beginning of each\nepisode, removing the main computational bottleneck. Since a parametric model is often chosen for\nP \u02c6Q, the switch to Q function modelling also directly facilitates use of function approximation and\nthus generalisation between states.\n\n3 Exploration under function approximation\n\nMany exploration methods, including (Osband et al., 2016b,a; Moerland et al., 2017; O\u2019Donoghue\net al., 2018; Azizzadenesheli et al., 2018), can be interpreted as combining the concept of RVF with\nneural network function approximation. While the use of neural network function approximation\nallows these methods to scale to problems too complex for PSRL, it also brings about conceptual\ndif\ufb01culties not present within PSRL and tabular RVF methods. Speci\ufb01cally, because a Q function is\nde\ufb01ned with respect to a particular policy, constructing P \u02c6Q requires selection of a reference policy\nor distribution over policies. Methods that utilise a distribution over reference policies typically\n\n2\n\n\fU P\n\nDOWN\n\n...\n\ns5\n\nU P\n\nDOWN\n\ns4\n\ns3\n\ns0\n\nU P\n\nDOWN\n\ns2\n\ns1\n\nFigure 1: Binary tree MDP of size L. States S = {s0, . . . , s2L} are one-hot encoded; actions\nA = {a1, a2} are mapped to movements {UP, DOWN} according to a random mapping drawn\nindependently for each state. Reward of one is obtained after reaching s2L and zero otherwise. States\nwith odd indices and s2L are terminal.\n\n#P \u02c6T [ \u02c6Q\u03c0(s, a)p] = EP \u02c6T\n\n(cid:8)[E(R0,S0)\u223c \u02c6T (s,a)R0 + EA0\u223c\u03c0(S0)F \u03c0( \u02c6T )(S0, A0)]p(cid:9).\n\nemploy a bootstrapped estimator of the Q function as we will discuss in more depth later. For\nnow, we focus on methods that employ a single reference policy which commonly interleave two\nsteps: (i) inference of P \u02c6Q\u03c0i for a given policy \u03c0i using the available data (value prediction step);\n(ii) estimation of an improved policy \u03c0i+1 based on P \u02c6Q\u03c0i (policy improvement step). While a\ncommon policy improvement choice is \u03c0i+1 : s 7\u2192 EP \u02c6Q\u03c0i [G( \u02c6Q)(s)], methods vary greatly in how\nthey implement value prediction. To gain a better insight into the value prediction step, we examine\nits idealised implementation: Suppose we have access to a belief over MDPs, P \u02c6T (as in PSRL),\nand want to compute the implied distribution P \u02c6Q\u03c0 for a single policy \u03c0. The intuitive (albeit still\ncomputationally expensive) procedure is to: (i) draw \u02c6T \u223c P \u02c6T ; and (ii) repeatedly apply the Bellman\noperator T \u03c0 to an initial \u02c6Q for the drawn \u02c6T until convergence. Denoting by F \u03c0: \u02c6T 7\u2192 \u02c6Q\u03c0 the map\nfrom \u02c6T to the corresponding \u02c6Q\u03c0 for a policy \u03c0, the distribution of resulting samples is P \u02c6Q\u03c0 = F \u03c0\n#P \u02c6T .\nThis idealised value prediction step motivates, for example, the Uncertainty Bellman Equation\n(UBE; O\u2019Donoghue et al., 2018). O\u2019Donoghue et al. argue that to achieve effective exploration, it is\nnecessary that the uncertainty about each \u02c6Q\u03c0(s, a), quanti\ufb01ed by variance, is equal to the uncertainty\nabout the immediate reward and the next state\u2019s Q value. This requirement can be formalised as\nfollows:\nDe\ufb01nition 1 (Propagation of uncertainty). For a given distribution P \u02c6T and policy \u03c0, we say that\na model P \u02c6Q\u03c0 propagates uncertainty according to P \u02c6T if for each (s, a) \u2208 S \u00d7 A and p = 1, 2\nEP \u02c6Q\u03c0 [ \u02c6Q\u03c0(s, a)p] = EF \u03c0\nIn words, propagation of uncertainty requires that the \ufb01rst two moments behave consistently under\napplication of the Bellman operator.\nPropagation of uncertainty is a desirable property when using Upper Con\ufb01dence Bounds (UCB; Auer,\n2002) for exploration, since UCB methods rely only on the \ufb01rst two moments of P \u02c6Q\u03c0. However,\npropagation of uncertainty is not suf\ufb01cient for effective exploration under posterior sampling. We\nshow this in the context of the binary tree MDP depicted in \ufb01gure 1. To solve the MDP, the agent\nmust execute a sequence of L uninterrupted UP movements. In the following proposition, we show\nthat any algorithm combining factorised symmetric distributions with posterior sampling (e.g. UBE)\nwill solve this MDP with probability of at most 2\u2212L per episode, thus failing to outperform a uniform\nexploration policy. Importantly, the sizes of marginal variances have no bearing on this result,\nmeaning that propagation of uncertainty on its own does not preclude this failure mode.\nProposition 1. Let |A| > 1, and P \u02c6Q be a factorised distribution, i.e. for \u02c6Q \u223c P \u02c6Q, \u02c6Q(s, a) and\n\u02c6Q(s0, a0) are independent, \u2200(s, a) 6= (s0, a0), with symmetric marginals. Assume that for each s \u2208 S,\nthe marginal distributions of { \u02c6Q(s, a): a \u2208 A} are all symmetric around the same value cs \u2208 R.\nThen the probability of executing any given sequence of L actions under \u02c6\u03c0 \u223c G#P \u02c6Q is at most 2\u2212L.\nPropagation of uncertainty is furthermore not necessary for posterior sampling. To see this, \ufb01rst note\nthat for any given P \u02c6Q\u03c0, the posterior sampling procedure only depends on the induced distribution\nover greedy policies, i.e. the pushforward of P \u02c6Q\u03c0 by the greedy operator G. This means that from\nthe point of view of posterior sampling, two Q function models are equivalent as long as they induce\nthe same distribution over greedy policies. In what follows, we formalise this equivalence relationship\n(de\ufb01nition 2), and then show that each of the induced equivalence classes contains a model that\ndoes not propagate uncertainty (proposition 2), implying that posterior sampling does not rely on\npropagation of uncertainty.\n\n3\n\n\fDe\ufb01nition 2 (Posterior sampling policy matching). For a given distribution P \u02c6T and a policy \u03c0, we say\nthat a model P \u02c6Q\u03c0 matches the posterior sampling policy implied by P \u02c6T if G#P \u02c6Q\u03c0 = (G \u25e6 F \u03c0)#P \u02c6T .\n#P \u02c6T [ \u02c6Q\u03c0(s, a)] is\nProposition 2. For any distribution P \u02c6T and policy \u03c0 such that the variance VF \u03c0\ngreater than zero for some (s, a), there exists a distribution P \u02c6Q\u03c0 which matches the posterior sampling\npolicy (de\ufb01nition 2), but does not propagate uncertainty (de\ufb01nition 1), according to P \u02c6T .\nWe conclude by addressing a potential criticism of proposition 1, i.e. that the described issues may be\ncircumvented by initialising expected Q values to a value higher than the maximal attainable Q value\nin given MDP, an approach known as optimistic initialisation (Osband et al., 2016b). In such case,\nsymmetries in the Q function may break as updates move the distribution towards more realistic\nQ values. However, when neural network function approximation is used, the effect of optimistic\ninitialisation can disappear quickly with optimisation (Osband et al., 2018). In particular, with\nnon-orthogonal state-action embeddings, Q value estimates may decrease for yet unseen state-action\npairs, and estimates for different state-action states can move in tandem. In practice, most recent\nmodels employing neural network function approximation do not use optimistic initialisation (Osband\net al., 2016a; Azizzadenesheli et al., 2018; Moerland et al., 2017; O\u2019Donoghue et al., 2018).\n\n4 Successor Uncertainties\n\nWe present Successsor Uncertainties, an algorithm which both propagates uncertainty and matches\nthe posterior sampling policy. As our work is motivated by PSRL, we focus on the use with posterior\nsampling, leaving combination with other exploration algorithms for future research.\n\n4.1 Q function model de\ufb01nition\nSuppose we are given an embedding function \u03c6: S\u00d7A \u2192 Rd, such that for all (s, a), k\u03c6(s, a)k2 = 1\nand \u03c6(s, a) \u2265 0 elementwise, and EtRt+1 = h\u03c6t, wi for some w \u2208 Rd. Denote \u03c6t = \u03c6(St, At).\n\u03c4=t \u03b3\u03c4\u2212t\u03c6\u03c4], the (discounted)\nThen we can express Q\u03c0\nexpected future occurrence of each \u03c6(s, a) feature under a policy \u03c0, as follows:\n\u03b3\u03c4\u2212t\u03c6\u03c4 , w\n\nt = Et[P\u221e\n(cid:28)\n\u221eX\n\nt as an inner product of w and \u03c8\u03c0\n\n\u03b3\u03c4\u2212tR\u03c4+1 = Et\n\n\u03b3\u03c4\u2212th\u03c6\u03c4 , wi =\n\n\u221eX\n\n\u221eX\n\n(cid:29)\n\nt = Et\nQ\u03c0\n\nEt\n\n= h\u03c8\u03c0\n\nt , wi ,\n\n(2)\n\n\u03c4=t\n\n\u03c4=t\n\n\u03c4=t\n\nwhere the second equality follows from the tower property of conditional expectation and the third\nfrom the dominated convergence theorem combined with the unit norm assumption.\nt is known in the literature as the successor features (Dayan, 1993; Barreto et al.,\nThe quantity \u03c8\u03c0\nt+1, an estimator of the successor features, \u02c6\u03c8\u03c0, can be obtained\n2017). Noting that \u03c8\u03c0\nby applying standard temporal difference learning techniques. The other quantity involved, w, can\nbe estimated by regressing embeddings of observed states \u03c6t onto the corresponding rewards. We\nperform Bayesian linear regression to infer a distribution over rewards, using N (0, \u03b8I) as the prior\nover w and N (h\u03c6, wi, \u03b2) as the likelihood, which leads to posterior N (\u00b5w, \u03a3w) over w with known\nanalytical expressions for both \u00b5w and \u03a3w. This induces posterior distribution over \u02c6Q\u03c0\nSU given by\n\nt = \u03c6t + \u03b3 Et\u03c8\u03c0\n\nSU \u223c N (\u02c6\u03a8\u03c0\u00b5w, \u02c6\u03a8\u03c0\u03a3w(\u02c6\u03a8\u03c0)>) ,\n\u02c6Q\u03c0\n\n(3)\nwhere \u02c6\u03a8\u03c0 = [ \u02c6\u03c8\u03c0(s, a)]>\n(s,a)\u2208S\u00d7A. This is our Successor Uncertainties (SU) model for the Q function.\nThe \ufb01nal element of the SU model is the selection of a sequence of reference policies (\u03c0i)i\u22651 for\nwhich the Q function model is learnt. We follow O\u2019Donoghue et al. (2018) in constructing these\niteratively as \u03c0i+1(s) = E\u02c6\u03c0\u223cG#P \u02c6Q\u03c0i [\u02c6\u03c0(s)].\n4.2 Properties of the model\n\nThe non-diagonal covariance matrix of the SU Q function model (see equation (3)) means that SU\ndoes not suffer from the shortcomings of previous methods with factorised posterior distributions\ndescribed in proposition 1. Moreover, note that \u02c6Q\u03c0\n#P \u02c6T for the MDP model P \u02c6T composed of\na delta distribution concentrated on empirical transition frequencies, and the Bayesian linear model\nfor rewards (assuming convergence of successor features, i.e. \u02c6\u03c8\u03c0 = \u03c8\u03c0). SU thus both propagates\nuncertainty and matches the posterior sampling policy according to this choice of P \u02c6T .\n\nSU \u223c F \u03c0\n\n4\n\n\fHowever, due to its use of a point estimate for the transition probabilities, SU may underestimate\nQ function uncertainty, and a good model of transition probabilities which scales beyond tabular\nsettings can lead to improved performance. Furthermore, SU estimates P \u02c6Q\u03c0i+1 for a single policy,\nwhich we choose to be \u03c0i+1(s) = E\u02c6\u03c0\u223cG#P \u02c6Q\u03c0i [\u02c6\u03c0(s)]. This approach may not adequately capture the\nuncertainty over \u02c6\u03c0 implied by P \u02c6Q\u03c0i . We expect that incorporation of this uncertainty, or an improved\nmethod of choosing \u03c0i+1, may further improve the SU algorithm.\n\n4.3 Neural network function approximation\n\n{z\n\n}\n\n|\n\n{z\n\n}\n\nOne of the main assumptions we made so far is that the embedding function \u03c6 is known a priori. This\nsection considers the scenario where \u03c6 is to be estimated jointly with the other quantities using neural\nnetwork function approximation. For reference, the pseudocode is included in appendix C.\nLet \u02c6\u03c6: S \u00d7 A \u2192 Rd+ be the current estimate of \u03c6, (st, at) the state-action pair observed at step t, rt+1\nthe reward observed after taking action at in state st. Suppose we want to estimate the Q function\nof some given policy \u03c0, and denote \u02c6\u03c6t := \u02c6\u03c6(st, at), \u02c6\u03c8t := \u02c6\u03c8\u03c0(st, at). We propose to jointly learn \u02c6\u03c6\nand \u02c6\u03c8 by enforcing the known relationships between \u03c6t, \u03c8\u03c0\n\nt and EtRt+1:\n+|h \u02c6w, \u02c6\u03c8ti\u2212 \u03b3(h \u02c6w, \u02c6\u03c8t+1i)\u2020\u2212 rt+1|2\n\n(4)\n\n+|h \u02c6w, \u02c6\u03c6ti\u2212 rt+1|2\n\n2\n\n}\nmin \u02c6\u03c6, \u02c6\u03c8, \u02c6w k \u02c6\u03c8t \u2212 \u02c6\u03c6t \u2212 \u03b3 ( \u02c6\u03c8t+1)\u2020k2\n\n{z\n\n|\n\nsuccessor feature loss\n\n|\n\nreward loss\n\nQ value loss\n\nin expectation over the observed data {(st, at, rt+1st+1): t = 0, . . . , N} with at+1 \u223c \u03c0(st+1);\n\u02c6\u03c6t, \u02c6\u03c8t \u2208 Rd+,k\u02c6\u03c6tk2 = 1,\u2200t, are respectively ensured by the use of ReLU activations and explicit\nnormalisation. The \u02c6w \u2208 Rd are the \ufb01nal layer weights shared by the the reward and the Q value\nnetworks. Quantities superscripted with \u2020 are treated as \ufb01xed during optimisation.\nThe need for the successor feature and reward losses follows directly from the de\ufb01nition of the SU\nmodel. We add the explicit Q value loss to ensure accuracy of Q value predictions. Assuming that\nthere exists a (ReLU) network that achieves zero successor feature and reward loss, the added Q value\nloss has no effect. However, \ufb01nding such an optimal solution is dif\ufb01cult in practice and empirically\nthe addition of the Q value loss improves performance. Our modelling assumptions cause all\nconstituent losses in equation (4) to have similar scale, and thus we found it unnecessary to introduce\nweighting factors. Furthermore, unlike in previous work utilising successor features (Kulkarni\net al., 2016; Machado et al., 2017, 2018), SU does not rely on any auxiliary state reconstruction or\nstate-transition prediction tasks for learning, which simpli\ufb01es implementation and greatly reduces\nthe required amount of computation.\nWe employ the neural network output weights \u02c6w in prediction of the mean Q function, and use\nthe Bayesian linear model only to provide uncertainty estimates. In estimating the covariance matrix\ni )\u22121 , \u03b6 \u2208\n\u02c6\u03c6>\n[0, 1], so as to counter non-stationarity of the learnt state-action embeddings \u02c6\u03c6 .\n\n\u03a3w, we decay the contribution of old data-points, \u02c6\u03a3w = (\u03b6 N \u03b8\u22121I + \u03b2\u22121PN\n\ni=0 \u03b6 N\u2212i \u02c6\u03c6i\n\n4.4 Comparison to existing methods\n\nWe discuss two popular classes of Q function models compatible with neural network function\napproximation: methods relying on Bayesian linear Q function models and methods based on\nbootstrapping. We omit variational Q-learning methods such as (Gal, 2016; Lipton et al., 2018), as\nconceptual issues with these algorithms have already been identi\ufb01ed in an illuminating line of work\nby Osband et al. (2016a, 2018).\nBayesian linear Q function models encompass our SU algorithm, UBE (O\u2019Donoghue et al., 2018) im-\nplemented with value function approximation, Bayesian Deep Q Networks (BDQN; Azizzadenesheli\net al., 2018), and a range of other related work (Levine et al., 2017; Moerland et al., 2017). The algo-\nrithms within this category tend to use a Q function model of the form \u02c6Q\u03c0(s, a) = h\u02c6\u03c6\u03c0\ns , wai, where\ns are state embeddings and wa \u223c Pwa are weights of a Bayesian linear model. The embeddings \u02c6\u03c6\u03c0\n\u02c6\u03c6\u03c0\ns\nare produced by a neural network, and are usually optimised using a temporal difference algorithm\napplied to Q values. However, these methods do not enforce any explicit structure within the embed-\ndings \u02c6\u03c6\u03c0\ns which would be required for posterior sampling policy matching, and prevent these methods\nfrom falling victim to proposition 1. SU can thus be viewed as a simple and computationally cheap\nalternative \ufb01xing the issues of existing Bayesian linear Q function models.\nBootstrapped DQN (Osband et al., 2016a, 2018) is a model which consists of an ensemble of K\nstandard Q networks, each initialised independently and trained on a random subset of the observed\n\n5\n\n\fdata. Each network is augmented with a \ufb01xed additive prior network, so as to ensure the ensemble\ndistribution does not collapse in sparse environments. If all networks within the ensemble are trained\nto estimate the Q function for a single policy \u03c0, then Bootstrapped DQN both propagates uncertainty\nand matches the posterior sampling policy for a distribution over MDPs formed by the mixture over\nempirical MDPs corresponding to each subsample of the data. In practice, Bootstrapped DQN does\nnot assume a single policy \u03c0 and instead each network learns for its corresponding greedy policy.\nBootstrapped DQN is, however, more computationally expensive: its performance increases with\nthe size of the ensemble K, but so does the amount of computation required. Our experiments show\nthat SU is much cheaper computationally, and that despite using only a single reference policy, it\nmanages to outperform Bootstrapped DQN on a wide range of exploration tasks (see section 5).\n\n5 Tabular experiments\n\nWe present results for: (i) the binary tree MDP accompanied by theoretical analysis showing how\nSU succeeds and avoids the pitfalls identi\ufb01ed in proposition 1; (ii) a hard exploration task proposed\nby Osband et al. (2018) together with the Boostrapped DQN algorithm which SU outperforms by\na signi\ufb01cant margin.3 We also provide an analysis explaining why some of the previously discussed\nalgorithms perform well on seemingly similar experiments present in existing literature.\n\n5.1 Binary tree MDP\n\nWe study the behaviour of SU and its competitors on the binary tree MDP introduced in \ufb01gure 1.\nFigure 2 shows the empirical performance of each algorithm as a function of the tree size L. Evidently,\nboth BDQN and UBE fail to outperform a uniform exploration policy. For UBE, this is a consequence\nof proposition 1, and the similarly poor behaviour of BDQN suggests it may suffer from an analogous\nissue. In contrast, SU and Bootstrapped DQN are able to succeed on large binary trees despite the\nvery sparse reward structure and randomised action effects. However, Bootstrapped DQN requires\napproximately 25 times more computation than SU to approach similar levels of performance due to\nthe necessity to train a whole ensemble of Q networks.\n\nFigure 2: Median number of episodes required to learn the optimal policy on the tree MDP. Blue\npoints indicate all 5 seeds succeeded within 5000 episodes, orange indicates only some of the runs\nsucceeded, and red all runs failed. Dashed lines correspond to the median for a uniform exploration\npolicy. Note the reduced size of the x-axis for BDQN and UBE.\n\nThe next proposition and its proof provide intuition for the success of SU on the tree MDP. The proof\nis based on a lemma stated just after the proposition (see appendix B.1 for formal treatment).\nProposition 3 (Informal statement). Assume the SU model with: (i) \ufb01xed one-hot state-action\nembeddings \u03c6, (ii) uniform exploration thus far, (iii) successor representations learnt to convergence\nfor a uniform policy. Let sk for 2 \u2264 k < 2L, even, be a state visited N times thus far. Then\nthe probability of selecting UP in sk, given UP was selected in s0, s2, . . . , sk\u22122, is greater than one\nhalf with probability greater than 1 \u2212 \u0001N , where \u0001N decreases exponentially with N.\nLemma 4 (Informal statement). Under the SU model \u02c6Q \u223c P \u02c6Q\u03c0 for the uniform policy \u03c0, the proba-\nbility that the greedy policy \u02c6\u03c0 = G( \u02c6Q) selects UP in sk, given UP was selected in s0, s2, . . . , sk\u22122, is\ngreater than one half if there exists an even 0 \u2264 j < k such that\n\nCov( \u02c6Q(sk, UP), \u02c6Q(sj, UP)) > Cov( \u02c6Q(sk, DOWN), \u02c6Q(sj, UP)) .\n\n3Code for the tabular experiments: https://djanz.org/successor_uncertainties/tabular_code\n\n6\n\n01020problem scale L025005000median learning timeBDQN01020problem scale LUBE0100200problem scale LBootstrap+Prior (1x compute)0100200problem scale LBootstrap+Prior (25x compute)0100200problem scale LSU (1x compute)\fSketch proof of proposition 3. Under SU \u02c6Q(sj, UP) = \u02c6r(sj, UP)+. . .+\u03c1 \u02c6Q(sk, UP)+\u03c1 \u02c6Q(sk, DOWN)\nwith \u03c1 = 2\u2212( k\u2212j\n2 ) the probability of getting from sj to sk under the uniform policy. Note that\n\u02c6Q(sj, UP) and \u02c6Q(sk, DOWN) only share the \u02c6Q(sk, DOWN) = \u02c6r(sk, DOWN) term, whereas \u02c6Q(sk, UP)\nand \u02c6Q(sj, UP) share \u02c6r(sj, UP), . . . , \u02c6r(sp, DOWN), where sp is the state with the highest index seen\nso far. Thus covariance between \u02c6Q(sk, UP) and \u02c6Q(sj, UP) is higher than that between \u02c6Q(sk, DOWN)\nand \u02c6Q(sj, UP) with high probability (at least 1 \u2212 \u0001N ), and the result follows from lemma 4.\n\nProposition 3 implies that (at least under the simplifying assumption of prior exploration being\nuniform) SU is likely to assign higher probability to Q functions for which a greedy policy leads\ntowards the furthest visited state (cf. the role of the state sp in the sketch proof). This is a strategy\nactively aimed for in exploration algorithms such as Go-Explore where the agent uses imitation\nlearning to return to the furthest discovered states (Ecoffet et al., 2019).\n\n5.2 Chain MDP from (Osband et al., 2018)\n\nWe present results on the chain environment introduced by Osband et al. (2018), described in detail\nin appendix C.1. Osband et al. describe their MDP as being \u201cakin to looking for a piece of hay in a\nneedle-stack\u201d and state that it \u201cmay seem like an impossible task\u201d. Figure 3 shows the scaling for\nSuccessor Uncertainties and Bootstrap+Prior for this problem. Learning time T scales empirically as\nO(L2.5) for SU, versus O(L3) for Bootstrap+Prior (as reported in Osband et al., 2018).\n\nFigure 3: Learning time T for SU and Bootstrap+Prior for a range of problem sizes L on the chain\nMDP. Curve for SU is log10 T = 2.5 log10 L\u22120.95. Curve for Bootstrap+Prior is taken from \ufb01gure 8\nin (Osband et al., 2018).\n\n5.3 On the success of BDQN in environments with tied actions\n\nWe brie\ufb02y address prior results in the literature where BDQN is seen solving problems seemingly\nsimilar to our binary tree MDP with ease (as in, for example, \ufb01gure 1 of Touati et al., 2018).\nThe discrepancy occurs because previous work often does not randomise the effects of actions\n(for example Osband et al., 2016a; Plappert et al., 2018; Touati et al., 2018), i.e. if a1 leads UP\nin any state sk, then a1 leads UP in all states. We refer to this as the tied actions setting. In the\nfollowing proposition, we show that MDPs with tied actions are trivial for BDQN with strictly\npositive activations (e.g. sigmoid). We offer a similar result for ReLU in appendix B.2.\nProposition 5. Let \u02c6Q(s, a) = h\u03c6(s), wai be a Bayesian Q function model with \u03c6(s) = \u03d5(U1s) \u2208 Rd,\n1s a one-hot encoding of s, and \u03d5 a strictly positive activation function (e.g. sigmoid) applied\nwI), Uhs \u223c N (0, \u03c32\nelementwise. Then sampling independently from the prior wa \u223c N (0, \u03c32\nu) solves\na tied action binary tree of size L in T \u2264 \u2212[log2(1 \u2212 2\u2212d)]\u22121 median number of episodes.\n\nProof. De\ufb01ne \u2206 := wUP \u2212 wDOWN and observe UP is selected if \u02c6Q(s, UP) \u2212 \u02c6Q(s, DOWN) =\nh\u03c6(s), wUP \u2212 wDOWNi > 0. By strict positivity of \u03d5, the probability that UP is always selected\n\n{h\u03c6(s2j), \u2206i >0} | \u2206 >0(cid:3)P(\u2206 >0) = P(\u2206 > 0) ,\n\nP(cid:2)L\u22121\\\n\n{ \u02c6Q(s2j, UP) > \u02c6Q(s2j, DOWN)}(cid:3)\u2265P(cid:2)L\u22121\\\n\nj=0\n\nj=0\n\nwhere \u2206 > 0 is to be interpreted elementwise. As \u2206 \u223c N (0, 2\u03c32\n\nwI), P(\u2206 > 0) = 2\u2212d for all L.\n\n7\n\n20406080100120140160180200problem scale L02500050000median learning time1.41.61.82.02.2log10 problem scale L2.53.03.54.04.55.0Bootstrap+Prior fitSuccessor Uncertainties fitSuccessor Uncertainties data\fFigure 4: Bars show the difference in human normalised score between SU and Bootstrap DQN (top),\nUBE (middle) and DQN (bottom) for each of the 49 Atari 2600 games. Blue indicates SU performed\nbetter, red worse. SU outperforms the baselines on 36/49, 43/49 and 42/49 games respectively.\nY-axis values have been clipped to [\u22122.5, 2.5].\n\nA single layer BDQN with one neuron can thus solve a tied action binary tree of any size L in one\nepisode (median) while completely ignoring all state information. That such an approach can be\nsuccessful implies tied actions MDPs generally do not make for good exploration benchmarks.\n\n6 Atari 2600 experiments\n\nWe have tested the SU algorithm on the standard set of 49 games from the Arcade Learning Environ-\nment, with the aim of showing that SU can be scaled to complex domains that require generalisation\nbetween states. We use a standard network architecture as in (Mnih et al., 2015; Van Hasselt et al.,\n2016) endowed with an extra head for prediction of \u02c6\u03c6 and one-step value updates. More detail on our\nimplementation, network architecture and training procedure can be found in appendix C.2.4\nSU obtains a median human normalised score of 2.09 (averaged over 3 seeds) after 200M training\nframes under the \u2018no-ops start 30 minute emulator time\u2019 test protocol described in (Hessel et al.,\n2018). Table 1 shows we signi\ufb01cantly outperform competing methods. The raw scores are reported\nin table 2 (appendix), and the difference in human normalised score between SU and the competing\nalgorithms for individual games is charted in \ufb01gure 4. Since Azizzadenesheli et al. (2018) only report\nscores for a small subset of the games and use a non-standard testing procedure, we do not compare\nagainst BDQN. Osband et al. (2018), who introduce Bootstrap+Prior, do not report Atari results; we\nthus compare with results for the original plain Bootstrapped DQN (Osband et al., 2016a) instead.\n\nTable 1: Human normalised Atari scores. Superhuman performance is the percentage of games on\nwhich each algorithm surpasses human performance (as reported in Mnih et al., 2015).\n\nHuman normalised score percentiles\n\nAlgorithm\n\nSuccessor Uncertainties\nBootstrapped DQN\nUBE\nDQN + \u0001-greedy\n\n7 Conclusion\n\n25%\n1.06\n0.76\n0.38\n0.50\n\n50%\n2.09\n1.60\n1.07\n1.00\n\n75%\n5.95\n5.16\n4.14\n3.41\n\nSuperhuman\nperformance %\n\n77.55%\n67.35%\n51.02%\n48.98%\n\nWe studied the Posterior Sampling for Reinforcement Learning algorithm and its extensions within the\nRandomised Value Function framework, focusing on use with neural network function approximation.\nWe have shown theoretically that exploration techniques based on the concept of propagation of uncer-\ntainty are neither suf\ufb01cient nor necessary for posterior sampling exploration in sparse environments.\nWe instead proposed posterior sampling policy matching, a property motivated by the probabilistic\nmodel over rewards and state transitions within the PSRL algorithm. Based on the theoretical insights,\n\n4Code for the Atari experiments: djanz.org/successor_uncertainties/atari_code\n\n8\n\nBootstrapUBEAliAmiAssAstAstAtlBanBatBeaBowBoxBreCenChoCraDemDouEndFisFreFroGopGraH.EIceJamKanKruKunMonMs.NamPonPriQ*BRivRoaRobSeaSpaStaTenTimTutUp VenVidWizZaxAtari 2600 games, alphabeticalDQNdifference in human normalised score versus Successor Uncertainties\fwe developed Successor Uncertainties, a randomised value function algorithm that avoids some of\nthe pathologies present within previous work. We showed empirically that on hard tabular examples,\nSU signi\ufb01cantly outperforms competing methods, and provided theoretical analysis of its behaviour.\nOn Atari 2600, we demonstrated Successor Uncertainties is also highly effective when combined\nwith neural network function approximation.\nPerformance on the hardest exploration tasks often bene\ufb01ts greatly from multi-step temporal dif-\nference learning (Precup, 2000; Munos et al., 2016; O\u2019Donoghue et al., 2018) which we believe is\nthe most promising direction for improving Successor Uncertainties. Since modi\ufb01cation of existing\nmodels to incorporate Successor Uncertainties is relatively simple, other standard techniques used\nwithin model-free reinforcement learning like (Schaul et al., 2015; Wang et al., 2016) can be leveraged\nto obtain further gains. This paper thus opens many exciting directions for future research which we\nhope will translate into both further performance improvements and a more thorough understanding\nof exploration in modern reinforcement learning.\n\nAcknowledgements\n\nWe thank Matej Balog and the anonymous reviewers for their helpful comments and suggestions. Jiri\nHron acknowledges support by a Nokia CASE Studentship.\n\nReferences\nAuer, P. Using con\ufb01dence bounds for exploitation-exploration trade-offs. Journal of Machine\n\nLearning Research, 3(Nov):397\u2013422, 2002.\n\nAzizzadenesheli, K., Brunskill, E., and Anandkumar, A. Ef\ufb01cient exploration through bayesian deep\n\nQ-networks. arXiv preprint arXiv:1802.04412, 2018.\n\nBarreto, A., Dabney, W., Munos, R., Hunt, J. J., Schaul, T., van Hasselt, H. P., and Silver, D. Successor\nfeatures for transfer in reinforcement learning. In Advances in Neural Information Processing\nSystems (NeurIPS), 2017.\n\nBusoniu, L., Babuska, R., Schutter, B. D., and Ernst, D. Reinforcement Learning and Dynamic\n\nProgramming Using Function Approximators. CRC Press, 2010.\n\nDayan, P. Improving generalization for temporal difference learning: the successor representation.\n\nNeural Computation, 5(4):613\u2013624, 1993.\n\nDearden, R., Friedman, N., and Russell, S. J. Bayesian Q-Learning. In AAAI/IAAI, pp. 761\u2013768.\n\nAAAI Press / The MIT Press, 1998.\n\nEcoffet, A., Huizinga, J., Lehman, J., Stanley, K. O., and Clune, J. Go-explore: a new approach for\n\nhard-exploration problems. arXiv preprint arXiv:1901.10995, 2019.\n\nGal, Y. Uncertainty in deep learning. PhD thesis, University of Cambridge, 2016.\n\nHessel, M., Modayil, J., van Hasselt, H., Schaul, T., Ostrovski, G., Dabney, W., Horgan, D., Piot, B.,\nAzar, M. G., and Silver, D. Rainbow: combining improvements in deep reinforcement learning. In\nAAAI Conference on Arti\ufb01cial Intelligence, 2018.\n\nKaelbling, L. P., Littman, M. L., and Moore, A. W. Reinforcement learning: a survey. Journal of\n\narti\ufb01cial intelligence research, 4:237\u2013285, 1996.\n\nKingma, D. P. and Ba, J. Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980,\n\n2014.\n\nKulkarni, T. D., Saeedi, A., Gautam, S., and Gershman, S. J. Deep successor reinforcement learning.\n\narXiv preprint arXiv:1606.02396, 2016.\n\nLevine, N., Zahavy, T., Mankowitz, D. J., Tamar, A., and Mannor, S. Shallow updates for deep\nreinforcement learning. In Advances in Neural Information Processing Systems (NeurIPS), 2017.\n\n9\n\n\fLipton, Z. C., Li, X., Gao, J., Li, L., Ahmed, F., and Deng, L. BBQ-Networks: ef\ufb01cient exploration in\ndeep reinforcement learning for task-oriented dialogue systems. In AAAI Conference on Arti\ufb01cial\nIntelligence, 2018.\n\nMachado, M. C., Rosenbaum, C., Guo, X., Liu, M., Tesauro, G., and Campbell, M. Eigenoption\n\ndiscovery through the deep successor representation. arXiv preprint arXiv:1710.11089, 2017.\n\nMachado, M. C., Bellemare, M. G., and Bowling, M. Count-based exploration with the successor\n\nrepresentation. arXiv preprint arXiv:1807.11622, 2018.\n\nMnih, V., Kavukcuoglu, K., Silver, D., Rusu, A. A., Veness, J., Bellemare, M. G., Graves, A., Ried-\nmiller, M., Fidjeland, A. K., Ostrovski, G., et al. Human-level control through deep reinforcement\nlearning. Nature, 518(7540):529, 2015.\n\nMoerland, T. M., Broekens, J., and Jonker, C. M. Ef\ufb01cient exploration with double uncertain value\n\nnetworks. arXiv preprint arXiv:1711.10789, 2017.\n\nMunos, R., Stepleton, T., Harutyunyan, A., and Bellemare, M. Safe and ef\ufb01cient off-policy reinforce-\n\nment learning. In Advances in Neural Information Processing Systems (NeurIPS), 2016.\n\nO\u2019Donoghue, B., Osband, I., Munos, R., and Mnih, V. The uncertainty Bellman equation and\n\nexploration. In International Conference on Machine Learning (ICML), 2018.\n\nOsband, I. and Van Roy, B. On lower bounds for regret in reinforcement learning. arXiv preprint\n\narXiv:1608.02732, 2016.\n\nOsband, I., Russo, D., and Van Roy, B. (More) ef\ufb01cient reinforcement learning via posterior sampling.\n\nIn Advances in Neural Information Processing Systems, 2013.\n\nOsband, I., Blundell, C., Pritzel, A., and Van Roy, B. Deep exploration via bootstrapped DQN. In\n\nAdvances in Neural Information Processing Systems (NeurIPS), 2016a.\n\nOsband, I., Van Roy, B., and Wen, Z. Generalization and exploration via randomized value functions.\n\nInternational Conference on Machine Learning (ICML), 2016b.\n\nOsband, I., Aslanides, J., and Cassirer, A. Randomized prior functions for deep reinforcement\n\nlearning. In Advances in Neural Information Processing Systems (NeurIPS), 2018.\n\nPlappert, M., Houthooft, R., Dhariwal, P., Sidor, S., Chen, R. Y., Chen, X., Asfour, T., Abbeel, P.,\nand Andrychowicz, M. Parameter space noise for exploration. In International Conference on\nLearning Representations (ICLR), 2018.\n\nPrecup, D. Eligibility traces for off-policy policy evaluation. Computer Science Department Faculty\n\nPublication Series, 2000.\n\nSchaul, T., Quan, J., Antonoglou, I., and Silver, D. Prioritized experience replay. arXiv preprint\n\narXiv:1511.05952, 2015.\n\nStrens, M. A Bayesian framework for reinforcement learning. In Conference on Machine Learning\n\n(ICML), 2000.\n\nSutton, R. S., Barto, A. G., et al. Reinforcement learning: An introduction. MIT press, 1998.\n\nThompson, W. R. On the likelihood that one unknown probability exceeds another in view of the\n\nevidence of two samples. Biometrika, 25(3/4):285\u2013294, 1933.\n\nTouati, A., Satija, H., Romoff, J., Pineau, J., and Vincent, P. Randomized value functions via\n\nmultiplicative normalizing \ufb02ows. arXiv preprint arXiv:1806.02315, 2018.\n\nVan Hasselt, H., Guez, A., and Silver, D. Deep reinforcement learning with double Q-learning. In\n\nAAAI Conference on Arti\ufb01cial Intelligence, 2016.\n\nWang, Z., Schaul, T., Hessel, M., Van Hasselt, H., Lanctot, M., and De Freitas, N. Dueling network\narchitectures for deep reinforcement learning. In International Conference on Machine Learning\n(ICML), 2016.\n\n10\n\n\f", "award": [], "sourceid": 2543, "authors": [{"given_name": "David", "family_name": "Janz", "institution": "University of Cambridge"}, {"given_name": "Jiri", "family_name": "Hron", "institution": "University of Cambridge"}, {"given_name": "Przemys\u0142aw", "family_name": "Mazur", "institution": "Wayve"}, {"given_name": "Katja", "family_name": "Hofmann", "institution": "Microsoft Research"}, {"given_name": "Jos\u00e9 Miguel", "family_name": "Hern\u00e1ndez-Lobato", "institution": "University of Cambridge"}, {"given_name": "Sebastian", "family_name": "Tschiatschek", "institution": "Microsoft Research"}]}