{"title": "Deep Exploration via Bootstrapped DQN", "book": "Advances in Neural Information Processing Systems", "page_first": 4026, "page_last": 4034, "abstract": "Efficient exploration remains a major challenge for reinforcement learning (RL). Common dithering strategies for exploration, such as epsilon-greedy, do not carry out temporally-extended (or deep) exploration; this can lead to exponentially larger data requirements. However, most algorithms for statistically efficient RL are not computationally tractable in complex environments. Randomized value functions offer a promising approach to efficient exploration with generalization, but existing algorithms are not compatible with nonlinearly parameterized value functions. As a first step towards addressing such contexts we develop bootstrapped DQN. We demonstrate that bootstrapped DQN can combine deep exploration with deep neural networks for exponentially faster learning than any dithering strategy. In the Arcade Learning Environment bootstrapped DQN substantially improves learning speed and cumulative performance across most games.", "full_text": "Deep Exploration via Bootstrapped DQN\n\nIan Osband1,2, Charles Blundell2, Alexander Pritzel2, Benjamin Van Roy1\n\n1Stanford University, 2Google DeepMind\n\n{iosband, cblundell, apritzel}@google.com, bvr@stanford.edu\n\nAbstract\n\nEcient exploration remains a major challenge for reinforcement learning\n(RL). Common dithering strategies for exploration, such as \u2018-greedy, do\nnot carry out temporally-extended (or deep) exploration; this can lead\nto exponentially larger data requirements. However, most algorithms for\nstatistically ecient RL are not computationally tractable in complex en-\nvironments. Randomized value functions oer a promising approach to\necient exploration with generalization, but existing algorithms are not\ncompatible with nonlinearly parameterized value functions. As a \ufb01rst step\ntowards addressing such contexts we develop bootstrapped DQN. We demon-\nstrate that bootstrapped DQN can combine deep exploration with deep\nneural networks for exponentially faster learning than any dithering strat-\negy. In the Arcade Learning Environment bootstrapped DQN substantially\nimproves learning speed and cumulative performance across most games.\n\n1 Introduction\nWe study the reinforcement learning (RL) problem where an agent interacts with an unknown\nenvironment. The agent takes a sequence of actions in order to maximize cumulative rewards.\nUnlike standard planning problems, an RL agent does not begin with perfect knowledge\nof the environment, but learns through experience. This leads to a fundamental trade-o\nof exploration versus exploitation; the agent may improve its future rewards by exploring\npoorly understood states and actions, but this may require sacri\ufb01cing immediate rewards. To\nlearn eciently an agent should explore only when there are valuable learning opportunities.\nFurther, since any action may have long term consequences, the agent should reason about\nthe informational value of possible observation sequences. Without this sort of temporally\nextended (deep) exploration, learning times can worsen by an exponential factor.\nThe theoretical RL literature oers a variety of provably-ecient approaches to deep explo-\nration [9]. However, most of these are designed for Markov decision processes (MDPs) with\nsmall \ufb01nite state spaces, while others require solving computationally intractable planning\ntasks [8]. These algorithms are not practical in complex environments where an agent must\ngeneralize to operate eectively. For this reason, large-scale applications of RL have relied\nupon statistically inecient strategies for exploration [12] or even no exploration at all [23].\nWe review related literature in more detail in Section 4.\nCommon dithering strategies, such as \u2018-greedy, approximate the value of an action by\na single number. Most of the time they pick the action with the highest estimate, but\nsometimes they choose another action at random. In this paper, we consider an alternative\napproach to ecient exploration inspired by Thompson sampling. These algorithms have\nsome notion of uncertainty and instead maintain a distribution over possible values. They\nexplore by randomly select a policy according to the probability it is the optimal policy.\nRecent work has shown that randomized value functions can implement something similar\nto Thompson sampling without the need for an intractable exact posterior update. However,\nthis work is restricted to linearly-parameterized value functions [16]. We present a natural\n\n30th Conference on Neural Information Processing Systems (NIPS 2016), Barcelona, Spain.\n\n\fextension of this approach that enables use of complex non-linear generalization methods\nsuch as deep neural networks. We show that the bootstrap with random initialization can\nproduce reasonable uncertainty estimates for neural networks at low computational cost.\nBootstrapped DQN leverages these uncertainty estimates for ecient (and deep) exploration.\nWe demonstrate that these bene\ufb01ts can extend to large scale problems that are not designed\nto highlight deep exploration. Bootstrapped DQN substantially reduces learning times and\nimproves performance across most games. This algorithm is computationally ecient and\nparallelizable; on a single machine our implementation runs roughly 20% slower than DQN.\n\n2 Uncertainty for neural networks\nDeep neural networks (DNN) represent the state of the art in many supervised and re-\ninforcement learning domains [12]. We want an exploration strategy that is statistically\ncomputationally ecient together with a DNN representation of the value function. To\nexplore eciently, the \ufb01rst step to quantify uncertainty in value estimates so that the agent\ncan judge potential bene\ufb01ts of exploratory actions. The neural network literature presents a\nsizable body of work on uncertainty quanti\ufb01cation founded on parametric Bayesian inference\n[3, 7]. We actually found the simple non-parametric bootstrap with random initialization [5]\nmore eective in our experiments, but the main ideas of this paper would apply with any\nother approach to uncertainty in DNNs.\nThe bootstrap principle is to approximate a population distribution by a sample distribution\n[6]. In its most common form, the bootstrap takes as input a data set D and an estimator \u00c2.\nTo generate a sample from the bootstrapped distribution, a data set \u02dcD of cardinality equal\nto that of D is sampled uniformly with replacement from D. The bootstrap sample estimate\nis then taken to be \u00c2( \u02dcD). The bootstrap is widely hailed as a great advance of 20th century\napplied statistics and even comes with theoretical guarantees [2]. In Figure 1a we present\nan ecient and scalable method for generating bootstrap samples from a large and deep\nneural network. The network consists of a shared architecture with K bootstrapped \u201cheads\u201d\nbranching o independently. Each head is trained only on its bootstrapped sub-sample\nof the data and represents a single bootstrap sample \u00c2( \u02dcD). The shared network learns a\njoint feature representation across all the data, which can provide signi\ufb01cant computational\nadvantages at the cost of lower diversity between heads. This type of bootstrap can be\ntrained eciently in a single forward/backward pass; it can be thought of as a data-dependent\ndropout, where the dropout mask for each head is \ufb01xed for each data point [19].\n\n(b) Gaussian process posterior\n\n(a) Shared network architecture\n(c) Bootstrapped neural nets\nFigure 1: Bootstrapped neural nets can produce reasonable posterior estimates for regression.\nFigure 1 presents an example of uncertainty estimates from bootstrapped neural networks on\na regression task with noisy data. We trained a fully-connected 2-layer neural networks with\n50 recti\ufb01ed linear units (ReLU) in each layer on 50 bootstrapped samples from the data.\nAs is standard, we initialize these networks with random parameter values, this induces an\nimportant initial diversity in the models. We were unable to generate eective uncertainty\nestimates for this problem using the dropout approach in prior literature [7]. Further details\nare provided in Appendix A.\n3 Bootstrapped DQN\n\nFor a policy \ufb01 we de\ufb01ne the value of an action a in state s Q\ufb01(s, a) := Es,a,\ufb01 [q\u0152t=1 \u201ctrt],\nwhere \u201c \u0153 (0, 1) is a discount factor that balances immediate versus future rewards rt. This\nexpectation indicates that the initial state is s, the initial action is a, and thereafter actions\n\n2\n\n\ft \u03a9 rt + \u201c max\nyQ\n\na\n\nQ!st+1, arg max\n\na\n\nQ(st+1, a; \u25cat); \u25ca\u2260\".\n\nare selected by the policy \ufb01. The optimal value is Q\u00fa(s, a) := max\ufb01 Q\ufb01(s, a). To scale to\nlarge problems, we learn a parameterized estimate of the Q-value function Q(s, a; \u25ca) rather\nthan a tabular encoding. We use a neural network to estimate this value.\nThe Q-learning update from state st, action at, reward rt and new state st+1 is given by\n\nt \u2260 Q(st, at; \u25cat))\u00d2\u25caQ(st, at; \u25cat)\n\n(1)\nis the target value rt + \u201c maxa Q(st+1, a; \u25ca\u2260). \u25ca\u2260\n\n\u25cat+1 \u03a9 \u25cat + \u2013(yQ\nwhere \u2013 is the scalar learning rate and yQ\nt\nare target network parameters \ufb01xed \u25ca\u2260 = \u25cat.\nSeveral important modi\ufb01cations to the Q-learning update improve stability for DQN [12].\nFirst the algorithm learns from sampled transitions from an experience buer, rather than\nlearning fully online. Second the algorithm uses a target network with parameters \u25ca\u2260 that\nare copied from the learning network \u25ca\u2260 \u03a9 \u25cat only every \u00b7 time steps and then kept \ufb01xed in\nbetween updates. Double DQN [25] modi\ufb01es the target yQ\n(2)\n\nt and helps further1:\n\nBootstrapped DQN modi\ufb01es DQN to approximate a distribution over Q-values via the\nbootstrap. At the start of each episode, bootstrapped DQN samples a single Q-value function\nfrom its approximate posterior. The agent then follows the policy which is optimal for\nthat sample for the duration of the episode. This is a natural adaptation of the Thompson\nsampling heuristic to RL that allows for temporally extended (or deep) exploration [21, 13].\nWe implement this algorithm eciently by building up K \u0153 N bootstrapped estimates\nof the Q-value function in parallel as in Figure 1a. Importantly, each one of these value\nfunction function heads Qk(s, a; \u25ca) is trained against its own target network Qk(s, a; \u25ca\u2260).\nThis means that each Q1, .., QK provide a temporally extended (and consistent) estimate\nof the value uncertainty via TD estimates. In order to keep track of which data belongs to\nwhich bootstrap head we store \ufb02ags w1, .., wK \u0153{ 0, 1} indicating which heads are privy to\nwhich data. We approximate a bootstrap sample by selecting k \u0153{ 1, .., K} uniformly at\nrandom and following Qk for the duration of that episode. We present a detailed algorithm\nfor our implementation of bootstrapped DQN in Appendix B.\n4 Related work\nThe observation that temporally extended exploration is necessary for ecient reinforcement\nlearning is not new. For any prior distribution over MDPs, the optimal exploration strategy\nis available through dynamic programming in the Bayesian belief state space. However, the\nexact solution is intractable even for very simple systems[8]. Many successful RL applications\nfocus on generalization and planning but address exploration only via inecient exploration\n[12] or even none at all [23]. However, such exploration strategies can be highly inecient.\nMany exploration strategies are guided by the principle of \u201coptimism in the face of uncertainty\u201d\n(OFU). These algorithms add an exploration bonus to values of state-action pairs that\nmay lead to useful learning and select actions to maximize these adjusted values. This\napproach was \ufb01rst proposed for \ufb01nite-armed bandits [11], but the principle has been extended\nsuccessfully across bandits with generalization and tabular RL [9]. Except for particular\ndeterministic contexts [27], OFU methods that lead to ecient RL in complex domains\nhave been computationally intractable. The work of [20] aims to add an eective bonus\nthrough a variation of DQN. The resulting algorithm relies on a large number of hand-tuned\nparameters and is only suitable for application to deterministic problems. We compare our\nresults on Atari to theirs in Appendix D and \ufb01nd that bootstrapped DQN oers a signi\ufb01cant\nimprovement over previous methods.\nPerhaps the oldest heuristic for balancing exploration with exploitation is given by Thompson\nsampling [24]. This bandit algorithm takes a single sample from the posterior at every time\nstep and chooses the action which is optimal for that time step. To apply the Thompson\nsampling principle to RL, an agent should sample a value function from its posterior. Naive\napplications of Thompson sampling to RL which resample every timestep can be extremely\n\n1In this paper we use the DDQN update for all DQN variants unless explicitly stated.\n\n3\n\n\finecient. The agent must also commit to this sample for several time steps in order to\nachieve deep exploration [21, 8]. The algorithm PSRL does exactly this, with state of the\nart guarantees [13, 14]. However, this algorithm still requires solving a single known MDP,\nwhich will usually be intractable for large systems.\nOur new algorithm, bootstrapped DQN, approximates this approach to exploration via\nrandomized value functions sampled from an approximate posterior. Recently, authors have\nproposed the RLSVI algorithm which accomplishes this for linearly parameterized value\nfunctions. Surprisingly, RLSVI recovers state of the art guarantees in the setting with\ntabular basis functions, but its performance is crucially dependent upon a suitable linear\nrepresentation of the value function [16]. We extend these ideas to produce an algorithm\nthat can simultaneously perform generalization and exploration with a \ufb02exible nonlinear\nvalue function representation. Our method is simple, general and compatible with almost all\nadvances in deep RL at low computational cost and with few tuning parameters.\n\n5 Deep Exploration\n\nUncertainty estimates allow an agent to direct its exploration at potentially informative states\nand actions. In bandits, this choice of directed exploration rather than dithering generally\ncategorizes ecient algorithms. The story in RL is not as simple, directed exploration is not\nenough to guarantee eciency; the exploration must also be deep. Deep exploration means\nexploration which is directed over multiple time steps; it can also be called \u201cplanning to\nlearn\u201d or \u201cfar-sighted\u201d exploration. Unlike bandit problems, which balance actions which\nare immediately rewarding or immediately informative, RL settings require planning over\nseveral time steps [10]. For exploitation, this means that an ecient agent must consider the\nfuture rewards over several time steps and not simply the myopic rewards. In exactly the\nsame way, ecient exploration may require taking actions which are neither immediately\nrewarding, nor immediately informative.\nTo illustrate this distinction, consider a simple deterministic chain {s\u22603, .., s+3} with three\nstep horizon starting from state s0. This MDP is known to the agent a priori, with\ndeterministic actions \u201cleft\u201d and \u201cright\u201d. All states have zero reward, except for the leftmost\nstate s\u22603 which has known reward \u2018> 0 and the rightmost state s3 which is unknown. In\norder to reach either a rewarding state or an informative state within three steps from s0 the\nagent must plan a consistent strategy over several time steps. Figure 2 depicts the planning\nand look ahead trees for several algorithmic approaches in this example MDP. The action\n\u201cleft\u201d is gray, the action \u201cright\u201d is black. Rewarding states are depicted as red, informative\nstates as blue. Dashed lines indicate that the agent can plan ahead for either rewards or\ninformation. Unlike bandit algorithms, an RL agent can plan to exploit future rewards. Only\nan RL agent with deep exploration can plan to learn.\n\n(a) Bandit algorithm\n\n(b) RL+dithering\n\n(c) RL+shallow explore\n\n(d) RL+deep explore\n\nFigure 2: Planning, learning and exploration in RL.\n\n4\n\n\f5.1 Testing for deep exploration\nWe now present a series of didactic computational experiments designed to highlight the\nneed for deep exploration. These environments can be described by chains of length N > 3\nin Figure 3. Each episode of interaction lasts N + 9 steps after which point the agent resets\nto the initial state s2. These are toy problems intended to be expository rather than entirely\nrealistic. Balancing a well known and mildly successful strategy versus an unknown, but\npotentially more rewarding, approach can emerge in many practical applications.\n\nFigure 3: Scalable environments that requires deep exploration.\n\nThese environments may be described by a \ufb01nite tabular MDP. However, we consider\nalgorithms which interact with the MDP only through raw pixel features. We consider\ntwo feature mappings \u201e1hot(st) := (1{x = st}) and \u201etherm(st) := (1{x \u00c6 st}) in {0, 1}N.\nWe present results for \u201etherm, which worked better for all DQN variants due to better\ngeneralization, but the dierence was relatively small - see Appendix C. Thompson DQN\nis the same as bootstrapped DQN, but resamples every timestep. Ensemble DQN uses the\nsame architecture as bootstrapped DQN, but with an ensemble policy.\nWe say that the algorithm has successfully learned the optimal policy when it has successfully\ncompleted one hundred episodes with optimal reward of 10. For each chain length, we ran\neach learning algorithm for 2000 episodes across three seeds. We plot the median time to learn\nin Figure 4, together with a conservative lower bound of 99 + 2N\u226011 on the expected time to\nlearn for any shallow exploration strategy [16]. Only bootstrapped DQN demonstrates a\ngraceful scaling to long chains which require deep exploration.\n\nFigure 4: Only Bootstrapped DQN demonstrates deep exploration.\n\n5.2 How does bootstrapped DQN drive deep exploration?\nBootstrapped DQN explores in a manner similar to the provably-ecient algorithm PSRL\n[13] but it uses a bootstrapped neural network to approximate a posterior sample for the value.\nUnlike PSRL, bootstrapped DQN directly samples a value function and so does not require\nfurther planning steps. This algorithm is similar to RLSVI, which is also provably-ecient\n[16], but with a neural network instead of linear value function and bootstrap instead of\nGaussian sampling. The analysis for the linear setting suggests that this nonlinear approach\nwill work well so long as the distribution {Q1, .., QK} remains stochastically optimistic [16],\nor at least as spread out as the \u201ccorrect\u201d posterior.\nBootstrapped DQN relies upon random initialization of the network weights as a prior\nto induce diversity. Surprisingly, we found this initial diversity was enough to maintain\ndiverse generalization to new and unseen states for large and deep neural networks. This\nis eective for our experimental setting, but will not work in all situations. In general it\nmay be necessary to maintain some more rigorous notion of \u201cprior\u201d, potentially through\nthe use of arti\ufb01cial prior data to maintain diversity [15]. One potential explanation for the\necacy of simple random initialization is that unlike supervised learning or bandits, where\nall networks \ufb01t the same data, each of our Qk heads has a unique target network. This,\ntogether with stochastic minibatch and \ufb02exible nonlinear representations, means that even\nsmall dierences at initialization may become bigger as they re\ufb01t to unique TD errors.\n\n5\n\n\fBootstrapped DQN does not require that any single network Qk is initialized to the correct\npolicy of \u201cright\u201d at every step, which would be exponentially unlikely for large chains N. For\nthe algorithm to be successful in this example we only require that the networks generalize in\na diverse way to the actions they have never chosen in the states they have not visited very\noften. Imagine that, in the example above, the network has made it as far as state \u02dcN < N,\nbut never observed the action right a = 2. As long as one head k imagines Q( \u02dcN , 2) > Q( \u02dcN , 2)\nthen TD bootstrapping can propagate this signal back to s = 1 through the target network\nto drive deep exploration. The expected time for these estimates at n to propagate to\nat least one head grows gracefully in n, even for relatively small K, as our experiments\nshow. We expand upon this intuition with a video designed to highlight how bootstrapped\nDQN demonstrates deep exploration https://youtu.be/e3KuV_d0EMk. We present further\nevaluation on a dicult stochastic MDP in Appendix C.\n6 Arcade Learning Environment\nWe now evaluate our algorithm across 49 Atari games on the Arcade Learning Environment\n[1]. Importantly, and unlike the experiments in Section 5, these domains are not speci\ufb01cally\ndesigned to showcase our algorithm. In fact, many Atari games are structured so that\nsmall rewards always indicate part of an optimal policy. This may be crucial for the strong\nperformance observed by dithering strategies2. We \ufb01nd that exploration via bootstrapped\nDQN produces signi\ufb01cant gains versus \u2018-greedy in this setting. Bootstrapped DQN reaches\npeak performance roughly similar to DQN. However, our improved exploration mean we reach\nhuman performance on average 30% faster across all games. This translates to signi\ufb01cantly\nimproved cumulative rewards through learning.\nWe follow the setup of [25] for our network architecture and benchmark our performance\nagainst their algorithm. Our network structure is identical to the convolutional structure\nof DQN [12] except we split 10 separate bootstrap heads after the convolutional layer\nas per Figure 1a. Recently, several authors have provided architectural and algorithmic\nimprovements to DDQN [26, 18]. We do not compare our results to these since their advances\nare orthogonal to our concern and could easily be incorporated to our bootstrapped DQN\ndesign. Full details of our experimental set up are available in Appendix D.\n\nImplementing bootstrapped DQN at scale\n\n6.1\nWe now examine how to generate online bootstrap samples for DQN in a computationally\necient manner. We focus on three key questions: how many heads do we need, how should\nwe pass gradients to the shared network and how should we bootstrap data online? We make\nsigni\ufb01cant compromises in order to maintain computational cost comparable to DQN.\nFigure 5a presents the cumulative reward of bootstrapped DQN on the game Breakout, for\ndierent number of heads K. More heads leads to faster learning, but even a small number\nof heads captures most of the bene\ufb01ts of bootstrapped DQN. We choose K = 10.\n\n(a) Number of bootstrap heads K.\n\n(b) Probability of data sharing p.\n\nFigure 5: Examining the sensitivities of bootstrapped DQN.\n\nThe shared network architecture allows us to train this combined network via backpropagation.\nFeeding K network heads to the shared convolutional network eectively increases the learning\nrate for this portion of the network. In some games, this leads to premature and sub-optimal\nconvergence. We found the best \ufb01nal scores by normalizing the gradients by 1/K, but this\nalso leads to slower early learning. See Appendix D for more details.\n\n2By contrast, imagine that the agent received a small immediate reward for dying; dithering\n\nstrategies would be hopeless at solving this problem, just like Section 5.\n\n6\n\n\fTo implement an online bootstrap we use an independent Bernoulli mask w1,..,wK\u2265Ber(p)\nfor each head in each episode3. These \ufb02ags are stored in the memory replay buer and\nidentify which heads are trained on which data. However, when trained using a shared\nminibatch the algorithm will also require an eective 1/p more iterations; this is undesirable\ncomputationally. Surprisingly, we found the algorithm performed similarly irrespective of\np and all outperformed DQN, as shown in Figure 5b. This is strange and we discuss this\nphenomenon in Appendix D. However, in light of this empirical observation for Atari, we\nchose p=1 to save on minibatch passes. As a result bootstrapped DQN runs at similar\ncomputational speed to vanilla DQN on identical hardware4.\n\n6.2 Ecient exploration in Atari\nWe \ufb01nd that Bootstrapped DQN drives ecient exploration in several Atari games. For\nthe same amount of game experience, bootstrapped DQN generally outperforms DQN with\n\u2018-greedy exploration. Figure 6 demonstrates this eect for a diverse selection of games.\n\nFigure 6: Bootstrapped DQN drives more ecient exploration.\n\nOn games where DQN performs well, bootstrapped DQN typically performs better. Boot-\nstrapped DQN does not reach human performance on Amidar (DQN does) but does on Beam\nRider and Battle Zone (DQN does not). To summarize this improvement in learning time we\nconsider the number of frames required to reach human performance. If bootstrapped DQN\nreaches human performance in 1/x frames of DQN we say it has improved by x. Figure 7\nshows that Bootstrapped DQN typically reaches human performance signi\ufb01cantly faster.\n\nFigure 7: Bootstrapped DQN reaches human performance faster than DQN.\n\nOn most games where DQN does not reach human performance, bootstrapped DQN does\nnot solve the problem by itself. On some challenging Atari games where deep exploration is\nconjectured to be important [25] our results are not entirely successful, but still promising.\nIn Frostbite, bootstrapped DQN reaches the second level much faster than DQN but network\ninstabilities cause the performance to crash. In Montezuma\u2019s Revenge, bootstrapped DQN\nreaches the \ufb01rst key after 20m frames (DQN never observes a reward even after 200m\nframes) but does not properly learn from this experience5. Our results suggest that improved\nexploration may help to solve these remaining games, but also highlight the importance of\nother problems like network instability, reward clipping and temporally extended rewards.\n\n3p=0.5 is double-or-nothing bootstrap [17], p=1 is ensemble with no bootstrapping at all.\n4Our implementation K=10, p=1 ran with less than a 20% increase on wall-time versus DQN.\n5An improved training method, such as prioritized replay [18] may help solve this problem.\n\n7\n\n\f6.3 Overall performance\nBootstrapped DQN is able to learn much faster than DQN. Figure 8 shows that bootstrapped\nDQN also improves upon the \ufb01nal score across most games. However, the real bene\ufb01ts to\necient exploration mean that bootstrapped DQN outperforms DQN by orders of magnitude\nin terms of the cumulative rewards through learning (Figure 9. In both \ufb01gures we normalize\nperformance relative to a fully random policy. The most similar work to ours presents\nseveral other approaches to improved exploration in Atari [20] they optimize for AUC-20, a\nnormalized version of the cumulative returns after 20m frames. According to their metric,\naveraged across the 14 games they consider, we improve upon both base DQN (0.29) and\ntheir best method (0.37) to obtain 0.62 via bootstrapped DQN. We present these results\ntogether with results tables across all 49 games in Appendix D.4.\n\nFigure 8: Bootstrapped DQN typically improves upon the best policy.\n\nFigure 9: Bootstrapped DQN improves cumulative rewards by orders of magnitude.\n\n6.4 Visualizing bootstrapped DQN\nWe now present some more insight to how bootstrapped DQN drives deep exploration in Atari.\nIn each game, although each head Q1, .., Q10 learns a high scoring policy, the policies they\n\ufb01nd are quite distinct. In the video https://youtu.be/Zm2KoT82O_M we show the evolution\nof these policies simultaneously for several games. Although each head performs well, they\neach follow a unique policy. By contrast, \u2018-greedy strategies are almost indistinguishable for\nsmall values of \u2018 and totally ineectual for larger values. We believe that this deep exploration\nis key to improved learning, since diverse experiences allow for better generalization.\nDisregarding exploration, bootstrapped DQN may be bene\ufb01cial as a purely exploitative\npolicy. We can combine all the heads into a single ensemble policy, for example by choosing\nthe action with the most votes across heads. This approach might have several bene\ufb01ts.\nFirst, we \ufb01nd that the ensemble policy can often outperform any individual policy. Second,\nthe distribution of votes across heads to give a measure of the uncertainty in the optimal\npolicy. Unlike vanilla DQN, bootstrapped DQN can know what it doesn\u2019t know. In an\napplication where executing a poorly-understood action is dangerous this could be crucial. In\nthe video https://youtu.be/0jvEcC5JvGY we visualize this ensemble policy across several\ngames. We \ufb01nd that the uncertainty in this policy is surprisingly interpretable: all heads\nagree at clearly crucial decision points, but remain diverse at other less important steps.\n7 Closing remarks\nIn this paper we present bootstrapped DQN as an algorithm for ecient reinforcement\nlearning in complex environments. We demonstrate that the bootstrap can produce useful\nuncertainty estimates for deep neural networks. Bootstrapped DQN is computationally\ntractable and also naturally scalable to massive parallel systems. We believe that, beyond\nour speci\ufb01c implementation, randomized value functions represent a promising alternative to\ndithering for exploration. Bootstrapped DQN practically combines ecient generalization\nwith exploration for complex nonlinear value functions.\n\n8\n\n\fReferences\n[1] Marc G Bellemare, Yavar Naddaf, Joel Veness, and Michael Bowling. The arcade learning\nenvironment: An evaluation platform for general agents. arXiv preprint arXiv:1207.4708, 2012.\n[2] Peter J Bickel and David A Freedman. Some asymptotic theory for the bootstrap. The Annals\n\n[3] Charles Blundell, Julien Cornebise, Koray Kavukcuoglu, and Daan Wierstra. Weight uncertainty\n\nof Statistics, pages 1196\u20131217, 1981.\n\nin neural networks. ICML, 2015.\n\n[4] Christoph Dann and Emma Brunskill. Sample complexity of episodic \ufb01xed-horizon reinforcement\n\nlearning. In Advances in Neural Information Processing Systems, pages 2800\u20132808, 2015.\n\n[5] Bradley Efron. The jackknife, the bootstrap and other resampling plans, volume 38. SIAM,\n\n1982.\n\n[6] Bradley Efron and Robert J Tibshirani. An introduction to the bootstrap. CRC press, 1994.\n[7] Yarin Gal and Zoubin Ghahramani. Dropout as a bayesian approximation: Representing model\n\nuncertainty in deep learning. arXiv preprint arXiv:1506.02142, 2015.\n\n[8] Arthur Guez, David Silver, and Peter Dayan. Ecient bayes-adaptive reinforcement learning\nusing sample-based search. In Advances in Neural Information Processing Systems, pages\n1025\u20131033, 2012.\n\n[9] Thomas Jaksch, Ronald Ortner, and Peter Auer. Near-optimal regret bounds for reinforcement\n\nlearning. Journal of Machine Learning Research, 11:1563\u20131600, 2010.\n\n[10] Sham Kakade. On the Sample Complexity of Reinforcement Learning. PhD thesis, University\n\nCollege London, 2003.\n\n[11] Tze Leung Lai and Herbert Robbins. Asymptotically ecient adaptive allocation rules. Advances\n\nin applied mathematics, 6(1):4\u201322, 1985.\n\n[12] Volodymyr et al. Mnih. Human-level control through deep reinforcement learning. Nature,\n\n518(7540):529\u2013533, 2015.\n\n[13] Ian Osband, Daniel Russo, and Benjamin Van Roy. (More) ecient reinforcement learning via\n\nposterior sampling. In NIPS, pages 3003\u20133011. Curran Associates, Inc., 2013.\n\n[14] Ian Osband and Benjamin Van Roy. Model-based reinforcement learning and the eluder\n\ndimension. In Advances in Neural Information Processing Systems, pages 1466\u20131474, 2014.\n\n[15] Ian Osband and Benjamin Van Roy. Bootstrapped thompson sampling and deep exploration.\n\narXiv preprint arXiv:1507.00300, 2015.\n\n[16] Ian Osband, Benjamin Van Roy, and Zheng Wen. Generalization and exploration via randomized\n\nvalue functions. arXiv preprint arXiv:1402.0635, 2014.\n\n[17] Art B Owen, Dean Eckles, et al. Bootstrapping data arrays of arbitrary order. The Annals of\n\n[18] Tom Schaul, John Quan, Ioannis Antonoglou, and David Silver. Prioritized experience replay.\n\nApplied Statistics, 6(3):895\u2013927, 2012.\n\narXiv preprint arXiv:1511.05952, 2015.\n\n[19] Nitish Srivastava, Georey Hinton, Alex Krizhevsky, Ilya Sutskever, and Ruslan Salakhutdinov.\nDropout: A simple way to prevent neural networks from over\ufb01tting. The Journal of Machine\nLearning Research, 15(1):1929\u20131958, 2014.\n\n[20] Bradly C Stadie, Sergey Levine, and Pieter Abbeel. Incentivizing exploration in reinforcement\n\nlearning with deep predictive models. arXiv preprint arXiv:1507.00814, 2015.\n\n[21] Malcolm J. A. Strens. A bayesian framework for reinforcement learning. In ICML, pages\n\n943\u2013950, 2000.\n\nMarch 1998.\n\n38(3):58\u201368, 1995.\n\n[22] Richard Sutton and Andrew Barto. Reinforcement Learning: An Introduction. MIT Press,\n\n[23] Gerald Tesauro. Temporal dierence learning and td-gammon. Communications of the ACM,\n\n[24] W.R. Thompson. On the likelihood that one unknown probability exceeds another in view of\n\nthe evidence of two samples. Biometrika, 25(3/4):285\u2013294, 1933.\n\n[25] Hado Van Hasselt, Arthur Guez, and David Silver. Deep reinforcement learning with double\n\nq-learning. arXiv preprint arXiv:1509.06461, 2015.\n\n[26] Ziyu Wang, Nando de Freitas, and Marc Lanctot. Dueling network architectures for deep\n\nreinforcement learning. arXiv preprint arXiv:1511.06581, 2015.\n\n[27] Zheng Wen and Benjamin Van Roy. Ecient exploration and value function generalization in\n\ndeterministic systems. In NIPS, pages 3021\u20133029, 2013.\n\n9\n\n\f", "award": [], "sourceid": 2017, "authors": [{"given_name": "Ian", "family_name": "Osband", "institution": "DeepMind"}, {"given_name": "Charles", "family_name": "Blundell", "institution": "DeepMind"}, {"given_name": "Alexander", "family_name": "Pritzel", "institution": "Google Deepmind"}, {"given_name": "Benjamin", "family_name": "Van Roy", "institution": "Stanford University"}]}