{"title": "Unifying Count-Based Exploration and Intrinsic Motivation", "book": "Advances in Neural Information Processing Systems", "page_first": 1471, "page_last": 1479, "abstract": "We consider an agent's uncertainty about its environment and the problem of generalizing this uncertainty across states. Specifically, we focus on the problem of exploration in non-tabular reinforcement learning. Drawing inspiration from the intrinsic motivation literature, we use density models to measure uncertainty, and propose a novel algorithm for deriving a pseudo-count from an arbitrary density model. This technique enables us to generalize count-based exploration algorithms to the non-tabular case. We apply our ideas to Atari 2600 games, providing sensible pseudo-counts from raw pixels. We transform these pseudo-counts into exploration bonuses and obtain significantly improved exploration in a number of hard games, including the infamously difficult Montezuma's Revenge.", "full_text": "Unifying Count-Based Exploration and Intrinsic Motivation\n\nMarc G. Bellemare\nbellemare@google.com\n\nSriram Srinivasan\n\nsrsrinivasan@google.com\n\nGeorg Ostrovski\n\nostrovski@google.com\n\nTom Schaul\n\nschaul@google.com\n\nDavid Saxton\n\nsaxton@google.com\n\nR\u00b4emi Munos\n\nmunos@google.com\n\nGoogle DeepMind\n\nLondon, United Kingdom\n\nAbstract\n\nWe consider an agent\u2019s uncertainty about its environment and the problem of gen-\neralizing this uncertainty across states. Speci\ufb01cally, we focus on the problem of\nexploration in non-tabular reinforcement learning. Drawing inspiration from the\nintrinsic motivation literature, we use density models to measure uncertainty, and\npropose a novel algorithm for deriving a pseudo-count from an arbitrary density\nmodel. This technique enables us to generalize count-based exploration algo-\nrithms to the non-tabular case. We apply our ideas to Atari 2600 games, providing\nsensible pseudo-counts from raw pixels. We transform these pseudo-counts into\nexploration bonuses and obtain signi\ufb01cantly improved exploration in a number of\nhard games, including the infamously dif\ufb01cult MONTEZUMA\u2019S REVENGE.\n\n1\n\nIntroduction\n\nExploration algorithms for Markov Decision Processes (MDPs) are typically concerned with re-\nducing the agent\u2019s uncertainty over the environment\u2019s reward and transition functions. In a tabular\nsetting, this uncertainty can be quanti\ufb01ed using con\ufb01dence intervals derived from Chernoff bounds,\nor inferred from a posterior over the environment parameters. In fact, both con\ufb01dence intervals\nand posterior shrink as the inverse square root of the state-action visit count N (x, a), making this\nquantity fundamental to most theoretical results on exploration.\nCount-based exploration methods directly use visit counts to guide an agent\u2019s behaviour towards\nreducing uncertainty. For example, Model-based Interval Estimation with Exploration Bonuses\n(MBIE-EB; Strehl and Littman, 2008) solves the augmented Bellman equation\n\n(cid:104) \u02c6R(x, a) + \u03b3 E \u02c6P [V (x(cid:48))] + \u03b2N (x, a)\u22121/2(cid:105)\n\n,\n\nV (x) = max\na\u2208A\n\ninvolving the empirical reward \u02c6R, the empirical transition function \u02c6P , and an exploration bonus\nproportional to N (x, a)\u22121/2. This bonus accounts for uncertainties in both transition and reward\nfunctions and enables a \ufb01nite-time bound on the agent\u2019s suboptimality.\nIn spite of their pleasant theoretical guarantees, count-based methods have not played a role in the\ncontemporary successes of reinforcement learning (e.g. Mnih et al., 2015). Instead, most practical\nmethods still rely on simple rules such as \u0001-greedy. The issue is that visit counts are not directly\nuseful in large domains, where states are rarely visited more than once.\nAnswering a different scienti\ufb01c question, intrinsic motivation aims to provide qualitative guidance\nfor exploration (Schmidhuber, 1991; Oudeyer et al., 2007; Barto, 2013). This guidance can be\nsummarized as \u201cexplore what surprises you\u201d. A typical approach guides the agent based on change\n\n30th Conference on Neural Information Processing Systems (NIPS 2016), Barcelona, Spain.\n\n\fin prediction error, or learning progress. If en(A) is the error made by the agent at time n over\nsome event A, and en+1(A) the same error after observing a new piece of information, then learning\nprogress is\n\nen(A) \u2212 en+1(A).\n\nIntrinsic motivation methods are attractive as they remain applicable in the absence of the Markov\nproperty or the lack of a tabular representation, both of which are required by count-based algo-\nrithms. Yet the theoretical foundations of intrinsic motivation remain largely absent from the litera-\nture, which may explain its slow rate of adoption as a standard approach to exploration.\nIn this paper we provide formal evidence that intrinsic motivation and count-based exploration are\nbut two sides of the same coin. Speci\ufb01cally, we consider a frequently used measure of learning\nprogress, information gain (Cover and Thomas, 1991). De\ufb01ned as the Kullback-Leibler divergence\nof a prior distribution from its posterior, information gain can be related to the con\ufb01dence intervals\nused in count-based exploration. Our contribution is to propose a new quantity, the pseudo-count,\nwhich connects information-gain-as-learning-progress and count-based exploration.\nWe derive our pseudo-count from a density model over the state space. This is in departure from\nmore traditional approaches to intrinsic motivation that consider learning progress with respect to\na transition model. We expose the relationship between pseudo-counts, a variant of Schmidhuber\u2019s\ncompression progress we call prediction gain, and information gain. Combined to Kolter and Ng\u2019s\nnegative result on the frequentist suboptimality of Bayesian bonuses, our result highlights the theo-\nretical advantages of pseudo-counts compared to many existing intrinsic motivation methods.\nThe pseudo-counts we introduce here are best thought of as \u201cfunction approximation for explo-\nration\u201d. We bring them to bear on Atari 2600 games from the Arcade Learning Environment (Belle-\nmare et al., 2013), focusing on games where myopic exploration fails. We extract our pseudo-counts\nfrom a simple density model and use them within a variant of MBIE-EB. We apply them to an ex-\nperience replay setting and to an actor-critic setting, and \ufb01nd improved performance in both cases.\nOur approach produces dramatic progress on the reputedly most dif\ufb01cult Atari 2600 game, MON-\nTEZUMA\u2019S REVENGE: within a fraction of the training time, our agent explores a signi\ufb01cant portion\nof the \ufb01rst level and obtains signi\ufb01cantly higher scores than previously published agents.\n\n2 Notation\nWe consider a countable state space X . We denote a sequence of length n from X by x1:n \u2208 X n,\nthe set of \ufb01nite sequences from X by X \u2217, write x1:nx to mean the concatenation of x1:n and a state\nx \u2208 X , and denote the empty sequence by \u0001. A model over X is a mapping from X \u2217 to probability\ndistributions over X . That is, for each x1:n \u2208 X n the model provides a probability distribution\n\n\u03c1n(x) := \u03c1(x ; x1:n).\n\nNote that we do not require \u03c1n(x) to be strictly positive for all x and x1:n. When it is, however, we\nmay understand \u03c1n(x) to be the usual conditional probability of Xn+1 = x given X1 . . . Xn = x1:n.\nWe will take particular interest in the empirical distribution \u00b5n derived from the sequence x1:n. If\nNn(x) := N (x, x1:n) is the number of occurrences of a state x in the sequence x1:n, then\n\n\u00b5n(x) := \u00b5(x ; x1:n) :=\n\nNn(x)\n\nn\n\n.\n\nWe call the Nn the empirical count function, or simply empirical count. The above notation extends\nto state-action spaces, and we write Nn(x, a) to explicitly refer to the number of occurrences of a\nstate-action pair when the argument requires it. When x1:n is generated by an ergodic Markov chain,\nfor example if we follow a \ufb01xed policy in a \ufb01nite-state MDP, then the limit point of \u00b5n is the chain\u2019s\nstationary distribution.\nIn our setting, a density model is any model that assumes states are independently (but not necessarily\nidentically) distributed; a density model is thus a particular kind of generative model. We emphasize\nthat a density model differs from a forward model, which takes into account the temporal relationship\nbetween successive states. Note that \u00b5n is itself a density model.\n\n2\n\n\f3 From Densities to Counts\n\nIn the introduction we argued that the visit count Nn(x) (and consequently, Nn(x, a)) is not directly\nuseful in practical settings, since states are rarely revisited. Speci\ufb01cally, Nn(x) is almost always\nzero and cannot help answer the question \u201cHow novel is this state?\u201d Nor is the problem solved\nby a Bayesian approach: even variable-alphabet models (e.g. Hutter, 2013) must assign a small,\ndiminishing probability to yet-unseen states. To estimate the uncertainty of an agent\u2019s knowledge,\nwe must instead look for a quantity which generalizes across states. Guided by ideas from the\nintrinsic motivation literature, we now derive such a quantity. We call it a pseudo-count as it extends\nthe familiar notion from Bayesian estimation.\n\n3.1 Pseudo-Counts and the Recoding Probability\nWe are given a density model \u03c1 over X . This density model may be approximate, biased, or even\ninconsistent. We begin by introducing the recoding probability of a state x:\n\n\u03c1(cid:48)\nn(x) := \u03c1(x ; x1:nx).\n\nThis is the probability assigned to x by our density model after observing a new occurrence of x.\nThe term \u201crecoding\u201d is inspired from the statistical compression literature, where coding costs are\ninversely related to probabilities (Cover and Thomas, 1991). When \u03c1 admits a conditional probabil-\nity distribution,\n\nn(x) = Pr\u03c1(Xn+2 = x| X1 . . . Xn = x1:n, Xn+1 = x).\n\u03c1(cid:48)\n\nWe now postulate two unknowns: a pseudo-count function \u02c6Nn(x), and a pseudo-count total \u02c6n. We\nrelate these two unknowns through two constraints:\n\n\u03c1n(x) =\n\n\u02c6Nn(x)\n\n\u02c6n\n\n\u03c1(cid:48)\nn(x) =\n\n\u02c6Nn(x) + 1\n\n\u02c6n + 1\n\n.\n\n(1)\n\nIn words: we require that, after observing one instance of x, the density model\u2019s increase in predic-\ntion of that same x should correspond to a unit increase in pseudo-count. The pseudo-count itself is\nderived from solving the linear system (1):\n\n\u02c6Nn(x) =\n\n\u03c1n(x)(1 \u2212 \u03c1(cid:48)\nn(x))\nn(x) \u2212 \u03c1n(x)\n\u03c1(cid:48)\n\n= \u02c6n\u03c1n(x).\n\n(2)\n\nNote that the equations (1) yield \u02c6Nn(x) = 0 (with \u02c6n = \u221e) when \u03c1n(x) = \u03c1(cid:48)\nn(x) = 0, and are\ninconsistent when \u03c1n(x) < \u03c1(cid:48)\nn(x) = 1. These cases may arise from poorly behaved density models,\nbut are easily accounted for. From here onwards we will assume a consistent system of equations.\nDe\ufb01nition 1 (Learning-positive density model). A density model \u03c1 is learning-positive if for all\nx1:n \u2208 X n and all x \u2208 X , \u03c1(cid:48)\nBy inspecting (2), we see that\n\nn(x) \u2265 \u03c1n(x).\n\n1. \u02c6Nn(x) \u2265 0 if and only if \u03c1 is learning-positive;\n2. \u02c6Nn(x) = 0 if and only if \u03c1n(x) = 0; and\n3. \u02c6Nn(x) = \u221e if and only if \u03c1n(x) = \u03c1(cid:48)\n\nn(x).\n\nIn many cases of interest, the pseudo-count \u02c6Nn(x) matches our intuition. If \u03c1n = \u00b5n then \u02c6Nn = Nn.\nSimilarly, if \u03c1n is a Dirichlet estimator then \u02c6Nn recovers the usual notion of pseudo-count. More\nimportantly, if the model generalizes across states then so do pseudo-counts.\n\n3.2 Estimating the Frequency of a Salient Event in FREEWAY\n\nAs an illustrative example, we employ our method to estimate the number of occurrences of an\ninfrequent event in the Atari 2600 video game FREEWAY (Figure 1, screenshot). We use the Arcade\nLearning Environment (Bellemare et al., 2013). We will demonstrate the following:\n\n1. Pseudo-counts are roughly zero for novel events,\n\n3\n\n\fFigure 1: Pseudo-counts obtained from a CTS density model applied to FREEWAY, along with a\nframe representative of the salient event (crossing the road). Shaded areas depict periods during\nwhich the agent observes the salient event, dotted lines interpolate across periods during which the\nsalient event is not observed. The reported values are 10,000-frame averages.\n\n2. they exhibit credible magnitudes,\n3. they respect the ordering of state frequency,\n4. they grow linearly (on average) with real counts,\n5. they are robust in the presence of nonstationary data.\n\nThese properties suggest that pseudo-counts provide an appropriate generalized notion of visit\ncounts in non-tabular settings.\nIn FREEWAY, the agent must navigate a chicken across a busy road. As our example, we consider\nestimating the number of times the chicken has reached the very top of the screen. As is the case for\nmany Atari 2600 games, this naturally salient event is associated with an increase in score, which\nALE translates into a positive reward. We may reasonably imagine that knowing how certain we\nare about this part of the environment is useful. After crossing, the chicken is teleported back to the\nbottom of the screen.\nTo highlight the robustness of our pseudo-count, we consider a nonstationary policy which waits for\n250,000 frames, then applies the UP action for 250,000 frames, then waits, then goes UP again. The\nsalient event only occurs during UP periods. It also occurs with the cars in different positions, thus\nrequiring generalization. As a point of reference, we record the pseudo-counts for both the salient\nevent and visits to the chicken\u2019s start position.\nWe use a simpli\ufb01ed, pixel-level version of the CTS model for Atari 2600 frames proposed by Belle-\nmare et al. (2014), ignoring temporal dependencies. While the CTS model is rather impoverished in\ncomparison to state-of-the-art density models for images (e.g. Van den Oord et al., 2016), its count-\nbased nature results in extremely fast learning, making it an appealing candidate for exploration.\nFurther details on the model may be found in the appendix.\nExamining the pseudo-counts depicted in Figure 1 con\ufb01rms that they exhibit the desirable properties\nlisted above. In particular, the pseudo-count is almost zero on the \ufb01rst occurrence of the salient event;\nit increases slightly during the 3rd period, since the salient and reference events share some common\nstructure; throughout, it remains smaller than the reference pseudo-count. The linearity on average\nand robustness to nonstationarity are immediate from the graph. Note, however, that the pseudo-\ncounts are a fraction of the real visit counts (inasmuch as we can de\ufb01ne \u201creal\u201d): by the end of the\ntrial, the start position has been visited about 140,000 times, and the topmost part of the screen, 1285\ntimes. Furthermore, the ratio of recorded pseudo-counts differs from the ratio of real counts. Both\neffects are quanti\ufb01able, as we shall show in Section 5.\n\n4 The Connection to Intrinsic Motivation\n\nHaving argued that pseudo-counts appropriately generalize visit counts, we will now show that they\nare closely related to information gain, which is commonly used to quantify novelty or curiosity and\nconsequently as an intrinsic reward. Information gain is de\ufb01ned in relation to a mixture model \u03be over\n\n4\n\nFrames (1000s)Pseudo-countssalient eventpseudo-countsstart position pseudo-countsperiods withoutsalient events\fa class of density models M. This model predicts according to a weighted combination from M:\n\n\u03ben(x) := \u03be(x ; x1:n) :=\n\nwn(\u03c1)\u03c1(x ; x1:n)d\u03c1,\n\n(cid:90)\n\n\u03c1\u2208M\n\nwith wn(\u03c1) the posterior weight of \u03c1. This posterior is de\ufb01ned recursively, starting from a prior\ndistribution w0 over M:\n\nwn+1(\u03c1) := wn(\u03c1, xn+1)\n\nwn(\u03c1, x) :=\n\nwn(\u03c1)\u03c1(x ; x1:n)\n\n\u03ben(x)\n\n.\n\n(3)\n\nInformation gain is then the Kullback-Leibler divergence from prior to posterior that results from\nobserving x:\n\nIGn(x) := IG(x ; x1:n) := KL(cid:0)wn(\u00b7, x)(cid:107) wn\n\n(cid:1).\n\nComputing the information gain of a complex density model is often impractical, if not downright\nintractable. However, a quantity which we call the prediction gain provides us with a good approxi-\nmation of the information gain. We de\ufb01ne the prediction gain of a density model \u03c1 (and in particular,\n\u03be) as the difference between the recoding log-probability and log-probability of x:\n\nPGn(x) := log \u03c1(cid:48)\n\n\u02c6Nn(x) \u2248(cid:16)\n\nn(x) \u2212 log \u03c1n(x).\n(cid:17)\u22121\n\nePGn(x) \u2212 1\n\n,\n\nPrediction gain is nonnegative if and only if \u03c1 is learning-positive. It is related to the pseudo-count:\n\nn(x) \u2192 0. As the following theorem shows, prediction gain allows us to relate\nwith equality when \u03c1(cid:48)\npseudo-count and information gain.\nTheorem 1. Consider a sequence x1:n \u2208 X n. Let \u03be be a mixture model over a class of learning-\npositive models M. Let \u02c6Nn be the pseudo-count derived from \u03be (Equation 2). For this model,\n\nIGn(x) \u2264 PGn(x) \u2264 \u02c6Nn(x)\u22121\n\nand\n\nPGn(x) \u2264 \u02c6Nn(x)\u22121/2.\n\nTheorem 1 suggests that using an exploration bonus proportional to \u02c6Nn(x)\u22121/2, similar to the\nMBIE-EB bonus, leads to a behaviour at least as exploratory as one derived from an information\ngain bonus. Since pseudo-counts correspond to empirical counts in the tabular setting, this approach\nalso preserves known theoretical guarantees. In fact, we are con\ufb01dent pseudo-counts may be used\nto prove similar results in non-tabular settings.\nOn the other hand, it may be dif\ufb01cult to provide theoretical guarantees about existing bonus-based\nintrinsic motivation approaches. Kolter and Ng (2009) showed that no algorithm based on a bonus\nupper bounded by \u03b2Nn(x)\u22121 for any \u03b2 > 0 can guarantee PAC-MDP optimality. Again considering\nthe tabular setting and combining their result to Theorem 1, we conclude that bonuses proportional\nto immediate information (or prediction) gain are insuf\ufb01cient for theoretically near-optimal explo-\nration:\nto paraphrase Kolter and Ng, these methods produce explore too little in comparison to\npseudo-count bonuses. By inspecting (2) we come to a similar negative conclusion for bonuses\nproportional to the L1 or L2 distance between \u03be(cid:48)\nUnlike many intrinsic motivation algorithms, pseudo-counts also do not rely on learning a forward\n(transition and/or reward) model. This point is especially important because a number of powerful\ndensity models for images exist (Van den Oord et al., 2016), and because optimality guarantees\ncannot in general exist for intrinsic motivation algorithms based on forward models.\n\nn and \u03ben.\n\n5 Asymptotic Analysis\n\nIn this section we analyze the limiting behaviour of the ratio \u02c6Nn/Nn. We use this analysis to assert\nthe consistency of pseudo-counts derived from tabular density models, i.e. models which maintain\nper-state visit counts. In the appendix we use the same result to bound the approximation error of\npseudo-counts derived from directed graphical models, of which our CTS model is a special case.\nConsider a \ufb01xed, in\ufb01nite sequence x1, x2, . . . from X . We de\ufb01ne the limit of a sequence of functions\nthat the empirical distribution \u00b5n converges pointwise to a distribution \u00b5, and write \u00b5(cid:48)\nrecoding probability of x under \u00b5n. We begin with two assumptions on our density model.\n\n(cid:0)f (x ; x1:n) : n \u2208 N(cid:1) with respect to the length n of the subsequence x1:n. We additionally assume\n\nn(x) for the\n\n5\n\n\fAssumption 1. The limits\n\n(a) r(x) := lim\nn\u2192\u221e\n\n\u03c1n(x)\n\u00b5n(x)\n\n(b) \u02d9r(x) := lim\nn\u2192\u221e\n\nexist for all x; furthermore, \u02d9r(x) > 0.\n\nn(x) \u2212 \u03c1n(x)\n\u03c1(cid:48)\nn(x) \u2212 \u00b5n(x)\n\u00b5(cid:48)\n\nAssumption (a) states that \u03c1 should eventually assign a probability to x proportional to the limiting\nempirical distribution \u00b5(x). In particular there must be a state x for which r(x) < 1, unless \u03c1n \u2192 \u00b5.\nAssumption (b), on the other hand, imposes a restriction on the learning rate of \u03c1 relative to \u00b5\u2019s. As\nboth r(x) and \u00b5(x) exist, Assumption 1 also implies that \u03c1n(x) and \u03c1(cid:48)\nTheorem 2. Under Assumption 1, the limit of the ratio of pseudo-counts \u02c6Nn(x) to empirical counts\nNn(x) exists for all x. This limit is\n\nn(x) have a common limit.\n\nlim\nn\u2192\u221e\n\n\u02c6Nn(x)\nNn(x)\n\n=\n\nr(x)\n\u02d9r(x)\n\n(cid:18) 1 \u2212 \u00b5(x)r(x)\n\n(cid:19)\n\n1 \u2212 \u00b5(x)\n\n.\n\nThe model\u2019s relative rate of change, whose convergence to \u02d9r(x) we require, plays an essential role\nin the ratio of pseudo- to empirical counts. To see this, consider a sequence (xn : n \u2208 N) generated\ni.i.d. from a distribution \u00b5 over a \ufb01nite state space, and a density model de\ufb01ned from a sequence of\nnonincreasing step-sizes (\u03b1n : n \u2208 N):\n\n\u03c1n(x) = (1 \u2212 \u03b1n)\u03c1n\u22121(x) + \u03b1nI{xn = x} ,\n\nwith initial condition \u03c10(x) = |X|\u22121. For \u03b1n = n\u22121, this density model is the empirical distribu-\ntion. For \u03b1n = n\u22122/3, we may appeal to well-known results from stochastic approximation (e.g.\nBertsekas and Tsitsiklis, 1996) and \ufb01nd that almost surely\n\nbut\n\nlim\nn\u2192\u221e\n\nn(x) \u2212 \u03c1n(x)\n\u03c1(cid:48)\nn(x) \u2212 \u00b5n(x)\n\u00b5(cid:48)\n\n= \u221e.\n\nlim\nn\u2192\u221e \u03c1n(x) = \u00b5(x)\nn(x) \u2212 \u00b5n(x) = n\u22121(1 \u2212 \u00b5(cid:48)\n\nSince \u00b5(cid:48)\nn(x)), we may think of Assumption 1(b) as also requiring \u03c1\nto converge at a rate of \u0398(1/n) for a comparison with the empirical count Nn to be meaningful.\nNote, however, that a density model that does not satisfy Assumption 1(b) may still yield useful (but\nincommensurable) pseudo-counts.\n\nCorollary 1. Let \u03c6(x) > 0 with(cid:80)\n\nx\u2208X \u03c6(x) < \u221e and consider the count-based estimator\n\nn +(cid:80)\n\nNn(x) + \u03c6(x)\nx(cid:48)\u2208X \u03c6(x(cid:48))\n\n.\n\n\u03c1n(x) =\n\nIf \u02c6Nn is the pseudo-count corresponding to \u03c1n then \u02c6Nn(x)/Nn(x) \u2192 1 for all x with \u00b5(x) > 0.\n\n6 Empirical Evaluation\n\nIn this section we demonstrate the use of pseudo-counts to guide exploration. We return to the\nArcade Learning Environment, now using the CTS model to generate an exploration bonus.\n\n6.1 Exploration in Hard Atari 2600 Games\n\nFrom 60 games available through the Arcade Learning Environment we selected \ufb01ve \u201chard\u201d games,\nin the sense that an \u0001-greedy policy is inef\ufb01cient at exploring them. We used a bonus of the form\n\nn (x, a) := \u03b2( \u02c6Nn(x) + 0.01)\u22121/2,\nR+\n\n(4)\nwhere \u03b2 = 0.05 was selected from a coarse parameter sweep. We also compared our method to the\noptimistic initialization trick proposed by Machado et al. (2015). We trained our agents\u2019 Q-functions\nwith Double DQN (van Hasselt et al., 2016), with one important modi\ufb01cation: we mixed the Double\nQ-Learning target with the Monte Carlo return. This modi\ufb01cation led to improved results both with\nand without exploration bonuses (details in the appendix).\nFigure 2 depicts the result of our experiment, averaged across 5 trials. Although optimistic ini-\ntialization helps in FREEWAY, it otherwise yields performance similar to DQN. By contrast, the\n\n6\n\n\fFigure 2: Average training score with and without exploration bonus or optimistic initialization in 5\nAtari 2600 games. Shaded areas denote inter-quartile range, dotted lines show min/max scores.\n\nFigure 3: \u201cKnown world\u201d of a DQN agent trained for 50 million frames with (right) and without\n(left) count-based exploration bonuses, in MONTEZUMA\u2019S REVENGE.\n\ncount-based exploration bonus enables us to make quick progress on a number of games, most dra-\nmatically in MONTEZUMA\u2019S REVENGE and VENTURE.\nMONTEZUMA\u2019S REVENGE is perhaps the hardest Atari 2600 game available through the ALE. The\ngame is infamous for its hostile, unforgiving environment: the agent must navigate a number of\ndifferent rooms, each \ufb01lled with traps. Due to its sparse reward function, most published agents\nachieve an average score close to zero and completely fail to explore most of the 24 rooms that\nconstitute the \ufb01rst level (Figure 3, top). By contrast, within 50 million frames our agent learns a\npolicy which consistently navigates through 15 rooms (Figure 3, bottom). Our agent also achieves a\nscore higher than anything previously reported, with one run consistently achieving 6600 points by\n100 million frames (half the training samples used by Mnih et al. (2015)). We believe the success of\nour method in this game is a strong indicator of the usefulness of pseudo-counts for exploration.1\n\n6.2 Exploration for Actor-Critic Methods\n\nWe next used our exploration bonuses in conjunction with the A3C (Asynchronous Advantage\nActor-Critic) algorithm of Mnih et al. (2016). One appeal of actor-critic methods is their explicit\nseparation of policy and Q-function parameters, which leads to a richer behaviour space. This very\nseparation, however, often leads to de\ufb01cient exploration: to produce any sensible results, the A3C\npolicy must be regularized with an entropy cost. We trained A3C on 60 Atari 2600 games, with and\nwithout the exploration bonus (4). We refer to our augmented algorithm as A3C+. Full details and\nadditional results may be found in the appendix.\nWe found that A3C fails to learn in 15 games, in the sense that the agent does not achieve a score\n50% better than random. In comparison, there are only 10 games for which A3C+ fails to improve on\nthe random agent; of these, 8 are games where DQN fails in the same sense. We normalized the two\nalgorithms\u2019 scores so that 0 and 1 are respectively the minimum and maximum of the random agent\u2019s\nand A3C\u2019s end-of-training score on a particular game. Figure 4 depicts the in-training median score\nfor A3C and A3C+, along with 1st and 3rd quartile intervals. Not only does A3C+ achieve slightly\nsuperior median performance, but it also signi\ufb01cantly outperforms A3C on at least a quarter of the\ngames. This is particularly important given the large proportion of Atari 2600 games for which an\n\u0001-greedy policy is suf\ufb01cient for exploration.\n\n7 Related Work\n\nInformation-theoretic quantities have been repeatedly used to describe intrinsically motivated be-\nhaviour. Closely related to prediction gain is Schmidhuber (1991)\u2019s notion of compression progress,\n\n1A video of our agent playing is available at https://youtu.be/0yI2wJ6F8r0.\n\n7\n\nScoreTraining frames (millions)FREEWAYMONTEZUMA\u2019S REVENGEPRIVATE EYEH.E.R.O.VENTURENo bonusWith bonus\fFigure 4: Median and interquartile performance across 60 Atari 2600 games for A3C and A3C+.\n\nwhich equates novelty with an agent\u2019s improvement in its ability to compress its past. More recently,\nLopes et al. (2012) showed the relationship between time-averaged prediction gain and visit counts\nin a tabular setting; their result is a special case of Theorem 2. Orseau et al. (2013) demonstrated\nthat maximizing the sum of future information gains does lead to optimal behaviour, even though\nmaximizing immediate information gain does not (Section 4). Finally, there may be a connection be-\ntween sequential normalized maximum likelihood estimators and our pseudo-count derivation (see\ne.g. Ollivier, 2015).\nIntrinsic motivation has also been studied in reinforcement learning proper, in particular in the con-\ntext of discovering skills (Singh et al., 2004; Barto, 2013). Recently, Stadie et al. (2015) used a\nsquared prediction error bonus for exploring in Atari 2600 games. Closest to our work is Houthooft\net al. (2016)\u2019s variational approach to intrinsic motivation, which is equivalent to a second order Tay-\nlor approximation to prediction gain. Mohamed and Rezende (2015) also considered a variational\napproach to the different problem of maximizing an agent\u2019s ability to in\ufb02uence its environment.\nAside for Orseau et al.\u2019s above-cited work, it is only recently that theoretical guarantees for explo-\nration have emerged for non-tabular, stateful settings. We note Pazis and Parr (2016)\u2019s PAC-MDP\nresult for metric spaces and Leike et al. (2016)\u2019s asymptotic analysis of Thompson sampling in\ngeneral environments.\n\n8 Future Directions\n\nThe last few years have seen tremendous advances in learning representations for reinforcement\nlearning. Surprisingly, these advances have yet to carry over to the problem of exploration. In this\npaper, we reconciled counts, the fundamental unit of uncertainty, with prediction-based heuristics\nand intrinsic motivation. Combining our work with more ideas from deep learning and better density\nmodels seems a plausible avenue for quick progress in practical, ef\ufb01cient exploration. We now\nconclude by outlining a few research directions we believe are promising.\nInduced metric. We did not address the question of where the generalization comes from. Clearly,\nthe choice of density model induces a particular metric over the state space. A better understanding\nof this metric should allow us to tailor the density model to the problem of exploration.\nCompatible value function. There may be a mismatch in the learning rates of the density model\nand the value function: DQN learns much more slowly than our CTS model. As such, it should be\nbene\ufb01cial to design value functions compatible with density models (or vice-versa).\nThe continuous case. Although we focused here on countable state spaces, we can as easily de\ufb01ne a\npseudo-count in terms of probability density functions. At present it is unclear whether this provides\nus with the right notion of counts for continuous spaces.\n\nAcknowledgments\n\nThe authors would like to thank Laurent Orseau, Alex Graves, Joel Veness, Charles Blundell, Shakir\nMohamed, Ivo Danihelka, Ian Osband, Matt Hoffman, Greg Wayne, Will Dabney, and A\u00a8aron van\nden Oord for their excellent feedback early and late in the writing, and Pierre-Yves Oudeyer and\nYann Ollivier for pointing out additional connections to the literature.\n\n8\n\nTraining frames (millions)Baseline scoreA3C+ PERFORMANCE ACROSS GAMES\fReferences\nBarto, A. G. (2013). Intrinsic motivation and reinforcement learning. In Intrinsically Motivated Learning in\n\nNatural and Arti\ufb01cial Systems, pages 17\u201347. Springer.\n\nBellemare, M., Veness, J., and Talvitie, E. (2014). Skip context tree switching. In Proceedings of the 31st\n\nInternational Conference on Machine Learning, pages 1458\u20131466.\n\nBellemare, M. G., Naddaf, Y., Veness, J., and Bowling, M. (2013). The Arcade Learning Environment: An\n\nevaluation platform for general agents. Journal of Arti\ufb01cial Intelligence Research, 47:253\u2013279.\n\nBertsekas, D. P. and Tsitsiklis, J. N. (1996). Neuro-Dynamic Programming. Athena Scienti\ufb01c.\nCover, T. M. and Thomas, J. A. (1991). Elements of information theory. John Wiley & Sons.\nHouthooft, R., Chen, X., Duan, Y., Schulman, J., De Turck, F., and Abbeel, P. (2016). Variational information\n\nmaximizing exploration.\n\nHutter, M. (2013). Sparse adaptive dirichlet-multinomial-like processes. In Proceedings of the Conference on\n\nOnline Learning Theory.\n\nKolter, Z. J. and Ng, A. Y. (2009). Near-bayesian exploration in polynomial time. In Proceedings of the 26th\n\nInternational Conference on Machine Learning.\n\nLeike, J., Lattimore, T., Orseau, L., and Hutter, M. (2016). Thompson sampling is asymptotically optimal in\n\ngeneral environments. In Proceedings of the Conference on Uncertainty in Arti\ufb01cial Intelligence.\n\nLopes, M., Lang, T., Toussaint, M., and Oudeyer, P.-Y. (2012). Exploration in model-based reinforcement\nlearning by empirically estimating learning progress. In Advances in Neural Information Processing Systems\n25.\n\nMachado, M. C., Srinivasan, S., and Bowling, M. (2015). Domain-independent optimistic initialization for\n\nreinforcement learning. AAAI Workshop on Learning for General Competency in Video Games.\n\nMnih, V., Badia, A. P., Mirza, M., Graves, A., Lillicrap, T. P., Harley, T., Silver, D., and Kavukcuoglu, K.\n(2016). Asynchronous methods for deep reinforcement learning. In Proceedings of the International Con-\nference on Machine Learning.\n\nMnih, V., Kavukcuoglu, K., Silver, D., Rusu, A. A., Veness, J., Bellemare, M. G., Graves, A., Riedmiller, M.,\nFidjeland, A. K., Ostrovski, G., et al. (2015). Human-level control through deep reinforcement learning.\nNature, 518(7540):529\u2013533.\n\nMohamed, S. and Rezende, D. J. (2015). Variational information maximisation for intrinsically motivated\n\nreinforcement learning. In Advances in Neural Information Processing Systems 28.\n\nOllivier, Y. (2015). Laplace\u2019s rule of succession in information geometry. arXiv preprint arXiv:1503.04304.\nOrseau, L., Lattimore, T., and Hutter, M. (2013). Universal knowledge-seeking agents for stochastic environ-\n\nments. In Proceedings of the Conference on Algorithmic Learning Theory.\n\nOudeyer, P., Kaplan, F., and Hafner, V. (2007). Intrinsic motivation systems for autonomous mental develop-\n\nment. IEEE Transactions on Evolutionary Computation, 11(2):265\u2013286.\n\nPazis, J. and Parr, R. (2016). Ef\ufb01cient PAC-optimal exploration in concurrent, continuous state MDPs with\n\ndelayed updates. In Proceedings of the 30th AAAI Conference on Arti\ufb01cial Intelligence.\n\nSchmidhuber, J. (1991). A possibility for implementing curiosity and boredom in model-building neural con-\nIn From animals to animats: proceedings of the \ufb01rst international conference on simulation of\n\ntrollers.\nadaptive behavior.\n\nSingh, S., Barto, A. G., and Chentanez, N. (2004). Intrinsically motivated reinforcement learning. In Advances\n\nin Neural Information Processing Systems 16.\n\nStadie, B. C., Levine, S., and Abbeel, P. (2015). Incentivizing exploration in reinforcement learning with deep\n\npredictive models. arXiv preprint arXiv:1507.00814.\n\nStrehl, A. L. and Littman, M. L. (2008). An analysis of model-based interval estimation for Markov decision\n\nprocesses. Journal of Computer and System Sciences, 74(8):1309 \u2013 1331.\n\nVan den Oord, A., Kalchbrenner, N., and Kavukcuoglu, K. (2016). Pixel recurrent neural networks. In Pro-\n\nceedigns of the 33rd International Conference on Machine Learning.\n\nvan Hasselt, H., Guez, A., and Silver, D. (2016). Deep reinforcement learning with double Q-learning. In\n\nProceedings of the 30th AAAI Conference on Arti\ufb01cial Intelligence.\n\n9\n\n\f", "award": [], "sourceid": 822, "authors": [{"given_name": "Marc", "family_name": "Bellemare", "institution": "Google DeepMind"}, {"given_name": "Sriram", "family_name": "Srinivasan", "institution": "Google DeepMind"}, {"given_name": "Georg", "family_name": "Ostrovski", "institution": "Google DeepMind"}, {"given_name": "Tom", "family_name": "Schaul", "institution": "Google Deepmind"}, {"given_name": "David", "family_name": "Saxton", "institution": "Google DeepMind"}, {"given_name": "Remi", "family_name": "Munos", "institution": "Google DeepMind"}]}