{"title": "#Exploration: A Study of Count-Based Exploration for Deep Reinforcement Learning", "book": "Advances in Neural Information Processing Systems", "page_first": 2753, "page_last": 2762, "abstract": "Count-based exploration algorithms are known to perform near-optimally when used in conjunction with tabular reinforcement learning (RL) methods for solving small discrete Markov decision processes (MDPs). It is generally thought that count-based methods cannot be applied in high-dimensional state spaces, since most states will only occur once. Recent deep RL exploration strategies are able to deal with high-dimensional continuous state spaces through complex heuristics, often relying on optimism in the face of uncertainty or intrinsic motivation. In this work, we describe a surprising finding: a simple generalization of the classic count-based approach can reach near state-of-the-art performance on various high-dimensional and/or continuous deep RL benchmarks. States are mapped to hash codes, which allows to count their occurrences with a hash table. These counts are then used to compute a reward bonus according to the classic count-based exploration theory. We find that simple hash functions can achieve surprisingly good results on many challenging tasks. Furthermore, we show that a domain-dependent learned hash code may further improve these results. Detailed analysis reveals important aspects of a good hash function: 1) having appropriate granularity and 2) encoding information relevant to solving the MDP. This exploration strategy achieves near state-of-the-art performance on both continuous control tasks and Atari 2600 games, hence providing a simple yet powerful baseline for solving MDPs that require considerable exploration.", "full_text": "#Exploration: A Study of Count-Based Exploration\n\nfor Deep Reinforcement Learning\n\nHaoran Tang1\u2217, Rein Houthooft34\u2217 , Davis Foote2, Adam Stooke2, Xi Chen2\u2020 ,\n\nYan Duan2\u2020, John Schulman4, Filip De Turck3, Pieter Abbeel 2\u2020\n\n1 UC Berkeley, Department of Mathematics\n\n2 UC Berkeley, Department of Electrical Engineering and Computer Sciences\n\n3 Ghent University \u2013 imec, Department of Information Technology\n\n4 OpenAI\n\nAbstract\n\nCount-based exploration algorithms are known to perform near-optimally when\nused in conjunction with tabular reinforcement learning (RL) methods for solving\nsmall discrete Markov decision processes (MDPs). It is generally thought that\ncount-based methods cannot be applied in high-dimensional state spaces, since\nmost states will only occur once. Recent deep RL exploration strategies are able to\ndeal with high-dimensional continuous state spaces through complex heuristics,\noften relying on optimism in the face of uncertainty or intrinsic motivation. In\nthis work, we describe a surprising \ufb01nding: a simple generalization of the classic\ncount-based approach can reach near state-of-the-art performance on various high-\ndimensional and/or continuous deep RL benchmarks. States are mapped to hash\ncodes, which allows to count their occurrences with a hash table. These counts\nare then used to compute a reward bonus according to the classic count-based\nexploration theory. We \ufb01nd that simple hash functions can achieve surprisingly\ngood results on many challenging tasks. Furthermore, we show that a domain-\ndependent learned hash code may further improve these results. Detailed analysis\nreveals important aspects of a good hash function: 1) having appropriate granularity\nand 2) encoding information relevant to solving the MDP. This exploration strategy\nachieves near state-of-the-art performance on both continuous control tasks and\nAtari 2600 games, hence providing a simple yet powerful baseline for solving\nMDPs that require considerable exploration.\n\n1\n\nIntroduction\n\nReinforcement learning (RL) studies an agent acting in an initially unknown environment, learning\nthrough trial and error to maximize rewards. It is impossible for the agent to act near-optimally until\nit has suf\ufb01ciently explored the environment and identi\ufb01ed all of the opportunities for high reward, in\nall scenarios. A core challenge in RL is how to balance exploration\u2014actively seeking out novel states\nand actions that might yield high rewards and lead to long-term gains; and exploitation\u2014maximizing\nshort-term rewards using the agent\u2019s current knowledge. While there are exploration techniques\nfor \ufb01nite MDPs that enjoy theoretical guarantees, there are no fully satisfying techniques for high-\ndimensional state spaces; therefore, developing more general and robust exploration techniques is an\nactive area of research.\n\nHouthooft \n\n\u2217These authors contributed equally. Correspondence to: Haoran Tang , Rein\n\u2020Work done at OpenAI\n\n31st Conference on Neural Information Processing Systems (NIPS 2017), Long Beach, CA, USA.\n\n\fMost of the recent state-of-the-art RL results have been obtained using simple exploration strategies\nsuch as uniform sampling [21] and i.i.d./correlated Gaussian noise [19, 30]. Although these heuristics\nare suf\ufb01cient in tasks with well-shaped rewards, the sample complexity can grow exponentially (with\nstate space size) in tasks with sparse rewards [25]. Recently developed exploration strategies for\ndeep RL have led to signi\ufb01cantly improved performance on environments with sparse rewards. Boot-\nstrapped DQN [24] led to faster learning in a range of Atari 2600 games by training an ensemble of\nQ-functions. Intrinsic motivation methods using pseudo-counts achieve state-of-the-art performance\non Montezuma\u2019s Revenge, an extremely challenging Atari 2600 game [4]. Variational Information\nMaximizing Exploration (VIME, [13]) encourages the agent to explore by acquiring information\nabout environment dynamics, and performs well on various robotic locomotion problems with sparse\nrewards. However, we have not seen a very simple and fast method that can work across different\ndomains.\nSome of the classic, theoretically-justi\ufb01ed exploration methods are based on counting state-action\nvisitations, and turning this count into a bonus reward. In the bandit setting, the well-known UCB\nalgorithm of [18] chooses the action at at time t that maximizes \u02c6r(at) +\nn(at) where \u02c6r(at) is\nthe estimated reward, and n(at) is the number of times action at was previously chosen. In the\nMDP setting, some of the algorithms have similar structure, for example, Model Based Interval\nEstimation\u2013Exploration Bonus (MBIE-EB) of [34] counts state-action pairs with a table n(s, a) and\nadding a bonus reward of the form\nto encourage exploring less visited pairs. [16] show\nthat the inverse-square-root dependence is optimal. MBIE and related algorithms assume that the\naugmented MDP is solved analytically at each timestep, which is only practical for small \ufb01nite state\nspaces.\nThis paper presents a simple approach for exploration, which extends classic counting-based methods\nto high-dimensional, continuous state spaces. We discretize the state space with a hash function and\napply a bonus based on the state-visitation count. The hash function can be chosen to appropriately\nbalance generalization across states, and distinguishing between states. We select problems from rllab\n[8] and Atari 2600 [3] featuring sparse rewards, and demonstrate near state-of-the-art performance on\nseveral games known to be hard for na\u00efve exploration strategies. The main strength of the presented\napproach is that it is fast, \ufb02exible and complementary to most existing RL algorithms.\nIn summary, this paper proposes a generalization of classic count-based exploration to high-\ndimensional spaces through hashing (Section 2); demonstrates its effectiveness on challenging deep\nRL benchmark problems and analyzes key components of well-designed hash functions (Section 4).\n\n(cid:113) 2 log t\n\n\u03b2\u221a\n\nn(s,a)\n\n2 Methodology\n\n2.1 Notation\n\nThis paper assumes a \ufb01nite-horizon discounted Markov decision process (MDP), de\ufb01ned by\n(S,A,P, r, \u03c10, \u03b3, T ), in which S is the state space, A the action space, P a transition probabil-\nity distribution, r : S \u00d7 A \u2192 R a reward function, \u03c10 an initial state distribution, \u03b3 \u2208 (0, 1] a\ndiscount factor, and T the horizon. The goal of RL is to maximize the total expected discounted\nreward E\u03c0,P\nover a policy \u03c0, which outputs a distribution over actions given a\nstate.\n\n(cid:104)(cid:80)T\n\nt=0 \u03b3tr(st, at)\n\n(cid:105)\n\n2.2 Count-Based Exploration via Static Hashing\nOur approach discretizes the state space with a hash function \u03c6 : S \u2192 Z. An exploration bonus\nr+ : S \u2192 R is added to the reward function, de\ufb01ned as\n\n\u03b2(cid:112)n(\u03c6(s))\n\nr+(s) =\n\n,\n\n(1)\n\nwhere \u03b2 \u2208 R\u22650 is the bonus coef\ufb01cient. Initially the counts n(\u00b7) are set to zero for the whole range\nof \u03c6. For every state st encountered at time step t, n(\u03c6(st)) is increased by one. The agent is trained\nwith rewards (r + r+), while performance is evaluated as the sum of rewards without bonuses.\n\n2\n\n\fdistribution N (0, 1)\n\nAlgorithm 1: Count-based exploration through static hashing, using SimHash\n1 De\ufb01ne state preprocessor g : S \u2192 RD\n2 (In case of SimHash) Initialize A \u2208 Rk\u00d7D with entries drawn i.i.d. from the standard Gaussian\n3 Initialize a hash table with values n(\u00b7) \u2261 0\n4 for each iteration j do\n5\n6\n7\n\nCollect a set of state-action samples {(sm, am)}M\nCompute hash codes through any LSH method, e.g., for SimHash, \u03c6(sm) = sgn(Ag(sm))\nUpdate the hash table counts \u2200m : 0 \u2264 m \u2264 M as n(\u03c6(sm)) \u2190 n(\u03c6(sm)) + 1\nUpdate the policy \u03c0 using rewards\n\nwith any RL algorithm\n\nm=0 with policy \u03c0\n\n(cid:27)M\n\n8\n\n(cid:26)\n\nr(sm, am) +\n\n\u03b2\u221a\n\nn(\u03c6(sm))\n\nm=0\n\nNote that our approach is a departure from count-based exploration methods such as MBIE-EB\nsince we use a state-space count n(s) rather than a state-action count n(s, a). State-action counts\nn(s, a) are investigated in the Supplementary Material, but no signi\ufb01cant performance gains over\nstate counting could be witnessed. A possible reason is that the policy itself is suf\ufb01ciently random to\ntry most actions at a novel state.\nClearly the performance of this method will strongly depend on the choice of hash function \u03c6. One\nimportant choice we can make regards the granularity of the discretization: we would like for \u201cdistant\u201d\nstates to be be counted separately while \u201csimilar\u201d states are merged. If desired, we can incorporate\nprior knowledge into the choice of \u03c6, if there would be a set of salient state features which are known\nto be relevant. A short discussion on this matter is given in the Supplementary Material.\nAlgorithm 1 summarizes our method. The main idea is to use locality-sensitive hashing (LSH) to\nconvert continuous, high-dimensional data to discrete hash codes. LSH is a popular class of hash\nfunctions for querying nearest neighbors based on certain similarity metrics [2]. A computationally\nef\ufb01cient type of LSH is SimHash [6], which measures similarity by angular distance. SimHash\nretrieves a binary code of state s \u2208 S as\n\n(2)\nwhere g : S \u2192 RD is an optional preprocessing function and A is a k \u00d7 D matrix with i.i.d. entries\ndrawn from a standard Gaussian distribution N (0, 1). The value for k controls the granularity: higher\nvalues lead to fewer collisions and are thus more likely to distinguish states.\n\n\u03c6(s) = sgn(Ag(s)) \u2208 {\u22121, 1}k,\n\n2.3 Count-Based Exploration via Learned Hashing\n\nWhen the MDP states have a complex structure, as is the case with image observations, measuring\ntheir similarity directly in pixel space fails to provide the semantic similarity measure one would desire.\nPrevious work in computer vision [7, 20, 36] introduce manually designed feature representations\nof images that are suitable for semantic tasks including detection and classi\ufb01cation. More recent\nmethods learn complex features directly from data by training convolutional neural networks [12,\n17, 31]. Considering these results, it may be dif\ufb01cult for a method such as SimHash to cluster states\nappropriately using only raw pixels.\nTherefore, rather than using SimHash, we propose to use an autoencoder (AE) to learn meaningful\nhash codes in one of its hidden layers as a more advanced LSH method. This AE takes as input\nstates s and contains one special dense layer comprised of D sigmoid functions. By rounding the\nsigmoid activations b(s) of this layer to their closest binary number (cid:98)b(s)(cid:101) \u2208 {0, 1}D, any state s\ncan be binarized. This is illustrated in Figure 1 for a convolutional AE.\nA problem with this architecture is that dissimilar inputs si, sj can map to identical hash codes\n(cid:98)b(si)(cid:101) = (cid:98)b(sj)(cid:101), but the AE still reconstructs them perfectly. For example, if b(si) and b(sj) have\nvalues 0.6 and 0.7 at a particular dimension, the difference can be exploited by deconvolutional\nlayers in order to reconstruct si and sj perfectly, although that dimension rounds to the same binary\nvalue. One can imagine replacing the bottleneck layer b(s) with the hash codes (cid:98)b(s)(cid:101), but then\ngradients cannot be back-propagated through the rounding function. A solution is proposed by Gregor\net al. [10] and Salakhutdinov & Hinton [28] is to inject uniform noise U (\u2212a, a) into the sigmoid\n\n3\n\n\fdownsample\n\n(cid:98)\u00b7(cid:101)\n\ncode\n\n6 \u00d7 6\n\n6 \u00d7 6\n\n6 \u00d7 6\n\n6 \u00d7 6\n\n6 \u00d7 6\n\n6 \u00d7 6\n\nlinear\n\nsoftmax\n\n96 \u00d7 5 \u00d7 5\n\n96 \u00d7 11 \u00d7 11\n\n96 \u00d7 24 \u00d7 24\n\n1 \u00d7 52 \u00d7 52\n\nb(\u00b7)\n512\n\n1024\n\n2400\n\n96 \u00d7 5 \u00d7 5\n\n96 \u00d7 10 \u00d7 10\n\n96 \u00d7 24 \u00d7 24\n\n1 \u00d7 52 \u00d7 52\n\n64 \u00d7 52 \u00d7 52\n\nFigure 1: The autoencoder (AE) architecture for ALE; the solid block represents the dense sigmoidal\nbinary code layer, after which noise U (\u2212a, a) is injected.\n\n(AE)\n\nAlgorithm 2: Count-based exploration using learned hash codes\n1 De\ufb01ne state preprocessor g : S \u2192 {0, 1}D as the binary code resulting from the autoencoder\n2 Initialize A \u2208 Rk\u00d7D with entries drawn i.i.d. from the standard Gaussian distribution N (0, 1)\n3 Initialize a hash table with values n(\u00b7) \u2261 0\n4 for each iteration j do\n5\n\nm=0 with policy \u03c0\n\nCollect a set of state-action samples {(sm, am)}M\nAdd the state samples {sm}M\nif j mod jupdate = 0 then\n\nm=0 to a FIFO replay pool R\n\nUpdate the AE loss function in Eq. (3) using samples drawn from the replay pool\n{sn}N\n\nn=1 \u223c R, for example using stochastic gradient descent\n(cid:27)M\n\nCompute g(sm) = (cid:98)b(sm)(cid:101), the D-dim rounded hash code for sm learned by the AE\nProject g(sm) to a lower dimension k via SimHash as \u03c6(sm) = sgn(Ag(sm))\nUpdate the hash table counts \u2200m : 0 \u2264 m \u2264 M as n(\u03c6(sm)) \u2190 n(\u03c6(sm)) + 1\nUpdate the policy \u03c0 using rewards\n\n\u03b2\u221a\n\nr(sm, am) +\n\n(cid:26)\n\nwith any RL algorithm\n\n6\n7\n8\n\n9\n10\n11\n\n12\n\nn(\u03c6(sm))\n\nm=0\n\nactivations. By choosing uniform noise with a > 1\n4, the AE is only capable of (always) reconstructing\ndistinct state inputs si (cid:54)= sj, if it has learned to spread the sigmoid outputs suf\ufb01ciently far apart,\n|b(si) \u2212 b(sj)| > \u0001, in order to counteract the injected noise.\nAs such, the loss function over a set of collected states {si}N\n\ni=1 is de\ufb01ned as\n\n(cid:80)D\n\n(cid:110)\n(1 \u2212 bi(sn))2 , bi(sn)2(cid:111)(cid:105)\n\n,\n\nlog p(sn) \u2212 \u03bb\n\nK\n\ni=1 min\n\nL(cid:0){sn}N\n\nn=1\n\n(cid:1) = \u2212 1\n\nN(cid:88)\n\n(cid:104)\n\nN\n\nn=1\n\n(3)\n\nwith p(sn) the AE output. This objective function consists of a negative log-likelihood term and a\nterm that pressures the binary code layer to take on binary values, scaled by \u03bb \u2208 R\u22650. The reasoning\nbehind this latter term is that it might happen that for particular states, a certain sigmoid unit is never\nused. Therefore, its value might \ufb02uctuate around 1\n2, causing the corresponding bit in binary code\n(cid:98)b(s)(cid:101) to \ufb02ip over the agent lifetime. Adding this second loss term ensures that an unused bit takes\non an arbitrary binary value.\nFor Atari 2600 image inputs, since the pixel intensities are discrete values in the range [0, 255],\nwe make use of a pixel-wise softmax output layer [37] that shares weights between all pixels. The\narchitectural details are described in the Supplementary Material and are depicted in Figure 1. Because\nthe code dimension often needs to be large in order to correctly reconstruct the input, we apply a\ndownsampling procedure to the resulting binary code (cid:98)b(s)(cid:101), which can be done through random\nprojection to a lower-dimensional space via SimHash as in Eq. (2).\nOn the one hand, it is important that the mapping from state to code needs to remain relatively\nconsistent over time, which is nontrivial as the AE is constantly updated according to the latest data\n(Algorithm 2 line 8). A solution is to downsample the binary code to a very low dimension, or by\nslowing down the training process. On the other hand, the code has to remain relatively unique\n\n4\n\n\ffor states that are both distinct and close together on the image manifold. This is tackled both by\nthe second term in Eq. (3) and by the saturating behavior of the sigmoid units. States already well\nrepresented by the AE tend to saturate the sigmoid activations, causing the resulting loss gradients to\nbe close to zero, making the code less prone to change.\n\n3 Related Work\n\nClassic count-based methods such as MBIE [33], MBIE-EB and [16] solve an approximate Bellman\nequation as an inner loop before the agent takes an action [34]. As such, bonus rewards are propagated\nimmediately throughout the state-action space.\nIn contrast, contemporary deep RL algorithms\npropagate the bonus signal based on rollouts collected from interacting with environments, with\nvalue-based [21] or policy gradient-based [22, 30] methods, at limited speed.\nIn addition, our\nproposed method is intended to work with contemporary deep RL algorithms, it differs from classical\ncount-based method in that our method relies on visiting unseen states \ufb01rst, before the bonus reward\ncan be assigned, making uninformed exploration strategies still a necessity at the beginning. Filling\nthe gaps between our method and classic theories is an important direction of future research.\nA related line of classical exploration methods is based on the idea of optimism in the face of\nuncertainty [5] but not restricted to using counting to implement \u201coptimism\u201d, e.g., R-Max [5], UCRL\n[14], and E3 [15]. These methods, similar to MBIE and MBIE-EB, have theoretical guarantees in\ntabular settings.\nBayesian RL methods [9, 11, 16, 35], which keep track of a distribution over MDPs, are an alternative\nto optimism-based methods. Extensions to continuous state space have been proposed by [27] and\n[25].\nAnother type of exploration is curiosity-based exploration. These methods try to capture the agent\u2019s\nsurprise about transition dynamics. As the agent tries to optimize for surprise, it naturally discovers\nnovel states. We refer the reader to [29] and [26] for an extensive review on curiosity and intrinsic\nrewards.\nSeveral exploration strategies for deep RL have been proposed to handle high-dimensional state\nspace recently. [13] propose VIME, in which information gain is measured in Bayesian neural\nnetworks modeling the MDP dynamics, which is used an exploration bonus. [32] propose to use the\nprediction error of a learned dynamics model as an exploration bonus. Thompson sampling through\nbootstrapping is proposed by [24], using bootstrapped Q-functions.\nThe most related exploration strategy is proposed by [4], in which an exploration bonus is added\ninversely proportional to the square root of a pseudo-count quantity. A state pseudo-count is derived\nfrom its log-probability improvement according to a density model over the state space, which in the\nlimit converges to the empirical count. Our method is similar to pseudo-count approach in the sense\nthat both methods are performing approximate counting to have the necessary generalization over\nunseen states. The difference is that a density model has to be designed and learned to achieve good\ngeneralization for pseudo-count whereas in our case generalization is obtained by a wide range of\nsimple hash functions (not necessarily SimHash). Another interesting connection is that our method\nalso implies a density model \u03c1(s) = n(\u03c6(s))\nover all visited states, where N is the total number of\nstates visited. Another method similar to hashing is proposed by [1], which clusters states and counts\ncluster centers instead of the true states, but this method has yet to be tested on standard exploration\nbenchmark problems.\n\nN\n\n4 Experiments\n\nExperiments were designed to investigate and answer the following research questions:\n\n1. Can count-based exploration through hashing improve performance signi\ufb01cantly across\ndifferent domains? How does the proposed method compare to the current state of the art in\nexploration for deep RL?\n\n2. What is the impact of learned or static state preprocessing on the overall performance when\n\nimage observations are used?\n\n5\n\n\fTo answer question 1, we run the proposed method on deep RL benchmarks (rllab and ALE) that\nfeature sparse rewards, and compare it to other state-of-the-art algorithms. Question 2 is answered by\ntrying out different image preprocessors on Atari 2600 games. Trust Region Policy Optimization\n(TRPO, [30]) is chosen as the RL algorithm for all experiments, because it can handle both discrete\nand continuous action spaces, can conveniently ensure stable improvement in the policy performance,\nand is relatively insensitive to hyperparameter changes. The hyperparameters settings are reported in\nthe Supplementary Material.\n\n4.1 Continuous Control\n\nThe rllab benchmark [8] consists of various control tasks to test deep RL algorithms. We selected\nseveral variants of the basic and locomotion tasks that use sparse rewards, as shown in Figure 2, and\nadopt the experimental setup as de\ufb01ned in [13]\u2014a description can be found in the Supplementary\nMaterial. These tasks are all highly dif\ufb01cult to solve with na\u00efve exploration strategies, such as adding\nGaussian noise to the actions.\n\nFigure 2: Illustrations of the rllab tasks used in the continuous control experiments, namely Moun-\ntainCar, CartPoleSwingup, SimmerGather, and HalfCheetah; taken from [8].\n\n(a) MountainCar\n\n(b) CartPoleSwingup\n\n(c) SwimmerGather\n\n(d) HalfCheetah\n\nFigure 3: Mean average return of different algorithms on rllab tasks with sparse rewards. The solid\nline represents the mean average return, while the shaded area represents one standard deviation, over\n5 seeds for the baseline and SimHash (the baseline curves happen to overlap with the axis).\n\nFigure 3 shows the results of TRPO (baseline), TRPO-SimHash, and VIME [13] on the classic tasks\nMountainCar and CartPoleSwingup, the locomotion task HalfCheetah, and the hierarchical task\nSwimmerGather. Using count-based exploration with hashing is capable of reaching the goal in all\nenvironments (which corresponds to a nonzero return), while baseline TRPO with Gaussia n control\nnoise fails completely. Although TRPO-SimHash picks up the sparse reward on HalfCheetah, it does\nnot perform as well as VIME. In contrast, the performance of SimHash is comparable with VIME on\nMountainCar, while it outperforms VIME on SwimmerGather.\n\n4.2 Arcade Learning Environment\n\nThe Arcade Learning Environment (ALE, [3]), which consists of Atari 2600 video games, is an\nimportant benchmark for deep RL due to its high-dimensional state space and wide variety of\ngames. In order to demonstrate the effectiveness of the proposed exploration strategy, six games are\nselected featuring long horizons while requiring signi\ufb01cant exploration: Freeway, Frostbite, Gravitar,\nMontezuma\u2019s Revenge, Solaris, and Venture. The agent is trained for 500 iterations in all experiments,\nwith each iteration consisting of 0.1 M steps (the TRPO batch size, corresponds to 0.4 M frames).\nPolicies and value functions are neural networks with identical architectures to [22]. Although the\npolicy and baseline take into account the previous four frames, the counting algorithm only looks at\nthe latest frame.\n\n6\n\n\fTable 1: Atari 2600: average total reward after training for 50 M time steps. Boldface numbers\nindicate best results. Italic numbers are the best among our methods.\n\nFreeway\n\nFrostbite Gravitar Montezuma\n\nSolaris Venture\n\nTRPO (baseline)\nTRPO-pixel-SimHash\nTRPO-BASS-SimHash\nTRPO-AE-SimHash\n\nDouble-DQN\nDueling network\nGorila\nDQN Pop-Art\nA3C+\npseudo-count\n\n16.5\n31.6\n28.4\n33.5\n\n33.3\n0.0\n11.7\n33.4\n27.3\n29.2\n\n2869\n4683\n3150\n5214\n\n1683\n4672\n605\n3469\n507\n1450\n\n486\n468\n604\n482\n\n412\n588\n1054\n483\n246\n\u2013\n\n0\n0\n238\n75\n\n0\n0\n4\n0\n142\n3439\n\n2758\n2897\n1201\n4467\n\n3068\n2251\nN/A\n4544\n2175\n\u2013\n\n121\n263\n616\n445\n\n98.0\n497\n1245\n1172\n0\n369\n\nBASS To compare with the autoencoder-based learned hash code, we propose using Basic Ab-\nstraction of the ScreenShots (BASS, also called Basic; see [3]) as a static preprocessing function g.\nBASS is a hand-designed feature transformation for images in Atari 2600 games. BASS builds on the\nfollowing observations speci\ufb01c to Atari: 1) the game screen has a low resolution, 2) most objects are\nlarge and monochrome, and 3) winning depends mostly on knowing object locations and motions.\nWe designed an adapted version of BASS3, that divides the RGB screen into square cells, computes\nthe average intensity of each color channel inside a cell, and assigns the resulting values to bins that\nuniformly partition the intensity range [0, 255]. Mathematically, let C be the cell size (width and\nheight), B the number of bins, (i, j) cell location, (x, y) pixel location, and z the channel, then\n\n(cid:106) B\n\n(cid:80)\n\n(cid:107)\n\nfeature(i, j, z) =\n\n255C2\n\n(x,y)\u2208 cell(i,j) I(x, y, z)\n\n.\n\n(4)\n\nAfterwards, the resulting integer-valued feature tensor is converted to an integer hash code (\u03c6(st) in\nLine 6 of Algorithm 1). A BASS feature can be regarded as a miniature that ef\ufb01ciently encodes object\nlocations, but remains invariant to negligible object motions. It is easy to implement and introduces\nlittle computation overhead. However, it is designed for generic Atari game images and may not\ncapture the structure of each speci\ufb01c game very well.\nWe compare our results to double DQN [39], dueling network [40], A3C+ [4], double DQN with\npseudo-counts [4], Gorila [23], and DQN Pop-Art [38] on the \u201cnull op\u201d metric4. We show training\ncurves in Figure 4 and summarize all results in Table 1. Surprisingly, TRPO-pixel-SimHash already\noutperforms the baseline by a large margin and beats the previous best result on Frostbite. TRPO-\nBASS-SimHash achieves signi\ufb01cant improvement over TRPO-pixel-SimHash on Montezuma\u2019s\nRevenge and Venture, where it captures object locations better than other methods.5 TRPO-AE-\nSimHash achieves near state-of-the-art performance on Freeway, Frostbite and Solaris.\nAs observed in Table 1, preprocessing images with BASS or using a learned hash code through the\nAE leads to much better performance on Gravitar, Montezuma\u2019s Revenge and Venture. Therefore, a\nstatic or adaptive preprocessing step can be important for a good hash function.\nIn conclusion, our count-based exploration method is able to achieve remarkable performance gains\neven with simple hash functions like SimHash on the raw pixel space. If coupled with domain-\ndependent state preprocessing techniques, it can sometimes achieve far better results.\nA reason why our proposed method does not achieve state-of-the-art performance on all games is that\nTRPO does not reuse off-policy experience, in contrast to DQN-based algorithms [4, 23, 38]), and is\n\n3The original BASS exploits the fact that at most 128 colors can appear on the screen. Our adapted version\n\ndoes not make this assumption.\n\n4The agent takes no action for a random number (within 30) of frames at the beginning of each episode.\n5We provide videos of example game play and visualizations of the difference bewteen Pixel-SimHash and\n\nBASS-SimHash at https://www.youtube.com/playlist?list=PLAd-UMX6FkBQdLNWtY8nH1-pzYJA_1T55\n\n7\n\n\f(a) Freeway\n\n(b) Frostbite\n\n(c) Gravitar\n\n(d) Montezuma\u2019s Revenge\n\n(e) Solaris\n\n(f) Venture\n\nFigure 4: Atari 2600 games: the solid line is the mean average undiscounted return per iteration,\nwhile the shaded areas represent the one standard deviation, over 5 seeds for the baseline, TRPO-\npixel-SimHash, and TRPO-BASS-SimHash, while over 3 seeds for TRPO-AE-SimHash.\n\nhence less ef\ufb01cient in harnessing extremely sparse rewards. This explanation is corroborated by the\nexperiments done in [4], in which A3C+ (an on-policy algorithm) scores much lower than DQN (an\noff-policy algorithm), while using the exact same exploration bonus.\n\n5 Conclusions\n\nThis paper demonstrates that a generalization of classical counting techniques through hashing is able\nto provide an appropriate signal for exploration, even in continuous and/or high-dimensional MDPs\nusing function approximators, resulting in near state-of-the-art performance across benchmarks. It\nprovides a simple yet powerful baseline for solving MDPs that require informed exploration.\n\nAcknowledgments\n\nWe would like to thank our colleagues at Berkeley and OpenAI for insightful discussions. This\nresearch was funded in part by ONR through a PECASE award. Yan Duan was also supported by a\nBerkeley AI Research lab Fellowship and a Huawei Fellowship. Xi Chen was also supported by a\nBerkeley AI Research lab Fellowship. We gratefully acknowledge the support of the NSF through\ngrant IIS-1619362 and of the ARC through a Laureate Fellowship (FL110100281) and through\nthe ARC Centre of Excellence for Mathematical and Statistical Frontiers. Adam Stooke gratefully\nacknowledges funding from a Fannie and John Hertz Foundation fellowship. Rein Houthooft was\nsupported by a Ph.D. Fellowship of the Research Foundation - Flanders (FWO).\n\nReferences\n[1] Abel, David, Agarwal, Alekh, Diaz, Fernando, Krishnamurthy, Akshay, and Schapire, Robert E.\nExploratory gradient boosting for reinforcement learning in complex domains. arXiv preprint\narXiv:1603.04119, 2016.\n\n[2] Andoni, Alexandr and Indyk, Piotr. Near-optimal hashing algorithms for approximate near-\nest neighbor in high dimensions. In Proceedings of the 47th Annual IEEE Symposium on\nFoundations of Computer Science (FOCS), pp. 459\u2013468, 2006.\n\n[3] Bellemare, Marc G, Naddaf, Yavar, Veness, Joel, and Bowling, Michael. The arcade learning\nenvironment: An evaluation platform for general agents. Journal of Arti\ufb01cial Intelligence\nResearch, 47:253\u2013279, 06 2013.\n\n8\n\n0100200300400500\u22125051015202530350100200300400500020004000600080001000001002003004005001002003004005006007008009001000TRPO-AE-SimHashTRPOTRPO-BASS-SimHashTRPO-pixel-SimHash010020030040050001002003004005000100200300400500\u22121000010002000300040005000600070000100200300400500\u2212200020040060080010001200\f[4] Bellemare, Marc G, Srinivasan, Sriram, Ostrovski, Georg, Schaul, Tom, Saxton, David, and\nMunos, Remi. Unifying count-based exploration and intrinsic motivation. In Advances in\nNeural Information Processing Systems 29 (NIPS), pp. 1471\u20131479, 2016.\n\n[5] Brafman, Ronen I and Tennenholtz, Moshe. R-max-a general polynomial time algorithm for\nnear-optimal reinforcement learning. Journal of Machine Learning Research, 3:213\u2013231, 2002.\n\n[6] Charikar, Moses S. Similarity estimation techniques from rounding algorithms. In Proceedings\n\nof the 34th Annual ACM Symposium on Theory of Computing (STOC), pp. 380\u2013388, 2002.\n\n[7] Dalal, Navneet and Triggs, Bill. Histograms of oriented gradients for human detection. In\nProceedings of the IEEE International Conference on Computer Vision and Pattern Recognition\n(CVPR), pp. 886\u2013893, 2005.\n\n[8] Duan, Yan, Chen, Xi, Houthooft, Rein, Schulman, John, and Abbeel, Pieter. Benchmarking\ndeep reinforcement learning for continous control. In Proceedings of the 33rd International\nConference on Machine Learning (ICML), pp. 1329\u20131338, 2016.\n\n[9] Ghavamzadeh, Mohammad, Mannor, Shie, Pineau, Joelle, and Tamar, Aviv. Bayesian rein-\nforcement learning: A survey. Foundations and Trends in Machine Learning, 8(5-6):359\u2013483,\n2015.\n\n[10] Gregor, Karol, Besse, Frederic, Jimenez Rezende, Danilo, Danihelka, Ivo, and Wierstra, Daan.\nTowards conceptual compression. In Advances in Neural Information Processing Systems 29\n(NIPS), pp. 3549\u20133557. 2016.\n\n[11] Guez, Arthur, Heess, Nicolas, Silver, David, and Dayan, Peter. Bayes-adaptive simulation-based\nsearch with value function approximation. In Advances in Neural Information Processing\nSystems (Advances in Neural Information Processing Systems (NIPS)), pp. 451\u2013459, 2014.\n\n[12] He, Kaiming, Zhang, Xiangyu, Ren, Shaoqing, and Sun, Jian. Deep residual learning for image\n\nrecognition. 2015.\n\n[13] Houthooft, Rein, Chen, Xi, Duan, Yan, Schulman, John, De Turck, Filip, and Abbeel, Pieter.\nVIME: Variational information maximizing exploration. In Advances in Neural Information\nProcessing Systems 29 (NIPS), pp. 1109\u20131117, 2016.\n\n[14] Jaksch, Thomas, Ortner, Ronald, and Auer, Peter. Near-optimal regret bounds for reinforcement\n\nlearning. Journal of Machine Learning Research, 11:1563\u20131600, 2010.\n\n[15] Kearns, Michael and Singh, Satinder. Near-optimal reinforcement learning in polynomial time.\n\nMachine Learning, 49(2-3):209\u2013232, 2002.\n\n[16] Kolter, J Zico and Ng, Andrew Y. Near-bayesian exploration in polynomial time. In Proceedings\n\nof the 26th International Conference on Machine Learning (ICML), pp. 513\u2013520, 2009.\n\n[17] Krizhevsky, Alex, Sutskever, Ilya, and Hinton, Geoffrey E. ImageNet classi\ufb01cation with deep\nconvolutional neural networks. In Advances in Neural Information Processing Systems 25\n(NIPS), pp. 1097\u20131105, 2012.\n\n[18] Lai, Tze Leung and Robbins, Herbert. Asymptotically ef\ufb01cient adaptive allocation rules.\n\nAdvances in Applied Mathematics, 6(1):4\u201322, 1985.\n\n[19] Lillicrap, Timothy P, Hunt, Jonathan J, Pritzel, Alexander, Heess, Nicolas, Erez, Tom, Tassa,\nYuval, Silver, David, and Wierstra, Daan. Continuous control with deep reinforcement learning.\narXiv preprint arXiv:1509.02971, 2015.\n\n[20] Lowe, David G. Object recognition from local scale-invariant features. In Proceedings of the\n\n7th IEEE International Conference on Computer Vision (ICCV), pp. 1150\u20131157, 1999.\n\n[21] Mnih, Volodymyr, Kavukcuoglu, Koray, Silver, David, Rusu, Andrei A, Veness, Joel, Bellemare,\nMarc G, Graves, Alex, Riedmiller, Martin, Fidjeland, Andreas K, Ostrovski, Georg, et al.\nHuman-level control through deep reinforcement learning. Nature, 518(7540):529\u2013533, 2015.\n\n9\n\n\f[22] Mnih, Volodymyr, Badia, Adria Puigdomenech, Mirza, Mehdi, Graves, Alex, Lillicrap, Timo-\nthy P, Harley, Tim, Silver, David, and Kavukcuoglu, Koray. Asynchronous methods for deep\nreinforcement learning. arXiv preprint arXiv:1602.01783, 2016.\n\n[23] Nair, Arun, Srinivasan, Praveen, Blackwell, Sam, Alcicek, Cagdas, Fearon, Rory, De Maria,\nAlessandro, Panneershelvam, Vedavyas, Suleyman, Mustafa, Beattie, Charles, Petersen,\nStig, et al. Massively parallel methods for deep reinforcement learning. arXiv preprint\narXiv:1507.04296, 2015.\n\n[24] Osband, Ian, Blundell, Charles, Pritzel, Alexander, and Van Roy, Benjamin. Deep exploration\nvia bootstrapped DQN. In Advances in Neural Information Processing Systems 29 (NIPS), pp.\n4026\u20134034, 2016.\n\n[25] Osband, Ian, Van Roy, Benjamin, and Wen, Zheng. Generalization and exploration via random-\nized value functions. In Proceedings of the 33rd International Conference on Machine Learning\n(ICML), pp. 2377\u20132386, 2016.\n\n[26] Oudeyer, Pierre-Yves and Kaplan, Frederic. What is intrinsic motivation? A typology of\n\ncomputational approaches. Frontiers in Neurorobotics, 1:6, 2007.\n\n[27] Pazis, Jason and Parr, Ronald. PAC optimal exploration in continuous space Markov decision\nprocesses. In Proceedings of the 27th AAAI Conference on Arti\ufb01cial Intelligence (AAAI), 2013.\n[28] Salakhutdinov, Ruslan and Hinton, Geoffrey. Semantic hashing. International Journal of\n\nApproximate Reasoning, 50(7):969 \u2013 978, 2009.\n\n[29] Schmidhuber, J\u00fcrgen. Formal theory of creativity, fun, and intrinsic motivation (1990\u20132010).\n\nIEEE Transactions on Autonomous Mental Development, 2(3):230\u2013247, 2010.\n\n[30] Schulman, John, Levine, Sergey, Moritz, Philipp, Jordan, Michael I, and Abbeel, Pieter. Trust\nregion policy optimization. In Proceedings of the 32nd International Conference on Machine\nLearning (ICML), 2015.\n\n[31] Simonyan, Karen and Zisserman, Andrew. Very deep convolutional networks for large-scale\n\nimage recognition. arXiv preprint arXiv:1409.1556, 2014.\n\n[32] Stadie, Bradly C, Levine, Sergey, and Abbeel, Pieter. Incentivizing exploration in reinforcement\n\nlearning with deep predictive models. arXiv preprint arXiv:1507.00814, 2015.\n\n[33] Strehl, Alexander L and Littman, Michael L. A theoretical analysis of model-based interval\nestimation. In Proceedings of the 21st International Conference on Machine Learning (ICML),\npp. 856\u2013863, 2005.\n\n[34] Strehl, Alexander L and Littman, Michael L. An analysis of model-based interval estimation for\nMarkov decision processes. Journal of Computer and System Sciences, 74(8):1309\u20131331, 2008.\n[35] Sun, Yi, Gomez, Faustino, and Schmidhuber, J\u00fcrgen. Planning to be surprised: Optimal\nBayesian exploration in dynamic environments. In Proceedings of the 4th International Confer-\nence on Arti\ufb01cial General Intelligence (AGI), pp. 41\u201351. 2011.\n\n[36] Tola, Engin, Lepetit, Vincent, and Fua, Pascal. DAISY: An ef\ufb01cient dense descriptor applied to\nwide-baseline stereo. IEEE Transactions on Pattern Analysis and Machine Intelligence, 32(5):\n815\u2013830, 2010.\n\n[37] van den Oord, Aaron, Kalchbrenner, Nal, and Kavukcuoglu, Koray. Pixel recurrent neural\nnetworks. In Proceedings of the 33rd International Conference on Machine Learning (ICML),\npp. 1747\u20131756, 2016.\n\n[38] van Hasselt, Hado, Guez, Arthur, Hessel, Matteo, and Silver, David. Learning functions across\n\nmany orders of magnitudes. arXiv preprint arXiv:1602.07714, 2016.\n\n[39] van Hasselt, Hado, Guez, Arthur, and Silver, David. Deep reinforcement learning with double\nQ-learning. In Proceedings of the 30th AAAI Conference on Arti\ufb01cial Intelligence (AAAI), 2016.\n[40] Wang, Ziyu, de Freitas, Nando, and Lanctot, Marc. Dueling network architectures for deep\nreinforcement learning. In Proceedings of the 33rd International Conference on Machine\nLearning (ICML), pp. 1995\u20132003, 2016.\n\n10\n\n\f", "award": [], "sourceid": 1560, "authors": [{"given_name": "Haoran", "family_name": "Tang", "institution": "UC Berkeley"}, {"given_name": "Rein", "family_name": "Houthooft", "institution": "OpenAI"}, {"given_name": "Davis", "family_name": "Foote", "institution": "Google Brain"}, {"given_name": "Adam", "family_name": "Stooke", "institution": "UC Berkeley"}, {"given_name": "OpenAI", "family_name": "Xi Chen", "institution": "OpenAI, UC Berkeley"}, {"given_name": "Yan", "family_name": "Duan", "institution": null}, {"given_name": "John", "family_name": "Schulman", "institution": "OpenAI"}, {"given_name": "Filip", "family_name": "DeTurck", "institution": null}, {"given_name": "Pieter", "family_name": "Abbeel", "institution": "OpenAI / UC Berkeley / Gradescope"}]}