{"title": "Sample-Efficient Deep Reinforcement Learning via Episodic Backward Update", "book": "Advances in Neural Information Processing Systems", "page_first": 2112, "page_last": 2121, "abstract": "We propose Episodic Backward Update (EBU) \u2013 a novel deep reinforcement learning algorithm with a direct value propagation. In contrast to the conventional use of the experience replay with uniform random sampling, our agent samples a whole episode and successively propagates the value of a state to its previous states. Our computationally efficient recursive algorithm allows sparse and delayed rewards to propagate directly through all transitions of the sampled episode. We theoretically prove the convergence of the EBU method and experimentally demonstrate its performance in both deterministic and stochastic environments. Especially in 49 games of Atari 2600 domain, EBU achieves the same mean and median human normalized performance of DQN by using only 5% and 10% of samples, respectively.", "full_text": "Sample-Ef\ufb01cient Deep Reinforcement Learning via\n\nEpisodic Backward Update\n\nSu Young Lee,\n\nSchool of Electrical Engineering, KAIST, Republic of Korea\n\n{suyoung.l, si_choi, schung}@kaist.ac.kr\n\nSungik Choi,\n\nSae-Young Chung\n\nAbstract\n\nWe propose Episodic Backward Update (EBU) \u2013 a novel deep reinforcement learn-\ning algorithm with a direct value propagation. In contrast to the conventional\nuse of the experience replay with uniform random sampling, our agent samples\na whole episode and successively propagates the value of a state to its previous\nstates. Our computationally ef\ufb01cient recursive algorithm allows sparse and de-\nlayed rewards to propagate directly through all transitions of the sampled episode.\nWe theoretically prove the convergence of the EBU method and experimentally\ndemonstrate its performance in both deterministic and stochastic environments.\nEspecially in 49 games of Atari 2600 domain, EBU achieves the same mean and\nmedian human normalized performance of DQN by using only 5% and 10% of\nsamples, respectively.\n\n1\n\nIntroduction\n\nDeep reinforcement learning (DRL) has been successful in many complex environments such as\nthe Arcade Learning Environment [2] and Go [18]. Despite DRL\u2019s impressive achievements, it is\nstill impractical in terms of sample ef\ufb01ciency. To achieve human-level performance in the Arcade\nLearning Environment, Deep Q-Network (DQN) [14] requires 200 million frames of experience for\ntraining which corresponds to 39 days of gameplay in real-time. Clearly, there is still a tremendous\ngap between the learning process of humans and that of deep reinforcement learning agents. This\nproblem is even more crucial for tasks such as autonomous driving, where we cannot risk many trials\nand errors due to the high cost of samples.\nOne of the reasons why DQN suffers from such low sample ef\ufb01ciency is the sampling method from\nthe replay memory. In many practical problems, an RL agent observes sparse and delayed rewards.\nThere are two main problems when we sample one-step transitions uniformly at random. (1) We\nhave a low chance of sampling a transition with a reward for its sparsity. The transitions with rewards\nshould always be updated to assign credits for actions that maximize the expected return. (2) In the\nearly stages of training when all values are initialized to zero, there is no point in updating values\nof one-step transitions with zero rewards if the values of future transitions with nonzero rewards\nhave not been updated yet. Without the future reward signals propagated, the sampled transition will\nalways be trained to return a zero value.\nIn this work, we propose Episodic Backward Update (EBU) to present solutions for the problems\nraised above. When we observe an event, we scan through our memory and seek for the past event\nthat caused the later one. Such an episodic control method is how humans normally recognize the\ncause and effect relationship [10]. Inspired by this, we can solve the \ufb01rst problem (1) by sampling\ntransitions in an episodic manner. Then, we can be assured that at least one transition with a non-zero\nreward is used for the value update. We can solve the second problem (2) by updating the values of\ntransitions in a backward manner in which the transitions were made. Afterward, we can perform an\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\fef\ufb01cient reward propagation without any meaningless updates. This method faithfully follows the\nprinciple of dynamic programming.\nAs mentioned by the authors of DQN, updating correlated samples in a sequence is vulnerable to\noverestimation. In Section 3, we deal with this issue by adopting a diffusion factor to mediate\nbetween the learned values from the future transitions and the current sample reward. In Section 4,\nwe theoretically prove the convergence of our method for both deterministic and stochastic MDPs. In\nSection 5, we empirically show the superiority of our method on 2D MNIST Maze Environment and\nthe 49 games of Atari 2600 domain. Especially in 49 games of the Atari 2600 domain, our method\nrequires only 10M frames to achieve the same mean human-normalized score reported in Nature\nDQN [14], and 20M frames to achieve the same median human-normalized score. Remarkably,\nEBU achieves such improvements with a comparable amount of computation complexity by only\nmodifying the target generation procedure for the value update from the original DQN.\n\n2 Background\n\nThe goal of reinforcement learning (RL) is to learn the optimal policy that maximizes the expected\nsum of rewards in the environment that is often modeled as a Markov decision process (MDP)\nM = (S,A, P, R). S denotes the state space, A denotes the action space, P : S \u00d7 A \u00d7 S \u2192 R\ndenotes the transition probability distribution, and R : S \u00d7 A \u2192 R denotes the reward function. Q-\nlearning [22] is one of the most widely used methods to solve RL tasks. The objective of Q-learning\nis to estimate the state-action value function Q(s, a), or the Q-function, which is characterized by the\nBellman optimality equation. Q\u2217(st, a) = E[rt + \u03b3 maxa(cid:48) Q\u2217(st+1, a(cid:48))].\nThere are two major inef\ufb01ciencies of the traditional on-line Q-learning. First, each experience is\nused only once to update the Q-function. Secondly, learning from experiences in a chronologically\nforward order is much more inef\ufb01cient than learning in a chronologically backward order, because the\nvalue of st+1 is required to update the value of st. Experience replay [12] is proposed to overcome\nthese inef\ufb01ciencies. After observing a transition (st, at, rt, st+1), the agent stores the transition into\nits replay buffer. In order to learn the Q-values, the agent samples transitions from the replay buffer.\nIn practice, the state space S is extremely large, therefore it is impractical to tabularize the Q-values\nof all state-action pairs. Deep Q-Network [14] overcomes this issue by using deep neural networks\nto approximate the Q-function. DQN adopts experience replay to use each transition for multiple\nupdates. Since DQN uses a function approximator, consecutive states output similar Q-values. If\nDQN updates transitions in a chronologically backward order, often overestimation errors cumulate\nand degrade the performance. Therefore, DQN does not sample transitions in a backward order, but\nuniformly at random. This process breaks down the correlations between consecutive transitions and\nreduces the variance of updates.\nThere have been a variety of methods proposed to improve the performance of DQN in terms of\nstability, sample ef\ufb01ciency, and runtime. Some methods propose new network architectures. The\ndueling network architecture [21] contains two streams of separate Q-networks to estimate the value\nfunctions and the advantage functions. Neural episodic control [16] and model-free episodic control\n[5] use episodic memory modules to estimate the state-action values. RUDDER [1] introduces an\nLSTM network with contribution analysis for an ef\ufb01cient return decomposition. Ephemeral Value\nAdjustments (EVA) [7] combines the values of two separate networks, where one is the standard\nDQN and another is a trajectory-based value network.\nSome methods tackle the uniform random sampling replay strategy of DQN. Prioritized experience\nreplay [17] assigns non-uniform probability to sample transitions, where greater probability is\nassigned for transitions with higher temporal difference (TD) error. Inspired by Lin\u2019s backward\nuse of replay memory, some methods try to aggregate TD values with Monte-Carlo (MC) returns.\nQ(\u03bb) [23], Q\u2217(\u03bb) [6] and Retrace(\u03bb) [15] modify the target values to allow the on-policy samples\nto be used interchangeably for on-policy and off-policy learning. Count-based exploration method\ncombined with intrinsic motivation [3] takes a mixture of one-step return and MC return to set up the\ntarget value. Optimality Tightening [8] applies constraints on the target using the values of several\nneighboring transitions. Simply by adding a few penalty terms to the loss, it ef\ufb01ciently propagates\nreliable values to achieve fast convergence.\n\n2\n\n\f(a)\n\n(b)\n\nFigure 1: A motivating example where uniform sampling method fails but EBU does not. (a): A\nsimple navigation domain with 4 states and a single rewarded transition. Circled numbers indicate\nthe order of sample updates. QU and QE stand for the Q-values learned by the uniform random\nsampling method and the EBU method respectively. (b): The probability of learning the optimal path\n(s1 \u2192 s2 \u2192 s3 \u2192 s4) after updating the Q-values with sample transitions.\n\nOur goal is to improve the sample ef\ufb01ciency of deep reinforcement learning by making a simple\nyet effective modi\ufb01cation. Without a single change of the network structure, training schemes, and\nhyperparameters of the original DQN, we only modify the target generation method. Instead of\nusing a limited number of transitions, our method samples a whole episode from the replay memory\nand propagates the values sequentially throughout the entire transitions of the sampled episode in a\nbackward manner. By using a temporary backward Q-table with a diffusion coef\ufb01cient, our novel\nalgorithm effectively reduces the errors generated from the consecutive updates of correlated states.\n\n3 Proposed Methods\n\n3.1 Episodic Backward Update for Tabular Q-Learning\n\nLet us imagine a simple tabular MDP with a single rewarded transition (Figure 1, (a)), where an\nagent can only take one of the two actions: \u2018left\u2019 and \u2018right\u2019. In this example, s1 is the initial state,\nand s4 is the terminal state. A reward of 1 is gained only when the agent reaches the terminal state\nand a reward of 0 is gained from any other transitions. To make it simple, assume that we have only\none episode stored in the experience memory: (s1 \u2192 s2 \u2192 s3 \u2192 s2 \u2192 s3 \u2192 s4). The Q-values\nof all transitions are initialized to zero. With a discount \u03b3 \u2208 (0, 1), the optimal policy is to take the\naction \u2018right\u2019 in all states. When sampling transitions uniformly at random as Nature DQN, the key\ntransitions (s1 \u2192 s2), (s2 \u2192 s3) and (s3 \u2192 s4) may not be sampled for updates. Even when those\ntransitions are sampled, there is no guarantee that the update of the transition (s3 \u2192 s4) is done\nbefore the update of (s2 \u2192 s3). We can speed up the reward propagation by updating all transitions\nwithin the episode in a backward manner. Such a recursive update is also computationally ef\ufb01cient.\nWe can calculate the probability of learning the optimal path (s1 \u2192 s2 \u2192 s3 \u2192 s4) as a function\nof the number of sample transitions trained. With the tabular Episodic Backward Update stated in\nAlgorithm 1, which is a special case of Lin\u2019s algorithm [11] with recency parameter \u03bb = 0, the agent\ncan \ufb01gure out the optimal policy just after 5 updates of Q-values. However, we see that the uniform\nsampling method requires more than 40 transitions to learn the optimal path with probability close to\n1 (Figure 1, (b)).\nNote that this method differs from the standard n-step Q-learning [22]. In n-step Q-learning, the\nnumber of future steps for the target generation is \ufb01xed as n. However, our method considers T future\nvalues, where T is the length of the sampled episode. N-step Q-learning takes a max operator at\nthe n-th step only, whereas our method takes a max operator at every iterative backward step which\ncan propagate high values faster. To avoid exponential decay of the Q-value, we set the learning rate\n\u03b1 = 1 within the single episode update.\n\n3\n\n\u2462\ud835\udc44\ud835\udc48\ud835\udc603\u2192\ud835\udc604=1\u2460\ud835\udc44\ud835\udc48\ud835\udc602\u2192\ud835\udc603=0\u2464\ud835\udc44\ud835\udc48\ud835\udc602\u2192\ud835\udc603=\ud835\udefe\u2461\ud835\udc44\ud835\udc48\ud835\udc601\u2192\ud835\udc602=0\ud835\udc602\ud835\udc603\ud835\udc604\ud835\udc601\ud835\udc5f=1Environment DynamicsEpisode Experienced and Updated Values\ud835\udc601\ud835\udc602\ud835\udc603\ud835\udc602\ud835\udc603\ud835\udc604\u2463\ud835\udc44\ud835\udc48\ud835\udc603\u2192\ud835\udc602=0\u2463\ud835\udc44\ud835\udc38\ud835\udc602\u2192\ud835\udc603=\ud835\udefe\u2464\ud835\udc44\ud835\udc38\ud835\udc601\u2192\ud835\udc602=\ud835\udefe2\u2462\ud835\udc44\ud835\udc38\ud835\udc603\u2192\ud835\udc602=\ud835\udefe2\u2460\ud835\udc44\ud835\udc38\ud835\udc603\u2192\ud835\udc604=1\u2461\ud835\udc44\ud835\udc38\ud835\udc602\u2192\ud835\udc603=\ud835\udefeProbability of learning the optimal path0.00.20.40.60.81.0# of sampled transitions0 10 20 30 40 50Uniform SampleEpisodic Backward\fAlgorithm 1 Episodic Backward Update for Tabular Q-Learning (single episode, tabular)\n1: Initialize the Q- table Q \u2208 RS\u00d7A with all-zero matrix.\nQ(s, a) = 0 for all state action pairs (s, a) \u2208 S \u00d7 A.\n2: Experience an episode E = {(s1, a1, r1, s2), . . . , (sT , aT , rT , sT +1)}\n3: for t = T to 1 do\n4:\n5: end for\n\nQ(st, at) \u2190 rt + \u03b3 maxa(cid:48) Q(st+1, a(cid:48))\n\nAlgorithm 2 Episodic Backward Update\n1: Initialize: replay memory D to capacity N, on-line action-value function Q(\u00b7; \u03b8), target action-value\n\nfunction \u02c6Q(\u00b7; \u03b8\u2212)\n\n2: for episode = 1 to M do\n3:\n4:\n5:\n6:\n7:\n8:\n9:\n10:\n11:\n12:\n13:\n14:\n15:\n16:\n17: end for\n\nend for\n\nfor t = 1 to Terminal do\n\nGenerate a temporary target Q-table, \u02dcQ = \u02c6Q(cid:0)S(cid:48),\u00b7; \u03b8\u2212(cid:1)\n\nWith probability \u0001 select a random action at, otherwise select at = argmaxa Q (st, a; \u03b8)\nExecute action at, observe reward rt and next state st+1\nStore transition (st, at, rt, st+1) in D\nSample a random episode E = {S, A, R, S(cid:48)} from D, set T = length(E)\nInitialize the target vector y = zeros(T ), yT \u2190 RT\nfor k = T \u2212 1 to 1 do\n\u02dcQ [Ak+1, k] \u2190 \u03b2yk+1 + (1 \u2212 \u03b2) \u02dcQ [Ak+1, k]\nyk \u2190 Rk + \u03b3 maxa \u02dcQ [a, k]\n\nend for\nPerform a gradient descent step on (y \u2212 Q (S, A; \u03b8))2 with respect to \u03b8\nEvery C steps reset \u02c6Q = Q\n\nThere are some other multi-step methods that converge to the optimal state-action value function,\nsuch as Q(\u03bb) and Q\u2217(\u03bb). However, our algorithm neither cuts trace of trajectories as Q(\u03bb), nor\nrequires the parameter \u03bb to be small enough to guarantee convergence as Q\u2217(\u03bb). We present a\ndetailed discussion on the relationship between EBU and other multi-step methods in Appendix F.\n\n3.2 Episodic Backward Update for Deep Q-Learning1\n\nDirectly applying the backward update algorithm to deep reinforcement learning is known to show\nhighly unstable results due to the high correlation of consecutive samples. We show that the\nfundamental ideas of the tabular version of the backward update algorithm may be applied to its deep\nversion with just a few modi\ufb01cations. The full algorithm introduced in Algorithm 2 closely resembles\nthat of Nature DQN [14]. Our contributions lie in the recursive backward target generation with a\ndiffusion factor \u03b2 (starting from line number 7 of Algorithm 2), which prevents the overestimation\nerrors from correlated states cumulating.\nInstead of sampling transitions uniformly at random, we make use of all transitions within the\nsampled episode E = {S, A, R, S(cid:48)}. Let the sampled episode start with a state S1, and contain\nT transitions. Then E can be denoted as a set of four length-T vectors: S = {S1, S2, . . . , ST};\nA = {A1, A2, . . . , AT}; R = {R1, R2, . . . , RT} and S(cid:48) = {S2, S3, . . . , ST +1}. The temporary\ntarget Q-table, \u02dcQ is an |A| \u00d7 T matrix which stores the target Q-values of all states S(cid:48) for all valid\nactions, where A is the action space of the MDP. Therefore, the j-th column of \u02dcQ is a column\nparametrized by \u03b8\u2212.\nAfter the initialization of the temporary Q-table, we perform a recursive backward update. Adopting\nthe backward update idea, one element \u02dcQ [Ak+1, k] in the k-th column of the \u02dcQ is replaced using the\nnext transition\u2019s target yk+1. Then yk is estimated as the maximum value of the newly modi\ufb01ed k-th\ncolumn of \u02dcQ. Repeating this procedure in a recursive manner until the start of the episode, we can\n\nvector that contains \u02c6Q(cid:0)Sj+1, a; \u03b8\u2212(cid:1) for all valid actions a \u2208 A, where \u02c6Q is the target Q-function\n\n1The code is available at https://github.com/suyoung-lee/Episodic-Backward-Update\n\n4\n\n\fsuccessfully apply the backward update algorithm for a deep Q-network. The process is described in\ndetail with a supplementary diagram in Appendix E.\nWe are using a function approximator, and updating correlated states in a sequence. As a result, we\nobserve overestimated values propagating and compounding through the recursive max operations.\nWe solve this problem by introducing the diffusion factor \u03b2. By setting \u03b2 \u2208 (0, 1), we can take a\nweighted sum of the new backpropagated value and the pre-existing value estimate. One can regard\n\u03b2 as a learning rate for the temporary Q-table, or as a level of \u2018backwardness\u2019 of the update. This\nprocess stabilizes the learning process by exponentially decreasing the overestimation error. Note\nthat Algorithm 2 with \u03b2 = 1 is identical to the tabular backward algorithm stated in Algorithm 1.\nWhen \u03b2 = 0, the algorithm is identical to episodic one-step DQN. The role of \u03b2 is investigated in\ndetail with experiments in Section 5.3.\n\n3.3 Adaptive Episodic Backward Update for Deep Q-Learning\n\nThe optimal diffusion factor \u03b2 varies depending on the type of the environment and the degree of how\nmuch the network is trained. We may further improve EBU by developing an adaptive tuning scheme\nfor \u03b2. Without increasing the sample complexity, we propose an adaptive, single actor and multiple\nlearner version of EBU. We generate K learner networks with different diffusion factors, and a single\nactor to output a policy. For each episode, the single actor selects one of the learner networks in a\nregular sequence. Each learner is trained in parallel, using the same episode sampled from a shared\nexperience replay. Even with the same training data, all learners show different interpretations of the\nsample based on the different levels of trust in backwardly propagated values. We record the episode\nscores of each learner during training. After every \ufb01xed step, we synchronize all the learner networks\nwith the parameters of a learner network with the best training score. This adaptive version of EBU is\npresented as a pseudo-code in Appendix A. In Section 5.2, we compare the two versions of EBU, one\nwith a constant \u03b2 and another with an adaptive \u03b2.\n\n4 Theoretical Convergence\n\n4.1 Deterministic MDPs\nWe prove that Episodic Backward Update with \u03b2 \u2208 (0, 1) de\ufb01nes a contraction operator, and\nconverges to the optimal Q-function in \ufb01nite and deterministic MDPs.\nTheorem 1. Given a \ufb01nite, deterministic and tabular MDP M = (S,A, P, R), the Episodic Back-\nward Update algorithm in Algorithm 2 converges to the optimal Q-function w.p. 1 as long as\n\u2022 The step size satis\ufb01es the Robbins-Monro condition;\n\u2022 The sample trajectories are \ufb01nite in lengths l: E[l] < \u221e;\n\u2022 Every (state, action) pair is visited in\ufb01nitely often.\nWe state the proof of Theorem 1 in Appendix G. Furthermore, even in stochastic environments, we\ncan guarantee the convergence of the episodic backward algorithm for a suf\ufb01ciently small \u03b2.\n\n4.2 Stochastic MDPs\nTheorem 2. Given a \ufb01nite, tabular and stochastic MDP M = (S,A, P, R), de\ufb01ne Rsto\nmax(s, a) as\nthe maximal return of trajectory that starts from state s \u2208 S and action a \u2208 A. In a similar way,\nde\ufb01ne rsto\nmean(s, a) as the minimum and mean of possible reward by selecting action\na in state s. De\ufb01ne Asub(s) = {a(cid:48) \u2208 A|Q\u2217(s, a(cid:48)) < maxa\u2208A Q\u2217(s, a)} as the set of suboptimal\nactions in state s \u2208 S. De\ufb01ne Aopt(s) = A\\Asub(s). Then, under the conditions of Theorem 1, and\n\nmin(s, a) and rsto\n\n\u03b2 \u2264 inf\ns\u2208S\n\n\u03b2 \u2264 inf\ns\u2208S\n\ninf\n\na(cid:48)\u2208Asub(s)\n\ninf\n\na\u2208Aopt(s)\n\ninf\n\na(cid:48)\u2208Asub(s)\n\ninf\n\na\u2208Aopt(s)\n\nQ\u2217(s, a) \u2212 Q\u2217(s, a(cid:48))\nmax(s, a(cid:48)) \u2212 Q\u2217(s, a(cid:48))\nRsto\nQ\u2217(s, a) \u2212 Q\u2217(s, a(cid:48))\nmean(s, a) \u2212 rsto\nrsto\nmin(s, a)\n\n,\n\n,\n\n(1)\n\n(2)\n\nthe Episodic Backward Update algorithm in Algorithm 2 converges to the optimal Q-function w.p. 1.\n\n5\n\n\f(a)\n\n(b)\n\n(c)\n\nFigure 2: (a) & (b): Median of 50 relative lengths of EBU and baselines. EBU outperforms other\nbaselines signi\ufb01cantly in the low sample regime and for high wall density. (c): Median relative\nlengths of EBU and other baseline algorithms in MNIST maze with stochastic transitions.\n\nThe main intuition of this theorem is that \u03b2 acts as a learning rate of the backward target therefore\nmitigates the collision between the max operator and stochastic transitions.\n\n5 Experimental Results\n\n5.1\n\n2D MNIST Maze (Deterministic/Stochastic MDPs)\n\nFigure 3: 2D MNIST\nMaze\n\nWe test our algorithm in the 2D Maze Environment. Starting from the initial\nposition at (0, 0), the agent has to navigate through the maze to reach the\ngoal position at (9, 9). To minimize the correlation between neighboring\nstates, we use the MNIST dataset [9] for the state representation. The agent\nreceives the coordinates of the position in two MNIST images as the state representation. The training\nenvironments are 10 by 10 mazes with randomly placed walls. We assign a reward of 1000 for\nreaching the goal, and a reward of -1 for bumping into a wall. A wall density indicates the probability\nof having a wall at each position. For each wall density, we generate 50 random mazes with different\nwall locations. We train a total of 50 independent agents, one for each maze over 200,000 steps. The\nperformance metric, relative length is de\ufb01ned as lrel = lagent/loracle, which is the ratio between the\nlength of the agent\u2019s path lagent and the length of the ground truth shortest path loracle to reach the\ngoal. The details of the hyperparameters and the network structure are described in Appendix D.\nWe compare EBU to uniform random sampling one-step DQN and n-step DQN. For n-step DQN, we\nset the value of n as the length of the episode. Since all three algorithms eventually achieve median\nrelative lengths of 1 at the end of the training, we report the relative lengths at 100,000 steps in\nTable 1. One-step DQN performs the worst in all con\ufb01gurations, implying the inef\ufb01ciency of uniform\nsampling update in environments with sparse and delayed rewards. As the wall density increases, it\nbecomes more important for the agent to learn the correct decisions at bottleneck positions. N-step\nDQN shows the best performance with a low wall density, but as the wall density increases, EBU\nsigni\ufb01cantly outperforms n-step DQN.\nIn addition, we run experiments with stochastic transitions. We assign 10% probability for each side\naction for all four valid actions. For example, when an agent takes an action \u2018up\u2019, there is a 10%\nchance of transiting to the left state, and 10% chance of transiting to the right state. In Figure 2 (c),\nwe see that the EBU agent outperforms the baselines in the stochastic environment as well.\n\nTable 1: Relative lengths (Mean & Median) of 50 deterministic MNIST Maze after 100,000 steps\n\nWall density\n\n20%\n30%\n40%\n50%\n\nEBU (\u03b2 = 1.0) One-step DQN N-step DQN\n2.24\n3.26\n3.32\n8.88\n8.96\n3.50\n11.32 3.12\n\n9.25\n21.03\n22.71\n16.62\n\n14.40\n25.63\n25.45\n22.36\n\n5.44\n8.14\n8.61\n5.51\n\n2.42\n3.03\n2.52\n2.34\n\n6\n\n25k50k75k100k125k150k175k200kstep0102030405060relative lengthDeterministic Wall density: 20%EBU (=1.0)One-step DQNN-step DQN25k50k75k100k125k150k175k200kstep0102030405060relative lengthDeterministic Wall density: 50%EBU (=1.0)One-step DQNN-step DQN25k50k75k100k125k150k175k200ksteps0102030405060relative lengthStochastic Wall density: 50%EBU (=1.0)EBU (=0.75)EBU (=0.5)EBU (=0.25)One-step DQNN-step DQN\u22ef\u22ee\u22ef\u22ee\fFigure 4: Relative score of adaptive EBU (4 random seeds) compared to Nature DQN (8 random\nseeds) in percents (%) both trained for 10M frames.\n\n5.2\n\n49 Games of Atari 2600 Environment (Deterministic MDPs)\n\nThe Arcade Learning Environment [2] is one of the most popular RL benchmarks for its diverse set\nof challenging tasks. We use the same set of 49 Atari 2600 games, which was evaluated in Nature\nDQN paper [14].\nWe select \u03b2 = 0.5 for EBU with a constant diffusion factor. For adaptive EBU, we train K = 11\nparallel learners with diffusion factors 0.0, 0.1, . . ., and 1.0. We synchronize the learners at the end\nof each epoch (0.25M frames). We compare our algorithm to four baselines: Nature DQN [14],\nPrioritized Experience Replay (PER) [17], Retrace(\u03bb) [15] and Optimality Tightening (OT) [8]. We\ntrain EBU and baselines for 10M frames (additional 20M frames for adaptive EBU) on 49 Atari\ngames with the same network structure, hyperparameters, and evaluation methods used in Nature\nDQN. The choice of such a small number of training steps is made to investigate the sample ef\ufb01ciency\nof each algorithm following [16, 8]. We report the mean result from 4 random seeds for adaptive EBU\nand 8 random seeds for all other baselines. Detailed speci\ufb01cations for each baseline are described in\nAppendix D.\nFirst, we show the improvement of adaptive EBU over Nature DQN at 10M frames for all 49 games\nin Figure 4. To compare the performance of an agent to its baseline\u2019s, we use the following relative\nscore,\n[21]. This measure shows how well an agent performs a\ntask compared to the task\u2019s level of dif\ufb01culty. EBU (\u03b2 = 0.5) and adaptive EBU outperform Nature\nDQN in 33 and 39 games out of 49 games, respectively. The large amount of improvements in games\nsuch as \u201cAtlantis,\u201d \u201cBreakout,\u201d and \u201cVideo Pinball\u201d highly surpass minor failings in few games.\nWe use human-normalized score, ScoreAgent\u2212ScoreRandom\n|ScoreHuman\u2212ScoreRandom| [20], which is the most widely used metric to\nmake an apple-to-apple comparison in the Atari domain. We report the mean and the median human-\nnormalized scores of the 49 games in Table 2. The result signi\ufb01es that our algorithm outperforms\nthe baselines in both the mean and median of the human-normalized scores. PER and Retrace(\u03bb)\ndo not show a lot of improvements for a small number of training steps as 10M frames. Since OT\nhas to calculate the Q-values of neighboring states and compare them to generate the penalty term,\nit requires about 3 times more training time than Nature DQN. However, EBU performs iterative\nepisodic updates using the temporary Q-table that is shared by all transitions in the episode, EBU has\nalmost the same computational cost as that of Nature DQN.\n\nmax{ScoreHuman, ScoreBaseline}\u2212ScoreRandom\n\nScoreAgent\u2212ScoreBaseline\n\n7\n\n3918.2KrullEnduroUp and DownName This GameQ*BertRobotankAsterixTennisMs. PacmanMontezuma's Rev.AmidarKangarooChopper CommandH.E.R.OAlienPrivate EyeRiver RaidSeaquestAsteroidsBattle ZoneDemon AttackVentureGravitarFrostbiteIce HockeyKung-Fu MasterSpace InvadersBeam RiderCentipedeFreewayBowlingZaxxonCrazy ClimberBank HeistStar GunnerWizard of WorTutankhamFishing DerbyAssaultTime PilotPongRoad RunnerJamesbondGopherBoxingDouble DunkAtlantisBreakoutVideo Pinball-68.3-36.7-30.4-9.2-7.2-5.6-4.9-4.0-1.4-0.10.00.71.42.53.13.25.25.86.06.47.88.310.311.715.118.320.120.422.125.328.630.631.031.233.235.847.551.156.763.263.974.889.092.595.3123.5415.9788.2\fFigure 5: Episode scores and average Q-values of all state-action pairs in \u201cGopher\u201d and \u201cBreakout\u201d.\n\nThe most signi\ufb01cant result is that EBU (\u03b2 = 0.5) requires only 10M frames of training to achieve\nthe mean human-normalized score reported in Nature DQN, which is trained for 200M frames.\nAlthough 10M frames are not enough to achieve the same median score, adaptive EBU trained for\n20M frames achieves the median normalized score. These results signify the ef\ufb01cacy of backward\nvalue propagation in the early stages of training. Raw scores for all 49 games are summarized in\nAppendix B. Learning curves of adaptive EBU for all 49 games are reported in Appendix C.\n\nTable 2: Summary of training time and human-normalized performance. Training time refers to the\ntotal time required to train 49 games of 10M frames using a single NVIDIA TITAN Xp for a single\nrandom seed. We use multi-GPUs to train learners of adaptive EBU in parallel. (*) The result of\nOT differs from the result reported in [8] due to different evaluation methods (i.e. not limiting the\nmaximum number of steps for a test episode and taking maximum score from random seeds). (**)\nWe report the scores of Nature DQN (200M) from [14].\n\nTraining Time (hours) Mean (%) Median (%)\n\nAlgorithm (frames)\nEBU (\u03b2 = 0.5) (10M)\nEBU (adaptive \u03b2) (10M)\nNature DQN (10M)\nPER (10M)\nRetrace(\u03bb) (10M)\nOT (10M)*\nEBU (adaptive \u03b2) (20M)\nNature DQN (200M)**\n\n152\n203\n138\n146\n154\n407\n450\n-\n\n253.55\n275.78\n133.95\n156.57\n93.77\n162.66\n347.99\n241.06\n\n51.55\n63.80\n40.42\n40.86\n41.99\n49.42\n92.50\n93.52\n\n5.3 Analysis on the Role of the Diffusion Factor \u03b2\n\nIn this section, we make comparisons between our own EBU algorithms. EBU (\u03b2 = 1.0) works the\nbest in the MNIST Maze environment because we use MNIST images for the state representation to\nallow consecutive states to exhibit little correlation. However, in the Atari domain, consecutive states\nare often different in a scale of few pixels only. As a consequence, EBU (\u03b2 = 1.0) underperforms\nEBU (\u03b2 = 0.5) in most of the Atari games. In order to analyze this phenomenon, we evaluate\nthe Q-values learned at the end of each training epoch. We report the test episode score and the\ncorresponding mean Q-values of all transitions within the test episode (Figure 5). We notice that the\nEBU (\u03b2 = 1.0) is trained to output highly overestimated Q-values compared to its actual return. Since\nthe EBU method performs recursive max operations, EBU outputs higher (possibly overestimated)\nQ-values than Nature DQN. This result indicates that sequentially updating correlated states with\n\n8\n\n050010001500200025003000Test episode score0510152025Mean Q-values of all states in the test episodeGopher050100150200Test episode score0246810Mean Q-values of all states in the test episodeBreakoutEBU (=0.5)EBU (=1.0)Nature DQN\foverestimated values may destabilize the learning process. However, this result clearly implies that\nEBU (\u03b2 = 0.5) is relatively free from the overestimation problem.\nNext, we investigate the ef\ufb01cacy of using an adaptive diffusion factor. In Figure 6, we present how\nadaptive EBU adapts its diffusion factor during the course of training in \u201cBreakout\u201d. In the early\nstage of training, the agent barely succeeds in breaking a single brick. With a high \u03b2 close to 1, values\ncan be directly propagated from the rewarded state to the state where the agent has to bounce the\nball up. Note that the performance of adaptive EBU follows that of EBU (\u03b2 = 1.0) up to about 5M\nframes. As the training proceeds, the agent encounters more rewards and various trajectories that may\ncause overestimation. As a consequence, we discover that the agent anneals the diffusion factor to a\nlower value of 0.5. The trend of how the diffusion factor adapts differs from game to game. Refer to\nthe diffusion factor curves for all 49 games in Appendix C to check how adaptive EBU selects the\nbest diffusion factor.\n\n(a)\n\n(b)\n\nFigure 6: (a) Test scores in \u201cBreakout\u201d. Mean and standard deviation from 4 random seeds are plotted.\n(b) Adaptive diffusion factor of adaptive EBU in \u201cBreakout\u201d.\n\n6 Conclusion\n\nIn this work, we propose Episodic Backward Update, which samples transitions episode by episode,\nand updates values recursively in a backward manner. Our algorithm achieves fast and stable learning\ndue to its ef\ufb01cient value propagation. We theoretically prove the convergence of our method, and\nexperimentally show that our algorithm outperforms other baselines in many complex domains,\nrequiring only about 10% of samples. Since our work differs from DQN only in terms of the target\ngeneration, we hope that we can make further improvements by combining with other successful\ndeep reinforcement learning methods.\n\nAcknowledgments\n\nThis work was supported by the ICT R&D program of MSIP/IITP. [2016-0-00563, Research on Adap-\ntive Machine Learning Technology Development for Intelligent Autonomous Digital Companion]\n\nReferences\n\n[1] Arjona-Medina, J. A., Gillhofer, M., Widrich, M., Unterthiner, T., and Hochreiter, S. RUDDER:\n\nReturn decomposition for delayed rewards. arXiv preprint arXiv:1806.07857, 2018.\n\n[2] Bellemare, M. G., Naddaf, Y., Veness, J., and Bowling, M. The arcade learning environment: An\nevaluation platform for general agents. Journal of Arti\ufb01cial Intelligence Research, 47:253-279,\n2013.\n\n9\n\n0.02.55.07.510.012.515.017.520.0Million Frames0100200300400500Raw ScoreBreakout Raw Scores (mean/std of 4 seeds)EBU (adaptive )EBU (=1.0)DQNDQN 200M0.02.55.07.510.012.515.017.520.0Million Frames0.00.20.40.60.81.0Breakout adaptive diffusion factor (mean/std of 4 seeds)\f[3] Bellemare, M. G., Srinivasan, S., Ostrovski, G., Schaul, T., Saxton, D., and Munos, R. Unifying\ncount-based exploration and intrinsic motivation. In Advances in Neural Information Processing\nSystems (NIPS), 1471-1479, 2016.\n\n[4] Bertsekas, D. P., and Tsitsiklis, J. N. Neuro-Dynamic Programming. Athena Scienti\ufb01c, 1996.\n[5] Blundell, C., Uria, B., Pritzel, A., Li, Y., Ruderman, A., Leibo, J. Z, Rae, J.,Wierstra, D., and\n\nHassabis, D. Modelfree episodic control. arXiv preprint arXiv:1606.04460, 2016.\n\n[6] Harutyunyan, A., Bellemare, M. G., Stepleton, T., and Munos, R. Q(\u03bb) with off-policy corrections.\n\nIn International Conference on Algorithmic Learning Theory (ALT), 305-320, 2016.\n\n[7] Hansen, S., Pritzel, A., Sprechmann, P., Barreto, A., and Blundell, C. Fast deep reinforcement\nlearning using online adjustments from the past. In Advances in Neural Information Processing\nSystems (NIPS), 10590\u201310600, 2018\n\n[8] He, F. S., Liu, Y., Schwing, A. G., and Peng, J. Learning to play in a day: Faster deep reinforce-\nment learning by optimality tightening. In International Conference on Learning Representations\n(ICLR), 2017.\n\n[9] LeCun, Y., Bottou, L., Bengio, Y., and Haffner, P. Gradient-based learning applied to document\nrecognition. In the Institute of Electrical and Electronics Engineers (IEEE), 86, 2278-2324, 1998.\n[10] Lengyel, M., and Dayan, P. Hippocampal Contributions to Control: The Third Way. In Advances\n\nin Neural Information Processing Systems (NIPS), 889-896, 2007.\n\n[11] Lin, L-J. Programming Robots Using Reinforcement Learning and Teaching. In Association for\n\nthe Advancement of Arti\ufb01cial Intelligence (AAAI), 781-786, 1991.\n\n[12] Lin, L-J. Self-improving reactive agents based on reinforcement learning, planning and teaching.\n\nMachine Learning, 293-321, 1992.\n\n[13] Melo, F. S. Convergence of Q-learning: A simple proof, Institute Of Systems and Robotics,\n\nTech. Rep, 2001.\n\n[14] Mnih, V., Kavukcuoglu, K., Silver, D., Rusu, A. A., Veness, J., Bellemare, M. G., Graves, A.,\nRiedmiller, M., Fidjeland, A. K., Ostrovski, G., Petersen, S., Beattie, C., Sadik, A., Antonoglou,\nI., King, H., Kumaran, D., Wierstra, D., Legg, S., and Hassabis, D. Human-level control through\ndeep reinforcement learning. Nature, 518(7540):529-533, 2015.\n\n[15] Munos, R., Stepleton, T., Harutyunyan, A., and Bellemare, M. G. Safe and ef\ufb01cient off-policy\nreinforcement learning. In Advances in Neural Information Processing Systems (NIPS), 1046-\n1054, 2016.\n\n[16] Pritzel, A., Uria, B., Srinivasan, S., Puig-\u2019domenech, A., Vinyals, O., Hassabis, D., Wierstra,\nD., and Blundell, C. Neural Episodic Control. In International Conference on Machine Learning\n(ICML), 2827-2836, 2017.\n\n[17] Schaul, T., Quan, J., Antonoglou, I., and Silver, D. Prioritized Experience Replay. In Interna-\n\ntional Conference on Learning Representations (ICLR), 2016.\n\n[18] Silver, D., Huang, A., Maddison C. J., Guez, A., Sifre, L., van den Driessche, G., Schrittwieser,\nJ., Antonoglou, I., Panneershelvam, V., Lanctot, M., Dieleman, S., Grewe, D., Nham, J., Kalch-\nbrenner, N., Sutskever, I., Lillicrap, T., Leach, M., Kavukcuoglu, K., Graepel, T., and Hassabis,\nD. Mastering the game of Go with deep neural networks and tree search. Nature, 529:484-489,\n2016.\n\n[19] Sutton, R. S., and Barto, A. G. Reinforcement Learning: An Introduction. MIT Press, 1998.\n[20] van Hasselt, H., Guez, A., and Silver, D. Deep Reinforcement Learning with Double Q-learning.\n\nIn Association for the Advancement of Arti\ufb01cial Intelligence (AAAI), 2094-2100, 2016.\n\n[21] Wang, Z., Schaul, T., Hessel, M., van Hasselt, H., Lanctot, M., and de Freitas, N. Dueling Net-\nwork Architectures for Deep Reinforcement Learning. In International Conference on Machine\nLearning (ICML), 1995-2003, 2016.\n\n[22] Watkins., C. J. C. H. Learning from delayed rewards. Ph.D. thesis, University of Cambridge\n\nEngland, 1989.\n\n[23] Watkins., C. J. C. H., and Dayan, P. Q-learning. Machine Learning, 272-292, 1992.\n\n10\n\n\f", "award": [], "sourceid": 1247, "authors": [{"given_name": "Su Young", "family_name": "Lee", "institution": "KAIST"}, {"given_name": "Choi", "family_name": "Sungik", "institution": "KAIST"}, {"given_name": "Sae-Young", "family_name": "Chung", "institution": "KAIST"}]}