{"title": "Using a Logarithmic Mapping to Enable Lower Discount Factors in Reinforcement Learning", "book": "Advances in Neural Information Processing Systems", "page_first": 14134, "page_last": 14144, "abstract": "In an effort to better understand the different ways in which the discount factor affects the optimization process in reinforcement learning, we designed a set of experiments to study each effect in isolation. Our analysis reveals that the common perception that poor performance of low discount factors is caused by (too) small action-gaps requires revision. We propose an alternative hypothesis that identifies the size-difference of the action-gap across the state-space as the primary cause. We then introduce a new method that enables more homogeneous action-gaps by mapping value estimates to a logarithmic space. We prove convergence for this method under standard assumptions and demonstrate empirically that it indeed enables lower discount factors for approximate reinforcement-learning methods. This in turn allows tackling a class of reinforcement-learning problems that are challenging to solve with traditional methods.", "full_text": "Using a Logarithmic Mapping to Enable Lower\nDiscount Factors in Reinforcement Learning\n\nHarm van Seijen\n\nMicrosoft Research Montr\u00e9al\n\nharm.vanseijen@microsoft.com\n\nMehdi Fatemi\n\nMicrosoft Research Montr\u00e9al\n\nmehdi.fatemi@microsoft.com\n\nArash Tavakoli\n\nImperial College London\n\na.tavakoli@imperial.ac.uk\n\nAbstract\n\nIn an effort to better understand the different ways in which the discount factor\naffects the optimization process in reinforcement learning, we designed a set of\nexperiments to study each effect in isolation. Our analysis reveals that the common\nperception that poor performance of low discount factors is caused by (too) small\naction-gaps requires revision. We propose an alternative hypothesis that identi\ufb01es\nthe size-difference of the action-gap across the state-space as the primary cause.\nWe then introduce a new method that enables more homogeneous action-gaps by\nmapping value estimates to a logarithmic space. We prove convergence for this\nmethod under standard assumptions and demonstrate empirically that it indeed\nenables lower discount factors for approximate reinforcement-learning methods.\nThis in turn allows tackling a class of reinforcement-learning problems that are\nchallenging to solve with traditional methods.\n\n1\n\nIntroduction\n\nIn reinforcement learning (RL), the objective that one wants to optimize for is often best described\nas an undiscounted sum of rewards (e.g., maximizing the total score in a game) and a discount\nfactor is merely introduced so as to avoid some of the optimization challenges that can occur when\ndirectly optimizing on an undiscounted objective [Bertsekas and Tsitsiklis, 1996]. In this scenario, the\ndiscount factor plays the role of a hyper-parameter that can be tuned to obtain a better performance\non the true objective. Furthermore, for practical reasons, a policy can only be evaluated for a \ufb01nite\namount of time, making the effective performance metric a \ufb01nite-horizon, undiscounted objective.1\nTo gain a better understanding of the interaction between the discount factor and a \ufb01nite-horizon,\nundiscounted objective, we designed a number of experiments to study this relation. One surprising\n\ufb01nding is that for some problems a low discount factor can result in better asymptotic performance,\nwhen a \ufb01nite-horizon, undiscounted objective is indirectly optimized through the proxy of an in\ufb01nite-\nhorizon, discounted sum. This motivates us to look deeper into the effect of the discount factor on the\noptimization process.\nWe analyze why in practice the performance of low discount factors tends to fall \ufb02at when combined\nwith function approximation, especially in tasks with long horizons. Speci\ufb01cally, we refute a number\nof common hypotheses and present a new one instead, identifying the primary culprit to be the size-\n\n1As an example, in the seminal work of Mnih et al. [2015], the (undiscounted) score of Atari games is\n\nreported with a time-limit of 5 minutes per game.\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\fdifference of the action gap (i.e., the difference between the values of the best and the second-best\nactions of a state) across the state-space.\nOur main contribution is a new method that yields more homogeneous action-gap sizes for sparse-\nreward problems. This is achieved by mapping the update target to a logarithmic space and performing\nupdates in that space instead. We prove convergence of this method under standard conditions.\nFinally, we demonstrate empirically that our method achieves much better performance for low\ndiscount factors than previously possible, providing supporting evidence for our new hypothesis.\nCombining this with our analytical result that there exist tasks where low discount factors outperform\nhigher ones asymptotically suggests that our method can unlock a performance on certain problems\nthat is not achievable by contemporary RL methods.\n\n2 Problem Setting\nConsider a Markov decision process (MDP, [Puterman, 1994]) M = (cid:104)S,A, P, R, S0(cid:105), where S\ndenotes the set of states, A the set of actions, R the reward function R : S \u00d7 A \u00d7 S \u2192 R, P the\ntransition probability function P : S \u00d7 A \u00d7 S \u2192 [0, 1], and S0 the starting state distribution. At\neach time step t, the agent observes state st \u2208 S and takes action at \u2208 A. The agent observes\nthe next state st+1, drawn from the transition probability distribution P (st, at,\u00b7), and a reward\nrt = R(st, at, st+1). A terminal state is one that, once entered, terminates the interaction with the\nenvironment; mathematically, it can be interpreted as an absorbing state that transitions only to itself\nwith a corresponding reward of 0. The behavior of an agent is de\ufb01ned by a policy \u03c0, which, at time\nstep t, takes as input the history of states, actions, and rewards, s0, a0, r0, s1, a1, ....rt\u22121, st, and\noutputs a distribution over actions, in accordance to which action at is selected. If action at only\ndepends on the current state st, we will call the policy a stationary one; if the policy depends on\nmore than the current state st, we will call the policy non-stationary.\nWe de\ufb01ne a task to be the combination of an MDP M and a performance metric F . The metric F is\na function that takes as input a policy \u03c0 and outputs a score that represents the performance of \u03c0 on\nM. By contrast, we de\ufb01ne the learning metric Fl to be the metric that the agent optimizes. Within\nthe context of this paper, unless otherwise stated, the performance metric F considers the expected,\n\ufb01nite-horizon, undiscounted sum of rewards over the start-state distribution; the learning metric Fl\nconsiders the expected, in\ufb01nite-horizon, discounted sum of rewards:\n\nF (\u03c0, M ) = E\n\n;\n\nFl(\u03c0, M ) = E\n\n\u03b3iri\n\n,\n\n(1)\n\n(cid:34)h\u22121(cid:88)\n\ni=0\n\n(cid:35)\n\n(cid:12)(cid:12)(cid:12)\u03c0, M\n\nri\n\n(cid:34) \u221e(cid:88)\n\ni=0\n\n(cid:35)\n\n(cid:12)(cid:12)(cid:12)\u03c0, M\n\nwhere the horizon h and the discount factor \u03b3 are hyper-parameters of F and Fl, respectively.\nThe optimal policy of a task, \u03c0\u2217, is the policy that maximizes the metric F on the MDP M. Note\nthat in general \u03c0\u2217 will be a non-stationary policy. In particular, the optimal policy depends besides\nthe current state on the time step. We denote the policy that is optimal w.r.t. the learning metric\nFl by \u03c0\u2217\nl . Because Fl is not a \ufb01nite-horizon objective, there exists a stationary, optimal policy for\nit, considerably simplifying the learning problem.2 Due to the difference between the learning and\nperformance metrics, the policy that is optimal w.r.t. the learning metric does not need to be optimal\nw.r.t. the performance metric. We call the difference in performance between \u03c0\u2217\nl and \u03c0\u2217, as measured\nby F , the metric gap:\n\n\u2206F = F (\u03c0\u2217, M ) \u2212 F (\u03c0\u2217\n\nl , M )\n\nThe relation between \u03b3 and the metric gap will be analyzed in Section 3.1.\nWe consider model-free, value-based methods. These are methods that aim to \ufb01nd a good policy by\niteratively improving an estimate of the optimal action-value function Q\u2217, which, generally, predicts\nthe expected discounted sum of rewards under the optimal policy \u03c0\u2217\nl conditioned on state-action pairs.\nThe canonical example is Q-learning [Watkins and Dayan, 1992], which updates its estimates as\nfollows:\n\nQt+1(st, at) := (1 \u2212 \u03b1)Qt(st, at) + \u03b1\n\nrt + \u03b3 max\n\n,\n\n(2)\n\n(cid:16)\n\n(cid:17)\na(cid:48) Qt(st+1, a(cid:48))\n\n2This is the main reason why optimizing on an in\ufb01nite-horizon objective, rather than a \ufb01nite-horizon one, is\n\nan attractive choice.\n\n2\n\n\fwhere \u03b1 \u2208 [0, 1] is the step-size. The action-value function is commonly estimated using a function\napproximator with weight vector \u03b8: Q(s, a; \u03b8). Deep Q-Networks (DQN) [Mnih et al., 2015] use\na deep neural network as function approximator and iteratively improve an estimate of Q\u2217 by\nminimizing a sequence of loss functions:\n\nLi(\u03b8i) = Es,a,r,s(cid:48)[(yDQN\nyDQN\ni\n\n= r + \u03b3 max\n\nwith\n\n\u2212 Q(s, a; \u03b8i))2] ,\n\ni\n\na(cid:48) Q(s(cid:48), a(cid:48); \u03b8i\u22121),\n\n(3)\n(4)\n\nThe weight vector from the previous iteration, \u03b8i\u22121, is encoded using a separate target network.\n\n3 Analysis of Discount Factor Effects\n\n3.1 Effect on Metric Gap\n\nFigure 1: Illustrations of three different tasks (blue diamond: starting position; green circle: positive\nobject; red circle: negative object; gray arrows: wind direction; numbers indicate rewards). The\ngraphs show the performance\u2014as measured by F \u2014on these tasks for \u03c0\u2217 (black, dotted line) and \u03c0\u2217\nl\n(red, solid line) as function of the discount factor of the learning metric. The difference between the\ntwo represents the metric gap.\n\nThe question that is central to this section is the following: given a \ufb01nite-horizon, undiscounted\nperformance metric, what can be said about the relation between the discount factor of the learning\nmetric and the metric gap?\nTo study this problem, we designed a variety of different tasks and measured the dependence between\nthe metric gap and the discount factor. In Figure 1, we illustrate three of those tasks, as well as the\nmetric gap on those tasks as function of the discount factor. In each task, an agent, starting from\na particular position, has to collect rewards by collecting the positive objects while avoiding the\nnegative objects. The transition dynamics of tasks A and B is deterministic; whereas, in task C\nwind blows in the direction of the arrows, making the agent move towards left with a 40% chance,\nregardless of its performed action. For all three tasks, the horizon of the performance metric is 12.\nOn task A, where a small negative reward has to be traded off for a large positive reward that is\nreceived later, high discount factors result in a smaller metric gap. By contrast, on task B, low\ndiscount factors result in a smaller metric gap. The reason is that for high discount factors the optimal\nlearning policy takes the longer route by \ufb01rst trying to collect the large object, before going to the\nsmall object. However, with a performance metric horizon of 12, there is not enough time to take the\nlong route and get both rewards. The low discount factor takes a shorter route by \ufb01rst going to the\nsmaller object and is able to collect all objects in time. On task C, a trade-off has to be made between\nthe risk of falling into the negative object (due to domain stochasticity) versus taking a longer detour\nthat minimizes this risk. On this task, the optimal policy \u03c0\u2217 is non-stationary (the optimal action\ndepends on the time step). However, because the learning objective Fl is not \ufb01nite-horizon, it has\na stationary optimal policy \u03c0\u2217\nl . Hence, the metric gap cannot be reduced to 0 for any value of the\ndiscount factor. The best discount factor is something that is not too high nor too low.\nWhile the policy \u03c0\u2217\nl is derived from an in\ufb01nite-horizon metric, this does not preclude it from being\nlearned with \ufb01nite-length training episodes. As an example, consider using Q-learning to learn \u03c0\u2217\nl for\nany of the tasks from Figure 1. With a uniformly random behavior policy and training episodes of\nlength 12 (the same as the horizon of the performance metric), there is a non-zero probability for\neach state-action pair that it will be visited within an episode. Hence, with the right step-size decay\nschedule, convergence in the limit can be guaranteed [Jaakkola et al., 1994]. A key detail to enable\n\n3\n\nACB-1+5+1+5-5+10.00.20.40.60.81.001234performanceA0.00.20.40.60.81.05.05.25.45.65.86.0performanceB0.00.20.40.60.81.00.00.10.20.3performanceC\fFigure 2: Chain task consisting of 50 states and two terminal ones. Each (non-terminal) state has\ntwo actions: aL which results in transitioning to the left with probability 1 \u2212 p and to the right with\nprobability p, and vice versa for the other action, aR. All rewards are 0, except for transitioning to\nthe far-left or far-right terminal states that result in rL and rR, respectively.\n\nthis is that the state that is reached at the \ufb01nal time step is not treated as a terminal state (which has\nvalue 0 by default), but normal bootstrapping occurs [Pardo et al., 2018].\nA \ufb01nite-horizon performance metric is not essential to observe strong dependence of the metric gap\non \u03b3. For example, if on task B the performance metric would measure the number of steps it takes\nto collect all objects, a similar graph is obtained. In general, the examples in this section demonstrate\nthat the best discount factor is task-dependent and can be anywhere in the range between 0 and 1.\n\nl ) being 1.\n\n3.2 Optimization Effects\nThe performance of \u03c0\u2217\nl gives the theoretical limit of what the agent can achieve given its learning\nmetric. However, the discount factor also affects the optimization process; for some discount factors,\n\ufb01nding \u03c0\u2217\nl could be more challenging than for others. In this section, using the task shown in Figure 2,\nwe evaluate the correlation between the discount factor and how hard it could be to \ufb01nd \u03c0\u2217\nl . It is easy\nto see that the policy that always takes the left action aL maximizes both discounted and undiscounted\nsum of rewards for any discount factor or horizon value, respectively. We de\ufb01ne the learning metric\nFl as before (1), but use a different performance metric F . Speci\ufb01cally, we de\ufb01ne F to be 1 if the\npolicy takes aL in every state, and 0 otherwise. The metric gap for this setting of F and Fl is 0, with\nthe optimal performance (for \u03c0\u2217 and \u03c0\u2217\nTo study the optimization effects under function approximation,\nwe use linear function approximation with features constructed\nby tile-coding [Sutton, 1996], using tile-widths of 1, 2, 3, and 5.\nA tile-width of w corresponds to a binary feature that is non-zero\nfor w neighbouring states and zero for the remaining ones. The\nnumber and offset of the tilings are such that any value function\ncan be represented. Hence, error-free reconstruction of the optimal\naction-value function is possible in principle, for any discount\nfactor. Note that for a width of 1, the representation reduces to a\ntabular one.\nTo keep the experiment as simple as possible, we remove explo-\nration effects by performing update sweeps over the entire state-\naction space (using a step-size of 0.001) and measure performance\nat the end of each update sweep. Figure 3 shows the performance\nduring early learning (average performance over the \ufb01rst 10, 000\nsweeps) as well as the \ufb01nal performance (average between sweeps\n100, 000 and 110, 000).\nThese experiments demonstrate a common empirical observation:\nwhen using function approximation, low discount factors do not\nwork well in sparse-reward domains. More speci\ufb01cally, the main\nobservations are: 1) there is a sharp drop in \ufb01nal performance\nfor discount factors below some threshold; 2) this threshold value\ndepends on the tile-width, with larger ones resulting in worse (i.e.,\nhigher) threshold values; and 3) the tabular representation performs well for all discount factors.\nIt is commonly believed that the action gap has a strong in\ufb02uence on the optimization process\n[Bellemare et al., 2016, Farahmand, 2011]. The action gap of a state s is de\ufb01ned as the difference\nin Q\u2217 between the best and the second best actions at that state. To examine this common belief,\nwe start by evaluating two straightforward hypotheses involving the action gap: 1) lower discount\nfactors cause poor performance because they result in smaller action gaps; 2) lower discount factors\ncause poor performance because they result in smaller relative action gaps (i.e, the action gap of a\n\nFigure 3: Early performance\n(top) and \ufb01nal performance (bot-\ntom) on the chain task.\n\n4\n\n0.20.40.60.81.00.00.20.40.60.81.0average performancew: 1w: 2w: 3w: 50.20.40.60.81.00.00.20.40.60.81.0average performancew: 1w: 2w: 3w: 5\fstate divided by the maximum action-value of that state). Since both hypotheses are supported by the\nresults from Figure 3, we performed more experiments to test them. To test the \ufb01rst hypothesis, we\nperformed the same experiment as above, but with rewards that are a factor 100 larger. This in turn\nincreases the action gaps by a factor 100 as well. Hence, to validate the \ufb01rst hypothesis, this change\nshould improve (i.e., lower) the threshold value where the performance falls \ufb02at. To test the second\nhypothesis, we pushed all action-values up by 100 through additional rewards, reducing the relative\naction-gap. Hence, to validate the second hypothesis, performance should degrade for this variation.\nHowever, neither of the modi\ufb01cations caused signi\ufb01cant changes to the early or \ufb01nal performance,\ninvalidating these hypotheses. The corresponding graphs can be found in the supplementary material.\nBecause our two na\u00efve action-gap hypotheses have failed, we\npropose an alternative hypothesis: lower discount factors cause\npoor performance because they result in a larger difference in the\naction-gap sizes across the state-space. To illustrate the statement\nabout the difference in action-gap sizes, we de\ufb01ne a metric, which\nwe call the action-gap deviation \u03ba, that aims to capture the notion\nof action-gap variations. Speci\ufb01cally, let X be a random variable\nand let S + \u2286 S be the subset of states that have a non-zero action\ngap. X draws uniformly at random a state s \u2208 S + and outputs\nlog10 (AG(s)), where AG(s) is the action gap of state s. We now\nde\ufb01ne \u03ba to be the standard deviation of the variable X. Figure 4\nplots \u03ba as function of the discount factor for the task in Figure 2.\nTo test this new hypothesis, we have to develop a method that reduces the action-gap deviation \u03ba for\nlow discount factors, without changing the optimal policy. We do so in the next section.\n\nFigure 4: Action-gap deviation\nas function of discount factor.\n\n4 Logarithmic Q-learning\n\nIn this section, we introduce our new method, logarithmic Q-learning, which reduces the action-gap\ndeviation \u03ba for sparse-reward domains. We present the method in three steps, in each step adding a\nlayer of complexity in order to extend the generality of the method. In the supplementary material,\nwe prove convergence of the method in its most general form. As the \ufb01rst step, we now consider\ndomains with deterministic dynamics and rewards that are either positive or zero.\n\n4.1 Deterministic Domains with Positive Rewards\n\n(cid:101)Qt+1(st, at) := (1 \u2212 \u03b1)(cid:101)Qt(st, at) + \u03b1f\n\nOur method is based on the same general approach as used by Pohlen et al. [2018]: mapping the\nupdate target to a different space and performing updates in that space instead. We indicate the\nmapping function by f, and its inverse by f\u22121. Values in the mapping space are updated as follows:\n\na(cid:48) f\u22121(cid:16)(cid:101)Qt(st+1, a(cid:48))\n(cid:17)(cid:17)\nNote that (cid:101)Q in this equation is not an estimate of an expected return; it is an estimate of an expected\nto (cid:101)Q. Because the updates occur in the mapping space, \u03ba is now measured w.r.t. (cid:101)Q. That is, the action\ngap of state s is now de\ufb01ned in the mapping space as (cid:101)Q(s, abest) \u2212 (cid:101)Q(s, a2nd best).\n\nreturn mapped to a different space. To obtain a regular Q-value the inverse mapping has to be applied\n\n(cid:16)\n\nrt + \u03b3 max\n\n.\n\n(5)\n\nTo reduce \u03ba, we propose to use a logarithmic mapping function.\nSpeci\ufb01cally, we propose the following mapping function:\n\nf (x) := c ln(x + \u03b3k) + d ,\n\n(6)\nwith inverse function: f\u22121(x) = e(x\u2212d)/c \u2212 \u03b3k , where c, d, and\nk are mapping hyper-parameters.\nTo understand the effect of (6) on \u03ba, we plot \u03ba, based on action gaps\nin the logarithmic space, on a variation of the chain task (Figure 2)\nthat uses rR = 0 and p = 0. We also plot \u03ba based on actions in the\nregular space. Figure 5 shows that with an appropriate value of k,\nthe action-gap deviation can almost be reduced to 0 for low values\nof \u03b3. Setting k too high increases the deviation a little, while setting it too low increases it a lot for\n\nFigure 5: Action-gap deviation\nas function of discount factor.\n\n5\n\n0.20.40.60.81.0010200.250.500.751.000123reglog, k=40log, k=50log, k=200\flow discount factors. In short, k controls the smallest Q-value that can still be accurately represented\n(i.e., for which the action gap in the log-space is still signi\ufb01cant). Roughly, the smallest value that\ncan still be accurately represented is about \u03b3k. In other words, the cut-off point lies approximately\nat a state from which it takes k time steps to experience a +1 reward. Setting k too high causes\nactions that have 0 value in the regular space to have a large negative value in the log-space. This\ncan increase the action gap substantially for the corresponding states, thus, resulting in an overall\nincrease of the action-gap deviation.\nThe parameters c and d scale and shift values in the logarithmic space and do not have any effect on\nthe action-gap deviation. The parameter d controls the initialization of the Q-values. Setting d as\nfollows:\n\n(7)\nensures that f\u22121(0) = qinit for any value of c, k, and \u03b3. This can be useful in practice, e.g., when\n\nusing neural networks to represent (cid:101)Q, as it enables standard initialization methods (which produce\noutput values around 0) while still ensuring that the initialized (cid:101)Q values correspond with qinit in\n\nthe regular space. The parameter c scales values in the log-space. For most tabular and linear\nmethods, scaling values does not affect the optimization process. Nevertheless, in deep RL methods\nmore advanced optimization techniques are commonly used and, thus, such scaling can impact the\noptimization process signi\ufb01cantly. In all our experiments, except the deep RL experiments, we \ufb01xed\nd according to the equation above with qinit = 0 and used c = 1.\nIn stochastic environments, the approach described in this section causes issues, because averaging\nover stochastic samples in the log-space produces an underestimate compared to averaging in the\nregular space and then mapping the result to the log-space. Speci\ufb01cally, if X is a random variable,\nE [ln(X)] \u2264 ln (E[X]) (i.e., Jensen\u2019s inequality). Fortunately, within our speci\ufb01c context, there is a\nway around this limitation that we discuss in the next section.\n\nd = \u2212c ln(qinit + \u03b3k) ,\n\n4.2 Stochastic Domains with Positive Rewards\n\nThe step-size \u03b1 generally con\ufb02ates two forms of averaging: averaging of stochastic update targets\ndue to environment stochasticity, and, in the case of function approximation, averaging over different\nstates. To amend our method for stochastic environments, ideally, we would separate these forms\nof averaging and perform the averaging over stochastic update targets in the regular space and the\naveraging over different states in the log-space. While such a separation is hard to achieve, the\napproach presented below, which is inspired by the above observation, achieves many of the same\n\nbene\ufb01ts. In particular, it enables convergence of (cid:101)Q to f (Q\u2217), even when the environment is stochastic.\n\nLet \u03b2log be the step-size for averaging in the log-space, and \u03b2reg be the step-size for averaging in\nthe regular space. We amend the approach from the previous section by computing an alternative\nupdate target that is based on performing an averaging operation in the regular space. Speci\ufb01cally,\nthe update target Ut is transformed into an alternative update target \u02c6Ut as follows:\n\nwith Ut := rt + \u03b3 maxa(cid:48) f\u22121((cid:101)Qt(st+1, a(cid:48))). The modi\ufb01ed update target \u02c6Ut is used for the update in\n\n(8)\n\n,\n\n\u02c6Ut := f\u22121((cid:101)Qt(st, at)) + \u03b2reg\n(cid:101)Qt+1(st, at) := (cid:101)Qt(st, at) + \u03b2log\n\n(cid:16)\n\n(cid:17)\nUt \u2212 f\u22121((cid:101)Qt(st, at)\n(cid:17)\n(cid:16)\nf ( \u02c6Ut) \u2212 (cid:101)Qt(st, at)\n\nthe log-space:\n\n.\n\n(9)\n\nNote that if \u03b2reg = 1, then \u02c6Ut = Ut, and update (9) reduces to update (5) from the previous section,\nwith \u03b1 = \u03b2log.\nThe conditions for convergence are discussed in the next section,\nbut one of the conditions is that \u03b2reg should go to 0 in the limit.\nFrom a more practical point of view, when using \ufb01xed values\nfor \u03b2reg and \u03b2log, \u03b2reg should be set suf\ufb01ciently small to keep\nunderestimation of values due to the averaging in the log-space\nunder control. To illustrate this, we plot the RMS error on a\npositive-reward variant of the chain task (rR = 0, rL = +1, p =\n0.25). The RMS error plotted is based on the difference between\n\nf\u22121((cid:101)Q(s, a)) and Q\u2217(s, a) over all state-action pairs. We used a tile-width of 1, corresponding with\n\nFigure 6: RMS error.\n\n6\n\n02505007501000# update sweeps0.000.050.100.150.20RMS errorlog = 0.001, reg=1log = 0.01, reg=0.1log = 1, reg=0.001\fa tabular representation, and used k = 200. Note that for \u03b2reg = 1, which reduces the method to the\none from the previous section, the error never comes close to zero.\n\n4.3 Stochastic Domains with Positive and/or Negative Rewards\n\nWe now consider the general case where the rewards can be both positive or negative (or zero). It\nmight seem that we can generalize to negative rewards simply by replacing x in the mapping function\n(6) by x + D, where D is a suf\ufb01ciently large constant that prevents x + D from becoming negative.\nThe problem with this approach, as we will demonstrate empirically, is that it does not decrease \u03ba for\nlow discount factors. Hence, in this section we present an alternative approach, based on decomposing\nthe Q-value function into two functions.\nConsider a decomposition of the reward rt into two components, r+\n\nt , as follows:\n\n(cid:26)rt\n\n0\n\nr+\nt\n\n:=\n\nif rt \u2265 0\notherwise\n\nr\u2212\n\nt\n\n:=\n\n;\n\n(cid:26)|rt|\n\nt and r\u2212\nif rt < 0\notherwise\n\n0\nt \u2212 r\u2212\n\n.\n\n(10)\n\nt and r\u2212\n\nto r+\nfollowing update targets:\n\nt are always non-negative and that rt = r+\n\nt . To train these value functions, we construct the\n\nNote that r+\nt at all times. By decomposing\nthe observed reward in this manner, these two reward components can be used to train two separate\n\nQ-value functions: (cid:101)Q+, which represents the value function in the mapping space corresponding\nt , and (cid:101)Q\u2212, which plays the same role for r\u2212\n\nt + \u03b3f\u22121(cid:16)(cid:101)Q+\n(cid:16)\nf\u22121((cid:101)Q+\nt , respectively, based on (8), which are then used to update (cid:101)Q+ and (cid:101)Q\u2212,\nQt(s, a) := f\u22121(cid:16)(cid:101)Q+\n\nwith \u02dcat+1 := arg maxa(cid:48)\nt and \u02c6U\u2212\nmodi\ufb01ed into \u02c6U +\nrespectively, based on (9). Action-selection at time t is based on Qt, which we de\ufb01ne as follows:\n\nt + \u03b3f\u22121(cid:16)(cid:101)Q\u2212\n(cid:17)\n\nt (st+1, a(cid:48))) \u2212 f\u22121((cid:101)Q\u2212\n\n(cid:17) \u2212 f\u22121(cid:16)(cid:101)Q\u2212\n\n. These update targets are\n\n; U\u2212\n\nt\n\n:= r\u2212\n\nt (st+1, \u02dcat+1)\n\n(11)\n\nt (s, a)\n\nt (s, a)\n\nU +\nt\n\n:= r+\n\nt (st+1, \u02dcat+1)\n\nt (st+1, a(cid:48)))\n\n(cid:17)\n\n(cid:17)\n\n(cid:17)\n\n(12)\n\nIn the supplementary material, we prove convergence of logarithmic Q-learning under similar\nconditions as regular Q-learning. In particular, the product \u03b2log,t \u00b7 \u03b2reg,t has to satisfy the same\nconditions as \u03b1t does for regular Q-learning. There is one additional condition on \u03b2reg,t, which states\nthat it should go to zero in the limit.\nWe now compute \u03ba for the full version of the chain task. Because\n\nde\ufb01nition of \u03ba to this situation. We consider three generalizations:\n\nthere are two functions, (cid:101)Q+ and (cid:101)Q\u2212, we have to generalize the\n1) \u03ba is based on the action-gaps of (cid:101)Q+ (\u2018log plus-only\u2019); 2) \u03ba is\nbased on the action-gaps of (cid:101)Q\u2212 (\u2018log min-only\u2019); and 3) \u03ba is based\non the action-gaps of both (cid:101)Q\u2212 and (cid:101)Q+ (\u2018log both\u2019). Furthermore,\n\nwe plot a version that resolves the issue of negative rewards na\u00efvely,\nby adding a value D = 1 to the input of the log-function (\u2018log\nbias\u2019). We plot \u03ba for these variants in Figure 7, using k = 200,\ntogether with \u03ba for regular Q-learning (\u2018reg\u2019). Interestingly, only\nfor the \u2018log plus-only\u2019 variant \u03ba is small for all discount factors.\nFurther analysis showed that the reason for this is that under the\noptimal policy, the chance that the agent moves from a state close to the positive terminal state to the\nnegative terminal state is very small, which means that k = 200 is too small to make the action-gaps\n\nfor (cid:101)Q\u2212 homogeneous. However, as we will see in the next section, the performance with k = 200 is\ngood for all discount factors, demonstrating that not having homogeneous action-gaps for (cid:101)Q\u2212 is not\n\nFigure 7: Action-gap deviation\nas function of discount factor.\n\na huge issue. We argue that this could be because of the behavior related to the nature of positive\nand negative rewards: it might be worthwhile to travel a long distance to get a positive reward, but\navoiding a negative reward is typically a short-horizon challenge.\n\n7\n\n0.20.40.60.81.001020reglog biaslog plus-onlylog min-onlylog both\f5 Experiments\n\nWe test our method by returning to the full version of the chain task\nand the same performance metric F as used in Section 3.2, which\nmeasures whether or not the greedy policy is optimal. We used\nk = 200, \u03b2reg = 0.1, and \u03b2log = 0.01 (the value of \u03b2reg \u00b7 \u03b2log is\nequal to the value of \u03b1 used in Section 3.2). Figure 8 plots the result\nfor early learning as well as the \ufb01nal performance. Comparing\nthese graphs with the graphs from Figure 3 shows that logarithmic\nQ-learning has successfully resolved the optimization issues of\nregular Q-learning related to the use of low discount factors in\nconjunction with function approximation.\nCombined with the observation from Section 3.1 that the best\ndiscount factor is task-dependent, and the convergence proof in\nthe supplementary material, which guarantees that logarithmic Q-\nlearning converges to the same policy as regular Q-learning, these\nresults demonstrate that logarithmic Q-learning is able to solve\ntasks that are challenging to solve with Q-learning. Speci\ufb01cally, if\na \ufb01nite-horizon performance metric is used and the task is such that\nthe metric gap is substantially smaller for lower discount factors,\nbut performance falls \ufb02at for these discount factors due to function\napproximation.\nFinally, we test our approach in a more complex setting by compar-\ning the performance of DQN [Mnih et al., 2015] with a variant of it\nthat implements our method, which we will refer to as LogDQN.3\nTo enable easy baseline comparisons, we used the Dopamine framework for our experiments [Castro\net al., 2018]. This framework not only contains open-source code of several important deep RL\nmethods, but also contains the results obtained with these methods for a set of 60 games from the\nArcade Learning Environment [Bellemare et al., 2013, Machado et al., 2018]. This means that direct\ncomparison to some important baselines is possible.\nOur LogDQN implementation consists of a modi\ufb01cation of the Dopamine\u2019s DQN code. Speci\ufb01cally,\n\nin order to adapt DQN\u2019s model to provide estimates of both (cid:101)Q+ and (cid:101)Q\u2212, the \ufb01nal output layer is\ndoubled in size, and half of it is used to estimate (cid:101)Q+ while the other half estimates (cid:101)Q\u2212. All the other\nlayers are shared between (cid:101)Q+ and (cid:101)Q\u2212 and remain unchanged. Because both (cid:101)Q+ and (cid:101)Q\u2212 are updated\ndoes not change. Furthermore, because (cid:101)Q+ and (cid:101)Q\u2212 are updated simultaneously using a single pass\n\nusing the same samples, the replay memory does not require modi\ufb01cation, so the memory footprint\n\nFigure 8: Early performance\n(top) and \ufb01nal performance (bot-\ntom) on the chain task.\n\nthrough the model, the computational cost of LogDQN and DQN is similar. Further implementation\ndetails are provided in the supplementary material.\nThe published Dopamine baselines are obtained on a stochastic version of Atari using sticky actions\n[Machado et al., 2018] where with 25% probability the environment executes the action from\nthe previous time step instead of the agent\u2019s new action. Hence, we conducted all our LogDQN\nexperiments on this stochastic version of Atari as well.\nWhile Dopamine provides baselines for 60 games in total, we only consider the subset of 55 games\nfor which human scores have been published, because only for these games a \u2018human-normalized\nscore\u2019 can be computed, which is de\ufb01ned as:\n\nScoreAgent \u2212 ScoreRandom\nScoreHuman \u2212 ScoreRandom\n\n.\n\n(13)\n\nWe use Table 2 from Wang et al. [2016] to retrieve the human and random scores.\nWe optimized hyper-parameters using a subset of 6 games. In particular, we performed a scan over\nthe discount factor \u03b3 between \u03b3 = 0.84 and \u03b3 = 0.99. For DQN, \u03b3 = 0.99 was optimal; for\nLogDQN, the best value in this range was \u03b3 = 0.96. We tried lower \u03b3 values as well, such as \u03b3 = 0.1\nand \u03b3 = 0.5, but this did not improve the overall performance over these 6 games. For the other\nhyper-parameters of LogDQN we used k = 100, c = 0.5, \u03b2log = 0.0025, and \u03b2reg = 0.1. The\n\n3The code for the experiments can be found at: https://github.com/microsoft/logrl\n\n8\n\n0.20.40.60.81.00.00.20.40.60.81.0average performancew: 1w: 2w: 3w: 50.20.40.60.81.00.00.20.40.60.81.0average performancew: 1w: 2w: 3w: 5\fFigure 10: Relative performance of LogDQN w.r.t. DQN (positive percentage means LogDQN\noutperforms DQN). Orange bars indicate a performance difference larger than 50%; dark-blue bars\nindicate a performance difference between 10% and 50%; light-blue bars indicate a performance\ndifference smaller than 10%.\n\nproduct of \u03b2log and \u03b2reg is 0.00025, which is the same value as the (default) step-size \u03b1 of DQN. We\nused different values of d for the positive and negative heads: we set d based on (7) with qinit = 1 for\nthe positive head, and qinit = 0 for the negative head. Results from the hyper-parameter optimization,\nas well as further implementation details are provided in the supplementary material.\nFigure 10 shows the performance of LogDQN compared to DQN\nper game, using the same comparison equation as used by Wang\net al. [2016]:\n\nScoreLogDQN \u2212 ScoreDQN\n\nmax(ScoreDQN, ScoreHuman) \u2212 ScoreRandom\n\n.\n\nwhere ScoreLogDQN/DQN is computed by averaging over the last\n10% of each learning curve (i.e., last 20 epochs).\nFigure 9 shows the mean and median of the human-normalized\nscore of LogDQN, as well as DQN. We also plot the performance\nof the other baselines that Dopamine provides: C51 [Bellemare\net al., 2017], Implicit Quantile Networks [Dabney et al., 2018],\nand Rainbow [Hessel et al., 2018]. These baselines are just for\nreference; we have not attempted to combine our technique with\nthe techniques that these other baselines make use of.\n\n6 Discussion and Future Work\nOur results provide strong evidence for our hypothesis that large\ndifferences in action-gap sizes are detrimental to the performance\nof approximate RL. A possible explanation could be that optimiz-\ning on the L2-norm (3) might drive towards an average squared-\nerror that is similar across the state-space. However, the error-\nlandscape required to bring the approximation error below the\naction-gap across the state-space has a very different shape if the\naction-gap is orders of magnitude different in size across the state-\nspace. This mismatch between the required error-landscape and\nthat produced by the L2-norm might lead to an ineffective use\nof the function approximator. Further experiments are needed to\ncon\ufb01rm this.\nThe strong performance we observed for \u03b3 = 0.96 in the deep RL setting is unlikely solely due to a\ndifference in metric gap. We suspect that there are also other effects at play that make LogDQN as\neffective as it is. On the other hand, at (much) lower discount factors, the performance was not as\ngood as it was for the high discount factors. We believe a possible reason could be that since such low\nvalues are very different than the original DQN settings, some of the other DQN hyper-parameters\nmight no longer be ideal in the low discount factor region. An interesting future direction would be\nto re-evaluate some of the other hyper-parameters in the low discount factor region.\n\nFigure 9: Human-normalized\nmean (left) and median (right)\nscores on 55 Atari games for\nLogDQN and various other al-\ngorithms.\n\n9\n\nDoubleDunkSkiingStarGunnerKangarooKrullAssaultIceHockeyJamesbondHeroBeamRiderAmidarCentipedeGopherMsPacmanRiverraidTimePilotAlienSolarisVentureKungFuMasterAsteroidsBowlingMontezumaRevengePitfallFishingDerbyPrivateEyeGravitarRobotankBerzerkSeaquestBoxingYarsRevengeDemonAttackPongPhoenixCrazyClimberBankHeistAtlantisQbertUpNDownBattleZoneRoadRunnerEnduroChopperCommandFreewayNameThisGameTennisSpaceInvadersWizardOfWorZaxxonTutankhamAsterixFrostbiteBreakoutVideoPinball-100%0%100%200%300%MeanMedian\fAcknowledgments\n\nWe like to thank Kimia Nadjahi for her contributions to a convergence proof of an early version\nof logarithmic Q-learning. This early version ultimately was replaced by a signi\ufb01cantly improved\nversion that required a different convergence proof.\n\nReferences\nDimitri P. Bertsekas and John N. Tsitsiklis. Neuro-Dynamic Programming. Athena Scienti\ufb01c, 1996.\n\nVolodymyr Mnih, Koray Kavukcuoglu, David Silver, Andrei A. Rusu, Joel Veness, Marc G. Belle-\nmare, Alex Graves, Martin Riedmiller, Andreas K. Fidjeland, Georg Ostrovski, Stig Petersen,\nCharles Beattie, Amir Sadik, Ioannis Antonoglou, Helen King, Dharshan Kumaran, Daan Wierstra,\nShane Legg, and Demis Hassabis. Human-level control through deep reinforcement learning.\nNature, 518(7540):529\u2013533, 2015.\n\nMartin L. Puterman. Markov Decision Processes: Discrete Stochastic Dynamic Programming. John\n\nWiley & Sons, 1994.\n\nChristopher J. C. H. Watkins and Peter Dayan. Q-learning. Machine Learning, 8(3):279\u2013292, 1992.\n\nTommi Jaakkola, Michael I. Jordan, and Satinder P. Singh. On the convergence of stochastic iterative\ndynamic programming algorithms. In Advances in Neural Information Processing Systems, pages\n703\u2013710, 1994.\n\nFabio Pardo, Arash Tavakoli, Vitaly Levdik, and Petar Kormushev. Time limits in reinforcement\nlearning. In Proceedings of the 35th International Conference on Machine Learning, volume 80,\npages 4045\u20134054, 2018.\n\nRichard S. Sutton. Generalization in reinforcement learning: Successful examples using sparse coarse\n\ncoding. In Advances in Neural Information Processing Systems, pages 1038\u20131044, 1996.\n\nMarc G. Bellemare, Georg Ostrovski, Arthur Guez, Philip Thomas, and R\u00e9mi Munos. Increasing the\naction gap: New operators for reinforcement learning. In Proceedings of the 30th AAAI Conference\non Arti\ufb01cial Intelligence, pages 1476\u20131483, 2016.\n\nAmir-massoud Farahmand. Action-gap phenomenon in reinforcement learning. In Advances in\n\nNeural Information Processing Systems, pages 172\u2013180, 2011.\n\nTobias Pohlen, Bilal Piot, Todd Hester, Mohammad Gheshlaghi Azar, Dan Horgan, David Budden,\nGabriel Barth-Maron, Hado van Hasselt, John Quan, Mel Ve\u02c7cer\u00edk, Matteo Hessel, R\u00e9mi Munos,\nand Olivier Pietquin. Observe and look further: Achieving consistent performance on Atari. arXiv\npreprint arXiv:1805.11593, 2018.\n\nPablo Samuel Castro, Subhodeep Moitra, Carles Gelada, Saurabh Kumar, and Marc G. Belle-\nmare. Dopamine: A research framework for deep reinforcement learning. arXiv preprint\narXiv:1812.06110, 2018.\n\nMarc G. Bellemare, Yavar Naddaf, Joel Veness, and Michael Bowling. The arcade learning environ-\nment: An evaluation platform for general agents. Journal of Arti\ufb01cial Intelligence Research, 47:\n253\u2013279, 2013.\n\nMarlos C. Machado, Marc G. Bellemare, Erik Talvitie, Joel Veness, Matthew Hausknecht, and\nMichael Bowling. Revisiting the arcade learning environment: Evaluation protocols and open\nproblems for general agents. Journal of Arti\ufb01cial Intelligence Research, 61:523\u2013562, 2018.\n\nZiyu Wang, Tom Schaul, Matteo Hessel, Hado van Hasselt, Marc Lanctot, and Nando de Freitas.\nIn Proceedings of the 33rd\n\nDueling network architectures for deep reinforcement learning.\nInternational Conference on Machine Learning, volume 48, pages 1995\u20132003, 2016.\n\nMarc G. Bellemare, Will Dabney, and R\u00e9mi Munos. A distributional perspective on reinforcement\nlearning. In Proceedings of the 34th International Conference on Machine Learning, volume 70,\npages 449\u2013458, 2017.\n\n10\n\n\fWill Dabney, Georg Ostrovski, David Silver, and R\u00e9mi Munos. Implicit quantile networks for\ndistributional reinforcement learning. In Proceedings of the 35th International Conference on\nMachine Learning, volume 80, pages 1096\u20131105, 2018.\n\nMatteo Hessel, Joseph Modayil, Hado van Hasselt, Tom Schaul, Georg Ostrovski, Will Dabney, Dan\nHorgan, Bilal Piot, Mohammad Azar, and David Silver. Rainbow: Combining improvements in\ndeep reinforcement learning. In Proceedings of the 32nd AAAI Conference on Arti\ufb01cial Intelligence,\npages 3215\u20133222, 2018.\n\n11\n\n\f", "award": [], "sourceid": 7875, "authors": [{"given_name": "Harm", "family_name": "Van Seijen", "institution": "Microsoft Research"}, {"given_name": "Mehdi", "family_name": "Fatemi", "institution": "Microsoft Research"}, {"given_name": "Arash", "family_name": "Tavakoli", "institution": "Imperial College London"}]}