{"title": "Biasing Approximate Dynamic Programming with a Lower Discount Factor", "book": "Advances in Neural Information Processing Systems", "page_first": 1265, "page_last": 1272, "abstract": "Most algorithms for solving Markov decision processes rely on a discount factor, which ensures their convergence. In fact, it is often used in problems with is no intrinsic motivation. In this paper, we show that when used in approximate dynamic programming, an artificially low discount factor may significantly improve the performance on some problems, such as Tetris. We propose two explanations for this phenomenon. Our first justification follows directly from the standard approximation error bounds: using a lower discount factor may decrease the approximation error bounds. However, we also show that these bounds are loose, a thus their decrease does not entirely justify a better practical performance. We thus propose another justification: when the rewards are received only sporadically (as it is the case in Tetris), we can derive tighter bounds, which support a significant performance increase with a decrease in the discount factor.", "full_text": "Biasing Approximate Dynamic Programming with a\n\nLower Discount Factor\n\nMarek Petrik\n\nDepartment of Computer Science\n\nUniversity of Massachusetts Amherst\n\nAmherst, MA 01003\n\npetrik@cs.umass.edu\n\nBruno Scherrer\n\nLORIA Campus Scienti\ufb01que B.P. 239\n54506 Vandoeuvre-les-Nancy, France\n\nbruno.scherrer@loria.fr\n\nAbstract\n\nMost algorithms for solving Markov decision processes rely on a discount factor,\nwhich ensures their convergence. It is generally assumed that using an arti\ufb01cially\nlow discount factor will improve the convergence rate, while sacri\ufb01cing the solu-\ntion quality. We however demonstrate that using an arti\ufb01cially low discount factor\nmay signi\ufb01cantly improve the solution quality, when used in approximate dynamic\nprogramming. We propose two explanations of this phenomenon. The \ufb01rst jus-\nti\ufb01cation follows directly from the standard approximation error bounds: using\na lower discount factor may decrease the approximation error bounds. However,\nwe also show that these bounds are loose, thus their decrease does not entirely\njustify the improved solution quality. We thus propose another justi\ufb01cation: when\nthe rewards are received only sporadically (as in the case of Tetris), we can derive\ntighter bounds, which support a signi\ufb01cant improvement in the solution quality\nwith a decreased discount factor.\n\n1 Introduction\n\nApproximate dynamic programming methods often offer surprisingly good performance in practical\nproblems modeled as Markov Decision Processes (MDP) [6, 2]. To achieve this performance, the\nparameters of the solution algorithms typically need to be carefully tuned. One such important pa-\nrameter of MDPs is the discount factor \u03b3. Discount factors are important in in\ufb01nite-horizon MDPs,\nin which they determine how the reward is counted. The motivation for the discount factor originally\ncomes from economic models, but has often no meaning in reinforcement learning problems. Nev-\nertheless, it is commonly used to ensure that the rewards are bounded and that the Bellman operator\nis a contraction [8]. In this paper, we focus on the quality of the solutions obtained by approximate\ndynamic programming algorithms. For simplicity, we disregard the computational time, and use\nperformance to refer to the quality of the solutions that are eventually obtained.\nIn addition to regularizing the rewards, using an arti\ufb01cially low discount factor sometimes has a\nsigni\ufb01cant effect on the performance of the approximate algorithms. Speci\ufb01cally, we have observed\na signi\ufb01cant improvement of approximate value iteration when applied to Tetris, a common rein-\nforcement learning benchmark problem. The natural discount factor in Tetris is 1, since the received\nrewards have the same importance, independently of when received. Currently, the best results\nachieved with approximate dynamic programming algorithms are on average about 6000 lines re-\nmoved in a single game [4, 3]. Our results, depicted in Figure 1, with approximate value iteration\nand standard features [1] show that setting the discount factor to \u03b3 \u2208 (0.84, 0.88) gives the best\nexpected total number of removed lines, a bit more than 20000. That is \ufb01ve times the performance\nwith discount factor of \u03b3 = 1 (about 4000). The improved performance for \u03b3 \u2208 (0.84, 0.88) is sur-\nprising, since computing a policy for this discount factor dramatically improves the return calculated\nwith \u03b3 = 1.\n\n\fFigure 1: Performance of approximate value iteration on Tetris with different discount factors. For\neach value of \u03b3, we ran the experiments 10 times and recorded the evolution of the score (the\nevaluation of the policy with \u03b3 = 1) on the 100 games, and averaged over 10 learning runs.\n\nIn this paper, we study why using a lower discount factor improves the quality of the solution with\nregard to a higher discount factor. First, in Section 2, we de\ufb01ne the framework for our analysis.\nIn Section 3 we analyze the in\ufb02uence of the discount factor on the standard approximation error\nbounds [2]. Then in Section 4 we argue that, in the context of this paper, the existing approximation\nerror bounds are loose. Though these bounds may be tightened by a lower discount factor, they are\nnot suf\ufb01cient to explain the improved performance. Finally, to explain the improved performance,\nwe identify a speci\ufb01c property of Tetris in Section 5 that enables the improvement. In particular,\nthe rewards in Tetris are received sparsely, unlike the approximation error, which makes the value\nfunction less sensitive to the discount factor than the approximation error.\n\n2 Framework and Notations\n\nIn this section we formalize the problem of adjusting the discount factor in approximate dynamic\nprogramming. We assume \u03b3-discounted in\ufb01nite horizon problems, with \u03b3 < 1. Tetris does not\ndirectly \ufb01t in this class, since its natural discount factor is 1.\nIt has been shown, however, that\nundiscounted in\ufb01nite horizon problems with a \ufb01nite total reward can be treated as discounted prob-\nlems [7]. Blackwell optimality implies that there exists \u03b3\u2217 < 1 such that for all \u03b3 > \u03b3\u2217 the\n\u03b3-discounted problem and the undiscounted problem have the same optimal policy. We therefore\ntreat Tetris as a discounted problem with a discount factor \u03b3\u2217 < 1 near one. The analysis is based\non Markov decision processes, de\ufb01ned as follows.\nDe\ufb01nition 1. A Markov Decision Process is a tuple (S, A, P, r). S is the set of states, A is the set of\nactions, P : S \u00d7 S \u00d7 A (cid:55)\u2192 [0, 1] is the transition function (P (s(cid:48), s, a) is the probability of transiting\nto state s(cid:48) from state s given action a), and r : S \u00d7 A (cid:55)\u2192 R+ is a (non-negative) reward function.\nWe assume that the number of states and actions is \ufb01nite, but possibly very large. For sake of sim-\nplicity, we also assume that the rewards are non-negative; our analysis can be extended to arbitrary\nrewards in a straight-forward way. We write (cid:107)r(cid:107)\u221e to denote the maximal reward for any action and\nstate.\nGiven a Markov decision process (S, A, P, r) and some discount factor \u03b3, the objective is to \ufb01nd a\npolicy, i.e. a mapping \u03c0 : S (cid:55)\u2192 A, with the maximal value from any initial states s. The value v\u03c0(s)\nof \u03c0 from state s is de\ufb01ned as the \u03b3-discounted in\ufb01nite horizon return:\n\n(cid:35)\n\n(cid:34) \u221e(cid:88)\n\nv\u03c0(s) := E\n\n\u03b3tr(st, at) s0 = s, a0 = \u03c0(s0), . . . , at = \u03c0(st)\n\n.\n\nt=0\n\nIt is well known [7, 2] that this problem can be solved by computing the optimal value function v\u2217,\nwhich is the \ufb01xed point of the Bellman operator Lv = max\u03c0 r\u03c0 + \u03b3P\u03c0v. Here r\u03c0 is the vector on S\nwith components r(s, \u03c0(s)) and P \u03c0 is the stochastic matrix associated with a policy \u03c0.\n\n 0 5000 10000 15000 20000 25000 0 10 20 30 40 50 60 70 80 90 100Average of 10 runs of average scores on 100 gamesIterations0.80.840.880.920.961.0\fWe consider in this paper that the MDP is solved with 1) an approximate dynamic programming\nalgorithm and 2) a different discount factor \u03b2 < \u03b3. In particular, our analysis applies to approximate\nvalue and policy iteration with existing error bounds. These methods invariably generate a sequence\nof approximate value functions, which we denote as \u02dcv\u03b2. Then, \u03c0\u03b2 is a policy greedy with regard to\nthe approximate value function \u02dcv\u03b2.\nAs we have two different discount factors, we use a subscript to denote the discount factor used in\ncalculating the value. Let \u03b4 be a discount factor and \u03c0 any policy. We use v\u03c0\n\u03b4 to represent the value\nof policy \u03c0 calculated with the discount factor \u03b4; when \u03c0 is the optimal policy corresponding to the\ndiscount \u03b4, we will simply denote its value v\u03b4. As mentioned above, our objective is to compare,\nfor the discount factor \u03b3, the value v\u03b3 of the optimal policy and the value v\u03c0\u03b2\n\u03b3 . Here, \u03c0\u03b2 is the\npolicy derived from the approximate \u03b2-discount value. The following shows how this error may be\ndecomposed in order to simplify the analysis. Most of our analysis is in terms of L\u221e mainly because\nthis is the most common measure used in the existing error bounds. Moreover, the results could be\nextended to L2 norm in a rather straight-forward way without a qualitative difference in the results.\nFrom the optimality of v\u03b3, v\u03b3 \u2265 v\u03c0\u03b2\n\u03b3 and from the non-negativity of the rewards, it is easy to show\n\u03b3 \u2265 v\u03c0\u03b2\nthat the value function is monotonous with respect to the discount factor, and therefore: v\u03c0\u03b2\n\u03b2 .\nThus 0 \u2264 v\u03b3 \u2212 v\u03c0\u03b2\n\n\u03b3 \u2264 v\u03b3 \u2212 v\u03c0\u03b2\n\n\u03b2 and consequently:\n\ne(\u03b2) := (cid:107)v\u03b3 \u2212 v\u03c0\u03b2\n\n\u03b3 (cid:107)\u221e \u2264 (cid:107)v\u03b3 \u2212 v\u03c0\u03b2\n\n\u03b2 (cid:107)\u221e \u2264 (cid:107)v\u03b3 \u2212 v\u03b2(cid:107)\u221e + (cid:107)v\u03b2 \u2212 v\u03c0\u03b2\n\n\u03b2 (cid:107)\u221e = ed(\u03b2) + ea(\u03b2).\n\n\u03b2 (cid:107)\u221e the approxi-\nwhere ed(\u03b2) := (cid:107)v\u03b3 \u2212 v\u03b2(cid:107)\u221e denotes the discount error, and ea(\u03b2) := (cid:107)v\u03b2 \u2212 v\u03c0\u03b2\nmation error. In other words, a bound of the loss due to using \u03c0\u03b2 instead of the optimal policy for\ndiscount factor \u03b3 is the sum of the error on the optimal value function due to the change of discount\nand the error due to the approximation for discount \u03b2. In the remainder of the paper, we analyze\neach of these error terms.\n\n3 Error Bounds\n\nIn this section, we develop a discount error bound and overview the existing approximation error\nbounds. We also show how these bounds motivate decreasing the discount factor in the majority of\nMDPs. First, we bound the discount error as follows.\nTheorem 2. The discount error due to using a discount factor \u03b2 instead of \u03b3 is:\n\ned(\u03b2) = (cid:107)v\u03b3 \u2212 v\u03b2(cid:107)\u221e \u2264\n\n\u03b3 \u2212 \u03b2\n\n(1 \u2212 \u03b2)(1 \u2212 \u03b3)\n\n(cid:107)r(cid:107)\u221e.\n\nProof. Let L\u03b3 and L\u03b2 be the Bellman operators for the corresponding discount factors. We have\nnow:\n(cid:107)v\u03b3 \u2212 v\u03b2(cid:107)\u221e = (cid:107)L\u03b3v\u03b3 \u2212 L\u03b2v\u03b2(cid:107)\u221e = (cid:107)L\u03b3v\u03b3 \u2212 L\u03b2v\u03b3 + L\u03b2v\u03b3 \u2212 L\u03b2v\u03b2(cid:107)\u221e\n\n\u2264 (cid:107)L\u03b3v\u03b3 \u2212 L\u03b2v\u03b3(cid:107)\u221e + (cid:107)L\u03b2v\u03b3 \u2212 L\u03b2v\u03b2(cid:107)\u221e \u2264 (cid:107)L\u03b3v\u03b3 \u2212 L\u03b2v\u03b3(cid:107)\u221e + \u03b2(cid:107)v\u03b3 \u2212 v\u03b2(cid:107)\u221e\nLet P\u03b3, r\u03b3 and P\u03b2, r\u03b2 be the transition matrices and rewards of policies greedy with regard to v\u03b3 for\n\u03b3 and \u03b2 respectively. Then we have:\n\nL\u03b3v\u03b3 \u2212 L\u03b2v\u03b3 = (\u03b3P\u03b3v\u03b3 + r\u03b3) \u2212 (\u03b2P\u03b2v\u03b3 + r\u03b2) \u2264 (\u03b3 \u2212 \u03b2)P\u03b3v\u03b3\nL\u03b3v\u03b3 \u2212 L\u03b2v\u03b3 = (\u03b3P\u03b3v\u03b3 + r\u03b3) \u2212 (\u03b2P\u03b2v\u03b3 + r\u03b2) \u2265 (\u03b3 \u2212 \u03b2)P\u03b2v\u03b3.\n\nFinally, the bound follows from above as:\n\n(cid:107)v\u03b3 \u2212 v\u03b2(cid:107)\u221e \u2264 1\n1 \u2212 \u03b2\n\nmax{(cid:107)(\u03b3 \u2212 \u03b2)P\u03b3v\u03b3(cid:107)\u221e,(cid:107)(\u03b3 \u2212 \u03b2)P\u03b2v\u03b3(cid:107)\u221e} \u2264\n\n\u03b3 \u2212 \u03b2\n\n(1 \u2212 \u03b3)(1 \u2212 \u03b2)\n\n(cid:107)r(cid:107)\u221e.\n\nRemark 3. This bound is trivially tight, that is there exists a problem for which the bound reduces to\nequality. It is however also straightforward to construct a problem in which the bound is not tight.\n\n\fFigure 2: Example e(\u03b2) function in a\nproblem with \u03b3 = 0.9 and \u0001 = 0.01\nand (cid:107)r(cid:107)\u221e = 10.\n\nFigure 3: The dependence of \u0001 on \u03b3\nneeded for the improvement in Propo-\nsition 6.\n\n3.1 Approximation Error Bound\n\n\u03b2)k\u22650 with \u03c0k\n\nWe now discuss the dependence of the approximation error ea(\u03b2) on the discount factor \u03b2. Approxi-\nmate dynamic programming algorithms like approximate value and policy iteration build a sequence\nof value functions (\u02dcvk\n\u03b2. These algorithms\nare approximate because at each iteration the value \u02dcvk\n\u03b2 is an approximation of some target value\n\u03b2, which is hard to compute. The analysis of [2] (see Section 6.5.3 and Proposition 6.1 for value\nvk\niteration, and Proposition 6.2 for policy iteration) bounds the loss of using the policies \u03c0k\n\u03b2 instead of\nthe optimal policy:\n\n\u03b2 being the policy greedy with respect to \u02dcvk\n\nlim sup\nk\u2192\u221e\n\n(cid:107)v\u03b2 \u2212 v\n\n\u03c0k\n\u03b2\n\n\u03b2 (cid:107)\u221e \u2264\n\n2\u03b2\n\n(1 \u2212 \u03b2)2 sup\n\nk\n\n(cid:107)\u02dcvk\n\n\u03b2 \u2212 vk\n\n\u03b2(cid:107)\u221e.\n\n(1)\n\n\u03b2 \u2212 vk\n\nTo completely describe how Eq. (1) depends on the discount factor, we need to bound the one-step\napproximation error (cid:107)\u02dcvk\n\u03b2(cid:107) in terms of \u03b2. Though this speci\ufb01c error depends on the particu-\nlar approximation framework used and is in general dif\ufb01cult to estimate, we propose to make the\nfollowing assumption.\nAssumption 4. There exists \u0001 \u2208 (0, 1/2), such that for all k, the single-step approximation error is\nbounded by:\n\n(cid:107)\u02dcvk\n\n\u03b2 \u2212 vk\n\n\u03b2(cid:107)\u221e \u2264 \u0001\n1 \u2212 \u03b2\n\n(cid:107)r(cid:107)\u221e.\n\nWe consider only \u0001 \u2264 1/2 because the above assumption holds with \u0001 = 1/2 and the trivial constant\n\u03b2 = (cid:107)r(cid:107)\u221e/2.\napproximation \u02dcvk\nRemark 5. Alternatively to Assumption 4, we could assume that the approximation error is constant\nin the discount factor \u03b2, i.e. (cid:107)\u02dcvk\n\u03b2(cid:107)\u221e \u2264 \u0001 = O(1) for some \u0001 for all \u03b2. We believe that such a\nbound is unlikely in practice. To show that, consider an MDP with two states s0 and s1, and a single\n\u221a\naction. The transitions loop from each state to itself, and the rewards are r(s0) = 0 and r(s1) = 1.\n2]. The approximation\nAssume a linear least-squares approximation with basis M = [1/\nerror in terms of \u03b2 is: 1/2(1 \u2212 \u03b2) = O(1/(1 \u2212 \u03b2)).\nIf Assumption 4 holds, we see from Eq. (1) that the approximation error ea is bounded as:\n\n\u221a\n2; 1/\n\n\u03b2 \u2212 vk\n\nea(\u03b2) \u2264\n\n2\u03b2\n\n(1 \u2212 \u03b2)3 \u0001(cid:107)r(cid:107)\u221e.\n\n3.2 Global Error Bound\n\nUsing the results above, and considering that Assumption 4 holds, the cumulative error bound when\nusing approximate dynamic programming with a discount factor \u03b2 < \u03b3 is:\n\ne(\u03b2) = ea(\u03b2) + ed(\u03b2) \u2264\n\n\u03b3 \u2212 \u03b2\n\n(1 \u2212 \u03b2)(1 \u2212 \u03b3)\n\n(cid:107)r(cid:107)\u221e +\n\n2\u03b2\n\n(1 \u2212 \u03b2)3 \u0001(cid:107)r(cid:107)\u221e.\n\nAn example of this error bound is shown in Figure 2: the bound is minimized for \u03b2 (cid:39) 0.8 < \u03b3. This\nis because the approximation error decreases rapidly in comparison with the increasing discount\nerror. More generally, the following proposition suggests how we should choose \u03b2.\n\n00.20.40.60.8160708090100110b00.20.40.60.8100.10.20.30.40.5ge\fProposition 6. If the approximation factor \u0001 introduced in Assumption 4 is suf\ufb01ciently large, pre-\ncisely if \u0001 > (1 \u2212 \u03b3)2/2(1 + 2\u03b3), then the best error bound e(\u03b2) will be achieved for the discount\n\nfactor \u03b2 = (2\u0001 + 1) \u2212(cid:112)(2\u0001 + 1)2 + (2\u0001 \u2212 1) < \u03b3.\n\nFigure 3 shows the approximation error fraction necessary to improve the performance. Notice that\nthe fraction decreases rapidly when \u03b3 \u2192 1.\nProof. The minimum of \u03b2 (cid:55)\u2192 e(\u03b2) can be derived analytically by taking its derivative:\n\ne(cid:48)(\u03b2) = \u2212(1 \u2212 \u03b2)\u22122(cid:107)r(cid:107)\u221e + (1 \u2212 \u03b2)\u221232(cid:107)r(cid:107)\u221e\u0001 + (\u22123)2\u03b2(\u22121)(1 \u2212 \u03b2)\u22124(cid:107)r(cid:107)\u221e\u0001\n(cid:107)r(cid:107)\u221e.\n\n(1 \u2212 \u03b2)2 + 2(1 \u2212 \u03b2)\u0001 + 6\u03b2\u0001\n\n\u2212\u03b22 + 2(2\u0001 + 1)\u03b2 + 2\u0001 \u2212 1\n\n(cid:107)r(cid:107)\u221e =\n\n=\n\n(1 \u2212 \u03b2)4\n\n(1 \u2212 \u03b2)4\n\nSo we want to know when \u03b2 (cid:55)\u2192 \u22121/2\u03b22 + (2\u0001 + 1)\u03b2 + 1/2(2\u0001 \u2212 1) equals 0. The discriminant\n\u2206 = (2\u0001 + 1)2 + (2\u0001 \u2212 1) = 2\u0001(2\u0001 + 3) is always positive. Therefore e(cid:48)(\u03b2) equals 0 for the points\n\u03b2\u2212 = (2\u0001 + 1) \u2212 \u221a\n\u221a\n\u2206 and is positive in between and negative outside.\nThis means that \u03b2\u2212 is a local minimum of e and \u03b2+ a local maximum.\nIt is clear that \u03b2+ > 1 > \u03b3. From the de\ufb01nition of \u2206 and the fact (cf Assumption 4) that \u0001 \u2264 1/2,\nwe see that \u03b2\u2212 \u2265 0. Then, the condition \u03b2\u2212 < \u03b3 is satis\ufb01ed if and only if:\n\n\u2206 and \u03b2+ = (2\u0001 + 1) +\n\n\u03b2\u2212 < \u03b3 \u21d4 (2\u0001 + 1) \u2212(cid:112)(2\u0001 + 1)2 + (2\u0001 \u2212 1) < \u03b3 \u21d4 1 \u2212\n\n(cid:115)\n\n2\u0001 \u2212 1\n(2\u0001 + 1)2 <\n\n\u03b3\n\n2\u0001 + 1\n\n1 +\n\n(cid:115)\n\n2\u0001 \u2212 1\n\u21d4 1 \u2212 \u03b3\n(2\u0001 + 1)2 \u21d4 1 \u2212 2\n\u21d4 \u22122\u03b3(2\u0001 + 1) + \u03b32 < 2\u0001 \u2212 1 \u21d4 (1 \u2212 \u03b3)2\n\n2\u0001 + 1\n\n1 +\n\n<\n\n\u03b3\n\n2\u0001 + 1\n\n+\n\n\u03b32\n\n(2\u0001 + 1)2 < 1 +\n\n2\u0001 \u2212 1\n(2\u0001 + 1)2\n\n< 2\u0001\n\n1 + 2\u03b3\n\nwhere the inequality holds after squaring, since both sides are positive.\n\n4 Bound Tightness\nWe show in this section that the bounds on the approximation error ea(\u03b2) are very loose for \u03b2 \u2192 1\nand thus the analysis above does not fully explain the improved performance. In particular, there\nexists a naive bound on the approximation error that is dramatically tighter than the standard bounds\nwhen \u03b2 is close to 1.\nLemma 7. There exists a constant c \u2208 R+ such that for all \u03b2 we have (cid:107)v\u03b2 \u2212 \u02dcv\u03b2(cid:107)\u221e \u2264 c/(1 \u2212 \u03b2).\nProof. Let P \u2217, r\u2217 and \u02c6P , \u02c6r be the transition reward functions of the optimal approximate policies\nrespectively. The functions may depend on the discount factor, but we omit that to simplify the\nnotation. Then the approximation error is:\n\u22121r\n\n\u2217(cid:107)\u221e + (cid:107)\u02c6r(cid:107)\u221e) .\n\n\u22121 \u02c6r(cid:107)\u221e \u2264 1\n\n((cid:107)r\n\n\u2217\n\n(cid:107)v\u03b2 \u2212 \u02c6v\u03b2(cid:107)\u221e = (cid:107)(I \u2212 \u03b2P\n\n\u2217 \u2212 (I \u2212 \u03b2 \u02c6P )\nThus setting c = 2 max\u03c0 (cid:107)r\u03c0(cid:107)\u221e proves the lemma.\n\n)\n\n1 \u2212 \u03b2\n\n2\u03b2\n\n\u03b2 \u2212 vk\n\n(1\u2212\u03b2)2 \u0001 > c\n\nLemma 7 implies that for every MDP, there exists a discount factor \u03b2, such that Eq. (1) is not\ntight. Consider even that the single-step approximation error is bounded by a constant, such that\nlim supk\u2192\u221e (cid:107)\u02dcvk\n\u03b2(cid:107)\u221e \u2264 \u0001. This is impractical, as discussed in Remark 5, but it tightens the\nbound. Such a bound implies that: ea(\u03b2) \u2264 2\u03b2\u0001/(1 \u2212 \u03b2)2. From Lemma 7, this bound is loose\n1\u2212\u03b2 . Thus we have that there exists \u03b2 < 1 for which the standard approximation\nwhen\nerror bounds are loose, whenever \u0001 > 0. The looseness of the bound will be more apparent in\nproblems with high discount factors. For example in the MDP formulation of Blackjack [5] the\ndiscount factor \u03b3 = 0.999, in which case the error bound may overestimate the true error by a factor\nup to 1/(1 \u2212 \u03b3) = 1000.\nThe looseness of the approximation error bounds may seem to contradict Example 6.4 in [2], which\nshows that Eq. (1) is tight. The discrepancy is because in our analysis we assume that the MDP has\n\n\fFigure 4: Looseness of the\nBellman error bound.\n\nFigure 5: Bellman error\nbound as a function of \u03b2 for\na problem with \u03b3 = 0.9.\n\nFigure 6: The approximation\nerror with a = \u02dcv\u03b2 and b =\nv\u03b3.\n\n\ufb01xed rewards and number of states, while the example in [2] assumes that the reward depends on\nthe discount factor and the number of states is potentially in\ufb01nite. Another way to put it is to say\nthat Example 6.4 shows that for any discount factor \u03b2 there exists an MDP (which depends on \u03b2)\nfor which the bound Eq. (1) is tight. We, on the other hand, show that there does not exist a \ufb01xed\nMDP such that for all discount factor \u03b2 the bound Eq. (1) is tight.\nProposition 6 justi\ufb01es the improved performance with a lower discount factor by a more rapid de-\ncrease in ea with \u03b2 than the increase in ed. The naive bound from Lemma 7 however shows that ea\nmay scale with 1/(1 \u2212 \u03b2), the same as ed. As a result, while the approximation error will decrease,\nit may not be suf\ufb01cient to offset the increase in the discount error.\nSome of the standard approximation error bound may be tightened by using a lower discount factor.\nFor example consider the standard a-posteriori approximation error bound for the value function\n\u02dcv\u03b2 [7] :\n\n(cid:107)L\u03b2 \u02dcv\u03b2 \u2212 \u02dcv\u03b2(cid:107)\u221e,\n\n(cid:107)v\u03b2 \u2212 v \u02dc\u03c0\n(cid:19)\n\n\u03b2(cid:107)\u221e \u2264 1\n1 \u2212 \u03b2\n(cid:19)\n(cid:18)0\n\nP2 =\n\n1\n1\n\n0\n\n0\n1\n\n(cid:18)1\n\n0\n\nP1 =\n\nwhere \u02dc\u03c0\u03b2 is greedy with respect to \u02dcv\u03b2. This bound is widely used and known as Bellman error\nbound. The following example demonstrates that the Bellman error bound may also be loose for \u03b2\nclose to 1:\n\nr1 =(cid:0)1\n\n2(cid:1)\n\nr2 =(cid:0)2\n\n2(cid:1)\n\nAssume that the current value function is the value of a policy with the transition matrix and reward\nP1, r1, while the optimal policy has the transition matrix and reward P2, r2. The looseness of the\nbound is depicted in Figure 4. The approximation error bound scales with\n(1\u2212\u03b3)2 , while the true\n1\u2212\u03b3 . As a result, for \u03b3 = 0.999, the bound is 1000 times the true error value in\nerror scales with 1\nthis example. The intuitive reason for the looseness of the bound is that the bound treats each state\nas recurrent, even when is it transient.\nThe global error bound may be also tightened by using a lower discount factor \u03b2 as follows:\n\n1\n\n(cid:107)v\u03b3 \u2212 v \u02dc\u03c0\u03b2\n\n\u03b3 (cid:107)\u221e \u2264 1\n1 \u2212 \u03b2\n\n(cid:107)L\u03b2 \u02dcv\u03b2 \u2212 \u02dcv\u03b2(cid:107)\u221e +\n\n\u03b3 \u2212 \u03b2\n\n(1 \u2212 \u03b2)(1 \u2212 \u03b3)\n\n(cid:107)r(cid:107)\u221e.\n\nFinding the discount factor \u03b2 that minimizes this error is dif\ufb01cult, because the function may not\nbe convex or differentiable. Thus the most practical method is a sub-gradient optimization method.\nThe global error bound the MDP example above is depicted in Figure 5.\n\n5 Sparse Rewards\n\nIn this section, we propose an alternative explanation for the performance improvement in Tetris\nthat does not rely on the loose approximation error bounds. A speci\ufb01c property of Tetris is that the\nrewards are not received in every step, i.e. they are sparse. The value function, on the other hand,\nis approximated in every step. As a result, the return should be less sensitive to the discount factor\nthan the approximation error. Decreasing the discount factor will thus reduce the approximation\nerror more signi\ufb01cantly than it increases the discount error. The following assumption formalizes\nthis intuition.\nAssumption 8 (Sparse rewards). There exists an integer q such that for all m \u2265 0 and all instantia-\n\ntions ri with non-zero probability:(cid:80)m\n\ni=0 ri \u2264 (cid:98)m/q(cid:99) and ri \u2208 {0, 1}.\n\n0.80.850.90.951050100150200gBellman error / true error00.20.40.60.8150100150200250bBellman error00.20.40.60.822.22.42.62.83\u03b2|| a \u2212 b ||\u221e\fNow de\ufb01ne u\u03b2 =(cid:80)\u221e\n\ni=0 \u03b2iti, where ti = 1 when i \u2261 0 mod q. Then let Im = {i ri = 1, i \u2264 m}\nand Jm = {j tj = 1, j \u2264 m} and let I = I\u221e and J = J\u221e. From the de\ufb01nition, these two sets\nsatisfy that |Im| \u2264 |Jm|. First we show the following lemma.\nLemma 9. Given sets Im and Jm, there exists an injective function f : I \u2192 J, such that f(i) \u2264 i.\n\nProof. By induction on m. The base case m = 0 is trivial. For the inductive case, consider the\nfollowing two cases: 1) rm+1 = 0. From the inductive assumption, there exists a function that maps\nIm to Jm. Now, this is also an injective function that maps Im+1 = Im to Jm+1. 2) rm+1 = 1. Let\nj\u2217 = max Jm+1. Then if j\u2217 = m + 1 then the function f : Im \u2192 Jm can be extended by setting\nf(m + 1) = j\u2217. If j\u2217 \u2264 m then since |Jm+1| \u2212 1 = |Jj\u2217\u22121| \u2265 |Im|, such an injective function\nexists from the inductive assumption.\n\nIn the following, let Ri be the random variable representing the reward received in step i. It is\npossible to prove that the discount error scales with a coef\ufb01cient that is lower than in Theorem 2:\nTheorem 10. Let \u03b2 \u2264 \u03b3 \u2212 \u03c6, let k = \u2212 log(1 \u2212 \u03b3)/(log(\u03b3) \u2212 log(\u03b3 \u2212 \u03c6)), and let \u03c1 =\nE\n\n. Then assuming the reward structure as de\ufb01ned in Assumption 8 we have that:\n\n(cid:104)(cid:80)k\n\n(cid:105)\n\ni=0 \u03b3iRi\n\n(cid:107)v\u03b3 \u2212 v\u03b2(cid:107)\u221e \u2264 \u03b3k(cid:107)u\u03b3 \u2212 u\u03b2(cid:107)\u221e + \u03c1 \u2264 \u03b3k(\u03b3q \u2212 \u03b2q)\n(1 \u2212 \u03b3q)(1 \u2212 \u03b2q)\n\n+ \u03c1.\n\nProof. Consider \u03c0 be the optimal policy for the discount factor \u03b3. Then we have: 0 \u2264 v\u03b3 \u2212 v\u03b2 \u2264\n\u03b3 \u2212 v\u03c0\n\u03b2 . In the remainder of the proof, we drop the superscript \u03c0 for simplicity, that is v\u03b2 = v\u03c0\n\u03b2 ,\nv\u03c0\nnot the optimal value function. Intuitively, the proof is based on \u201cmoving\u201d the rewards to earlier\nsteps to obtain a regular rewards structure. A small technical problem with this approach is that\nmoving the rewards that are close to the initial time step decreases the bound. Therefore, we treat\nthese rewards separately within the constant \u03c1. First, we show that for f(i) \u2265 k, we have that\n\u03b3i \u2212 \u03b2i \u2264 \u03b3f (i) \u2212 \u03b2f (i). Let j = f(i) = i \u2212 k, for some k \u2265 0. Then:\n\n\u03b3j \u2212 \u03b2j \u2265 \u03b3j+k \u2212 \u03b2j+k\n\nj \u2265\n\nmax\n\n\u03b2\u2208[0,\u03b3\u2212\u03c6]\n\nlog(1 \u2212 \u03b2k) \u2212 log(1 \u2212 \u03b3k)\n\nlog(\u03b3) \u2212 log(\u03b2)\n\n\u2265\n\n\u2212 log(1 \u2212 \u03b3k)\n\nlog(\u03b3) \u2212 log(\u03b3 \u2212 \u03c6)\n\n,\n\nwith the maximization used to get a suf\ufb01cient condition independent of \u03b2. Since the function f\nmaps only at most (cid:98)k/q(cid:99) values of Im to j < k, there is such |Iz| = k, that \u2200x \u2208 Im \\ Iz f(x) \u2265 k\nThen we have for j > k:\n\n\uf8f9\uf8fb\n\n(\u03b3f (i) \u2212 \u03b2f (i))tf (i)\n\n\uf8ee\uf8f0 (cid:88)\n\n\uf8f9\uf8fb \u2264 \u03c1 + lim\n\nm\u2192\u221e E\n\n\uf8ee\uf8f0 (cid:88)\n\ni=1...m\u2227f (i)\u2265k\n\n0 \u2264 v\u03b3 \u2212 v\u03b2 = lim\n\n(\u03b3i \u2212 \u03b2i)\n\nm\u2192\u221e E\n\n\u221e(cid:88)\n\n\u2264 \u03c1 +\n\ni\u2208Im\\Iz\n(\u03b3j \u2212 \u03b2j)tj = \u03c1 + \u03b3k(u\u03b3 \u2212 u\u03b2).\n\nj=k\n\nBecause the playing board in Tetris is 10 squares wide, and each piece has 4 squares, it takes on\naverage 2.5 moves to remove a line. Since Theorem 10 applies only to integer values of q, we use a\nTetris formulation in which dropping each piece requires two steps. A proper Tetris action is taken\nin the \ufb01rst step, and there is no action in the second one. To make this model identical to the original\n1\n2 . Then the upper bound from Theorem 10 on the\nformulation, we change the discount factor to \u03b3\ndiscount error is: (cid:107)v\u03b3 \u2212 v\u03b2(cid:107)\u221e \u2264 \u03b3k(\u03b32.5 \u2212 \u03b22.5)/(1 \u2212 \u03b32.5)(1 \u2212 \u03b22.5) + \u03c1, Notice that \u03c1 is a\nconstant; it is independent of the new discount factor \u03b2.\nThe sparse rewards property can now be used to motivate the performance increase, even if the\napproximation error is bounded by \u0001/(1 \u2212 \u03b2) instead of by \u0001/(1 \u2212 \u03b2)3 (as Lemma 7 suggests).\nThe approximation error bound will not, in most cases, satisfy the sparsity assumption, as the errors\nare typically distributed almost uniformly over the state space and is received in every step as a\nresult. Therefore, for sparse rewards, the discount error increase will typically be offset by the larger\ndecrease in the approximation error.\n\n\fThe cumulative error bounds derived above predict it is bene\ufb01cial to reduce the discount factor to \u03b2\nwhen:\n\n(\u03b32.5 \u2212 \u03b22.5)\n\n(1 \u2212 \u03b32.5)(1 \u2212 \u03b22.5)\n\n+ \u03c1 + \u0001\n\n1 \u2212 \u03b2\n\n(cid:107)v\u03b3 \u2212 v\u03b2(cid:107)\u221e \u2264 \u03b3k\n\n\u0001\n\n<\n\n1 \u2212 \u03b3\n\n.\n\nThe effective discount factor \u03b3\u2217 in Tetris is not known, but consider for example that it is \u03b3\u2217 =\n0.99. Assuming \u03c6 = 0.1 we have that k = 48, which means that the \ufb01rst (cid:98)48/2.5(cid:99) rewards must\nbe excluded, and included in \u03c1. The bounds then predict that for \u0001 \u2265 0.4 the performance of\napproximate value iteration may be expected to improve using \u03b2 \u2264 \u03b3 \u2212 \u03c6.\nWe end by empirically illustrating the in\ufb02uence of reward sparsity in a general context. Consider\na simple 1-policy, 7-state chain problem. Consider two reward instances, one with a single reward\nof 1, and the other with randomly generated rewards. We show the comparison of the effects of a\nlower discount factor of these two examples in Figure 6. The dotted line represents the global error\nwith sparse rewards, and the solid line represents the cumulative error with dense rewards. Sparsity\nof rewards makes a decrease of the discount factor more interesting.\n\n6 Conclusion and Future Work\n\nWe show in this paper that some common approximation error bounds may be tightened with a lower\ndiscount factor. We also identi\ufb01ed a class of problems in which a lower discount factor is likely to\nincrease the performance of approximate dynamic programming algorithms. In particular, these are\nproblems in which the rewards are received relatively sparsely. We concentrated on a theoretical\nanalysis of the in\ufb02uence of the discount factor, not on the speci\ufb01c methods which could be used to\ndetermine a discount factor. The actual dependence of the performance on the discount factor may\nbe non-trivial, and therefore hard to predict based on simple bounds. Therefore, the most practical\napproach is to \ufb01rst predict an improving discount factor based on the theoretical predictions, and\nthen use line search to \ufb01nd a discount factor that ensures good performance. This is possible since\nthe discount factor is a single-dimensional variable with a limited range.\nThe central point of our analysis is based on bounds that are in general quite loose. An important\nfuture direction is to analyze the approximation error more carefully. We shall do experiments\nin order to see if we can have some insight on the form (i.e.\nthe distribution) of the error for\nseveral settings (problems, approximation architecture). If such errors follow some law, it might be\ninteresting to see whether it helps to tighten the bounds.\n\nAcknowledgements This work was supported in part by the Air Force Of\ufb01ce of Scienti\ufb01c Research Grant\nNo. FA9550-08-1-0171 and by the National Science Foundation Grant No. 0535061. The \ufb01rst author was also\nsupported by a University of Massachusetts Graduate Fellowship.\n\nReferences\n[1] Dimitri P. Bertsekas and Sergey Ioffe. Temporal differences-based policy iteration and applications in\n\nneuro-dynamic programming. Technical Report LIDS-P-2349, LIDS, 1997.\n\n[2] Dimitri P. Bertsekas and John N. Tsitsiklis. Neuro-dynamic programming. Athena Scienti\ufb01c, 1996.\n[3] V.F. Farias and B. Van Roy. Probabilistic and Randomized Methods for Design Under Uncertainty, chapter\n\n6: Tetris: A Study of Randomized Constraint Sampling. Springer-Verlag, 2006.\n\n[4] Sham Machandranath Kakade. A Natural Policy Gradient. In Advances in neural information processing\n\nsystems, pages 1531\u20131538. MIT Press, 2001.\n\n[5] Ronald Parr, Lihong Li, Gavin Taylor, Christopher Painter-Wake\ufb01eld, and Michael L. Littman. An analysis\nof linear models, linear value function approximation, and feature selection for reinforcement learning. In\nInternational Conference on Machine Learning, 2008.\n\n[6] Warren B. Powell. Approximate Dynamic Programming. Wiley-Interscience, 2007.\n[7] Martin L. Puterman. Markov decision processes: Discrete stochastic dynamic programming. John Wiley\n\n& Sons, Inc., 2005.\n\n[8] Richard S. Sutton and Andrew Barto. Reinforcement learning. MIT Press, 1998.\n\n\f", "award": [], "sourceid": 597, "authors": [{"given_name": "Marek", "family_name": "Petrik", "institution": null}, {"given_name": "Bruno", "family_name": "Scherrer", "institution": null}]}